Summary:
The methods and fields in the private section of DBImpl were all intermingled, making it hard to figure out where the fields/methods start and where they end. I cleaned up the code a little so that all the type declaration are at the beginning, followed by methods, and all the data fields are at the end. This follows
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5385
Differential Revision: D15566978
Pulled By: sagar0
fbshipit-source-id: 4618a7d819ad4e2d7cc9ae1af2c59f400140bb1b
Summary:
1. Fix a bug in WAL replay in which write batches with old sequence numbers are mistakenly inserted into memtables.
2. Add support for benchmarking secondary instance to db_bench_tool.
With changes made in this PR, we can start benchmarking secondary instance
using two processes. It is also possible to vary the frequency at which the
secondary instance tries to catch up with the primary. The info log of the
secondary can be found in a directory whose path can be specified with
'-secondary_path'.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5170
Differential Revision: D15564608
Pulled By: riversand963
fbshipit-source-id: ce97688ed3d33f69d3a0b9266ebbbbf887aa0ec8
Summary:
Fix flaky DBTest2.PresetCompressionDict test.
This PR fixes two issues with the test:
1. Replaces `GetSstFiles` with `TotalSize`, which is based on `DB::GetColumnFamilyMetaData` so that only the size of the live SST files is taken into consideration when computing the total size of all sst files. Earlier, with `GetSstFiles`, even obsolete files were getting picked up.
1. In ZSTD compression, it is sometimes possible that using a trained dictionary is not better than using an untrained one. Using a trained dictionary performs well in 99% of the cases, but still in the remaining ~1% of the cases (out of 10000 runs) using an untrained dictionary gets better compression results.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5378
Differential Revision: D15559100
Pulled By: sagar0
fbshipit-source-id: c35adbf13871f520a2cec48f8bad9ff27ff7a0b4
Summary:
Currently, when the block cache is used for index blocks as well, it is
not really the index block that is stored in the cache but an
IndexReader object. Since this object is not pure data (it has, for
instance, pointers that might dangle), it's not really sharable. To
avoid the issues around this, the current code uses a dummy unique cache
key for each TableReader to store the IndexReader, and erases the
IndexReader entry when the TableReader is closed. Instead of doing this,
the new code moves the IndexReader out of the cache altogether. In
particular, instead of the TableReader owning, or caching/pinning the
IndexReader based on the customer's settings, the TableReader
unconditionally owns the IndexReader, which in turn owns/caches/pins
the index block (which is itself sharable and thus can be safely put in
the cache without any hacks).
Note: the change has two side effects:
1) Partitions of partitioned indexes no longer affect the read
amplification statistics.
2) Eviction statistics for index blocks are temporarily broken. We plan to fix
this in a separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5298
Differential Revision: D15303203
Pulled By: ltamasi
fbshipit-source-id: 935a69ba59d87d5e44f42e2310619b790c366e47
Summary:
When the --reopen option is non-zero, the DB is reopened after every ops_per_thread/(reopen+1) ops, with the check being done after every op. With MultiGet, we might do multiple ops in one iteration, which broke the logic that checked when to synchronize among the threads and reopen the DB. This PR fixes that logic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5374
Differential Revision: D15559780
Pulled By: anand1976
fbshipit-source-id: ee6563a68045df7f367eca3cbc2500d3e26359ef
Summary:
There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo
ve them to a new directory test_util/, so that util/ is cleaner.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377
Differential Revision: D15551366
Pulled By: siying
fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c
Summary:
By increasing the ratio, we ensure that all files go through background deletion and eliminate flakiness due to timing of deletions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5366
Differential Revision: D15549992
Pulled By: anand1976
fbshipit-source-id: d137375cd791fc1a802841412755d6e2b8fd7688
Summary:
When dynamically setting options, we check the option type info and skip options that are marked deprecated. However this check is only done at top level, which results in bugs where SetOptions will corrupt option values and cause unexpected system behavior iff a deprecated second level option is set dynamically.
For exmaple, the following call:
```
dbfull()->SetOptions(
{{"compaction_options_fifo",
"{allow_compaction=true;max_table_files_size=1024;ttl=731;}"}});
```
was from pre 6.0 release when `ttl` was part of `compaction_options_fifo`. Now that it got moved out of `compaction_options_fifo`, this call will incorrectly set `compaction_options_fifo.max_table_files_size` to 731 (as `max_table_files_size` is the first one in `OptionsHelper::fifo_compaction_options_type_info` struct) and cause files to gett evicted much faster than expected.
This PR adds verification to second level options like `compaction_options_fifo.ttl` or `compaction_options_fifo.max_table_files_size` when set dynamically, and filter out those marked as deprecated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5367
Differential Revision: D15530998
Pulled By: miasantreble
fbshipit-source-id: 818258be5c3abe09cd82d62f3c083572d70fecdd
Summary:
util/ means for lower level libraries, so it's a good idea to move the files which requires knowledge to DB out. Create a file/ and move some files there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5375
Differential Revision: D15550935
Pulled By: siying
fbshipit-source-id: 61a9715dcde5386eebfb43e93f847bba1ae0d3f2
Summary:
This enables the user to set TransactionDBOptions::skip_concurrency_control so the standard `DB::Write(const WriteOptions& opts, WriteBatch* updates)` would skip the concurrency control. This would give higher throughput to the users who know their use case doesn't need concurrency control.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5330
Differential Revision: D15525932
Pulled By: maysamyabandeh
fbshipit-source-id: 68421ac1ba34f549a4a8de9ce4c2dccf6fb4b06b
Summary:
When committing a transaction without prepare, WritePrepared simply writes the batch to db and add the commit entry to CommitCache. When two_write_queues=true, following the rule of committing only from 2nd write queue, the first write, writes the batch and the only thing the 2nd write does is to write the commit entry to CommitCache. Currently the write batch in 2nd write is set to an empty LogData entry, while the write to the WAL could simply be entirely disabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5327
Differential Revision: D15424546
Pulled By: maysamyabandeh
fbshipit-source-id: 3d9ea3922d5196984c584d62a3ed57e1f7ca7b9f
Summary:
The analyzer thinks max_allowed_ space can be 0. In that case, free_space will
be assigned as free_space. It fails to realize that the function call
GetFreeSpace actually sets the free_space variable properly, which is possibly
due to lack of inter-function call analysis.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5365
Differential Revision: D15521790
Pulled By: riversand963
fbshipit-source-id: 839d0a285a1c8773a28a385f0c3be4bb7fbe32cb
Summary:
If RocksDB is configured with a positive max_allowed_space (via sst file manager),
then the sst file manager should use this value instead of total free disk
space to determine whether to clear the background error of space limit
reached.
In DBSSTTest.DBWithMaxSpaceAllowed, we configure a low space limit that is very
likely lower than the free disk space of the test machine. Therefore, once the
test db encounters a Status::SpaceLimit, error handler will call into sst file
manager to start error recovery which may clear the bg error since disk free
space is larger than reserved_disk_buffer_.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5334
Differential Revision: D15501622
Pulled By: riversand963
fbshipit-source-id: 58035efc450b062d6b28c78c322005ec3705fb47
Summary:
Add some comments in db_impl.h. Also reordered function order a little bit so that I can add a comment to flag the area of functions implementing DB interface.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5338
Differential Revision: D15498284
Pulled By: siying
fbshipit-source-id: 3d7c59c8303577fe44d13c74ae84c7ce05164f77
Summary:
RocksDB always tries to perform a hard link operation on the external SST file to ingest. This operation can fail if the external SST resides on a different device/FS, or the underlying FS does not support hard link. Currently RocksDB assumes that if the link fails, the user is willing to perform file copy, which is not true according to the post. This commit provides an option named 'failed_move_fall_back_to_copy' for users to choose which behavior they want.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5333
Differential Revision: D15457597
Pulled By: HaoyuHuang
fbshipit-source-id: f3626e13f845db4f7ed970a53ec8a2b1f0d62214
Summary:
Add file comment in db/db_iter.h and minor changes in other parts.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5340
Differential Revision: D15484605
Pulled By: siying
fbshipit-source-id: 173771f9d5bd51303de5410ee5afd0a4af9d6572
Summary:
Modified FindIntraL0Compaction to stop picking more files if total
amount of compensated bytes would be larger than max_compaction_bytes
option.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5329
Differential Revision: D15435728
Pulled By: ThomasFersch
fbshipit-source-id: d118a6da88d5df8ee20944422ade37cf6b15d60c
Summary:
In version_set.cc, there is a function GetCurrentManifestPath. The goal of this task is to refactor ListColumnFamilies function so that ListColumnFamilies calls GetCurrentManifestPath to search for MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5331
Differential Revision: D15444524
Pulled By: HaoyuHuang
fbshipit-source-id: 1dcbd030bc0f2e835695741f450bba150f2f2903
Summary:
Remove PATENTS related wording from a few stragglers which still reference the old PATENTS file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5326
Differential Revision: D15423297
Pulled By: sagar0
fbshipit-source-id: 4babcddfc120b7d2fed6eb3898287cf8012bf8ea
Summary:
Right now, in log writer, we call flush after writing each physical record. I don't see the necessarity of it. Right now, the underlying writer has a buffer, so there isn't a concern that the write request is too large either. On the other hand, in an Env where every flush is expensive, the current approach is significantly slower than only flushing after a whole record finishes, when the record is very large.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5328
Differential Revision: D15425032
Pulled By: siying
fbshipit-source-id: 440ebef002dfbb60c59d8388c9ddfc83d79700aa
Summary:
WritePrepared transactions when configured with two_write_queues=true offers higher throughput with unordered_write feature without however compromising the rocksdb guarantees. This is because it performs ordering among writes in a 2nd step that is not tied to memtable write speed. The 2nd step is naturally provided by 2PC when the commit phase does the ordering as well. Without 2PC, the 2nd step would only be provided when we use two_write_queues=true, where WritePrepared after performing the writes, in a 2nd step uses the 2nd queue to assign order to the writes.
The patch clarifies the need for two_write_queues=true in the HISTORY and inline comments of unordered_writes. Moreover it extends the stress tests of WritePrepared to unordred_write.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5313
Differential Revision: D15379977
Pulled By: maysamyabandeh
fbshipit-source-id: 5b6f05b9b59285dcbf3b0532215ba9fe7d926e00
Summary:
RocksDB secondary can replay both MANIFEST and WAL now.
On the one hand, the memory usage by memtables will grow after replaying WAL for sometime. On the other hand, replaying the MANIFEST can bring the database persistent data to a more recent point in time, giving us the opportunity to discard some memtables containing out-dated data.
This PR coordinates the MANIFEST and WAL replay, using the updates from MANIFEST replay to update the active memtable and immutable memtable list of each column family.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5305
Differential Revision: D15386512
Pulled By: riversand963
fbshipit-source-id: a3ea6fc415f8382d8cf624f52a71ebdcffa3e355
Summary:
Previously if iterator upper/lower bound presents, `DBIter` will check the bound for every key. This patch turns the check into per-file or per-data block check when applicable, by checking against either file largest/smallest key or block index key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5111
Differential Revision: D15330061
Pulled By: siying
fbshipit-source-id: 8a653fe3cd50d94d81eb2d13b087326c58ee2024
Summary:
In the current db_bench trace replay, the replay process strictly follows the timestamp to issue the queries. In some cases, user does not care about the time. Therefore, fast forward is needed for users to speed up the replay process.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5273
Differential Revision: D15389232
Pulled By: zhichao-cao
fbshipit-source-id: 735d629b9d2a167b05af3e4fa0ddf9d5d0be1806