Summary:
With atomic flush, RocksDB background flush will flush memtables of a column family up to the largest memtable id in the immutable memtable list. This can introduce a bug in the following scenario. A user thread inserts into a column family until the memtable is full and triggers a flush. This will add the column family to flush_scheduler_. Then the user thread writes another record to the column family. In the PreprocessWrite function, the user thread picks the column family from flush_scheduler_ and schedules a flush request. The flush request gaurantees to flush all the memtables up to the current largest memtable ID of the immutable memtable list. Then the user thread writes new data to the newly-created active memtable. After the write returns, the user thread closes the db. This can cause assertion failure when the background flush thread tries to install superversion for the column family. The solution is to not install flush results if the db has already set `shutting_down_` to true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5254
Differential Revision: D15124149
Pulled By: riversand963
fbshipit-source-id: 0a667a41339dedb5a18bcb01b0bf11c275c04df0
Summary:
Improve the iterators performance when the user explicitly sets the readahead size via `ReadOptions.readahead_size`.
1. Stop creating new table readers when the user explicitly sets readahead size.
2. Make use of an internal buffer based on `FilePrefetchBuffer` instead of using `ReadaheadRandomAccessFileReader`, to handle the user readahead requests (for both buffered and direct io cases).
3. Add `readahead_size` to db_bench.
**Benchmarks:**
https://gist.github.com/sagar0/53693edc320a18abeaeca94ca32f5737
For 1 MB readahead, Buffered IO performance improves by 28% and Direct IO performance improves by 50%.
For 512KB readahead, Buffered IO performance improves by 30% and Direct IO performance improves by 67%.
**Test Plan:**
Updated `DBIteratorTest.ReadAhead` test to make sure that:
- no new table readers are created for iterators on setting ReadOptions.readahead_size
- At least "readahead" number of bytes are actually getting read on each iterator read.
TODO later:
- Use similar logic for compactions as well.
- This ties in nicely with #4052 and paves the way for removing ReadaheadRandomAcessFile later.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5246
Differential Revision: D15107946
Pulled By: sagar0
fbshipit-source-id: 2c1149729ca7d779e4e8b7710ba6f4e8cbfd3bea
Summary:
The newly added test CompactionJobTest.SnapshotRefresh sets the snapshot refresh period to 0 to stress the feature. This results into large number of refresh events, which in turn results into an UBSAN failure when a bitwise shift operand goes beyond the uint64_t size.
The patch fixes that by simplifying the shift logic to be done only by 2 bits after each refresh. Furthermore it verifies that the shift operation does not result in decreasing the refresh period.
Testing:
COMPILE_WITH_UBSAN=1 make -j32 compaction_job_test
./compaction_job_test --gtest_filter=CompactionJobTest.SnapshotRefresh
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5257
Differential Revision: D15106463
Pulled By: maysamyabandeh
fbshipit-source-id: f2718898ea7ba4fa9f7e87b70cf98fe647c0de80
Summary:
Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5099
Differential Revision: D15086710
Pulled By: maysamyabandeh
fbshipit-source-id: 7649f56c3b6b2fb334962048150142a3bf9c1a12
Summary:
Currently one thread in RocksDB keeps a WAL file open while another thread
deletes it. Although the first thread never writes to the WAL again, it still
tries to close it in the end. This is fine on POSIX, but can be problematic on
other platforms, e.g. HDFS, etc.. It will either cause a lot of warning messages or
throw exceptions. The solution is to let the second thread close the WAL before deleting it.
RocksDB keeps the writers of the logs to delete in `logs_to_free_`, which is passed to `job_context` during `FindObsoleteFiles` (holding mutex). Then in `PurgeObsoleteFiles` (without mutex), these writers should close the logs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5233
Differential Revision: D15032670
Pulled By: riversand963
fbshipit-source-id: c55e8a612db8cc2306644001a5e6d53842a8f754
Summary:
We have a DB with ~4k column families and ~70k files. On shutdown, destroying the 4k ColumnFamilyHandle-s takes over 2 minutes. Most of this time is spent in VersionSet::AddLiveFiles() called from FindObsoleteFiles() from ~ColumnFamilyHandleImpl(). It's just iterating over the list of files in memory. This seems completely unnecessary as no obsolete files are actually found since the CFs are not even dropped. This PR fixes that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5238
Differential Revision: D15056342
Pulled By: siying
fbshipit-source-id: 2aa342ef3770b4aa384ce81f8768e485480e4f08
Summary: PR https://github.com/facebook/rocksdb/pull/4899 implemented the general framework for RocksDB secondary instances. This PR adds the support for WAL tailing in `OpenAsSecondary`, which means after the `OpenAsSecondary` call, the secondary is now able to see primary's writes that are yet to be flushed. The secondary can see primary's writes in the WAL up to the moment of `OpenAsSecondary` call starts.
Differential Revision: D15059905
Pulled By: miasantreble
fbshipit-source-id: 44f71f548a30b38179a7940165e138f622de1f10
Summary:
In some cases, we want to known the smallest and largest sequence numbers of sstable files, to help us get more details.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5231
Differential Revision: D15038087
Pulled By: siying
fbshipit-source-id: c473c1ca07b53efe2f1884fa1ecdc8686f455ed8
Summary:
It's hard to get DBIter to directly use InternalIterator::NextAndGetResult() because the code change would be complicated. Instead, use IteratorWrapper, where Next() is already using NextAndGetResult(). Performance number is hard to measure because it is small and ther is variation. I run readseq many times, and there seems to be 1% gain.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5214
Differential Revision: D15003635
Pulled By: siying
fbshipit-source-id: 17af1965c409c2fe90cd85037fbd2c5a1364f82a
Summary:
Introduce BlockBasedTableOptions::index_shortening to give users control on which key shortening techniques to be used in building index blocks. Before this patch, both separators and successor keys where shortened in indexes. With this patch, the default is set to kShortenSeparators to only shorten the separators. Since each index block has many separators and only one successor (last key), the change should not have negative impact on index block size. However it should prevent many unnecessary block loads where due to approximation introduced by shorted successor, seek would land us to the previous block and then fix it by moving to the next one.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5174
Differential Revision: D14884185
Pulled By: al13n321
fbshipit-source-id: 1b08bc8c03edcf09b6b8c16e9a7eea08ad4dd534
Summary:
Savepoints are assumed to be used in a stack-wise fashion (only
the top element should be used), so they were stored by `WriteBatch`
in a member variable `save_points` using an std::stack.
Conceptually this is fine, but the implementation had a few issues:
- the `save_points_` instance variable was a plain pointer to a heap-
allocated `SavePoints` struct. The destructor of `WriteBatch` simply
deletes this pointer. However, the copy constructor of WriteBatch
just copied that pointer, meaning that copying a WriteBatch with
active savepoints will very likely have crashed before. Now a proper
copy of the savepoints is made in the copy constructor, and not just
a copy of the pointer
- `save_points_` was an std::stack, which defaults to `std::deque` for
the underlying container. A deque is a bit over the top here, as we
only need access to the most recent savepoint (i.e. stack.top()) but
never any elements at the front. std::deque is rather expensive to
initialize in common environments. For example, the STL implementation
shipped with GNU g++ will perform a heap allocation of more than 500
bytes to create an empty deque object. Although the `save_points_`
container is created lazily by RocksDB, moving from a deque to a plain
`std::vector` is much more memory-efficient. So `save_points_` is now
a vector.
- `save_points_` was changed from a plain pointer to an `std::unique_ptr`,
making ownership more explicit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5192
Differential Revision: D15024074
Pulled By: maysamyabandeh
fbshipit-source-id: 5b128786d3789cde94e46465c9e91badd07a25d7
Summary:
My compiler doesn't inline DBIter::Next() to arena wrapped iterator, even if it is a direct forward. Adding this annotation makes it inlined. It might not always work but inlinging this function to arena wrapped iterator always feels like the right decision.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5217
Differential Revision: D15004086
Pulled By: siying
fbshipit-source-id: a4cffd79c6fb092669a3a90633c9aa5e494f8a66
Summary:
We found an issue in Periodic Compactions (introduced in #5166) where files were not being picked up for compactions as all the SST files created with older versions of RocksDB have `file_creation_time` as 0. (Note that `file_creation_time` is a new table property introduced in #5166).
To address this, Periodic compactions now fall back to looking at the `creation_time` table property or the file's modification time (as given by the Env) when `file_creation_time` table property is found to be 0.
Here how the file's modification time (and, in turn, the file age) is computed now:
1. Use `file_creation_time` table property if it is > 0.
1. If not, then use `creation_time` table property if it is > 0.
1. If not, then use file's mtime stat metadata given by the underlying Env.
Don't consider the file at all for compaction if the modification time cannot be correctly determined based on the above conditions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5184
Differential Revision: D14907795
Pulled By: sagar0
fbshipit-source-id: 4bb2f3631f9a3e04470c674a1d13544584e1e56c
Summary:
Reorganize the code so that no function call into ReadRangeDelAggregator is needed if there is no tomb range stone.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5202
Differential Revision: D14968155
Pulled By: siying
fbshipit-source-id: 0bd61911293c7a27b4e1b8d57c66d0c4ad6a6a5f
Summary:
Several small changes for Next():
1. Reducing branching by always update local_stats_.next_count_++ even if statistics is null. This should be faster than a branching.
2. Replacing ResetInternalKeysSkippedCounter() in Next() because the valid_ check is not needed in this case.
3. iter_->Valid() should always be true for non merge case. Remove this check.
4. Adding an inline annotation. It ends up with not picked up by my compiler, but it shouldn't hurt.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5200
Differential Revision: D15000391
Pulled By: siying
fbshipit-source-id: be97f61c708968234fb8e5cf272b5c2ac07dc4dd
Summary:
In long scans, virtual function calls of Next(), Valid(), key() and value() are not trivial. By introducing NextAndGetResult(), Some of the Next(), Valid() and key() calls are consolidated into one virtual function call to reduce CPU.
Also did some inline tricks and add some "final" randomly in some functions. Even without the "final" annotation, most Next() calls are inlined with -O3, but sometimes with a final it is inlined by O2 too. It doesn't hurt to add those final annotations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5197
Differential Revision: D14945977
Pulled By: siying
fbshipit-source-id: 7003969f9a5f1d5717f0bda503b91d19ba75ed88
Summary:
`GetOverlappingInputsRangeBinarySearch` firstly use binary search
to find a index in the given range `[begin, end]`. But after find
the index, then use linear search to find the `start_index` and
`end_index`. So the search process degraded to linear time.
Here optmize the search process with below changes:
- use `std::lower_bound` and `std::upper_bound` to get
`lg(n)` search complexity.
- use uniformed lambda for search process.
- simplify process for `within_interval` true or false.
- remove function `ExtendFileRangeWithinInterval`
and `ExtendFileRangeOverlappingInterval`.
Signed-off-by: JiYou <jiyou09@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4987
Differential Revision: D14984192
Pulled By: riversand963
fbshipit-source-id: fae4b8e59a21b7e350718d60cdc94dd55ac81e89
Summary:
this PR fixes the following compile warning:
```
db/memtable.cc: In member function ‘virtual void rocksdb::MemTableIterator::Seek(const rocksdb::Slice&)’:
db/memtable.cc:321:22: error: declaration of ‘user_key’ shadows a member of 'this' [-Werror=shadow]
Slice user_key(ExtractUserKey(k));
^
db/memtable.cc: In member function ‘virtual void rocksdb::MemTableIterator::SeekForPrev(const rocksdb::Slice&)’:
db/memtable.cc:338:22: error: declaration of ‘user_key’ shadows a member of 'this' [-Werror=shadow]
Slice user_key(ExtractUserKey(k));
^
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5204
Differential Revision: D14970160
Pulled By: miasantreble
fbshipit-source-id: 388eb089f90c4528cc6d615dd4607fb53ceac705
Summary:
Depending on the config, manual compaction (leveled compaction style) does following compactions:
L0->L1
L1->L2
...
Ln-1 -> Ln
Ln -> Ln
The final Ln -> Ln compaction is partly unnecessary as it recompacts all the files that were just generated by the Ln-1 -> Ln. We should avoid recompacting such files. This rule should be applied to Lmax only.
Resolves issue https://github.com/facebook/rocksdb/issues/4995
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5138
Differential Revision: D14940106
Pulled By: miasantreble
fbshipit-source-id: 8d3cf5507a17e76f3333cfd4bac5256d005636e5
Summary:
Dummy cache size of 1MB is too large for small block sizes. Our GetDefaultCacheShardBits() use min_shard_size = 512L * 1024L to determine number of shards, so 1MB will excceeds the size of the whole shard and make the cache excceeds the budget.
Change it to 256KB accordingly.
There shouldn't be obvious performance impact, since inserting a cache entry every 256KB of memtable inserts is still infrequently enough.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5175
Differential Revision: D14954289
Pulled By: siying
fbshipit-source-id: 2c275255c1ac3992174e06529e44c55538325c94
Summary:
This is second attempt for #5101. Original commit message:
`BlockBasedTableIterator` avoid reading next block on `Next()` if it detects the iterator will be out of bound, by checking against index key. The optimization was added in #2239, and by the time it only check the bound per block. It seems later change make it a per-key check, which introduce unnecessary key comparisons.
This patch come with two fixes:
Fix 1: To optimize checking for bounds, we need comparing the bounds with index key as well. However BlockBasedTableIterator doesn't know whether its index iterator is internally using user keys or internal keys. The patch fixes that by extending InternalIterator with a user_key() function that is overridden by In IndexBlockIter.
Fix 2: In #5101 we return `IsOutOfBound()=true` when block index key is out of bound. But the index key can be larger than smallest key of the next file on the level. That file can be within upper bound and should not be filtered out.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5142
Differential Revision: D14907113
Pulled By: siying
fbshipit-source-id: ac95775c5b4e7b700f76ab43e39f45402c98fbfb
Summary:
Right now, two separate pieces of code are used to create WAL files in DBImpl::Open function of db_impl_open.cc and DBImpl::SwitchMemtable function of db_impl_write.cc. This code change simply creates 1 function called DBImpl::CreateWAL in db_impl_open.cc which is used to replace existing WAL creation logic in DBImpl::Open and DBImpl::SwitchMemtable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5188
Differential Revision: D14942832
Pulled By: vjnadimpalli
fbshipit-source-id: d49230e04c36176015c8c1b422575872f92157fb
Summary:
Found this when test driving the new MultiGet. If you pass unsorted result with sorted_result = false you'll trigger the ASSERT incorrect even though we'll sort down below.
I've also added simple test cover sorted_result=true/false scenario copied from MultiGetSimple.
anand1976
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5195
Differential Revision: D14935475
Pulled By: yizhang82
fbshipit-source-id: 1d2af5e3a003847d965066a16e3b19da68acf170
Summary:
Before using prefix extractor `InDomain()` should be check. All uses in memtable.cc didn't check `InDomain()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5190
Differential Revision: D14923773
Pulled By: miasantreble
fbshipit-source-id: b3ad60bcca5f3a1a2b929a6eb34b0b7ba6326f04
Summary:
When ReadOption doesn't specify a snapshot, WritePrepared::Get used kMaxSequenceNumber to avoid the cost of creating a new snapshot object (that requires sync over db_mutex). This creates a race condition if it is reading from the writes of a transaction that had duplicate keys: each instance of duplicate key is inserted with a different sequence number and depending on the ordering the ::Get might skip the newer one and read the older one that is obsolete.
The patch fixes that by using last published seq as the snapshot sequence number. It also adds a check after the read is done to ensure that the max_evicted_seq has not advanced the aforementioned seq, which is a very unlikely event. If it did, then the read is not valid since the seq is not backed by an actually snapshot to let IsInSnapshot handle that properly when an overlapping commit is evicted from commit cache.
A unit test is added to reproduce the race condition with duplicate keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5147
Differential Revision: D14758815
Pulled By: maysamyabandeh
fbshipit-source-id: a56915657132cf6ba5e3f5ea1b5d78c803407719
Summary:
Since Statistics::measureTime() is deprecated, StatisticsImpl::measureTime() is not implemented. We realized that users might have a wrapped Statistics implementation in which measureTime() is implemented as forwarded to StatisticsImpl, and causes assert failure. In order to make the change less intrusive, we implement StatisticsImpl::measureTime(). We will revisit whether we need to remove it after several releases.
Also, add a test to make sure that a Statistics implementation using the old interface still works.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5181
Differential Revision: D14907089
Pulled By: siying
fbshipit-source-id: 29b6202fd04e30ed6f6adcaeb1000e87f10d1e1a
Summary:
When a new SST file is created via flush or compaction, we dump out the table properties, however only a few table properties are logged. The change here is to log all the table properties
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5168
Differential Revision: D14876928
Pulled By: vjnadimpalli
fbshipit-source-id: 1aca42ad00f9f650761d39e187f8beeb8700149b
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
Summary:
Change the behavior of OptimizeForSmallDb() so that it is less likely to go out of memory.
Change the behavior of OptimizeForPointLookup() to take advantage of the new memtable whole key filter, and move away from prefix extractor as well as hash-based indexing, as they are prone to misuse.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5165
Differential Revision: D14880709
Pulled By: siying
fbshipit-source-id: 9af30e3c9e151eceea6d6b38701a58f1f9fb692d
Summary:
Introducing Periodic Compactions.
This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted. And also, of course, it helps to cleanup data older than certain threshold.
- Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
- This works across all levels.
- The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
- Compaction filters, if any, are invoked as usual.
- A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).
This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166
Differential Revision: D14884441
Pulled By: sagar0
fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
Summary:
Create new function NPHash64() and GetSliceNPHash64(), which are currently
implemented using murmurhash.
Replace the current direct call of murmurhash() to use the new functions
if the hash results are not used in on-disk format.
This will make it easier to try out or switch to alternative functions
in the uses where data format compatibility doesn't need to be considered.
This part shouldn't have any performance impact.
Also, the sharded cache hash function is changed to the new format, because
it falls into this categoery. It doesn't show visible performance impact
in db_bench results. CPU showed by perf is increased from about 0.2% to 0.4%
in an extreme benchmark setting (4KB blocks, no-compression, everything
cached in block cache). We've known that the current hash function used,
our own Hash() has serious hash quality problem. It can generate a lots of
conflicts with similar input. In this use case, it means extra lock contention
for reads from the same file. This slight CPU regression is worthy to me
to counter the potential bad performance with hot keys. And hopefully this
will get further improved in the future with a better hash function.
cache_test's condition is relaxed a little bit to. The new hash is slightly
more skewed in this use case, but I manually checked the data and see
the hash results are still in a reasonable range.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5155
Differential Revision: D14834821
Pulled By: siying
fbshipit-source-id: ec9a2c0a2f8ae4b54d08b13a5c2e9cc97aa80cb5
Summary:
Expose DB methods to lock and unlock the WAL.
These methods are intended to use by MyRocks in order to obtain WAL
coordinates in consistent way.
Usage scenario is following:
MySQL has performance_schema.log_status which provides information that
enables a backup tool to copy the required log files without locking for
the duration of copy. To populate this table MySQL does following:
1. Lock the binary log. Transactions are not allowed to commit now
2. Save the binary log coordinates
3. Walk through the storage engines and lock writes on each engine. For
InnoDB, redo log is locked. For MyRocks, WAL should be locked.
4. Ask storage engines for their coordinates. InnoDB reports its current
LSN and checkpoint LSN. MyRocks should report active WAL files names
and sizes.
5. Release storage engine's locks
6. Unlock binary log
Backup tool will then use this information to copy InnoDB, RocksDB and
MySQL binary logs up to specified positions to end up with consistent DB
state after restore.
Currently, RocksDB allows to obtain the list of WAL files. Only missing
bit is the method to lock the writes to WAL files.
LockWAL method must flush the WAL in order for the reported size to be
accurate (GetSortedWALFiles is using file system stat call to return the
file size), also, since backup tool is going to copy the WAL, it is
better to be flushed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5146
Differential Revision: D14815447
Pulled By: maysamyabandeh
fbshipit-source-id: eec9535a6025229ed471119f19fe7b3d8ae888a3
Summary:
Annotate all of the logging functions to inform the compiler that these
use printf-style formatting arguments. This allows the compiler to emit
warnings if the format arguments are incorrect.
This also fixes many problems reported now that format string checking
is enabled. Many of these are simply mix-ups in the argument type (e.g,
int vs uint64_t), but in several cases the wrong number of arguments
were being passed in which can cause the code to crash.
The primary motivation for this was to fix the log message in
`DBImpl::SwitchMemtable()` which caused a segfault due to an extra %s
format parameter with no argument supplied.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5089
Differential Revision: D14574795
Pulled By: simpkins
fbshipit-source-id: 0921b03f0743652bf4ae21e414ff54b3bb65422a
Summary:
The ReadCallback was shared between all threads in IteratorWithLocalStatistics. A race condition was
hence introduced with recent changes that changes the content of ReadCallback. The patch fixes that by using a separate callback per thread.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5149
Differential Revision: D14761612
Pulled By: maysamyabandeh
fbshipit-source-id: 814a316aed046c318cb90e22379a6e32ac528949
Summary:
Although user should first call StartTrace to begin the RocksDB tracing function and call EndTrace to stop the tracing process, user can accidentally call EndTrace first. It will cause segment fault and crash the DB instance. The issue is fixed by checking the pointer first.
Test case added in db_test2.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5130
Differential Revision: D14691420
Pulled By: zhichao-cao
fbshipit-source-id: 3be13d2f944bc453728ef8eef67b68d7ad0939c8
Summary:
This PR address two open issues:
1. clang analyzer is paranoid about db_ being nullptr after DB::Open calls in the test.
See https://github.com/facebook/rocksdb/pull/5043#discussion_r271394579
Add an assert to keep clang happy
2. PR https://github.com/facebook/rocksdb/pull/5049 introduced a variable shadowing:
```
db/db_iterator_test.cc: In constructor ‘rocksdb::DBIteratorWithReadCallbackTest_ReadCallback_Test::TestBody()::TestReadCallback::TestReadCallback(rocksdb::SequenceNumber)’:
db/db_iterator_test.cc:2484:9: error: declaration of ‘max_visible_seq’ shadows a member of 'this' [-Werror=shadow]
: ReadCallback(max_visible_seq) {}
^
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5140
Differential Revision: D14735497
Pulled By: miasantreble
fbshipit-source-id: 3219ea75cf4ae04f64d889323f6779e84be98144
Summary:
In prepare phase of 2PC, the db promises to remember the prepared data, for possible future commits. To fulfill the promise the prepared data must be persisted in the WAL so that they could be recovered after a crash. The log that contains a prepare batch that is not committed yet, is marked so that it is not garbage collected before the transaction commits/rollbacks. The bug was that the write to the log file and the mark of the file was not atomic, and WAL gc could have happened before the WAL log is actually marked. This patch moves the marking logic to PreReleaseCallback so that the WAL gc logic that joins both write threads would see the WAL write and WAL mark atomically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5121
Differential Revision: D14665210
Pulled By: maysamyabandeh
fbshipit-source-id: 1d66aeb1c66a296cb4899a5a20c4d40c59e4b534
Summary:
WriteUnPrepared adds a virtual function, MaxUnpreparedSequenceNumber, to ReadCallback, which returns 0 unless WriteUnPrepared is enabled and the transaction has uncommitted data written to the DB. Together with snapshot sequence number, this determines the last sequence that is visible to reads.
The patch clarifies the guarantees of the GetIterator API in WriteUnPrepared transactions and make use of that to statically initialize the read callback and thus avoid the virtual call.
Furthermore it increases the minimum value for min_uncommitted from 0 to 1 as seq 0 is used only for last level keys that are committed in all snapshots.
The following benchmark shows +0.26% higher throughput in seekrandom benchmark.
Benchmark:
./db_bench --benchmarks=fillrandom --use_existing_db=0 --num=1000000 --db=/dev/shm/dbbench
./db_bench --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
seekrandom [AVG 10 runs] : 20355 ops/sec; 225.2 MB/sec
seekrandom [MEDIAN 10 runs] : 20425 ops/sec; 225.9 MB/sec
./db_bench_lessvirtual3 --benchmarks=seekrandom[X10] --use_existing_db=1 --db=/dev/shm/dbbench --num=1000000 --duration=60 --seek_nexts=100
seekrandom [AVG 10 runs] : 20409 ops/sec; 225.8 MB/sec
seekrandom [MEDIAN 10 runs] : 20487 ops/sec; 226.6 MB/sec
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5049
Differential Revision: D14366459
Pulled By: maysamyabandeh
fbshipit-source-id: ebaff8908332a5ae9af7defeadabcb624be660ef
Summary:
This reverts commit f29dc1b906.
In BlockBasedTableIterator, index_iter_->key() is sometimes a user key, so it is wrong to call ExtractUserKey() against it. This is a bug introduced by #5101.
Temporarily revert the diff to keep the branch clean.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5132
Differential Revision: D14718584
Pulled By: siying
fbshipit-source-id: 0ac55dc9b5dbc18c7809092146bdf7eb9364b9ad
Summary:
Just like ReadOptions::background_purge_on_iterator_cleanup but for ColumnFamilyHandle instead of Iterator.
In our use case we sometimes call ColumnFamilyHandle's destructor from low-latency threads, and sometimes it blocks the thread for a few seconds deleting the files. To avoid that, we can either offload ColumnFamilyHandle's destruction to a background thread on our side, or add this option on rocksdb side. This PR does the latter, to be consistent with how we solve exactly the same problem for iterators using background_purge_on_iterator_cleanup option.
(EDIT: It's avoid_unnecessary_blocking_io now, and affects both CF drops and iterator destructors.)
I'm not quite comfortable with having two separate options (background_purge_on_iterator_cleanup and background_purge_on_cf_cleanup) for such a rarely used thing. Maybe we should merge them? Rename background_purge_on_cf_cleanup to something like delete_files_on_background_threads_only or avoid_blocking_io_in_unexpected_places, and make iterators use it instead of the one in ReadOptions? I can do that here if you guys think it's better.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5043
Differential Revision: D14339233
Pulled By: al13n321
fbshipit-source-id: ccf7efa11c85c9a5b91d969bb55627d0fb01e7b8
Summary:
`BlockBasedTableIterator` avoid reading next block on `Next()` if it detects the iterator will be out of bound, by checking against index key. The optimization was added in #2239, and by the time it only check the bound per block. It seems later change make it a per-key check, which introduce unnecessary key comparisons.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5101
Differential Revision: D14678707
Pulled By: siying
fbshipit-source-id: 2372446116753c7892ea4cec7b4b49ef87ba463e
Summary:
**This PR updates RepeatableThread::wait, breaking some tests on OS X. The rest of the PR fixes the tests on OS X.**
`RepeatableThreadTest.MockEnvTest` uses `MockTimeEnv` and `RepeatableThread`. If `RepeatableThread::wait` calls `TimedWait` with a time smaller than or equal to the current (real) time, `TimedWait` returns immediately on certain platforms, e.g. OS X. #4560 addresses this issue by replacing `TimedWait` with `Wait` in test. This fixes the test but makes test/production code diverge, which is not optimal for test coverage. This PR proposes an alternative fix which unifies test and production code path for `RepeatableThread::wait`. We obtain the current (real) time in seconds and add 10 extra seconds to ensure that `RepeatableThread::wait` invokes `TimedWait` with a time greater than (real) current time. This is to prevent the `TimedWait` function from returning immediately without sleeping and releasing the mutex. If `TimedWait` returns immediately, the mutex will not be released, and `RepeatableThread::TEST_WaitForRun` never has a chance to execute the callback which, in this case, updates the result returned by `mock_env->NowMicros()`. Consequently, `RepeatableThread::wait` cannot break out of the loop, causing test to hang. The extra 10 seconds is a best-effort approach because there seems no reliable and deterministic way to provide the aforementioned guarantee. By the time `RepeatableThread::wait` is called, there is no guarantee that the `delay + mock_env->NowMicros()` will be greater than the current real time. However, 10 seconds should be sufficient in most cases. We will keep an eye for possible flakiness of this test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5107
Differential Revision: D14680885
Pulled By: riversand963
fbshipit-source-id: d1ecbe10e1dacd110bd464cd01e188bfee72b89e