Summary:
Partition Filters make use of a top-level index to find the partition that might have the bloom hash of the key. The index is with internal key format (before format version 3). Each partition contains the i) blooms of the keys in that range ii) bloom of prefixes of keys in that range, iii) the bloom of the prefix of the last key in the previous partition.
When ::SeekForPrev(key), we first perform a prefix bloom test on the SST file. The partition however is identified using the full internal key, rather than the prefix key. The reason is to be compatible with the internal key format of the top-level index. This creates a corner case. Example:
- SST k, Partition N: P1K1, P1K2
- SST k, top-level index: P1K2
- SST k+1, Partition 1: P2K1, P3K1
- SST k+1 top-level index: P3K1
When SeekForPrev(P1K3), it should point us to P1K2. However SST k top-level index would reject P1K3 since it is out of range.
One possible fix would be to search with the prefix P1 (instead of full internal key P1K3) however the details of properly comparing prefix with full internal key might get complicated. The fix we apply in this PR is to look into the last partition anyway even if the key is out of range.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5907
Differential Revision: D17889918
Pulled By: maysamyabandeh
fbshipit-source-id: 169fd7b3c71dbc08808eae5a8340611ebe5bdc1e
Summary:
Since we do not evict a file's blocks from block cache before that file
is deleted, we require a file's cache ID prefix is both unique and
non-reusable. However, the Windows functionality we were relying on only
guaranteed uniqueness. That meant a newly created file could be assigned
the same cache ID prefix as a deleted file. If the newly created file
had block offsets matching the deleted file, full cache keys could be
exactly the same, resulting in obsolete data blocks returned from cache
when trying to read from the new file.
We noticed this when running on FAT32 where compaction was writing out
of order keys due to reading obsolete blocks from its input files. The
functionality is documented as behaving the same on NTFS, although I
wasn't able to repro it there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5844
Test Plan:
we had a reliable repro of out-of-order keys on FAT32 that
was fixed by this change
Differential Revision: D17752442
fbshipit-source-id: 95d983f9196cf415f269e19293b97341edbf7e00
Summary:
RocksDB has a MultiGet() API that implements batched key lookup for higher performance (https://github.com/facebook/rocksdb/blob/master/include/rocksdb/db.h#L468). Currently, batching is implemented in BlockBasedTableReader::MultiGet() for SST file lookups. One of the ways it improves performance is by pipelining bloom filter lookups (by prefetching required cachelines for all the keys in the batch, and then doing the probe) and thus hiding the cache miss latency. The same concept can be extended to the memtable as well. This PR involves implementing a pipelined bloom filter lookup in DynamicBloom, and implementing MemTable::MultiGet() that can leverage it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5818
Test Plan:
Existing tests
Performance Test:
Ran the below command which fills up the memtable and makes sure there are no flushes and then call multiget. Ran it on master and on the new change and see atleast 1% performance improvement across all the test runs I did. Sometimes the improvement was upto 5%.
TEST_TMPDIR=/data/users/$USER/benchmarks/feature/ numactl -C 10 ./db_bench -benchmarks="fillseq,multireadrandom" -num=600000 -compression_type="none" -level_compaction_dynamic_level_bytes -write_buffer_size=200000000 -target_file_size_base=200000000 -max_bytes_for_level_base=16777216 -reads=90000 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4 -statistics -memtable_whole_key_filtering=true -memtable_bloom_size_ratio=10
Differential Revision: D17578869
Pulled By: vjnadimpalli
fbshipit-source-id: 23dc651d9bf49db11d22375bf435708875a1f192
Summary:
The loop in OperateDb() is getting quite complicated with the introduction of multiple key operations such as MultiGet and Reseeks. This is resulting in a number of corner cases that hangs db_stress due to synchronization problems during reopen (i.e when -reopen=<> option is specified). This PR makes it more robust by ensuring all db_stress threads vote to reopen the DB the exact same number of times.
Most of the changes in this diff are due to indentation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5893
Test Plan: Run crash test
Differential Revision: D17823827
Pulled By: anand1976
fbshipit-source-id: ec893829f611ac7cac4057c0d3d99f9ffb6a6dd9
Summary:
Fixed some spots where converting size_t or uint_fast32_t to
uint32_t. Wrapped mt19937 in a new Random32 class to avoid future
such traps.
NB: I tried using Random32::Uniform (std::uniform_int_distribution) in
filter_bench instead of fastrange, but that more than doubled the dry
run time! So I added fastrange as Random32::Uniformish. ;)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5894
Test Plan: USE_CLANG=1 build, and manual re-run filter_bench
Differential Revision: D17825131
Pulled By: pdillinger
fbshipit-source-id: 68feee333b5f8193c084ded760e3d6679b405ecd
Summary:
This PR allows for the creation of custom env when using sst_dump. If
the user does not set options.env or set options.env to nullptr, then sst_dump
will automatically try to create a custom env depending on the path to the sst
file or db directory. In order to use this feature, the user must call
ObjectRegistry::Register() beforehand.
Test Plan (on devserver):
```
$make all && make check
```
All tests must pass to ensure this change does not break anything.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5845
Differential Revision: D17678038
Pulled By: riversand963
fbshipit-source-id: 58ecb4b3f75246d52b07c4c924a63ee61c1ee626
Summary:
This is the 2nd attempt after the revert of https://github.com/facebook/rocksdb/pull/4020
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5895
Test Plan:
```
./tools/db_crashtest.py blackbox --simple --interval=10 --max_key=10000000
```
Differential Revision: D17822137
Pulled By: maysamyabandeh
fbshipit-source-id: 3d148c0d8cc129080410ff859c04b544223c8ea3
Summary:
Example: using the tool before and after PR https://github.com/facebook/rocksdb/issues/5784 shows that
the refactoring, presumed performance-neutral, actually sped up SST
filters by about 3% to 8% (repeatable result):
Before:
- Dry run ns/op: 22.4725
- Single filter ns/op: 51.1078
- Random filter ns/op: 120.133
After:
+ Dry run ns/op: 22.2301
+ Single filter run ns/op: 47.4313
+ Random filter ns/op: 115.9
Only tests filters for the block-based table (full filters and
partitioned filters - same implementation; not block-based filters),
which seems to be the recommended format/implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5825
Differential Revision: D17804987
Pulled By: pdillinger
fbshipit-source-id: 0f18a9c254c57f7866030d03e7fa4ba503bac3c5
Summary:
Instead of hard coding Env::Default in TestEnv and a few other places, use the
DBTestBase::env_ that has been deduced from the constructor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5886
Test Plan:
```
make check
```
Differential Revision: D17773029
Pulled By: riversand963
fbshipit-source-id: 7ce4e5175a487e9d281ea2c3aae3c41bffd44629
Summary:
This PR eliminates repeated lookups in associative or ordered containers when a single lookup suffices.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5875
Differential Revision: D17753172
Pulled By: anand1976
fbshipit-source-id: 796b02b760082521d8c42a1cb65a76bf0e6c1b8e
Summary:
When an iterator reseek happens with the user specifying a new iterate_upper_bound in ReadOptions, and the new seek position is at the end of the same data block, the Seek() ends up using a stale value of data_block_within_upper_bound_ and may return incorrect results.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5883
Test Plan: Added a new test case DBIteratorTest.IterReseekNewUpperBound. Verified that it failed due to the assertion failure without the fix, and passes with the fix.
Differential Revision: D17752740
Pulled By: anand1976
fbshipit-source-id: f9b635ff5d6aeb0e1bef102cf8b2f900efd378e3
Summary:
Broken type for shift in PR#5834. Fixing code means fixing
expected values in test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5882
Test Plan: thisisthetest
Differential Revision: D17746136
Pulled By: pdillinger
fbshipit-source-id: d3c456ed30b433d55fcab6fc7d836940fe3b46b8
Summary:
When multiple operations are performed in a db_stress thread in one loop
iteration, the reopen voting logic needs to take that into account. It
was doing that for MultiGet, but a new option was introduced recently to
do multiple iterator seeks per iteration, which broke it again. Fix the
logic to be more robust and agnostic of the type of operation performed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5876
Test Plan: Run db_stress
Differential Revision: D17733590
Pulled By: anand1976
fbshipit-source-id: 787f01abefa1e83bba43e0b4f4abb26699b2089e
Summary:
There was significant untested logic in FullFilterBitsReader in
the handling of serialized Bloom filter bits that cannot be generated by
FullFilterBitsBuilder in the current compilation. These now test many of
those corner-case behaviors, including bad metadata or filters created
with different cache line size than the current compiled-in value.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5834
Test Plan: thisisthetest
Differential Revision: D17726372
Pulled By: pdillinger
fbshipit-source-id: fb7b8003b5a8e6fb4666fe95206128f3d5835fc7
Summary:
Conflict resolving in 846e05005d ("Revert "Merging iterator to avoid child iterator reseek for some cases") caused some timer misplaced. Fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5874
Test Plan: See it build.
Differential Revision: D17705073
fbshipit-source-id: 9bd3a8dc4901ac33c2c6fc5b1091ffbc56a8529f
Summary:
Without this fix, compiler complains.
```
$ROCKSDB_NO_FBCODE=1 USE_CLANG=1 make ldb
table/block_based/full_filter_block.cc: In constructor ‘rocksdb::FullFilterBlockBuilder::FullFilterBlockBuilder(const rocksdb::SliceTransform*, bool, rocksdb::FilterBitsBuilder*)’:
table/block_based/full_filter_block.cc:20:43: error: declaration of ‘prefix_extractor’ shadows a member of 'this' [-Werror=shadow]
FilterBitsBuilder* filter_bits_builder)
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5872
Test Plan:
```
$ROCKSDB_NO_FBCODE=1 make all
```
Differential Revision: D17690058
Pulled By: riversand963
fbshipit-source-id: 19e3d9bd86e1123847095240e73d30da5d66240e
Summary:
This reverts commit 9fad3e21eb.
Iterator verification in stress tests sometimes fail for assertion
table/block_based/block_based_table_reader.cc:2973: void rocksdb::BlockBasedTableIterator<TBlockIter, TValue>::FindBlockForward() [with TBlockIter = rocksdb::DataBlockIter; TValue = rocksdb::Slice]: Assertion `!next_block_is_out_of_bound || user_comparator_.Compare(*read_options_.iterate_upper_bound, index_iter_->user_key()) <= 0' failed.
It is likely to be linked to https://github.com/facebook/rocksdb/pull/5286 together with https://github.com/facebook/rocksdb/pull/5468 as the former PR makes some child iterator's seek being avoided, so that upper bound condition fails to be updated there. Strictly speaking, the former PR was merged before the latter one, but the latter one feels a more important improvement so I choose to revert the former one for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5871
Differential Revision: D17689196
fbshipit-source-id: 4ded5be68f67bee2782d31a29cb72ea68f59dd8c
Summary:
Two more bug fixes in db_stress:
1. this is to complete the fix of the regression bug causing overflowing when supporting FLAGS_prefix_size = -1.
2. Fix regression bug in compare iterator itself:
(1) when creating control iterator, which used the same read option as the normal iterator by mistake; (2) the logic of comparing has some problems. Fix them.
(3) disable validation for lower bound now, which generated some wildly different results. Disabling it to make normal tests pass while investigating it.
3. Cleaning up snapshots in verification failure cases. Memory is leaked otherwise.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5867
Test Plan: Run "make crash_test" for a while and see at least 1 is fixed.
Differential Revision: D17671712
fbshipit-source-id: 011f98ea1a72aef23e19ff28656830c78699b402
Summary:
Atomic flush is incompatible with pipelined write. At least now.
If pipelined write is enabled, a thread performing write can exit the write
thread and start inserting into memtables. Consequently a thread performing
flush will enter write thread and race with memtable insertion by the former.
This will cause undefined result in terms of data persistence.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5860
Test Plan:
```
$make all && make check
```
Differential Revision: D17638944
Pulled By: riversand963
fbshipit-source-id: abc578dc49a5dbe41bc5adcecf448f8e042a6d49
Summary:
When prefix_size = -1, stress test crashes with run time error because of overflow. Fix it by not using -1 but 7 in prefix scan mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5862
Test Plan:
Run
python -u tools/db_crashtest.py --simple whitebox --random_kill_odd \
888887 --compression_type=zstd
and see it doesn't crash.
Differential Revision: D17642313
fbshipit-source-id: f029e7651498c905af1b1bee6d310ae50cdcda41
Summary:
For now, crash_test is not able to report any failure for the logic related to iterator upper, lower bounds or iterators, or reseek. These are features prone to errors. Improve db_stress in several ways:
(1) For each iterator run, reseek up to 3 times.
(2) For every iterator, create control iterator with upper or lower bound, with total order seek. Compare the results with the iterator.
(3) Make simple crash test to avoid prefix size to have more coverage.
(4) make prefix_size = 0 a valid size and -1 to indicate disabling prefix extractor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5846
Test Plan: Manually hack the code to create wrong results and see they are caught by the tool.
Differential Revision: D17631760
fbshipit-source-id: acd460a177bd2124a5ffd7fff490702dba63030b
Summary:
Add unordered_write option api and related ut to rocksjava
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5839
Differential Revision: D17604446
Pulled By: maysamyabandeh
fbshipit-source-id: c6b07e85ca9d5e3a92973ddb6ab2bc079e53c9c1
Summary:
as title.
Test Plan (on devserver):
```
$make all && make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5855
Differential Revision: D17615125
Pulled By: riversand963
fbshipit-source-id: bd6ed8cf59eafff41f0d1fc044f39e8f3573172a
Summary:
This is a bug occaionally shows up in crash test, and this unit test is to reproduce it. The bug is following:
1. Database has multiple CFs.
2. Between one DB restart, the last log file is corrupted in the middle (not the tail)
3. During restart, DB crashes between flushes between two CFs.
The DB will fail to be opened again with error "SST file is ahead of WALs"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5851
Test Plan: Run the test itself.
Differential Revision: D17614721
fbshipit-source-id: 1b0abce49b203a76a039e38e76bc940429975f20
Summary:
Partitioned filters make use of a top-level index to find the partition in which the filter resides. The top-level index has a key per partition. The key is guaranteed to be larger or equal than any key in that partition. When used with format_version 3, which excludes the sequence number form index keys, the separator key in the index could be equal to the prefix of the keys in the next partition. In this way, when searching for the key, the top-level index will lead us to the previous partition, which has no key with that prefix. The prefix bloom test thus returns false, although the prefix exists in the bloom of the next partition.
The patch fixes that by a hack: It always adds the prefix of the first key of the next partition to the bloom of the current partition. In this way, in the corner cases that the index will lead us to the previous partition, we still can find the bloom filter there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5835
Differential Revision: D17513585
Pulled By: maysamyabandeh
fbshipit-source-id: e2d1ff26c759e6e03875c4d57f4228316ecf50e9
Summary:
The comparison of va_list and nullptr is always False under any arch, and will raise invalid operands of types error in aarch64 env (`error: invalid operands of types ‘va_list {aka __va_list}’ and ‘std::nullptr_t’ to binary ‘operator!=’`).
This patch removes this invalid assert.
Closes: https://github.com/facebook/rocksdb/issues/4277
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5836
Differential Revision: D17532470
fbshipit-source-id: ca98078ecbc6a9416c69de3bd6ffcfa33a0f0185
Summary:
format-diff.sh, a.k.a. 'make format', would use 'master'
to decide which commits are probably unpublished. Much better to use
facebook remote master since local master may not be caught up and may
have its own unpublished commits. Script now tries to compare against
facebook remote master branch (branch pointer is updated with any fetch
or pull), because those differences are what would be considered the
differences for a pull request.
Also, script would compare against *parent* of merge-base with that
reference point, which is just wrong since that includes the last
published commit.
In case of problems, you can now customize the reference point, by
setting the FORMAT_UPSTREAM variable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5831
Test Plan: manual
Differential Revision: D17528462
Pulled By: pdillinger
fbshipit-source-id: 50fdb8795d683bf3c14d449669c1a5299e0dfa8b
Summary:
Further apply formatter to more recent commits.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5830
Test Plan: Run all existing tests.
Differential Revision: D17488031
fbshipit-source-id: 137458fd94d56dd271b8b40c522b03036943a2ab
Summary:
Some recent commits might not have passed through the formatter. I formatted recent 45 commits. The script hangs for more commits so I stopped there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5827
Test Plan: Run all existing tests.
Differential Revision: D17483727
fbshipit-source-id: af23113ee63015d8a43d89a3bc2c1056189afe8f
Summary:
clang-analyzer has uncovered a bunch of places where the code is relying
on pointers being valid and one case (in VectorIterator) where a moved-from
object is being used:
In file included from db/range_tombstone_fragmenter.cc:17:
./util/vector_iterator.h:23:18: warning: Method called on moved-from object 'keys' of type 'std::vector'
current_(keys.size()) {
^~~~~~~~~~~
1 warning generated.
utilities/persistent_cache/block_cache_tier_file.cc:39:14: warning: Called C++ object pointer is null
Status s = env->NewRandomAccessFile(filepath, file, opt);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
utilities/persistent_cache/block_cache_tier_file.cc:47:19: warning: Called C++ object pointer is null
Status status = env_->GetFileSize(Path(), size);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
utilities/persistent_cache/block_cache_tier_file.cc:290:14: warning: Called C++ object pointer is null
Status s = env_->FileExists(Path());
^~~~~~~~~~~~~~~~~~~~~~~~
utilities/persistent_cache/block_cache_tier_file.cc:363:35: warning: Called C++ object pointer is null
CacheWriteBuffer* const buf = alloc_->Allocate();
^~~~~~~~~~~~~~~~~~
utilities/persistent_cache/block_cache_tier_file.cc:399:41: warning: Called C++ object pointer is null
const uint64_t file_off = buf_doff_ * alloc_->BufferSize();
^~~~~~~~~~~~~~~~~~~~
utilities/persistent_cache/block_cache_tier_file.cc:463:33: warning: Called C++ object pointer is null
size_t start_idx = lba.off_ / alloc_->BufferSize();
^~~~~~~~~~~~~~~~~~~~
utilities/persistent_cache/block_cache_tier_file.cc:515:5: warning: Called C++ object pointer is null
alloc_->Deallocate(bufs_[i]);
^~~~~~~~~~~~~~~~~~~~~~~~~~~~
7 warnings generated.
ar: creating librocksdb_debug.a
utilities/memory/memory_test.cc:68:25: warning: Called C++ object pointer is null
cache_set->insert(db->GetDBOptions().row_cache.get());
^~~~~~~~~~~~~~~~~~
1 warning generated.
The patch fixes these by adding assertions and explicitly passing in zero
when initializing VectorIterator::current_ (which preserves the existing
behavior).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5821
Test Plan: Ran make check and make analyze to make sure the warnings have disappeared.
Differential Revision: D17455949
Pulled By: ltamasi
fbshipit-source-id: 363619618ea649a0674287f9f3b3393e390571ee
Summary:
Make class ObsoleteFilesTest inherit from DBTestBase.
Test plan (on devserver):
```
$COMPILE_WITH_ASAN=1 make obsolete_files_test
$./obsolete_files_test
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5820
Differential Revision: D17452348
Pulled By: riversand963
fbshipit-source-id: b09f4581a18022ca2bfd79f2836c0bf7083f5f25
Summary:
Originally the loop of closing WAL in PurgeObsoleteFiles resides inside a loop
iterating over the candidate files. It should be moved out.
Test plan (devserver)
```
$COMPILE_WITH_ASAN=1 make -j32 all
$make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5804
Differential Revision: D17374350
Pulled By: riversand963
fbshipit-source-id: 2bee7343fc0481d9a385a87c7676491522285c96
Summary:
We are seeing a bug of wrong results with merging iterator's reseek avoidence feature and prefix extractor. Disable this optimization for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5815
Test Plan: Validated the same MyRocks case was fixed; run all existing tests.
Differential Revision: D17430776
fbshipit-source-id: aef664277ba0ab8a2e68331ff0db6ae682535371
Summary:
purge_queue_ maybe contains thousands sst files, for example manual compact a range. If full scan is triggered at the same time and the total sst files number is large, RocksDB will be blocked at https://github.com/facebook/rocksdb/blob/master/db/db_impl_files.cc#L150 for several seconds. In our environment we have 140,000 sst files and the manual compaction delete about 1000 sst files, it blocked about 2 minutes.
Commandeering https://github.com/facebook/rocksdb/issues/5290.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5796
Differential Revision: D17357775
Pulled By: riversand963
fbshipit-source-id: 20eacca917355b8de975ccc7b1c9a3e7bd5b201a
Summary:
https://github.com/facebook/rocksdb/issues/5797 charges the block cache with the total of user-provided charge plus the metadata charge. It had a bug where in MaintainPoolSize the user-provided charge was used instead of the total charge. The patch fixes that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5813
Differential Revision: D17412783
Pulled By: maysamyabandeh
fbshipit-source-id: 45c0ac9f1e2233760db5ccd61399605cd74edc87