Summary:
Compressed block cache depends on reading the block compression marker beyond the payload block size. Only the payload bytes were being saved and loaded from SecondaryCache -> boom!
This removes some unnecessary code attempting to combine these two competing features. Note that BlockContents was previously used for block-based filter in block cache, but that support has been removed.
Also marking block_cache_compressed as deprecated in this commit as we expect it to be replaced with SecondaryCache.
This problem was discovered during refactoring but didn't want to combine bug fix with that refactoring.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10944
Test Plan: test added that fails on base revision (at least with ASAN)
Reviewed By: akankshamahajan15
Differential Revision: D41205578
Pulled By: pdillinger
fbshipit-source-id: 1b29d36c7a6552355ac6511fcdc67038ef4af29f
Summary:
The patch refines/reworks `MergeHelper::TimedFullMerge(WithEntity)`
a bit in two ways. First, it eliminates the recently introduced `TimedFullMerge`
overload, which makes the responsibilities clearer by making sure the query
result (`value` for `Get`, `columns` for `GetEntity`) is set uniformly in
`SaveValue` and `GetContext`. Second, it changes the interface of
`TimedFullMergeWithEntity` so it exposes its result in a serialized form; this
is a more decoupled design which will come in handy when adding support
for `Merge` with wide-column entities to `DBIter`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10932
Test Plan: `make check`
Reviewed By: akankshamahajan15
Differential Revision: D41129399
Pulled By: ltamasi
fbshipit-source-id: 69d8da358c77d4fc7e8c40f4dafc2c129a710677
Summary:
The patch fixes a bug where `GetContext::Merge` (and `MergeEntity`) does not update the ticker `READ_NUM_MERGE_OPERANDS` because it implicitly uses the default parameter value of `update_num_ops_stats=false` when calling `MergeHelper::TimedFullMerge`. Also, to prevent such issues going forward, the PR removes the default parameter values from the `TimedFullMerge` methods. In addition, it removes an unused/unnecessary parameter from `TimedFullMergeWithEntity`, and does some cleanup at the call sites of these methods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10925
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D41096453
Pulled By: ltamasi
fbshipit-source-id: fc60646d32b4d516b8fe81e265c3f020a32fd7f8
Summary:
The patch adds `Merge` support for wide-column entities to the point lookup
APIs, i.e. `Get`, `MultiGet`, `GetEntity`, and `GetMergeOperands`. (I plan to
update the iterator and compaction logic in separate PRs.) In terms of semantics,
the `Merge` operation is applied to the default (anonymous) column; any other
columns in the entity are unaffected.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10916
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D40962311
Pulled By: ltamasi
fbshipit-source-id: 244bc9d172be1af2f204796b2f89104e4d2fa373
Summary:
This PR implements the originally disabled `Merge()` APIs when user-defined timestamp is enabled.
Simplest usage:
```cpp
// assume string append merge op is used with '.' as delimiter.
// ts1 < ts2
db->Put(WriteOptions(), "key", ts1, "v0");
db->Merge(WriteOptions(), "key", ts2, "1");
ReadOptions ro;
ro.timestamp = &ts2;
db->Get(ro, "key", &value);
ASSERT_EQ("v0.1", value);
```
Some code comments are added for clarity.
Note: support for timestamp in `DB::GetMergeOperands()` will be done in a follow-up PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10819
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D40603195
Pulled By: riversand963
fbshipit-source-id: f96d6f183258f3392d80377025529f7660503013
Summary:
Currently, a memtable's stats `num_deletes_` is incremented only if the entry is a regular delete (kTypeDeletion). We need to fix it by accounting for kTypeSingleDeletion and kTypeDeletionWithTimestamp.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10886
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D40740754
Pulled By: riversand963
fbshipit-source-id: 7bde62cd6df136585bc5bfb1c426c7a8276c08e1
Summary:
The for loop is marked as unreachable code because it will never call the increment. Switch it to `if`.
```
\table\merging_iterator.cc(823): error C2220: the following warning is treated as an error
\table\merging_iterator.cc(823): warning C4702: unreachable code
\table\merging_iterator.cc(1030): error C2220: the following warning is treated as an error
\table\merging_iterator.cc(1030): warning C4702: unreachable code
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10897
Reviewed By: cbi42
Differential Revision: D40811790
Pulled By: jay-zhuang
fbshipit-source-id: fe8fd3e7cf3d6f710360c402b79763854d5120df
Summary:
The PR fixes the handling of `Merge`s in `GetEntity`. Note that `Merge` is not yet
supported for wide-column entities written using `PutEntity`; this change is
about returning correct (i.e. consistent with `Get`) results in cases like when the
base value is a plain old key-value written using `Put` or when there is no real base
value because we hit either a tombstone or the beginning of history.
Implementation-wise, the patch introduces a new wrapper around the existing
`MergeHelper::TimedFullMerge` that can store the merge result in either a string
(for the purposes of `Get`) or a `PinnableWideColumns` instance (for `GetEntity`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10894
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D40782708
Pulled By: ltamasi
fbshipit-source-id: 3d700d56b2ef81f02ba1e2d93f6481bf13abcc90
Summary:
Right now in MergingIterator, for each range tombstone start and end key, we pop one end from heap and push the other end into the heap. This involves extra downheap and upheap cost. In the likely cases when a range tombstone iterator emits relatively adjacent keys, these keys should have similar order within all keys in the heap. This can happen when there is a burst of consecutive range tombstones, and most of the keys covered by them are dropped already. This PR uses `replace_top()` when inserting new range tombstone keys, which is more efficient in these common cases.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10877
Test Plan:
- existing UT
- ran all flavors of stress test through sandcastle
- benchmark:
```
# Set up: --writes_per_range_tombstone=1 means one point write and one delete range
TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=100000000 --writes=800000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=64
Level Files Size(MB)
--------------------
0 8 152
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
# Benchmark
TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone/ ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472 --num=100000000 --reads=1000000 --disable_auto_compactions=true --avoid_flush_during_recovery=true
# Pre PR
readseq [AVG 5 runs] : 1432116 (± 59664) ops/sec; 224.0 (± 9.3) MB/sec
readseq [MEDIAN 5 runs] : 1454886 ops/sec; 227.5 MB/sec
# Post PR
readseq [AVG 5 runs] : 1944425 (± 29521) ops/sec; 304.1 (± 4.6) MB/sec
readseq [MEDIAN 5 runs] : 1959430 ops/sec; 306.5 MB/sec
```
Reviewed By: ajkr
Differential Revision: D40710936
Pulled By: cbi42
fbshipit-source-id: cb782fb9cdcd26c0c3eb9443215a4ef4d2f79022
Summary:
Add some extra information in outputs of "sst_dump --command=raw" to help debug some issues. Right now, encoded block handle is printed out. It is more useful to directly print out offset and size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10873
Test Plan: Manually run it against a file and check the output.
Reviewed By: anand1976
Differential Revision: D40742289
fbshipit-source-id: 04d7de26e7f27e1595a7cc3ac1c1082e4e835b93
Summary:
FragmentedRangeTombstoneList has a member variable `seq_set_` that contains the sequence numbers of all range tombstones in a set. The set is constructed in `FragmentTombstones()` and is used only in `FragmentedRangeTombstoneList::ContainsRange()` which only happens during compaction. This PR moves the initialization of `seq_set_` to `FragmentedRangeTombstoneList::ContainsRange()`. This should speed up `FragmentTombstones()` when the range tombstone list is used for read/scan requests. Microbench shows the speed improvement to be ~45%.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10848
Test Plan:
- Existing tests and stress test: `python3 tools/db_crashtest.py whitebox --simple --verify_iterator_with_expected_state_one_in=5`.
- Microbench: update `range_del_aggregator_bench` to benchmark speed of `FragmentTombstones()`:
```
./range_del_aggregator_bench --num_range_tombstones=1000 --tombstone_start_upper_bound=50000000 --num_runs=10000 --tombstone_width_mean=200 --should_deletes_per_run=100 --use_compaction_range_del_aggregator=true
Before this PR:
=========================
Fragment Tombstones: 270.286 us
AddTombstones: 1.28933 us
ShouldDelete (first): 0.525528 us
ShouldDelete (rest): 0.0797519 us
After this PR: time to fragment tombstones is pushed to AddTombstones() which only happen during compaction.
=========================
Fragment Tombstones: 149.879 us
AddTombstones: 102.131 us
ShouldDelete (first): 0.565871 us
ShouldDelete (rest): 0.0729444 us
```
- db_bench: this should improve speed for fragmenting range tombstones for mutable memtable:
```
./db_bench --benchmarks=readwhilewriting --writes_per_range_tombstone=100 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=250000 --disable_auto_compactions --max_num_range_tombstones=100000 --finish_after_writes --write_buffer_size=1073741824 --threads=25
Before this PR:
readwhilewriting : 18.301 micros/op 1310445 ops/sec 4.769 seconds 6250000 operations; 28.1 MB/s (41001 of 250000 found)
After this PR:
readwhilewriting : 16.943 micros/op 1439376 ops/sec 4.342 seconds 6250000 operations; 23.8 MB/s (28977 of 250000 found)
```
Reviewed By: ajkr
Differential Revision: D40646227
Pulled By: cbi42
fbshipit-source-id: ea471667edb258f67d01cfd828588e80a89e4083
Summary:
`#include "db/range_tombstone_fragmenter.h"` seems to break some internal test for 7.8 release. I'm removing it from sst_file_reader.h for now to unblock release. This should be fine as it is only used in a unit test for DeleteRange with timestamp. In addition, it does not seem to be useful to support delete range for sst file writer, since the range tombstone won't cover any key (its sequence number is 0). So maybe we can remove it in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10847
Test Plan: CI.
Reviewed By: akankshamahajan15
Differential Revision: D40620865
Pulled By: cbi42
fbshipit-source-id: be44b2f31e062bff87ed1b8d94482c3f7eaa370c
Summary:
Refactor the classes, APIs and data structures for block cache tracing to allow a user provided trace writer to be used. Currently, only a TraceWriter is supported, with a default built-in implementation of FileTraceWriter. The TraceWriter, however, takes a flat trace record and is thus only suitable for file tracing. This PR introduces an abstract BlockCacheTraceWriter class that takes a structured BlockCacheTraceRecord. The BlockCacheTraceWriter implementation can then format and log the record in whatever way it sees fit. The default BlockCacheTraceWriterImpl does file tracing using a user provided TraceWriter.
`DB::StartBlockTrace` will internally redirect to changed `BlockCacheTrace::StartBlockCacheTrace`.
New API `DB::StartBlockTrace` is also added that directly takes `BlockCacheTraceWriter` pointer.
This same philosophy can be applied to KV and IO tracing as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10811
Test Plan:
existing unit tests
Old API DB::StartBlockTrace checked with db_bench tool
create database
```
./db_bench --benchmarks="fillseq" \
--key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \
--cache_index_and_filter_blocks --cache_size=1048576 \
--disable_auto_compactions=1 --disable_wal=1 --compression_type=none \
--min_level_to_compress=-1 --compression_ratio=1 --num=10000000
```
To trace block cache accesses when running readrandom benchmark:
```
./db_bench --benchmarks="readrandom" --use_existing_db --duration=60 \
--key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \
--cache_index_and_filter_blocks --cache_size=1048576 \
--disable_auto_compactions=1 --disable_wal=1 --compression_type=none \
--min_level_to_compress=-1 --compression_ratio=1 --num=10000000 \
--threads=16 \
-block_cache_trace_file="/tmp/binary_trace_test_example" \
-block_cache_trace_max_trace_file_size_in_bytes=1073741824 \
-block_cache_trace_sampling_frequency=1
```
Reviewed By: anand1976
Differential Revision: D40435289
Pulled By: akankshamahajan15
fbshipit-source-id: fa2755f4788185e19f4605e731641cfd21ab3282
Summary:
The motivations for this change include
* Free up space in ClockHandle so that we can add data for secondary cache handling while still keeping within single cache line (64 byte) size.
* This change frees up space by eliminating the need for the `hash` field by making the fixed-size key itself a hash, using a 128-bit bijective (lossless) hash.
* Generally more customizability of ShardedCache (such as hashing) without worrying about virtual call overheads
* ShardedCache now uses static polymorphism (template) instead of dynamic polymorphism (virtual overrides) for the CacheShard. No obvious performance benefit is seen from the change (as mostly expected; most calls to virtual functions in CacheShard could already be optimized to static calls), but offers more flexibility without incurring the runtime cost of adhering to a common interface (without type parameters or static callbacks).
* You'll also notice less `reinterpret_cast`ing and other boilerplate in the Cache implementations, as this can go in ShardedCache.
More detail:
* Don't have LRUCacheShard maintain `std::shared_ptr<SecondaryCache>` copies (extra refcount) when LRUCache can be in charge of keeping a `shared_ptr`.
* Renamed `capacity_mutex_` to `config_mutex_` to better represent the scope of what it guards.
* Some preparation for 64-bit hash and indexing in LRUCache, but didn't include the full change because of slight performance regression.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10801
Test Plan:
Unit test updates were non-trivial because of major changes to the ClockCacheShard interface in handling of key vs. hash.
Performance:
Create with `TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16`
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X1000] -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=610000000 -duration 20 -threads=16
```
Before: `readrandom [AVG 150 runs] : 321147 (± 253) ops/sec`
After: `readrandom [AVG 150 runs] : 321530 (± 326) ops/sec`
So possibly ~0.1% improvement.
And with `-cache_type=hyper_clock_cache`:
Before: `readrandom [AVG 30 runs] : 614126 (± 7978) ops/sec`
After: `readrandom [AVG 30 runs] : 645349 (± 8087) ops/sec`
So roughly 5% improvement!
Reviewed By: anand1976
Differential Revision: D40252236
Pulled By: pdillinger
fbshipit-source-id: ff8fc70ef569585edc95bcbaaa0386f61355ae5b
Summary:
Instead of existing calls to ps from gnu_parallel, call a new wrapper that does ps, looks for unit test like processes, and uses pstack or gdb to print thread stack traces. Also, using `ps -wwf` instead of `ps -wf` ensures output is not cut off.
For security, CircleCI runs with security restrictions on ptrace (/proc/sys/kernel/yama/ptrace_scope = 1), and this change adds a work-around to `InstallStackTraceHandler()` (only used by testing tools) to allow any process from the same user to debug it. (I've also touched >100 files to ensure all the unit tests call this function.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10828
Test Plan: local manual + temporary infinite loop in a unit test to observe in CircleCI
Reviewed By: hx235
Differential Revision: D40447634
Pulled By: pdillinger
fbshipit-source-id: 718a4c4a5b54fa0f9af2d01a446162b45e5e84e1
Summary:
Add user-defined timestamp support for range deletion. The new API is `DeleteRange(opt, cf, begin_key, end_key, ts)`. Most of the change is to update the comparator to compare without timestamp. Other than that, major changes are
- internal range tombstone data structures (`FragmentedRangeTombstoneList`, `RangeTombstone`, etc.) to store timestamps.
- Garbage collection of range tombstones and range tombstone covered keys during compaction.
- Get()/MultiGet() to return the timestamp of a range tombstone when needed.
- Get/Iterator with range tombstones bounded by readoptions.timestamp.
- timestamp crash test now issues DeleteRange by default.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10661
Test Plan:
- Added unit test: `make check`
- Stress test: `python3 tools/db_crashtest.py --enable_ts whitebox --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`
- Ran `db_bench` to measure regression when timestamp is not enabled. The tests are for write (with some range deletion) and iterate with DB fitting in memory: `./db_bench--benchmarks=fillrandom,seekrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=500000 --seek_nexts=10 --disable_auto_compactions -disable_wal=true --max_num_range_tombstones=1000`. Did not see consistent regression in no timestamp case.
| micros/op | fillrandom | seekrandom |
| --- | --- | --- |
|main| 2.58 |10.96|
|PR 10661| 2.68 |10.63|
Reviewed By: riversand963
Differential Revision: D39441192
Pulled By: cbi42
fbshipit-source-id: f05aca3c41605caf110daf0ff405919f300ddec2
Summary:
Try to align the compaction output file boundaries to the next level ones
(grandparent level), to reduce the level compaction write-amplification.
In level compaction, there are "wasted" data at the beginning and end of the
output level files. Align the file boundary can avoid such "wasted" compaction.
With this PR, it tries to align the non-bottommost level file boundaries to its
next level ones. It may cut file when the file size is large enough (at least
50% of target_file_size) and not too large (2x target_file_size).
db_bench shows about 12.56% compaction reduction:
```
TEST_TMPDIR=/data/dbbench2 ./db_bench --benchmarks=fillrandom,readrandom -max_background_jobs=12 -num=400000000 -target_file_size_base=33554432
# baseline:
Flush(GB): cumulative 25.882, interval 7.216
Cumulative compaction: 285.90 GB write, 162.36 MB/s write, 269.68 GB read, 153.15 MB/s read, 2926.7 seconds
# with this change:
Flush(GB): cumulative 25.882, interval 7.753
Cumulative compaction: 249.97 GB write, 141.96 MB/s write, 233.74 GB read, 132.74 MB/s read, 2534.9 seconds
```
The compaction simulator shows a similar result (14% with 100G random data).
As a side effect, with this PR, the SST file size can exceed the
target_file_size, but is capped at 2x target_file_size. And there will be
smaller files. Here are file size statistics when loading 100GB with the target
file size 32MB:
```
baseline this_PR
count 1.656000e+03 1.705000e+03
mean 3.116062e+07 3.028076e+07
std 7.145242e+06 8.046139e+06
```
The feature is enabled by default, to revert to the old behavior disable it
with `AdvancedColumnFamilyOptions.level_compaction_dynamic_file_size = false`
Also includes https://github.com/facebook/rocksdb/issues/1963 to cut file before skippable grandparent file. Which is for
use case like user adding 2 or more non-overlapping data range at the same
time, it can reduce the overlapping of 2 datasets in the lower levels.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10655
Reviewed By: cbi42
Differential Revision: D39552321
Pulled By: jay-zhuang
fbshipit-source-id: 640d15f159ab0cd973f2426cfc3af266fc8bdde2
Summary:
Currently, without this fix, DBImpl::GetLatestSequenceForKey() may not return the latest sequence number for merge operands of the key. This can cause conflict checking during optimistic transaction commit phase to fail. Fix it by always returning the latest sequence number of the key, also considering range tombstones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10724
Test Plan: make check
Reviewed By: cbi42
Differential Revision: D39756847
Pulled By: riversand963
fbshipit-source-id: 0764c3dd4cb24960b37e18adccc6e7feed0e6876
Summary:
We have a lot of confusing code because of mixed, sometimes
completely opposite uses of of the term "raw block" or "raw contents",
sometimes within the same source file. For example, in `BlockBasedTableBuilder`,
`raw_block_contents` and `raw_size` generally referred to uncompressed block
contents and size, while `WriteRawBlock` referred to writing a block that
is already compressed if it is going to be. Meanwhile, in
`BlockBasedTable`, `raw_block_contents` either referred to a (maybe
compressed) block with trailer, or a maybe compressed block maybe
without trailer. (Note: left as follow-up work to use C++ typing to
better sort out the various kinds of BlockContents.)
This change primarily tries to apply some consistent terminology around
the kinds of block representations, avoiding the unclear "raw". (Any
meaning of "raw" assumes some bias toward the storage layer or toward
the logical data layer.) Preferred terminology:
* **Serialized block** - bytes that go into storage. For block-based table
(usually the case) this includes the block trailer. WART: block `size` may or
may not include the trailer; need to be clear about whether it does or not.
* **Maybe compressed block** - like a serialized block, but without the
trailer (or no promise of including a trailer). Must be accompanied by a
CompressionType.
* **Uncompressed block** - "payload" bytes that are either stored with no
compression, used as input to compression function, or result of
decompression function.
* **Parsed block** - an in-memory form of a block in block cache, as it is
used by the table reader. Different C++ types are used depending on the
block type (see block_like_traits.h).
Other refactorings:
* Misc corrections/improvements of internal API comments
* Remove a few misleading / unhelpful / redundant comments.
* Use move semantics in some places to simplify contracts
* Use better parameter names to indicate which parameters are used for
outputs
* Remove some extraneous `extern`
* Various clean-ups to `CacheDumperImpl` (mostly unnecessary code)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10408
Test Plan: existing tests
Reviewed By: akankshamahajan15
Differential Revision: D38172617
Pulled By: pdillinger
fbshipit-source-id: ccb99299f324ac5ca46996d34c5089621a4f260c
Summary:
Change the library order in PLATFORM_LDFLAGS to enable fbcode platform 10 build with folly. This PR also has a few fixes for platform 10 compiler errors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10708
Test Plan:
ROCKSDB_FBCODE_BUILD_WITH_PLATFORM010=1 USE_COROUTINES=1 make -j64 check
ROCKSDB_FBCODE_BUILD_WITH_PLATFORM010=1 USE_FOLLY=1 make -j64 check
Reviewed By: ajkr
Differential Revision: D39666590
Pulled By: anand1976
fbshipit-source-id: 256a1127ef561399cd6299a6a392ca29bd68ca44
Summary:
when there is a single memtable without range tombstones and no SST files in the database, DBIter should wrap memtable iterator directly. Currently we create a merging iterator on top of the memtable iterator, and have DBIter wrap around it. This causes iterator regression and this PR fixes this issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10705
Test Plan:
- `make check`
- Performance:
- Set up: `./db_bench -benchmarks=filluniquerandom -write_buffer_size=$((1 << 30)) -num=10000`
- Benchmark: `./db_bench -benchmarks=seekrandom -use_existing_db=true -avoid_flush_during_recovery=true -write_buffer_size=$((1 << 30)) -num=10000 -threads=16 -duration=60 -seek_nexts=$seek_nexts`
```
seek_nexts main op/sec https://github.com/facebook/rocksdb/issues/10705 RocksDB v7.6
0 5746568 5749033 5786180
30 2411690 3006466 2837699
1000 102556 128902 124667
```
Reviewed By: ajkr
Differential Revision: D39644221
Pulled By: cbi42
fbshipit-source-id: 8063ff611ba31b0e5670041da3927c8c54b2097d
Summary:
Sanitize initial_auto_readahead_size if its greater than max_auto_readahead_size in case of async_io
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10660
Test Plan: Ran db_stress with intitial_auto_readahead_size greater than max_auto_readahead_size.
Reviewed By: anand1976
Differential Revision: D39408095
Pulled By: akankshamahajan15
fbshipit-source-id: 07f933242f636cfbc7ccf042e0c8b959a8ec5f3a
Summary:
Although we've been tracking SST unique IDs in the DB manifest
unconditionally, checking has been opt-in and with an extra pass at DB::Open
time. This changes the behavior of `verify_sst_unique_id_in_manifest` to
check unique ID against manifest every time an SST file is opened through
table cache (normal DB operations), replacing the explicit pass over files
at DB::Open time. This change also enables the option by default and
removes the "EXPERIMENTAL" designation.
One possible criticism is that the option no longer ensures the integrity
of a DB at Open time. This is far from an all-or-nothing issue. Verifying
the IDs of all SST files hardly ensures all the data in the DB is readable.
(VerifyChecksum is supposed to do that.) Also, with
max_open_files=-1 (default, extremely common), all SST files are
opened at DB::Open time anyway.
Implementation details:
* `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass
that is now removed.
* Unit tests that manipulate/corrupt table properties have to opt out of
this check, because that corrupts the "actual" unique id. (And even for
testing we don't currently have a mechanism to set "no unique id"
in the in-memory file metadata for new files.)
* A lot of other unit test churn relates to (a) default checking on, and
(b) checking on SST open even without DB::Open (e.g. on flush)
* Use `FileMetaData` for more `TableCache` operations (in place of
`FileDescriptor`) so that we have access to the unique_id whenever
we might need to open an SST file. **There is the possibility of
performance impact because we can no longer use the more
localized `fd` part of an `FdWithKeyRange` but instead follow the
`file_metadata` pointer. However, this change (possible regression)
is only done for `GetMemoryUsageByTableReaders`.**
* Removed a completely unnecessary constructor overload of
`TableReaderOptions`
Possible follow-up:
* Verification only happens when opening through table cache. Are there
more places where this should happen?
* Improve error message when there is a file size mismatch vs. manifest
(FIXME added in the appropriate place).
* I'm not sure there's a justification for `FileDescriptor` to be distinct from
`FileMetaData`.
* I'm skeptical that `FdWithKeyRange` really still makes sense for
optimizing some data locality by duplicating some data in memory, but I
could be wrong.
* An unnecessary overload of NewTableReader was recently added, in
the public API nonetheless (though unusable there). It should be cleaned
up to put most things under `TableReaderOptions`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532
Test Plan:
updated unit tests
Performance test showing no significant difference (just noise I think):
`./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000`
Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec
After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec
Reviewed By: jay-zhuang
Differential Revision: D38765551
Pulled By: pdillinger
fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2
Summary:
Delete range logic is moved from `DBIter` to `MergingIterator`, and `MergingIterator` will seek to the end of a range deletion if possible instead of scanning through each key and check with `RangeDelAggregator`.
With the invariant that a key in level L (consider memtable as the first level, each immutable and L0 as a separate level) has a larger sequence number than all keys in any level >L, a range tombstone `[start, end)` from level L covers all keys in its range in any level >L. This property motivates optimizations in iterator:
- in `Seek(target)`, if level L has a range tombstone `[start, end)` that covers `target.UserKey`, then for all levels > L, we can do Seek() on `end` instead of `target` to skip some range tombstone covered keys.
- in `Next()/Prev()`, if the current key is covered by a range tombstone `[start, end)` from level L, we can do `Seek` to `end` for all levels > L.
This PR implements the above optimizations in `MergingIterator`. As all range tombstone covered keys are now skipped in `MergingIterator`, the range tombstone logic is removed from `DBIter`. The idea in this PR is similar to https://github.com/facebook/rocksdb/issues/7317, but this PR leaves `InternalIterator` interface mostly unchanged. **Credit**: the cascading seek optimization and the sentinel key (discussed below) are inspired by [Pebble](https://github.com/cockroachdb/pebble/blob/master/merging_iter.go) and suggested by ajkr in https://github.com/facebook/rocksdb/issues/7317. The two optimizations are mostly implemented in `SeekImpl()/SeekForPrevImpl()` and `IsNextDeleted()/IsPrevDeleted()` in `merging_iterator.cc`. See comments for each method for more detail.
One notable change is that the minHeap/maxHeap used by `MergingIterator` now contains range tombstone end keys besides point key iterators. This helps to reduce the number of key comparisons. For example, for a range tombstone `[start, end)`, a `start` and an `end` `HeapItem` are inserted into the heap. When a `HeapItem` for range tombstone start key is popped from the minHeap, we know this range tombstone becomes "active" in the sense that, before the range tombstone's end key is popped from the minHeap, all the keys popped from this heap is covered by the range tombstone's internal key range `[start, end)`.
Another major change, *delete range sentinel key*, is made to `LevelIterator`. Before this PR, when all point keys in an SST file are iterated through in `MergingIterator`, a level iterator would advance to the next SST file in its level. In the case when an SST file has a range tombstone that covers keys beyond the SST file's last point key, advancing to the next SST file would lose this range tombstone. Consequently, `MergingIterator` could return keys that should have been deleted by some range tombstone. We prevent this by pretending that file boundaries in each SST file are sentinel keys. A `LevelIterator` now only advance the file iterator once the sentinel key is processed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10449
Test Plan:
- Added many unit tests in db_range_del_test
- Stress test: `./db_stress --readpercent=5 --prefixpercent=19 --writepercent=20 -delpercent=10 --iterpercent=44 --delrangepercent=2`
- Additional iterator stress test is added to verify against iterators against expected state: https://github.com/facebook/rocksdb/issues/10538. This is based on ajkr's previous attempt https://github.com/facebook/rocksdb/pull/5506#issuecomment-506021913.
```
python3 ./tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --compression_type=none --max_background_compactions=8 --value_size_mult=33 --max_key=5000000 --interval=10 --duration=7200 --delrangepercent=3 --delpercent=9 --iterpercent=25 --writepercent=60 --readpercent=3 --prefixpercent=0 --num_iterations=1000 --range_deletion_width=100 --verify_iterator_with_expected_state_one_in=1
```
- Performance benchmark: I used a similar setup as in the blog [post](http://rocksdb.org/blog/2018/11/21/delete-range.html) that introduced DeleteRange, "a database with 5 million data keys, and 10000 range tombstones (ignoring those dropped during compaction) that were written in regular intervals after 4.5 million data keys were written". As expected, the performance with this PR depends on the range tombstone width.
```
# Setup:
TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=fillrandom --writes=4500000 --num=5000000
TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=overwrite --writes=500000 --num=5000000 --use_existing_db=true --writes_per_range_tombstone=50
# Scan entire DB
TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=readseq[-X5] --use_existing_db=true --num=5000000 --disable_auto_compactions=true
# Short range scan (10 Next())
TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=100000 --seek_nexts=10 --disable_auto_compactions=true
# Long range scan(1000 Next())
TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=2500 --seek_nexts=1000 --disable_auto_compactions=true
```
Avg over of 10 runs (some slower tests had fews runs):
For the first column (tombstone), 0 means no range tombstone, 100-10000 means width of the 10k range tombstones, and 1 means there is a single range tombstone in the entire DB (width is 1000). The 1 tombstone case is to test regression when there's very few range tombstones in the DB, as no range tombstone is likely to take a different code path than with range tombstones.
- Scan entire DB
| tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
| ------------- | ------------- | ------------- | ------------- |
| 0 range tombstone |2525600 (± 43564) |2486917 (± 33698) |-1.53% |
| 100 |1853835 (± 24736) |2073884 (± 32176) |+11.87% |
| 1000 |422415 (± 7466) |1115801 (± 22781) |+164.15% |
| 10000 |22384 (± 227) |227919 (± 6647) |+918.22% |
| 1 range tombstone |2176540 (± 39050) |2434954 (± 24563) |+11.87% |
- Short range scan
| tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
| ------------- | ------------- | ------------- | ------------- |
| 0 range tombstone |35398 (± 533) |35338 (± 569) |-0.17% |
| 100 |28276 (± 664) |31684 (± 331) |+12.05% |
| 1000 |7637 (± 77) |25422 (± 277) |+232.88% |
| 10000 |1367 |28667 |+1997.07% |
| 1 range tombstone |32618 (± 581) |32748 (± 506) |+0.4% |
- Long range scan
| tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% |
| ------------- | ------------- | ------------- | ------------- |
| 0 range tombstone |2262 (± 33) |2353 (± 20) |+4.02% |
| 100 |1696 (± 26) |1926 (± 18) |+13.56% |
| 1000 |410 (± 6) |1255 (± 29) |+206.1% |
| 10000 |25 |414 |+1556.0% |
| 1 range tombstone |1957 (± 30) |2185 (± 44) |+11.65% |
- Microbench does not show significant regression: https://gist.github.com/cbi42/59f280f85a59b678e7e5d8561e693b61
Reviewed By: ajkr
Differential Revision: D38450331
Pulled By: cbi42
fbshipit-source-id: b5ef12e8d8c289ed2e163ccdf277f5039b511fca
Summary:
RocksDB does auto-readahead for iterators on noticing more
than two reads for a table file if user doesn't provide readahead_size and reads are sequential.
A new option num_file_reads_for_auto_readahead is added which can be
configured and indicates after how many sequential reads prefetching should
be start.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10556
Test Plan: Existing and new unit test
Reviewed By: anand1976
Differential Revision: D38947147
Pulled By: akankshamahajan15
fbshipit-source-id: c9eeab495f84a8df7f701c42f04894e46440ad97
Summary:
This PR exploits parallelism in MultiGet across levels. It applies only to the coroutine version of MultiGet. Previously, MultiGet file reads from SST files in the same level were parallelized. With this PR, MultiGet batches with keys distributed across multiple levels are read in parallel. This is accomplished by splitting the keys not present in a level (determined by bloom filtering) into a separate batch, and processing the new batch in parallel with the original batch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10535
Test Plan:
1. Ensure existing MultiGet unit tests pass, updating them as necessary
2. New unit tests - TODO
3. Run stress test - TODO
No noticeable regression (<1%) without async IO -
Without PR: `multireadrandom : 7.261 micros/op 1101724 ops/sec 60.007 seconds 66110936 operations; 571.6 MB/s (8168992 of 8168992 found)`
With PR: `multireadrandom : 7.305 micros/op 1095167 ops/sec 60.007 seconds 65717936 operations; 568.2 MB/s (8271992 of 8271992 found)`
For a fully cached DB, but with async IO option on, no regression observed (<1%) -
Without PR: `multireadrandom : 5.201 micros/op 1538027 ops/sec 60.005 seconds 92288936 operations; 797.9 MB/s (11540992 of 11540992 found) `
With PR: `multireadrandom : 5.249 micros/op 1524097 ops/sec 60.005 seconds 91452936 operations; 790.7 MB/s (11649992 of 11649992 found) `
Reviewed By: akankshamahajan15
Differential Revision: D38774009
Pulled By: anand1976
fbshipit-source-id: c955e259749f1c091590ade73105b3ee46cd0007
Summary:
The patch adds a new API `GetEntity` that can be used to perform
wide-column point lookups. It also extends the `Get` code path and
the `MemTable` / `MemTableList` and `Version` / `GetContext` logic
accordingly so that wide-column entities can be served from both
memtables and SSTs. If the result of a lookup is a wide-column entity
(`kTypeWideColumnEntity`), it is passed to the application in deserialized
form; if it is a plain old key-value (`kTypeValue`), it is presented as a
wide-column entity with a single default (anonymous) column.
(In contrast, regular `Get` returns plain old key-values as-is, and
returns the value of the default column for wide-column entities, see
https://github.com/facebook/rocksdb/issues/10483 .)
The result of `GetEntity` is a self-contained `PinnableWideColumns` object.
`PinnableWideColumns` contains a `PinnableSlice`, which either stores the
underlying data in its own buffer or holds on to a cache handle. It also contains
a `WideColumns` instance, which indexes the contents of the `PinnableSlice`,
so applications can access the values of columns efficiently.
There are several pieces of functionality which are currently not supported
for wide-column entities: there is currently no `MultiGetEntity` or wide-column
iterator; also, `Merge` and `GetMergeOperands` are not supported, and there
is no `GetEntity` implementation for read-only and secondary instances.
We plan to implement these in future PRs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10540
Test Plan: `make check`
Reviewed By: akankshamahajan15
Differential Revision: D38847474
Pulled By: ltamasi
fbshipit-source-id: 42311a34ccdfe88b3775e847a5e2a5296e002b5b
Summary:
This reverts commit 0d885e80d4. The original commit causes a ASAN stack-use-after-return failure due to the `CreateCallback` being allocated on stack and then used in another thread when a secondary cache object is promoted to the primary cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10541
Reviewed By: gitbw95
Differential Revision: D38850039
Pulled By: anand1976
fbshipit-source-id: 810c592b7de2523693f5bb267159b23b0ee9132c
Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10461
Reviewed By: siying
Differential Revision: D38672823
Pulled By: ltamasi
fbshipit-source-id: 90cf7362036563d79891f47be2cc24b827482743
Summary:
... so that cache keys can be derived from DB manifest data
before reading the file from storage--so that every part of the file
can potentially go in a persistent cache.
See updated comments in cache_key.cc for technical details. Importantly,
the new cache key encoding uses some fancy but efficient math to pack
data into the cache key without depending on the sizes of the various
pieces. This simplifies some existing code creating cache keys, like
cache warming before the file size is known.
This should provide us an essentially permanent mapping between SST
unique IDs and base cache keys, with the ability to "upgrade" SST
unique IDs (and thus cache keys) with new SST format_versions.
These cache keys are of similar, perhaps indistinguishable quality to
the previous generation. Before this change (see "corrected" days
between collision):
```
./cache_bench -stress_cache_key -sck_keep_bits=43
18 collisions after 2 x 90 days, est 10 days between (1.15292e+19 corrected)
```
After this change (keep 43 bits, up through 50, to validate "trajectory"
is ok on "corrected" days between collision):
```
19 collisions after 3 x 90 days, est 14.2105 days between (1.63836e+19 corrected)
16 collisions after 5 x 90 days, est 28.125 days between (1.6213e+19 corrected)
15 collisions after 7 x 90 days, est 42 days between (1.21057e+19 corrected)
15 collisions after 17 x 90 days, est 102 days between (1.46997e+19 corrected)
15 collisions after 49 x 90 days, est 294 days between (2.11849e+19 corrected)
15 collisions after 62 x 90 days, est 372 days between (1.34027e+19 corrected)
15 collisions after 53 x 90 days, est 318 days between (5.72858e+18 corrected)
15 collisions after 309 x 90 days, est 1854 days between (1.66994e+19 corrected)
```
However, the change does modify (probably weaken) the "guaranteed unique" promise from this
> SST files generated in a single process are guaranteed to have unique cache keys, unless/until number session ids * max file number = 2**86
to this (see https://github.com/facebook/rocksdb/issues/10388)
> With the DB id limitation, we only have nice guaranteed unique cache keys for files generated in a single process until biggest session_id_counter and offset_in_file reach combined 64 bits
I don't think this is a practical concern, though.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10394
Test Plan: unit tests updated, see simulation results above
Reviewed By: jay-zhuang
Differential Revision: D38667529
Pulled By: pdillinger
fbshipit-source-id: 49af3fe7f47e5b61162809a78b76c769fd519fba
Summary:
Some files miss headers. Also some headers are irregular. Fix them to make an internal checkup tool happy.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10519
Reviewed By: jay-zhuang
Differential Revision: D38603291
fbshipit-source-id: 13b1bbd6d48f5ee15ba20da67544396de48238f1
Summary:
A flag in WritableFileWriter is introduced to remember error has happened. Subsequent operations will fail with an assertion. Those operations, except Close() are not supposed to be called anyway. This change will help catch bug in tests and stress tests and limit damage of a potential bug of continue writing to a file after a failure.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10489
Test Plan: Fix existing unit tests and watch crash tests for a while.
Reviewed By: anand1976
Differential Revision: D38473277
fbshipit-source-id: 09aafb971e56cfd7f9ef92ad15b883f54acf1366
Summary:
The patch adds support for wide-column entities to the existing query
APIs (`Get`, `MultiGet`, and iterator). Namely, when during a query a
wide-column entity is encountered, we will return the value of the default
(anonymous) column as the result. Later, we plan to add wide-column
specific query APIs which will enable retrieving entire wide-column entities
or a subset of their columns.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10483
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D38441881
Pulled By: ltamasi
fbshipit-source-id: 6444e79a31aff2470e866698e3a97985bc2b3543
Summary:
lambda function dynamicly allocates memory from heap if it needs to
capture multiple values, which could be expensive.
Switch to explictly use local functor from stack.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10453
Test Plan:
CI
db_bench shows ~2-3% read improvement:
```
# before the change
TEST_TMPDIR=/tmp/dbbench4 ./db_bench_main --benchmarks=filluniquerandom,readrandom -compression_type=none -max_background_jobs=12 -num=10000000
readrandom : 8.528 micros/op 117265 ops/sec 85.277 seconds 10000000 operations; 13.0 MB/s (10000000 of 10000000 found)
# after the change
TEST_TMPDIR=/tmp/dbbench5 ./db_bench_new --benchmarks=filluniquerandom,readrandom -compression_type=none -max_background_jobs=12 -num=10000000
readrandom : 8.263 micros/op 121015 ops/sec 82.634 seconds 10000000 operations; 13.4 MB/s (10000000 of 10000000 found)
```
details: https://gist.github.com/jay-zhuang/5ac0628db8fc9cbcb499e056d4cb5918
Micro-benchmark shows a similar improvement ~1-2%:
before the change:
https://gist.github.com/jay-zhuang/9dc0ebf51bbfbf4af82f6193d43cf75b
after the change:
https://gist.github.com/jay-zhuang/fc061f1813cd8f441109ad0b0fe7c185
Reviewed By: ajkr
Differential Revision: D38345056
Pulled By: jay-zhuang
fbshipit-source-id: f3597aeeee338a804d37bf2e81386d5a100665e0
Summary:
- Right now each read fragments the memtable range tombstones https://github.com/facebook/rocksdb/issues/4808. This PR explores the idea of fragmenting memtable range tombstones in the write path and reads can just read this cached fragmented tombstone without any fragmenting cost. This PR only does the caching for immutable memtable, and does so right before a memtable is added to an immutable memtable list. The fragmentation is done without holding mutex to minimize its performance impact.
- db_bench is updated to print out the number of range deletions executed if there is any.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10380
Test Plan:
- CI, added asserts in various places to check whether a fragmented range tombstone list should have been constructed.
- Benchmark: as this PR only optimizes immutable memtable path, the number of writes in the benchmark is chosen such an immutable memtable is created and range tombstones are in that memtable.
```
single thread:
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=100000 --max_num_range_tombstones=100
multi_thread
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=15000 --reads=20000 --threads=32 --max_num_range_tombstones=100
```
Commit 99cdf16464a057ca44de2f747541dedf651bae9e is included in benchmark result. It was an earlier attempt where tombstones are fragmented for each write operation. Reader threads share it using a shared_ptr which would slow down multi-thread read performance as seen in benchmark results.
Results are averaged over 5 runs.
Single thread result:
| Max # tombstones | main fillrandom micros/op | 99cdf16464a057ca44de2f747541dedf651bae9e | Post PR | main readrandom micros/op | 99cdf16464a057ca44de2f747541dedf651bae9e | Post PR |
| ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
| 0 |6.68 |6.57 |6.72 |4.72 |4.79 |4.54 |
| 1 |6.67 |6.58 |6.62 |5.41 |4.74 |4.72 |
| 10 |6.59 |6.5 |6.56 |7.83 |4.69 |4.59 |
| 100 |6.62 |6.75 |6.58 |29.57 |5.04 |5.09 |
| 1000 |6.54 |6.82 |6.61 |320.33 |5.22 |5.21 |
32-thread result: note that "Max # tombstones" is per thread.
| Max # tombstones | main fillrandom micros/op | 99cdf16464a057ca44de2f747541dedf651bae9e | Post PR | main readrandom micros/op | 99cdf16464a057ca44de2f747541dedf651bae9e | Post PR |
| ------------- | ------------- |------------- |------------- |------------- |------------- |------------- |
| 0 |234.52 |260.25 |239.42 |5.06 |5.38 |5.09 |
| 1 |236.46 |262.0 |231.1 |19.57 |22.14 |5.45 |
| 10 |236.95 |263.84 |251.49 |151.73 |21.61 |5.73 |
| 100 |268.16 |296.8 |280.13 |2308.52 |22.27 |6.57 |
Reviewed By: ajkr
Differential Revision: D37916564
Pulled By: cbi42
fbshipit-source-id: 05d6d2e16df26c374c57ddcca13a5bfe9d5b731e
Summary:
This PR is the first step in enhancing the coroutines MultiGet to be able to lookup a batch in parallel across levels. By having a separate TableReader function for probing the bloom filters, we can quickly figure out which overlapping keys from a batch are definitely not in the file and can move on to the next level.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10432
Reviewed By: akankshamahajan15
Differential Revision: D38245910
Pulled By: anand1976
fbshipit-source-id: 3d20db2350378c3fe6f086f0c7ba5ff01d7f04de
Summary:
If a secondary cache is configured, its possible that a cache lookup will get a hit in the secondary cache. In that case, the ```LRUCacheShard::Lookup``` doesn't immediately update the ```total_charge``` for the item handle if the ```wait``` parameter is false (i.e caller will call later to check the completeness). However, ```BlockBasedTable::GetEntryFromCache``` assumes the handle is complete and calls ```UpdateCacheHitMetrics```, which checks the usage of the cache item and fails the assert in https://github.com/facebook/rocksdb/blob/main/cache/lru_cache.h#L237 (```assert(total_charge >= meta_charge)```).
To fix this, we call ```UpdateCacheHitMetrics``` later in ```MultiGet```, after waiting for all cache lookup completions.
Test plan -
Run crash test with changes from https://github.com/facebook/rocksdb/issues/10160
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10440
Reviewed By: gitbw95
Differential Revision: D38283968
Pulled By: anand1976
fbshipit-source-id: 31c54ef43517726c6e5fdda81899b364241dd7e1
Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10309
Reviewed By: ltamasi
Differential Revision: D38211655
Pulled By: gangliao
fbshipit-source-id: 65ef33337db4d85277cc6f9782d67c421ad71dd5
Summary:
Unit tests still haven't been fixed. Also need to add more tests. But I ran some simple fillrandom db_bench and the partitioning feels reasonable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10393
Test Plan:
1. Make sure existing tests pass. This should cover some basic sub compaction logic to be correct and the partitioning result is reasonable;
2. Add a new unit test to ApproximateKeyAnchors()
3. Run some db_bench with max_subcompaction = 4 and watch the compaction is indeed partitioned evenly.
Reviewed By: jay-zhuang
Differential Revision: D38043783
fbshipit-source-id: 085008e0f85f9b7c5abff7800307618320efb19f
Summary:
To help service owners to manage their memory budget effectively, we have been working towards counting all major memory users inside RocksDB towards a single global memory limit (see e.g. https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache). The global limit is specified by the capacity of the block-based table's block cache, and is technically implemented by inserting dummy entries ("reservations") into the block cache. The goal of this task is to support charging the memory usage of the new blob cache against this global memory limit when the backing cache of the blob cache and the block cache are different.
This PR is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10321
Reviewed By: ltamasi
Differential Revision: D37913590
Pulled By: gangliao
fbshipit-source-id: eaacf23907f82dc7d18964a3f24d7039a2937a72
Summary:
Which will be used for tiered storage to preclude hot data from
compacting to the cold tier (the last level).
Internally, adding seqno to time mapping. A periodic_task is scheduled
to record the current_seqno -> current_time in certain cadence. When
memtable flush, the mapping informaiton is stored in sstable property.
During compaction, the mapping information are merged and get the
approximate time of sequence number, which is used to determine if a key
is recently inserted or not and preclude it from the last level if it's
recently inserted (within the `preclude_last_level_data_seconds`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10338
Test Plan: CI
Reviewed By: siying
Differential Revision: D37810187
Pulled By: jay-zhuang
fbshipit-source-id: 6953be7a18a99de8b1cb3b162d712f79c2b4899f
Summary:
InternalKeyComparator is an internal class which is a simple wrapper of Comparator. https://github.com/facebook/rocksdb/pull/8336 made Comparator customizeable. As a side effect, internal key comparator was made configurable too. This introduces overhead to this simple wrapper. For example, every InternalKeyComparator will have an std::vector attached to it, which consumes memory and possible allocation overhead too.
We remove InternalKeyComparator from being customizable by making InternalKeyComparator not a subclass of Comparator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10342
Test Plan: Run existing CI tests and make sure it doesn't fail
Reviewed By: riversand963
Differential Revision: D37771351
fbshipit-source-id: 917256ee04b2796ed82974549c734fb6c4d8ccee
Summary:
InternalKeyComparator is a thin wrapper around user comparator. Storing a string for name is relatively expensive to this small wrapper for both CPU and memory usage. Try to remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10343
Test Plan: Run existing tests
Reviewed By: ajkr
Differential Revision: D37772469
fbshipit-source-id: d2d106a8d022193058fd7f6b220108e3d94aca34
Summary:
I noticed it would clean up some things to have Cache::Insert()
return our MemoryLimit Status instead of Incomplete for the case in
which the capacity limit is reached. I suspect this fixes some existing but
unknown bugs where this Incomplete could be confused with other uses
of Incomplete, especially no_io cases. This is the most suspicious case I
noticed, but was not able to reproduce a bug, in part because the existing
code is not covered by unit tests (FIXME added): 57adbf0e91/table/get_context.cc (L397)
I audited all the existing uses of IsIncomplete and updated those that
seemed relevant.
HISTORY updated with a clear warning to users of strict_capacity_limit=true
to update uses of `IsIncomplete()`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10262
Test Plan: updated unit tests
Reviewed By: hx235
Differential Revision: D37473155
Pulled By: pdillinger
fbshipit-source-id: 4bd9d9353ccddfe286b03ebd0652df8ce20f99cb
Summary:
…ta blocks
During MyShadow testing, ajkr helped me find out that with partitioned index and dictionary compression enabled, `PartitionedIndexIterator::InitPartitionedIndexBlock()` spent considerable amount of time (1-2% CPU) on fetching uncompression dictionary. Fetching uncompression dict was not needed since the index blocks were not compressed (and even if they were, they use empty dictionary). This should only affect use cases with partitioned index, dictionary compression and without uncompression dictionary pinned. This PR updates NewDataBlockIterator to not fetch uncompression dictionary when it is not for data blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10310
Test Plan:
1. `make check`
2. Perf benchmark: 1.5% (143950 -> 146176) improvement in op/sec for partitioned index + dict compression benchmark.
For default config without partitioned index and without dict compression, there is no regression in readrandom perf from multiple runs of db_bench.
```
# Set up for partitioned index with dictionary compression
TEST_TMPDIR=/dev/shm ./db_bench_main -benchmarks=filluniquerandom,compact -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -partition_index=true -compression_max_dict_bytes=16384 -compression_zstd_max_train_bytes=1638400
# Pre PR
TEST_TMPDIR=/dev/shm ./db_bench_main -use_existing_db=true -benchmarks=readrandom[-X50] -partition_index=true
readrandom [AVG 50 runs] : 143950 (± 1108) ops/sec; 15.9 (± 0.1) MB/sec
readrandom [MEDIAN 50 runs] : 144406 ops/sec; 16.0 MB/sec
# Post PR
TEST_TMPDIR=/dev/shm ./db_bench_opt -use_existing_db=true -benchmarks=readrandom[-X50] -partition_index=true
readrandom [AVG 50 runs] : 146176 (± 1121) ops/sec; 16.2 (± 0.1) MB/sec
readrandom [MEDIAN 50 runs] : 146014 ops/sec; 16.2 MB/sec
# Set up for no partitioned index and no dictionary compression
TEST_TMPDIR=/dev/shm/baseline ./db_bench_main -benchmarks=filluniquerandom,compact -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false
# Pre PR
TEST_TMPDIR=/dev/shm/baseline/ ./db_bench_main --use_existing_db=true "--benchmarks=readrandom[-X50]"
readrandom [AVG 50 runs] : 158546 (± 1000) ops/sec; 17.5 (± 0.1) MB/sec
readrandom [MEDIAN 50 runs] : 158280 ops/sec; 17.5 MB/sec
# Post PR
TEST_TMPDIR=/dev/shm/baseline/ ./db_bench_opt --use_existing_db=true "--benchmarks=readrandom[-X50]"
readrandom [AVG 50 runs] : 161061 (± 1520) ops/sec; 17.8 (± 0.2) MB/sec
readrandom [MEDIAN 50 runs] : 161596 ops/sec; 17.9 MB/sec
```
Reviewed By: ajkr
Differential Revision: D37631358
Pulled By: cbi42
fbshipit-source-id: 6ca2665e270e63871968e061ba4a99d3136785d9