Summary:
The patch cleans up the handling of cache hit/miss/insertion related
performance counters, get context counters, and statistics by
eliminating some code duplication and factoring out the affected logic
into separate methods. In addition, it makes the semantics of cache hit
metrics more consistent by changing the code so that accessing a
partition of partitioned indexes/filters through a pinned reference no
longer counts as a cache hit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5408
Differential Revision: D15610883
Pulled By: ltamasi
fbshipit-source-id: ee749c18965077aca971d8f8bee8b24ed8fa76f1
Summary:
It's useful to be able to (optionally) associate key-value pairs with user-provided timestamps. This PR is an early effort towards this goal and continues the work of facebook#4942. A suite of new unit tests exist in DBBasicTestWithTimestampWithParam. Support for timestamp requires the user to provide timestamp as a slice in `ReadOptions` and `WriteOptions`. All timestamps of the same database must share the same length, format, etc. The format of the timestamp is the same throughout the same database, and the user is responsible for providing a comparator function (Comparator) to order the <key, timestamp> tuples. Once created, the format and length of the timestamp cannot change (at least for now).
Test plan (on devserver):
```
$COMPILE_WITH_ASAN=1 make -j32 all
$./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/*
$make check
```
All tests must pass.
We also run the following db_bench tests to verify whether there is regression on Get/Put while timestamp is not enabled.
```
$TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000
$TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000
```
Repeat for 6 times for both versions.
Results are as follows:
```
| | readrandom | fillrandom |
| master | 16.77 MB/s | 47.05 MB/s |
| PR5079 | 16.44 MB/s | 47.03 MB/s |
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5079
Differential Revision: D15132946
Pulled By: riversand963
fbshipit-source-id: 833a0d657eac21182f0f206c910a6438154c742c
Summary:
The commit makes GetEntryFromCache become a member function. It also makes all its callers become member functions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5394
Differential Revision: D15579222
Pulled By: HaoyuHuang
fbshipit-source-id: 07509c42ee9022dcded54950012bd3bd562aa1ae
Summary:
Many methods are passing around pointers to non-const objects when in fact
they do not/should not modify said objects. The patch makes the semantics
clearer and also helps from a thread safety point-of-view by changing some
pointers to pointers-to-const and marking some instance methods as const.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5383
Differential Revision: D15562770
Pulled By: ltamasi
fbshipit-source-id: 89361dadbb8b25bbe54d17e8da28fee24a2419af
Summary:
Currently, when the block cache is used for index blocks as well, it is
not really the index block that is stored in the cache but an
IndexReader object. Since this object is not pure data (it has, for
instance, pointers that might dangle), it's not really sharable. To
avoid the issues around this, the current code uses a dummy unique cache
key for each TableReader to store the IndexReader, and erases the
IndexReader entry when the TableReader is closed. Instead of doing this,
the new code moves the IndexReader out of the cache altogether. In
particular, instead of the TableReader owning, or caching/pinning the
IndexReader based on the customer's settings, the TableReader
unconditionally owns the IndexReader, which in turn owns/caches/pins
the index block (which is itself sharable and thus can be safely put in
the cache without any hacks).
Note: the change has two side effects:
1) Partitions of partitioned indexes no longer affect the read
amplification statistics.
2) Eviction statistics for index blocks are temporarily broken. We plan to fix
this in a separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5298
Differential Revision: D15303203
Pulled By: ltamasi
fbshipit-source-id: 935a69ba59d87d5e44f42e2310619b790c366e47
Summary:
There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo
ve them to a new directory test_util/, so that util/ is cleaner.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377
Differential Revision: D15551366
Pulled By: siying
fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c
Summary:
Previously if iterator upper/lower bound presents, `DBIter` will check the bound for every key. This patch turns the check into per-file or per-data block check when applicable, by checking against either file largest/smallest key or block index key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5111
Differential Revision: D15330061
Pulled By: siying
fbshipit-source-id: 8a653fe3cd50d94d81eb2d13b087326c58ee2024
Summary:
https://github.com/facebook/rocksdb/pull/5256 broke it: `block_iter_.user_key()` may not be valid even if `block_iter_points_to_real_block_` is true. E.g. if there was an IO error or Status::Incomplete.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5291
Differential Revision: D15273324
Pulled By: al13n321
fbshipit-source-id: 442e5b09f9884a58f92a6ac1ca93af719c219886
Summary:
CachableEntry is used in a variety of contexts: it may refer to a cached
object (i.e. an object in the block cache), an owned object, or an
unowned object; also, in some cases (most notably with iterators), the
responsibility of managing the pointed-to object gets handed off to
another object. Each of the above scenarios have different implications
for the lifecycle of the referenced object. For the most part, the patch
does not change the lifecycle of managed objects; however, it makes
these relationships explicit, and it also enables us to eliminate some
hacks and accident-prone code around releasing cache handles and
deleting/cleaning up objects. (The only places where the patch changes
how an objects are managed are the partitions of partitioned indexes and
filters.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5252
Differential Revision: D15101358
Pulled By: ltamasi
fbshipit-source-id: 9eb59e9ae5a7230e3345789762d0ba1f189485be
Summary:
Right now there is a potential race condition where two threads are created to iterate through the DB (https://gist.github.com/miasantreble/88f5798a397ee7cb8e7baff9db2d9e85). The problem is that in `BlockBasedTable::NewIndexIterator`, if both threads failed to find index_reader from block cache, they will call `CreateIndexReader->UpdateIndexType()` which creates a race to update `index_type` in the shared rep_ object. By checking the code, we realize the index type is always populated by `PrefetchIndexAndFilterBlocks` during the table `Open` call, so there is no need to update index type every time during iterator creation. This PR attempts to fix the race condition by removing the unnecessary call to `UpdateIndexType`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5288
Differential Revision: D15252509
Pulled By: miasantreble
fbshipit-source-id: 6e3258652121d5c76d267f7ac457e15c5e84756e
Summary:
Right now, when Seek() is called again, RocksDB always does a binary search against the files and index blocks, even if they end up with the same file/block. Improve it as following:
1. in LevelIterator, reseek first try to check the boundary of the current file. If it falls into the same file, skip the binary search to find the file
2. in block based table iterator, reseek skip to reseek the iterator block if the seek key is larger than the current key and lower than the index key (boundary of the current block and the next block).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5256
Differential Revision: D15105072
Pulled By: siying
fbshipit-source-id: 39634bdb4a881082451fa39cecd7ecf12160bf80
Summary:
At least one of the meta-block loading functions (`ReadRangeDelBlock`)
uses the same block reading function (`NewDataBlockIterator`) as data
block reads, which means it uses the dictionary handle. However, the
dictionary handle was uninitialized while reading meta-blocks, causing
readers to receive an error. This situation was only noticed when
`cache_index_and_filter_blocks=true`.
This PR initializes the handle to null while reading meta-blocks to
prevent the error. It also adds support to `db_stress` /
`db_crashtest.py` for `cache_index_and_filter_blocks`.
Fixes#5263.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5267
Differential Revision: D15149264
Pulled By: maysamyabandeh
fbshipit-source-id: 991d38a306c62db5976778bfb050fa3cd4a0671b
Summary:
Improve the iterators performance when the user explicitly sets the readahead size via `ReadOptions.readahead_size`.
1. Stop creating new table readers when the user explicitly sets readahead size.
2. Make use of an internal buffer based on `FilePrefetchBuffer` instead of using `ReadaheadRandomAccessFileReader`, to handle the user readahead requests (for both buffered and direct io cases).
3. Add `readahead_size` to db_bench.
**Benchmarks:**
https://gist.github.com/sagar0/53693edc320a18abeaeca94ca32f5737
For 1 MB readahead, Buffered IO performance improves by 28% and Direct IO performance improves by 50%.
For 512KB readahead, Buffered IO performance improves by 30% and Direct IO performance improves by 67%.
**Test Plan:**
Updated `DBIteratorTest.ReadAhead` test to make sure that:
- no new table readers are created for iterators on setting ReadOptions.readahead_size
- At least "readahead" number of bytes are actually getting read on each iterator read.
TODO later:
- Use similar logic for compactions as well.
- This ties in nicely with #4052 and paves the way for removing ReadaheadRandomAcessFile later.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5246
Differential Revision: D15107946
Pulled By: sagar0
fbshipit-source-id: 2c1149729ca7d779e4e8b7710ba6f4e8cbfd3bea
Summary:
In long scans, virtual function calls of Next(), Valid(), key() and value() are not trivial. By introducing NextAndGetResult(), Some of the Next(), Valid() and key() calls are consolidated into one virtual function call to reduce CPU.
Also did some inline tricks and add some "final" randomly in some functions. Even without the "final" annotation, most Next() calls are inlined with -O3, but sometimes with a final it is inlined by O2 too. It doesn't hurt to add those final annotations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5197
Differential Revision: D14945977
Pulled By: siying
fbshipit-source-id: 7003969f9a5f1d5717f0bda503b91d19ba75ed88
Summary:
This is second attempt for #5101. Original commit message:
`BlockBasedTableIterator` avoid reading next block on `Next()` if it detects the iterator will be out of bound, by checking against index key. The optimization was added in #2239, and by the time it only check the bound per block. It seems later change make it a per-key check, which introduce unnecessary key comparisons.
This patch come with two fixes:
Fix 1: To optimize checking for bounds, we need comparing the bounds with index key as well. However BlockBasedTableIterator doesn't know whether its index iterator is internally using user keys or internal keys. The patch fixes that by extending InternalIterator with a user_key() function that is overridden by In IndexBlockIter.
Fix 2: In #5101 we return `IsOutOfBound()=true` when block index key is out of bound. But the index key can be larger than smallest key of the next file on the level. That file can be within upper bound and should not be filtered out.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5142
Differential Revision: D14907113
Pulled By: siying
fbshipit-source-id: ac95775c5b4e7b700f76ab43e39f45402c98fbfb
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
Summary:
The uncompression dictionary object has a Statistics pointer that might
dangle if the database closed. This patch evicts the dictionary from the
block cache when a table is closed, similarly to how index and filter
readers are handled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5150
Differential Revision: D14782422
Pulled By: ltamasi
fbshipit-source-id: 0cec9336c742c479aa92206e04521767f1aa9622
Summary:
This reverts commit f29dc1b906.
In BlockBasedTableIterator, index_iter_->key() is sometimes a user key, so it is wrong to call ExtractUserKey() against it. This is a bug introduced by #5101.
Temporarily revert the diff to keep the branch clean.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5132
Differential Revision: D14718584
Pulled By: siying
fbshipit-source-id: 0ac55dc9b5dbc18c7809092146bdf7eb9364b9ad
Summary:
`BlockBasedTableIterator` avoid reading next block on `Next()` if it detects the iterator will be out of bound, by checking against index key. The optimization was added in #2239, and by the time it only check the bound per block. It seems later change make it a per-key check, which introduce unnecessary key comparisons.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5101
Differential Revision: D14678707
Pulled By: siying
fbshipit-source-id: 2372446116753c7892ea4cec7b4b49ef87ba463e
Summary:
Following files were run through automatic formatter:
db/db_impl.cc
db/db_impl.h
db/db_impl_compaction_flush.cc
db/db_impl_debug.cc
db/db_impl_files.cc
db/db_impl_readonly.h
db/db_impl_write.cc
db/dbformat.cc
db/dbformat.h
table/block.cc
table/block.h
table/block_based_filter_block.cc
table/block_based_filter_block.h
table/block_based_filter_block_test.cc
table/block_based_table_builder.cc
table/block_based_table_reader.cc
table/block_based_table_reader.h
table/block_builder.cc
table/block_builder.h
table/block_fetcher.cc
table/block_prefix_index.cc
table/block_prefix_index.h
table/block_test.cc
table/format.cc
table/format.h
I could easily run all the files, but I don't want people to feel that
I'm doing it for lines of code changes :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5114
Differential Revision: D14633040
Pulled By: siying
fbshipit-source-id: 3f346cb53bf21e8c10704400da548dfce1e89a52
Summary:
Currently `perf_context.user_key_comparison_count` is bump only in `InternalKeyComparator`. For places user comparator is used directly the counter is not bump. Fixing the majority of it.
Index iterator and filter code also use user comparator directly and don't bump the counter. It is not fixed in this patch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5098
Differential Revision: D14603753
Pulled By: siying
fbshipit-source-id: 1cd41035644ca9e49b97a51030a5d1e15f5f3cae
Summary:
The code convention we are following, Google C++ Style, discourage
alias in header files, especially public headers:
https://google.github.io/styleguide/cppguide.html#Aliases
Remove some of them. Might removed some from .cc files as well to be consistent.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5113
Differential Revision: D14633030
Pulled By: siying
fbshipit-source-id: b990edc919d5de60295992284f980195e501d424
Summary:
Since `SstFileReader` don't know largest seqno of a file, it will fail this check when it open a file with global seqno: ca89ac2ba9/table/block_based_table_reader.cc (L730)
Changes:
* Pass largest_seqno=kMaxSequenceNumber from `SstFileReader` and allow it to bypass the above check.
* `BlockBasedTable::VerifyChecksum` also double check if checksum will match when excluding global seqno (this is to make the new test in sst_table_reader_test pass).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5097
Differential Revision: D14607434
Pulled By: riversand963
fbshipit-source-id: 9008599227c5fccbf9b73fee46b3bf4a1523f023
Summary:
Always enable properties block checksum verification for block-based table. For external SST file ingested with 'write_global_seqno==true', we use 'DecodeEntrySlow' to parse its blocks' contents so that the process will not die upon failing the assertion possibly caused by corruption.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4956
Differential Revision: D14012741
Pulled By: riversand963
fbshipit-source-id: 8b766e6f54b36f8f9e074c0e19e0926ec3cce186
Summary:
- If block cache disabled or not used for meta-blocks, `BlockBasedTableReader::Rep::uncompression_dict` owns the `UncompressionDict`. It is preloaded during `PrefetchIndexAndFilterBlocks`.
- If block cache is enabled and used for meta-blocks, block cache owns the `UncompressionDict`, which holds dictionary and digested dictionary when needed. It is never prefetched though there is a TODO for this in the code. The cache key is simply the compression dictionary block handle.
- New stats for compression dictionary accesses in block cache: "BLOCK_CACHE_COMPRESSION_DICT_*" and "compression_dict_block_read_count"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4881
Differential Revision: D13663801
Pulled By: ajkr
fbshipit-source-id: bdcc54044e180855cdcc57639b493b0e016c9a3f
Summary:
The pointer `get_context` was passed as the value for the boolean argument `index_key_is_full`. Luckily the pointer was always non-null so evaluated to true which is the correct value for the boolean argument. But we were missing out on batch updates to stats since we were not passing anything for the `GetContext*` argument and it defaults to `nullptr`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4913
Differential Revision: D13791449
Pulled By: ajkr
fbshipit-source-id: dbe40bf406c64d34cb5298604145d18b9e0ca9be
Summary:
This is essentially a re-submission of #4251 with a few improvements:
- Split `CompressionDict` into two separate classes: `CompressionDict` and `UncompressionDict`
- Eliminated `Init` functions. Instead do all initialization work in constructors.
- Added test case for parallel DB open, which is the scenario where #4251 failed under TSAN.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4849
Differential Revision: D13606039
Pulled By: ajkr
fbshipit-source-id: 08c236059798c710db9cbf545fce0f371232d447
Summary:
Add a new per level counter for block cache hits, increase it by one on every successful attempt to get an entry from cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4796
Differential Revision: D13513688
Pulled By: zinoale
fbshipit-source-id: 104df038f1232e3356e162eb2d8ca138e34a8281
Summary:
Previously we were cleaning up range tombstone meta-block by calling `ReleaseCachedEntry`, which wouldn't work if `value != nullptr && cache_handle == nullptr`. This happened at least in the case with mmap reads and block cache both enabled. I noticed `NewDataBlockIterator` intends to handle all these cases, so migrated to that instead of `NewUnfragmentedRangeTombstoneIterator`.
Also changed the table-opening logic to fail on `ReadRangeDelBlock` failure, since that can cause data corruption. Added a test case to verify this behavior. Note the test case does not fail on `TryReopen` because failure to preload table handlers is not considered critical. However, it does fail on any read involving that file since it cannot return correct data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4810
Differential Revision: D13534296
Pulled By: ajkr
fbshipit-source-id: 55dde1111717cea6ec4bf38418daab81ccef3599
Summary:
To support the flush/compaction use cases of RangeDelAggregator
in v2, FragmentedRangeTombstoneIterator now supports dropping tombstones
that cannot be read in the compaction output file. Furthermore,
FragmentedRangeTombstoneIterator supports the "snapshot striping" use
case by allowing an iterator to be split by a list of snapshots.
RangeDelAggregatorV2 will use these changes in a follow-up change.
In the process of making these changes, other miscellaneous cleanups
were also done in these files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4740
Differential Revision: D13287382
Pulled By: abhimadan
fbshipit-source-id: f5aeb03e1b3058049b80c02a558ee48f723fa48c
Summary:
Add counters to track block cache index/filter hits and misses. We currently count aggregate hits and misses, which includes index/filter/data blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4540
Differential Revision: D10459652
Pulled By: anand1976
fbshipit-source-id: 0c59eee7f12f5103dcb6686f0e7995babe63d425
Summary:
Refactored and simplified `BlockBasedTable::Open` to be similar to `BlockBasedTableBuilder::Finish` as both these functions complement each other. Also added `BlockBasedTableBuilder::WriteFooter` along the way.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4636
Differential Revision: D12933319
Pulled By: sagar0
fbshipit-source-id: 1ff1d02f6d80a63b5ba720a1fc75e71c7344137b
Summary:
Fix block based table reader not using memory_allocator when allocating index blocks and compression dictionary blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4678
Differential Revision: D13054594
Pulled By: yiwu-arbug
fbshipit-source-id: 379f25bcc665395662511c4f873f4b7b55104ce2
Summary:
Removed `one_time_use` flag, which removed the need for some
tests, and changed all `NewRangeTombstoneIterator` methods to return
`FragmentedRangeTombstoneIterators`.
These changes also led to removing `RangeDelAggregatorV2::AddUnfragmentedTombstones`
and one of the `MemTableListVersion::AddRangeTombstoneIterators` methods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4692
Differential Revision: D13106570
Pulled By: abhimadan
fbshipit-source-id: cbab5432d7fc2d9cdfd8d9d40361a1bffaa8f845
Summary:
…ons (#4676)"
This reverts commit b32d087dbb.
`MemoryAllocator` needs to be with `Cache`, since cache entry can
outlive DB and block based table. The cache needs to hold reference to
memory allocator when deleting cache entry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4697
Differential Revision: D13133490
Pulled By: yiwu-arbug
fbshipit-source-id: 8ef7e8a51263bfd929f892fd062665ff4ce9ce5a
Summary:
Rather than storing a `vector<RangeTombstone>`, we now store a
`vector<RangeTombstoneStack>` and a `vector<SequenceNumber>`. A
`RangeTombstoneStack` contains the start and end keys of a range tombstone
fragment, and indices into the seqnum vector to indicate which sequence
numbers the fragment is located at. The diagram below illustrates an
example:
```
tombstones_: [a, b) [c, e) [h, k)
| \ / \ / |
| \ / \ / |
v v v v
tombstone_seqs_: [ 5 3 10 7 2 8 6 ]
```
This format allows binary searching the tombstone list to use less key
comparisons, which helps in cases where there are many overlapping
tombstones. Also, this format makes it easier to add DBIter-like
semantics to `FragmentedRangeTombstoneIterator` in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4632
Differential Revision: D13053103
Pulled By: abhimadan
fbshipit-source-id: e8220cc712fcf5be4d602913bb23ace8ea5f8ef0
Summary:
We carry compression type and "cachable" variables for every block in the block cache, while they take well-known values. 8-byte is wasted for each block (2-byte for useful information but it takes 8 bytes because of padding). With this change, these two variables are removed.
The cachable information is only useful in the process of reading the block. We use other information to infer from it. For compressed blocks, the compression type is a part of the block content itself so we can get it from there.
Some code is slightly refactored so that the cachable information can flow better.
Another change is to only use class BlockContents for compressed block, and narrow the class Block to only be used for uncompressed blocks, including blocks in compressed block cache. This can make the Block class less confusing. It also saves tens of bytes for each block in compressed block cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4650
Differential Revision: D12969070
Pulled By: siying
fbshipit-source-id: 548b62724e9eb66993026429fd9c7c3acd1f95ed
Summary:
Per offline discussion with siying, `MemoryAllocator` and `Cache` should be decouple. The idea is that memory allocator handles memory allocation, while cache handle cache policy.
It is normal that external cache libraries pack couple the two components for better optimization. If we want to integrate with such library in the future, we can make a wrapper of the library implementing both `Cache` and `MemoryAllocator` interface.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4676
Differential Revision: D13047662
Pulled By: yiwu-arbug
fbshipit-source-id: cd42e246d80ab600b4de47d073f7d2db308ce6dd
Summary:
Ran the following commands to recursively change all the files under RocksDB:
```
find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
```
Running `make format` updated some formatting on the files touched.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638
Differential Revision: D12934992
Pulled By: sagar0
fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
Summary:
Rename the interface, as it is mean to be a generic interface for memory allocation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4590
Differential Revision: D10866340
Pulled By: yiwu-arbug
fbshipit-source-id: 85cb753351a40cb856c046aeaa3f3b369eef3d16
Summary:
This allows tombstone fragmenting to only be performed when the table is opened, and cached for subsequent accesses.
On the same DB used in #4449, running `readrandom` results in the following:
```
readrandom : 0.983 micros/op 1017076 ops/sec; 78.3 MB/s (63103 of 100000 found)
```
Now that Get performance in the presence of range tombstones is reasonable, I also compared the performance between a DB with range tombstones, "expanded" range tombstones (several point tombstones that cover the same keys the equivalent range tombstone would cover, a common workaround for DeleteRange), and no range tombstones. The created DBs had 5 million keys each, and DeleteRange was called at regular intervals (depending on the total number of range tombstones being written) after 4.5 million Puts. The table below summarizes the results of a `readwhilewriting` benchmark (in order to provide somewhat more realistic results):
```
Tombstones? | avg micros/op | stddev micros/op | avg ops/s | stddev ops/s
----------------- | ------------- | ---------------- | ------------ | ------------
None | 0.6186 | 0.04637 | 1,625,252.90 | 124,679.41
500 Expanded | 0.6019 | 0.03628 | 1,666,670.40 | 101,142.65
500 Unexpanded | 0.6435 | 0.03994 | 1,559,979.40 | 104,090.52
1k Expanded | 0.6034 | 0.04349 | 1,665,128.10 | 125,144.57
1k Unexpanded | 0.6261 | 0.03093 | 1,600,457.50 | 79,024.94
5k Expanded | 0.6163 | 0.05926 | 1,636,668.80 | 154,888.85
5k Unexpanded | 0.6402 | 0.04002 | 1,567,804.70 | 100,965.55
10k Expanded | 0.6036 | 0.05105 | 1,667,237.70 | 142,830.36
10k Unexpanded | 0.6128 | 0.02598 | 1,634,633.40 | 72,161.82
25k Expanded | 0.6198 | 0.04542 | 1,620,980.50 | 116,662.93
25k Unexpanded | 0.5478 | 0.0362 | 1,833,059.10 | 121,233.81
50k Expanded | 0.5104 | 0.04347 | 1,973,107.90 | 184,073.49
50k Unexpanded | 0.4528 | 0.03387 | 2,219,034.50 | 170,984.32
```
After a large enough quantity of range tombstones are written, range tombstone Gets can become faster than reading from an equivalent DB with several point tombstones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4493
Differential Revision: D10842844
Pulled By: abhimadan
fbshipit-source-id: a7d44534f8120e6aabb65779d26c6b9df954c509
Summary:
PR https://github.com/facebook/rocksdb/pull/4226 introduced per-level perf context which allows breaking down perf context by levels.
This PR takes advantage of the feature to populate a few counters related to bloom filters
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4581
Differential Revision: D10518010
Pulled By: miasantreble
fbshipit-source-id: 011244561783ec860d32d5b0fa6bce6e78d70ef8
Summary:
Current implementation of perf context is level agnostic. Making it hard to do performance evaluation for the LSM tree. This PR adds `PerfContextByLevel` to decompose the counters by level.
This will be helpful when analyzing point and range query performance as well as tuning bloom filter
Also replaced __thread with thread_local keyword for perf_context
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4226
Differential Revision: D10369509
Pulled By: miasantreble
fbshipit-source-id: f1ced4e0de5fcebdb7f9cff36164516bc6382d82
Summary:
This is a conceptually simple change, but it touches many files to
pass the allocator through function calls.
We introduce CacheAllocator, which can be used by clients to configure
custom allocator for cache blocks. Our motivation is to hook this up
with folly's `JemallocNodumpAllocator`
(f43ce6d686/folly/experimental/JemallocNodumpAllocator.h),
but there are many other possible use cases.
Additionally, this commit cleans up memory allocation in
`util/compression.h`, making sure that all allocations are wrapped in a
unique_ptr as soon as possible.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4437
Differential Revision: D10132814
Pulled By: yiwu-arbug
fbshipit-source-id: be1343a4b69f6048df127939fea9bbc96969f564