Summary:
Right now in MergingIterator, for each range tombstone start and end key, we pop one end from heap and push the other end into the heap. This involves extra downheap and upheap cost. In the likely cases when a range tombstone iterator emits relatively adjacent keys, these keys should have similar order within all keys in the heap. This can happen when there is a burst of consecutive range tombstones, and most of the keys covered by them are dropped already. This PR uses `replace_top()` when inserting new range tombstone keys, which is more efficient in these common cases.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10877
Test Plan:
- existing UT
- ran all flavors of stress test through sandcastle
- benchmark:
```
# Set up: --writes_per_range_tombstone=1 means one point write and one delete range
TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=100000000 --writes=800000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=64
Level Files Size(MB)
--------------------
0 8 152
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
# Benchmark
TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone/ ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472 --num=100000000 --reads=1000000 --disable_auto_compactions=true --avoid_flush_during_recovery=true
# Pre PR
readseq [AVG 5 runs] : 1432116 (± 59664) ops/sec; 224.0 (± 9.3) MB/sec
readseq [MEDIAN 5 runs] : 1454886 ops/sec; 227.5 MB/sec
# Post PR
readseq [AVG 5 runs] : 1944425 (± 29521) ops/sec; 304.1 (± 4.6) MB/sec
readseq [MEDIAN 5 runs] : 1959430 ops/sec; 306.5 MB/sec
```
Reviewed By: ajkr
Differential Revision: D40710936
Pulled By: cbi42
fbshipit-source-id: cb782fb9cdcd26c0c3eb9443215a4ef4d2f79022
main
Changyu Bi2 years agocommitted byFacebook GitHub Bot