Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/10777 mistakenly added a history entry under 7.8 release but the PR is not included in 7.8. This mistake was due to rebase and merge didn't realize it was a conflict when "## Unreleased" was changed to "## 7.8.0 (10/22/2022)".
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10898
Test Plan: Make check
Reviewed By: akankshamahajan15
Differential Revision: D40861001
Pulled By: hx235
fbshipit-source-id: b2310c95490f6ebb90834a210c965a74c9560b51
Summary:
When there is a column family that doesn't get any traffic, its stats are still dumped when options.options.stats_dump_period_sec triggers. This sometimes spam the information logs. With this change, we skip the printing if there is not change, until 8 periods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10891
Test Plan: Manually test the behavior with hacked db_bench setups.
Reviewed By: jay-zhuang
Differential Revision: D40777183
fbshipit-source-id: ef0b9a793e4f6282df099b464f01d1fb4c5a2cab
Summary:
Currently, a memtable's stats `num_deletes_` is incremented only if the entry is a regular delete (kTypeDeletion). We need to fix it by accounting for kTypeSingleDeletion and kTypeDeletionWithTimestamp.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10886
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D40740754
Pulled By: riversand963
fbshipit-source-id: 7bde62cd6df136585bc5bfb1c426c7a8276c08e1
Summary:
The for loop is marked as unreachable code because it will never call the increment. Switch it to `if`.
```
\table\merging_iterator.cc(823): error C2220: the following warning is treated as an error
\table\merging_iterator.cc(823): warning C4702: unreachable code
\table\merging_iterator.cc(1030): error C2220: the following warning is treated as an error
\table\merging_iterator.cc(1030): warning C4702: unreachable code
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10897
Reviewed By: cbi42
Differential Revision: D40811790
Pulled By: jay-zhuang
fbshipit-source-id: fe8fd3e7cf3d6f710360c402b79763854d5120df
Summary:
The motivation and benefit of current behavior of passing `LockInfo&&` as argument to AcquireLocked() and AcquireWithTimeout() is not clear to me. Furthermore, in AcquireWithTimeout(), we access members of `LockInfo&&` after it is passed to AcquireLocked() as rvalue ref. In addition, we may call `AcquireLocked()` with `std::move(lock_info)` multiple times.
This leads to linter warning of use-after-move. If future implementation of AcquireLocked() does something like moving-construct a new `LockedInfo` using the passed-in `LockInfo&&`, then the caller cannot use it because `LockInfo` has a member of type `autovector`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10874
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D40704210
Pulled By: riversand963
fbshipit-source-id: 20091df65b4fc63b072bcec9809efc49955d6d35
Summary:
The PR fixes the handling of `Merge`s in `GetEntity`. Note that `Merge` is not yet
supported for wide-column entities written using `PutEntity`; this change is
about returning correct (i.e. consistent with `Get`) results in cases like when the
base value is a plain old key-value written using `Put` or when there is no real base
value because we hit either a tombstone or the beginning of history.
Implementation-wise, the patch introduces a new wrapper around the existing
`MergeHelper::TimedFullMerge` that can store the merge result in either a string
(for the purposes of `Get`) or a `PinnableWideColumns` instance (for `GetEntity`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10894
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D40782708
Pulled By: ltamasi
fbshipit-source-id: 3d700d56b2ef81f02ba1e2d93f6481bf13abcc90
Summary:
This option allows a custom cleanup command line for a non-Posix file system to be used by db_crashtest.py to cleanup between runs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10883
Test Plan: Run the whitebox crash test
Reviewed By: pdillinger
Differential Revision: D40726424
Pulled By: anand1976
fbshipit-source-id: b827f6b583ff78f9ca75ced2d96f7e58f5200432
Summary:
It's unsafe to call `malloc_usable_size` with an address not returned by a function from the `malloc` family (see https://github.com/facebook/rocksdb/issues/10798). The patch switches from using `new[]` / `delete[]` for `LRUHandle` to `malloc` / `free`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10884
Test Plan: `make check`
Reviewed By: pdillinger
Differential Revision: D40738089
Pulled By: ltamasi
fbshipit-source-id: ac5583f88125fee49c314639be6b6df85937fbee
Summary:
Right now in MergingIterator, for each range tombstone start and end key, we pop one end from heap and push the other end into the heap. This involves extra downheap and upheap cost. In the likely cases when a range tombstone iterator emits relatively adjacent keys, these keys should have similar order within all keys in the heap. This can happen when there is a burst of consecutive range tombstones, and most of the keys covered by them are dropped already. This PR uses `replace_top()` when inserting new range tombstone keys, which is more efficient in these common cases.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10877
Test Plan:
- existing UT
- ran all flavors of stress test through sandcastle
- benchmark:
```
# Set up: --writes_per_range_tombstone=1 means one point write and one delete range
TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=100000000 --writes=800000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=64
Level Files Size(MB)
--------------------
0 8 152
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0
# Benchmark
TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone/ ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472 --num=100000000 --reads=1000000 --disable_auto_compactions=true --avoid_flush_during_recovery=true
# Pre PR
readseq [AVG 5 runs] : 1432116 (± 59664) ops/sec; 224.0 (± 9.3) MB/sec
readseq [MEDIAN 5 runs] : 1454886 ops/sec; 227.5 MB/sec
# Post PR
readseq [AVG 5 runs] : 1944425 (± 29521) ops/sec; 304.1 (± 4.6) MB/sec
readseq [MEDIAN 5 runs] : 1959430 ops/sec; 306.5 MB/sec
```
Reviewed By: ajkr
Differential Revision: D40710936
Pulled By: cbi42
fbshipit-source-id: cb782fb9cdcd26c0c3eb9443215a4ef4d2f79022
Summary:
Add some extra information in outputs of "sst_dump --command=raw" to help debug some issues. Right now, encoded block handle is printed out. It is more useful to directly print out offset and size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10873
Test Plan: Manually run it against a file and check the output.
Reviewed By: anand1976
Differential Revision: D40742289
fbshipit-source-id: 04d7de26e7f27e1595a7cc3ac1c1082e4e835b93
Summary:
The call to `folly::coro::collectAllRange()` should move the input `mget_tasks`. But just in case, assert and clear the std::vector before reusing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10845
Reviewed By: akankshamahajan15
Differential Revision: D40611719
Pulled By: anand1976
fbshipit-source-id: 0f32b387cf5a2894b13389016c020b01ab479b5e
Summary:
If windows.h is not included in a particular way, it can conflict with other code including it. I don't know all the details, but having just one standard place where we include windows.h in header files seems best and seems to fix the internal issue we hit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10885
Test Plan: CI and internal validation
Reviewed By: anand1976
Differential Revision: D40738945
Pulled By: pdillinger
fbshipit-source-id: 88f635e895b1c7b810baad159e6dbb8351344cac
Summary:
Resolves see https://github.com/facebook/rocksdb/issues/9006
Fixes 2 related issues with JNI local references in the RocksJava API.
1. Some instances of RocksJava API JNI code appear to have misunderstood the reason for `JNIEnv->EnsureLocalCapacity()` and are carrying out bogus checks which happen to fail with some larger parameter values (many column families in a single call, very long key names or values). Remove these checks and add some regression tests for the previous failures.
2. The helper for Transaction multiGet operations (`multiGet()`, `multiGetForUpdate()`,...) is limited in the number of keys it can `get()` for because it requires a corresponding number of live local references. Refactor the helper slightly, copying out the key contents within a loop so that the references don't have to exist at the same time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10674
Reviewed By: ajkr
Differential Revision: D40515361
Pulled By: jay-zhuang
fbshipit-source-id: f1be0126181a698b3ad27c0945a39c54d950aa25
Summary:
Currently, the name of `block_cache_trace_analyzer_tool` in `CMakeLists.txt` is somewhat confusing.
## Makefile
The same thing in Makefile is called `block_cache_trace_analyzer`.
```c++
block_cache_trace_analyzer: $(OBJ_DIR)/tools/block_cache_analyzer/block_cache_trace_analyzer_tool.o $(ANALYZE_OBJECTS) $(TOOLS_LIBRARY) $(LIBRARY)
$(AM_LINK)
```
## RocksDB Wiki
Also, in the [Block-cache-analysis-and-simulation-tools](https://github.com/facebook/rocksdb/wiki/Block-cache-analysis-and-simulation-tools#quick-start) of RocksDB Wiki, it is called `block_cache_trace_analyzer` too.
<img width="955" alt="Screen Shot 2022-10-13 at 20 07 09" src="https://user-images.githubusercontent.com/90088090/195591912-00b539b4-7f8c-4117-bf72-ac4eb51100d1.png">
Therefore, I think maybe it's better to rename `block_cache_trace_analyzer_tool` to `block_cache_trace_analyzer` in `CMakeLists.txt`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10814
Reviewed By: ajkr
Differential Revision: D40348522
Pulled By: jay-zhuang
fbshipit-source-id: f3d69d5880b27cdb8c8fe71df56fa3dbe1dc32fb
Summary:
Complements https://github.com/facebook/rocksdb/issues/10867 with some manual edits to avoid weird formatting or to avoid massive reformatting third party code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10870
Test Plan: `make check` etc
Reviewed By: riversand963
Differential Revision: D40686526
Pulled By: pdillinger
fbshipit-source-id: 6af988fe4b0a8ae4a5992ec2c3c37fe67584226e
Summary:
This is purely the result of running `clang-format -i` on files, except some files have been excluded for manual intervention in a separate PR
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10867
Test Plan: `make check`, `make check-headers`, `make format`
Reviewed By: jay-zhuang
Differential Revision: D40682086
Pulled By: pdillinger
fbshipit-source-id: 8673d978553ab99b516da7fb63ba0b82523337f8
Summary:
While PR#9749 nominally added support for XXH3 in the Java API, it did not update the `toCppChecksumType` method. As a result, setting the checksum type to XXH3 actually set it to CRC32c instead.
This commit adds the missing entry to portal.h, and also updates the tests so that they verify the options passed to RocksDB, instead of simply checking that the getter returns the value set by the setter.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10862
Reviewed By: pdillinger
Differential Revision: D40665031
Pulled By: ajkr
fbshipit-source-id: 2834419b3361a4bac47db3b858951fb451b5bdc8
Summary:
The patch adjusts the generation of values in batched ops stress tests so that the digits 0..9 are appended (instead of prepended) to the values written. This has the advantage of aligning the encoding of the "value base" into the value string across non-batched, batched, and CF consistency stress tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10872
Test Plan: Tested using some black box stress test runs.
Reviewed By: riversand963
Differential Revision: D40692847
Pulled By: ltamasi
fbshipit-source-id: 26bf8adff2944cbe416665f09c3bab89d80416b3
Summary:
Some lines of .h and .cc files are not properly fomatted. Clear them up with clang format.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10868
Test Plan: Watch existing CI to pass
Reviewed By: ajkr
Differential Revision: D40683485
fbshipit-source-id: 491fbb78b2cdcb948164f306829909ad816d5d0b
Summary:
FragmentedRangeTombstoneList has a member variable `seq_set_` that contains the sequence numbers of all range tombstones in a set. The set is constructed in `FragmentTombstones()` and is used only in `FragmentedRangeTombstoneList::ContainsRange()` which only happens during compaction. This PR moves the initialization of `seq_set_` to `FragmentedRangeTombstoneList::ContainsRange()`. This should speed up `FragmentTombstones()` when the range tombstone list is used for read/scan requests. Microbench shows the speed improvement to be ~45%.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10848
Test Plan:
- Existing tests and stress test: `python3 tools/db_crashtest.py whitebox --simple --verify_iterator_with_expected_state_one_in=5`.
- Microbench: update `range_del_aggregator_bench` to benchmark speed of `FragmentTombstones()`:
```
./range_del_aggregator_bench --num_range_tombstones=1000 --tombstone_start_upper_bound=50000000 --num_runs=10000 --tombstone_width_mean=200 --should_deletes_per_run=100 --use_compaction_range_del_aggregator=true
Before this PR:
=========================
Fragment Tombstones: 270.286 us
AddTombstones: 1.28933 us
ShouldDelete (first): 0.525528 us
ShouldDelete (rest): 0.0797519 us
After this PR: time to fragment tombstones is pushed to AddTombstones() which only happen during compaction.
=========================
Fragment Tombstones: 149.879 us
AddTombstones: 102.131 us
ShouldDelete (first): 0.565871 us
ShouldDelete (rest): 0.0729444 us
```
- db_bench: this should improve speed for fragmenting range tombstones for mutable memtable:
```
./db_bench --benchmarks=readwhilewriting --writes_per_range_tombstone=100 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=250000 --disable_auto_compactions --max_num_range_tombstones=100000 --finish_after_writes --write_buffer_size=1073741824 --threads=25
Before this PR:
readwhilewriting : 18.301 micros/op 1310445 ops/sec 4.769 seconds 6250000 operations; 28.1 MB/s (41001 of 250000 found)
After this PR:
readwhilewriting : 16.943 micros/op 1439376 ops/sec 4.342 seconds 6250000 operations; 23.8 MB/s (28977 of 250000 found)
```
Reviewed By: ajkr
Differential Revision: D40646227
Pulled By: cbi42
fbshipit-source-id: ea471667edb258f67d01cfd828588e80a89e4083
Summary:
**Context:**
Same as https://github.com/facebook/rocksdb/pull/5958#issue-511150930 but apply the fix to FIFO Compaction case
Repro:
```
COERCE_CONTEXT_SWICH=1 make -j56 db_stress
./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=18 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=1000 --compaction_pri=3 --open_files=-1 --compaction_style=2 --fifo_allow_compaction=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8388607 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test0/rocksdb_crashtest_whitebox --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=15 --index_type=3 --ingest_external_file_one_in=100 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=20 --ribbon_starting_level=999 --snapshot_hold_ops=1000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=524288 --write_dbid_to_manifest=0 --writepercent=35
put or merge error: Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/479 with seqno 23711 29070 vs. file https://github.com/facebook/rocksdb/issues/482 with seqno 27138 29049
```
**Summary:**
FIFO only does intra-L0 compaction in the following four cases. For other cases, FIFO drops data instead of compacting on data, which is irrelevant to the overlapping seqno issue we are solving.
- [FIFOCompactionPicker::PickSizeCompaction](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L155) when `total size < compaction_options_fifo.max_table_files_size` and `compaction_options_fifo.allow_compaction == true`
- For this path, we simply reuse the fix in `FindIntraL0Compaction` https://github.com/facebook/rocksdb/pull/5958/files#diff-c261f77d6dd2134333c4a955c311cf4a196a08d3c2bb6ce24fd6801407877c89R56
- This path was not stress-tested at all. Therefore we covered `fifo.allow_compaction` in stress test to surface the overlapping seqno issue we are fixing here.
- [FIFOCompactionPicker::PickCompactionToWarm](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L313) when `compaction_options_fifo.age_for_warm > 0`
- For this path, we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and skip files of largest seqno greater than `earliest_mem_seqno`
- This path was not stress-tested at all. However covering `age_for_warm` option worths a separate PR to deal with db stress compatibility. Therefore we manually tested this path for this PR
- [FIFOCompactionPicker::CompactRange](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L365) that ends up picking one of the above two compactions
- [CompactionPicker::CompactFiles](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.cc#L378)
- Since `SanitizeCompactionInputFiles()` will be called [before](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.h#L111-L113) `CompactionPicker::CompactFiles` , we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 in `SanitizeCompactionInputFiles()`. To simplify implementation, we return `Stats::Abort()` on encountering seqno-overlapped file when doing compaction to L0 instead of skipping the file and proceed with the compaction.
Some additional clean-up included in this PR:
- Renamed `earliest_memtable_seqno` to `earliest_mem_seqno` for consistent naming
- Added comment about `earliest_memtable_seqno` in related APIs
- Made parameter `earliest_memtable_seqno` constant and required
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10777
Test Plan:
- make check
- New unit test `TEST_P(DBCompactionTestFIFOCheckConsistencyWithParam, FlushAfterIntraL0CompactionWithIngestedFile)`corresponding to the above 4 cases, which will fail accordingly without the fix
- Regular CI stress run on this PR + stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 and on FIFO compaction only
Reviewed By: ajkr
Differential Revision: D40090485
Pulled By: hx235
fbshipit-source-id: 52624186952ee7109117788741aeeac86b624a4f
Summary:
Run format check for .h and .cc files to clean the format
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10851
Test Plan: Watch CI tests to pass
Reviewed By: ajkr
Differential Revision: D40649723
fbshipit-source-id: 62d32cead0b3b8e6540e86d25451bd72642109eb
Summary:
Run "clang-format" against files under port to make it happy.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10849
Test Plan: Watch existing CI to pass.
Reviewed By: anand1976
Differential Revision: D40645839
fbshipit-source-id: 582b4215503223795cf6234af90cc4e8e4eba773
Summary:
Apply the formatting changes suggested by `clang-format`, except
where they would ruin the ASCII art in `blob_log_format.h`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10856
Test Plan: `make check`
Reviewed By: siying
Differential Revision: D40652224
Pulled By: ltamasi
fbshipit-source-id: 8c1f5757b758474ea3e8102a7c5a1cf9e6dc1402
Summary:
Run clang-format against files under include/
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10850
Test Plan: Watch existing CI to pass.
Reviewed By: ajkr
Differential Revision: D40646158
fbshipit-source-id: 8ce04b107c837630f4000a478d0c871577090263
Summary:
`#include "db/range_tombstone_fragmenter.h"` seems to break some internal test for 7.8 release. I'm removing it from sst_file_reader.h for now to unblock release. This should be fine as it is only used in a unit test for DeleteRange with timestamp. In addition, it does not seem to be useful to support delete range for sst file writer, since the range tombstone won't cover any key (its sequence number is 0). So maybe we can remove it in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10847
Test Plan: CI.
Reviewed By: akankshamahajan15
Differential Revision: D40620865
Pulled By: cbi42
fbshipit-source-id: be44b2f31e062bff87ed1b8d94482c3f7eaa370c
Summary:
Allow the last level only compaction able to output result to penultimate level if the penultimate level is empty. Which will also block the other compaction output to the penultimate level.
(it includes the PR https://github.com/facebook/rocksdb/issues/10829)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10822
Reviewed By: siying
Differential Revision: D40389180
Pulled By: jay-zhuang
fbshipit-source-id: 4e5dcdce307795b5e07b5dd1fa29dd75bb093bad
Summary:
Right now UserComparatorWrapper is a Customizable object, although it is not, which introduces some intialization overhead for the object. In some benchmarks, it shows up in CPU profiling. Make it not configurable by defining most functions needed by UserComparatorWrapper to an interface and implement the interface.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10837
Test Plan: Make sure existing tests pass
Reviewed By: pdillinger
Differential Revision: D40528511
fbshipit-source-id: 70eaac89ecd55401a26e8ed32abbc413a9617c62
Summary:
Refactor the classes, APIs and data structures for block cache tracing to allow a user provided trace writer to be used. Currently, only a TraceWriter is supported, with a default built-in implementation of FileTraceWriter. The TraceWriter, however, takes a flat trace record and is thus only suitable for file tracing. This PR introduces an abstract BlockCacheTraceWriter class that takes a structured BlockCacheTraceRecord. The BlockCacheTraceWriter implementation can then format and log the record in whatever way it sees fit. The default BlockCacheTraceWriterImpl does file tracing using a user provided TraceWriter.
`DB::StartBlockTrace` will internally redirect to changed `BlockCacheTrace::StartBlockCacheTrace`.
New API `DB::StartBlockTrace` is also added that directly takes `BlockCacheTraceWriter` pointer.
This same philosophy can be applied to KV and IO tracing as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10811
Test Plan:
existing unit tests
Old API DB::StartBlockTrace checked with db_bench tool
create database
```
./db_bench --benchmarks="fillseq" \
--key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \
--cache_index_and_filter_blocks --cache_size=1048576 \
--disable_auto_compactions=1 --disable_wal=1 --compression_type=none \
--min_level_to_compress=-1 --compression_ratio=1 --num=10000000
```
To trace block cache accesses when running readrandom benchmark:
```
./db_bench --benchmarks="readrandom" --use_existing_db --duration=60 \
--key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \
--cache_index_and_filter_blocks --cache_size=1048576 \
--disable_auto_compactions=1 --disable_wal=1 --compression_type=none \
--min_level_to_compress=-1 --compression_ratio=1 --num=10000000 \
--threads=16 \
-block_cache_trace_file="/tmp/binary_trace_test_example" \
-block_cache_trace_max_trace_file_size_in_bytes=1073741824 \
-block_cache_trace_sampling_frequency=1
```
Reviewed By: anand1976
Differential Revision: D40435289
Pulled By: akankshamahajan15
fbshipit-source-id: fa2755f4788185e19f4605e731641cfd21ab3282
Summary:
In https://github.com/facebook/rocksdb/issues/10801 in ClockHandleTable::Evict, we saved a reference to the hash value (`const UniqueId64x2& hashed_key`) instead of saving the hash value itself before marking the handle as empty and thus free for use by other threads. This could lead to Rollback seeing the wrong hash value for updating the `displacements` after an entry is removed.
The fix is (like other places) to copy the hash value before it's released. (We could Rollback while we own the entry, but that creates more dependences between atomic updates, because in that case, based on the code, the Rollback writes would have to happen before or after the entry is released by marking empty. By doing the relaxed Rollback after marking empty, there's more opportunity for re-ordering / ILP.)
Intended follow-up: refactoring for better code sharing in clock_cache.cc
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10843
Test Plan: watch for clean crash test, TSAN
Reviewed By: siying
Differential Revision: D40579680
Pulled By: pdillinger
fbshipit-source-id: 258e43b3b80bc980a161d5c675ccc6708ecb8025
Summary:
Currently, the code in `SaveValue` that handles `kTypeValue` and
`kTypeBlobIndex` (and more recently, `kTypeWideColumnEntity`) is
mostly shared. This made sense originally; however, by now the
handling of these three value types has diverged significantly. The
patch makes the logic cleaner and also eliminates quite a bit of branching
by giving each value type its own `case` and removing a fall-through.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10840
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D40568420
Pulled By: ltamasi
fbshipit-source-id: 2e614606afd1c3d9c76d9b5f1efa0959fc174103