rocksdb

Commit Graph

Author	SHA1	Message	Date
anand76	8ffabdc226	Fix table cache leak in MultiGet with async_io (#10997 ) Summary: When MultiGet with the async_io option encounters an IO error in TableCache::GetTableReader, it may result in leakage of table cache handles due to queued coroutines being abandoned. This PR fixes it by ensuring any queued coroutines are run before aborting the MultiGet. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10997 Test Plan: 1. New unit test in db_basic_test 2. asan_crash Reviewed By: pdillinger Differential Revision: D41587244 Pulled By: anand1976 fbshipit-source-id: 900920cd3fba47cb0fc744a62facc5ffe2eccb64	2 years ago
Hui Xiao	2f76ab150d	Fix missing WAL in new manifest by rolling over the WAL deletion record from prev manifest (#10892 ) Summary: Context `Options::track_and_verify_wals_in_manifest = true` verifies each of the WALs tracked in manifest indeed presents in the WAL folder. If not, a corruption "Missing WAL with log number" will be thrown. `DB::SyncWAL()` called at a specific timing (i.e, at the `TEST_SYNC_POINT("FindObsoleteFiles::PostMutexUnlock")`) can record in a new manifest the WAL addition of a WAL file that already had a WAL deletion recorded in the previous manifest. And the WAL deletion record is not rollover-ed to the new manifest. So the new manifest creates the illusion of such WAL never gets deleted and should presents at db re/open. - Such WAL deletion record can be caused by flushing the memtable associated with that WAL and such WAL deletion can actually happen in` PurgeObsoleteFiles()`. As a consequence, upon `DB::Reopen()`, this WAL file can be deleted while manifest still has its WAL addition record , which causes a false alarm of corruption "Missing WAL with log number" to be thrown. Summary This PR fixes this false alarm by rolling over the WAL deletion record from prev manifest to the new manifest by adding the WAL deletion record to the new manifest. Test - Make check - Added new unit test `TEST_F(DBWALTest, FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL)` that failed before the fix and passed after - [Ongoing]CI stress test + aggressive value as in https://github.com/facebook/rocksdb/pull/10761 , which is how this false alarm was first surfaced, to confirm such false alarm disappears - [Ongoing]Regular CI stress test to confirm such fix didn't harm anything Pull Request resolved: https://github.com/facebook/rocksdb/pull/10892 Reviewed By: ajkr Differential Revision: D40778965 Pulled By: hx235 fbshipit-source-id: a512364bfdeb0b1a55c171890e60d856c528f37f	2 years ago
Hui Xiao	f1574a20ff	Revert PR 10777 "Fix FIFO causing overlapping seqnos in L0 files due to overla…" (#10999 ) Summary: Context/Summary: This reverts commit `fc74abb436` and related HISTORY record. The issue with PR 10777 or general approach using earliest_mem_seqno like https://github.com/facebook/rocksdb/pull/5958#issue-511150930 is that the earliest seqno of memtable of each CFs does not get persisted and will always start with 0 upon Recover(). Later when creating a new memtable in certain CF, we use the last seqno of the whole DB (but not of that CF from previous DB session) for this CF. This will lead to false positive overlapping seqno and PR 10777 will throw something like https://github.com/facebook/rocksdb/blob/main/db/compaction/compaction_picker.cc#L1002-L1004 Luckily a more elegant and complete solution to the overlapping seqno problem these PR aim to solve does not have above problem, see https://github.com/facebook/rocksdb/pull/10922. It is already being pursued and in the process of review. So we can just revert this PR and focus on getting PR10922 to land. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10999 Test Plan: make check Reviewed By: anand1976 Differential Revision: D41572604 Pulled By: hx235 fbshipit-source-id: 9d9bdf594abd235e2137045cef513ca0b14e0a3a	2 years ago
Hui Xiao	d8c043f7ad	Trigger FIFO file deletion in non L0 only if exceeding max_table_files_size (#10955 ) Summary: Context https://github.com/facebook/rocksdb/pull/10348 allows multi-level FIFO but accidentally made change to the logic of deleting files in `FIFOCompactionPicker::PickSizeCompaction`. With [this](https://github.com/facebook/rocksdb/pull/10348/files#diff-d8fb3d50749aa69b378de447e3d9cf2f48abe0281437f010b5d61365a7b813fdR156) and [this](https://github.com/facebook/rocksdb/pull/10348/files#diff-d8fb3d50749aa69b378de447e3d9cf2f48abe0281437f010b5d61365a7b813fdR235) together, it deletes one file in non-L0 even when `total_size <= mutable_cf_options.compaction_options_fifo.max_table_files_size`, which is incorrect. As a consequence, FIFO exercises more file deletion in our crash testing, which is not able to verify correctly on deleted keys in the file deleted by compaction. This results in errors `error : inconsistent values for key 000000000000239F000000000000012B000000000000028B: expected state has the key, Get() returns NotFound. Verification failed :(` or `Expected state has key 00000000000023A90000000000000003787878, iterator is at key 00000000000023A9000000000000004178 Column family: default, op_logs: S 00000000000023A90000000000000003787878` Summary: - Delete file for non-L0 only if `total_size <= mutable_cf_options.compaction_options_fifo.max_table_files_size` - Add some helpful log to LOG file Pull Request resolved: https://github.com/facebook/rocksdb/pull/10955 Test Plan: - Errors repro-ed by ``` ./db_stress --preserve_unverified_changes=1 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --async_io=0 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=10 --bottommost_compression_type=none --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_style=2 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8589934591 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=1048576 --delpercent=0 --delrangepercent=0 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=2 --ingest_external_file_one_in=1000000 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=False --log2_keys_per_lock=10 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=0 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=40000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=7 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=0 --readpercent=65 --recycle_log_file_num=1 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=1 --use_full_merge_v1=1 --use_merge=0 --use_multiget=0 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=20 ``` is gone after this fix - CI Reviewed By: ajkr Differential Revision: D41319441 Pulled By: hx235 fbshipit-source-id: 6939753767007f7449ea7055b1420aabd03d7709	2 years ago
Changyu Bi	534fb06dd3	Prevent iterating over range tombstones beyond `iterate_upper_bound` (#10966 ) Summary: Currently, `iterate_upper_bound` is not checked for range tombstone keys in MergingIterator. This may impact performance when there is a large number of range tombstones right after `iterate_upper_bound`. This PR fixes this issue by checking `iterate_upper_bound` in MergingIterator for range tombstone keys. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10966 Test Plan: - added unit test - stress test: `python3 tools/db_crashtest.py whitebox --simple --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=18 --writepercent=48 --readpercen=15 --duration=36000 --range_deletion_width=100` - ran different stress tests over sandcastle - Falcon team ran some test traffic and saw reduced CPU usage on processing range tombstones. Reviewed By: ajkr Differential Revision: D41414172 Pulled By: cbi42 fbshipit-source-id: 9b2c29eb3abb99327c6a649bdc412e70d863f981	2 years ago
Yanqin Jin	3d0d6b8140	Make best-efforts recovery verify SST unique ID before Version construction (#10962 ) Summary: The check for SST unique IDs added to best-efforts recovery (`Options::best_efforts_recovery` is true). With best_efforts_recovery being true, RocksDB will recover to the latest point in MANIFEST such that all valid SST files included up to this point pass unique ID checks as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10962 Test Plan: make check Reviewed By: pdillinger Differential Revision: D41378241 Pulled By: riversand963 fbshipit-source-id: a036064e2c17dec13d080a24ef2a9f85d607b16c	2 years ago
anand76	f4cfcfe824	Post 7.9.0 release branch cut updates (#10974 ) Summary: Update HISTORY.md, version.h, and check_format_compatible.sh Pull Request resolved: https://github.com/facebook/rocksdb/pull/10974 Reviewed By: akankshamahajan15 Differential Revision: D41455289 Pulled By: anand1976 fbshipit-source-id: 99888ebcb9109e5ced80584a66b20123f8783c0b	2 years ago
anand76	3ff6da6bd5	Update HISTORY.md for 7.9.0 (#10973 ) Summary: Update HISTORY.md for 7.9.0 release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10973 Reviewed By: pdillinger Differential Revision: D41453720 Pulled By: anand1976 fbshipit-source-id: 47a23d4b6539ec6a9a09c9e69c026f7c8b10afa7	2 years ago
Peter Dillinger	e079d562af	Add a SecondaryCache::InsertSaved() API, use in CacheDumper impl (#10945 ) Summary: Can simplify some ugly code in cache_dump_load_impl.cc by having an API in SecondaryCache that can directly consume persisted data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10945 Test Plan: existing tests for CacheDumper, added basic unit test Reviewed By: anand1976 Differential Revision: D41231497 Pulled By: pdillinger fbshipit-source-id: b8ec993ef7d3e7efd68aae8602fd3f858da58068	2 years ago
Andrew Kryczka	097f9f4425	Fix CompactionIterator flag for penultimate level output (#10967 ) Summary: We were not resetting it in non-debug mode so it could be true once and then stay true for future keys where it should be false. This PR adds the reset logic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10967 Test Plan: - built `db_bench` with DEBUG_LEVEL=0 - ran benchmark: `TEST_TMPDIR=/dev/shm/prefix ./db_bench -benchmarks=fillrandom -compaction_style=1 -preserve_internal_time_seconds=100 -preclude_last_level_data_seconds=10 -write_buffer_size=1048576 -target_file_size_base=1048576 -subcompactions=8 -duration=120` - compared "output_to_penultimate_level: X bytes + last: Y bytes" lines in LOG output - Before this fix, Y was always zero - After this fix, Y gradually increased throughout the benchmark Reviewed By: riversand963 Differential Revision: D41417726 Pulled By: ajkr fbshipit-source-id: ace1e9a289e751a5b0c2fbaa8addd4eda5525329	2 years ago
Peter Dillinger	3182beeffc	Observe and warn about misconfigured HyperClockCache (#10965 ) Summary: Background. One of the core risks of chosing HyperClockCache is ending up with degraded performance if estimated_entry_charge is very significantly wrong. Too low leads to under-utilized hash table, which wastes a bit of (tracked) memory and likely increases access times due to larger working set size (more TLB misses). Too high leads to fully populated hash table (at some limit with reasonable lookup performance) and not being able to cache as many objects as the memory limit would allow. In either case, performance degradation is graceful/continuous but can be quite significant. For example, cutting block size in half without updating estimated_entry_charge could lead to a large portion of configured block cache memory (up to roughly 1/3) going unused. Fix. This change adds a mechanism through which the DB periodically probes the block cache(s) for "problems" to report, and adds diagnostics to the HyperClockCache for bad estimated_entry_charge. The periodic probing is currently done with DumpStats / stats_dump_period_sec, and diagnostics reported to info_log (normally LOG file). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10965 Test Plan: unit test included. Doesn't cover all the implemented subtleties of reporting, but ensures basics of when to report or not. Also manual testing with db_bench. Create db with ``` ./db_bench --benchmarks=fillrandom,flush --num=3000000 --disable_wal=1 ``` Use and check LOG file for HyperClockCache for various block sizes (used as estimated_entry_charge) ``` ./db_bench --use_existing_db --benchmarks=readrandom --num=3000000 --duration=20 --stats_dump_period_sec=8 --cache_type=hyper_clock_cache -block_size=XXXX ``` Seeing warnings / errors or not as expected. Reviewed By: anand1976 Differential Revision: D41406932 Pulled By: pdillinger fbshipit-source-id: 4ca56162b73017e4b9cec2cad74466f49c27a0a7	2 years ago
Peter Dillinger	8c0f5b1fcf	Mark HyperClockCache as production-ready (#10963 ) Summary: After a couple minor bug fixes and successful productions roll-outs in a few places, I think we can mark this as production-ready. It has a clear value proposition for many workloads, even if we don't have clear advice for every workload yet. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10963 Test Plan: existing tests, comment changes only Reviewed By: siying Differential Revision: D41384083 Pulled By: pdillinger fbshipit-source-id: 56359f01a57bb28de8697666b342382fac72ce6d	2 years ago
Levi Tamasi	8fa8780932	Mention wide-column support in HISTORY.md (#10959 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10959 Reviewed By: akankshamahajan15 Differential Revision: D41348198 Pulled By: ltamasi fbshipit-source-id: 51e89d03c1fe87f576a766f609a7f233a519c83d	2 years ago
anand76	ecba6a320e	Add some async read stats (#10947 ) Summary: Add stats for time spent in the ReadAsync call, and async read errors. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10947 Test Plan: Run db_bench and look at stats Reviewed By: akankshamahajan15 Differential Revision: D41236637 Pulled By: anand1976 fbshipit-source-id: 70539b69a28491d57acead449436a761f7108acf	2 years ago
Peter Dillinger	f321e8fc98	Don't attempt to use SecondaryCache on block_cache_compressed (#10944 ) Summary: Compressed block cache depends on reading the block compression marker beyond the payload block size. Only the payload bytes were being saved and loaded from SecondaryCache -> boom! This removes some unnecessary code attempting to combine these two competing features. Note that BlockContents was previously used for block-based filter in block cache, but that support has been removed. Also marking block_cache_compressed as deprecated in this commit as we expect it to be replaced with SecondaryCache. This problem was discovered during refactoring but didn't want to combine bug fix with that refactoring. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10944 Test Plan: test added that fails on base revision (at least with ASAN) Reviewed By: akankshamahajan15 Differential Revision: D41205578 Pulled By: pdillinger fbshipit-source-id: 1b29d36c7a6552355ac6511fcdc67038ef4af29f	2 years ago
akankshamahajan	d1aca4a5ae	Fix async_io regression in scans (#10939 ) Summary: Fix async_io regression in scans due to incorrect check which was causing the valid data in buffer to be cleared during seek. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10939 Test Plan: - stress tests export CRASH_TEST_EXT_ARGS="--async_io=1" make crash_test -j32 - Ran db_bench command which was caught the regression: ./db_bench --db=/rocksdb_async_io_testing/prefix_scan --disable_wal=1 --use_existing_db=true --benchmarks="seekrandom" -key_size=32 -value_size=512 -num=50000000 -use_direct_reads=false -seek_nexts=963 -duration=30 -ops_between_duration_checks=1 --async_io=true --compaction_readahead_size=4194304 --log_readahead_size=0 --blob_compaction_readahead_size=0 --initial_auto_readahead_size=65536 --num_file_reads_for_auto_readahead=0 --max_auto_readahead_size=524288 seekrandom : 3777.415 micros/op 264 ops/sec 30.000 seconds 7942 operations; 132.3 MB/s (7942 of 7942 found) Reviewed By: anand1976 Differential Revision: D41173899 Pulled By: akankshamahajan15 fbshipit-source-id: 2d75b06457d65b1851c92382565d9c3fac329dfe	2 years ago
Levi Tamasi	fbd9077d66	Fix a bug where GetContext does not update READ_NUM_MERGE_OPERANDS (#10925 ) Summary: The patch fixes a bug where `GetContext::Merge` (and `MergeEntity`) does not update the ticker `READ_NUM_MERGE_OPERANDS` because it implicitly uses the default parameter value of `update_num_ops_stats=false` when calling `MergeHelper::TimedFullMerge`. Also, to prevent such issues going forward, the PR removes the default parameter values from the `TimedFullMerge` methods. In addition, it removes an unused/unnecessary parameter from `TimedFullMergeWithEntity`, and does some cleanup at the call sites of these methods. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10925 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D41096453 Pulled By: ltamasi fbshipit-source-id: fc60646d32b4d516b8fe81e265c3f020a32fd7f8	2 years ago
Andrew Kryczka	aa0a11e1b9	Fix flush picking non-consecutive memtables (#10921 ) Summary: Prevents `MemTableList::PickMemtablesToFlush()` from picking non-consecutive memtables. It leads to wrong ordering in L0 if the files are committed, or an error like below if force_consistency_checks=true catches it: ``` Corruption: force_consistency_checks: VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/25 with seqno 320416 368066 vs. file https://github.com/facebook/rocksdb/issues/24 with seqno 336037 352068 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10921 Test Plan: fix the expectation in the existing test of this behavior Reviewed By: riversand963 Differential Revision: D41046935 Pulled By: ajkr fbshipit-source-id: 783696bff56115063d5dc5856dfaed6a9881d1ab	2 years ago
akankshamahajan	ff9ad2c39b	Fix async_io failures in case there is error in reading data (#10890 ) Summary: Fix memory corruption error in scans if async_io is enabled. Memory corruption happened if data is overlapping between two buffers. If there is IOError while reading the data, it leads to empty buffer and other buffer already in progress of async read goes again for reading causing the error. Fix: Added check to abort IO in second buffer if curr_ got empty. This PR also fixes db_stress failures which happened when buffers are not aligned. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10890 Test Plan: - Ran make crash_test -j32 with async_io enabled. - Ran benchmarks to make sure there is no regression. Reviewed By: anand1976 Differential Revision: D40881731 Pulled By: akankshamahajan15 fbshipit-source-id: 39fcf2134c7b1bbb08415ede3e1ef261ac2dbc58	2 years ago
Yanqin Jin	7d26e4c5a3	Basic Support for Merge with user-defined timestamp (#10819 ) Summary: This PR implements the originally disabled `Merge()` APIs when user-defined timestamp is enabled. Simplest usage: ```cpp // assume string append merge op is used with '.' as delimiter. // ts1 < ts2 db->Put(WriteOptions(), "key", ts1, "v0"); db->Merge(WriteOptions(), "key", ts2, "1"); ReadOptions ro; ro.timestamp = &ts2; db->Get(ro, "key", &value); ASSERT_EQ("v0.1", value); ``` Some code comments are added for clarity. Note: support for timestamp in `DB::GetMergeOperands()` will be done in a follow-up PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10819 Test Plan: make check Reviewed By: ltamasi Differential Revision: D40603195 Pulled By: riversand963 fbshipit-source-id: f96d6f183258f3392d80377025529f7660503013	2 years ago
Hui Xiao	7f5e438aee	Move move wrong history entry out of 7.8 release (#10898 ) Summary: Context/Summary: https://github.com/facebook/rocksdb/pull/10777 mistakenly added a history entry under 7.8 release but the PR is not included in 7.8. This mistake was due to rebase and merge didn't realize it was a conflict when "## Unreleased" was changed to "## 7.8.0 (10/22/2022)". Pull Request resolved: https://github.com/facebook/rocksdb/pull/10898 Test Plan: Make check Reviewed By: akankshamahajan15 Differential Revision: D40861001 Pulled By: hx235 fbshipit-source-id: b2310c95490f6ebb90834a210c965a74c9560b51	2 years ago
sdong	d989300ad1	Avoid repeat periodic stats printing when there is no change (#10891 ) Summary: When there is a column family that doesn't get any traffic, its stats are still dumped when options.options.stats_dump_period_sec triggers. This sometimes spam the information logs. With this change, we skip the printing if there is not change, until 8 periods. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10891 Test Plan: Manually test the behavior with hacked db_bench setups. Reviewed By: jay-zhuang Differential Revision: D40777183 fbshipit-source-id: ef0b9a793e4f6282df099b464f01d1fb4c5a2cab	2 years ago
Changyu Bi	56715350d9	Reduce heap operations for range tombstone keys in iterator (#10877 ) Summary: Right now in MergingIterator, for each range tombstone start and end key, we pop one end from heap and push the other end into the heap. This involves extra downheap and upheap cost. In the likely cases when a range tombstone iterator emits relatively adjacent keys, these keys should have similar order within all keys in the heap. This can happen when there is a burst of consecutive range tombstones, and most of the keys covered by them are dropped already. This PR uses `replace_top()` when inserting new range tombstone keys, which is more efficient in these common cases. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10877 Test Plan: - existing UT - ran all flavors of stress test through sandcastle - benchmark: ``` # Set up: --writes_per_range_tombstone=1 means one point write and one delete range TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=100000000 --writes=800000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=64 Level Files Size(MB) -------------------- 0 8 152 1 0 0 2 0 0 3 0 0 4 0 0 5 0 0 6 0 0 # Benchmark TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone/ ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472 --num=100000000 --reads=1000000 --disable_auto_compactions=true --avoid_flush_during_recovery=true # Pre PR readseq [AVG 5 runs] : 1432116 (± 59664) ops/sec; 224.0 (± 9.3) MB/sec readseq [MEDIAN 5 runs] : 1454886 ops/sec; 227.5 MB/sec # Post PR readseq [AVG 5 runs] : 1944425 (± 29521) ops/sec; 304.1 (± 4.6) MB/sec readseq [MEDIAN 5 runs] : 1959430 ops/sec; 306.5 MB/sec ``` Reviewed By: ajkr Differential Revision: D40710936 Pulled By: cbi42 fbshipit-source-id: cb782fb9cdcd26c0c3eb9443215a4ef4d2f79022	2 years ago
Hui Xiao	fc74abb436	Fix FIFO causing overlapping seqnos in L0 files due to overlapped seqnos between ingested files and memtable's (#10777 ) Summary: Context: Same as https://github.com/facebook/rocksdb/pull/5958#issue-511150930 but apply the fix to FIFO Compaction case Repro: ``` COERCE_CONTEXT_SWICH=1 make -j56 db_stress ./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=18 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=1000 --compaction_pri=3 --open_files=-1 --compaction_style=2 --fifo_allow_compaction=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8388607 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test0/rocksdb_crashtest_whitebox --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=15 --index_type=3 --ingest_external_file_one_in=100 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=20 --ribbon_starting_level=999 --snapshot_hold_ops=1000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=524288 --write_dbid_to_manifest=0 --writepercent=35 put or merge error: Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/479 with seqno 23711 29070 vs. file https://github.com/facebook/rocksdb/issues/482 with seqno 27138 29049 ``` Summary: FIFO only does intra-L0 compaction in the following four cases. For other cases, FIFO drops data instead of compacting on data, which is irrelevant to the overlapping seqno issue we are solving. - [FIFOCompactionPicker::PickSizeCompaction](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L155) when `total size < compaction_options_fifo.max_table_files_size` and `compaction_options_fifo.allow_compaction == true` - For this path, we simply reuse the fix in `FindIntraL0Compaction` https://github.com/facebook/rocksdb/pull/5958/files#diff-c261f77d6dd2134333c4a955c311cf4a196a08d3c2bb6ce24fd6801407877c89R56 - This path was not stress-tested at all. Therefore we covered `fifo.allow_compaction` in stress test to surface the overlapping seqno issue we are fixing here. - [FIFOCompactionPicker::PickCompactionToWarm](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L313) when `compaction_options_fifo.age_for_warm > 0` - For this path, we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and skip files of largest seqno greater than `earliest_mem_seqno` - This path was not stress-tested at all. However covering `age_for_warm` option worths a separate PR to deal with db stress compatibility. Therefore we manually tested this path for this PR - [FIFOCompactionPicker::CompactRange](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L365) that ends up picking one of the above two compactions - [CompactionPicker::CompactFiles](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.cc#L378) - Since `SanitizeCompactionInputFiles()` will be called [before](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.h#L111-L113) `CompactionPicker::CompactFiles` , we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 in `SanitizeCompactionInputFiles()`. To simplify implementation, we return `Stats::Abort()` on encountering seqno-overlapped file when doing compaction to L0 instead of skipping the file and proceed with the compaction. Some additional clean-up included in this PR: - Renamed `earliest_memtable_seqno` to `earliest_mem_seqno` for consistent naming - Added comment about `earliest_memtable_seqno` in related APIs - Made parameter `earliest_memtable_seqno` constant and required Pull Request resolved: https://github.com/facebook/rocksdb/pull/10777 Test Plan: - make check - New unit test `TEST_P(DBCompactionTestFIFOCheckConsistencyWithParam, FlushAfterIntraL0CompactionWithIngestedFile)`corresponding to the above 4 cases, which will fail accordingly without the fix - Regular CI stress run on this PR + stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 and on FIFO compaction only Reviewed By: ajkr Differential Revision: D40090485 Pulled By: hx235 fbshipit-source-id: 52624186952ee7109117788741aeeac86b624a4f	2 years ago
akankshamahajan	daceb85c51	Update version.h, HISTORY.md and add branches to compatibility check (#10846 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10846 Reviewed By: ajkr Differential Revision: D40617997 Pulled By: akankshamahajan15 fbshipit-source-id: 4b2d6e85dbca7e73b930c4165869b693d3e4e137	2 years ago
akankshamahajan	9a55e5da17	Update HISTORY.md for 7.8 release (#10844 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10844 Reviewed By: ajkr Differential Revision: D40592956 Pulled By: akankshamahajan15 fbshipit-source-id: 6656f4bc5faa30fa7882bf44155f7931895590e2	2 years ago
Jay Zhuang	f726d29a82	Allow penultimate level output for the last level only compaction (#10822 ) Summary: Allow the last level only compaction able to output result to penultimate level if the penultimate level is empty. Which will also block the other compaction output to the penultimate level. (it includes the PR https://github.com/facebook/rocksdb/issues/10829) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10822 Reviewed By: siying Differential Revision: D40389180 Pulled By: jay-zhuang fbshipit-source-id: 4e5dcdce307795b5e07b5dd1fa29dd75bb093bad	2 years ago
Peter Dillinger	27c9705ac4	Use kXXH3 as default checksum (CPU efficiency) (#10778 ) Summary: Since this has been supported for about a year, I think it's time to make it the default. This should improve CPU efficiency slightly on most hardware. A current DB performance comparison using buck+clang build: ``` TEST_TMPDIR=/dev/shm ./db_bench -checksum_type={1,4} -benchmarks=fillseq[-X1000] -num=3000000 -disable_wal ``` kXXH3 (+0.2% DB write throughput): `fillseq [AVG 1000 runs] : 822149 (± 1004) ops/sec; 91.0 (± 0.1) MB/sec` kCRC32c: `fillseq [AVG 1000 runs] : 820484 (± 1203) ops/sec; 90.8 (± 0.1) MB/sec` Micro benchmark comparison: ``` ./db_bench --benchmarks=xxh3[-X20],crc32c[-X20] ``` Machine 1, buck+clang build: `xxh3 [AVG 20 runs] : 3358616 (± 19091) ops/sec; 13119.6 (± 74.6) MB/sec` `crc32c [AVG 20 runs] : 2578725 (± 7742) ops/sec; 10073.1 (± 30.2) MB/sec` Machine 2, make+gcc build, DEBUG_LEVEL=0 PORTABLE=0: `xxh3 [AVG 20 runs] : 6182084 (± 137223) ops/sec; 24148.8 (± 536.0) MB/sec` `crc32c [AVG 20 runs] : 5032465 (± 42454) ops/sec; 19658.1 (± 165.8) MB/sec` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10778 Test Plan: make check, unit tests updated Reviewed By: ajkr Differential Revision: D40112510 Pulled By: pdillinger fbshipit-source-id: e59a8d50a60346137732f8668ba7cfac93be2b37	2 years ago
sdong	5d17297b76	Make UserComparatorWrapper not Customizable (#10837 ) Summary: Right now UserComparatorWrapper is a Customizable object, although it is not, which introduces some intialization overhead for the object. In some benchmarks, it shows up in CPU profiling. Make it not configurable by defining most functions needed by UserComparatorWrapper to an interface and implement the interface. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10837 Test Plan: Make sure existing tests pass Reviewed By: pdillinger Differential Revision: D40528511 fbshipit-source-id: 70eaac89ecd55401a26e8ed32abbc413a9617c62	2 years ago
akankshamahajan	0e7b27bfcf	Refactor block cache tracing APIs (#10811 ) Summary: Refactor the classes, APIs and data structures for block cache tracing to allow a user provided trace writer to be used. Currently, only a TraceWriter is supported, with a default built-in implementation of FileTraceWriter. The TraceWriter, however, takes a flat trace record and is thus only suitable for file tracing. This PR introduces an abstract BlockCacheTraceWriter class that takes a structured BlockCacheTraceRecord. The BlockCacheTraceWriter implementation can then format and log the record in whatever way it sees fit. The default BlockCacheTraceWriterImpl does file tracing using a user provided TraceWriter. `DB::StartBlockTrace` will internally redirect to changed `BlockCacheTrace::StartBlockCacheTrace`. New API `DB::StartBlockTrace` is also added that directly takes `BlockCacheTraceWriter` pointer. This same philosophy can be applied to KV and IO tracing as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10811 Test Plan: existing unit tests Old API DB::StartBlockTrace checked with db_bench tool create database ``` ./db_bench --benchmarks="fillseq" \ --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \ --cache_index_and_filter_blocks --cache_size=1048576 \ --disable_auto_compactions=1 --disable_wal=1 --compression_type=none \ --min_level_to_compress=-1 --compression_ratio=1 --num=10000000 ``` To trace block cache accesses when running readrandom benchmark: ``` ./db_bench --benchmarks="readrandom" --use_existing_db --duration=60 \ --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 \ --cache_index_and_filter_blocks --cache_size=1048576 \ --disable_auto_compactions=1 --disable_wal=1 --compression_type=none \ --min_level_to_compress=-1 --compression_ratio=1 --num=10000000 \ --threads=16 \ -block_cache_trace_file="/tmp/binary_trace_test_example" \ -block_cache_trace_max_trace_file_size_in_bytes=1073741824 \ -block_cache_trace_sampling_frequency=1 ``` Reviewed By: anand1976 Differential Revision: D40435289 Pulled By: akankshamahajan15 fbshipit-source-id: fa2755f4788185e19f4605e731641cfd21ab3282	2 years ago
Changyu Bi	333abe9c55	Ignore max_compaction_bytes for compaction input that are within output key-range (#10835 ) Summary: When picking compaction input files, we sometimes stop picking a file that is fully included in the output key-range due to hitting max_compaction_bytes. Including these input files can potentially reduce WA at the expense of larger compactions. Larger compaction should be fine as files from input level are usually 10X smaller than files from output level. This PR adds a mutable CF option `ignore_max_compaction_bytes_for_input` that is enabled by default. We can remove this option once we are sure it is safe. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10835 Test Plan: - CI, a unit test on max_compaction_bytes fails before turning this flag off. - Benchmark does not show much difference in WA: `./db_bench --benchmarks=fillrandom,waitforcompaction,stats,levelstats -max_background_jobs=12 -num=2000000000 -target_file_size_base=33554432 --write_buffer_size=33554432` ``` main: Compaction Stats [default] Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 3/0 91.59 MB 0.8 70.9 0.0 70.9 200.8 129.9 0.0 1.5 25.2 71.2 2886.55 2463.45 9725 0.297 1093M 254K 0.0 0.0 L1 9/0 248.03 MB 1.0 392.0 129.8 262.2 391.7 129.5 0.0 3.0 69.0 68.9 5821.71 5536.90 804 7.241 6029M 5814K 0.0 0.0 L2 87/0 2.50 GB 1.0 537.0 128.5 408.5 533.8 125.2 0.7 4.2 69.5 69.1 7912.24 7323.70 4417 1.791 8299M 36M 0.0 0.0 L3 836/0 24.99 GB 1.0 616.9 118.3 498.7 594.5 95.8 5.2 5.0 66.9 64.5 9442.38 8490.28 4204 2.246 9749M 306M 0.0 0.0 L4 2355/0 62.95 GB 0.3 67.3 37.1 30.2 54.2 24.0 38.9 1.5 72.2 58.2 954.37 821.18 917 1.041 1076M 173M 0.0 0.0 Sum 3290/0 90.77 GB 0.0 1684.2 413.7 1270.5 1775.0 504.5 44.9 13.7 63.8 67.3 27017.25 24635.52 20067 1.346 26G 522M 0.0 0.0 Cumulative compaction: 1774.96 GB write, 154.29 MB/s write, 1684.19 GB read, 146.40 MB/s read, 27017.3 seconds This PR: Compaction Stats [default] Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 3/0 45.71 MB 0.8 72.9 0.0 72.9 202.8 129.9 0.0 1.6 25.4 70.7 2938.16 2510.36 9741 0.302 1124M 265K 0.0 0.0 L1 8/0 234.54 MB 0.9 384.5 129.8 254.7 384.2 129.6 0.0 3.0 69.0 68.9 5708.08 5424.43 791 7.216 5913M 5753K 0.0 0.0 L2 84/0 2.47 GB 1.0 543.1 128.6 414.5 539.9 125.4 0.7 4.2 69.6 69.2 7989.31 7403.13 4418 1.808 8393M 36M 0.0 0.0 L3 839/0 24.96 GB 1.0 615.6 118.4 497.2 593.2 96.0 5.1 5.0 66.6 64.1 9471.23 8489.31 4193 2.259 9726M 306M 0.0 0.0 L4 2360/0 63.04 GB 0.3 67.6 37.3 30.3 54.4 24.1 38.9 1.5 71.5 57.6 967.30 827.99 907 1.066 1080M 173M 0.0 0.0 Sum 3294/0 90.75 GB 0.0 1683.8 414.2 1269.6 1774.5 504.9 44.8 13.7 63.7 67.1 27074.08 24655.22 20050 1.350 26G 522M 0.0 0.0 Cumulative compaction: 1774.52 GB write, 157.09 MB/s write, 1683.77 GB read, 149.06 MB/s read, 27074.1 seconds ``` Reviewed By: ajkr Differential Revision: D40518319 Pulled By: cbi42 fbshipit-source-id: f4ea614bc0ebefe007ffaf05bb9aec9a8ca25b60	2 years ago
Andrew Kryczka	33ceea9b76	Add DB property for fast block cache stats collection (#10832 ) Summary: This new property allows users to trigger the background block cache stats collection mode through the `GetProperty()` and `GetMapProperty()` APIs. The background mode has much lower overhead at the expense of returning stale values in more cases. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10832 Test Plan: updated unit test Reviewed By: pdillinger Differential Revision: D40497883 Pulled By: ajkr fbshipit-source-id: bdcc93402f426463abb2153756aad9e295447343	2 years ago
Yueh-Hsuan Chiang	e267909ecf	Enable a multi-level db to smoothly migrate to FIFO via DB::Open (#10348 ) Summary: FIFO compaction can theoretically open a DB with any compaction style. However, the current code only allows FIFO compaction to open a DB with a single level. This PR relaxes the limitation of FIFO compaction and allows it to open a DB with multiple levels. Below is the read / write / compaction behavior: * The read behavior is untouched, and it works like a regular rocksdb instance. * The write behavior is untouched as well. When a FIFO compacted DB is opened with multiple levels, all new files will still be in level 0, and no files will be moved to a different level. * Compaction logic is extended. It will first identify the bottom-most non-empty level. Then, it will delete the oldest file in that level. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10348 Test Plan: Added a new test to verify the migration from level to FIFO where the db has multiple levels. Extended existing test cases in db_test and db_basic_test to also verify all entries of a key after reopening the DB with FIFO compaction. Reviewed By: jay-zhuang Differential Revision: D40233744 fbshipit-source-id: 6cc011d6c3467e6bfb9b6a4054b87619e69815e1	2 years ago
Jay Zhuang	5a5f21c489	Allow the last level data moving up to penultimate level (#10782 ) Summary: Lock the penultimate level for the whole compaction inputs range, so any key in that compaction is safe to move up from the last level to penultimate level. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10782 Reviewed By: siying Differential Revision: D40231540 Pulled By: siying fbshipit-source-id: ca115cc8b4018b35d797329fa85a19b06cc8c13e	2 years ago
Peter Dillinger	2d0380adbe	Allow manifest fix-up without requiring prior state (#10796 ) Summary: This change is motivated by ensuring that `ldb update_manifest` or `UpdateManifestForFilesState` can run without expecting files to open when the old temperature is provided (in case the FileSystem strictly interprets non-kUnknown), but ended up fixing a problem in `OfflineManifestWriter` (used by `ldb unsafe_remove_sst_file`) where it would open some SST files during recovery and expect them to match the prior manifest state, even if not required by the intended new state. Also update BackupEngine to retry with Temperature kUnknown when reading file with potentially "wrong" temperature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10796 Test Plan: tests added/updated, that fail before the change(s) and now pass Reviewed By: jay-zhuang Differential Revision: D40232645 Pulled By: jay-zhuang fbshipit-source-id: b5aa2688aecfe0c320b80a7da689b315414c20be	2 years ago
akankshamahajan	ebf8c454fd	Provide support for async_io with tailing iterators (#10781 ) Summary: Provide support for async_io if ReadOptions.tailing is set true. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10781 Test Plan: - Update unit tests - Ran db_bench: ./db_bench --benchmarks="readrandom" --use_existing_db --use_tailing_iterator=1 --async_io=1 Reviewed By: anand1976 Differential Revision: D40128882 Pulled By: anand1976 fbshipit-source-id: 55e17855536871a5c47e2de92d238ae005c32d01	2 years ago
Jay Zhuang	c401f285c3	Add option `preserve_internal_time_seconds` to preserve the time info (#10747 ) Summary: Add option `preserve_internal_time_seconds` to preserve the internal time information. It's mostly for the migration of the existing data to tiered storage ( `preclude_last_level_data_seconds`). When the tiering feature is just enabled, the existing data won't have the time information to decide if it's hot or cold. Enabling this feature will start collect and preserve the time information for the new data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10747 Reviewed By: siying Differential Revision: D39910141 Pulled By: siying fbshipit-source-id: 25c21638e37b1a7c44006f636b7d714fe7242138	2 years ago
Peter Dillinger	b205c6d029	Fix bug in HyperClockCache ApplyToEntries; cleanup (#10768 ) Summary: We have seen some rare crash test failures in HyperClockCache, and the source could certainly be a bug fixed in this change, in ClockHandleTable::ConstApplyToEntriesRange. It wasn't properly accounting for the fact that incrementing the acquire counter could be ineffective, due to parallel updates. (When incrementing the acquire counter is ineffective, it is incorrect to then decrement it.) This change includes some other minor clean-up in HyperClockCache, and adds stats_dump_period_sec with a much lower period to the crash test. This should be the primary caller of ApplyToEntries, in collecting cache entry stats. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10768 Test Plan: haven't been able to reproduce the failure, but should be in a better state (bug fix and improved crash test) Reviewed By: anand1976 Differential Revision: D40034747 Pulled By: anand1976 fbshipit-source-id: a06fcefe146e17ee35001984445cedcf3b63eb68	2 years ago
Yanqin Jin	4d82b94896	Sanitize min_write_buffer_number_to_merge to 1 with atomic_flush (#10773 ) Summary: With current implementation, within the same RocksDB instance, all column families with non-empty memtables will be scheduled for flush if RocksDB determines that any column family needs to be flushed, e.g. memtable full, write buffer manager, etc., if atomic flush is enabled. Not doing so can lead to data loss and inconsistency when WAL is disabled, which is a common setting when atomic flush is enabled. Therefore, setting a per-column-family knob, min_write_buffer_number_to_merge to a value greater than 1 is not compatible with atomic flush, and should be sanitized during column family creation and db open. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10773 Test Plan: Reproduce: D39993203 has detailed steps. Run the test with and without the fix. Reviewed By: cbi42 Differential Revision: D40077955 Pulled By: cbi42 fbshipit-source-id: 451a9179eb531ac42eaccf40b451b9dec4085240	2 years ago
Changyu Bi	eca47fb696	Ignore kBottommostFiles compaction logic when allow_ingest_behind (#10767 ) Summary: fix for https://github.com/facebook/rocksdb/issues/10752 where RocksDB could be in an infinite compaction loop (with compaction reason kBottommostFiles) if allow_ingest_behind is enabled and the bottommost level is unfilled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10767 Test Plan: Added a unit test to reproduce the compaction loop. Reviewed By: ajkr Differential Revision: D40031861 Pulled By: ajkr fbshipit-source-id: 71c4b02931fbe507a847632905404c9b8fa8c96b	2 years ago
Changyu Bi	ffde463a5f	Cleanup SuperVersion in Iterator::Refresh() (#10770 ) Summary: Fix a bug in Iterator::Refresh() where the local SV it obtained could be obsolete upon return, and should be cleaned up. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10770 Test Plan: added a unit test to reproduce the issue. Reviewed By: ajkr Differential Revision: D40063809 Pulled By: ajkr fbshipit-source-id: 619e728eb0f1ac9540b4d0ad38e43acc37a514b2	2 years ago
Yanqin Jin	edda219fc3	Manual flush with `wait=false` should not stall when writes stopped (#10001 ) Summary: When `FlushOptions::wait` is set to false, manual flush should not stall forever. If the database has already stopped writes, then the thread calling `DB::Flush()` with `FlushOptions::wait=false` should not enter the `DBImpl::write_thread_`. To prevent this, we should do a check at the beginning and return `TryAgain()` Resolves: https://github.com/facebook/rocksdb/issues/9892 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10001 Reviewed By: siying Differential Revision: D36422303 Pulled By: siying fbshipit-source-id: 723bd3065e8edc4f17c82449d0d6b95a2381ac0a	2 years ago
Jay Zhuang	f007ad8b4f	RoundRobin TTL compaction (#10725 ) Summary: For RoundRobin compaction, the data should be mostly sorted per level and within level. Use normal compaction picker for RR until all expired data is compacted. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10725 Reviewed By: ajkr Differential Revision: D39771069 Pulled By: jay-zhuang fbshipit-source-id: 7ccf88d7c093fad5673bda73a7b08cc4757780cd	2 years ago
akankshamahajan	ae0f9c3339	Add new property in IOOptions to skip recursing through directories and list only files during GetChildren. (#10668 ) Summary: Add new property "do_not_recurse" in IOOptions for underlying file system to skip iteration of directories during DB::Open if there are no sub directories and list only files. By default this property is set to false. This property is set true currently in the code where RocksDB is sure only files are needed during DB::Open. Provided support in PosixFileSystem to use "do_not_recurse". TestPlan: - Existing tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/10668 Reviewed By: anand1976 Differential Revision: D39471683 Pulled By: akankshamahajan15 fbshipit-source-id: 90e32f0b86d5346d53bc2714d3a0e7002590527f	2 years ago
Changyu Bi	9f2363f4c4	User-defined timestamp support for `DeleteRange()` (#10661 ) Summary: Add user-defined timestamp support for range deletion. The new API is `DeleteRange(opt, cf, begin_key, end_key, ts)`. Most of the change is to update the comparator to compare without timestamp. Other than that, major changes are - internal range tombstone data structures (`FragmentedRangeTombstoneList`, `RangeTombstone`, etc.) to store timestamps. - Garbage collection of range tombstones and range tombstone covered keys during compaction. - Get()/MultiGet() to return the timestamp of a range tombstone when needed. - Get/Iterator with range tombstones bounded by readoptions.timestamp. - timestamp crash test now issues DeleteRange by default. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10661 Test Plan: - Added unit test: `make check` - Stress test: `python3 tools/db_crashtest.py --enable_ts whitebox --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4` - Ran `db_bench` to measure regression when timestamp is not enabled. The tests are for write (with some range deletion) and iterate with DB fitting in memory: `./db_bench--benchmarks=fillrandom,seekrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=500000 --seek_nexts=10 --disable_auto_compactions -disable_wal=true --max_num_range_tombstones=1000`. Did not see consistent regression in no timestamp case. \| micros/op \| fillrandom \| seekrandom \| \| --- \| --- \| --- \| \|main\| 2.58 \|10.96\| \|PR 10661\| 2.68 \|10.63\| Reviewed By: riversand963 Differential Revision: D39441192 Pulled By: cbi42 fbshipit-source-id: f05aca3c41605caf110daf0ff405919f300ddec2	2 years ago
Jay Zhuang	f3cc66632b	Align compaction output file boundaries to the next level ones (#10655 ) Summary: Try to align the compaction output file boundaries to the next level ones (grandparent level), to reduce the level compaction write-amplification. In level compaction, there are "wasted" data at the beginning and end of the output level files. Align the file boundary can avoid such "wasted" compaction. With this PR, it tries to align the non-bottommost level file boundaries to its next level ones. It may cut file when the file size is large enough (at least 50% of target_file_size) and not too large (2x target_file_size). db_bench shows about 12.56% compaction reduction: ``` TEST_TMPDIR=/data/dbbench2 ./db_bench --benchmarks=fillrandom,readrandom -max_background_jobs=12 -num=400000000 -target_file_size_base=33554432 # baseline: Flush(GB): cumulative 25.882, interval 7.216 Cumulative compaction: 285.90 GB write, 162.36 MB/s write, 269.68 GB read, 153.15 MB/s read, 2926.7 seconds # with this change: Flush(GB): cumulative 25.882, interval 7.753 Cumulative compaction: 249.97 GB write, 141.96 MB/s write, 233.74 GB read, 132.74 MB/s read, 2534.9 seconds ``` The compaction simulator shows a similar result (14% with 100G random data). As a side effect, with this PR, the SST file size can exceed the target_file_size, but is capped at 2x target_file_size. And there will be smaller files. Here are file size statistics when loading 100GB with the target file size 32MB: ``` baseline this_PR count 1.656000e+03 1.705000e+03 mean 3.116062e+07 3.028076e+07 std 7.145242e+06 8.046139e+06 ``` The feature is enabled by default, to revert to the old behavior disable it with `AdvancedColumnFamilyOptions.level_compaction_dynamic_file_size = false` Also includes https://github.com/facebook/rocksdb/issues/1963 to cut file before skippable grandparent file. Which is for use case like user adding 2 or more non-overlapping data range at the same time, it can reduce the overlapping of 2 datasets in the lower levels. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10655 Reviewed By: cbi42 Differential Revision: D39552321 Pulled By: jay-zhuang fbshipit-source-id: 640d15f159ab0cd973f2426cfc3af266fc8bdde2	2 years ago
Changyu Bi	df492791b6	Fix segfault in Iterator::Refresh() (#10739 ) Summary: when a new internal iterator is constructed during iterator refresh, pointer to the previous memtable range tombstone iterator was not cleared. This could cause segfault for future `Refresh()` calls when they try to free the memtable range tombstones. This PR fixes this issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10739 Test Plan: added a unit test in db_range_del_test.cc to reproduce this issue. Reviewed By: ajkr, riversand963 Differential Revision: D39825283 Pulled By: cbi42 fbshipit-source-id: 3b59a2b73865aed39e28cdd5c1b57eed7991b94c	2 years ago
Yanqin Jin	52f2411722	Update HISTORY to mention PR #10724 (#10737 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10737 Reviewed By: cbi42 Differential Revision: D39825386 Pulled By: riversand963 fbshipit-source-id: a3c55f2777e034d6ae6ff44ef0219d9fbbf1cc96	2 years ago
Changyu Bi	93f46da1fa	Mention in HISTORY.md the fix in #10705 (#10720 ) Summary: Mention in HISTORY.md the fix in https://github.com/facebook/rocksdb/issues/10705. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10720 Reviewed By: riversand963 Differential Revision: D39709455 Pulled By: cbi42 fbshipit-source-id: 1a2c9dd8425c73f7095ddb1d0b1cca8ed35b7ef2	2 years ago
Akanksha Mahajan	0f4978d34f	Fix sqe->addr passed in cancel request in io_uring (#10644 ) Summary: Update io_uring_prep_cancel as it is now backward compatible. Also, io_uring_prep_cancel expects sqe->addr to match with read request submitted. It's being set wrong which is now fixed in this PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10644 Test Plan: - Ran internally with lastest liburing package and on RocksDB github repo with older version. - Ran seekrandom regression to confirm there is no regression. Reviewed By: anand1976 Differential Revision: D39284229 Pulled By: akankshamahajan15 fbshipit-source-id: fd52cdf23d676da114896163626b75c8ae09c980	2 years ago

1 2 3 4 5 ...

1436 Commits (15bb4ea08428aa70d078c3162de5f1b283919e8c)