rocksdb

Commit Graph

Author	SHA1	Message	Date
Jay Huh	81aeb15988	Add WaitForCompact with WaitForCompactOptions to public API (#11436 ) Summary: Context: This is the first PR for WaitForCompact() Implementation with WaitForCompactOptions. In this PR, we are introducing `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` in the public API. This currently utilizes the existing internal `WaitForCompact()` implementation (with default abort_on_pause = false). `abort_on_pause` has been moved to `WaitForCompactOptions&`. In the later PRs, we will introduce the following two options in `WaitForCompactOptions` 1. `bool flush = false` by default - If true, flush before waiting for compactions to finish. Must be set to true to ensure no immediate compactions (except perhaps periodic compactions) after closing and re-opening the DB. 2. `bool close_db = false` by default - If true, will also close the DB upon compactions finishing. 1. struct `WaitForCompactOptions` added to options.h and `abort_on_pause` in the internal API moved to the option struct. 2. `Status WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` introduced in `db.h` 3. Changed the internal WaitForCompact() to `WaitForCompact(const WaitForCompactOptions& wait_for_compact_options)` and checks for the `abort_on_pause` inside the option. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11436 Test Plan: Following tests added - `DBCompactionTest::WaitForCompactWaitsOnCompactionToFinish` - `DBCompactionTest::WaitForCompactAbortOnPauseAborted` - `DBCompactionTest::WaitForCompactContinueAfterPauseNotAborted` - `DBCompactionTest::WaitForCompactShutdownWhileWaiting` - `TransactionTest::WaitForCompactAbortOnPause` NOTE: `TransactionTest::WaitForCompactAbortOnPause` was added to use `StackableDB` to ensure the wrapper function is in place. Reviewed By: pdillinger Differential Revision: D45799659 Pulled By: jaykorean fbshipit-source-id: b5b58f95957f2ab47d1221dee32a61d6cdc4685b	1 year ago
Jay Huh	586d78b31e	Remove wait_unscheduled from waitForCompact internal API (#11443 ) Summary: Context: In pull request https://github.com/facebook/rocksdb/issues/11436, we are introducing a new public API `waitForCompact(const WaitForCompactOptions& wait_for_compact_options)`. This API invokes the internal implementation `waitForCompact(bool wait_unscheduled=false)`. The unscheduled parameter indicates the compactions that are not yet scheduled but are required to process items in the queue. In certain cases, we are unable to wait for compactions, such as during a shutdown or when background jobs are paused. It is important to return the appropriate status in these scenarios. For all other cases, we should wait for all compaction and flush jobs, including the unscheduled ones. The primary purpose of this new API is to wait until the system has resolved its compaction debt. Currently, the usage of `wait_unscheduled` is limited to test code. This pull request eliminates the usage of wait_unscheduled. The internal `waitForCompact()` API now waits for unscheduled compactions unless the db is undergoing a shutdown. In the event of a shutdown, the API returns `Status::ShutdownInProgress()`. Additionally, a new parameter, `abort_on_pause`, has been introduced with a default value of `false`. This parameter addresses the possibility of waiting indefinitely for unscheduled jobs if `PauseBackgroundWork()` was called before `waitForCompact()` is invoked. By setting `abort_on_pause` to `true`, the API will immediately return `Status::Aborted`. Furthermore, all tests that previously called `waitForCompact(true)` have been fixed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11443 Test Plan: Existing tests that involve a shutdown in progress: - DBCompactionTest::CompactRangeShutdownWhileDelayed - DBTestWithParam::PreShutdownMultipleCompaction - DBTestWithParam::PreShutdownCompactionMiddle Reviewed By: pdillinger Differential Revision: D45923426 Pulled By: jaykorean fbshipit-source-id: 7dc93fe6a6841a7d9d2d72866fa647090dba8eae	2 years ago
Hui Xiao	151242ce46	Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288 ) Summary: Context: The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them. Summary - Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros` - Fixed the default histogram in `RandomAccessFileReader` - New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader` - Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288 Test Plan: - Stress test - Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's. (without blob) - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads. ``` ./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10) ``` ``` // BlockBasedTable rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805 rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116 rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689 // PlainTable Does not apply ``` - Db bench 2: performance Read SETUP: db with 900 files ``` ./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none ```run till convergence ``` ./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3 ``` Pre-change `readrandom [AVG 60 runs] : 21568 (± 248) ops/sec` Post-change (no regression, -0.3%) `readrandom [AVG 60 runs] : 21486 (± 236) ops/sec` Compaction/Flushrun till convergence ``` ./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none rocksdb.sst.read.micros COUNT : 33820 rocksdb.sst.read.flush.micros COUNT : 1800 rocksdb.sst.read.compaction.micros COUNT : 32020 ``` Pre-change `fillseq [AVG 46 runs] : 1391 (± 214) ops/sec; 0.7 (± 0.1) MB/sec` Post-change (no regression, ~-0.4%) `fillseq [AVG 46 runs] : 1385 (± 216) ops/sec; 0.7 (± 0.1) MB/sec` Reviewed By: ajkr Differential Revision: D44007011 Pulled By: hx235 fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132	2 years ago
Murali Vilayannur	226ee25d30	Block fetch CPU time counters in perf context (#11342 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11342 Reviewed By: ajkr Differential Revision: D45026838 Pulled By: mnv104 fbshipit-source-id: 099ed9579922b8fa6e7d3332bbb829d50ec47d91	2 years ago
Andrew Kryczka	504fe4de80	Avoid allocations/copies for large `GetMergeOperands()` results (#10458 ) Summary: This PR avoids allocations and copies for the result of `GetMergeOperands()` when the average operand size is at least 256 bytes and the total operands size is at least 32KB. The `GetMergeOperands()` already included `PinnableSlice` but was calling `PinSelf()` (i.e., allocating and copying) for each operand. When this optimization takes effect, we instead call `PinSlice()` to skip that allocation and copy. Resources are pinned in order for the `PinnableSlice` to point to valid memory even after `GetMergeOperands()` returns. The pinned resources include a referenced `SuperVersion`, a `MergingContext`, and a `PinnedIteratorsManager`. They are bundled into a `GetMergeOperandsState`. We use `SharedCleanablePtr` to share that bundle among all `PinnableSlice`s populated by `GetMergeOperands()`. That way, the last `PinnableSlice` to be `Reset()` will cleanup the bundle, including unreferencing the `SuperVersion`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10458 Test Plan: - new DB level test - measured benefit/regression in a number of memtable scenarios Setup command: ``` $ ./db_bench -benchmarks=mergerandom -merge_operator=StringAppendOperator -num=$num -writes=16384 -key_size=16 -value_size=$value_sz -compression_type=none -write_buffer_size=1048576000 ``` Benchmark command: ``` ./db_bench -threads=$threads -use_existing_db=true -avoid_flush_during_recovery=true -write_buffer_size=1048576000 -benchmarks=readrandomoperands -merge_operator=StringAppendOperator -num=$num -duration=10 ``` Worst regression is when a key has many tiny operands: - Parameters: num=1 (implying 16384 operands per key), value_sz=8, threads=1 - `GetMergeOperands()` latency increases 682 micros -> 800 micros (+17%) The regression disappears into the noise (<1% difference) if we remove the `Reset()` loop and the size counting loop. The former is arguably needed regardless of this PR as the convention in `Get()` and `MultiGet()` is to `Reset()` the input `PinnableSlice`s at the start. The latter could be optimized to count the size as we accumulate operands rather than after the fact. Best improvement is when a key has large operands and high concurrency: - Parameters: num=4 (implying 4096 operands per key), value_sz=2KB, threads=32 - `GetMergeOperands()` latency decreases 11492 micros -> 437 micros (-96%). Reviewed By: cbi42 Differential Revision: D38336578 Pulled By: ajkr fbshipit-source-id: 48146d127e04cb7f2d4d2939a2b9dff3aba18258	2 years ago
Jay Zhuang	5490da20a5	Fix flaky db_basic_bench caused by unreleased iterator (#10058 ) Summary: Iterator is not freed after test is done (after the main for loop), which could cause db close failed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10058 Test Plan: Able to reproduce consistently with higher thread number, like 100, make sure it passes after the fix Reviewed By: ajkr Differential Revision: D36685823 Pulled By: jay-zhuang fbshipit-source-id: 4c98b8758d106bfe40cae670e689c3d284765bcf	3 years ago
Andrew Kryczka	7b7a37c069	Add microbenchmarks for `DB::GetMergeOperands()` (#9971 ) Summary: The new microbenchmarks, DBGetMergeOperandsInMemtable and DBGetMergeOperandsInSstFile, correspond to the two different LSMs tested: all data in one memtable and all data in one SST file, respectively. Both cases are parameterized by thread count (1 or 8) and merge operands per key (1, 32, or 1024). The SST file case is additionally parameterized by whether data is in block cache or mmap'd memory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9971 Test Plan: ``` $ TEST_TMPDIR=/dev/shm/db_basic_bench/ ./db_basic_bench --benchmark_filter=DBGetMergeOperands The number of inputs is very large. DBGet will be repeated at least 192 times. The number of inputs is very large. DBGet will be repeated at least 192 times. 2022-05-09T13:15:40-07:00 Running ./db_basic_bench Run on (36 X 2570.91 MHz CPU s) CPU Caches: L1 Data 32 KiB (x18) L1 Instruction 32 KiB (x18) L2 Unified 1024 KiB (x18) L3 Unified 25344 KiB (x1) Load Average: 4.50, 4.33, 4.37 ---------------------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations UserCounters... ---------------------------------------------------------------------------------------------------------------------------- DBGetMergeOperandsInMemtable/entries_per_key:1/threads:1 846 ns 846 ns 849893 db_size=0 DBGetMergeOperandsInMemtable/entries_per_key:32/threads:1 2436 ns 2436 ns 305779 db_size=0 DBGetMergeOperandsInMemtable/entries_per_key:1024/threads:1 77226 ns 77224 ns 8152 db_size=0 DBGetMergeOperandsInMemtable/entries_per_key:1/threads:8 116 ns 929 ns 779368 db_size=0 DBGetMergeOperandsInMemtable/entries_per_key:32/threads:8 330 ns 2644 ns 280824 db_size=0 DBGetMergeOperandsInMemtable/entries_per_key:1024/threads:8 12466 ns 99718 ns 7200 db_size=0 DBGetMergeOperandsInSstFile/entries_per_key:1/mmap:0/threads:1 1640 ns 1640 ns 461262 db_size=21.7826M DBGetMergeOperandsInSstFile/entries_per_key:1/mmap:1/threads:1 1693 ns 1693 ns 439936 db_size=21.7826M DBGetMergeOperandsInSstFile/entries_per_key:32/mmap:0/threads:1 3999 ns 3999 ns 172881 db_size=19.6981M DBGetMergeOperandsInSstFile/entries_per_key:32/mmap:1/threads:1 5544 ns 5543 ns 135657 db_size=19.6981M DBGetMergeOperandsInSstFile/entries_per_key:1024/mmap:0/threads:1 78767 ns 78761 ns 8395 db_size=19.6389M DBGetMergeOperandsInSstFile/entries_per_key:1024/mmap:1/threads:1 157242 ns 157238 ns 4495 db_size=19.6389M DBGetMergeOperandsInSstFile/entries_per_key:1/mmap:0/threads:8 231 ns 1848 ns 347768 db_size=21.7826M DBGetMergeOperandsInSstFile/entries_per_key:1/mmap:1/threads:8 214 ns 1715 ns 393312 db_size=21.7826M DBGetMergeOperandsInSstFile/entries_per_key:32/mmap:0/threads:8 596 ns 4767 ns 142088 db_size=19.6981M DBGetMergeOperandsInSstFile/entries_per_key:32/mmap:1/threads:8 720 ns 5757 ns 118200 db_size=19.6981M DBGetMergeOperandsInSstFile/entries_per_key:1024/mmap:0/threads:8 11613 ns 92460 ns 7344 db_size=19.6389M DBGetMergeOperandsInSstFile/entries_per_key:1024/mmap:1/threads:8 19989 ns 159908 ns 4440 db_size=19.6389M ``` Reviewed By: jay-zhuang Differential Revision: D36258861 Pulled By: ajkr fbshipit-source-id: 04b733e1cc3a4a70ed9baa894c50fdf96c0d6064	3 years ago
sdong	736a7b5433	Remove own ToString() (#9955 ) Summary: ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955 Test Plan: Watch CI tests Reviewed By: riversand963 Differential Revision: D36176799 fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471	3 years ago
Andrew Kryczka	440c7f6306	db_basic_bench fix for DB object cleanup (#9939 ) Summary: Use `unique_ptr<DB>` to make sure the DB object is deleted. Previously it was not, which led to accumulating file descriptors for deleted directories because a `DBImpl::db_dir_` from each test remained alive. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9939 Test Plan: run `lsof -p $(pidof db_basic_bench)` while benchmark runs; verify no FDs for deleted directories. Reviewed By: jay-zhuang Differential Revision: D36108761 Pulled By: ajkr fbshipit-source-id: cfe02646b038a445af7d5db8989eb1f40d658359	3 years ago
Andrew Kryczka	1eb279dcce	Add mmap DBGet microbench parameters (#9903 ) Summary: I tried evaluating https://github.com/facebook/rocksdb/issues/9611 using DBGet microbenchmarks but mostly found the change is well within the noise even for hundreds of repetitions; meanwhile, the InternalKeyComparator CPU it saves is 1-2% according to perf so it should be measurable. In this PR I tried adding a mmap mode that will bypass compression/checksum/block cache/file read to focus more on the block lookup paths, and also increased the Get() count. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9903 Reviewed By: jay-zhuang, riversand963 Differential Revision: D35907375 Pulled By: ajkr fbshipit-source-id: 69490d5040ef0863e1ce296724104d0aa7667215	3 years ago
Jingjing Wang	5a085d789d	pristine code Summary: This commit was generated using `mgt import`. pristine code for third-party libraries: third-party/benchmark upgrade google benchmark to v1.6.1 contains a local patch that reverts [this](https://github.com/google/benchmark/pull/1227?fbclid=IwAR2CCmIJmjU62SPPQQf_t8kdAsMjYv_Pa_GxabYUOdQpGPZUHKwbnYS_1oE) and changs `enum Flags` to be `enum Flags : uint32_t`. Reviewed By: chadaustin Differential Revision: D35136540 fbshipit-source-id: f3662f953cd87956e5e9b767e55e3697f99d3b49	3 years ago
Jay Zhuang	18463f8c00	Remove DBGet P95/P99 benchmark metrics (#9742 ) Summary: DBGet p95 and p99 have high variation, remove them for now. Also increase the iteration to 3 to avoid false positive. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9742 Test Plan: Internal CI Reviewed By: ajkr Differential Revision: D35082820 Pulled By: jay-zhuang fbshipit-source-id: facc1d56b94e54aa8c8852c207aae2ae4e4924b0	3 years ago
Patrick Somaru	ff8763c187	regenerate config jsons, reduce noise (#9644 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9644 Reviewed By: jay-zhuang Differential Revision: D34543778 fbshipit-source-id: eae5f2c0ced4c11d365d0049bdb288598e364e8f	3 years ago
Jay Zhuang	f4b2500e12	Add last level and non-last level read statistics (#9519 ) Summary: Add last level and non-last level read statistics: ``` LAST_LEVEL_READ_BYTES, LAST_LEVEL_READ_COUNT, NON_LAST_LEVEL_READ_BYTES, NON_LAST_LEVEL_READ_COUNT, ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9519 Test Plan: added unittest Reviewed By: siying Differential Revision: D34062539 Pulled By: jay-zhuang fbshipit-source-id: 908644c3050878b4234febdc72e3e19d89af38cd	3 years ago
Jay Zhuang	980b9ff385	Add more micro-benchmark tests (#9436 ) Summary: * Add more micro-benchmark tests * Expose an API in DBImpl for waiting for compactions (still not visible to the user) * Add argument name for ribbon_bench * remove benchmark run from CI, as it runs too long. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9436 Test Plan: CI Reviewed By: riversand963 Differential Revision: D33777836 Pulled By: jay-zhuang fbshipit-source-id: c05de3bc082cc05b5d019f00b324e774bf4bbd96	3 years ago
Jay Zhuang	29102641dd	Skip directory fsync for filesystem btrfs (#8903 ) Summary: Directory fsync might be expensive on btrfs and it may not be needed. Here are 4 directory fsync cases: 1. creating a new file: dir-fsync is not needed on btrfs, as long as the new file itself is synced. 2. renaming a file: dir-fsync is not needed if the renamed file is synced. So an API `FsyncAfterFileRename(filename, ...)` is provided to sync the file on btrfs. By default, it just calls dir-fsync. 3. deleting files: dir-fsync is forced by set `IOOptions.force_dir_fsync = true` 4. renaming multiple files (like backup and checkpoint): dir-fsync is forced, the same as above. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8903 Test Plan: run tests on btrfs and non btrfs Reviewed By: ajkr Differential Revision: D30885059 Pulled By: jay-zhuang fbshipit-source-id: dd2730b31580b0bcaedffc318a762d7dbf25de4a	3 years ago

16 Commits (56ca9e3106e05c0c031b179ecbb9cb0865fc22dd)