Summary:
This PR changes the default value of
`CompactRangeOptions::exclusive_manual_compaction` from true to false so
manual `CompactRange()`s can run in parallel with other compactions. I
believe no artificial parallelism restriction is the intuitive behavior
so feel the old default value is a trap, which I have fallen into
several times, including yesterday.
`CompactRangeOptions::exclusive_manual_compaction == false` has been
used in both our correctness test and in production for years so should
be reasonably safe.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10317
Reviewed By: jay-zhuang
Differential Revision: D37659392
Pulled By: ajkr
fbshipit-source-id: 504915e978bbe300b79483d064070c75e93d91e5
Summary:
Allow sufficient subcompactions can be used when the number of input files is less than `max_subcompactions` under round-robin compaction priority.
Test Case:
Add `RoundRobinWithoutAdditionalResources` into `db_compaction_test`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10422
Reviewed By: ajkr
Differential Revision: D38186545
Pulled By: littlepig2013
fbshipit-source-id: b8e5098306f1e5b9561dfafafc8300a38f7fe88e
Summary:
Earlier implementation of round-robin priority can only pick one file at a time and disallows parallel compactions within the same level. In this PR, round-robin compaction policy will expand towards more input files with respecting some additional constraints, which are summarized as follows:
* Constraint 1: We can only pick consecutive files
- Constraint 1a: When a file is being compacted (or some input files are being compacted after expanding), we cannot choose it and have to stop choosing more files
- Constraint 1b: When we reach the last file (with the largest keys), we cannot choose more files (the next file will be the first one with small keys)
* Constraint 2: We should ensure the total compaction bytes (including the overlapped files from the next level) is no more than `mutable_cf_options_.max_compaction_bytes`
* Constraint 3: We try our best to pick as many files as possible so that the post-compaction level size can be just less than `MaxBytesForLevel(start_level_)`
* Constraint 4: If trivial move is allowed, we reuse the logic of `TryNonL0TrivialMove()` instead of expanding files with Constraint 3
More details can be found in `LevelCompactionBuilder::SetupOtherFilesWithRoundRobinExpansion()`.
The above optimization accelerates the process of moving the compaction cursor, in which the write-amp can be further reduced. While a large compaction may lead to high write stall, we break this large compaction into several subcompactions **regardless of** the `max_subcompactions` limit. The number of subcompactions for round-robin compaction priority is determined through the following steps:
* Step 1: Initialized against `max_output_file_limit`, the number of input files in the start level, and also the range size limit `ranges.size()`
* Step 2: Call `AcquireSubcompactionResources()`when max subcompactions is not sufficient, but we may or may not obtain desired resources, additional number of resources is stored in `extra_num_subcompaction_threads_reserved_`). Subcompaction limit is changed and update `num_planned_subcompactions` with `GetSubcompactionLimit()`
* Step 3: Call `ShrinkSubcompactionResources()` to ensure extra resources can be released (extra resources may exist for round-robin compaction when the number of actual number of subcompactions is less than the number of planned subcompactions)
More details can be found in `CompactionJob::AcquireSubcompactionResources()`,`CompactionJob::ShrinkSubcompactionResources()`, and `CompactionJob::ReleaseSubcompactionResources()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10341
Test Plan: Add `CompactionPriMultipleFilesRoundRobin[1-3]` unit test in `compaction_picker_test.cc` and `RoundRobinSubcompactionsAgainstResources.SubcompactionsUsingResources/[0-4]`, `RoundRobinSubcompactionsAgainstPressureToken.PressureTokenTest/[0-1]` in `db_compaction_test.cc`
Reviewed By: ajkr, hx235
Differential Revision: D37792644
Pulled By: littlepig2013
fbshipit-source-id: 7fecb7c4ffd97b34bbf6e3b760b2c35a772a0657
Summary:
## Problem Summary
RocksDB will acquire the global mutex of db instance for every time when user calls `Write`. When RocksDB schedules a lot of compaction jobs, it will compete the mutex with write thread and it will hurt the write performance.
## Problem Solution:
I want to use log_write_mutex to replace the global mutex in most case so that we do not acquire it in write-thread unless there is a write-stall event or a write-buffer-full event occur.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7516
Test Plan:
1. make check
2. CI
3. COMPILE_WITH_TSAN=1 make db_stress
make crash_test
make crash_test_with_multiops_wp_txn
make crash_test_with_multiops_wc_txn
make crash_test_with_atomic_flush
Reviewed By: siying
Differential Revision: D36908702
Pulled By: riversand963
fbshipit-source-id: 59b13881f4f5c0a58fd3ca79128a396d9cd98efe
Summary:
Using the Sequence number to time mapping to decide if a key is hot or not in
compaction and place it in the corresponding level.
Note: the feature is not complete, level compaction will run indefinitely until
all penultimate level data is cold and small enough to not trigger compaction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10370
Test Plan:
CI
* Run basic db_bench for universal compaction manually
Reviewed By: siying
Differential Revision: D37892338
Pulled By: jay-zhuang
fbshipit-source-id: 792bbd91b1ccc2f62b5d14c53118434bcaac4bbe
Summary:
Support per_key_placement for last level compaction, which will
be used for tiered compaction.
* compaction iterator reports which level a key should output to;
* compaction get the output level information and check if it's safe to
output the data to penultimate level;
* all compaction output files will be installed.
* extra internal compaction stats added for penultimate level.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9964
Test Plan:
* Unittest
* db_bench, no significate difference: https://gist.github.com/jay-zhuang/3645f8fb97ec0ab47c10704bb39fd6e4
* microbench manual compaction no significate difference: https://gist.github.com/jay-zhuang/ba679b3e89e24992615ee9eef310e6dd
* run the db_stress multiple times (not covering the new feature) looks good (internal: https://fburl.com/sandcastle/9w84pp2m)
Reviewed By: ajkr
Differential Revision: D36249494
Pulled By: jay-zhuang
fbshipit-source-id: a96da57c8031c1df83e4a7a8567b657a112b80a3
Summary:
In leveled compaction, L0->L1 trivial move will allow more than one file to be moved in one compaction. This would allow L0 files to be moved down faster when data is loaded in sequential order, making slowdown or stop condition harder to hit. Also seek L0->L1 trivial move when only some files qualify.
1. We always try to find L0->L1 trivial move from the oldest files. Keep including newer files, until adding a new file won't trigger a trivial move
2. Modify the trivial move condition so that this compaction would be tagged as trivial move.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10188
Test Plan:
See throughput improvements with db_bench with fast fillseq benchmark and small L0 files:
./db_bench_l0_move --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000 -level_compaction_dynamic_level_bytes
The throughput improved by about 50%. Stalling still happens though.
Reviewed By: jay-zhuang
Differential Revision: D37224743
fbshipit-source-id: 8958d97f22e12bdfc14d2e85930f6fa0070e9659
Summary:
The 'PersistRoundRobinCompactCursor' unit test in `db_compaction_test` may occasionally fail due to the inconsistent LSM state. The issue is fixed by adding `Flush()` and `WaitForFlushMemTable()` to produce a more predictable and stable LSM state.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10250
Test Plan: 'PersistRoundRobinCompactCursor' unit test in `db_compaction_test`
Reviewed By: jay-zhuang, riversand963
Differential Revision: D37426091
Pulled By: littlepig2013
fbshipit-source-id: 56fbaab0384c380c1f279a16dc8732b139c9f611
Summary:
The files behind the compaction cursor contain newer data than the files ahead of it. If a compaction writes a file that spans from before its output level’s cursor to after it, then data before the cursor will be contaminated with the old timestamp from the data after the cursor. To avoid this, we can split the output file into two – one entirely before the cursor and one entirely after the cursor. Note that, in rare cases, we **DO NOT** need to cut the file if it is a trivial move since the file will not be contaminated by older files. In such case, the compact cursor is not guaranteed to be the boundary of the file, but it does not hurt the round-robin selection process.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10227
Test Plan:
Add 'RoundRobinCutOutputAtCompactCursor' unit test in `db_compaction_test`
Task: [T122216351](https://www.internalfb.com/intern/tasks/?t=122216351)
Reviewed By: jay-zhuang
Differential Revision: D37388088
Pulled By: littlepig2013
fbshipit-source-id: 9246a6a084b6037b90d6ab3183ba4dfb75a3378d
Summary:
Add `kRoundRobin` as a compaction priority. The implementation is as follows.
- Define a cursor as the smallest Internal key in the successor of the selected file. Add `vector<InternalKey> compact_cursor_` into `VersionStorageInfo` where each element (`InternalKey`) in `compact_cursor_` represents a cursor. In round-robin compaction policy, we just need to select the first file (assuming files are sorted) and also has the smallest InternalKey larger than/equal to the cursor. After a file is chosen, we create a new `Fsize` vector which puts the selected file is placed at the first position in `temp`, the next cursor is then updated as the smallest InternalKey in successor of the selected file (the above logic is implemented in `SortFileByRoundRobin`).
- After a compaction succeeds, typically `InstallCompactionResults()`, we choose the next cursor for the input level and save it to `edit`. When calling `LogAndApply`, we save the next cursor with its level into some local variable and finally apply the change to `vstorage` in `SaveTo` function.
- Cursors are persist pair by pair (<level, InternalKey>) in `EncodeTo` so that they can be reconstructed when reopening. An empty cursor will not be encoded to MANIFEST
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10107
Test Plan: add unit test (`CompactionPriRoundRobin`) in `compaction_picker_test`, add `kRoundRobin` priority in `CompactionPriTest` from `db_compaction_test`, and add `PersistRoundRobinCompactCursor` in `db_compaction_test`
Reviewed By: ajkr
Differential Revision: D37316037
Pulled By: littlepig2013
fbshipit-source-id: 9f481748190ace416079139044e00df2968fb1ee
Summary:
Garbage collection is generally controlled by the BlobDB configuration options `enable_blob_garbage_collection` and `blob_garbage_collection_age_cutoff`. However, there might be use cases where we would want to temporarily override these options while performing a manual compaction. (One use case would be doing a full key-space manual compaction with full=100% garbage collection age cutoff in order to minimize the space occupied by the database.) Our goal here is to make it possible to override the configured GC parameters when using the `CompactRange` API to perform manual compactions. This PR would involve:
- Extending the `CompactRangeOptions` structure so clients can both force-enable and force-disable GC, as well as use a different cutoff than what's currently configured
- Storing whether blob GC should actually be enabled during a certain manual compaction and the cutoff to use in the `Compaction` object (considering the above overrides) and passing it to `CompactionIterator` via `CompactionProxy`
- Updating the BlobDB wiki to document the new options.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10073
Test Plan: Adding unit tests and adding the new options to the stress test tool.
Reviewed By: ltamasi
Differential Revision: D36848700
Pulled By: gangliao
fbshipit-source-id: c878ef101d1c612429999f513453c319f75d78e9
Summary:
ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955
Test Plan: Watch CI tests
Reviewed By: riversand963
Differential Revision: D36176799
fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471
Summary:
Right now we still don't fully use std::numeric_limits but use a macro, mainly for supporting VS 2013. Right now we only support VS 2017 and up so it is not a problem. The code comment claims that MinGW still needs it. We don't have a CI running MinGW so it's hard to validate. since we now require C++17, it's hard to imagine MinGW would still build RocksDB but doesn't support std::numeric_limits<>.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9954
Test Plan: See CI Runs.
Reviewed By: riversand963
Differential Revision: D36173954
fbshipit-source-id: a35a73af17cdcae20e258cdef57fcf29a50b49e0
Summary:
... by filling out remaining testing hole: handling of
db_pathsi+cf_paths. (Note that while GetLiveFilesStorageInfo works
with db_paths / cf_paths, Checkpoint and BackupEngine do not and
are marked appropriately.)
Also improved comments for "live files" APIs, and grouped them
together in db.h.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9868
Test Plan: Adding to existing unit tests
Reviewed By: jay-zhuang
Differential Revision: D35752254
Pulled By: pdillinger
fbshipit-source-id: c70eb67748fad61826e2f554b674638700abefb2
Summary:
In `FileMetaData`, we keep track of the lowest-numbered blob file
referenced by the SST file in question for the purposes of BlobDB's
garbage collection in the `oldest_blob_file_number` field, which is
updated in `UpdateBoundaries`. However, with the current code,
`BlobIndex` decoding errors (or invalid blob file numbers) are swallowed
in this method. The patch changes this by propagating these errors
and failing the corresponding flush/compaction. (Note that since blob
references are generated by the BlobDB code and also parsed by
`CompactionIterator`, in reality this can only happen in the case of
memory corruption.)
This change necessitated updating some unit tests that involved
fake/corrupt `BlobIndex` objects. Some of these just used a dummy string like
`"blob_index"` as a placeholder; these were replaced with real `BlobIndex`es.
Some were relying on the earlier behavior to simulate corruption; these
were replaced with `SyncPoint`-based test code that corrupts a valid
blob reference at read time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9851
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D35683671
Pulled By: ltamasi
fbshipit-source-id: f7387af9945c48e4d5c4cd864f1ba425c7ad51f6
Summary:
When sub compaction is decided for L0->L1 compaction, most of the cases, all L0 files will be involved in all sub compactions. However, it is not always the case. When files are generally (but not strictly) inserted in sequential order, there can be a subset of L0 files invovled. Yet RocksDB always open all those L0 files, and build an iterator, read many of the files' first of last block with expensive readahead. We trim some input files to reduce overhead a little bit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9802
Test Plan: Add a unit test to cover this case and manually validate the behavior while running the test.
Reviewed By: ajkr
Differential Revision: D35371031
fbshipit-source-id: 701ed7375b5cbe41672e93b38fe8a1503dad08b6
Summary:
In https://github.com/facebook/rocksdb/issues/9659, when `DisableManualCompaction()` is issued, the foreground
manual compaction thread does not have to wait background compaction
thread to finish. Which could be a problem that the user re-enable
manual compaction with `EnableManualCompaction()`, it may re-enable the
BG compaction which supposed be cancelled.
This patch makes the FG compaction wait on
`manual_compaction_state.done`, which either be set by BG compaction or
Unschedule callback. Then when FG manual compaction thread returns, it
should not have BG compaction running. So shared_ptr is no longer needed
for `manual_compaction_state`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9694
Test Plan: a StressTest and unittest
Reviewed By: ajkr
Differential Revision: D34885472
Pulled By: jay-zhuang
fbshipit-source-id: e6476175b43e8c59cd49f5c09241036a0716c274
Summary:
https://github.com/facebook/rocksdb/issues/9625 didn't change the unschedule condition which was waiting for the background thread to clean-up the compaction.
make sure we only unschedule the task when it's scheduled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9659
Reviewed By: ajkr
Differential Revision: D34651820
Pulled By: jay-zhuang
fbshipit-source-id: 23f42081b15ec8886cd81cbf131b116e0c74dc2f
Summary:
PR https://github.com/facebook/rocksdb/issues/9557 introduced a race condition between manual compaction
foreground thread and background compaction thread.
This PR adds the ability to really unschedule manual compaction from
thread-pool queue by differentiate tag name for manual compaction and
other tasks.
Also fix an issue that db `close()` didn't cancel the manual compaction thread.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9625
Test Plan: unittest not hang
Reviewed By: ajkr
Differential Revision: D34410811
Pulled By: jay-zhuang
fbshipit-source-id: cb14065eabb8cf1345fa042b5652d4f788c0c40c
Summary:
Fix `DisableManualCompaction()` has to wait scheduled manual compaction to start the execution to cancel the job.
When a manual compaction in thread-pool queue is cancel, set the job is_canceled to true and clean the resource.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9557
Test Plan: added unittest that will hang without the change
Reviewed By: ajkr
Differential Revision: D34214910
Pulled By: jay-zhuang
fbshipit-source-id: 89dbaee78ddf26eb13ce862c2b15f4a098b36a78
Summary:
The patch replaces `std::map` with a sorted `std::vector` for
`VersionStorageInfo::blob_files_` and preallocates the space
for the `vector` before saving the `BlobFileMetaData` into the
new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`.
These changes reduce the time the DB mutex is held while
saving new `Version`s, and using a sorted `vector` also makes
lookups faster thanks to better memory locality.
In addition, the patch introduces helper methods
`VersionStorageInfo::GetBlobFileMetaData` and
`VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by
clients to perform lookups in the `vector`, and does some general
cleanup in the parts of code where blob file metadata are used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526
Test Plan:
Ran `make check` and the crash test script for a while.
Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced:
```
numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value>
```
Final statistics before the patch:
```
Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s
Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s
```
With the patch:
```
Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s
Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s
```
Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs.
Reviewed By: riversand963
Differential Revision: D34082728
Pulled By: ltamasi
fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
Summary:
In RocksDB option new_table_reader_for_compaction_inputs has
not effect on Compaction or on the behavior of RocksDB library.
Therefore, we are removing it in the upcoming 7.0 release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9443
Test Plan: CircleCI
Reviewed By: ajkr
Differential Revision: D33788508
Pulled By: akankshamahajan15
fbshipit-source-id: 324ca6f12bfd019e9bd5e1b0cdac39be5c3cec7d
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9266
This diff adds a new tag `CommitWithTimestamp`. Currently, there is no API to trigger writing
this tag to WAL, thus it is unavailable to users.
This is an ongoing effort to add user-defined timestamp support to write-committed transactions.
This diff also indicates all column families that may potentially participate in the same
transaction must either disable timestamp or have the same timestamp format, since
`CommitWithTimestamp` tag is followed by a single byte-array denoting the commit
timestamp of the transaction. We will enforce this checking in a future diff. We keep this
diff small.
Reviewed By: ltamasi
Differential Revision: D31721350
fbshipit-source-id: e1450811443647feb6ca01adec4c8aaae270ffc6
Summary:
- Fixed bug where bottom-pri manual compactions were counting towards `bg_compaction_scheduled_` instead of `bg_bottom_compaction_scheduled_`. It seems to have no negative effect.
- Fixed bug where automatic compaction scheduling did not consider `bg_bottom_compaction_scheduled_`. Now automatic compactions cannot be scheduled that exceed the per-DB compaction concurrency limit (`max_compactions`) when some existing compactions are bottommost.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9179
Test Plan: new unit test for manual/automatic. Also verified the existing automatic/automatic test ("ConcurrentBottomPriLowPriCompactions") hanged until changing it to explicitly enable concurrency.
Reviewed By: riversand963
Differential Revision: D32488048
Pulled By: ajkr
fbshipit-source-id: 20c4c0693678e81e43f85ed3cc3402fcf26e3310
Summary:
Right now, when options.ttl is set, compactions are triggered around the time when TTL is reached. This might cause extra compactions which are often bursty. This commit tries to mitigate it by picking those files earlier in normal compaction picking process. This is only implemented using kMinOverlappingRatio with Leveled compaction as it is the default value and it is more complicated to change other styles.
When a file is aged more than ttl/2, RocksDB starts to boost the compaction priority of files in normal compaction picking process, and hope by the time TTL is reached, very few extra compaction is needed.
In order for this to work, another change is made: during a compaction, if an output level file is older than ttl/2, cut output files based on original boundary (if it is not in the last level). This is to make sure that after an old file is moved to the next level, and new data is merged from the upper level, the new data falling into this range isn't reset with old timestamp. Without this change, in many cases, most files from one level will keep having old timestamp, even if they have newer data and we stuck in it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8749
Test Plan: Add a unit test to test the boosting logic. Will add a unit test to test it end-to-end.
Reviewed By: jay-zhuang
Differential Revision: D30735261
fbshipit-source-id: 503c2d89250b22911eb99e72b379be154de3428e
Summary:
The bug can impact the following scenario. There must be two `CompactRange()`s, call them A and B. Compaction A must have `change_level=true`. Compactions A and B must run in parallel, and new data must be added while they run as well.
Now, on to the details of the race condition. Compaction A must reach the refitting phase while B's next step is to trivial move new data (i.e., data that has been inserted behind A) down to the same level that A's refit targets (`CompactRangeOptions::target_level`). B must be unregistered (i.e., has not yet called `AddManualCompaction()` for the current `RunManualCompaction()`) while A invokes `DisableManualCompaction()`s to prepare for refitting. In the old code, B could still proceed to register a manual compaction, while A had disabled manual compaction.
The next part of the race condition is B picks and schedules a trivial move while A has released the lock in refitting phase in order to persist the LSM state change (i.e., the log phase of `LogAndApply()`). That way, B does not see the refitted data when picking a trivial-move compaction. So it is susceptible to picking one that overlaps.
Finally, B executes the picked trivial-move compaction. Trivial-move compactions are special in that they never check whether manual compaction is disabled. So the picked compaction causing overlap ends up being applied, leading to LSM corruption if `force_consistency_checks=false`, or entering read-only mode with `Status::Corruption` if `force_consistency_checks=true` (the default).
The fix is just to prevent B from registering itself in `RunManualCompaction()` while manual compactions are disabled, consequently preventing any trivial move or other compaction from being picked/scheduled.
Thanks to siying for finding the bug.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9077
Test Plan: The test does not go all the way in exposing the bug because it requires a compaction to be picked/scheduled while logging LSM state change for RefitLevel(). But the fix is to make such a compaction not picked/scheduled in the first place, so any repro of that scenario would end up hanging RefitLevel() logging. So instead I just verified no such compaction is registered in the scenario where `RefitLevel()` disables manual compactions.
Reviewed By: siying
Differential Revision: D31921908
Pulled By: ajkr
fbshipit-source-id: 9bb5d0e847ad428211227f40830c685c209fbecb
Summary:
This commit introduces incremental compaction in univeral style for space amplification. This follows the first improvement mentioned in https://rocksdb.org/blog/2021/04/12/universal-improvements.html . The implemention simply picks up files about size of max_compaction_bytes to compact and execute if the penalty is not too big. More optimizations can be done in the future, e.g. prioritizing between this compaction and other types. But for now, the feature is supposed to be functional and can often reduce frequency of full compactions, although it can introduce penalty.
In order to add cut files more efficiently so that more files from upper levels can be included, SST file cutting threshold (for current file + overlapping parent level files) is set to 1.5X of target file size. A 2MB target file size will generate files like this: https://gist.github.com/siying/29d2676fba417404f3c95e6c013c7de8 Number of files indeed increases but it is not out of control.
Two set of write benchmarks are run:
1. For ingestion rate limited scenario, we can see full compaction is mostly eliminated: https://gist.github.com/siying/959bc1186066906831cf4c808d6e0a19 . The write amp increased from 7.7 to 9.4, as expected. After applying file cutting, the number is improved to 8.9. In another benchmark, the write amp is even better with the incremental approach: https://gist.github.com/siying/d1c16c286d7c59c4d7bba718ca198163
2. For ingestion rate unlimited scenario, incremental compaction turns out to be too expensive most of the time and is not executed, as expected.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8655
Test Plan: Add unit tests to the functionality.
Reviewed By: ajkr
Differential Revision: D31787034
fbshipit-source-id: ce813e63b15a61d5a56e97bf8902a1b28e011beb
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8991
Test Plan: the new test hangs forever without this fix and passes with this fix.
Reviewed By: hx235
Differential Revision: D31456419
Pulled By: ajkr
fbshipit-source-id: a82c0e5560b6e6153089dccd8e46163c61b07bff
Summary:
Made SystemClock into a Customizable class, complete with CreateFromString.
Cleaned up some of the existing SystemClock implementations that were redundant (NoSleep was the same as the internal one for MockEnv).
Changed MockEnv construction to allow Clock to be passed to the Memory/MockFileSystem.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8636
Reviewed By: zhichao-cao
Differential Revision: D30483360
Pulled By: mrambacher
fbshipit-source-id: cd0e3a876c39f8c98fe13374c06e8edbd5b9f2a1
Summary:
This PR does the following:
-> Makes the MemTableRepFactory into a Customizable class and creatable/configurable via CreateFromString
-> Makes the existing implementations compatible with configurations
-> Moves the "SpecialRepFactory" test class into testutil, accessible via the ObjectRegistry or a NewSpecial API
New tests were added to validate the functionality and all existing tests pass. db_bench and memtablerep_bench were hand-tested to verify the functionality in those tools.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8419
Reviewed By: zhichao-cao
Differential Revision: D29558961
Pulled By: mrambacher
fbshipit-source-id: 81b7229636e4e649a0c914e73ac7b0f8454c931c
Summary:
Some FIFO users want to keep the data for longer, but the old data is rarely accessed. This feature allows users to configure FIFO compaction so that data older than a threshold is moved to a warm storage tier.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8310
Test Plan: Add several unit tests.
Reviewed By: ajkr
Differential Revision: D28493792
fbshipit-source-id: c14824ea634814dee5278b449ab5c98b6e0b5501
Summary:
The PerThreadDBPath has already specified a slash. It does not need to be specified when initializing the test path.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8555
Reviewed By: ajkr
Differential Revision: D29758399
Pulled By: jay-zhuang
fbshipit-source-id: 6d2b878523e3e8580536e2829cb25489844d9011
Summary:
Various tests had disabled valgrind due to it slowing down and timing
out (as is the case right now) the CI runs. Where a test was disabled with no comment,
I assumed slowness was the cause. For these tests that were slow under
valgrind, as well as the ones identified in https://github.com/facebook/rocksdb/issues/8352, this PR moves them
behind the compiler flag `-DROCKSDB_FULL_VALGRIND_RUN`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8475
Test Plan: running `make full_valgrind_test`, `make valgrind_test`, `make check`; will verify they appear working correctly
Reviewed By: jay-zhuang
Differential Revision: D29504843
Pulled By: ajkr
fbshipit-source-id: 2aac90749cfbd30d5ce11cb29a07a1b9314eeea7
Summary:
`DeleteFilesInRange()` marks deleting files to `being_compacted`
before deleting, which may cause ongoing compactions report corruption
exception or ASSERT for debug build.
Adding the missing `ComputeCompactionScore()` when `being_compacted` is set.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8434
Test Plan: Unittest
Reviewed By: ajkr
Differential Revision: D29276127
Pulled By: jay-zhuang
fbshipit-source-id: f5b223e3c1fc6d821e100e3f3442bc70c1d50cf7
Summary:
Recalculate the total size after generate new sst files.
New generated files might have different size as the previous time which
could cause the test failed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8396
Test Plan:
```
gtest-parallel ./db_compaction_test
--gtest_filter=DBCompactionTest.ManualCompactionMax -r 1000 -w 100
```
Reviewed By: akankshamahajan15
Differential Revision: D29083299
Pulled By: jay-zhuang
fbshipit-source-id: 49d4bd619cefc0f9a1f452f8759ff4c2ba1b6fdb
Summary:
Error:
```
db/db_compaction_test.cc:5211:47: warning: The left operand of '*' is a garbage value
uint64_t total = (l1_avg_size + l2_avg_size * 10) * 10;
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8325
Test Plan: `$ make analyze`
Reviewed By: pdillinger
Differential Revision: D28620916
Pulled By: jay-zhuang
fbshipit-source-id: f6d58ab84eefbcc905cda45afb9522b0c6d230f8
Summary:
Fix a bug that for manual compaction, `max_compaction_bytes` is only
limit the SST files from input level, but not overlapped files on output
level.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8269
Test Plan: `make check`
Reviewed By: ajkr
Differential Revision: D28231044
Pulled By: jay-zhuang
fbshipit-source-id: 9d7d03004f30cc4b1b9819830141436907554b7c
Summary:
Adds a new Cache::ApplyToAllEntries API that we expect to use
(in follow-up PRs) for efficiently gathering block cache statistics.
Notable features vs. old ApplyToAllCacheEntries:
* Includes key and deleter (in addition to value and charge). We could
have passed in a Handle but then more virtual function calls would be
needed to get the "fields" of each entry. We expect to use the 'deleter'
to identify the origin of entries, perhaps even more.
* Heavily tuned to minimize latency impact on operating cache. It
does this by iterating over small sections of each cache shard while
cycling through the shards.
* Supports tuning roughly how many entries to operate on for each
lock acquire and release, to control the impact on the latency of other
operations without excessive lock acquire & release. The right balance
can depend on the cost of the callback. Good default seems to be
around 256.
* There should be no need to disable thread safety. (I would expect
uncontended locks to be sufficiently fast.)
I have enhanced cache_bench to validate this approach:
* Reports a histogram of ns per operation, so we can look at the
ditribution of times, not just throughput (average).
* Can add a thread for simulated "gather stats" which calls
ApplyToAllEntries at a specified interval. We also generate a histogram
of time to run ApplyToAllEntries.
To make the iteration over some entries of each shard work as cleanly as
possible, even with resize between next set of entries, I have
re-arranged which hash bits are used for sharding and which for indexing
within a shard.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8225
Test Plan:
A couple of unit tests are added, but primary validation is manual, as
the primary risk is to performance.
The primary validation is using cache_bench to ensure that neither
the minor hashing changes nor the simulated stats gathering
significantly impact QPS or latency distribution. Note that adding op
latency histogram seriously impacts the benchmark QPS, so for a
fair baseline, we need the cache_bench changes (except remove simulated
stat gathering to make it compile). In short, we don't see any
reproducible difference in ops/sec or op latency unless we are gathering
stats nearly continuously. Test uses 10GB block cache with
8KB values to be somewhat realistic in the number of items to iterate
over.
Baseline typical output:
```
Complete in 92.017 s; Rough parallel ops/sec = 869401
Thread ops/sec = 54662
Operation latency (ns):
Count: 80000000 Average: 11223.9494 StdDev: 29.61
Min: 0 Median: 7759.3973 Max: 9620500
Percentiles: P50: 7759.40 P75: 14190.73 P99: 46922.75 P99.9: 77509.84 P99.99: 217030.58
------------------------------------------------------
[ 0, 1 ] 68 0.000% 0.000%
( 2900, 4400 ] 89 0.000% 0.000%
( 4400, 6600 ] 33630240 42.038% 42.038% ########
( 6600, 9900 ] 18129842 22.662% 64.700% #####
( 9900, 14000 ] 7877533 9.847% 74.547% ##
( 14000, 22000 ] 15193238 18.992% 93.539% ####
( 22000, 33000 ] 3037061 3.796% 97.335% #
( 33000, 50000 ] 1626316 2.033% 99.368%
( 50000, 75000 ] 421532 0.527% 99.895%
( 75000, 110000 ] 56910 0.071% 99.966%
( 110000, 170000 ] 16134 0.020% 99.986%
( 170000, 250000 ] 5166 0.006% 99.993%
( 250000, 380000 ] 3017 0.004% 99.996%
( 380000, 570000 ] 1337 0.002% 99.998%
( 570000, 860000 ] 805 0.001% 99.999%
( 860000, 1200000 ] 319 0.000% 100.000%
( 1200000, 1900000 ] 231 0.000% 100.000%
( 1900000, 2900000 ] 100 0.000% 100.000%
( 2900000, 4300000 ] 39 0.000% 100.000%
( 4300000, 6500000 ] 16 0.000% 100.000%
( 6500000, 9800000 ] 7 0.000% 100.000%
```
New, gather_stats=false. Median thread ops/sec of 5 runs:
```
Complete in 92.030 s; Rough parallel ops/sec = 869285
Thread ops/sec = 54458
Operation latency (ns):
Count: 80000000 Average: 11298.1027 StdDev: 42.18
Min: 0 Median: 7722.0822 Max: 6398720
Percentiles: P50: 7722.08 P75: 14294.68 P99: 47522.95 P99.9: 85292.16 P99.99: 228077.78
------------------------------------------------------
[ 0, 1 ] 109 0.000% 0.000%
( 2900, 4400 ] 793 0.001% 0.001%
( 4400, 6600 ] 34054563 42.568% 42.569% #########
( 6600, 9900 ] 17482646 21.853% 64.423% ####
( 9900, 14000 ] 7908180 9.885% 74.308% ##
( 14000, 22000 ] 15032072 18.790% 93.098% ####
( 22000, 33000 ] 3237834 4.047% 97.145% #
( 33000, 50000 ] 1736882 2.171% 99.316%
( 50000, 75000 ] 446851 0.559% 99.875%
( 75000, 110000 ] 68251 0.085% 99.960%
( 110000, 170000 ] 18592 0.023% 99.983%
( 170000, 250000 ] 7200 0.009% 99.992%
( 250000, 380000 ] 3334 0.004% 99.997%
( 380000, 570000 ] 1393 0.002% 99.998%
( 570000, 860000 ] 700 0.001% 99.999%
( 860000, 1200000 ] 293 0.000% 100.000%
( 1200000, 1900000 ] 196 0.000% 100.000%
( 1900000, 2900000 ] 69 0.000% 100.000%
( 2900000, 4300000 ] 32 0.000% 100.000%
( 4300000, 6500000 ] 10 0.000% 100.000%
```
New, gather_stats=true, 1 second delay between scans. Scans take about
1 second here so it's spending about 50% time scanning. Still the effect on
ops/sec and latency seems to be in the noise. Median thread ops/sec of 5 runs:
```
Complete in 91.890 s; Rough parallel ops/sec = 870608
Thread ops/sec = 54551
Operation latency (ns):
Count: 80000000 Average: 11311.2629 StdDev: 45.28
Min: 0 Median: 7686.5458 Max: 10018340
Percentiles: P50: 7686.55 P75: 14481.95 P99: 47232.60 P99.9: 79230.18 P99.99: 232998.86
------------------------------------------------------
[ 0, 1 ] 71 0.000% 0.000%
( 2900, 4400 ] 291 0.000% 0.000%
( 4400, 6600 ] 34492060 43.115% 43.116% #########
( 6600, 9900 ] 16727328 20.909% 64.025% ####
( 9900, 14000 ] 7845828 9.807% 73.832% ##
( 14000, 22000 ] 15510654 19.388% 93.220% ####
( 22000, 33000 ] 3216533 4.021% 97.241% #
( 33000, 50000 ] 1680859 2.101% 99.342%
( 50000, 75000 ] 439059 0.549% 99.891%
( 75000, 110000 ] 60540 0.076% 99.967%
( 110000, 170000 ] 14649 0.018% 99.985%
( 170000, 250000 ] 5242 0.007% 99.991%
( 250000, 380000 ] 3260 0.004% 99.995%
( 380000, 570000 ] 1599 0.002% 99.997%
( 570000, 860000 ] 1043 0.001% 99.999%
( 860000, 1200000 ] 471 0.001% 99.999%
( 1200000, 1900000 ] 275 0.000% 100.000%
( 1900000, 2900000 ] 143 0.000% 100.000%
( 2900000, 4300000 ] 60 0.000% 100.000%
( 4300000, 6500000 ] 27 0.000% 100.000%
( 6500000, 9800000 ] 7 0.000% 100.000%
( 9800000, 14000000 ] 1 0.000% 100.000%
Gather stats latency (us):
Count: 46 Average: 980387.5870 StdDev: 60911.18
Min: 879155 Median: 1033777.7778 Max: 1261431
Percentiles: P50: 1033777.78 P75: 1120666.67 P99: 1261431.00 P99.9: 1261431.00 P99.99: 1261431.00
------------------------------------------------------
( 860000, 1200000 ] 45 97.826% 97.826% ####################
( 1200000, 1900000 ] 1 2.174% 100.000%
Most recent cache entry stats:
Number of entries: 1295133
Total charge: 9.88 GB
Average key size: 23.4982
Average charge: 8.00 KB
Unique deleters: 3
```
Reviewed By: mrambacher
Differential Revision: D28295742
Pulled By: pdillinger
fbshipit-source-id: bbc4a552f91ba0fe10e5cc025c42cef5a81f2b95
Summary:
Extend support to track blob files in SST File manager.
This PR notifies SstFileManager whenever a new blob file is created,
via OnAddFile and an obsolete blob file deleted via OnDeleteFile
and delete file via ScheduleFileDeletion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8037
Test Plan: Add new unit tests
Reviewed By: ltamasi
Differential Revision: D26891237
Pulled By: akankshamahajan15
fbshipit-source-id: 04c69ccfda2a73782fd5c51982dae58dd11979b6
Summary:
CompactionDeletionTriggerReopen was observed to be flaky recently:
https://app.circleci.com/pipelines/github/facebook/rocksdb/6030/workflows/787af4f3-b9f7-4645-8e8d-1fb0ebf05539/jobs/101451.
I went through it and the related tests and arrived at different
conclusions on what constraints we can expect on DB size. Some
constraints got looser and some got tighter. The particular constraint
that flaked got a lot looser so at least the flake linked above would have been prevented.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8036
Reviewed By: riversand963
Differential Revision: D26862566
Pulled By: ajkr
fbshipit-source-id: 3512b86b4fb41aeecae32e1c7382c03916d88d88
Summary:
The patch does the following:
1) Exposes the amount of data (number of bytes) read from blob files from
`BlobFileReader::GetBlob` / `Version::GetBlob`.
2) Tracks the total number and size of blobs read from blob files during a
compaction (due to garbage collection or compaction filter usage) in
`CompactionIterationStats` and propagates this data to
`InternalStats::CompactionStats` / `CompactionJobStats`.
3) Updates the formulae for write amplification calculations to include the
amount of data read from blob files.
4) Extends the compaction stats dump with a new column `Rblob(GB)` and
a new line containing the total number and size of blob files in the current
`Version` to complement the information about the shape and size of the LSM tree
that's already there.
5) Updates `CompactionJobStats` so that the number of files and amount of data
written by a compaction are broken down per file type (i.e. table/blob file).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8022
Test Plan: Ran `make check` and `db_bench`.
Reviewed By: riversand963
Differential Revision: D26801199
Pulled By: ltamasi
fbshipit-source-id: 28a5f072048a702643b28cb5971b4099acabbfb2
Summary:
The patch breaks down the "bytes written" (as well as the "number of output files")
compaction statistics into two, so the values are logged separately for table files
and blob files in the info log, and are shown in separate columns (`Write(GB)` for table
files, `Wblob(GB)` for blob files) when the compaction statistics are dumped.
This will also come in handy for fixing the write amplification statistics, which currently
do not consider the amount of data read from blob files during compaction. (This will
be fixed by an upcoming patch.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8013
Test Plan: Ran `make check` and `db_bench`.
Reviewed By: riversand963
Differential Revision: D26742156
Pulled By: ltamasi
fbshipit-source-id: 31d18ee8f90438b438ca7ed1ea8cbd92114442d5
Summary:
in PR https://github.com/facebook/rocksdb/issues/7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7523
Test Plan: add new unit test, pass make check/ make asan_check
Reviewed By: pdillinger
Differential Revision: D24313271
Pulled By: zhichao-cao
fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0
Summary:
Introduces and uses a SystemClock class to RocksDB. This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock.
Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead. There are likely more places that can be changed, but this is a start to show what can/should be done. Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock.
There are several Env classes that implement these functions. Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR. It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc).
Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858
Reviewed By: pdillinger
Differential Revision: D26006406
Pulled By: mrambacher
fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90
Summary:
In the original stacked BlobDB implementation, which writes blobs to blob files
immediately and treats blob files as logs, it makes sense to flush the file after
writing each blob to protect against process crashes; however, in the integrated
implementation, which builds blob files in the background jobs, this unnecessarily
reduces performance. This patch fixes this by simply adding a `do_flush` flag to
`BlobLogWriter`, which is set to `true` by the stacked implementation and to `false`
by the new code. Note: the change itself is trivial but the tests needed some work;
since in the new implementation, blobs are now buffered, adding a blob to
`BlobFileBuilder` is no longer guaranteed to result in an actual I/O. Therefore, we can
no longer rely on `FaultInjectionTestEnv` when testing failure cases; instead, we
manipulate the return values of I/O methods directly using `SyncPoint`s.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7892
Test Plan: `make check`
Reviewed By: jay-zhuang
Differential Revision: D26022814
Pulled By: ltamasi
fbshipit-source-id: b3dce419f312137fa70d84cdd9b908fd5d60d8cd
Summary:
The patch adds basic garbage collection support to the integrated BlobDB
implementation. Valid blobs residing in the oldest blob files are relocated
as they are encountered during compaction. The threshold that determines
which blob files qualify is computed based on the configuration option
`blob_garbage_collection_age_cutoff`, which was introduced in https://github.com/facebook/rocksdb/issues/7661 .
Once a blob is retrieved for the purposes of relocation, it passes through the
same logic that extracts large values to blob files in general. This means that
if, for instance, the size threshold for key-value separation (`min_blob_size`)
got changed or writing blob files got disabled altogether, it is possible for the
value to be moved back into the LSM tree. In particular, one way to re-inline
all blob values if needed would be to perform a full manual compaction with
`enable_blob_files` set to `false`, `enable_blob_garbage_collection` set to
`true`, and `blob_file_garbage_collection_age_cutoff` set to `1.0`.
Some TODOs that I plan to address in separate PRs:
1) We'll have to measure the amount of new garbage in each blob file and log
`BlobFileGarbage` entries as part of the compaction job's `VersionEdit`.
(For the time being, blob files are cleaned up solely based on the
`oldest_blob_file_number` relationships.)
2) When compression is used for blobs, the compression type hasn't changed,
and the blob still qualifies for being written to a blob file, we can simply copy
the compressed blob to the new file instead of going through decompression
and compression.
3) We need to update the formula for computing write amplification to account
for the amount of data read from blob files as part of GC.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7694
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D25069663
Pulled By: ltamasi
fbshipit-source-id: bdfa8feb09afcf5bca3b4eba2ba72ce2f15cd06a