Summary:
The blob cache enables an optimization on the read path: when a blob is found in the cache, we can avoid copying it into the buffer provided by the application. Instead, we can simply transfer ownership of the cache handle to the target `PinnableSlice`. (Note: this relies on the `Cleanable` interface, which is implemented by `PinnableSlice`.)
This has the potential to save a lot of CPU, especially with large blob values.
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10297
Reviewed By: riversand963
Differential Revision: D37640311
Pulled By: gangliao
fbshipit-source-id: 92de0e35cc703d06c87c5c1861cc2899ec52234a
Summary:
Earlier implementation of cutting the output files with a compact cursor under Round-Robin priority uses `Valid()` to determine if the `output_split_key` is valid in `ShouldStopBefore`. This contributes to excessive CPU computation, as pointed out by [this issue](https://github.com/facebook/rocksdb/issues/10315). In this PR, we change the type of `output_split_key` to be `InternalKey*` and set it as `nullptr` if it is not going to be used in `ShouldStopBefore`, `Valid()` condition checking can be avoided using that pointer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10316
Reviewed By: ajkr
Differential Revision: D37661492
Pulled By: littlepig2013
fbshipit-source-id: 66ff1105f3378e5573d3a126fdaff9bb23b5498f
Summary:
I noticed it would clean up some things to have Cache::Insert()
return our MemoryLimit Status instead of Incomplete for the case in
which the capacity limit is reached. I suspect this fixes some existing but
unknown bugs where this Incomplete could be confused with other uses
of Incomplete, especially no_io cases. This is the most suspicious case I
noticed, but was not able to reproduce a bug, in part because the existing
code is not covered by unit tests (FIXME added): 57adbf0e91/table/get_context.cc (L397)
I audited all the existing uses of IsIncomplete and updated those that
seemed relevant.
HISTORY updated with a clear warning to users of strict_capacity_limit=true
to update uses of `IsIncomplete()`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10262
Test Plan: updated unit tests
Reviewed By: hx235
Differential Revision: D37473155
Pulled By: pdillinger
fbshipit-source-id: 4bd9d9353ccddfe286b03ebd0652df8ce20f99cb
Summary:
In leveled compaction, try to trivial move more than one files if possible, up to 4 files or max_compaction_bytes. This is to allow higher write throughput for some use cases where data is loaded in sequential order, where appying compaction results is the bottleneck.
When pick up a file to compact and it doesn't have overlapping files in the next level, try to expand to the next file if there is still no overlapping.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10190
Test Plan:
Add some unit tests.
For performance, Try to run
./db_bench_multi_move --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000 -level_compaction_dynamic_level_bytes
Together with https://github.com/facebook/rocksdb/pull/10188 , stalling will be eliminated in this benchmark.
Reviewed By: jay-zhuang
Differential Revision: D37230647
fbshipit-source-id: 42b260f545c46abc5d90335ac2bbfcd09602b549
Summary:
Before this PR, it is required that application open RocksDB secondary
instance with `max_open_files = -1`. This is a hacky workaround that
prevents IOErrors on the seconary instance during point-lookup or range
scan caused by primary instance deleting the table files. This is not
necessary if the application can coordinate the primary and secondaries
so that primary does not delete files that are still being used by the
secondaries. Or users can provide a custom Env/FS implementation that
deletes the files only after all primary and secondary instances
indicate files are obsolete and deleted.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10260
Test Plan: make check
Reviewed By: jay-zhuang
Differential Revision: D37462633
Pulled By: riversand963
fbshipit-source-id: 9c2fc939f49663efa61e3d60c8f1e01d64b9d72c
Summary:
In leveled compaction, L0->L1 trivial move will allow more than one file to be moved in one compaction. This would allow L0 files to be moved down faster when data is loaded in sequential order, making slowdown or stop condition harder to hit. Also seek L0->L1 trivial move when only some files qualify.
1. We always try to find L0->L1 trivial move from the oldest files. Keep including newer files, until adding a new file won't trigger a trivial move
2. Modify the trivial move condition so that this compaction would be tagged as trivial move.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10188
Test Plan:
See throughput improvements with db_bench with fast fillseq benchmark and small L0 files:
./db_bench_l0_move --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000 -level_compaction_dynamic_level_bytes
The throughput improved by about 50%. Stalling still happens though.
Reviewed By: jay-zhuang
Differential Revision: D37224743
fbshipit-source-id: 8958d97f22e12bdfc14d2e85930f6fa0070e9659
Summary:
In round-robin compaction priority, when splitting the compaction into sub-compactions, the earlier implementation takes into account the compact cursor to have full use of available sub-compactions. But this may result in unbalanced sub-compactions, so we remove this here. The removal does not affect the cursor-based splitting mechanism within a sub-compaction, and thus the output files are still ensured to be split according to the cursor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10289
Reviewed By: ajkr
Differential Revision: D37559091
Pulled By: littlepig2013
fbshipit-source-id: b8b45b99f63b09cf873f7f049bcb4ab13871fffc
Summary:
The current level targets for dynamical leveling has a problem: the target level size will dramatically change after a L0->L1 compaction. When there are many L0 bytes, lower level compactions are delayed, but they will be resumed after the L0->L1 compaction finishes, so the expected write amplification benefits might not be realized. The proposal here is to revert the level targetting size, but instead relying on adjusting score for each level to prioritize levels that need to compact most.
Basic idea:
(1) target level size isn't adjusted, but score is adjusted. The reasoning is that with parallel compactions, holding compactions from happening might not be desirable, but we would like the compactions are scheduled from the level we feel most needed. For example, if we have a extra-large L2, we would like all compactions are scheduled for L2->L3 compactions, rather than L4->L5. This gets complicated when a large L0->L1 compaction is going on. Should we compact L2->L3 or L4->L5. So the proposal for that is:
(2) the score is calculated by actual level size / (target size + estimated upper bytes coming down). The reasoning is that if we have a large amount of pending L0/L1 bytes coming down, compacting L2->L3 might be more expensive, as when the L0 bytes are compacted down to L2, the actual L2->L3 fanout would change dramatically. On the other hand, when the amount of bytes coming down to L5, the impacts to L5->L6 fanout are much less. So when calculating target score, we can adjust it by adding estimated downward bytes to the target level size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10057
Test Plan: Repurpose tests VersionStorageInfoTest.MaxBytesForLevelDynamicWithLargeL0_* tests to cover this scenario.
Reviewed By: ajkr
Differential Revision: D37539742
fbshipit-source-id: 9c154cbfe92023f918cf5d80875d8776ad4831a4
Summary:
- [x] Enabled blob caching for MultiGetBlob in RocksDB
- [x] Refactored MultiGetBlob logic and interface in RocksDB
- [x] Cleaned up Version::MultiGetBlob() and moved 'blob'-related code snippets into BlobSource
- [x] Add End-to-end test cases in db_blob_basic_test and also add unit tests in blob_source_test
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10272
Reviewed By: ltamasi
Differential Revision: D37558112
Pulled By: gangliao
fbshipit-source-id: a73a6a94ffdee0024d5b2a39e6d1c1a7d38664db
Summary:
Add load_latest_options() to C api.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10152
Test Plan:
Extend the existing c_test by reopening db using the latest options file
at different parts of the test.
Reviewed By: akankshamahajan15
Differential Revision: D37305225
Pulled By: ajkr
fbshipit-source-id: 8b3bab73f56fa6fcbdba45aae393145d007b3962
Summary:
If the internal iterator is not valid, `SeekToLast` with iter_start_ts should have `valid_` is false without assertion failure.
Test plan
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10279
Reviewed By: ltamasi
Differential Revision: D37539393
Pulled By: riversand963
fbshipit-source-id: 8e94057838f8a05144fad5768f4d62f1893ec315
Summary:
This is the initial step in the development of a lock-free clock cache. This PR includes the base hash table design (which we mostly ported over from FastLRUCache) and the clock eviction algorithm. Importantly, it's still _not_ lock-free---all operations use a shard lock. Besides the locking, there are other features left as future work:
- Remove keys from the handles. Instead, use 128-bit bijective hashes of them for handle comparisons, probing (we need two 32-bit hashes of the key for double hashing) and sharding (we need one 6-bit hash).
- Remove the clock_usage_ field, which is updated on every lookup. Even if it were atomically updated, it could cause memory invalidations across cores.
- Middle insertions into the clock list.
- A test that exercises the clock eviction policy.
- Update the Java API of ClockCache and Java calls to C++.
Along the way, we improved the code and comments quality of FastLRUCache. These changes are relatively minor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10273
Test Plan: ``make -j24 check``
Reviewed By: pdillinger
Differential Revision: D37522461
Pulled By: guidotag
fbshipit-source-id: 3d70b737dbb70dcf662f00cef8c609750f083943
Summary:
Currently, when installing a new super version, when stalling condition triggers, we compare estimated compaction bytes to previously, and if the new value is larger or equal to the previous one, we reduce the slowdown write rate. However, if concurrent compactions happen, the same value might be used. The result is that, although some compactions reduce estimated compaction bytes, we treat them as a signal for further slowing down. In some cases, it causes slowdown rate drops all the way to the minimum, far lower than needed.
Fix the bug by not triggering a re-calculation if a new super version doesn't have Version or a memtable change. With this fix, number of compaction finishes are still undercounted in this algorithm, but it is still better than the current bug where they are negatively counted.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10270
Test Plan: Run a benchmark where the slowdown rate is dropped to minimal unnessarily and see it is back to a normal value.
Reviewed By: ajkr
Differential Revision: D37497327
fbshipit-source-id: 9bca961cc38fed965c3af0fa6c9ca0efaa7637c4
Summary:
Resolves https://github.com/facebook/rocksdb/issues/9761
With this PR, applications can create an iterator with the following
```cpp
ReadOptions read_opts;
read_opts.timestamp = &ts_ub;
read_opts.iter_start_ts = &ts_lb;
auto* it = db->NewIterator(read_opts);
it->SeekToLast();
// or it->SeekForPrev("foo");
it->Prev();
...
```
The application can access different versions of the same user key via `key()`, `value()`, and `timestamp()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10200
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D37258074
Pulled By: riversand963
fbshipit-source-id: 3f0b866ade50dcff7ef60d506397a9dd6ec91565
Summary:
Most of the properties listed as required for prefix extractors
are not really required but offer some conveniences. This updates API
comments to clarify actual requirements, and adds tests to demonstrate
how previously presumed requirements can be safely violated.
This might seem like a useless exercise, but this relaxing of requirements
would be needed if we generalize prefixes to group keys not just at the
byte level but also based on bits or arbitrary value ranges. For
applications without a "natural" prefix size, having only byte-level
granularity often means one prefix size to the next differs in magnitude
by a factor of 256.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10245
Test Plan: Tests added, also covering missing Iterator cases from https://github.com/facebook/rocksdb/issues/10244
Reviewed By: bjlemaire
Differential Revision: D37371559
Pulled By: pdillinger
fbshipit-source-id: ab2dd719992eea7656e9042cf8542393e02fa244
Summary:
The `bytes_read` returned by the current BlobSource interface is ambiguous. The uncompressed blob size is returned if the cache hits. The size of the blob read from disk, presumably the compressed version, is returned if the cache misses. Two differing semantics might cause ambiguity and consistency issues. For example, this inconsistency causes the assertion failure (T124246362 and its hot fix is https://github.com/facebook/rocksdb/issues/10249).
This goal is to require that the value of `byte read` always be an on-disk blob record size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10248
Reviewed By: ltamasi
Differential Revision: D37470292
Pulled By: gangliao
fbshipit-source-id: fbca521b2791d3674dbf2484cea5fcae2fdd94d2
Summary:
The patch fixes a couple of issues related to in-place updates: 1) the value type was not passed from
`MemTableInserter::PutCFImpl` to `MemTable::Update` and 2) `MemTable::UpdateCallback` was called
for any value type (with the callee's logic assuming `kTypeValue`) even though the callback mechanism
is only safe for plain values.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10254
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D37463644
Pulled By: ltamasi
fbshipit-source-id: 33802477dac0691681f416ae84c4d9742c6fe41a
Summary:
The patch builds on https://github.com/facebook/rocksdb/pull/9915 and adds
a new API called `PutEntity` that can be used to write a wide-column entity
to the database. The new API is added to both `DB` and `WriteBatch`. Note
that currently there is no way to retrieve these entities; more precisely, all
read APIs (`Get`, `MultiGet`, and iterator) return `NotSupported` when they
encounter a wide-column entity that is required to answer a query. Read-side
support (as well as other missing functionality like `Merge`, compaction filter,
and timestamp support) will be added in later PRs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10242
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D37369748
Pulled By: ltamasi
fbshipit-source-id: 7f5e412359ed7a400fd80b897dae5599dbcd685d
Summary:
The 'PersistRoundRobinCompactCursor' unit test in `db_compaction_test` may occasionally fail due to the inconsistent LSM state. The issue is fixed by adding `Flush()` and `WaitForFlushMemTable()` to produce a more predictable and stable LSM state.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10250
Test Plan: 'PersistRoundRobinCompactCursor' unit test in `db_compaction_test`
Reviewed By: jay-zhuang, riversand963
Differential Revision: D37426091
Pulled By: littlepig2013
fbshipit-source-id: 56fbaab0384c380c1f279a16dc8732b139c9f611
Summary:
Currently SortFileByOverlappingRatio() is O(nlogn). It is usually OK but When there are a lot of files in an LSM-tree, SortFileByOverlappingRatio() can take non-trivial amount of time. The problem is severe when the user is loading keys in sorted order, where compaction is only trivial move and this operation becomes the bottleneck and limit the total throughput. This commit makes SortFileByOverlappingRatio() only find the top 50 files based on score. 50 files are usually enough for the parallel compactions needed for the level, and in case it is not enough, we would fall back to random, which should be acceptable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10161
Test Plan:
Run a fillseq that generates a lot of files, and observe throughput improved (although stall is not yet eliminated). The command ran:
TEST_TMPDIR=/dev/shm/ ./db_bench_sort --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000
The throughput improved by 11%.
Reviewed By: jay-zhuang
Differential Revision: D37129469
fbshipit-source-id: 492da2ef5bfc7cdd6daa3986b50d2ff91f88542d
Summary:
This task is to fix assertion failures during the crash test runs. The cache entry size might not match value size because value size can include the on-disk (possibly compressed) size. Therefore, we removed the assertions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10249
Reviewed By: ltamasi
Differential Revision: D37407576
Pulled By: gangliao
fbshipit-source-id: 577559f267c5b2437bcd0631cd0efabb6dde3b69
Summary:
Resolves https://github.com/facebook/rocksdb/issues/10129
I extracted this fix from https://github.com/facebook/rocksdb/issues/7516 since it's also already a bug in main branch, and we want to
separate it from the main part of the PR.
There can be a race condition between two threads. Thread 1 executes
`DBImpl::FindObsoleteFiles()` while thread 2 executes `GetSortedWals()`.
```
Time thread 1 thread 2
| mutex_.lock
| read disable_delete_obsolete_files_
| ...
| wait on log_sync_cv and release mutex_
| mutex_.lock
| ++disable_delete_obsolete_files_
| mutex_.unlock
| mutex_.lock
| while (pending_purge_obsolete_files > 0) { bg_cv.wait;}
| wake up with mutex_ locked
| compute WALs tracked by MANIFEST
| mutex_.unlock
| wake up with mutex_ locked
| ++pending_purge_obsolete_files_
| mutex_.unlock
|
| delete obsolete WAL
| WAL missing but tracked in MANIFEST.
V
```
The fix proposed eliminates the possibility of the above by increasing
`pending_purge_obsolete_files_` before `FindObsoleteFiles()` can possibly release the mutex.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10187
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D37214235
Pulled By: riversand963
fbshipit-source-id: 556ab1b58ae6d19150169dfac4db08195c797184
Summary:
Add suggest_compact_range() and suggest_compact_range_cf() to C API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10175
Test Plan:
As verifying the result requires SyncPoint, which is not available in the c_test.c,
the test is currently done by invoking the functions and making sure it does not crash.
Reviewed By: jay-zhuang
Differential Revision: D37305191
Pulled By: ajkr
fbshipit-source-id: 0fe257b45914f6c9aeb985d8b1820dafc57a20db
Summary:
The files behind the compaction cursor contain newer data than the files ahead of it. If a compaction writes a file that spans from before its output level’s cursor to after it, then data before the cursor will be contaminated with the old timestamp from the data after the cursor. To avoid this, we can split the output file into two – one entirely before the cursor and one entirely after the cursor. Note that, in rare cases, we **DO NOT** need to cut the file if it is a trivial move since the file will not be contaminated by older files. In such case, the compact cursor is not guaranteed to be the boundary of the file, but it does not hurt the round-robin selection process.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10227
Test Plan:
Add 'RoundRobinCutOutputAtCompactCursor' unit test in `db_compaction_test`
Task: [T122216351](https://www.internalfb.com/intern/tasks/?t=122216351)
Reviewed By: jay-zhuang
Differential Revision: D37388088
Pulled By: littlepig2013
fbshipit-source-id: 9246a6a084b6037b90d6ab3183ba4dfb75a3378d
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10225
Test Plan:
Add test cases for MultiGetBlob
In this task, we added the new API MultiGetBlob() for BlobSource.
This PR is a part of https://github.com/facebook/rocksdb/issues/10156
Reviewed By: ltamasi
Differential Revision: D37358364
Pulled By: gangliao
fbshipit-source-id: aff053a37615d96d768fb9aedde17da5618c7ae6
Summary:
When a key is "out of domain" for the prefix_extractor (no
prefix assigned) then the Bloom filter is not queried. PerfContext
was counting this as a Bloom "hit" while Statistics doesn't count this
as a prefix Bloom checked. I think it's more accurate to call it neither
hit nor miss, so changing the counting to make it PerfContext coounting
more like Statistics.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10244
Test Plan:
tests updates and expanded (Get and MultiGet). Iterator test
coverage of the change will come in next PR
Reviewed By: bjlemaire
Differential Revision: D37371297
Pulled By: pdillinger
fbshipit-source-id: fed132fba6a92b2314ab898d449fce2d1586c157
Summary:
**Summary**
Make the mempurge option flag a Mutable Column Family option flag. Therefore, the mempurge feature can be dynamically toggled.
**Motivation**
RocksDB users prefer having the ability to switch features on and off without having to close and reopen the DB. This is particularly important if the feature causes issues and needs to be turned off. Dynamically changing a DB option flag does not seem currently possible.
Moreover, with this new change, the MemPurge feature can be toggled on or off independently between column families, which we see as a major improvement.
**Content of this PR**
This PR includes removal of the `experimental_mempurge_threshold` flag as a DB option flag, and its re-introduction as a `MutableCFOption` flag. I updated the code to handle dynamic changes of the flag (in particular inside the `FlushJob` file). Additionally, this PR includes a new test to demonstrate the capacity of the code to toggle the MemPurge feature on and off, as well as the addition in the `db_stress` module of 2 different mempurge threshold values (0.0 and 1.0) that can be randomly changed with the `set_option_one_in` flag. This is useful to stress test the dynamic changes.
**Benchmarking**
I will add numbers to prove that there is no performance impact within the next 12 hours.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10011
Reviewed By: pdillinger
Differential Revision: D36462357
Pulled By: bjlemaire
fbshipit-source-id: 5e3d63bdadf085c0572ecc2349e7dd9729ce1802
Summary:
* Add metadata related structs and functions in C API, including
- `rocksdb_get_column_family_metadata()` and `rocksdb_get_column_family_metadata_cf()`
that returns `rocksdb_column_family_metadata_t`.
- `rocksdb_column_family_metadata_t` and its get functions & destroy function.
- `rocksdb_level_metadata_t` and its and its get functions & destroy function.
- `rocksdb_file_metadata_t` and its and get functions & destroy functions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10207
Test Plan:
Extend the existing c_test.c to include additional checks for column_family_metadata
inside CheckCompaction.
Reviewed By: riversand963
Differential Revision: D37305209
Pulled By: ajkr
fbshipit-source-id: 0a5183206353acde145f5f9b632c3bace670aa6e
Summary:
https://github.com/facebook/rocksdb/issues/9984 changes the behavior of RocksDB: if logger creation failed during `SanitizeOptions()`,
`DB::Open()` will fail. However, since `SanitizeOptions()` is called in `DBImpl::DBImpl()`, we cannot
directly expose the error to caller without some additional work.
This is a first version proposal which:
- Adds a new member `init_logger_creation_s` to `DBImpl` to store the result of init logger creation
- Checks the error during `DB::Open()` and return it to caller if non-ok
This is not very ideal. We can alternatively move the logger creation logic out of the `SanitizeOptions()`.
Since `SanitizeOptions()` is used in other places, we need to check whether this change breaks anything
in case other callers of `SanitizeOptions()` assumes that a logger should be created.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10223
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D37321717
Pulled By: riversand963
fbshipit-source-id: 58042358a86369d606549dd9938933dd47591c4b
Summary:
So that DBImpl::RecoverLogFiles do not have to deal with implementation
details of WalFilter.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10214
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D37299122
Pulled By: riversand963
fbshipit-source-id: acf1a80f1ef75da393d375f55968b2f3ac189816
Summary:
Add `kRoundRobin` as a compaction priority. The implementation is as follows.
- Define a cursor as the smallest Internal key in the successor of the selected file. Add `vector<InternalKey> compact_cursor_` into `VersionStorageInfo` where each element (`InternalKey`) in `compact_cursor_` represents a cursor. In round-robin compaction policy, we just need to select the first file (assuming files are sorted) and also has the smallest InternalKey larger than/equal to the cursor. After a file is chosen, we create a new `Fsize` vector which puts the selected file is placed at the first position in `temp`, the next cursor is then updated as the smallest InternalKey in successor of the selected file (the above logic is implemented in `SortFileByRoundRobin`).
- After a compaction succeeds, typically `InstallCompactionResults()`, we choose the next cursor for the input level and save it to `edit`. When calling `LogAndApply`, we save the next cursor with its level into some local variable and finally apply the change to `vstorage` in `SaveTo` function.
- Cursors are persist pair by pair (<level, InternalKey>) in `EncodeTo` so that they can be reconstructed when reopening. An empty cursor will not be encoded to MANIFEST
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10107
Test Plan: add unit test (`CompactionPriRoundRobin`) in `compaction_picker_test`, add `kRoundRobin` priority in `CompactionPriTest` from `db_compaction_test`, and add `PersistRoundRobinCompactCursor` in `db_compaction_test`
Reviewed By: ajkr
Differential Revision: D37316037
Pulled By: littlepig2013
fbshipit-source-id: 9f481748190ace416079139044e00df2968fb1ee
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
In this task, we formally introduced the blob source to RocksDB. BlobSource is a new abstraction layer that provides universal access to blobs, regardless of whether they are in the blob cache, secondary cache, or (remote) storage. Depending on user settings, it always fetch blobs from multi-tier cache and storage with minimal cost.
Note: The new `MultiGetBlob()` implementation is not included in the current PR. To go faster, we aim to create a separate PR for it in parallel!
This PR is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10198
Reviewed By: ltamasi
Differential Revision: D37294735
Pulled By: gangliao
fbshipit-source-id: 9cb50422d9dd1bc03798501c2778b6c7520c7a1e
Summary:
One consistency check in SaveTo() is dupilcated with the one within Apply(). Remove one of then in release mode to reduce time spent in DB mutex.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10169
Test Plan: Run existing tests and see nothing breaks.
Reviewed By: ltamasi
Differential Revision: D37157821
fbshipit-source-id: 73b89443a20b43362ff66d10b9212022034a8234
Summary:
include "include/rocksdb/blah.h" is messing up some internal
builds vs. include "rocksdb/blah." This fixes the bad case and adds a
check for future instances.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10213
Test Plan: back-port to 7.4 release candidate and watch internal build
Reviewed By: hx235
Differential Revision: D37296202
Pulled By: pdillinger
fbshipit-source-id: d7cc6b2c57d858dff0444f19320d83c8b4f9b185
Summary:
This bug was discovered after write batch checksum verification before WAL is added (https://github.com/facebook/rocksdb/issues/10114) and stress test with write batch checksum protection is turned on (https://github.com/facebook/rocksdb/issues/10037). In this [line](d5d8920f2c/db/write_batch.cc (L2887)), the number of checksums may not be consistent with `batch->Count()`. This PR fixes this issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10201
Test Plan:
```
./db_stress --batch_protection_bytes_per_key=8 --destroy_db_initially=1 --max_key=100000 --use_txn=1
```
Reviewed By: ajkr
Differential Revision: D37260799
Pulled By: cbi42
fbshipit-source-id: ff8dce7dcce295d689333bc9d892d17a843bf0ea
Summary:
`FlushWAL(true /* sync */)` is used internally and for manual WAL sync. It had a bug when used together with `track_and_verify_wals_in_manifest` where the synced size tracked in MANIFEST was larger than the number of bytes actually synced.
The bug could be repro'd almost immediately with the following crash test command: `python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --max_bytes_for_level_base=2097152 --target_file_size_base=524288 --duration=3600 --interval=10 --sync_fault_injection=1 --disable_wal=0 --checkpoint_one_in=1000 --max_key=10000 --value_size_mult=33`.
An example error message produced by the above command is shown below. The error sometimes arose from the checkpoint and other times arose from the main stress test DB.
```
Corruption: Size mismatch: WAL (log number: 119) in MANIFEST is 27938 bytes , but actually is 27859 bytes on disk.
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10185
Test Plan:
- repro unit test
- the above crash test command no longer finds the error. It does find a different error after a while longer such as "Corruption: WAL file 481 required by manifest but not in directory list"
Reviewed By: riversand963
Differential Revision: D37200993
Pulled By: ajkr
fbshipit-source-id: 98e0071c1a89f4d009888512ed89f9219779ae5f
Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/9424 added rate-limiting support for user reads, which does not include batched `MultiGet()`s that call `RandomAccessFileReader::MultiRead()`. The reason is that it's harder (compared with RandomAccessFileReader::Read()) to implement the ideal rate-limiting where we first call `RateLimiter::RequestToken()` for allowed bytes to multi-read and then consume those bytes by satisfying as many requests in `MultiRead()` as possible. For example, it can be tricky to decide whether we want partially fulfilled requests within one `MultiRead()` or not.
However, due to a recent urgent user request, we decide to pursue an elementary (but a conditionally ineffective) solution where we accumulate enough rate limiter requests toward the total bytes needed by one `MultiRead()` before doing that `MultiRead()`. This is not ideal when the total bytes are huge as we will actually consume a huge bandwidth from rate-limiter causing a burst on disk. This is not what we ultimately want with rate limiter. Therefore a follow-up work is noted through TODO comments.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10159
Test Plan:
- Modified existing unit test `DBRateLimiterOnReadTest/DBRateLimiterOnReadTest.NewMultiGet`
- Traced the underlying system calls `io_uring_enter` and verified they are 10 seconds apart from each other correctly under the setting of `strace -ftt -e trace=io_uring_enter ./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb2 -readonly -num=50 -threads=1 -multiread_batched=1 -batch_size=100 -duration=10 -rate_limiter_bytes_per_sec=200 -rate_limiter_refill_period_us=1000000 -rate_limit_bg_reads=1 -disable_auto_compactions=1 -rate_limit_user_ops=1` where each `MultiRead()` read about 2000 bytes (inspected by debugger) and the rate limiter grants 200 bytes per seconds.
- Stress test:
- Verified `./db_stress (-test_cf_consistency=1/test_batches_snapshots=1) -use_multiget=1 -cache_size=1048576 -rate_limiter_bytes_per_sec=10241024 -rate_limit_bg_reads=1 -rate_limit_user_ops=1` work
Reviewed By: ajkr, anand1976
Differential Revision: D37135172
Pulled By: hx235
fbshipit-source-id: 73b8e8f14761e5d4b77235dfe5d41f4eea968bcd
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
In this task, we added a new abstraction layer `BlobSource` to retrieve blobs from either blob cache or raw blob file. Note: For simplicity, the current PR only includes `GetBlob()`. `MultiGetBlob()` will be included in the next PR.
This PR is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10178
Reviewed By: ltamasi
Differential Revision: D37250507
Pulled By: gangliao
fbshipit-source-id: 3fc4a55a0cea955a3147bdc7dba06430e377259b
Summary:
folly DistributedMutex is faster than standard mutexes though
imposes some static obligations on usage. See
https://github.com/facebook/folly/blob/main/folly/synchronization/DistributedMutex.h
for details. Here we use this alternative for our Cache implementations
(especially LRUCache) for better locking performance, when RocksDB is
compiled with folly.
Also added information about which distributed mutex implementation is
being used to cache_bench output and to DB LOG.
Intended follow-up:
* Use DMutex in more places, perhaps improving API to support non-scoped
locking
* Fix linking with fbcode compiler (needs ROCKSDB_NO_FBCODE=1 currently)
Credit: Thanks Siying for reminding me about this line of work that was previously
left unfinished.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10179
Test Plan:
for correctness, existing tests. CircleCI config updated.
Also Meta-internal buck build updated.
For performance, ran simultaneous before & after cache_bench. Out of three
comparison runs, the middle improvement to ops/sec was +21%:
Baseline: USE_CLANG=1 DEBUG_LEVEL=0 make -j24 cache_bench (fbcode
compiler)
```
Complete in 20.201 s; Rough parallel ops/sec = 1584062
Thread ops/sec = 107176
Operation latency (ns):
Count: 32000000 Average: 9257.9421 StdDev: 122412.04
Min: 134 Median: 3623.0493 Max: 56918500
Percentiles: P50: 3623.05 P75: 10288.02 P99: 30219.35 P99.9: 683522.04 P99.99: 7302791.63
```
New: (add USE_FOLLY=1)
```
Complete in 16.674 s; Rough parallel ops/sec = 1919135 (+21%)
Thread ops/sec = 135487
Operation latency (ns):
Count: 32000000 Average: 7304.9294 StdDev: 108530.28
Min: 132 Median: 3777.6012 Max: 91030902
Percentiles: P50: 3777.60 P75: 10169.89 P99: 24504.51 P99.9: 59721.59 P99.99: 1861151.83
```
Reviewed By: anand1976
Differential Revision: D37182983
Pulled By: pdillinger
fbshipit-source-id: a17eb05f25b832b6a2c1356f5c657e831a5af8d1
Summary:
Added an option, `WriteOptions::protection_bytes_per_key`, that controls how many bytes per key we use for integrity protection in `WriteBatch`. It takes effect when `WriteBatch::GetProtectionBytesPerKey() == 0`.
Currently the only supported value is eight. Invoking a user API with it set to any other nonzero value will result in `Status::NotSupported` returned to the user.
There is also a bug fix for integrity protection with `inplace_callback`, where we forgot to take into account the possible change in varint length when calculating KV checksum for the final encoded buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10037
Test Plan:
- Manual
- Set default value of `WriteOptions::protection_bytes_per_key` to eight and ran `make check -j24`
- Enabled in MyShadow for 1+ week
- Automated
- Unit tests have a `WriteMode` that enables the integrity protection via `WriteOptions`
- Crash test - in most cases, use `WriteOptions::protection_bytes_per_key` to enable integrity protection
Reviewed By: cbi42
Differential Revision: D36614569
Pulled By: ajkr
fbshipit-source-id: 8650087ceac9b61b560f1e5fafe5e1baf9c725fb
Summary:
There was an interesting code path not covered by testing that
is difficult to replicate in a unit test, which is now covered using a
sync point. Specifically, the case of table_prefix_extractor == null and
!need_upper_bound_check in `BlockBasedTable::PrefixMayMatch`, which
can happen if table reader is open before extractor is registered with global
object registry, but is later registered and re-set with SetOptions. (We
don't have sufficient testing control over object registry to set that up
repeatedly.)
Also, this function has been renamed to `PrefixRangeMayMatch` for clarity
vs. other functions that are not the same.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10122
Test Plan: unit tests expanded
Reviewed By: siying
Differential Revision: D36944834
Pulled By: pdillinger
fbshipit-source-id: 9e52d9da1929a3e42bbc230fcdc3599949de7bdb