Summary:
With the current code, when a blob isn't found in the cache and gets read
from the blob file and then inserted into the cache, the application gets
passed the self-contained `PinnableSlice` resulting from the blob file read.
The patch changes this so that the `PinnableSlice` pins the cache entry
instead in this case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10625
Test Plan: `make check`
Reviewed By: pdillinger
Differential Revision: D39220904
Pulled By: ltamasi
fbshipit-source-id: cb9c62881e3523b1e9f614e00bf503bac2fe3b0a
Summary:
The patch adds a dedicated cache entry role for blob values and switches
to a registered deleter so that blobs show up as a separate bucket
(as opposed to "Misc") in the cache occupancy statistics, e.g.
```
Block cache entry stats(count,size,portion): DataBlock(133515,531.73 MB,13.6866%) BlobValue(1824855,3.10 GB,81.7071%) Misc(1,0.00 KB,0%)
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10601
Test Plan: Ran `make check` and tested the cache occupancy statistics using `db_bench`.
Reviewed By: riversand963
Differential Revision: D39107915
Pulled By: ltamasi
fbshipit-source-id: 8446c3b190a41a144030df73f318eeda4398c125
Summary:
The patch improves the bookkeeping around the memory usage of
cached blobs in two ways: 1) it uses `malloc_usable_size`, which accounts
for allocator bin sizes etc., and 2) it also considers the memory usage
of the `BlobContents` object in addition to the blob itself. Note: some unit
tests had been relying on the cache charge being equal to the size of the
cached blob; these were updated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10583
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D39060680
Pulled By: ltamasi
fbshipit-source-id: 3583adce2b4ce6e84861f3fadccbfd2e5a3cc482
Summary:
The patch introduces a new class called `BlobContents`, which represents
a single uncompressed blob value. We currently use `std::string` for this
purpose; `BlobContents` is somewhat smaller but the primary reason for a
dedicated class is that it enables certain improvements and optimizations
like eliding a copy when inserting a blob into the cache, using custom
allocators, or more control over and better accounting of the memory usage
of cached blobs (see https://github.com/facebook/rocksdb/issues/10484).
(We plan to implement these in subsequent PRs.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10571
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D39000965
Pulled By: ltamasi
fbshipit-source-id: f296eddf9dec4fc3e11cad525b462bdf63c78f96
Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10461
Reviewed By: siying
Differential Revision: D38672823
Pulled By: ltamasi
fbshipit-source-id: 90cf7362036563d79891f47be2cc24b827482743
Summary:
... so that cache keys can be derived from DB manifest data
before reading the file from storage--so that every part of the file
can potentially go in a persistent cache.
See updated comments in cache_key.cc for technical details. Importantly,
the new cache key encoding uses some fancy but efficient math to pack
data into the cache key without depending on the sizes of the various
pieces. This simplifies some existing code creating cache keys, like
cache warming before the file size is known.
This should provide us an essentially permanent mapping between SST
unique IDs and base cache keys, with the ability to "upgrade" SST
unique IDs (and thus cache keys) with new SST format_versions.
These cache keys are of similar, perhaps indistinguishable quality to
the previous generation. Before this change (see "corrected" days
between collision):
```
./cache_bench -stress_cache_key -sck_keep_bits=43
18 collisions after 2 x 90 days, est 10 days between (1.15292e+19 corrected)
```
After this change (keep 43 bits, up through 50, to validate "trajectory"
is ok on "corrected" days between collision):
```
19 collisions after 3 x 90 days, est 14.2105 days between (1.63836e+19 corrected)
16 collisions after 5 x 90 days, est 28.125 days between (1.6213e+19 corrected)
15 collisions after 7 x 90 days, est 42 days between (1.21057e+19 corrected)
15 collisions after 17 x 90 days, est 102 days between (1.46997e+19 corrected)
15 collisions after 49 x 90 days, est 294 days between (2.11849e+19 corrected)
15 collisions after 62 x 90 days, est 372 days between (1.34027e+19 corrected)
15 collisions after 53 x 90 days, est 318 days between (5.72858e+18 corrected)
15 collisions after 309 x 90 days, est 1854 days between (1.66994e+19 corrected)
```
However, the change does modify (probably weaken) the "guaranteed unique" promise from this
> SST files generated in a single process are guaranteed to have unique cache keys, unless/until number session ids * max file number = 2**86
to this (see https://github.com/facebook/rocksdb/issues/10388)
> With the DB id limitation, we only have nice guaranteed unique cache keys for files generated in a single process until biggest session_id_counter and offset_in_file reach combined 64 bits
I don't think this is a practical concern, though.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10394
Test Plan: unit tests updated, see simulation results above
Reviewed By: jay-zhuang
Differential Revision: D38667529
Pulled By: pdillinger
fbshipit-source-id: 49af3fe7f47e5b61162809a78b76c769fd519fba
Summary:
New blobdb has a bug in compaction filter, where `blob_value_` is not reset for next iterated key. This will cause blob_value_ not empty and previous value read from blob is passed into the filter function for next key, even if its value is not in blob. Fixed by reseting regardless of key type.
Test Case:
Add `FilterByValueLength` test case in `DBBlobCompactionTest`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10391
Reviewed By: riversand963
Differential Revision: D38629900
Pulled By: ltamasi
fbshipit-source-id: 47d23ff2e5ec697958a210db9e6ceeb8b2fc49fa
Summary:
A flag in WritableFileWriter is introduced to remember error has happened. Subsequent operations will fail with an assertion. Those operations, except Close() are not supposed to be called anyway. This change will help catch bug in tests and stress tests and limit damage of a potential bug of continue writing to a file after a failure.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10489
Test Plan: Fix existing unit tests and watch crash tests for a while.
Reviewed By: anand1976
Differential Revision: D38473277
fbshipit-source-id: 09aafb971e56cfd7f9ef92ad15b883f54acf1366
Summary:
We have recently added caching support to BlobDB, and separately,
implemented an optimization where reading blobs from the cache
results in the cache handle being transferred to the target `PinnableSlice`
(as opposed to the contents getting copied). With these changes,
it makes sense to reset the `PinnableSlice` storing the blob value in
`DBIter` as soon as we move to a different iterator position to prevent
us from holding on to the cache handle any longer than necessary.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10490
Test Plan: `make check`
Reviewed By: akankshamahajan15
Differential Revision: D38473630
Pulled By: ltamasi
fbshipit-source-id: 84c045ffac76436c6152fd0f5775b007f4051386
Summary:
Similarly to https://github.com/facebook/rocksdb/pull/10457, we now have
to explicitly set the `fill_cache` read option when reading blobs in
`DBIter` to prevent the cache from getting polluted by queries with
`fill_cache` set to false. (Before we added support for a blob cache,
the setting had not made any difference either way.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10492
Test Plan: `make check`
Reviewed By: akankshamahajan15
Differential Revision: D38476121
Pulled By: ltamasi
fbshipit-source-id: ea5c5e252f83e4a4e2c74156b37d40308d7e0c80
Summary:
Local static string is not friendly to Jemalloc arena aware implementation, as it will be allocated on the arena of the first caller, which causes crash if the allocated arena gets refunded earlier.
P.S. A Jemalloc arena aware implementation is each rocksdb instance only use certain Jemalloc arenas, and arena will be refunded after associated DB instance is destroyed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8103
Reviewed By: ajkr
Differential Revision: D38477235
Pulled By: ltamasi
fbshipit-source-id: a58d32cb647ed64c144b4736fb2d5db27c2c28f9
Summary:
During compaction, blobs are currently read using the default
`ReadOptions`, which has the `fill_cache` flag set to true. Earlier,
this didn't make any difference since we didn't have a blob cache;
however, now we have to explicitly set this flag to false to avoid
polluting the cache during compaction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10457
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D38333528
Pulled By: ltamasi
fbshipit-source-id: 5b4d49a1e39543bee73c7df2aa9194fb101875e2
Summary:
RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them.
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10309
Reviewed By: ltamasi
Differential Revision: D38211655
Pulled By: gangliao
fbshipit-source-id: 65ef33337db4d85277cc6f9782d67c421ad71dd5
Summary:
GCC-12 has strick check on variables, and thus
build fails when it finds read_req is not properly
initialized (-Werror=maybe-uninitialized). Add
default value to fix this.
Change-Id: Ib8a9085e2d613ee7b943b58a6a58e1bc351725d7
Signed-off-by: Jun He <jun.he@arm.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10312
Reviewed By: riversand963
Differential Revision: D37656997
Pulled By: ajkr
fbshipit-source-id: fe47492c913b34b3a03c04beeec9ec57831dcaff
Summary:
To help service owners to manage their memory budget effectively, we have been working towards counting all major memory users inside RocksDB towards a single global memory limit (see e.g. https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache). The global limit is specified by the capacity of the block-based table's block cache, and is technically implemented by inserting dummy entries ("reservations") into the block cache. The goal of this task is to support charging the memory usage of the new blob cache against this global memory limit when the backing cache of the blob cache and the block cache are different.
This PR is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10321
Reviewed By: ltamasi
Differential Revision: D37913590
Pulled By: gangliao
fbshipit-source-id: eaacf23907f82dc7d18964a3f24d7039a2937a72
Summary:
Many workloads have temporal locality, where recently written items are read back in a short period of time. When using remote file systems, this is inefficient since it involves network traffic and higher latencies. Because of this, we would like to support prepopulating the blob cache during flush.
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10298
Reviewed By: ltamasi
Differential Revision: D37908743
Pulled By: gangliao
fbshipit-source-id: 9feaed234bc719d38f0c02975c1ad19fa4bb37d1
Summary:
RocksDB supports a two-level cache hierarchy (see https://rocksdb.org/blog/2021/05/27/rocksdb-secondary-cache.html), where items evicted from the primary cache can be spilled over to the secondary cache, or items from the secondary cache can be promoted to the primary one. We have a CacheLib-based non-volatile secondary cache implementation that can be used to improve read latencies and reduce the amount of network bandwidth when using distributed file systems. In addition, we have recently implemented a compressed secondary cache that can be used as a replacement for the OS page cache when e.g. direct I/O is used. The goals of this task are to add support for using a secondary cache with the blob cache and to measure the potential performance gains using `db_bench`.
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10349
Reviewed By: ltamasi
Differential Revision: D37896773
Pulled By: gangliao
fbshipit-source-id: 7804619ce4a44b73d9e11ad606640f9385969c84
Summary:
The blob cache enables an optimization on the read path: when a blob is found in the cache, we can avoid copying it into the buffer provided by the application. Instead, we can simply transfer ownership of the cache handle to the target `PinnableSlice`. (Note: this relies on the `Cleanable` interface, which is implemented by `PinnableSlice`.)
This has the potential to save a lot of CPU, especially with large blob values.
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10297
Reviewed By: riversand963
Differential Revision: D37640311
Pulled By: gangliao
fbshipit-source-id: 92de0e35cc703d06c87c5c1861cc2899ec52234a
Summary:
I noticed it would clean up some things to have Cache::Insert()
return our MemoryLimit Status instead of Incomplete for the case in
which the capacity limit is reached. I suspect this fixes some existing but
unknown bugs where this Incomplete could be confused with other uses
of Incomplete, especially no_io cases. This is the most suspicious case I
noticed, but was not able to reproduce a bug, in part because the existing
code is not covered by unit tests (FIXME added): 57adbf0e91/table/get_context.cc (L397)
I audited all the existing uses of IsIncomplete and updated those that
seemed relevant.
HISTORY updated with a clear warning to users of strict_capacity_limit=true
to update uses of `IsIncomplete()`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10262
Test Plan: updated unit tests
Reviewed By: hx235
Differential Revision: D37473155
Pulled By: pdillinger
fbshipit-source-id: 4bd9d9353ccddfe286b03ebd0652df8ce20f99cb
Summary:
- [x] Enabled blob caching for MultiGetBlob in RocksDB
- [x] Refactored MultiGetBlob logic and interface in RocksDB
- [x] Cleaned up Version::MultiGetBlob() and moved 'blob'-related code snippets into BlobSource
- [x] Add End-to-end test cases in db_blob_basic_test and also add unit tests in blob_source_test
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10272
Reviewed By: ltamasi
Differential Revision: D37558112
Pulled By: gangliao
fbshipit-source-id: a73a6a94ffdee0024d5b2a39e6d1c1a7d38664db
Summary:
The `bytes_read` returned by the current BlobSource interface is ambiguous. The uncompressed blob size is returned if the cache hits. The size of the blob read from disk, presumably the compressed version, is returned if the cache misses. Two differing semantics might cause ambiguity and consistency issues. For example, this inconsistency causes the assertion failure (T124246362 and its hot fix is https://github.com/facebook/rocksdb/issues/10249).
This goal is to require that the value of `byte read` always be an on-disk blob record size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10248
Reviewed By: ltamasi
Differential Revision: D37470292
Pulled By: gangliao
fbshipit-source-id: fbca521b2791d3674dbf2484cea5fcae2fdd94d2
Summary:
This task is to fix assertion failures during the crash test runs. The cache entry size might not match value size because value size can include the on-disk (possibly compressed) size. Therefore, we removed the assertions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10249
Reviewed By: ltamasi
Differential Revision: D37407576
Pulled By: gangliao
fbshipit-source-id: 577559f267c5b2437bcd0631cd0efabb6dde3b69
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10225
Test Plan:
Add test cases for MultiGetBlob
In this task, we added the new API MultiGetBlob() for BlobSource.
This PR is a part of https://github.com/facebook/rocksdb/issues/10156
Reviewed By: ltamasi
Differential Revision: D37358364
Pulled By: gangliao
fbshipit-source-id: aff053a37615d96d768fb9aedde17da5618c7ae6
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
In this task, we formally introduced the blob source to RocksDB. BlobSource is a new abstraction layer that provides universal access to blobs, regardless of whether they are in the blob cache, secondary cache, or (remote) storage. Depending on user settings, it always fetch blobs from multi-tier cache and storage with minimal cost.
Note: The new `MultiGetBlob()` implementation is not included in the current PR. To go faster, we aim to create a separate PR for it in parallel!
This PR is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10198
Reviewed By: ltamasi
Differential Revision: D37294735
Pulled By: gangliao
fbshipit-source-id: 9cb50422d9dd1bc03798501c2778b6c7520c7a1e
Summary:
include "include/rocksdb/blah.h" is messing up some internal
builds vs. include "rocksdb/blah." This fixes the bad case and adds a
check for future instances.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10213
Test Plan: back-port to 7.4 release candidate and watch internal build
Reviewed By: hx235
Differential Revision: D37296202
Pulled By: pdillinger
fbshipit-source-id: d7cc6b2c57d858dff0444f19320d83c8b4f9b185
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
In this task, we added a new abstraction layer `BlobSource` to retrieve blobs from either blob cache or raw blob file. Note: For simplicity, the current PR only includes `GetBlob()`. `MultiGetBlob()` will be included in the next PR.
This PR is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10178
Reviewed By: ltamasi
Differential Revision: D37250507
Pulled By: gangliao
fbshipit-source-id: 3fc4a55a0cea955a3147bdc7dba06430e377259b
Summary:
Currently, if blob files are enabled (i.e. `enable_blob_files` is true), large values are extracted both during flush/recovery (when SST files are written into level 0 of the LSM tree) and during compaction into any LSM tree level. For certain use cases that have a mix of short-lived and long-lived values, it might make sense to support extracting large values only during compactions whose output level is greater than or equal to a specified LSM tree level (e.g. compactions into L1/L2/... or above). This could reduce the space amplification caused by large values that are turned into garbage shortly after being written at the price of some write amplification incurred by long-lived values whose extraction to blob files is delayed.
In order to achieve this, we would like to do the following:
- Add a new configuration option `blob_file_starting_level` (default: 0) to `AdvancedColumnFamilyOptions` (and `MutableCFOptions` and extend the related logic)
- Instantiate `BlobFileBuilder` in `BuildTable` (used during flush and recovery, where the LSM tree level is L0) and `CompactionJob` iff `enable_blob_files` is set and the LSM tree level is `>= blob_file_starting_level`
- Add unit tests for the new functionality, and add the new option to our stress tests (`db_stress` and `db_crashtest.py` )
- Add the new option to our benchmarking tool `db_bench` and the BlobDB benchmark script `run_blob_bench.sh`
- Add the new option to the `ldb` tool (see https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool)
- Ideally extend the C and Java bindings with the new option
- Update the BlobDB wiki to document the new option.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10077
Reviewed By: ltamasi
Differential Revision: D36884156
Pulled By: gangliao
fbshipit-source-id: 942bab025f04633edca8564ed64791cb5e31627d
Summary:
ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955
Test Plan: Watch CI tests
Reviewed By: riversand963
Differential Revision: D36176799
fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471
Summary:
In `FileMetaData`, we keep track of the lowest-numbered blob file
referenced by the SST file in question for the purposes of BlobDB's
garbage collection in the `oldest_blob_file_number` field, which is
updated in `UpdateBoundaries`. However, with the current code,
`BlobIndex` decoding errors (or invalid blob file numbers) are swallowed
in this method. The patch changes this by propagating these errors
and failing the corresponding flush/compaction. (Note that since blob
references are generated by the BlobDB code and also parsed by
`CompactionIterator`, in reality this can only happen in the case of
memory corruption.)
This change necessitated updating some unit tests that involved
fake/corrupt `BlobIndex` objects. Some of these just used a dummy string like
`"blob_index"` as a placeholder; these were replaced with real `BlobIndex`es.
Some were relying on the earlier behavior to simulate corruption; these
were replaced with `SyncPoint`-based test code that corrupts a valid
blob reference at read time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9851
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D35683671
Pulled By: ltamasi
fbshipit-source-id: f7387af9945c48e4d5c4cd864f1ba425c7ad51f6
Summary:
This information has been already available as part of the `rocksdb.blob-stats`
string property. The patch adds a dedicated integer property to make it easier
to surface this information in monitoring systems.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9835
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D35619495
Pulled By: ltamasi
fbshipit-source-id: 03fb0b228aa27d3859a1e3783bcb7eca095607f8
Summary:
Users can set the priority for file reads associated with their operation by setting `ReadOptions::rate_limiter_priority` to something other than `Env::IO_TOTAL`. Rate limiting `VerifyChecksum()` and `VerifyFileChecksums()` is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working.
`RandomAccessFileReader::Read()` already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed in `Env::IOPriority`. Now the compaction read rate limiting is supported by setting `rate_limiter_priority = Env::IO_LOW` on its `ReadOptions`.
There is no default value for the new `Env::IOPriority` parameter to `RandomAccessFileReader::Read()`. That means this PR goes through all callers (in some cases multiple layers up the call stack) to find a `ReadOptions` to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads).
The API doc only lists the missing cases where a file read associated with a provided `ReadOptions` cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provide `ReadOptions` or `Env::IOPriority`, so I didn't count that as missing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9424
Test Plan:
- new unit tests
- new benchmarks on ~50MB database with 1MB/s read rate limit and 100ms refill interval; verified with strace reads are chunked (at 0.1MB per chunk) and spaced roughly 100ms apart.
- setup command: `./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true`
- benchmarks command: `strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true`
- crash test using IO_USER priority on non-validation reads with https://github.com/facebook/rocksdb/issues/9567 reverted: `python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10`
Reviewed By: hx235
Differential Revision: D33747386
Pulled By: ajkr
fbshipit-source-id: a2d985e97912fba8c54763798e04f006ccc56e0c
Summary:
Extend the periodic statistics in the info log with the total amount of garbage
in blob files and the space amplification pertaining to blob files, where the
latter is defined as `total_blob_file_size / (total_blob_file_size - total_blob_garbage_size)`.
Also expose the space amp via the `rocksdb.blob-stats` DB property.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9538
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D34126855
Pulled By: ltamasi
fbshipit-source-id: 3153e7a0fe0eca440322db273f4deaabaccc51b2
Summary:
The patch replaces `std::map` with a sorted `std::vector` for
`VersionStorageInfo::blob_files_` and preallocates the space
for the `vector` before saving the `BlobFileMetaData` into the
new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`.
These changes reduce the time the DB mutex is held while
saving new `Version`s, and using a sorted `vector` also makes
lookups faster thanks to better memory locality.
In addition, the patch introduces helper methods
`VersionStorageInfo::GetBlobFileMetaData` and
`VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by
clients to perform lookups in the `vector`, and does some general
cleanup in the parts of code where blob file metadata are used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526
Test Plan:
Ran `make check` and the crash test script for a while.
Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced:
```
numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value>
```
Final statistics before the patch:
```
Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s
Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s
```
With the patch:
```
Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s
Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s
```
Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs.
Reviewed By: riversand963
Differential Revision: D34082728
Pulled By: ltamasi
fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9266
This diff adds a new tag `CommitWithTimestamp`. Currently, there is no API to trigger writing
this tag to WAL, thus it is unavailable to users.
This is an ongoing effort to add user-defined timestamp support to write-committed transactions.
This diff also indicates all column families that may potentially participate in the same
transaction must either disable timestamp or have the same timestamp format, since
`CommitWithTimestamp` tag is followed by a single byte-array denoting the commit
timestamp of the transaction. We will enforce this checking in a future diff. We keep this
diff small.
Reviewed By: ltamasi
Differential Revision: D31721350
fbshipit-source-id: e1450811443647feb6ca01adec4c8aaae270ffc6
Summary:
The patch adds a new BlobDB configuration option `blob_compaction_readahead_size`
that can be used to enable prefetching data from blob files during compaction.
This is important when using storage with higher latencies like HDDs or remote filesystems.
If enabled, prefetching is used for all cases when blobs are read during compaction,
namely garbage collection, compaction filters (when the existing value has to be read from
a blob file), and `Merge` (when the value of the base `Put` is stored in a blob file).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9187
Test Plan: Ran `make check` and the stress/crash test.
Reviewed By: riversand963
Differential Revision: D32565512
Pulled By: ltamasi
fbshipit-source-id: 87be9cebc3aa01cc227bec6b5f64d827b8164f5d
Summary:
Allow compaction_job_test, db_io_failure_test, dbformat_test, deletefile_test, and fault_injection_test to use a custom Env object. Also move ```RegisterCustomObjects``` declaration to a header file to simplify things.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9087
Test Plan: Run manually using "buck test rocksdb/src:compaction_job_test_fbcode" etc.
Reviewed By: riversand963
Differential Revision: D32007222
Pulled By: anand1976
fbshipit-source-id: 99af58559e25bf61563dfa95dc46e31fa7375792
Summary:
There were three implementations of VectorIterator (util/vector_iterator, test_util/testutil.h and LoggingForwardVectorIterator). Merged them into one class to increase code coverage/testing and reduce duplication.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8901
Reviewed By: pdillinger
Differential Revision: D31022673
Pulled By: mrambacher
fbshipit-source-id: 8e3acbd2dfd60b4df609d02cc72846de2389d531
Summary:
On MacOS, there were errors building in LITE mode related to unused private member variables:
In file included from ./db/compaction/compaction_job.h:20:
./db/blob/blob_file_completion_callback.h:87:19: error: private field ‘sst_file_manager_’ is not used [-Werror,-Wunused-private-field]
SstFileManager* sst_file_manager_;
^
./db/blob/blob_file_completion_callback.h:88:22: error: private field ‘mutex_’ is not used [-Werror,-Wunused-private-field]
InstrumentedMutex* mutex_;
^
./db/blob/blob_file_completion_callback.h:89:17: error: private field ‘error_handler_’ is not used [-Werror,-Wunused-private-field]
ErrorHandler* error_handler_;
This PR resolves those build issues by removing the values as members in LITE mode and fixing the constructor to ignore the input values in LITE mode (otherwise we get unused parameter warnings).
Tested by validating compiles without warnings.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8981
Reviewed By: akankshamahajan15
Differential Revision: D31320141
Pulled By: mrambacher
fbshipit-source-id: d67875ebbd39a9555e4f09b2d37159566dd8a085
Summary:
Add comments for MultiGetBlob() that input argument `offsets` must be
sorted. In addition, add assertion for this condition in debug build.
Repeat the same for RandomAccessFileReader::MultiRead().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8972
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D31253205
Pulled By: riversand963
fbshipit-source-id: 98758229b8052f3aeb319d5584026b4de2d220a2
Summary:
`RandomAccessFileReader::MultiRead()` tries to merge requests in direct IO, assuming input IO requests are
sorted by offsets.
Add a test in direct IO mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8953
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D31183546
Pulled By: riversand963
fbshipit-source-id: 5d043ec68e2daa47a3149066150afd41ee3d73e6
Summary:
Made SystemClock into a Customizable class, complete with CreateFromString.
Cleaned up some of the existing SystemClock implementations that were redundant (NoSleep was the same as the internal one for MockEnv).
Changed MockEnv construction to allow Clock to be passed to the Memory/MockFileSystem.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8636
Reviewed By: zhichao-cao
Differential Revision: D30483360
Pulled By: mrambacher
fbshipit-source-id: cd0e3a876c39f8c98fe13374c06e8edbd5b9f2a1
Summary:
1. Extend FlushJobInfo and CompactionJobInfo with information about the blob files generated by flush/compaction jobs. This PR add two structures BlobFileInfo and BlobFileGarbageInfo that contains the required information of blob files.
2. Notify the creation and deletion of blob files through OnBlobFileCreationStarted, OnBlobFileCreated, and OnBlobFileDeleted.
3. Test OnFile*Finish operations notifications with Blob Files.
4. Log the blob file creation/deletion events through EventLogger in Log file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8675
Test Plan: Add new unit tests in listener_test
Reviewed By: ltamasi
Differential Revision: D30412613
Pulled By: akankshamahajan15
fbshipit-source-id: ca51b63c6e8c8d0485a38c503572bc5a82bd5d07
Summary:
The patch adjusts the definition of BlobDB's DB properties a bit by
switching to `GetBlobFileSize` from `GetTotalBlobBytes`. The
difference is that the value returned by `GetBlobFileSize` includes
the blob file header and footer as well, and thus matches the on-disk
size of blob files. In addition, the patch removes the `Version` number
from the `blob_stats` property, and updates/extends the unit tests a little.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8902
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D30859542
Pulled By: ltamasi
fbshipit-source-id: e3426d2d567bd1bd8c8636abdafaafa0743c854c
Summary:
RocksDB exposes certain internal statistics via the DB property interface.
However, there are currently no properties related to BlobDB.
For starters, we would like to add the following BlobDB properties:
`rocksdb.num-blob-files`: number of blob files in the current Version (kind of like `num-files-at-level` but note this is not per level, since blob files are not part of the LSM tree).
`rocksdb.blob-stats`: this could return the total number and size of all blob files, and potentially also the total amount of garbage (in bytes) in the blob files in the current Version.
`rocksdb.total-blob-file-size`: the total size of all blob files (as a blob counterpart for `total-sst-file-size`) of all Versions.
`rocksdb.live-blob-file-size`: the total size of all blob files in the current Version.
`rocksdb.estimate-live-data-size`: this is actually an existing property that we can extend so it considers blob files as well. When it comes to blobs, we actually have an exact value for live bytes. Namely, live bytes can be computed simply as total bytes minus garbage bytes, summed over the entire set of blob files in the Version.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8734
Test Plan:
```
➜ rocksdb git:(new_feature_blobDB_properties) ./db_blob_basic_test
[==========] Running 16 tests from 2 test cases.
[----------] Global test environment set-up.
[----------] 10 tests from DBBlobBasicTest
[ RUN ] DBBlobBasicTest.GetBlob
[ OK ] DBBlobBasicTest.GetBlob (12 ms)
[ RUN ] DBBlobBasicTest.MultiGetBlobs
[ OK ] DBBlobBasicTest.MultiGetBlobs (11 ms)
[ RUN ] DBBlobBasicTest.GetBlob_CorruptIndex
[ OK ] DBBlobBasicTest.GetBlob_CorruptIndex (10 ms)
[ RUN ] DBBlobBasicTest.GetBlob_InlinedTTLIndex
[ OK ] DBBlobBasicTest.GetBlob_InlinedTTLIndex (12 ms)
[ RUN ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber
[ OK ] DBBlobBasicTest.GetBlob_IndexWithInvalidFileNumber (9 ms)
[ RUN ] DBBlobBasicTest.GenerateIOTracing
[ OK ] DBBlobBasicTest.GenerateIOTracing (11 ms)
[ RUN ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile
[ OK ] DBBlobBasicTest.BestEffortsRecovery_MissingNewestBlobFile (13 ms)
[ RUN ] DBBlobBasicTest.GetMergeBlobWithPut
[ OK ] DBBlobBasicTest.GetMergeBlobWithPut (11 ms)
[ RUN ] DBBlobBasicTest.MultiGetMergeBlobWithPut
[ OK ] DBBlobBasicTest.MultiGetMergeBlobWithPut (14 ms)
[ RUN ] DBBlobBasicTest.BlobDBProperties
[ OK ] DBBlobBasicTest.BlobDBProperties (21 ms)
[----------] 10 tests from DBBlobBasicTest (124 ms total)
[----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest
[ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0
[ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/0 (12 ms)
[ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1
[ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.GetBlob_IOError/1 (10 ms)
[ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0
[ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/0 (10 ms)
[ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1
[ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.MultiGetBlobs_IOError/1 (10 ms)
[ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0
[ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/0 (1011 ms)
[ RUN ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1
[ OK ] DBBlobBasicTest/DBBlobBasicIOErrorTest.CompactionFilterReadBlob_IOError/1 (1013 ms)
[----------] 6 tests from DBBlobBasicTest/DBBlobBasicIOErrorTest (2066 ms total)
[----------] Global test environment tear-down
[==========] 16 tests from 2 test cases ran. (2190 ms total)
[ PASSED ] 16 tests.
```
Reviewed By: ltamasi
Differential Revision: D30690849
Pulled By: Zhiyi-Zhang
fbshipit-source-id: a7567319487ad76bd1a2e24bf143afdbbd9e4346