Summary:
If user runs `db_bench` with `-perf_level=2` or higher, db_bench should
print perf context after each of all benchmarks.
Or make `-perf_level` a per-benchmark switch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10396
Test Plan: ./db_bench -benchmarks=fillseq,readseq -perf_level=2
Reviewed By: ajkr
Differential Revision: D38016324
Pulled By: riversand963
fbshipit-source-id: d83ea4abc34d40ffea394ca6abf0814bc5c0a2e0
Summary:
To help service owners to manage their memory budget effectively, we have been working towards counting all major memory users inside RocksDB towards a single global memory limit (see e.g. https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache). The global limit is specified by the capacity of the block-based table's block cache, and is technically implemented by inserting dummy entries ("reservations") into the block cache. The goal of this task is to support charging the memory usage of the new blob cache against this global memory limit when the backing cache of the blob cache and the block cache are different.
This PR is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10321
Reviewed By: ltamasi
Differential Revision: D37913590
Pulled By: gangliao
fbshipit-source-id: eaacf23907f82dc7d18964a3f24d7039a2937a72
Summary:
After branch 7.5.fb branch is cut, following release process, upgrade version number to 7.6 and add 7.5.fb to format compatibility check.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10376
Test Plan: Watch CI
Reviewed By: ajkr
Differential Revision: D37927694
fbshipit-source-id: 71b37ae55ebb7c95a1bcc0d7eee643d6ba5f8461
Summary:
Many workloads have temporal locality, where recently written items are read back in a short period of time. When using remote file systems, this is inefficient since it involves network traffic and higher latencies. Because of this, we would like to support prepopulating the blob cache during flush.
This task is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10298
Reviewed By: ltamasi
Differential Revision: D37908743
Pulled By: gangliao
fbshipit-source-id: 9feaed234bc719d38f0c02975c1ad19fa4bb37d1
Summary:
ClockCache is still in experimental stage, and currently fails some pre-release fbcode tests. See https://www.internalfb.com/diff/D37772011. API calls to construct ClockCache are done via the function NewClockCache. For now, NewClockCache calls will return an LRUCache (with appropriate arguments), which is stable.
The idea that NewClockCache returns nullptr was also floated, but this would be interpreted as unsupported cache, and a default LRUCache would be constructed instead, potentially causing a performance regression that is harder to identify.
A new version of the NewClockCache function was created for our internal tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10351
Test Plan: ``make -j24 check`` and re-run the pre-release tests.
Reviewed By: pdillinger
Differential Revision: D37802685
Pulled By: guidotag
fbshipit-source-id: 0a8d10612ff21e576f7360cb13e20bc36e244972
Summary:
as title.
Test plan
- make check
- CI on PR
- TEST_TMPDIR=/dev/shm make crash_test_with_multiops_wp_txn (tested with successful run)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10350
Reviewed By: ajkr
Differential Revision: D37792872
Pulled By: riversand963
fbshipit-source-id: ff064093b7f715d0acf387af2e3ae87b1278b52b
Summary:
Previously the version was displayed as $major.$minor
This changes it to $major.$minor.$path
This also adds the git hash for the time from which RocksDB was built to the end of report.tsv. I confirmed that benchmark_log_tool.py still parses it and that the people
who consume/graph these results are OK with it.
Example output:
ops_sec mb_sec lsm_sz blob_sz c_wgb w_amp c_mbps c_wsecs c_csecs b_rgb b_wgb usec_op p50 p99 p99.9 p99.99 pmax uptime stall% Nstall u_cpu s_cpu rss test date version job_id githash
609488 244.1 1GB 0.0GB, 1.4 0.7 93.3 39 38 0 0 1.6 1.0 4 15 26 5365 15 0.0 0 0.1 0.0 0.5 fillseq.wal_disabled.v400 2022-06-29T13:36:05 7.5.0 6115254416
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10277
Test Plan: Run it
Reviewed By: jay-zhuang
Differential Revision: D37532418
Pulled By: mdcallag
fbshipit-source-id: 55e472640d51265819b228d3373c9fa9b62b660d
Summary:
This adds --undefok to support use of this script with BlobDB for db_bench versions prior
to 7.5 when the options land in a release.
While there is a limit to how far back this script can go WRT backwards compatiblity,
this is an easy change to support early 7.x releases.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10276
Test Plan: Run it with versions of db_bench that do not and then do support these options
Reviewed By: gangliao
Differential Revision: D37529299
Pulled By: mdcallag
fbshipit-source-id: 7bb1feec5c68760e6d64792c585bfbde4f5e52d8
Summary:
This is the initial step in the development of a lock-free clock cache. This PR includes the base hash table design (which we mostly ported over from FastLRUCache) and the clock eviction algorithm. Importantly, it's still _not_ lock-free---all operations use a shard lock. Besides the locking, there are other features left as future work:
- Remove keys from the handles. Instead, use 128-bit bijective hashes of them for handle comparisons, probing (we need two 32-bit hashes of the key for double hashing) and sharding (we need one 6-bit hash).
- Remove the clock_usage_ field, which is updated on every lookup. Even if it were atomically updated, it could cause memory invalidations across cores.
- Middle insertions into the clock list.
- A test that exercises the clock eviction policy.
- Update the Java API of ClockCache and Java calls to C++.
Along the way, we improved the code and comments quality of FastLRUCache. These changes are relatively minor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10273
Test Plan: ``make -j24 check``
Reviewed By: pdillinger
Differential Revision: D37522461
Pulled By: guidotag
fbshipit-source-id: 3d70b737dbb70dcf662f00cef8c609750f083943
Summary:
Before this PR, when user-defined timestamp is enabled, db_stress disables compaction filter.
This is no longer necessary after this PR, since the `DbStressCompactionFilter` is now aware of
the presence of timestamps.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10259
Test Plan: TEST_TMPDIR=/dev/shm make crash_test_with_ts
Reviewed By: ajkr
Differential Revision: D37459692
Pulled By: riversand963
fbshipit-source-id: 8fe62e90a63bd9317fe1bb95a2b4984080c9e5ef
Summary:
Need to disable it for now as CI is failing, particularly `MultiOpsTxnsStressTest`. Investigation details in internal task T124324915. This PR disables mempurge more widely than `MultiOpsTxnsStressTest` until we know the issue is contained to that particular test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10252
Reviewed By: riversand963
Differential Revision: D37432948
Pulled By: ajkr
fbshipit-source-id: d0cf5b0e0ec7c3142c382a0347f35a4c34f4607a
Summary:
In order to facilitate correctness and performance testing, we would like to add the new blob cache to our stress test tool `db_stress` and our continuously running crash test script `db_crashtest.py`, as well as our synthetic benchmarking tool `db_bench` and the BlobDB performance testing script `run_blob_bench.sh`.
As part of this task, we would also like to utilize these benchmarking tools to get some initial performance numbers about the effectiveness of caching blobs.
This PR is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10202
Reviewed By: ltamasi
Differential Revision: D37325739
Pulled By: gangliao
fbshipit-source-id: deb65d0d414502270dd4c324d987fd5469869fa8
Summary:
There was a bug in the MultiGet enhancement in https://github.com/facebook/rocksdb/issues/9899 with data
block hash index, which was not caught because data block hash index was
never added to stress tests. This change fixes both issues.
Fixes https://github.com/facebook/rocksdb/issues/10186
I intend to pick this into the 7.4.0 release candidate
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10220
Test Plan:
Failure quickly reproduces in crash test with
kDataBlockBinaryAndHash, and does not seem to with the fix. Reproducing
the failure with a unit test I believe would be too tricky and fragile
to be worthwhile.
Reviewed By: anand1976
Differential Revision: D37315647
Pulled By: pdillinger
fbshipit-source-id: 9f648265bba867275edc752f7a56611a59401cba
Summary:
In FastLRUCache, we replace the current chained per-shard hash table by an open-addressing hash table. In particular, this allows us to preallocate all handles.
Because all handles are preallocated, this implementation doesn't support strict_capacity_limit = false (i.e., allowing insertions beyond the predefined capacity). This clashes with current assumptions of some tests, namely two tests in cache_test and the crash tests. We have disabled these for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10194
Test Plan: ``make -j24 check``
Reviewed By: pdillinger
Differential Revision: D37296770
Pulled By: guidotag
fbshipit-source-id: 232ff1b8260331d868ebf4e3e5d8ad709390b0ad
Summary:
There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache.
In this task, we formally introduced the blob source to RocksDB. BlobSource is a new abstraction layer that provides universal access to blobs, regardless of whether they are in the blob cache, secondary cache, or (remote) storage. Depending on user settings, it always fetch blobs from multi-tier cache and storage with minimal cost.
Note: The new `MultiGetBlob()` implementation is not included in the current PR. To go faster, we aim to create a separate PR for it in parallel!
This PR is a part of https://github.com/facebook/rocksdb/issues/10156
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10198
Reviewed By: ltamasi
Differential Revision: D37294735
Pulled By: gangliao
fbshipit-source-id: 9cb50422d9dd1bc03798501c2778b6c7520c7a1e
Summary:
**Context/Summary:**
https://github.com/facebook/rocksdb/pull/9424 added rate-limiting support for user reads, which does not include batched `MultiGet()`s that call `RandomAccessFileReader::MultiRead()`. The reason is that it's harder (compared with RandomAccessFileReader::Read()) to implement the ideal rate-limiting where we first call `RateLimiter::RequestToken()` for allowed bytes to multi-read and then consume those bytes by satisfying as many requests in `MultiRead()` as possible. For example, it can be tricky to decide whether we want partially fulfilled requests within one `MultiRead()` or not.
However, due to a recent urgent user request, we decide to pursue an elementary (but a conditionally ineffective) solution where we accumulate enough rate limiter requests toward the total bytes needed by one `MultiRead()` before doing that `MultiRead()`. This is not ideal when the total bytes are huge as we will actually consume a huge bandwidth from rate-limiter causing a burst on disk. This is not what we ultimately want with rate limiter. Therefore a follow-up work is noted through TODO comments.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10159
Test Plan:
- Modified existing unit test `DBRateLimiterOnReadTest/DBRateLimiterOnReadTest.NewMultiGet`
- Traced the underlying system calls `io_uring_enter` and verified they are 10 seconds apart from each other correctly under the setting of `strace -ftt -e trace=io_uring_enter ./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb2 -readonly -num=50 -threads=1 -multiread_batched=1 -batch_size=100 -duration=10 -rate_limiter_bytes_per_sec=200 -rate_limiter_refill_period_us=1000000 -rate_limit_bg_reads=1 -disable_auto_compactions=1 -rate_limit_user_ops=1` where each `MultiRead()` read about 2000 bytes (inspected by debugger) and the rate limiter grants 200 bytes per seconds.
- Stress test:
- Verified `./db_stress (-test_cf_consistency=1/test_batches_snapshots=1) -use_multiget=1 -cache_size=1048576 -rate_limiter_bytes_per_sec=10241024 -rate_limit_bg_reads=1 -rate_limit_user_ops=1` work
Reviewed By: ajkr, anand1976
Differential Revision: D37135172
Pulled By: hx235
fbshipit-source-id: 73b8e8f14761e5d4b77235dfe5d41f4eea968bcd
Summary:
Added an option, `WriteOptions::protection_bytes_per_key`, that controls how many bytes per key we use for integrity protection in `WriteBatch`. It takes effect when `WriteBatch::GetProtectionBytesPerKey() == 0`.
Currently the only supported value is eight. Invoking a user API with it set to any other nonzero value will result in `Status::NotSupported` returned to the user.
There is also a bug fix for integrity protection with `inplace_callback`, where we forgot to take into account the possible change in varint length when calculating KV checksum for the final encoded buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10037
Test Plan:
- Manual
- Set default value of `WriteOptions::protection_bytes_per_key` to eight and ran `make check -j24`
- Enabled in MyShadow for 1+ week
- Automated
- Unit tests have a `WriteMode` that enables the integrity protection via `WriteOptions`
- Crash test - in most cases, use `WriteOptions::protection_bytes_per_key` to enable integrity protection
Reviewed By: cbi42
Differential Revision: D36614569
Pulled By: ajkr
fbshipit-source-id: 8650087ceac9b61b560f1e5fafe5e1baf9c725fb
Summary:
In https://github.com/facebook/rocksdb/issues/9535, release 7.0, we hid the old block-based filter from being created using
the public API, because of its inefficiency. Although we normally maintain read compatibility
on old DBs forever, filters are not required for reading a DB, only for optimizing read
performance. Thus, it should be acceptable to remove this code and the substantial
maintenance burden it carries as useful features are developed and validated (such
as user timestamp).
This change completely removes the code for reading and writing the old block-based
filters, net removing about 1370 lines of code no longer needed. Options removed from
testing / benchmarking tools. The prior existence is only evident in a couple of places:
* `CacheEntryRole::kDeprecatedFilterBlock` - We can update this public API enum in
a major release to minimize source code incompatibilities.
* A warning is logged when an old table file is opened that used the old block-based
filter. This is provided as a courtesy, and would be a pain to unit test, so manual testing
should suffice. Unfortunately, sst_dump does not tell you whether a file uses
block-based filter, and the structure of the code makes it very difficult to fix.
* To detect that case, `kObsoleteFilterBlockPrefix` (renamed from `kFilterBlockPrefix`)
for metaindex is maintained (for now).
Other notes:
* In some cases where numbers are associated with filter configurations, we have had to
update the assigned numbers so that they all correspond to something that exists.
* Fixed potential stat counting bug by assuming `filter_checked = false` for cases
like `filter == nullptr` rather than assuming `filter_checked = true`
* Removed obsolete `block_offset` and `prefix_extractor` parameters from several
functions.
* Removed some unnecessary checks `if (!table_prefix_extractor() && !prefix_extractor)`
because the caller guarantees the prefix extractor exists and is compatible
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10184
Test Plan:
tests updated, manually test new warning in LOG using base version to
generate a DB
Reviewed By: riversand963
Differential Revision: D37212647
Pulled By: pdillinger
fbshipit-source-id: 06ee020d8de3b81260ffc36ad0c1202cbf463a80
Summary:
Fix existing usage of non-ASCII and add a check to prevent
future use. Added `-n` option to greps to provide line numbers.
Alternative to https://github.com/facebook/rocksdb/issues/10147
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10164
Test Plan:
used new checker to find & fix cases, manually check
db_bench output is preserved
Reviewed By: akankshamahajan15
Differential Revision: D37148792
Pulled By: pdillinger
fbshipit-source-id: 68c8b57e7ab829369540d532590bf756938855c7
Summary:
There is `Options::allow_data_in_errors` that controls whether RocksDB
is allowed to log data, e.g. key, value, etc in LOG files. It is false
by default. However, in db_bench and db_stress, it is often ok to log
data because there is no concern about privacy.
This PR allows db_stress and db_bench to set this option on the command
line, while it remains false by default. Furthermore, make
crash/recovery test driven by db_crashtest.py to opt-in.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10171
Test Plan: Stress test and db_bench
Reviewed By: hx235
Differential Revision: D37163787
Pulled By: riversand963
fbshipit-source-id: 0242f24d292ba15b6faf8ff903963b85d3e011f8
Summary:
We make the size of the per-shard hash table fixed. The base level of the hash table is now preallocated with the required capacity. The user must provide an estimate of the size of the values.
Notice that even though the base level becomes fixed, the chains are still dynamic. Overall, the shard capacity mechanisms haven't changed, so we don't need to test this.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10154
Test Plan: `make -j24 check`
Reviewed By: pdillinger
Differential Revision: D37124451
Pulled By: guidotag
fbshipit-source-id: cba6ac76052fe0ec60b8ff4211b3de7650e80d0c
Summary:
See https://github.com/facebook/rocksdb/issues/10082 for more details. Trivial move
isn't done for universal when compaction is from L0 into L0. So a too small value for
num_levels with db_bench means there will be fewer trivial moves with universal and
that means that write-amp will increase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10158
Test Plan: run it
Reviewed By: siying
Differential Revision: D37122519
Pulled By: mdcallag
fbshipit-source-id: 1cb39049676f68a6cc3ea8d105a9965f89d4d09e
Summary:
In RocksDB, keys are associated with (internal) sequence numbers which denote when the keys are written
to the database. Sequence numbers in different RocksDB instances are unrelated, thus not comparable.
It is nice if we can associate sequence numbers with their corresponding actual timestamps. One thing we can
do is to support user-defined timestamp, which allows the applications to specify the format of custom timestamps
and encode a timestamp with each key. More details can be found at https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29.
This PR provides a different but complementary approach. We can associate rocksdb snapshots (defined in
https://github.com/facebook/rocksdb/blob/7.2.fb/include/rocksdb/snapshot.h#L20) with **user-specified** timestamps.
Since a snapshot is essentially an object representing a sequence number, this PR establishes a bi-directional mapping between sequence numbers and timestamps.
In the past, snapshots are usually taken by readers. The current super-version is grabbed, and a `rocksdb::Snapshot`
object is created with the last published sequence number of the super-version. You can see that the reader actually
has no good idea of what timestamp to assign to this snapshot, because by the time the `GetSnapshot()` is called,
an arbitrarily long period of time may have already elapsed since the last write, which is when the last published
sequence number is written.
This observation motivates the creation of "timestamped" snapshots on the write path. Currently, this functionality is
exposed only to the layer of `TransactionDB`. Application can tell RocksDB to create a snapshot when a transaction
commits, effectively associating the last sequence number with a timestamp. It is also assumed that application will
ensure any two snapshots with timestamps should satisfy the following:
```
snapshot1.seq < snapshot2.seq iff. snapshot1.ts < snapshot2.ts
```
If the application can guarantee that when a reader takes a timestamped snapshot, there is no active writes going on
in the database, then we also allow the user to use a new API `TransactionDB::CreateTimestampedSnapshot()` to create
a snapshot with associated timestamp.
Code example
```cpp
// Create a timestamped snapshot when committing transaction.
txn->SetCommitTimestamp(100);
txn->SetSnapshotOnNextOperation();
txn->Commit();
// A wrapper API for convenience
Status Transaction::CommitAndTryCreateSnapshot(
std::shared_ptr<TransactionNotifier> notifier,
TxnTimestamp ts,
std::shared_ptr<const Snapshot>* ret);
// Create a timestamped snapshot if caller guarantees no concurrent writes
std::pair<Status, std::shared_ptr<const Snapshot>> snapshot = txn_db->CreateTimestampedSnapshot(100);
```
The snapshots created in this way will be managed by RocksDB with ref-counting and potentially shared with
other readers. We provide the following APIs for readers to retrieve a snapshot given a timestamp.
```cpp
// Return the timestamped snapshot correponding to given timestamp. If ts is
// kMaxTxnTimestamp, then we return the latest timestamped snapshot if present.
// Othersise, we return the snapshot whose timestamp is equal to `ts`. If no
// such snapshot exists, then we return null.
std::shared_ptr<const Snapshot> TransactionDB::GetTimestampedSnapshot(TxnTimestamp ts) const;
// Return the latest timestamped snapshot if present.
std::shared_ptr<const Snapshot> TransactionDB::GetLatestTimestampedSnapshot() const;
```
We also provide two additional APIs for stats collection and reporting purposes.
```cpp
Status TransactionDB::GetAllTimestampedSnapshots(
std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
// Return timestamped snapshots whose timestamps fall in [ts_lb, ts_ub) and store them in `snapshots`.
Status TransactionDB::GetTimestampedSnapshots(
TxnTimestamp ts_lb,
TxnTimestamp ts_ub,
std::vector<std::shared_ptr<const Snapshot>>& snapshots) const;
```
To prevent the number of timestamped snapshots from growing infinitely, we provide the following API to release
timestamped snapshots whose timestamps are older than or equal to a given threshold.
```cpp
void TransactionDB::ReleaseTimestampedSnapshotsOlderThan(TxnTimestamp ts);
```
Before shutdown, RocksDB will release all timestamped snapshots.
Comparison with user-defined timestamp and how they can be combined:
User-defined timestamp persists every key with a timestamp, while timestamped snapshots maintain a volatile
mapping between snapshots (sequence numbers) and timestamps.
Different internal keys with the same user key but different timestamps will be treated as different by compaction,
thus a newer version will not hide older versions (with smaller timestamps) unless they are eligible for garbage collection.
In contrast, taking a timestamped snapshot at a certain sequence number and timestamp prevents all the keys visible in
this snapshot from been dropped by compaction. Here, visible means (seq < snapshot and most recent).
The timestamped snapshot supports the semantics of reading at an exact point in time.
Timestamped snapshots can also be used with user-defined timestamp.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9879
Test Plan:
```
make check
TEST_TMPDIR=/dev/shm make crash_test_with_txn
```
Reviewed By: siying
Differential Revision: D35783919
Pulled By: riversand963
fbshipit-source-id: 586ad905e169189e19d3bfc0cb0177a7239d1bd4
Summary:
A recent diff add a few more fields to one of the db_bench output lines that gets parsed.
This diff updates tools/benchmark.sh to handle that.
overwrite : 7.939 micros/op 125963 ops/sec; 50.5 MB/s
overwrite : 7.854 micros/op 127320 ops/sec 1800.001 seconds 229176999 operations; 51.0 MB/s
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10124
Test Plan: Run it
Reviewed By: jay-zhuang
Differential Revision: D36945137
Pulled By: mdcallag
fbshipit-source-id: 9c96f79491411da997e369a3be9c6b921a21d0fa
Summary:
This PR updates secondary instance testing in stress test by default.
A background thread will be started (disabled by default), running a secondary instance tailing the logs of the primary.
Periodically (every 1 sec), this thread calls `TryCatchUpWithPrimary()` and uses point lookup or range scan
to read some random keys with only very basic verification to make sure no assertion failure is triggered.
Thanks to https://github.com/facebook/rocksdb/issues/10061 , we can enable secondary instance when user-defined timestamp is enabled.
Also removed a less useful test configuration, `secondary_catch_up_one_in`. This is very similar to the periodic
catch-up.
In the last commit, I decided not to enable it now, but just update the tests, since secondary instance does not
work well when the underlying file is renamed by primary, e.g. SstFileManager.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10121
Test Plan:
```
TEST_TMPDIR=/dev/shm/rocksdb make crash_test
TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_ts
TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_atomic_flush
```
Reviewed By: ajkr
Differential Revision: D36939458
Pulled By: riversand963
fbshipit-source-id: 1c065b7efc3690fc341569b9d369a5cbd8ef6b3e
Summary:
The patch attempts to fix three bugs in `verify_random_db.sh`:
1) https://github.com/facebook/rocksdb/pull/9937 changed the default for
`--try_load_options` to true in the script's use case, so we have to
explicitly set it to false if the corresponding argument of the script
is 0. This should fix the issue we've been seeing with our forward
compatibility tests where 7.3 is unable to open a database created by
the version on main after adding a new configuration option.
2) The script seems to support two "extra parameters"; however,
in practice, if the second one was set, only that one was passed on to
`ldb`. Now both get forwarded.
3) When running the `diff` command, the base DB directory was passed as
the second argument instead of the file containing the `ldb` output
(this actually seems to work, probably accidentally though).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10112
Reviewed By: pdillinger
Differential Revision: D36911363
Pulled By: ltamasi
fbshipit-source-id: fe29db4e28d373cee51a12322c59050fc50e926d
Summary:
db_bench can now run with FastLRUCache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10096
Test Plan:
- Temporarily add an ``assert(false)`` in the execution path that sets up the FastLRUCache. Run ``make -j24 db_bench``. Then test the appropriate code is used by running ``./db_bench -cache_type=fast_lru_cache`` and checking that the assert is called. Repeat for LRUCache.
- Verify that FastLRUCache (currently a clone of LRUCache) produces similar benchmark data than LRUCache, by comparing the outputs of ``./db_bench -benchmarks=fillseq,fillrandom,readseq,readrandom -cache_type=fast_lru_cache`` and ``./db_bench -benchmarks=fillseq,fillrandom,readseq,readrandom -cache_type=lru_cache``.
Reviewed By: gitbw95
Differential Revision: D36898774
Pulled By: guidotag
fbshipit-source-id: f9f6b6f6da124f88b21b3c8dee742fbb04eff773
Summary:
Currently, if blob files are enabled (i.e. `enable_blob_files` is true), large values are extracted both during flush/recovery (when SST files are written into level 0 of the LSM tree) and during compaction into any LSM tree level. For certain use cases that have a mix of short-lived and long-lived values, it might make sense to support extracting large values only during compactions whose output level is greater than or equal to a specified LSM tree level (e.g. compactions into L1/L2/... or above). This could reduce the space amplification caused by large values that are turned into garbage shortly after being written at the price of some write amplification incurred by long-lived values whose extraction to blob files is delayed.
In order to achieve this, we would like to do the following:
- Add a new configuration option `blob_file_starting_level` (default: 0) to `AdvancedColumnFamilyOptions` (and `MutableCFOptions` and extend the related logic)
- Instantiate `BlobFileBuilder` in `BuildTable` (used during flush and recovery, where the LSM tree level is L0) and `CompactionJob` iff `enable_blob_files` is set and the LSM tree level is `>= blob_file_starting_level`
- Add unit tests for the new functionality, and add the new option to our stress tests (`db_stress` and `db_crashtest.py` )
- Add the new option to our benchmarking tool `db_bench` and the BlobDB benchmark script `run_blob_bench.sh`
- Add the new option to the `ldb` tool (see https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool)
- Ideally extend the C and Java bindings with the new option
- Update the BlobDB wiki to document the new option.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10077
Reviewed By: ltamasi
Differential Revision: D36884156
Pulled By: gangliao
fbshipit-source-id: 942bab025f04633edca8564ed64791cb5e31627d
Summary:
Currently, the DB directory file descriptor is left open until the deconstruction process (`DB::Close()` does not close the file descriptor). To verify this, comment out the lines between `db_ = nullptr` and `db_->Close()` (line 512, 513, 514, 515 in ldb_cmd.cc) to leak the ``db_'' object, build `ldb` tool and run
```
strace --trace=open,openat,close ./ldb --db=$TEST_TMPDIR --ignore_unknown_options put K1 V1 --create_if_missing
```
There is one directory file descriptor that is not closed in the strace log.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10049
Test Plan: Add a new unit test DBBasicTest.DBCloseAllDirectoryFDs: Open a database with different WAL directory and three different data directories, and all directory file descriptors should be closed after calling Close(). Explicitly call Close() after a directory file descriptor is not used so that the counter of directory open and close should be equivalent.
Reviewed By: ajkr, hx235
Differential Revision: D36722135
Pulled By: littlepig2013
fbshipit-source-id: 07bdc2abc417c6b30997b9bbef1f79aa757b21ff
Summary:
Stress tests can run with the experimental FastLRUCache. Crash tests randomly choose between LRUCache and FastLRUCache.
Since only LRUCache supports a secondary cache, we validate the `--secondary_cache_uri` and `--cache_type` flags---when `--secondary_cache_uri` is set, the `--cache_type` is set to `lru_cache`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10081
Test Plan:
- To test that the FastLRUCache is used and the stress test runs successfully, run `make -j24 CRASH_TEST_EXT_ARGS=—duration=960 blackbox_crash_test_with_atomic_flush`. The cache type should sometimes be `fast_lru_cache`.
- To test the flag validation, run `make -j24 CRASH_TEST_EXT_ARGS="--duration=960 --secondary_cache_uri=x" blackbox_crash_test_with_atomic_flush` multiple times. The test will always be aborted (which is okay). Check that the cache type is always `lru_cache`.
Reviewed By: anand1976
Differential Revision: D36839908
Pulled By: guidotag
fbshipit-source-id: ebcdfdcd12ec04c96c09ae5b9c9d1e613bdd1725
Summary:
Thanks to https://github.com/facebook/rocksdb/issues/9919 and https://github.com/facebook/rocksdb/issues/10051 the known bugs in file ingestion (besides mmap read + file checksum) are fixed. Now we can try again to enable file ingestion in crash test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9357
Test Plan: stress file ingestion heavily for an hour: `$ TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --max_key=1000000 --ingest_external_file_one_in=100 --duration=3600 --interval=20 --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152`
Reviewed By: riversand963
Differential Revision: D33410746
Pulled By: ajkr
fbshipit-source-id: d276431390995a67f68390d61c06a40945fdd280
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10053
Need to exit if ldb command fails, to avoid running db_bench on
empty/bad DB and considering the results valid.
Reviewed By: jay-zhuang
Differential Revision: D36673200
fbshipit-source-id: e0d78a0d397e0e335d82d9349bfd612d38ffb552
Summary:
Added rate limiter and read rate-limiting support to SequentialFileReader. I've updated call sites to SequentialFileReader::Read with appropriate IO priority (or left a TODO and specified IO_TOTAL for now).
The PR is separated into four commits: the first one added the rate-limiting support, but with some fixes in the unit test since the number of request bytes from rate limiter in SequentialFileReader are not accurate (there is overcharge at EOF). The second commit fixed this by allowing SequentialFileReader to check file size and determine how many bytes are left in the file to read. The third commit added benchmark related code. The fourth commit moved the logic of using file size to avoid overcharging the rate limiter into backup engine (the main user of SequentialFileReader).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9973
Test Plan:
- `make check`, backup_engine_test covers usage of SequentialFileReader with rate limiter.
- Run db_bench to check if rate limiting is throttling as expected: Verified that reads and writes are together throttled at 2MB/s, and at 0.2MB chunks that are 100ms apart.
- Set up: `./db_bench --benchmarks=fillrandom -db=/dev/shm/test_rocksdb`
- Benchmark:
```
strace -ttfe read,write ./db_bench --benchmarks=backup -db=/dev/shm/test_rocksdb --backup_rate_limit=2097152 --use_existing_db
strace -ttfe read,write ./db_bench --benchmarks=restore -db=/dev/shm/test_rocksdb --restore_rate_limit=2097152 --use_existing_db
```
- db bench on backup and restore to ensure no performance regression.
- backup (avg over 50 runs): pre-change: 1.90443e+06 micros/op; post-change: 1.8993e+06 micros/op (improve by 0.2%)
- restore (avg over 50 runs): pre-change: 1.79105e+06 micros/op; post-change: 1.78192e+06 micros/op (improve by 0.5%)
```
# Set up
./db_bench --benchmarks=fillrandom -db=/tmp/test_rocksdb -num=10000000
# benchmark
TEST_TMPDIR=/tmp/test_rocksdb
NUM_RUN=50
for ((j=0;j<$NUM_RUN;j++))
do
./db_bench -db=$TEST_TMPDIR -num=10000000 -benchmarks=backup -use_existing_db | egrep 'backup'
# Restore
#./db_bench -db=$TEST_TMPDIR -num=10000000 -benchmarks=restore -use_existing_db
done > rate_limit.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' rate_limit.txt >> rate_limit_2.txt
```
Reviewed By: hx235
Differential Revision: D36327418
Pulled By: cbi42
fbshipit-source-id: e75d4307cff815945482df5ba630c1e88d064691
Summary:
Essentially refactored the RangeMayExist implementation in
FullFilterBlockReader to FilterBlockReaderCommon so that it applies to
partitioned filters as well. (The function is not called for the
block-based filter case.) RangeMayExist is essentially a series of checks
around a possible PrefixMayExist, and I'm confident those checks should
be the same for partitioned as for full filters. (I think it's likely
that bugs remain in those checks, but this change is overall a simplifying
one.)
Added auto_prefix_mode support to db_bench
Other small fixes as well
Fixes https://github.com/facebook/rocksdb/issues/10003
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10012
Test Plan:
Expanded unit test that uses statistics to check for filter
optimization, fails without the production code changes here
Performance: populate two DBs with
```
TEST_TMPDIR=/dev/shm/rocksdb_nonpartitioned ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8
TEST_TMPDIR=/dev/shm/rocksdb_partitioned ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -partition_index_and_filters
```
Observe no measurable change in non-partitioned performance
```
TEST_TMPDIR=/dev/shm/rocksdb_nonpartitioned ./db_bench -benchmarks=seekrandom[-X1000] -num=10000000 -readonly -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -auto_prefix_mode -cache_index_and_filter_blocks=1 -cache_size=1000000000 -duration 20
```
Before: seekrandom [AVG 15 runs] : 11798 (± 331) ops/sec
After: seekrandom [AVG 15 runs] : 11724 (± 315) ops/sec
Observe big improvement with partitioned (also supported by bloom use statistics)
```
TEST_TMPDIR=/dev/shm/rocksdb_partitioned ./db_bench -benchmarks=seekrandom[-X1000] -num=10000000 -readonly -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -partition_index_and_filters -auto_prefix_mode -cache_index_and_filter_blocks=1 -cache_size=1000000000 -duration 20
```
Before: seekrandom [AVG 12 runs] : 2942 (± 57) ops/sec
After: seekrandom [AVG 12 runs] : 7489 (± 184) ops/sec
Reviewed By: siying
Differential Revision: D36469796
Pulled By: pdillinger
fbshipit-source-id: bcf1e2a68d347b32adb2b27384f945434e7a266d