Summary:
`Replayer::Execute()` can directly returns the result (e.g, request latency, DB::Get() return code, returned value, etc.)
`Replayer::Replay()` reports the results via a callback function.
New interface:
`TraceRecordResult` in "rocksdb/trace_record_result.h".
`DBTest2.TraceAndReplay` and `DBTest2.TraceAndManualReplay` are updated accordingly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8657
Reviewed By: ajkr
Differential Revision: D30290216
Pulled By: autopear
fbshipit-source-id: 3c8d4e6b180ec743de1a9d9dcaee86064c74f0d6
Summary:
New public interfaces:
`TraceRecord` and `TraceRecord::Handler`, available in "rocksdb/trace_record.h".
`Replayer`, available in `rocksdb/utilities/replayer.h`.
User can use `DB::NewDefaultReplayer()` to create a Replayer to auto/manual replay a trace file.
Unit tests:
- `./db_test2 --gtest_filter="DBTest2.TraceAndReplay"`: Updated with the internal API changes.
- `./db_test2 --gtest_filter="DBTest2.TraceAndManualReplay"`: New for manual replay.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8611
Reviewed By: ajkr
Differential Revision: D30266329
Pulled By: autopear
fbshipit-source-id: 1ecb3cbbedae0f6a67c18f0cc82e002b4d81b6f8
Summary:
Changes the API of the MemPurge process: the `bool experimental_allow_mempurge` and `experimental_mempurge_policy` flags have been replaced by a `double experimental_mempurge_threshold` option.
This change of API reflects another major change introduced in this PR: the MemPurgeDecider() function now works by sampling the memtables being flushed to estimate the overall amount of useful payload (payload minus the garbage), and then compare this useful payload estimate with the `double experimental_mempurge_threshold` value.
Therefore, when the value of this flag is `0.0` (default value), mempurge is simply deactivated. On the other hand, a value of `DBL_MAX` would be equivalent to always going through a mempurge regardless of the garbage ratio estimate.
At the moment, a `double experimental_mempurge_threshold` value else than 0.0 or `DBL_MAX` is opnly supported`with the `SkipList` memtable representation.
Regarding the sampling, this PR includes the introduction of a `MemTable::UniqueRandomSample` function that collects (approximately) random entries from the memtable by using the new `SkipList::Iterator::RandomSeek()` under the hood, or by iterating through each memtable entry, depending on the target sample size and the total number of entries.
The unit tests have been readapted to support this new API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8628
Reviewed By: pdillinger
Differential Revision: D30149315
Pulled By: bjlemaire
fbshipit-source-id: 1feef5390c95db6f4480ab4434716533d3947f27
Summary:
`GenericRateLimiter` slow path handles requests that cannot be satisfied
immediately. Such requests enter a queue, and their thread stays in `Request()`
until they are granted or the rate limiter is stopped. These threads are
responsible for unblocking themselves. The work to do so is split into two main
duties.
(1) Waiting for the next refill time.
(2) Refilling the bytes and granting requests.
Prior to this PR, the slow path logic involved a leader election algorithm to
pick one thread to perform (1) followed by (2). It elected the thread whose
request was at the front of the highest priority non-empty queue since that
request was most likely to be granted. This algorithm was efficient in terms of
reducing intermediate wakeups, which is a thread waking up only to resume
waiting after finding its request is not granted. However, the conceptual
complexity of this algorithm was too high. It took me a long time to draw a
timeline to understand how it works for just one edge case yet there were so
many.
This PR drops the leader election to reduce conceptual complexity. Now, the two
duties can be performed by whichever thread acquires the lock first. The risk
of this change is increasing the number of intermediate wakeups, however, we
took steps to mitigate that.
- `wait_until_refill_pending_` flag ensures only one thread performs (1). This\
prevents the thundering herd problem at the next refill time. The remaining\
threads wait on their condition variable with an unbounded duration -- thus we\
must remember to notify them to ensure forward progress.
- (1) is typically done by a thread at the front of a queue. This is trivial\
when the queues are initially empty as the first choice that arrives must be\
the only entry in its queue. When queues are initially non-empty, we achieve\
this by having (2) notify a thread at the front of a queue (preferring higher\
priority) to perform the next duty.
- We do not require any additional wakeup for (2). Typically it will just be\
done by the thread that finished (1).
Combined, the second and third bullet points above suggest the refill/granting
will typically be done by a request at the front of its queue. This is
important because one wakeup is saved when a granted request happens to be in an
already running thread.
Note there are a few cases that still lead to intermediate wakeup, however. The
first two are existing issues that also apply to the old algorithm, however, the
third (including both subpoints) is new.
- No request may be granted (only possible when rate limit dynamically\
decreases).
- Requests from a different queue may be granted.
- (2) may be run by a non-front request thread causing it to not be granted even\
if some requests in that same queue are granted. It can happen for a couple\
(unlikely) reasons.
- A new request may sneak in and grab the lock at the refill time, before the\
thread finishing (1) can wake up and grab it.
- A new request may sneak in and grab the lock and execute (1) before (2)'s\
chosen candidate can wake up and grab the lock. Then that non-front request\
thread performing (1) can carry over to perform (2).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8602
Test Plan:
- Use existing tests. The edge cases listed in the comment are all performance\
related; I could not really think of any related to correctness. The logic\
looks the same whether a thread wakes up/finishes its work early/on-time/late,\
or whether the thread is chosen vs. "steals" the work.
- Verified write throughput and CPU overhead are basically the same with and\
without this change, even in a rate limiter heavy workload:
Test command:
```
$ rm -rf /dev/shm/dbbench/ && TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -benchmarks=fillrandom -num_multi_db=64 -num_low_pri_threads=64 -num_high_pri_threads=64 -write_buffer_size=262144 -target_file_size_base=262144 -max_bytes_for_level_base=1048576 -rate_limiter_bytes_per_sec=16777216 -key_size=24 -value_size=1000 -num=10000 -compression_type=none -rate_limiter_refill_period_us=1000
```
Results before this PR:
```
fillrandom : 108.463 micros/op 9219 ops/sec; 9.0 MB/s
7.40user 8.84system 1:26.20elapsed 18%CPU (0avgtext+0avgdata 256140maxresident)k
```
Results after this PR:
```
fillrandom : 108.108 micros/op 9250 ops/sec; 9.0 MB/s
7.45user 8.23system 1:26.68elapsed 18%CPU (0avgtext+0avgdata 255688maxresident)k
```
Reviewed By: hx235
Differential Revision: D30048013
Pulled By: ajkr
fbshipit-source-id: 6741bba9d9dfbccab359806d725105817fef818b
Summary:
Some FIFO users want to keep the data for longer, but the old data is rarely accessed. This feature allows users to configure FIFO compaction so that data older than a threshold is moved to a warm storage tier.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8310
Test Plan: Add several unit tests.
Reviewed By: ajkr
Differential Revision: D28493792
fbshipit-source-id: c14824ea634814dee5278b449ab5c98b6e0b5501
Summary:
- Changed MergeOperator, CompactionFilter, and CompactionFilterFactory into Customizable classes.
- Added Options/Configurable/Object Registration for TTL and Cassandra variants
- Changed the StringAppend MergeOperators to accept a string delimiter rather than a simple char. Made the delimiter into a configurable option
- Added tests for new functionality
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8481
Reviewed By: zhichao-cao
Differential Revision: D30136050
Pulled By: mrambacher
fbshipit-source-id: 271d1772835935b6773abaf018ee71e42f9491af
Summary:
Introduction of a new `fillanddeleteuniquerandom` benchmark (`db_bench`) with 5 new option flags to simulate a benchmark where the following sequence is repeated multiple times:
"A set of keys S1 is inserted ('`disposable entries`'), then after some delay another set of keys S2 is inserted ('`persistent entries`') and the first set of keys S1 is deleted. S2 artificially represents the insertion of hypothetical results from some undefined computation done on the first set of keys S1. The next sequence can start as soon as the last disposable entry in the set S1 of this sequence is inserted, if the `delay` is non negligible."
New flags:
- `disposable_entries_delete_delay`: minimum delay in microseconds between insertion of the last `disposable` entry, and the start of the insertion of the first `persistent` entry.
- `disposable_entries_batch_size`: number of `disposable` entries inserted at the beginning of each sequence.
- `disposable_entries_value_size`: size of the random `value` string for the `disposable` entries.
- `persistent_entries_batch_size`: number of `persistent` entries inserted at the end of each sequence, right before the deletion of the `disposable` entries starts.
- `persistent_entries_value_size`: size of the random value string for the `persistent` entries.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8593
Reviewed By: pdillinger
Differential Revision: D29974436
Pulled By: bjlemaire
fbshipit-source-id: f578033e5b45e8268ba6fa6f38f4770c2e6e801d
Summary:
- Added Type/CreateFromString
- Added ability to load EventListeners to DBOptions
- Since EventListeners did not previously have a Name(), defaulted to "". If there is no name, the listener cannot be loaded from the ObjectRegistry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8473
Reviewed By: zhichao-cao
Differential Revision: D29901488
Pulled By: mrambacher
fbshipit-source-id: 2d3a4aa6db1562ac03e7ad41b360e3521d486254
Summary:
Add `experimental_mempurge_policy` option flag and introduce two new `MemPurge` (Memtable Garbage Collection) policies: 'ALWAYS' and 'ALTERNATE'. Default value: ALTERNATE.
`ALWAYS`: every flush will first go through a `MemPurge` process. If the output is too big to fit into a single memtable, then the mempurge is aborted and a regular flush process carries on. `ALWAYS` is designed for user that need to reduce the number of L0 SST file created to a strict minimum, and can afford a small dent in performance (possibly hits to CPU usage, read efficiency, and maximum burst write throughput).
`ALTERNATE`: a flush is transformed into a `MemPurge` except if one of the memtables being flushed is the product of a previous `MemPurge`. `ALTERNATE` is a good tradeoff between reduction in number of L0 SST files created and performance. `ALTERNATE` perform particularly well for completely random garbage ratios, or garbage ratios anywhere in (0%,50%], and even higher when there is a wild variability in garbage ratios.
This PR also includes support for `experimental_mempurge_policy` in `db_bench`.
Testing was done locally by replacing all the `MemPurge` policies of the unit tests with `ALTERNATE`, as well as local testing with `db_crashtest.py` `whitebox` and `blackbox`. Overall, if an `ALWAYS` mempurge policy passes the tests, there is no reasons why an `ALTERNATE` policy would fail, and therefore the mempurge policy was set to `ALWAYS` for all mempurge unit tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8583
Reviewed By: pdillinger
Differential Revision: D29888050
Pulled By: bjlemaire
fbshipit-source-id: e2cf26646d66679f6f5fb29842624615610759c1
Summary:
PR https://github.com/facebook/rocksdb/issues/8519 fix db_bench_tool.cc for MSVC build errors by simply copy-paste, this PR fix the copy-paste while also works for MSVC.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8553
Reviewed By: ajkr
Differential Revision: D29838056
Pulled By: jay-zhuang
fbshipit-source-id: 0cd60c146b87a355c3dc1061dfe813169d75cea4
Summary:
Add flags `overwrite_probability` and `overwrite_window_size` flag to `db_bench`.
Add the possibility of performing a `filluniquerandom` benchmark with an overwrite probability.
For each write operation, there is a probability _p_ that the write is an overwrite (_p_=`overwrite_probability`).
When an overwrite is decided, the key is randomly chosen from the last _N_ keys previously inserted into the DB (with _N_=`overwrite_window_size`).
When a pure write is decided, the key inserted into the DB is unique and therefore will not be an overwrite.
The `overwrite_window_size` is used so that the user can decide if the overwrite are mostly targeting recently inserted keys (when `overwrite_window_size` is small compared to the total number of writes), or can also target keys inserted "a long time ago" (when `overwrite_window_size` is comparable to total number of writes).
Note that total number of writes = # of unique insertions + # of overwrites.
No unit test specifically added.
Local testing show the following **throughputs** for `filluniquerandom` with 1M total writes:
- bypass the code inserts (no `overwrite_probability` flag specified): ~14.0MB/s
- `overwrite_probability=0.99`, `overwrite_window_size=10`: ~17.0MB/s
- `overwrite_probability=0.10`, `overwrite_window_size=10`: ~14.0MB/s
- `overwrite_probability=0.99`, `overwrite_window_size=1M`: ~14.5MB/s
- `overwrite_probability=0.10`, `overwrite_window_size=1M`: ~14.0MB/s
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8569
Reviewed By: pdillinger
Differential Revision: D29818631
Pulled By: bjlemaire
fbshipit-source-id: d472b4ea4e457a4da7c4ee4f14b40cccd6a4587a
Summary:
Tiny PR to add the `experimental_allow_mempurge` to the `db_bench` tool (`Mempurge` is the current prototype for memtable garbage collection).
This is useful to benchmark the prototype of this new feature, stress test it and help find new meaningful heuristics for GC.
By default, the flag to allow `mempurge` is set to `false`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8546
Reviewed By: anand1976
Differential Revision: D29738338
Pulled By: bjlemaire
fbshipit-source-id: 01892883a2f1c714c110718674da05992d6e2dd6
Summary:
Right now, db_bench with seekrandom and multiple DB setup creates iterator for all DBs just to query one of them. It's different from most real workloads. Fix it by only creating iterators that will be queried.
Also fix a bug that DBs are not destroyed in multi-DB mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7818
Test Plan: Run db_bench with single/multiDB X using/not using tailing iterator with ASAN build, and validate the behavior is expected.
Reviewed By: ajkr
Differential Revision: D25720226
fbshipit-source-id: c2ff7ff7120e5ba64287a30b057c5d29b2cbe20b
Summary:
Fixed a few MSVC (VCToolsVersion=14.0) build errors and warnings
* `DEFINE_string` is a macro and VC compiler complains that it cannot put [ifdef-inside-define](https://stackoverflow.com/questions/5586429/ifdef-inside-define)
* `sleep()` is not a recognizable function. Use `FLAGS_env->SleepForMicroseconds` instead
* Define precise type in comparison to avoid mismatch warning
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8519
Reviewed By: jay-zhuang
Differential Revision: D29683086
fbshipit-source-id: 8c80941472089f8daba84ae29597e75e603850e4
Summary:
1. Fix printing of stats when there are no writes (wamp=0). Previously had a div0 error
2. Added multireadrandom command as a valid target
3. Added ability to pass additional command line options to db_bench. Now can say things like benchmark.sh readrandom --mmap_read and the option will be passed to db_bench.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8346
Reviewed By: zhichao-cao
Differential Revision: D29500436
Pulled By: mrambacher
fbshipit-source-id: 54e90708aae9133be3a903e35efdf8f8abbd86fa
Summary:
Hello and thanks for RocksDB,
Here is a PR to add file deletes, renames and ```Flush()```, ```Sync()```, ```Fsync()``` and ```Close()``` to file ops report.
The reason is to help tune RocksDB options when using an env/filesystem with high latencies for file level ("metadata") operations, typically seen during ```DB::Open``` (```db_bench -num 0``` also see https://github.com/facebook/rocksdb/pull/7203 where IOTracing does not trace ```DB::Open```).
Before:
```
> db_bench -benchmarks updaterandom -num 0 -report_file_operations true
...
Entries: 0
...
Num files opened: 12
Num Read(): 6
Num Append(): 8
Num bytes read: 6216
Num bytes written: 6289
```
After:
```
> db_bench -benchmarks updaterandom -num 0 -report_file_operations true
...
Entries: 0
...
Num files opened: 12
Num files deleted: 3
Num files renamed: 4
Num Flush(): 10
Num Sync(): 5
Num Fsync(): 1
Num Close(): 2
Num Read(): 6
Num Append(): 8
Num bytes read: 6216
Num bytes written: 6289
```
Before:
```
> db_bench -benchmarks updaterandom -report_file_operations true
...
Entries: 1000000
...
Num files opened: 18
Num Read(): 396339
Num Append(): 1000058
Num bytes read: 892030224
Num bytes written: 187569238
```
After:
```
> db_bench -benchmarks updaterandom -report_file_operations true
...
Entries: 1000000
...
Num files opened: 18
Num files deleted: 5
Num files renamed: 4
Num Flush(): 1000068
Num Sync(): 9
Num Fsync(): 1
Num Close(): 6
Num Read(): 396339
Num Append(): 1000058
Num bytes read: 892030224
Num bytes written: 187569238
```
Another example showing how using ```DB::OpenForReadOnly``` reduces file operations compared to ```((Optimistic)Transaction)DB::Open```:
```
> db_bench -benchmarks updaterandom -num 1
> db_bench -benchmarks updaterandom -num 0 -use_existing_db true -readonly true -report_file_operations true
...
Entries: 0
...
Num files opened: 8
Num files deleted: 0
Num files renamed: 0
Num Flush(): 0
Num Sync(): 0
Num Fsync(): 0
Num Close(): 0
Num Read(): 13
Num Append(): 0
Num bytes read: 374
Num bytes written: 0
```
```
> db_bench -benchmarks updaterandom -num 1
> db_bench -benchmarks updaterandom -num 0 -use_existing_db true -report_file_operations true
...
Entries: 0
...
Num files opened: 14
Num files deleted: 3
Num files renamed: 4
Num Flush(): 14
Num Sync(): 5
Num Fsync(): 1
Num Close(): 3
Num Read(): 11
Num Append(): 10
Num bytes read: 7291
Num bytes written: 7357
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8448
Reviewed By: anand1976
Differential Revision: D29333818
Pulled By: zhichao-cao
fbshipit-source-id: a06a8c87f799806462319115195b3e94faf5f542
Summary:
This PR prepopulates warm/hot data blocks which are already in memory
into block cache at the time of flush. On a flush, the data block that is
in memory (in memtables) get flushed to the device. If using Direct IO,
additional IO is incurred to read this data back into memory again, which
is avoided by enabling newly added option.
Right now, this is enabled only for flush for data blocks. We plan to
expand this option to cover compactions in the future and for other types
of blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8242
Test Plan: Add new unit test
Reviewed By: anand1976
Differential Revision: D28521703
Pulled By: akankshamahajan15
fbshipit-source-id: 7219d6958821cedce689a219c3963a6f1a9d5f05
Summary:
Marked the Ribbon filter and optimize_filters_for_memory features
as production-ready, each enabling memory savings for Bloom-like filters.
Use `NewRibbonFilterPolicy` in place of `NewBloomFilterPolicy` to use
Ribbon filters instead of Bloom, or `ribbonfilter` in place of
`bloomfilter` in configuration string.
Some small refactoring in db_stress.
Removed/refactored unused code in db_bench, in part preparing for future
default possibly being different from "disabled."
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8408
Test Plan:
Lots of prior automated, ad-hoc, and "real world" testing.
Updated tests for new API names. Quick db_bench test:
bloom fillrandom
77730 ops/sec
rocksdb.block.cache.filter.bytes.insert COUNT : 89929384
ribbon fillrandom
71492 ops/sec
rocksdb.block.cache.filter.bytes.insert COUNT : 64531384
Reviewed By: mrambacher
Differential Revision: D29140805
Pulled By: pdillinger
fbshipit-source-id: d742c922722421678f95ad85eeb0aaebc9f5e49a
Summary:
- Added CreateFromString method to Env and FilesSystem to replace LoadEnv/Load. This method/signature is a precursor to making these classes extend Customizable.
- Added CreateFromSystem to Env. This method standardizes creating an Env from the environment variables. Previously, some places would check TEST_ENV_URI and others would also check TEST_FS_URI. Now the code is more command/standardized.
- Added CreateFromFlags to Env. These method allows Env to be create from string options (such as GFLAGS options) in a more standard way.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8174
Reviewed By: zhichao-cao
Differential Revision: D28999603
Pulled By: mrambacher
fbshipit-source-id: 88e6911e7e91f908458a7fe10a20e93ecbc275fb
Summary:
This change gathers and publishes statistics about the
kinds of items in block cache. This is especially important for
profiling relative usage of cache by index vs. filter vs. data blocks.
It works by iterating over the cache during periodic stats dump
(InternalStats, stats_dump_period_sec) or on demand when
DB::Get(Map)Property(kBlockCacheEntryStats), except that for
efficiency and sharing among column families, saved data from
the last scan is used when the data is not considered too old.
The new information can be seen in info LOG, for example:
Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0
Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%)
And also through DB::GetProperty and GetMapProperty (here using
ldb just for demonstration):
$ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats
rocksdb.block-cache-entry-stats.bytes.data-block: 0
rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0
rocksdb.block-cache-entry-stats.bytes.filter-block: 0
rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0
rocksdb.block-cache-entry-stats.bytes.index-block: 178992
rocksdb.block-cache-entry-stats.bytes.misc: 0
rocksdb.block-cache-entry-stats.bytes.other-block: 0
rocksdb.block-cache-entry-stats.bytes.write-buffer: 0
rocksdb.block-cache-entry-stats.capacity: 8388608
rocksdb.block-cache-entry-stats.count.data-block: 0
rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0
rocksdb.block-cache-entry-stats.count.filter-block: 0
rocksdb.block-cache-entry-stats.count.filter-meta-block: 0
rocksdb.block-cache-entry-stats.count.index-block: 215
rocksdb.block-cache-entry-stats.count.misc: 1
rocksdb.block-cache-entry-stats.count.other-block: 0
rocksdb.block-cache-entry-stats.count.write-buffer: 0
rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290
rocksdb.block-cache-entry-stats.percent.data-block: 0.000000
rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000
rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000
rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000
rocksdb.block-cache-entry-stats.percent.index-block: 2.133751
rocksdb.block-cache-entry-stats.percent.misc: 0.000000
rocksdb.block-cache-entry-stats.percent.other-block: 0.000000
rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000
rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052
rocksdb.block-cache-entry-stats.secs_since_last_collection: 0
Solution detail - We need some way to flag what kind of blocks each
entry belongs to, preferably without changing the Cache API.
One of the complications is that Cache is a general interface that could
have other users that don't adhere to whichever convention we decide
on for keys and values. Or we would pay for an extra field in the Handle
that would only be used for this purpose.
This change uses a back-door approach, the deleter, to indicate the
"role" of a Cache entry (in addition to the value type, implicitly).
This has the added benefit of ensuring proper code origin whenever we
recognize a particular role for a cache entry; if the entry came from
some other part of the code, it will use an unrecognized deleter, which
we simply attribute to the "Misc" role.
An internal API makes for simple instantiation and automatic
registration of Cache deleters for a given value type and "role".
Another internal API, CacheEntryStatsCollector, solves the problem of
caching the results of a scan and sharing them, to ensure scans are
neither excessive nor redundant so as not to harm Cache performance.
Because code is added to BlocklikeTraits, it is pulled out of
block_based_table_reader.cc into its own file.
This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option
(could still be added), and with actual stat gathering.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297
Test Plan: manual testing with db_bench, and a couple of basic unit tests
Reviewed By: ltamasi
Differential Revision: D28488721
Pulled By: pdillinger
fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
Summary:
This PR adds a ```-secondary_cache_uri``` option to the cache_bench and db_bench tools to allow the user to specify a custom secondary cache URI. The object registry is used to create an instance of the ```SecondaryCache``` object of the type specified in the URI.
The main cache_bench code is packaged into a separate library, similar to db_bench.
An example invocation of db_bench with a secondary cache URI -
```db_bench --env_uri=ws://ws.flash_sandbox.vll1_2/ -db=anand/nvm_cache_2 -use_existing_db=true -benchmarks=readrandom -num=30000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=67108864 -cache_index_and_filter_blocks=true -secondary_cache_uri='cachelibwrapper://filename=/home/anand76/nvm_cache/cache_file;size=2147483648;regionSize=16777216;admPolicy=random;admProbability=1.0;volatileSize=8388608;bktPower=20;lockPower=12' -partition_index_and_filters=true -duration=1800```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8312
Reviewed By: zhichao-cao
Differential Revision: D28544325
Pulled By: anand1976
fbshipit-source-id: 8f209b9af900c459dc42daa7a610d5f00176eeed
Summary:
As the first part of the effort of having placing different files on different storage types, this change introduces several things:
(1) An experimental interface in FileSystem that specify temperature to a new file created.
(2) A test FileSystemWrapper, SimulatedHybridFileSystem, that simulates HDD for a file of "warm" temperature.
(3) A simple experimental feature ColumnFamilyOptions.bottommost_temperature. RocksDB would pass this value to FileSystem when creating any bottommost file.
(4) A db_bench parameter that applies the (2) and (3) to db_bench.
The motivation of the change is to introduce minimal changes that allow us to evolve tiered storage development.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8222
Test Plan:
./db_bench --benchmarks=fillrandom --write_buffer_size=2000000 -max_bytes_for_level_base=20000000 -level_compaction_dynamic_level_bytes --reads=100 -compaction_readahead_size=20000000 --reads=100000 -num=10000000
followed by
./db_bench --benchmarks=readrandom,stats --write_buffer_size=2000000 -max_bytes_for_level_base=20000000 -simulate_hybrid_fs_file=/tmp/warm_file_list -level_compaction_dynamic_level_bytes -compaction_readahead_size=20000000 --reads=500 --threads=16 -use_existing_db --num=10000000
and see results as expected.
Reviewed By: ajkr
Differential Revision: D28003028
fbshipit-source-id: 4724896d5205730227ba2f17c3fecb11261744ce
Summary:
Fixes https://github.com/facebook/rocksdb/issues/6548.
If we do not reset the pinnable slice before calling get, we will see the following assertion failure
while running the test with multiple column families.
```
db_bench: ./include/rocksdb/slice.h:168: void rocksdb::PinnableSlice::PinSlice(const rocksdb::Slice&, rocksdb::Cleanable*): Assertion `!pinned_' failed.
```
This happens in `BlockBasedTable::Get()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8154
Test Plan:
./db_bench --benchmarks=fillseq -num_column_families=3
./db_bench --benchmarks=readrandom -use_existing_db=1 -num_column_families=3
Reviewed By: ajkr
Differential Revision: D27587589
Pulled By: riversand963
fbshipit-source-id: 7379e7649ba40f046d6a4014c9ad629cb3f9a786
Summary:
The check in db_bench for table_cache_numshardbits was 0 < bits <= 20, whereas the check in LRUCache was 0 < bits < 20. Changed the two values to match to avoid a crash in db_bench on a null cache.
Fixes https://github.com/facebook/rocksdb/issues/7393
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8110
Reviewed By: zhichao-cao
Differential Revision: D27353522
Pulled By: mrambacher
fbshipit-source-id: a414bd23b5bde1f071146b34cfca5e35c02de869
Summary:
Add the new Append and PositionedAppend API to env WritableFile. User is able to benefit from the write checksum handoff API when using the legacy Env classes. FileSystem already implemented the checksum handoff API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8071
Test Plan: make check, added new unit test.
Reviewed By: anand1976
Differential Revision: D27177043
Pulled By: zhichao-cao
fbshipit-source-id: 430c8331fc81099fa6d00f4fff703b68b9e8080e
Summary:
The new options are:
* compact0 - compact L0 into L1 using one thread
* compact1 - compact L1 into L2 using one thread
* flush - flush memtable
* waitforcompaction - wait for compaction to finish
These are useful for reproducible benchmarks to help get the LSM tree shape
into a deterministic state. I wrote about this at:
http://smalldatum.blogspot.com/2021/02/read-only-benchmarks-with-lsm-are.html
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8027
Reviewed By: riversand963
Differential Revision: D27053861
Pulled By: ajkr
fbshipit-source-id: 1646f35584a3db03740fbeb47d91c3f00fb35d6e
Summary:
For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>. The shared ptr has some performance degradation on certain hardware classes.
For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere. For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it. The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource.
There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold. In those cases, the shared pointer was preserved.
Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17:
6.17: readrandom : 28.046 micros/op 854902 ops/sec; 61.3 MB/s (355999 of 355999 found)
6.18: readrandom : 32.615 micros/op 735306 ops/sec; 52.7 MB/s (290999 of 290999 found)
PR: readrandom : 27.500 micros/op 871909 ops/sec; 62.5 MB/s (367999 of 367999 found)
(Note that the times for 6.18 are prior to revert of the SystemClock).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033
Reviewed By: pdillinger
Differential Revision: D27014563
Pulled By: mrambacher
fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
Summary:
The patch adds the configuration options of the new BlobDB implementation
to `db_bench` and adjusts the help messages of the old (`StackableDB`-based)
BlobDB's options to make it clear which implementation they pertain to.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7956
Test Plan: Ran `make check` and `db_bench` with the new options.
Reviewed By: jay-zhuang
Differential Revision: D26384808
Pulled By: ltamasi
fbshipit-source-id: b4405bb2c56cfd3506d4c32e3329c08dfdf69c94
Summary:
Currently, db_bench cleanup only deletes the main DB, if there's one.
Multiple DBs that are opened when --num_multi_db is specified are not
deleted, which can lead to crashes due to running compaction threads on
process exit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7891
Test Plan: Run regression test
Reviewed By: jay-zhuang
Differential Revision: D26049914
Pulled By: anand1976
fbshipit-source-id: acef2821001ca5e208a96a6a273c724e56353316
Summary:
The multireadrandom benchmark, when run for a specific number of reads (--reads argument), should base the duration on the actual number of keys read rather than number of batches.
Tests:
Run db_bench multireadrandom benchmark
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7817
Reviewed By: zhichao-cao
Differential Revision: D25717230
Pulled By: anand1976
fbshipit-source-id: 13f4d8162268cf9a34918655e60302d0aba3864b
Summary:
Primarily this change refactors the optimize_filters_for_memory
code for Bloom filters, based on malloc_usable_size, to also work for
Ribbon filters.
This change also replaces the somewhat slow but general
BuiltinFilterBitsBuilder::ApproximateNumEntries with
implementation-specific versions for Ribbon (new) and Legacy Bloom
(based on a recently deleted version). The reason is to emphasize
speed in ApproximateNumEntries rather than 100% accuracy.
Justification: ApproximateNumEntries (formerly CalculateNumEntry) is
only used by RocksDB for range-partitioned filters, called each time we
start to construct one. (In theory, it should be possible to reuse the
estimate, but the abstractions provided by FilterPolicy don't really
make that workable.) But this is only used as a heuristic estimate for
hitting a desired partitioned filter size because of alignment to data
blocks, which have various numbers of unique keys or prefixes. The two
factors lead us to prioritize reasonable speed over 100% accuracy.
optimize_filters_for_memory adds extra complication, because precisely
calculating num_entries for some allowed number of bytes depends on state
with optimize_filters_for_memory enabled. And the allocator-agnostic
implementation of optimize_filters_for_memory, using malloc_usable_size,
means we would have to actually allocate memory, many times, just to
precisely determine how many entries (keys) could be added and stay below
some size budget, for the current state. (In a draft, I got this
working, and then realized the balance of speed vs. accuracy was all
wrong.)
So related to that, I have made CalculateSpace, an internal-only API
only used for testing, non-authoritative also if
optimize_filters_for_memory is enabled. This simplifies some code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7774
Test Plan:
unit test updated, and for FilterSize test, range of tested
values is greatly expanded (still super fast)
Also tested `db_bench -benchmarks=fillrandom,stats -bloom_bits=10 -num=1000000 -partition_index_and_filters -format_version=5 [-optimize_filters_for_memory] [-use_ribbon_filter]` with temporary debug output of generated filter sizes.
Bloom+optimize_filters_for_memory:
1 Filter size: 197 (224 in memory)
134 Filter size: 3525 (3584 in memory)
107 Filter size: 4037 (4096 in memory)
Total on disk: 904,506
Total in memory: 918,752
Ribbon+optimize_filters_for_memory:
1 Filter size: 3061 (3072 in memory)
110 Filter size: 3573 (3584 in memory)
58 Filter size: 4085 (4096 in memory)
Total on disk: 633,021 (-30.0%)
Total in memory: 634,880 (-30.9%)
Bloom (no offm):
1 Filter size: 261 (320 in memory)
1 Filter size: 3333 (3584 in memory)
240 Filter size: 3717 (4096 in memory)
Total on disk: 895,674 (-1% on disk vs. +offm; known tolerable overhead of offm)
Total in memory: 986,944 (+7.4% vs. +offm)
Ribbon (no offm):
1 Filter size: 2949 (3072 in memory)
1 Filter size: 3381 (3584 in memory)
167 Filter size: 3701 (4096 in memory)
Total on disk: 624,397 (-30.3% vs. Bloom)
Total in memory: 690,688 (-30.0% vs. Bloom)
Note that optimize_filters_for_memory is even more effective for Ribbon filter than for cache-local Bloom, because it can close the unused memory gap even tighter than Bloom filter, because of 16 byte increments for Ribbon vs. 64 byte increments for Bloom.
Reviewed By: jay-zhuang
Differential Revision: D25592970
Pulled By: pdillinger
fbshipit-source-id: 606fdaa025bb790d7e9c21601e8ea86e10541912
Summary:
db_bench currently does not allow overriding the default `arena_block_size `calculation ([memtable size/8](https://github.com/facebook/rocksdb/blob/master/db/column_family.cc#L216)). For memtables whose size is in gigabytes, the `arena_block_size` defaults to hundreds of megabytes (affecting performance).
Exposing this option in db_bench would allow us to test the workloads with various `arena_block_size` values.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7654
Reviewed By: jay-zhuang
Differential Revision: D24996812
Pulled By: ajkr
fbshipit-source-id: a5e3d2c83d9f89e1bb8382f2e8dd476c79e33bef
Summary:
This is a PR generated **semi-automatically** by an internal tool to remove unused includes and `using` statements.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7604
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D24579392
Pulled By: riversand963
fbshipit-source-id: c4bfa6c6b08da1de186690d37eb73d8fff45aecd
Summary:
The patch introduces a helper method in `util/compression.h` called `UncompressData`
that dispatches calls to the correct uncompression method based on type, and changes
`UncompressBlockContentsForCompressionType` and `Benchmark::Uncompress` in
`db_bench` so they are implemented in terms of the new method. This eliminates
some code duplication. (`Benchmark::Compress` is also updated to use the previously
introduced `CompressData` helper.)
In addition, the patch brings the implementation of `Snappy_Uncompress` into sync with
the other uncompression methods by making the method compute the buffer size and allocate
the buffer itself. Finally, the patch eliminates some potentially risky back-and-forth conversions
between various unsigned and signed integer types by exposing the size of the allocated buffer
as a `size_t` instead of an `int`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7434
Test Plan:
`make check`
`./db_bench -benchmarks=compress,uncompress --compression_type ...`
Reviewed By: riversand963
Differential Revision: D23900011
Pulled By: ltamasi
fbshipit-source-id: b25df63ceec4639889be94acb22eb53e530c54e0
Summary:
Update db_bench so that we can run it with user-defined timestamp.
Currently, only 64-bit timestamp is supported, while others are disabled by assertion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7389
Test Plan: ./db_bench -benchmarks=fillseq,fillrandom,readrandom,readsequential,....., -user_timestamp_size=8
Reviewed By: ltamasi
Differential Revision: D23720830
Pulled By: riversand963
fbshipit-source-id: 486eacbb82de9a5441e79a61bfa9beef6581608a
Summary:
This PR merges the functionality of making the ColumnFamilyOptions, TableFactory, and DBOptions into Configurable into a single PR, resolving any merge conflicts
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5753
Reviewed By: ajkr
Differential Revision: D23385030
Pulled By: zhichao-cao
fbshipit-source-id: 8b977a7731556230b9b8c5a081b98e49ee4f160a
Summary:
Also enables a pull request to trigger all the Travis
configurations by writing FULL_CI in the commit message. (See what I did
there?)
First issue
make: *** No rule to make target 'jl/util/crc32c_ppc_asm.o', needed by 'rocksdbjava'. Stop.
Second issue
tools/db_bench_tool.cc:5514:38: error: ‘gen_exp.rocksdb::Benchmark::GenerateTwoTermExpKeys::keyrange_size_’ may be used uninitialized in this function
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7359
Test Plan: CI
Reviewed By: zhichao-cao
Differential Revision: D23582132
Pulled By: pdillinger
fbshipit-source-id: 06d794673fd522ba11cf6398385387e6bd97ef89
Summary:
Delete database instances to make sure there are no loose threads
running before exit(). This fixes segfaults seen when running
workloads through CompositeEnvs with custom file systems.
For further background on the issues arising when using CompositeEnvs, see the discussion in:
https://github.com/facebook/rocksdb/pull/6878
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7327
Reviewed By: cheng-chang
Differential Revision: D23433244
Pulled By: ajkr
fbshipit-source-id: 4e19cf2067e3fe68c2a3fe1823f24b4091336bbe
Summary:
This pull request adds the parameter --fs_uri to db_bench and db_stress, creating a composite env combining the default env with a specified registered rocksdb file system.
This makes it easier to develop and test new RocksDB FileSystems.
The pull request also registers the posix file system for testing purposes.
Examples:
```
$./db_bench --fs_uri=posix:// --benchmarks=fillseq
$./db_stress --fs_uri=zenfs://nullb1
```
zenfs is a RocksDB FileSystem I'm developing to add support for zoned block devices, and in that case the zoned block device is specified in the uri (a zoned null block device in the above example).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6878
Reviewed By: siying
Differential Revision: D23023063
Pulled By: ajkr
fbshipit-source-id: 8b3fe7193ce45e683043b021779b7a4d547af247
Summary:
Adds compaction statistics (total bytes read and written) for compactions that occur for delete-triggered, periodic, and TTL compaction reasons.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7165
Test Plan:
TTL and periodic can be checked by runnning db_bench with the options activated:
/db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -periodic_compaction_seconds=1
./db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -fifo_compaction_ttl=1
Setting the time to one second causes non-zero bytes read/written for those compaction reasons. Disabling them or setting them to times longer than the test run length causes the stats to return to zero as expected.
Delete-triggered compaction counting is tested in DBTablePropertiesTest.DeletionTriggeredCompactionMarking
Reviewed By: ajkr
Differential Revision: D22693050
Pulled By: akabcenell
fbshipit-source-id: d15cef4d94576f703015c8942d5f0d492f69401d