Summary:
- Added Type/CreateFromString
- Added ability to load EventListeners to DBOptions
- Since EventListeners did not previously have a Name(), defaulted to "". If there is no name, the listener cannot be loaded from the ObjectRegistry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8473
Reviewed By: zhichao-cao
Differential Revision: D29901488
Pulled By: mrambacher
fbshipit-source-id: 2d3a4aa6db1562ac03e7ad41b360e3521d486254
Summary:
Add `experimental_mempurge_policy` option flag and introduce two new `MemPurge` (Memtable Garbage Collection) policies: 'ALWAYS' and 'ALTERNATE'. Default value: ALTERNATE.
`ALWAYS`: every flush will first go through a `MemPurge` process. If the output is too big to fit into a single memtable, then the mempurge is aborted and a regular flush process carries on. `ALWAYS` is designed for user that need to reduce the number of L0 SST file created to a strict minimum, and can afford a small dent in performance (possibly hits to CPU usage, read efficiency, and maximum burst write throughput).
`ALTERNATE`: a flush is transformed into a `MemPurge` except if one of the memtables being flushed is the product of a previous `MemPurge`. `ALTERNATE` is a good tradeoff between reduction in number of L0 SST files created and performance. `ALTERNATE` perform particularly well for completely random garbage ratios, or garbage ratios anywhere in (0%,50%], and even higher when there is a wild variability in garbage ratios.
This PR also includes support for `experimental_mempurge_policy` in `db_bench`.
Testing was done locally by replacing all the `MemPurge` policies of the unit tests with `ALTERNATE`, as well as local testing with `db_crashtest.py` `whitebox` and `blackbox`. Overall, if an `ALWAYS` mempurge policy passes the tests, there is no reasons why an `ALTERNATE` policy would fail, and therefore the mempurge policy was set to `ALWAYS` for all mempurge unit tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8583
Reviewed By: pdillinger
Differential Revision: D29888050
Pulled By: bjlemaire
fbshipit-source-id: e2cf26646d66679f6f5fb29842624615610759c1
Summary:
Originally the 2 options `db_log_dir` and `wal_dir` will be reused in a snapshot db since the options files are just copied. By default, if `wal_dir` was not set when a db was created, it is set to the db's dir. Therefore, the snapshot db will use the same WAL dir. If both the original db and the snapshot db write to or delete from the WAL dir, one may modify or delete files which belong to the other. The same applies to `db_log_dir` as well, but as info log files are not copied or linked, it is simpler for this option.
2 arguments are added to `Checkpoint::CreateCheckpoint()`, allowing to override these 2 options.
`wal_dir`: If the function argument `wal_dir` is empty, or set to the original db location, or the checkpoint location, the snapshot's `wal_dir` option will be updated to the checkpoint location. Otherwise, the absolute path specified in the argument will be used. During checkpointing, live WAL files will be copied or linked the new location, instead of the current WAL dir specified in the original db.
`db_log_dir`: Same as `wal_dir`, but no files will be copied or linked.
A new unit test was added: `CheckpointTest.CheckpointWithOptionsDirsTest`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8572
Test Plan:
New unit test
```
checkpoint_test --gtest_filter="CheckpointTest.CheckpointWithOptionsDirsTest"
```
Output
```
Note: Google Test filter = CheckpointTest.CheckpointWithOptionsDirsTest
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from CheckpointTest
[ RUN ] CheckpointTest.CheckpointWithOptionsDirsTest
[ OK ] CheckpointTest.CheckpointWithOptionsDirsTest (11712 ms)
[----------] 1 test from CheckpointTest (11712 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (11713 ms total)
[ PASSED ] 1 test.
```
This test will fail without this patch. Just modify the code to remove the 2 arguments introduced in this patch in `CreateCheckpoint()`.
Reviewed By: zhichao-cao
Differential Revision: D29832761
Pulled By: autopear
fbshipit-source-id: e6a639b4d674380df82998c0839e79cab695fe29
Summary:
ObjectLibrary is shared between multiple DB instances, the
Register() could have race condition.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8574
Test Plan: pass the failed test
Reviewed By: ajkr
Differential Revision: D29855096
Pulled By: jay-zhuang
fbshipit-source-id: 541eed0bd495d2c963d858d81e7eabf1ba16153c
Summary:
There is an extra " in options.h (`"index block""`)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8550
Test Plan: None
Reviewed By: jay-zhuang
Differential Revision: D29746077
Pulled By: autopear
fbshipit-source-id: 2e5117296e5414b7c7440d990926bc1e567a0b4f
Summary:
I previously didn't notice the DB mutex was being held during
block cache entry stat scans, probably because I primarily checked for
read performance regressions, because they require the block cache and
are traditionally latency-sensitive.
This change does some refactoring to avoid holding DB mutex and to
avoid triggering and waiting for a scan in GetProperty("rocksdb.cfstats").
Some tests have to be updated because now the stats collector is
populated in the Cache aggressively on DB startup rather than lazily.
(I hope to clean up some of this added complexity in the future.)
This change also ensures proper treatment of need_out_of_mutex for
non-int DB properties.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8538
Test Plan:
Added unit test logic that uses sync points to fail if the DB mutex
is held during a scan, covering the various ways that a scan might be
triggered.
Performance test - the known impact to holding the DB mutex is on
TransactionDB, and the easiest way to see the impact is to hack the
scan code to almost always miss and take an artificially long time
scanning. Here I've injected an unconditional 5s sleep at the call to
ApplyToAllEntries.
Before (hacked):
$ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 433.219 micros/op 2308 ops/sec; 0.1 MB/s ( transactions:78999 aborts:0)
rocksdb.db.write.micros P50 : 16.135883 P95 : 36.622503 P99 : 66.036115 P100 : 5000614.000000 COUNT : 149677 SUM : 8364856
$ TEST_TMPDIR=/dev/shm ./db_bench.base_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 448.802 micros/op 2228 ops/sec; 0.1 MB/s ( transactions:75999 aborts:0)
rocksdb.db.write.micros P50 : 16.629221 P95 : 37.320607 P99 : 72.144341 P100 : 5000871.000000 COUNT : 143995 SUM : 13472323
Notice the 5s P100 write time.
After (hacked):
$ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 303.645 micros/op 3293 ops/sec; 0.1 MB/s ( transactions:98999 aborts:0)
rocksdb.db.write.micros P50 : 16.061871 P95 : 33.978834 P99 : 60.018017 P100 : 616315.000000 COUNT : 187619 SUM : 4097407
$ TEST_TMPDIR=/dev/shm ./db_bench.new_xxx -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 310.383 micros/op 3221 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0)
rocksdb.db.write.micros P50 : 16.270026 P95 : 35.786844 P99 : 64.302878 P100 : 603088.000000 COUNT : 183819 SUM : 4095918
P100 write is now ~0.6s. Not good, but it's the same even if I completely bypass all the scanning code:
$ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 311.365 micros/op 3211 ops/sec; 0.1 MB/s ( transactions:96999 aborts:0)
rocksdb.db.write.micros P50 : 16.274362 P95 : 36.221184 P99 : 68.809783 P100 : 649808.000000 COUNT : 183819 SUM : 4156767
$ TEST_TMPDIR=/dev/shm ./db_bench.new_skip -benchmarks=randomtransaction,stats -cache_index_and_filter_blocks=1 -bloom_bits=10 -partition_index_and_filters=1 -duration=30 -stats_dump_period_sec=12 -cache_size=100000000 -statistics -transaction_db 2>&1 | egrep 'db.db.write.micros|micros/op'
randomtransaction : 308.395 micros/op 3242 ops/sec; 0.1 MB/s ( transactions:97999 aborts:0)
rocksdb.db.write.micros P50 : 16.106222 P95 : 37.202403 P99 : 67.081875 P100 : 598091.000000 COUNT : 185714 SUM : 4098832
No substantial difference.
Reviewed By: siying
Differential Revision: D29738847
Pulled By: pdillinger
fbshipit-source-id: 1c5c155f5a1b62e4fea0fd4eeb515a8b7474027b
Summary:
Made the EncryptionProvider and BlockCipher classes inherit from Customizable. Added/fixed the CreateFromString method to these classes to create instances from builtin or registered classes. Added tests to verify that instances can be registered and retrieved as appropriate.
Added the ability to configure the builtin (CTR, ROT13) classes from configurable properties. Added the appropriate tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8354
Reviewed By: zhichao-cao
Differential Revision: D29558949
Pulled By: mrambacher
fbshipit-source-id: c20286b32d179777e060f51a58943e9b0cf81d04
Summary:
Implement an experimental feature called "MemPurge", which consists in purging "garbage" bytes out of a memtable and reuse the memtable struct instead of making it immutable and eventually flushing its content to storage.
The prototype is by default deactivated and is not intended for use. It is intended for correctness and validation testing. At the moment, the "MemPurge" feature can be switched on by using the `options.experimental_allow_mempurge` flag. For this early stage, when the allow_mempurge flag is set to `true`, all the flush operations will be rerouted to perform a MemPurge. This is a temporary design decision that will give us the time to explore meaningful heuristics to use MemPurge at the right time for relevant workloads . Moreover, the current MemPurge operation only supports `Puts`, `Deletes`, `DeleteRange` operations, and handles `Iterators` as well as `CompactionFilter`s that are invoked at flush time .
Three unit tests are added to `db_flush_test.cc` to test if MemPurge works correctly (and checks that the previously mentioned operations are fully supported thoroughly tested).
One noticeable design decision is the timing of the MemPurge operation in the memtable workflow: for this prototype, the mempurge happens when the memtable is switched (and usually made immutable). This is an inefficient process because it implies that the entirety of the MemPurge operation happens while holding the db_mutex. Future commits will make the MemPurge operation a background task (akin to the regular flush operation) and aim at drastically enhancing the performance of this operation. The MemPurge is also not fully "WAL-compatible" yet, but when the WAL is full, or when the regular MemPurge operation fails (or when the purged memtable still needs to be flushed), a regular flush operation takes place. Later commits will also correct these behaviors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8454
Reviewed By: anand1976
Differential Revision: D29433971
Pulled By: bjlemaire
fbshipit-source-id: 6af48213554e35048a7e03816955100a80a26dc5
Summary:
Added the Customizable::ConfigureNewObject method. The method will configure the object if options are found and invoke PrepareOptions if the flag is set properly.
Added tests to test that PrepareOptions is properly called and to test if PrepareOptions fails.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8468
Reviewed By: zhichao-cao
Differential Revision: D29494703
Pulled By: mrambacher
fbshipit-source-id: d5767dee5d7a98620ac66190262101cd0aa9d2b7
Summary:
Change the job_id for remote compaction interface, which will include
both internal compaction job_id, also a sub_compaction_job_id. It is not
a backward compatible change. The user needs to update interface during
upgrade. (We will avoid backward incompatible change after the feature is
not experimental.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8364
Reviewed By: ajkr
Differential Revision: D28917301
Pulled By: jay-zhuang
fbshipit-source-id: 6d72a21f652bb517ad6954d0387b496797fc4e11
Summary:
Added BlobMetaData to ColumnFamilyMetaData and LiveBlobMetaData and DB API GetLiveBlobMetaData to retrieve it.
First pass at struct. More tests and maybe fields to come...
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8273
Reviewed By: ltamasi
Differential Revision: D29102400
Pulled By: mrambacher
fbshipit-source-id: 8a2383a4446328be6b91dced9841fdd3dfc80b73
Summary:
Implement the ```WaitAll()``` interface in ```LRUCache``` to allow callers to issue multiple lookups in parallel and wait for all of them to complete. Modify ```MultiGet``` to use this to parallelize the secondary cache lookups in order to reduce the overall latency. A call to ```cache->Lookup()``` returns a handle that has an incomplete value (nullptr), and the caller can call ```cache->IsReady()``` to check whether the lookup is complete, and pass a vector of handles to ```WaitAll``` to wait for completion. If any of the lookups fail, ```MultiGet``` will read the block from the SST file.
Another change in this PR is to rename ```SecondaryCacheHandle``` to ```SecondaryCacheResultHandle``` as it more accurately describes the return result of the secondary cache lookup, which is more like a future.
Tests:
1. Add unit tests in lru_cache_test
2. Benchmark results with no secondary cache configured
Master -
```
readrandom : 41.175 micros/op 388562 ops/sec; 106.7 MB/s (7277999 of 7277999 found)
readrandom : 41.217 micros/op 388160 ops/sec; 106.6 MB/s (7274999 of 7274999 found)
multireadrandom : 10.309 micros/op 1552082 ops/sec; (28908992 of 28908992 found)
multireadrandom : 10.321 micros/op 1550218 ops/sec; (29081984 of 29081984 found)
```
This PR -
```
readrandom : 41.158 micros/op 388723 ops/sec; 106.8 MB/s (7290999 of 7290999 found)
readrandom : 41.185 micros/op 388463 ops/sec; 106.7 MB/s (7287999 of 7287999 found)
multireadrandom : 10.277 micros/op 1556801 ops/sec; (29346944 of 29346944 found)
multireadrandom : 10.253 micros/op 1560539 ops/sec; (29274944 of 29274944 found)
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8405
Reviewed By: zhichao-cao
Differential Revision: D29190509
Pulled By: anand1976
fbshipit-source-id: 6f8eff6246712af8a297cfe22ea0d1c3b2a01bb0
Summary:
**Summary**:
2 new statistics counters are added to RocksDB: `MEMTABLE_PAYLOAD_BYTES_AT_FLUSH` and `MEMTABLE_GARBAGE_BYTES_AT_FLUSH`. The former tracks how many raw bytes of useful data are present on the memtable at flush time, whereas the latter is tracks how many of these raw bytes are considered garbage, meaning that they ended up not being imported on the SSTables resulting from the flush operations.
**Unit test**: run `make db_flush_test -j$(nproc); ./db_flush_test` to run the unit test.
This executable includes 3 tests, that test support and correct stat calculations for workloads with inserts, deletes, and DeleteRanges. The parameters are set such that the workloads are performed on a single memtable, and a single SSTable is created as a result of the flush operation. The flush operation is manually called in the test file. The tests verify that the values of these 2 statistics counters introduced in this PR can be exactly predicted, showing that we have a full understanding of the underlying operations.
**Performance testing**:
`./db_bench -statistics -benchmarks=fillrandom -num=10000000` repeated 10 times.
Timing done using "date" function in a bash script.
_Results_:
Original Rocksdb fork: mean 66.6 sec, std 1.18 sec.
This feature branch: mean 67.4 sec, std 1.35 sec.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8411
Reviewed By: akankshamahajan15
Differential Revision: D29150629
Pulled By: bjlemaire
fbshipit-source-id: 7b3c2e86d50c6aa34fa50fd134282eacb543a5b1
Summary:
This PR prepopulates warm/hot data blocks which are already in memory
into block cache at the time of flush. On a flush, the data block that is
in memory (in memtables) get flushed to the device. If using Direct IO,
additional IO is incurred to read this data back into memory again, which
is avoided by enabling newly added option.
Right now, this is enabled only for flush for data blocks. We plan to
expand this option to cover compactions in the future and for other types
of blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8242
Test Plan: Add new unit test
Reviewed By: anand1976
Differential Revision: D28521703
Pulled By: akankshamahajan15
fbshipit-source-id: 7219d6958821cedce689a219c3963a6f1a9d5f05
Summary:
Marked the Ribbon filter and optimize_filters_for_memory features
as production-ready, each enabling memory savings for Bloom-like filters.
Use `NewRibbonFilterPolicy` in place of `NewBloomFilterPolicy` to use
Ribbon filters instead of Bloom, or `ribbonfilter` in place of
`bloomfilter` in configuration string.
Some small refactoring in db_stress.
Removed/refactored unused code in db_bench, in part preparing for future
default possibly being different from "disabled."
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8408
Test Plan:
Lots of prior automated, ad-hoc, and "real world" testing.
Updated tests for new API names. Quick db_bench test:
bloom fillrandom
77730 ops/sec
rocksdb.block.cache.filter.bytes.insert COUNT : 89929384
ribbon fillrandom
71492 ops/sec
rocksdb.block.cache.filter.bytes.insert COUNT : 64531384
Reviewed By: mrambacher
Differential Revision: D29140805
Pulled By: pdillinger
fbshipit-source-id: d742c922722421678f95ad85eeb0aaebc9f5e49a
Summary:
- Added CreateFromString method to Env and FilesSystem to replace LoadEnv/Load. This method/signature is a precursor to making these classes extend Customizable.
- Added CreateFromSystem to Env. This method standardizes creating an Env from the environment variables. Previously, some places would check TEST_ENV_URI and others would also check TEST_FS_URI. Now the code is more command/standardized.
- Added CreateFromFlags to Env. These method allows Env to be create from string options (such as GFLAGS options) in a more standard way.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8174
Reviewed By: zhichao-cao
Differential Revision: D28999603
Pulled By: mrambacher
fbshipit-source-id: 88e6911e7e91f908458a7fe10a20e93ecbc275fb
Summary:
Longstanding tech debt
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8379
Test Plan: Better than not having an API contract
Reviewed By: jay-zhuang
Differential Revision: D29011131
Pulled By: pdillinger
fbshipit-source-id: 2c4796177733651954024fc17875f8642ca08d09
Summary:
(1)Make CompactionService derived from Customizable by defining two extra functions that are needed, as described in customizable.h comment section
(2)Revise the MyTestCompactionService class in compaction_service_test.cc to satisfy the class inheritance requirement
(3)Specify namespace of ToString() in compaction_service_test.cc to avoid function collision with CompactionService's ancestor classes
Test did:
make -j24 compaction_service_test
./compaction_service_test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8395
Reviewed By: jay-zhuang
Differential Revision: D29076068
Pulled By: hx235
fbshipit-source-id: c130100fa466939b3137e917f5fdc4b2ae8e37d4
Summary:
Makes the Comparator class into a Customizable object. Added/Updated the CreateFromString method to create Comparators. Added test for using the ObjectRegistry to create one.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8336
Reviewed By: jay-zhuang
Differential Revision: D28999612
Pulled By: mrambacher
fbshipit-source-id: bff2cb2814eeb9fef6a00fddc61d6e34b6fbcf2e
Summary:
Added the ability to cancel an in-progress range compaction by storing to an atomic "canceled" variable pointed to within the CompactRangeOptions structure.
Tested via two tests added to db_tests2.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8351
Reviewed By: ajkr
Differential Revision: D28808894
Pulled By: ddevec
fbshipit-source-id: cb321361c9e23b084b188bb203f11c375a22c2dd
Summary:
Update HISTORY and version to 6.21 on master.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8363
Reviewed By: jay-zhuang
Differential Revision: D28888818
Pulled By: anand1976
fbshipit-source-id: 9e5fac3b99ecc9f3b7d9f21474a39fa50decb117
Summary:
With Ribbon filter work and possible variance in actual bits
per key (or prefix; general term "entry") to achieve certain FP rates,
I've received a request to be able to track actual bits per key in
generated filters. This change adds a num_filter_entries table
property, which can be combined with filter_size to get bits per key
(entry).
This can vary from num_entries in at least these ways:
* Different versions of same key are only counted once in filters.
* With prefix filters, several user keys map to the same filter entry.
* A single filter can include both prefixes and user keys.
Note that FilterBlockBuilder::NumAdded() didn't do anything useful
except distinguish empty from non-empty.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8323
Test Plan: basic unit test included, others updated
Reviewed By: jay-zhuang
Differential Revision: D28596210
Pulled By: pdillinger
fbshipit-source-id: 529a111f3c84501e5a470bc84705e436ee68c376
Summary:
When a memtable is flushed, it will validate number of entries it reads, and compare the number with how many entries inserted into memtable. This serves as one sanity c\
heck against memory corruption. This change will also allow more counters to be added in the future for better validation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8288
Test Plan: Pass all existing tests
Reviewed By: ajkr
Differential Revision: D28369194
fbshipit-source-id: 7ff870380c41eab7f99eee508550dcdce32838ad
Summary:
This change gathers and publishes statistics about the
kinds of items in block cache. This is especially important for
profiling relative usage of cache by index vs. filter vs. data blocks.
It works by iterating over the cache during periodic stats dump
(InternalStats, stats_dump_period_sec) or on demand when
DB::Get(Map)Property(kBlockCacheEntryStats), except that for
efficiency and sharing among column families, saved data from
the last scan is used when the data is not considered too old.
The new information can be seen in info LOG, for example:
Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0
Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%)
And also through DB::GetProperty and GetMapProperty (here using
ldb just for demonstration):
$ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats
rocksdb.block-cache-entry-stats.bytes.data-block: 0
rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0
rocksdb.block-cache-entry-stats.bytes.filter-block: 0
rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0
rocksdb.block-cache-entry-stats.bytes.index-block: 178992
rocksdb.block-cache-entry-stats.bytes.misc: 0
rocksdb.block-cache-entry-stats.bytes.other-block: 0
rocksdb.block-cache-entry-stats.bytes.write-buffer: 0
rocksdb.block-cache-entry-stats.capacity: 8388608
rocksdb.block-cache-entry-stats.count.data-block: 0
rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0
rocksdb.block-cache-entry-stats.count.filter-block: 0
rocksdb.block-cache-entry-stats.count.filter-meta-block: 0
rocksdb.block-cache-entry-stats.count.index-block: 215
rocksdb.block-cache-entry-stats.count.misc: 1
rocksdb.block-cache-entry-stats.count.other-block: 0
rocksdb.block-cache-entry-stats.count.write-buffer: 0
rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290
rocksdb.block-cache-entry-stats.percent.data-block: 0.000000
rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000
rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000
rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000
rocksdb.block-cache-entry-stats.percent.index-block: 2.133751
rocksdb.block-cache-entry-stats.percent.misc: 0.000000
rocksdb.block-cache-entry-stats.percent.other-block: 0.000000
rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000
rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052
rocksdb.block-cache-entry-stats.secs_since_last_collection: 0
Solution detail - We need some way to flag what kind of blocks each
entry belongs to, preferably without changing the Cache API.
One of the complications is that Cache is a general interface that could
have other users that don't adhere to whichever convention we decide
on for keys and values. Or we would pay for an extra field in the Handle
that would only be used for this purpose.
This change uses a back-door approach, the deleter, to indicate the
"role" of a Cache entry (in addition to the value type, implicitly).
This has the added benefit of ensuring proper code origin whenever we
recognize a particular role for a cache entry; if the entry came from
some other part of the code, it will use an unrecognized deleter, which
we simply attribute to the "Misc" role.
An internal API makes for simple instantiation and automatic
registration of Cache deleters for a given value type and "role".
Another internal API, CacheEntryStatsCollector, solves the problem of
caching the results of a scan and sharing them, to ensure scans are
neither excessive nor redundant so as not to harm Cache performance.
Because code is added to BlocklikeTraits, it is pulled out of
block_based_table_reader.cc into its own file.
This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option
(could still be added), and with actual stat gathering.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297
Test Plan: manual testing with db_bench, and a couple of basic unit tests
Reviewed By: ltamasi
Differential Revision: D28488721
Pulled By: pdillinger
fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
Summary:
- Add class `FunctorWrapper` to invoke the function with given parameters
- Implement `StartThreadTyped` which wraps `StartThread` with type checking cover
- Demonstrate `StartThreadTyped` in test `util/thread_local_test.cc`
https://github.com/facebook/rocksdb/issues/8285
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8303
Reviewed By: ajkr
Differential Revision: D28539318
Pulled By: pdillinger
fbshipit-source-id: 624789c236bde31163deda95c1e1471aee68933e
Summary:
This PR adds a ```-secondary_cache_uri``` option to the cache_bench and db_bench tools to allow the user to specify a custom secondary cache URI. The object registry is used to create an instance of the ```SecondaryCache``` object of the type specified in the URI.
The main cache_bench code is packaged into a separate library, similar to db_bench.
An example invocation of db_bench with a secondary cache URI -
```db_bench --env_uri=ws://ws.flash_sandbox.vll1_2/ -db=anand/nvm_cache_2 -use_existing_db=true -benchmarks=readrandom -num=30000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=67108864 -cache_index_and_filter_blocks=true -secondary_cache_uri='cachelibwrapper://filename=/home/anand76/nvm_cache/cache_file;size=2147483648;regionSize=16777216;admPolicy=random;admProbability=1.0;volatileSize=8388608;bktPower=20;lockPower=12' -partition_index_and_filters=true -duration=1800```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8312
Reviewed By: zhichao-cao
Differential Revision: D28544325
Pulled By: anand1976
fbshipit-source-id: 8f209b9af900c459dc42daa7a610d5f00176eeed
Summary:
As a part of tiered storage, writing tempeature information to manifest is needed so that after DB recovery, RocksDB still has the tiering information, to implement some further necessary functionalities.
Also fix some issues in simulated hybrid FS.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8284
Test Plan: Add a new unit test to validate that the information is indeed written and read back.
Reviewed By: zhichao-cao
Differential Revision: D28335801
fbshipit-source-id: 56aeb2e6ea090be0200181dd968c8a7278037def
Summary:
Defined the abstract interface for a secondary cache in include/rocksdb/secondary_cache.h, and updated LRUCacheOptions to take a std::shared_ptr<SecondaryCache>. An item is initially inserted into the LRU (primary) cache. When it ages out and evicted from memory, its inserted into the secondary cache. On a LRU cache miss and successful lookup in the secondary cache, the item is promoted to the LRU cache. Only support synchronous lookup currently. The secondary cache would be used to implement a persistent (flash cache) or compressed cache.
Tests:
Results from cache_bench and db_bench don't show any regression due to these changes.
cache_bench results before and after this change -
Command
```./cache_bench -ops_per_thread=10000000 -threads=1```
Before
```Complete in 40.688 s; QPS = 245774```
```Complete in 40.486 s; QPS = 246996```
```Complete in 42.019 s; QPS = 237989```
After
```Complete in 40.672 s; QPS = 245869```
```Complete in 44.622 s; QPS = 224107```
```Complete in 42.445 s; QPS = 235599```
db_bench results before this change, and with this change + https://github.com/facebook/rocksdb/issues/8213 and https://github.com/facebook/rocksdb/issues/8191 -
Commands
```./db_bench --benchmarks="fillseq,compact" -num=30000000 -key_size=32 -value_size=256 -use_direct_io_for_flush_and_compaction=true -db=/home/anand76/nvm_cache/db -partition_index_and_filters=true```
```./db_bench -db=/home/anand76/nvm_cache/db -use_existing_db=true -benchmarks=readrandom -num=30000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=1073741824 -cache_numshardbits=6 -cache_index_and_filter_blocks=true -read_random_exp_range=17 -statistics -partition_index_and_filters=true -threads=16 -duration=300```
Before
```
DB path: [/home/anand76/nvm_cache/db]
readrandom : 80.702 micros/op 198104 ops/sec; 54.4 MB/s (3708999 of 3708999 found)
```
```
DB path: [/home/anand76/nvm_cache/db]
readrandom : 87.124 micros/op 183625 ops/sec; 50.4 MB/s (3439999 of 3439999 found)
```
After
```
DB path: [/home/anand76/nvm_cache/db]
readrandom : 77.653 micros/op 206025 ops/sec; 56.6 MB/s (3866999 of 3866999 found)
```
```
DB path: [/home/anand76/nvm_cache/db]
readrandom : 84.962 micros/op 188299 ops/sec; 51.7 MB/s (3535999 of 3535999 found)
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8271
Reviewed By: zhichao-cao
Differential Revision: D28357511
Pulled By: anand1976
fbshipit-source-id: d1cfa236f00e649a18c53328be10a8062a4b6da2
Summary:
The functions will be used for remote compaction parameter
input and result.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8247
Test Plan: `make check`
Reviewed By: ajkr
Differential Revision: D28104680
Pulled By: jay-zhuang
fbshipit-source-id: c0a5178e6277125118384278efea2acbf90aa6cb
Summary:
Adds a new Cache::ApplyToAllEntries API that we expect to use
(in follow-up PRs) for efficiently gathering block cache statistics.
Notable features vs. old ApplyToAllCacheEntries:
* Includes key and deleter (in addition to value and charge). We could
have passed in a Handle but then more virtual function calls would be
needed to get the "fields" of each entry. We expect to use the 'deleter'
to identify the origin of entries, perhaps even more.
* Heavily tuned to minimize latency impact on operating cache. It
does this by iterating over small sections of each cache shard while
cycling through the shards.
* Supports tuning roughly how many entries to operate on for each
lock acquire and release, to control the impact on the latency of other
operations without excessive lock acquire & release. The right balance
can depend on the cost of the callback. Good default seems to be
around 256.
* There should be no need to disable thread safety. (I would expect
uncontended locks to be sufficiently fast.)
I have enhanced cache_bench to validate this approach:
* Reports a histogram of ns per operation, so we can look at the
ditribution of times, not just throughput (average).
* Can add a thread for simulated "gather stats" which calls
ApplyToAllEntries at a specified interval. We also generate a histogram
of time to run ApplyToAllEntries.
To make the iteration over some entries of each shard work as cleanly as
possible, even with resize between next set of entries, I have
re-arranged which hash bits are used for sharding and which for indexing
within a shard.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8225
Test Plan:
A couple of unit tests are added, but primary validation is manual, as
the primary risk is to performance.
The primary validation is using cache_bench to ensure that neither
the minor hashing changes nor the simulated stats gathering
significantly impact QPS or latency distribution. Note that adding op
latency histogram seriously impacts the benchmark QPS, so for a
fair baseline, we need the cache_bench changes (except remove simulated
stat gathering to make it compile). In short, we don't see any
reproducible difference in ops/sec or op latency unless we are gathering
stats nearly continuously. Test uses 10GB block cache with
8KB values to be somewhat realistic in the number of items to iterate
over.
Baseline typical output:
```
Complete in 92.017 s; Rough parallel ops/sec = 869401
Thread ops/sec = 54662
Operation latency (ns):
Count: 80000000 Average: 11223.9494 StdDev: 29.61
Min: 0 Median: 7759.3973 Max: 9620500
Percentiles: P50: 7759.40 P75: 14190.73 P99: 46922.75 P99.9: 77509.84 P99.99: 217030.58
------------------------------------------------------
[ 0, 1 ] 68 0.000% 0.000%
( 2900, 4400 ] 89 0.000% 0.000%
( 4400, 6600 ] 33630240 42.038% 42.038% ########
( 6600, 9900 ] 18129842 22.662% 64.700% #####
( 9900, 14000 ] 7877533 9.847% 74.547% ##
( 14000, 22000 ] 15193238 18.992% 93.539% ####
( 22000, 33000 ] 3037061 3.796% 97.335% #
( 33000, 50000 ] 1626316 2.033% 99.368%
( 50000, 75000 ] 421532 0.527% 99.895%
( 75000, 110000 ] 56910 0.071% 99.966%
( 110000, 170000 ] 16134 0.020% 99.986%
( 170000, 250000 ] 5166 0.006% 99.993%
( 250000, 380000 ] 3017 0.004% 99.996%
( 380000, 570000 ] 1337 0.002% 99.998%
( 570000, 860000 ] 805 0.001% 99.999%
( 860000, 1200000 ] 319 0.000% 100.000%
( 1200000, 1900000 ] 231 0.000% 100.000%
( 1900000, 2900000 ] 100 0.000% 100.000%
( 2900000, 4300000 ] 39 0.000% 100.000%
( 4300000, 6500000 ] 16 0.000% 100.000%
( 6500000, 9800000 ] 7 0.000% 100.000%
```
New, gather_stats=false. Median thread ops/sec of 5 runs:
```
Complete in 92.030 s; Rough parallel ops/sec = 869285
Thread ops/sec = 54458
Operation latency (ns):
Count: 80000000 Average: 11298.1027 StdDev: 42.18
Min: 0 Median: 7722.0822 Max: 6398720
Percentiles: P50: 7722.08 P75: 14294.68 P99: 47522.95 P99.9: 85292.16 P99.99: 228077.78
------------------------------------------------------
[ 0, 1 ] 109 0.000% 0.000%
( 2900, 4400 ] 793 0.001% 0.001%
( 4400, 6600 ] 34054563 42.568% 42.569% #########
( 6600, 9900 ] 17482646 21.853% 64.423% ####
( 9900, 14000 ] 7908180 9.885% 74.308% ##
( 14000, 22000 ] 15032072 18.790% 93.098% ####
( 22000, 33000 ] 3237834 4.047% 97.145% #
( 33000, 50000 ] 1736882 2.171% 99.316%
( 50000, 75000 ] 446851 0.559% 99.875%
( 75000, 110000 ] 68251 0.085% 99.960%
( 110000, 170000 ] 18592 0.023% 99.983%
( 170000, 250000 ] 7200 0.009% 99.992%
( 250000, 380000 ] 3334 0.004% 99.997%
( 380000, 570000 ] 1393 0.002% 99.998%
( 570000, 860000 ] 700 0.001% 99.999%
( 860000, 1200000 ] 293 0.000% 100.000%
( 1200000, 1900000 ] 196 0.000% 100.000%
( 1900000, 2900000 ] 69 0.000% 100.000%
( 2900000, 4300000 ] 32 0.000% 100.000%
( 4300000, 6500000 ] 10 0.000% 100.000%
```
New, gather_stats=true, 1 second delay between scans. Scans take about
1 second here so it's spending about 50% time scanning. Still the effect on
ops/sec and latency seems to be in the noise. Median thread ops/sec of 5 runs:
```
Complete in 91.890 s; Rough parallel ops/sec = 870608
Thread ops/sec = 54551
Operation latency (ns):
Count: 80000000 Average: 11311.2629 StdDev: 45.28
Min: 0 Median: 7686.5458 Max: 10018340
Percentiles: P50: 7686.55 P75: 14481.95 P99: 47232.60 P99.9: 79230.18 P99.99: 232998.86
------------------------------------------------------
[ 0, 1 ] 71 0.000% 0.000%
( 2900, 4400 ] 291 0.000% 0.000%
( 4400, 6600 ] 34492060 43.115% 43.116% #########
( 6600, 9900 ] 16727328 20.909% 64.025% ####
( 9900, 14000 ] 7845828 9.807% 73.832% ##
( 14000, 22000 ] 15510654 19.388% 93.220% ####
( 22000, 33000 ] 3216533 4.021% 97.241% #
( 33000, 50000 ] 1680859 2.101% 99.342%
( 50000, 75000 ] 439059 0.549% 99.891%
( 75000, 110000 ] 60540 0.076% 99.967%
( 110000, 170000 ] 14649 0.018% 99.985%
( 170000, 250000 ] 5242 0.007% 99.991%
( 250000, 380000 ] 3260 0.004% 99.995%
( 380000, 570000 ] 1599 0.002% 99.997%
( 570000, 860000 ] 1043 0.001% 99.999%
( 860000, 1200000 ] 471 0.001% 99.999%
( 1200000, 1900000 ] 275 0.000% 100.000%
( 1900000, 2900000 ] 143 0.000% 100.000%
( 2900000, 4300000 ] 60 0.000% 100.000%
( 4300000, 6500000 ] 27 0.000% 100.000%
( 6500000, 9800000 ] 7 0.000% 100.000%
( 9800000, 14000000 ] 1 0.000% 100.000%
Gather stats latency (us):
Count: 46 Average: 980387.5870 StdDev: 60911.18
Min: 879155 Median: 1033777.7778 Max: 1261431
Percentiles: P50: 1033777.78 P75: 1120666.67 P99: 1261431.00 P99.9: 1261431.00 P99.99: 1261431.00
------------------------------------------------------
( 860000, 1200000 ] 45 97.826% 97.826% ####################
( 1200000, 1900000 ] 1 2.174% 100.000%
Most recent cache entry stats:
Number of entries: 1295133
Total charge: 9.88 GB
Average key size: 23.4982
Average charge: 8.00 KB
Unique deleters: 3
```
Reviewed By: mrambacher
Differential Revision: D28295742
Pulled By: pdillinger
fbshipit-source-id: bbc4a552f91ba0fe10e5cc025c42cef5a81f2b95
Summary:
Added ParseType, SerializeType, and TypesAreEqual methods to OptionTypeInfo. These methods can be used for serialization and deserialization of basic types.
Change the MutableCF/DB Options to use this format.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8249
Reviewed By: jay-zhuang
Differential Revision: D28351190
Pulled By: mrambacher
fbshipit-source-id: 72a78643b804f2f0bf59c32ffefa63346672ad16
Summary:
This change enables a couple of things:
- Different ConfigOptions can have different registry/factory associated with it, thereby allowing things like a "Test" ConfigOptions versus a "Production"
- The ObjectRegistry is created fewer times and can be re-used
The ConfigOptions can also be initialized/constructed from a DBOptions, in which case it will grab some of its settings (Env, Logger) from the DBOptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8166
Reviewed By: zhichao-cao
Differential Revision: D27657952
Pulled By: mrambacher
fbshipit-source-id: ae1d6200bb7ab127405cdeefaba43c7fe694dfdd