Summary:
Implement an experimental feature called "MemPurge", which consists in purging "garbage" bytes out of a memtable and reuse the memtable struct instead of making it immutable and eventually flushing its content to storage.
The prototype is by default deactivated and is not intended for use. It is intended for correctness and validation testing. At the moment, the "MemPurge" feature can be switched on by using the `options.experimental_allow_mempurge` flag. For this early stage, when the allow_mempurge flag is set to `true`, all the flush operations will be rerouted to perform a MemPurge. This is a temporary design decision that will give us the time to explore meaningful heuristics to use MemPurge at the right time for relevant workloads . Moreover, the current MemPurge operation only supports `Puts`, `Deletes`, `DeleteRange` operations, and handles `Iterators` as well as `CompactionFilter`s that are invoked at flush time .
Three unit tests are added to `db_flush_test.cc` to test if MemPurge works correctly (and checks that the previously mentioned operations are fully supported thoroughly tested).
One noticeable design decision is the timing of the MemPurge operation in the memtable workflow: for this prototype, the mempurge happens when the memtable is switched (and usually made immutable). This is an inefficient process because it implies that the entirety of the MemPurge operation happens while holding the db_mutex. Future commits will make the MemPurge operation a background task (akin to the regular flush operation) and aim at drastically enhancing the performance of this operation. The MemPurge is also not fully "WAL-compatible" yet, but when the WAL is full, or when the regular MemPurge operation fails (or when the purged memtable still needs to be flushed), a regular flush operation takes place. Later commits will also correct these behaviors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8454
Reviewed By: anand1976
Differential Revision: D29433971
Pulled By: bjlemaire
fbshipit-source-id: 6af48213554e35048a7e03816955100a80a26dc5
Summary:
Added the Customizable::ConfigureNewObject method. The method will configure the object if options are found and invoke PrepareOptions if the flag is set properly.
Added tests to test that PrepareOptions is properly called and to test if PrepareOptions fails.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8468
Reviewed By: zhichao-cao
Differential Revision: D29494703
Pulled By: mrambacher
fbshipit-source-id: d5767dee5d7a98620ac66190262101cd0aa9d2b7
Summary:
Change the job_id for remote compaction interface, which will include
both internal compaction job_id, also a sub_compaction_job_id. It is not
a backward compatible change. The user needs to update interface during
upgrade. (We will avoid backward incompatible change after the feature is
not experimental.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8364
Reviewed By: ajkr
Differential Revision: D28917301
Pulled By: jay-zhuang
fbshipit-source-id: 6d72a21f652bb517ad6954d0387b496797fc4e11
Summary:
Added BlobMetaData to ColumnFamilyMetaData and LiveBlobMetaData and DB API GetLiveBlobMetaData to retrieve it.
First pass at struct. More tests and maybe fields to come...
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8273
Reviewed By: ltamasi
Differential Revision: D29102400
Pulled By: mrambacher
fbshipit-source-id: 8a2383a4446328be6b91dced9841fdd3dfc80b73
Summary:
Implement the ```WaitAll()``` interface in ```LRUCache``` to allow callers to issue multiple lookups in parallel and wait for all of them to complete. Modify ```MultiGet``` to use this to parallelize the secondary cache lookups in order to reduce the overall latency. A call to ```cache->Lookup()``` returns a handle that has an incomplete value (nullptr), and the caller can call ```cache->IsReady()``` to check whether the lookup is complete, and pass a vector of handles to ```WaitAll``` to wait for completion. If any of the lookups fail, ```MultiGet``` will read the block from the SST file.
Another change in this PR is to rename ```SecondaryCacheHandle``` to ```SecondaryCacheResultHandle``` as it more accurately describes the return result of the secondary cache lookup, which is more like a future.
Tests:
1. Add unit tests in lru_cache_test
2. Benchmark results with no secondary cache configured
Master -
```
readrandom : 41.175 micros/op 388562 ops/sec; 106.7 MB/s (7277999 of 7277999 found)
readrandom : 41.217 micros/op 388160 ops/sec; 106.6 MB/s (7274999 of 7274999 found)
multireadrandom : 10.309 micros/op 1552082 ops/sec; (28908992 of 28908992 found)
multireadrandom : 10.321 micros/op 1550218 ops/sec; (29081984 of 29081984 found)
```
This PR -
```
readrandom : 41.158 micros/op 388723 ops/sec; 106.8 MB/s (7290999 of 7290999 found)
readrandom : 41.185 micros/op 388463 ops/sec; 106.7 MB/s (7287999 of 7287999 found)
multireadrandom : 10.277 micros/op 1556801 ops/sec; (29346944 of 29346944 found)
multireadrandom : 10.253 micros/op 1560539 ops/sec; (29274944 of 29274944 found)
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8405
Reviewed By: zhichao-cao
Differential Revision: D29190509
Pulled By: anand1976
fbshipit-source-id: 6f8eff6246712af8a297cfe22ea0d1c3b2a01bb0
Summary:
**Summary**:
2 new statistics counters are added to RocksDB: `MEMTABLE_PAYLOAD_BYTES_AT_FLUSH` and `MEMTABLE_GARBAGE_BYTES_AT_FLUSH`. The former tracks how many raw bytes of useful data are present on the memtable at flush time, whereas the latter is tracks how many of these raw bytes are considered garbage, meaning that they ended up not being imported on the SSTables resulting from the flush operations.
**Unit test**: run `make db_flush_test -j$(nproc); ./db_flush_test` to run the unit test.
This executable includes 3 tests, that test support and correct stat calculations for workloads with inserts, deletes, and DeleteRanges. The parameters are set such that the workloads are performed on a single memtable, and a single SSTable is created as a result of the flush operation. The flush operation is manually called in the test file. The tests verify that the values of these 2 statistics counters introduced in this PR can be exactly predicted, showing that we have a full understanding of the underlying operations.
**Performance testing**:
`./db_bench -statistics -benchmarks=fillrandom -num=10000000` repeated 10 times.
Timing done using "date" function in a bash script.
_Results_:
Original Rocksdb fork: mean 66.6 sec, std 1.18 sec.
This feature branch: mean 67.4 sec, std 1.35 sec.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8411
Reviewed By: akankshamahajan15
Differential Revision: D29150629
Pulled By: bjlemaire
fbshipit-source-id: 7b3c2e86d50c6aa34fa50fd134282eacb543a5b1
Summary:
This PR prepopulates warm/hot data blocks which are already in memory
into block cache at the time of flush. On a flush, the data block that is
in memory (in memtables) get flushed to the device. If using Direct IO,
additional IO is incurred to read this data back into memory again, which
is avoided by enabling newly added option.
Right now, this is enabled only for flush for data blocks. We plan to
expand this option to cover compactions in the future and for other types
of blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8242
Test Plan: Add new unit test
Reviewed By: anand1976
Differential Revision: D28521703
Pulled By: akankshamahajan15
fbshipit-source-id: 7219d6958821cedce689a219c3963a6f1a9d5f05
Summary:
Marked the Ribbon filter and optimize_filters_for_memory features
as production-ready, each enabling memory savings for Bloom-like filters.
Use `NewRibbonFilterPolicy` in place of `NewBloomFilterPolicy` to use
Ribbon filters instead of Bloom, or `ribbonfilter` in place of
`bloomfilter` in configuration string.
Some small refactoring in db_stress.
Removed/refactored unused code in db_bench, in part preparing for future
default possibly being different from "disabled."
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8408
Test Plan:
Lots of prior automated, ad-hoc, and "real world" testing.
Updated tests for new API names. Quick db_bench test:
bloom fillrandom
77730 ops/sec
rocksdb.block.cache.filter.bytes.insert COUNT : 89929384
ribbon fillrandom
71492 ops/sec
rocksdb.block.cache.filter.bytes.insert COUNT : 64531384
Reviewed By: mrambacher
Differential Revision: D29140805
Pulled By: pdillinger
fbshipit-source-id: d742c922722421678f95ad85eeb0aaebc9f5e49a
Summary:
- Added CreateFromString method to Env and FilesSystem to replace LoadEnv/Load. This method/signature is a precursor to making these classes extend Customizable.
- Added CreateFromSystem to Env. This method standardizes creating an Env from the environment variables. Previously, some places would check TEST_ENV_URI and others would also check TEST_FS_URI. Now the code is more command/standardized.
- Added CreateFromFlags to Env. These method allows Env to be create from string options (such as GFLAGS options) in a more standard way.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8174
Reviewed By: zhichao-cao
Differential Revision: D28999603
Pulled By: mrambacher
fbshipit-source-id: 88e6911e7e91f908458a7fe10a20e93ecbc275fb
Summary:
Longstanding tech debt
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8379
Test Plan: Better than not having an API contract
Reviewed By: jay-zhuang
Differential Revision: D29011131
Pulled By: pdillinger
fbshipit-source-id: 2c4796177733651954024fc17875f8642ca08d09
Summary:
(1)Make CompactionService derived from Customizable by defining two extra functions that are needed, as described in customizable.h comment section
(2)Revise the MyTestCompactionService class in compaction_service_test.cc to satisfy the class inheritance requirement
(3)Specify namespace of ToString() in compaction_service_test.cc to avoid function collision with CompactionService's ancestor classes
Test did:
make -j24 compaction_service_test
./compaction_service_test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8395
Reviewed By: jay-zhuang
Differential Revision: D29076068
Pulled By: hx235
fbshipit-source-id: c130100fa466939b3137e917f5fdc4b2ae8e37d4
Summary:
Makes the Comparator class into a Customizable object. Added/Updated the CreateFromString method to create Comparators. Added test for using the ObjectRegistry to create one.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8336
Reviewed By: jay-zhuang
Differential Revision: D28999612
Pulled By: mrambacher
fbshipit-source-id: bff2cb2814eeb9fef6a00fddc61d6e34b6fbcf2e
Summary:
Added the ability to cancel an in-progress range compaction by storing to an atomic "canceled" variable pointed to within the CompactRangeOptions structure.
Tested via two tests added to db_tests2.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8351
Reviewed By: ajkr
Differential Revision: D28808894
Pulled By: ddevec
fbshipit-source-id: cb321361c9e23b084b188bb203f11c375a22c2dd
Summary:
Update HISTORY and version to 6.21 on master.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8363
Reviewed By: jay-zhuang
Differential Revision: D28888818
Pulled By: anand1976
fbshipit-source-id: 9e5fac3b99ecc9f3b7d9f21474a39fa50decb117
Summary:
With Ribbon filter work and possible variance in actual bits
per key (or prefix; general term "entry") to achieve certain FP rates,
I've received a request to be able to track actual bits per key in
generated filters. This change adds a num_filter_entries table
property, which can be combined with filter_size to get bits per key
(entry).
This can vary from num_entries in at least these ways:
* Different versions of same key are only counted once in filters.
* With prefix filters, several user keys map to the same filter entry.
* A single filter can include both prefixes and user keys.
Note that FilterBlockBuilder::NumAdded() didn't do anything useful
except distinguish empty from non-empty.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8323
Test Plan: basic unit test included, others updated
Reviewed By: jay-zhuang
Differential Revision: D28596210
Pulled By: pdillinger
fbshipit-source-id: 529a111f3c84501e5a470bc84705e436ee68c376
Summary:
When a memtable is flushed, it will validate number of entries it reads, and compare the number with how many entries inserted into memtable. This serves as one sanity c\
heck against memory corruption. This change will also allow more counters to be added in the future for better validation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8288
Test Plan: Pass all existing tests
Reviewed By: ajkr
Differential Revision: D28369194
fbshipit-source-id: 7ff870380c41eab7f99eee508550dcdce32838ad
Summary:
This change gathers and publishes statistics about the
kinds of items in block cache. This is especially important for
profiling relative usage of cache by index vs. filter vs. data blocks.
It works by iterating over the cache during periodic stats dump
(InternalStats, stats_dump_period_sec) or on demand when
DB::Get(Map)Property(kBlockCacheEntryStats), except that for
efficiency and sharing among column families, saved data from
the last scan is used when the data is not considered too old.
The new information can be seen in info LOG, for example:
Block cache LRUCache@0x7fca62229330 capacity: 95.37 MB collections: 8 last_copies: 0 last_secs: 0.00178 secs_since: 0
Block cache entry stats(count,size,portion): DataBlock(7092,28.24 MB,29.6136%) FilterBlock(215,867.90 KB,0.888728%) FilterMetaBlock(2,5.31 KB,0.00544%) IndexBlock(217,180.11 KB,0.184432%) WriteBuffer(1,256.00 KB,0.262144%) Misc(1,0.00 KB,0%)
And also through DB::GetProperty and GetMapProperty (here using
ldb just for demonstration):
$ ./ldb --db=/dev/shm/dbbench/ get_property rocksdb.block-cache-entry-stats
rocksdb.block-cache-entry-stats.bytes.data-block: 0
rocksdb.block-cache-entry-stats.bytes.deprecated-filter-block: 0
rocksdb.block-cache-entry-stats.bytes.filter-block: 0
rocksdb.block-cache-entry-stats.bytes.filter-meta-block: 0
rocksdb.block-cache-entry-stats.bytes.index-block: 178992
rocksdb.block-cache-entry-stats.bytes.misc: 0
rocksdb.block-cache-entry-stats.bytes.other-block: 0
rocksdb.block-cache-entry-stats.bytes.write-buffer: 0
rocksdb.block-cache-entry-stats.capacity: 8388608
rocksdb.block-cache-entry-stats.count.data-block: 0
rocksdb.block-cache-entry-stats.count.deprecated-filter-block: 0
rocksdb.block-cache-entry-stats.count.filter-block: 0
rocksdb.block-cache-entry-stats.count.filter-meta-block: 0
rocksdb.block-cache-entry-stats.count.index-block: 215
rocksdb.block-cache-entry-stats.count.misc: 1
rocksdb.block-cache-entry-stats.count.other-block: 0
rocksdb.block-cache-entry-stats.count.write-buffer: 0
rocksdb.block-cache-entry-stats.id: LRUCache@0x7f3636661290
rocksdb.block-cache-entry-stats.percent.data-block: 0.000000
rocksdb.block-cache-entry-stats.percent.deprecated-filter-block: 0.000000
rocksdb.block-cache-entry-stats.percent.filter-block: 0.000000
rocksdb.block-cache-entry-stats.percent.filter-meta-block: 0.000000
rocksdb.block-cache-entry-stats.percent.index-block: 2.133751
rocksdb.block-cache-entry-stats.percent.misc: 0.000000
rocksdb.block-cache-entry-stats.percent.other-block: 0.000000
rocksdb.block-cache-entry-stats.percent.write-buffer: 0.000000
rocksdb.block-cache-entry-stats.secs_for_last_collection: 0.000052
rocksdb.block-cache-entry-stats.secs_since_last_collection: 0
Solution detail - We need some way to flag what kind of blocks each
entry belongs to, preferably without changing the Cache API.
One of the complications is that Cache is a general interface that could
have other users that don't adhere to whichever convention we decide
on for keys and values. Or we would pay for an extra field in the Handle
that would only be used for this purpose.
This change uses a back-door approach, the deleter, to indicate the
"role" of a Cache entry (in addition to the value type, implicitly).
This has the added benefit of ensuring proper code origin whenever we
recognize a particular role for a cache entry; if the entry came from
some other part of the code, it will use an unrecognized deleter, which
we simply attribute to the "Misc" role.
An internal API makes for simple instantiation and automatic
registration of Cache deleters for a given value type and "role".
Another internal API, CacheEntryStatsCollector, solves the problem of
caching the results of a scan and sharing them, to ensure scans are
neither excessive nor redundant so as not to harm Cache performance.
Because code is added to BlocklikeTraits, it is pulled out of
block_based_table_reader.cc into its own file.
This is a reformulation of https://github.com/facebook/rocksdb/issues/8276, without the type checking option
(could still be added), and with actual stat gathering.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8297
Test Plan: manual testing with db_bench, and a couple of basic unit tests
Reviewed By: ltamasi
Differential Revision: D28488721
Pulled By: pdillinger
fbshipit-source-id: 472f524a9691b5afb107934be2d41d84f2b129fb
Summary:
- Add class `FunctorWrapper` to invoke the function with given parameters
- Implement `StartThreadTyped` which wraps `StartThread` with type checking cover
- Demonstrate `StartThreadTyped` in test `util/thread_local_test.cc`
https://github.com/facebook/rocksdb/issues/8285
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8303
Reviewed By: ajkr
Differential Revision: D28539318
Pulled By: pdillinger
fbshipit-source-id: 624789c236bde31163deda95c1e1471aee68933e
Summary:
This PR adds a ```-secondary_cache_uri``` option to the cache_bench and db_bench tools to allow the user to specify a custom secondary cache URI. The object registry is used to create an instance of the ```SecondaryCache``` object of the type specified in the URI.
The main cache_bench code is packaged into a separate library, similar to db_bench.
An example invocation of db_bench with a secondary cache URI -
```db_bench --env_uri=ws://ws.flash_sandbox.vll1_2/ -db=anand/nvm_cache_2 -use_existing_db=true -benchmarks=readrandom -num=30000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=67108864 -cache_index_and_filter_blocks=true -secondary_cache_uri='cachelibwrapper://filename=/home/anand76/nvm_cache/cache_file;size=2147483648;regionSize=16777216;admPolicy=random;admProbability=1.0;volatileSize=8388608;bktPower=20;lockPower=12' -partition_index_and_filters=true -duration=1800```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8312
Reviewed By: zhichao-cao
Differential Revision: D28544325
Pulled By: anand1976
fbshipit-source-id: 8f209b9af900c459dc42daa7a610d5f00176eeed
Summary:
As a part of tiered storage, writing tempeature information to manifest is needed so that after DB recovery, RocksDB still has the tiering information, to implement some further necessary functionalities.
Also fix some issues in simulated hybrid FS.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8284
Test Plan: Add a new unit test to validate that the information is indeed written and read back.
Reviewed By: zhichao-cao
Differential Revision: D28335801
fbshipit-source-id: 56aeb2e6ea090be0200181dd968c8a7278037def
Summary:
Defined the abstract interface for a secondary cache in include/rocksdb/secondary_cache.h, and updated LRUCacheOptions to take a std::shared_ptr<SecondaryCache>. An item is initially inserted into the LRU (primary) cache. When it ages out and evicted from memory, its inserted into the secondary cache. On a LRU cache miss and successful lookup in the secondary cache, the item is promoted to the LRU cache. Only support synchronous lookup currently. The secondary cache would be used to implement a persistent (flash cache) or compressed cache.
Tests:
Results from cache_bench and db_bench don't show any regression due to these changes.
cache_bench results before and after this change -
Command
```./cache_bench -ops_per_thread=10000000 -threads=1```
Before
```Complete in 40.688 s; QPS = 245774```
```Complete in 40.486 s; QPS = 246996```
```Complete in 42.019 s; QPS = 237989```
After
```Complete in 40.672 s; QPS = 245869```
```Complete in 44.622 s; QPS = 224107```
```Complete in 42.445 s; QPS = 235599```
db_bench results before this change, and with this change + https://github.com/facebook/rocksdb/issues/8213 and https://github.com/facebook/rocksdb/issues/8191 -
Commands
```./db_bench --benchmarks="fillseq,compact" -num=30000000 -key_size=32 -value_size=256 -use_direct_io_for_flush_and_compaction=true -db=/home/anand76/nvm_cache/db -partition_index_and_filters=true```
```./db_bench -db=/home/anand76/nvm_cache/db -use_existing_db=true -benchmarks=readrandom -num=30000000 -key_size=32 -value_size=256 -use_direct_reads=true -cache_size=1073741824 -cache_numshardbits=6 -cache_index_and_filter_blocks=true -read_random_exp_range=17 -statistics -partition_index_and_filters=true -threads=16 -duration=300```
Before
```
DB path: [/home/anand76/nvm_cache/db]
readrandom : 80.702 micros/op 198104 ops/sec; 54.4 MB/s (3708999 of 3708999 found)
```
```
DB path: [/home/anand76/nvm_cache/db]
readrandom : 87.124 micros/op 183625 ops/sec; 50.4 MB/s (3439999 of 3439999 found)
```
After
```
DB path: [/home/anand76/nvm_cache/db]
readrandom : 77.653 micros/op 206025 ops/sec; 56.6 MB/s (3866999 of 3866999 found)
```
```
DB path: [/home/anand76/nvm_cache/db]
readrandom : 84.962 micros/op 188299 ops/sec; 51.7 MB/s (3535999 of 3535999 found)
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8271
Reviewed By: zhichao-cao
Differential Revision: D28357511
Pulled By: anand1976
fbshipit-source-id: d1cfa236f00e649a18c53328be10a8062a4b6da2
Summary:
The functions will be used for remote compaction parameter
input and result.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8247
Test Plan: `make check`
Reviewed By: ajkr
Differential Revision: D28104680
Pulled By: jay-zhuang
fbshipit-source-id: c0a5178e6277125118384278efea2acbf90aa6cb
Summary:
Adds a new Cache::ApplyToAllEntries API that we expect to use
(in follow-up PRs) for efficiently gathering block cache statistics.
Notable features vs. old ApplyToAllCacheEntries:
* Includes key and deleter (in addition to value and charge). We could
have passed in a Handle but then more virtual function calls would be
needed to get the "fields" of each entry. We expect to use the 'deleter'
to identify the origin of entries, perhaps even more.
* Heavily tuned to minimize latency impact on operating cache. It
does this by iterating over small sections of each cache shard while
cycling through the shards.
* Supports tuning roughly how many entries to operate on for each
lock acquire and release, to control the impact on the latency of other
operations without excessive lock acquire & release. The right balance
can depend on the cost of the callback. Good default seems to be
around 256.
* There should be no need to disable thread safety. (I would expect
uncontended locks to be sufficiently fast.)
I have enhanced cache_bench to validate this approach:
* Reports a histogram of ns per operation, so we can look at the
ditribution of times, not just throughput (average).
* Can add a thread for simulated "gather stats" which calls
ApplyToAllEntries at a specified interval. We also generate a histogram
of time to run ApplyToAllEntries.
To make the iteration over some entries of each shard work as cleanly as
possible, even with resize between next set of entries, I have
re-arranged which hash bits are used for sharding and which for indexing
within a shard.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8225
Test Plan:
A couple of unit tests are added, but primary validation is manual, as
the primary risk is to performance.
The primary validation is using cache_bench to ensure that neither
the minor hashing changes nor the simulated stats gathering
significantly impact QPS or latency distribution. Note that adding op
latency histogram seriously impacts the benchmark QPS, so for a
fair baseline, we need the cache_bench changes (except remove simulated
stat gathering to make it compile). In short, we don't see any
reproducible difference in ops/sec or op latency unless we are gathering
stats nearly continuously. Test uses 10GB block cache with
8KB values to be somewhat realistic in the number of items to iterate
over.
Baseline typical output:
```
Complete in 92.017 s; Rough parallel ops/sec = 869401
Thread ops/sec = 54662
Operation latency (ns):
Count: 80000000 Average: 11223.9494 StdDev: 29.61
Min: 0 Median: 7759.3973 Max: 9620500
Percentiles: P50: 7759.40 P75: 14190.73 P99: 46922.75 P99.9: 77509.84 P99.99: 217030.58
------------------------------------------------------
[ 0, 1 ] 68 0.000% 0.000%
( 2900, 4400 ] 89 0.000% 0.000%
( 4400, 6600 ] 33630240 42.038% 42.038% ########
( 6600, 9900 ] 18129842 22.662% 64.700% #####
( 9900, 14000 ] 7877533 9.847% 74.547% ##
( 14000, 22000 ] 15193238 18.992% 93.539% ####
( 22000, 33000 ] 3037061 3.796% 97.335% #
( 33000, 50000 ] 1626316 2.033% 99.368%
( 50000, 75000 ] 421532 0.527% 99.895%
( 75000, 110000 ] 56910 0.071% 99.966%
( 110000, 170000 ] 16134 0.020% 99.986%
( 170000, 250000 ] 5166 0.006% 99.993%
( 250000, 380000 ] 3017 0.004% 99.996%
( 380000, 570000 ] 1337 0.002% 99.998%
( 570000, 860000 ] 805 0.001% 99.999%
( 860000, 1200000 ] 319 0.000% 100.000%
( 1200000, 1900000 ] 231 0.000% 100.000%
( 1900000, 2900000 ] 100 0.000% 100.000%
( 2900000, 4300000 ] 39 0.000% 100.000%
( 4300000, 6500000 ] 16 0.000% 100.000%
( 6500000, 9800000 ] 7 0.000% 100.000%
```
New, gather_stats=false. Median thread ops/sec of 5 runs:
```
Complete in 92.030 s; Rough parallel ops/sec = 869285
Thread ops/sec = 54458
Operation latency (ns):
Count: 80000000 Average: 11298.1027 StdDev: 42.18
Min: 0 Median: 7722.0822 Max: 6398720
Percentiles: P50: 7722.08 P75: 14294.68 P99: 47522.95 P99.9: 85292.16 P99.99: 228077.78
------------------------------------------------------
[ 0, 1 ] 109 0.000% 0.000%
( 2900, 4400 ] 793 0.001% 0.001%
( 4400, 6600 ] 34054563 42.568% 42.569% #########
( 6600, 9900 ] 17482646 21.853% 64.423% ####
( 9900, 14000 ] 7908180 9.885% 74.308% ##
( 14000, 22000 ] 15032072 18.790% 93.098% ####
( 22000, 33000 ] 3237834 4.047% 97.145% #
( 33000, 50000 ] 1736882 2.171% 99.316%
( 50000, 75000 ] 446851 0.559% 99.875%
( 75000, 110000 ] 68251 0.085% 99.960%
( 110000, 170000 ] 18592 0.023% 99.983%
( 170000, 250000 ] 7200 0.009% 99.992%
( 250000, 380000 ] 3334 0.004% 99.997%
( 380000, 570000 ] 1393 0.002% 99.998%
( 570000, 860000 ] 700 0.001% 99.999%
( 860000, 1200000 ] 293 0.000% 100.000%
( 1200000, 1900000 ] 196 0.000% 100.000%
( 1900000, 2900000 ] 69 0.000% 100.000%
( 2900000, 4300000 ] 32 0.000% 100.000%
( 4300000, 6500000 ] 10 0.000% 100.000%
```
New, gather_stats=true, 1 second delay between scans. Scans take about
1 second here so it's spending about 50% time scanning. Still the effect on
ops/sec and latency seems to be in the noise. Median thread ops/sec of 5 runs:
```
Complete in 91.890 s; Rough parallel ops/sec = 870608
Thread ops/sec = 54551
Operation latency (ns):
Count: 80000000 Average: 11311.2629 StdDev: 45.28
Min: 0 Median: 7686.5458 Max: 10018340
Percentiles: P50: 7686.55 P75: 14481.95 P99: 47232.60 P99.9: 79230.18 P99.99: 232998.86
------------------------------------------------------
[ 0, 1 ] 71 0.000% 0.000%
( 2900, 4400 ] 291 0.000% 0.000%
( 4400, 6600 ] 34492060 43.115% 43.116% #########
( 6600, 9900 ] 16727328 20.909% 64.025% ####
( 9900, 14000 ] 7845828 9.807% 73.832% ##
( 14000, 22000 ] 15510654 19.388% 93.220% ####
( 22000, 33000 ] 3216533 4.021% 97.241% #
( 33000, 50000 ] 1680859 2.101% 99.342%
( 50000, 75000 ] 439059 0.549% 99.891%
( 75000, 110000 ] 60540 0.076% 99.967%
( 110000, 170000 ] 14649 0.018% 99.985%
( 170000, 250000 ] 5242 0.007% 99.991%
( 250000, 380000 ] 3260 0.004% 99.995%
( 380000, 570000 ] 1599 0.002% 99.997%
( 570000, 860000 ] 1043 0.001% 99.999%
( 860000, 1200000 ] 471 0.001% 99.999%
( 1200000, 1900000 ] 275 0.000% 100.000%
( 1900000, 2900000 ] 143 0.000% 100.000%
( 2900000, 4300000 ] 60 0.000% 100.000%
( 4300000, 6500000 ] 27 0.000% 100.000%
( 6500000, 9800000 ] 7 0.000% 100.000%
( 9800000, 14000000 ] 1 0.000% 100.000%
Gather stats latency (us):
Count: 46 Average: 980387.5870 StdDev: 60911.18
Min: 879155 Median: 1033777.7778 Max: 1261431
Percentiles: P50: 1033777.78 P75: 1120666.67 P99: 1261431.00 P99.9: 1261431.00 P99.99: 1261431.00
------------------------------------------------------
( 860000, 1200000 ] 45 97.826% 97.826% ####################
( 1200000, 1900000 ] 1 2.174% 100.000%
Most recent cache entry stats:
Number of entries: 1295133
Total charge: 9.88 GB
Average key size: 23.4982
Average charge: 8.00 KB
Unique deleters: 3
```
Reviewed By: mrambacher
Differential Revision: D28295742
Pulled By: pdillinger
fbshipit-source-id: bbc4a552f91ba0fe10e5cc025c42cef5a81f2b95
Summary:
Added ParseType, SerializeType, and TypesAreEqual methods to OptionTypeInfo. These methods can be used for serialization and deserialization of basic types.
Change the MutableCF/DB Options to use this format.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8249
Reviewed By: jay-zhuang
Differential Revision: D28351190
Pulled By: mrambacher
fbshipit-source-id: 72a78643b804f2f0bf59c32ffefa63346672ad16
Summary:
This change enables a couple of things:
- Different ConfigOptions can have different registry/factory associated with it, thereby allowing things like a "Test" ConfigOptions versus a "Production"
- The ObjectRegistry is created fewer times and can be re-used
The ConfigOptions can also be initialized/constructed from a DBOptions, in which case it will grab some of its settings (Env, Logger) from the DBOptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8166
Reviewed By: zhichao-cao
Differential Revision: D27657952
Pulled By: mrambacher
fbshipit-source-id: ae1d6200bb7ab127405cdeefaba43c7fe694dfdd
Summary:
From HISTORY.md release note:
- Allow `CompactionFilter`s to apply in more table file creation scenarios such as flush and recovery. For compatibility, `CompactionFilter`s by default apply during compaction. Users can customize this behavior by overriding `CompactionFilterFactory::ShouldFilterTableFileCreation()`.
- Removed unused structure `CompactionFilterContext`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8243
Test Plan: added unit tests
Reviewed By: pdillinger
Differential Revision: D28088089
Pulled By: ajkr
fbshipit-source-id: 0799be7908e3b39fea09fc3f1ab00e13ad817fae
Summary:
An early design of BackupEngine used stackable DB, so I guess a
DB had to opt-in to being backupable. Unfortunately the naming of that
obsolete design still infects our public API and implementation.
This change fixes the public API, with a deprecated
backward-compatibility header. `BackupableDBOptions` is renamed to
`BackupEngineOptions` (copy-replace in the public header) and
backup_engine.h replaces backupable_db.h (present for backward
compatibility). The only other change in backupable_db.h ->
backup_engine.h is cleaning up headers.
Later changes will fix the internal implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8274
Test Plan:
The internal implementation of BackupEngine uses the name
BackupEngineOptions, while the unit tests use the old name
BackupableDBOptions. This gives me confidence that both still work.
Reviewed By: mrambacher
Differential Revision: D28259471
Pulled By: pdillinger
fbshipit-source-id: a25dbe327b9772143488e7bb0ec7139ee42d0613
Summary:
Larger arena block size does provide the benefit of reducing allocation overhead, however it may cause other troubles. For example, allocator is more likely not to allocate them to physical memory and trigger page fault. Weighing the risk, we cap the arena block size to 1MB. Users can always use a larger value if they want.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7907
Test Plan: Run all existing tests
Reviewed By: pdillinger
Differential Revision: D26135269
fbshipit-source-id: b7f55afd03e6ee1d8715f90fa11b6c33944e9ea8
Summary:
In testing for https://github.com/facebook/rocksdb/issues/8225 I found cache_bench would crash with
-use_clock_cache, as well as db_bench -use_clock_cache, but not
single-threaded. Smaller cache size hits failure much faster. ASAN
reported the failuer as calling malloc_usable_size on the `key` pointer
of a ClockCache handle after it was reportedly freed. On detailed
inspection I found this bad sequence of operations for a cache entry:
state=InCache=1,refs=1
[thread 1] Start ClockCacheShard::Unref (from Release, no mutex)
[thread 1] Decrement ref count
state=InCache=1,refs=0
[thread 1] Suspend before CalcTotalCharge (no mutex)
[thread 2] Start UnsetInCache (from Insert, mutex held)
[thread 2] clear InCache bit
state=InCache=0,refs=0
[thread 2] Calls RecycleHandle (based on pre-updated state)
[thread 2] Returns to Insert which calls Cleanup which deletes `key`
[thread 1] Resume ClockCacheShard::Unref
[thread 1] Read `key` in CalcTotalCharge
To fix this, I've added a field to the handle to store the metadata
charge so that we can efficiently remember everything we need from
the handle in Unref. We must not read from the handle again if we
decrement the count to zero with InCache=1, which means we don't own
the entry and someone else could eject/overwrite it immediately.
Note before this change, on amd64 sizeof(Handle) == 56 even though there
are only 48 bytes of data. Grouping together the uint32_t fields would
cut it down to 48, but I've added another uint32_t, which takes it
back up to 56. Not a big deal.
Also fixed DisownData to cooperate with ASAN as in LRUCache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8261
Test Plan:
Manual + adding use_clock_cache to db_crashtest.py
Base performance
./cache_bench -use_clock_cache
Complete in 17.060 s; QPS = 2458513
New performance
./cache_bench -use_clock_cache
Complete in 17.052 s; QPS = 2459695
Any difference is easily buried in small noise.
Crash test shows still more bug(s) in ClockCache, so I'm expecting to
disable ClockCache from production code in a follow-up PR (if we
can't find and fix the bug(s))
Reviewed By: mrambacher
Differential Revision: D28207358
Pulled By: pdillinger
fbshipit-source-id: aa7a9322afc6f18f30e462c75dbbe4a1206eb294
Summary:
As the first part of the effort of having placing different files on different storage types, this change introduces several things:
(1) An experimental interface in FileSystem that specify temperature to a new file created.
(2) A test FileSystemWrapper, SimulatedHybridFileSystem, that simulates HDD for a file of "warm" temperature.
(3) A simple experimental feature ColumnFamilyOptions.bottommost_temperature. RocksDB would pass this value to FileSystem when creating any bottommost file.
(4) A db_bench parameter that applies the (2) and (3) to db_bench.
The motivation of the change is to introduce minimal changes that allow us to evolve tiered storage development.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8222
Test Plan:
./db_bench --benchmarks=fillrandom --write_buffer_size=2000000 -max_bytes_for_level_base=20000000 -level_compaction_dynamic_level_bytes --reads=100 -compaction_readahead_size=20000000 --reads=100000 -num=10000000
followed by
./db_bench --benchmarks=readrandom,stats --write_buffer_size=2000000 -max_bytes_for_level_base=20000000 -simulate_hybrid_fs_file=/tmp/warm_file_list -level_compaction_dynamic_level_bytes -compaction_readahead_size=20000000 --reads=500 --threads=16 -use_existing_db --num=10000000
and see results as expected.
Reviewed By: ajkr
Differential Revision: D28003028
fbshipit-source-id: 4724896d5205730227ba2f17c3fecb11261744ce
Summary:
Add `num_levels`, `is_bottommost`, and table file creation
`reason` to `FilterBuildingContext`, in anticipation of more powerful
Bloom-like filter support.
To support this, added `is_bottommost` and `reason` to
`TableBuilderOptions`, which allowed removing `reason` parameter from
`rocksdb::BuildTable`.
I attempted to remove `skip_filters` from `TableBuilderOptions`, because
filter construction decisions should arise from options, not one-off
parameters. I could not completely remove it because the public API for
SstFileWriter takes a `skip_filters` parameter, and translating this
into an option change would mean awkwardly replacing the table_factory
if it is BlockBasedTableFactory with new filter_policy=nullptr option.
I marked this public skip_filters option as deprecated because of this
oddity. (skip_filters on the read side probably makes sense.)
At least `skip_filters` is now largely hidden for users of
`TableBuilderOptions` and is no longer used for implementing the
optimize_filters_for_hits option. Bringing the logic for that option
closer to handling of FilterBuildingContext makes it more obvious that
hese two are using the same notion of "bottommost." (Planned:
configuration options for Bloom-like filters that generalize
`optimize_filters_for_hits`)
Recommended follow-up: Try to get away from "bottommost level" naming of
things, which is inaccurate (see
VersionStorageInfo::RangeMightExistAfterSortedRun), and move to
"bottommost run" or just "bottommost."
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8246
Test Plan:
extended an existing unit test to exercise and check various
filter building contexts. Also, existing tests for
optimize_filters_for_hits validate some of the "bottommost" handling,
which is now closely connected to FilterBuildingContext::is_bottommost
through TableBuilderOptions::is_bottommost
Reviewed By: mrambacher
Differential Revision: D28099346
Pulled By: pdillinger
fbshipit-source-id: 2c1072e29c24d4ac404c761a7b7663292372600a
Summary:
Greatly reduced the not-quite-copy-paste giant parameter lists
of rocksdb::NewTableBuilder, rocksdb::BuildTable,
BlockBasedTableBuilder::Rep ctor, and BlockBasedTableBuilder ctor.
Moved weird separate parameter `uint32_t column_family_id` of
TableFactory::NewTableBuilder into TableBuilderOptions.
Re-ordered parameters to TableBuilderOptions ctor, so that `uint64_t
target_file_size` is not randomly placed between uint64_t timestamps
(was easy to mix up).
Replaced a couple of fields of BlockBasedTableBuilder::Rep with a
FilterBuildingContext. The motivation for this change is making it
easier to pass along more data into new fields in FilterBuildingContext
(follow-up PR).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8240
Test Plan: ASAN make check
Reviewed By: mrambacher
Differential Revision: D28075891
Pulled By: pdillinger
fbshipit-source-id: fddb3dbb8260a0e8bdcbb51b877ebabf9a690d4f
Summary:
As previously coded, a Configurable extension would need access to code not in the public API. This change moves RegisterOptions into the Configurable class and therefore available to public extensions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8223
Reviewed By: anand1976
Differential Revision: D27960188
Pulled By: mrambacher
fbshipit-source-id: ac88b19397183df633902def5b5701b9b65fbf40
Summary:
Add new C APIs to create the JemallocNodumpAllocator and set it on a Cache object.
`make test` passes with and without `DISABLE_JEMALLOC=1`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8178
Reviewed By: jay-zhuang
Differential Revision: D27944631
Pulled By: ajkr
fbshipit-source-id: 2531729aa285a8985c58f22f093c4d53029c4a7b
Summary:
Add comment to DisableManualCompaction() which was missing.
Also explictly return from DBImpl::CompactRange() to avoid memtable flush when manual compaction is disabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8186
Test Plan: Run existing unit tests.
Reviewed By: jay-zhuang
Differential Revision: D27744517
fbshipit-source-id: 449548a48905903b888dc9612bd17480f6596a71
Summary:
When WriteBufferManager is shared across DBs and column families
to maintain memory usage under a limit, OOMs have been observed when flush cannot
finish but writes continuously insert to memtables.
In order to avoid OOMs, when memory usage goes beyond buffer_limit_ and DBs tries to write,
this change will stall incoming writers until flush is completed and memory_usage
drops.
Design: Stall condition: When total memory usage exceeds WriteBufferManager::buffer_size_
(memory_usage() >= buffer_size_) WriterBufferManager::ShouldStall() returns true.
DBImpl first block incoming/future writers by calling write_thread_.BeginWriteStall()
(which adds dummy stall object to the writer's queue).
Then DB is blocked on a state State::Blocked (current write doesn't go
through). WBStallInterface object maintained by every DB instance is added to the queue of
WriteBufferManager.
If multiple DBs tries to write during this stall, they will also be
blocked when check WriteBufferManager::ShouldStall() returns true.
End Stall condition: When flush is finished and memory usage goes down, stall will end only if memory
waiting to be flushed is less than buffer_size/2. This lower limit will give time for flush
to complete and avoid continous stalling if memory usage remains close to buffer_size.
WriterBufferManager::EndWriteStall() is called,
which removes all instances from its queue and signal them to continue.
Their state is changed to State::Running and they are unblocked. DBImpl
then signal all incoming writers of that DB to continue by calling
write_thread_.EndWriteStall() (which removes dummy stall object from the
queue).
DB instance creates WBMStallInterface which is an interface to block and
signal DBs during stall.
When DB needs to be blocked or signalled by WriteBufferManager,
state_for_wbm_ state is changed accordingly (RUNNING or BLOCKED).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7898
Test Plan: Added a new test db/db_write_buffer_manager_test.cc
Reviewed By: anand1976
Differential Revision: D26093227
Pulled By: akankshamahajan15
fbshipit-source-id: 2bbd982a3fb7033f6de6153aa92a221249861aae
Summary:
This partially reverts commit 10196d7edc.
The problem with this change is because of important filter use cases:
FIFO compaction and SST writer. FIFO "compaction" always uses level 0 so
would only use Ribbon filters if specifically including level 0 for the
Ribbon filter policy. SST writer sets level_at_creation=-1 to indicate
unknown level, and this would be treated the same as level 0 unless
fixed.
We are keeping the part about committing to permanent schema, which is
only changes to API comments and HISTORY.md.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8212
Test Plan: CI
Reviewed By: jay-zhuang
Differential Revision: D27896468
Pulled By: pdillinger
fbshipit-source-id: 50a775f7cba5d64fb729d9b982e355864020596e