Summary:
The patch adds a new BlobDB configuration option `blob_compaction_readahead_size`
that can be used to enable prefetching data from blob files during compaction.
This is important when using storage with higher latencies like HDDs or remote filesystems.
If enabled, prefetching is used for all cases when blobs are read during compaction,
namely garbage collection, compaction filters (when the existing value has to be read from
a blob file), and `Merge` (when the value of the base `Put` is stored in a blob file).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9187
Test Plan: Ran `make check` and the stress/crash test.
Reviewed By: riversand963
Differential Revision: D32565512
Pulled By: ltamasi
fbshipit-source-id: 87be9cebc3aa01cc227bec6b5f64d827b8164f5d
Summary:
Note: This PR is the 4th part of a bigger PR stack (https://github.com/facebook/rocksdb/pull/9073) and will rebase/merge only after the first three PRs (https://github.com/facebook/rocksdb/pull/9070, https://github.com/facebook/rocksdb/pull/9071, https://github.com/facebook/rocksdb/pull/9130) merge.
**Context:**
Similar to https://github.com/facebook/rocksdb/pull/8428, this PR is to track memory usage during (new) Bloom Filter (i.e,FastLocalBloom) and Ribbon Filter (i.e, Ribbon128) construction, moving toward the goal of [single global memory limit using block cache capacity](https://github.com/facebook/rocksdb/wiki/Projects-Being-Developed#improving-memory-efficiency). It also constrains the size of the banding portion of Ribbon Filter during construction by falling back to Bloom Filter if that banding is, at some point, larger than the available space in the cache under `LRUCacheOptions::strict_capacity_limit=true`.
The option to turn on this feature is `BlockBasedTableOptions::reserve_table_builder_memory = true` which by default is set to `false`. We [decided](https://github.com/facebook/rocksdb/pull/9073#discussion_r741548409) not to have separate option for separate memory user in table building therefore their memory accounting are all bundled under one general option.
**Summary:**
- Reserved/released cache for creation/destruction of three main memory users with the passed-in `FilterBuildingContext::cache_res_mgr` during filter construction:
- hash entries (i.e`hash_entries`.size(), we bucket-charge hash entries during insertion for performance),
- banding (Ribbon Filter only, `bytes_coeff_rows` +`bytes_result_rows` + `bytes_backtrack`),
- final filter (i.e, `mutable_buf`'s size).
- Implementation details: in order to use `CacheReservationManager::CacheReservationHandle` to account final filter's memory, we have to store the `CacheReservationManager` object and `CacheReservationHandle` for final filter in `XXPH3BitsFilterBuilder` as well as explicitly delete the filter bits builder when done with the final filter in block based table.
- Added option fo run `filter_bench` with this memory reservation feature
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9073
Test Plan:
- Added new tests in `db_bloom_filter_test` to verify filter construction peak cache reservation under combination of `BlockBasedTable::Rep::FilterType` (e.g, `kFullFilter`, `kPartitionedFilter`), `BloomFilterPolicy::Mode`(e.g, `kFastLocalBloom`, `kStandard128Ribbon`, `kDeprecatedBlock`) and `BlockBasedTableOptions::reserve_table_builder_memory`
- To address the concern for slow test: tests with memory reservation under `kFullFilter` + `kStandard128Ribbon` and `kPartitionedFilter` take around **3000 - 6000 ms** and others take around **1500 - 2000 ms**, in total adding **20000 - 25000 ms** to the test suit running locally
- Added new test in `bloom_test` to verify Ribbon Filter fallback on large banding in FullFilter
- Added test in `filter_bench` to verify that this feature does not significantly slow down Bloom/Ribbon Filter construction speed. Local result averaged over **20** run as below:
- FastLocalBloom
- baseline `./filter_bench -impl=2 -quick -runs 20 | grep 'Build avg'`:
- **Build avg ns/key: 29.56295** (DEBUG_LEVEL=1), **29.98153** (DEBUG_LEVEL=0)
- new feature (expected to be similar as above)`./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true | grep 'Build avg'`:
- **Build avg ns/key: 30.99046** (DEBUG_LEVEL=1), **30.48867** (DEBUG_LEVEL=0)
- new feature of RibbonFilter with fallback (expected to be similar as above) `./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true | grep 'Build avg'` :
- **Build avg ns/key: 31.146975** (DEBUG_LEVEL=1), **30.08165** (DEBUG_LEVEL=0)
- Ribbon128
- baseline `./filter_bench -impl=3 -quick -runs 20 | grep 'Build avg'`:
- **Build avg ns/key: 129.17585** (DEBUG_LEVEL=1), **130.5225** (DEBUG_LEVEL=0)
- new feature (expected to be similar as above) `./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true | grep 'Build avg' `:
- **Build avg ns/key: 131.61645** (DEBUG_LEVEL=1), **132.98075** (DEBUG_LEVEL=0)
- new feature of RibbonFilter with fallback (expected to be a lot faster than above due to fallback) `./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true | grep 'Build avg'` :
- **Build avg ns/key: 52.032965** (DEBUG_LEVEL=1), **52.597825** (DEBUG_LEVEL=0)
- And the warning message of `"Cache reservation for Ribbon filter banding failed due to cache full"` is indeed logged to console.
Reviewed By: pdillinger
Differential Revision: D31991348
Pulled By: hx235
fbshipit-source-id: 9336b2c60f44d530063da518ceaf56dac5f9df8e
Summary:
Adds changes to DBOptions (comparable to ColumnFamilyOptions) to allow some option values to be ignored on rehydration from the Options file. This is necessary for some customizable classes that were not registered with the ObjectRegistry but are saved/restored from the Options file.
All tests pass. Will run check_format_compatible.sh shortly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9045
Reviewed By: zhichao-cao
Differential Revision: D31761664
Pulled By: mrambacher
fbshipit-source-id: 300c2251639cce2b223481c3bb2a63877b1f3766
Summary:
The current BlobDB garbage collection logic works by relocating the valid
blobs from the oldest blob files as they are encountered during compaction,
and cleaning up blob files once they contain nothing but garbage. However,
with sufficiently skewed workloads, it is theoretically possible to end up in a
situation when few or no compactions get scheduled for the SST files that contain
references to the oldest blob files, which can lead to increased space amp due
to the lack of GC.
In order to efficiently handle such workloads, the patch adds a new BlobDB
configuration option called `blob_garbage_collection_force_threshold`,
which signals to BlobDB to schedule targeted compactions for the SST files
that keep alive the oldest batch of blob files if the overall ratio of garbage in
the given blob files meets the threshold *and* all the given blob files are
eligible for GC based on `blob_garbage_collection_age_cutoff`. (For example,
if the new option is set to 0.9, targeted compactions will get scheduled if the
sum of garbage bytes meets or exceeds 90% of the sum of total bytes in the
oldest blob files, assuming all affected blob files are below the age-based cutoff.)
The net result of these targeted compactions is that the valid blobs in the oldest
blob files are relocated and the oldest blob files themselves cleaned up (since
*all* SST files that rely on them get compacted away).
These targeted compactions are similar to periodic compactions in the sense
that they force certain SST files that otherwise would not get picked up to undergo
compaction and also in the sense that instead of merging files from multiple levels,
they target a single file. (Note: such compactions might still include neighboring files
from the same level due to the need of having a "clean cut" boundary but they never
include any files from any other level.)
This functionality is currently only supported with the leveled compaction style
and is inactive by default (since the default value is set to 1.0, i.e. 100%).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8994
Test Plan: Ran `make check` and tested using `db_bench` and the stress/crash tests.
Reviewed By: riversand963
Differential Revision: D31489850
Pulled By: ltamasi
fbshipit-source-id: 44057d511726a0e2a03c5d9313d7511b3f0c4eab
Summary:
Made SliceTransform into a Customizable class.
Would be nice to write a test that stored and used a custom transform in an SST table.
There are a set of tests (DBBlockFliterTest.PrefixExtractor*, SamePrefixTest.InDomainTest, PrefixTest.PrefixAndWholeKeyTest that run the same with or without a SliceTransform/PrefixFilter. Is this expected?
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8641
Reviewed By: zhichao-cao
Differential Revision: D31142793
Pulled By: mrambacher
fbshipit-source-id: bb08672fccbfdc263dcae21f25a62307e1facda1
Summary:
This PR does the following:
-> Makes the MemTableRepFactory into a Customizable class and creatable/configurable via CreateFromString
-> Makes the existing implementations compatible with configurations
-> Moves the "SpecialRepFactory" test class into testutil, accessible via the ObjectRegistry or a NewSpecial API
New tests were added to validate the functionality and all existing tests pass. db_bench and memtablerep_bench were hand-tested to verify the functionality in those tools.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8419
Reviewed By: zhichao-cao
Differential Revision: D29558961
Pulled By: mrambacher
fbshipit-source-id: 81b7229636e4e649a0c914e73ac7b0f8454c931c
Summary:
The atomic variable "is_prepared_" was keeping Configurable objects from being copy-constructed. Removed the atomic to allow copies.
Since the variable is only changed from false to true (and never back), there is no reason it had to be atomic.
Added tests that simple Configurable and Customizable objects can be put on the stack and copied.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8704
Reviewed By: anand1976
Differential Revision: D30530526
Pulled By: ltamasi
fbshipit-source-id: 4dd4439b3e5ad7fa396573d0b25d9fb709160576
Summary:
This is essentially resurrection and fixing of the part of
https://github.com/facebook/rocksdb/issues/8198 that was reverted in https://github.com/facebook/rocksdb/issues/8212, using data added in https://github.com/facebook/rocksdb/issues/8246. Basically,
when configuring Ribbon filter, you can specify an LSM level before which
Bloom will be used instead of Ribbon. But Bloom is only considered for
Leveled and Universal compaction styles and file going into a known LSM
level. This way, SST file writer, FIFO compaction, etc. use Ribbon filter as
you would expect with NewRibbonFilterPolicy.
So that this can be controlled with a single int value and so that flushes
can be distinguished from intra-L0, we consider flush to go to level -1 for
the purposes of this option. (Explained in API comment.)
I also expect the most common and recommended Ribbon configuration to
use Bloom during flush, to minimize slowing down writes and because according
to my estimates, Ribbon only pays off if the structure lives in memory for
more than an hour. Thus, I have changed the default for NewRibbonFilterPolicy
to be this mild hybrid configuration. I don't really want to add something like
NewHybridFilterPolicy because at least the mild hybrid configuration (Bloom for
flush, Ribbon otherwise) should be considered a natural choice.
C APIs also updated, but because they don't support overloading,
rocksdb_filterpolicy_create_ribbon is kept pure ribbon for clarity and
rocksdb_filterpolicy_create_ribbon_hybrid must be called for a hybrid
configuration. While touching C API, I changed bits per key options from
int to double.
BuiltinFilterPolicy is needed so that LevelThresholdFilterPolicy doesn't inherit
unused fields from BloomFilterPolicy.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8679
Test Plan: new + updated tests, including crash test
Reviewed By: jay-zhuang
Differential Revision: D30445797
Pulled By: pdillinger
fbshipit-source-id: 6f5aeddfd6d79f7e55493b563c2d1d2d568892e1
Summary:
- Fix issue with OptionType::Vector when the nested item is a Customizable with no names
- Fix issue with OptionType::Vector to appropriately wrap the elements in a Vector;
- Fix an issue with nested Customizable object with a null immutable object still appearing in the mutable options;
- Fix/Add tests for null/empty customizable objects
- Move the RegisterTestObjects from customizable_test into testutil.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8566
Reviewed By: zhichao-cao
Differential Revision: D30303724
Pulled By: mrambacher
fbshipit-source-id: 33fa8ea2a3b663210cb356da05e64aab7585b1b5
Summary:
Changes the API of the MemPurge process: the `bool experimental_allow_mempurge` and `experimental_mempurge_policy` flags have been replaced by a `double experimental_mempurge_threshold` option.
This change of API reflects another major change introduced in this PR: the MemPurgeDecider() function now works by sampling the memtables being flushed to estimate the overall amount of useful payload (payload minus the garbage), and then compare this useful payload estimate with the `double experimental_mempurge_threshold` value.
Therefore, when the value of this flag is `0.0` (default value), mempurge is simply deactivated. On the other hand, a value of `DBL_MAX` would be equivalent to always going through a mempurge regardless of the garbage ratio estimate.
At the moment, a `double experimental_mempurge_threshold` value else than 0.0 or `DBL_MAX` is opnly supported`with the `SkipList` memtable representation.
Regarding the sampling, this PR includes the introduction of a `MemTable::UniqueRandomSample` function that collects (approximately) random entries from the memtable by using the new `SkipList::Iterator::RandomSeek()` under the hood, or by iterating through each memtable entry, depending on the target sample size and the total number of entries.
The unit tests have been readapted to support this new API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8628
Reviewed By: pdillinger
Differential Revision: D30149315
Pulled By: bjlemaire
fbshipit-source-id: 1feef5390c95db6f4480ab4434716533d3947f27
Summary:
- Changed MergeOperator, CompactionFilter, and CompactionFilterFactory into Customizable classes.
- Added Options/Configurable/Object Registration for TTL and Cassandra variants
- Changed the StringAppend MergeOperators to accept a string delimiter rather than a simple char. Made the delimiter into a configurable option
- Added tests for new functionality
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8481
Reviewed By: zhichao-cao
Differential Revision: D30136050
Pulled By: mrambacher
fbshipit-source-id: 271d1772835935b6773abaf018ee71e42f9491af
Summary:
- Added Type/CreateFromString
- Added ability to load EventListeners to DBOptions
- Since EventListeners did not previously have a Name(), defaulted to "". If there is no name, the listener cannot be loaded from the ObjectRegistry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8473
Reviewed By: zhichao-cao
Differential Revision: D29901488
Pulled By: mrambacher
fbshipit-source-id: 2d3a4aa6db1562ac03e7ad41b360e3521d486254
Summary:
Implement an experimental feature called "MemPurge", which consists in purging "garbage" bytes out of a memtable and reuse the memtable struct instead of making it immutable and eventually flushing its content to storage.
The prototype is by default deactivated and is not intended for use. It is intended for correctness and validation testing. At the moment, the "MemPurge" feature can be switched on by using the `options.experimental_allow_mempurge` flag. For this early stage, when the allow_mempurge flag is set to `true`, all the flush operations will be rerouted to perform a MemPurge. This is a temporary design decision that will give us the time to explore meaningful heuristics to use MemPurge at the right time for relevant workloads . Moreover, the current MemPurge operation only supports `Puts`, `Deletes`, `DeleteRange` operations, and handles `Iterators` as well as `CompactionFilter`s that are invoked at flush time .
Three unit tests are added to `db_flush_test.cc` to test if MemPurge works correctly (and checks that the previously mentioned operations are fully supported thoroughly tested).
One noticeable design decision is the timing of the MemPurge operation in the memtable workflow: for this prototype, the mempurge happens when the memtable is switched (and usually made immutable). This is an inefficient process because it implies that the entirety of the MemPurge operation happens while holding the db_mutex. Future commits will make the MemPurge operation a background task (akin to the regular flush operation) and aim at drastically enhancing the performance of this operation. The MemPurge is also not fully "WAL-compatible" yet, but when the WAL is full, or when the regular MemPurge operation fails (or when the purged memtable still needs to be flushed), a regular flush operation takes place. Later commits will also correct these behaviors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8454
Reviewed By: anand1976
Differential Revision: D29433971
Pulled By: bjlemaire
fbshipit-source-id: 6af48213554e35048a7e03816955100a80a26dc5
Summary:
If a Customizable option was not mutable, it would still appear in the list of mutable options when serialized. This meant that when the immutable options were used to configure another immutable object, an "option not changeable" status would be returned.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8457
Reviewed By: zhichao-cao
Differential Revision: D29428298
Pulled By: mrambacher
fbshipit-source-id: 3945b0b822f8e5955a7c5590fe64dfd5bc1fe6a0
Summary:
Marked the Ribbon filter and optimize_filters_for_memory features
as production-ready, each enabling memory savings for Bloom-like filters.
Use `NewRibbonFilterPolicy` in place of `NewBloomFilterPolicy` to use
Ribbon filters instead of Bloom, or `ribbonfilter` in place of
`bloomfilter` in configuration string.
Some small refactoring in db_stress.
Removed/refactored unused code in db_bench, in part preparing for future
default possibly being different from "disabled."
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8408
Test Plan:
Lots of prior automated, ad-hoc, and "real world" testing.
Updated tests for new API names. Quick db_bench test:
bloom fillrandom
77730 ops/sec
rocksdb.block.cache.filter.bytes.insert COUNT : 89929384
ribbon fillrandom
71492 ops/sec
rocksdb.block.cache.filter.bytes.insert COUNT : 64531384
Reviewed By: mrambacher
Differential Revision: D29140805
Pulled By: pdillinger
fbshipit-source-id: d742c922722421678f95ad85eeb0aaebc9f5e49a
Summary:
Added ParseType, SerializeType, and TypesAreEqual methods to OptionTypeInfo. These methods can be used for serialization and deserialization of basic types.
Change the MutableCF/DB Options to use this format.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8249
Reviewed By: jay-zhuang
Differential Revision: D28351190
Pulled By: mrambacher
fbshipit-source-id: 72a78643b804f2f0bf59c32ffefa63346672ad16
Summary:
This change enables a couple of things:
- Different ConfigOptions can have different registry/factory associated with it, thereby allowing things like a "Test" ConfigOptions versus a "Production"
- The ObjectRegistry is created fewer times and can be re-used
The ConfigOptions can also be initialized/constructed from a DBOptions, in which case it will grab some of its settings (Env, Logger) from the DBOptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8166
Reviewed By: zhichao-cao
Differential Revision: D27657952
Pulled By: mrambacher
fbshipit-source-id: ae1d6200bb7ab127405cdeefaba43c7fe694dfdd
Summary:
The ImmutableCFOptions contained a bunch of fields that belonged to the ImmutableDBOptions. This change cleans that up by introducing an ImmutableOptions struct. Following the pattern of Options struct, this class inherits from the DB and CFOption structs (of the Immutable form).
Only one structural change (the ImmutableCFOptions::fs was changed to a shared_ptr from a raw one) is in this PR. All of the other changes involve moving the member variables from the ImmutableCFOptions into the ImmutableOptions and changing member variables or function parameters as required for compilation purposes.
Follow-on PRs may do a further clean-up of the code, such as renaming variables (such as "ImmutableOptions cf_options") and potentially eliminating un-needed function parameters (there is no longer a need to pass both an ImmutableDBOptions and an ImmutableOptions to a function).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8262
Reviewed By: pdillinger
Differential Revision: D28226540
Pulled By: mrambacher
fbshipit-source-id: 18ae71eadc879dedbe38b1eb8e6f9ff5c7147dbf
Summary:
This PR is a first step at attempting to clean up some of the Mutable/Immutable Options code. With this change, a DBOption and a ColumnFamilyOption can be reconstructed from their Mutable and Immutable equivalents, respectively.
readrandom tests do not show any performance degradation versus master (though both are slightly slower than the current 6.19 release).
There are still fields in the ImmutableCFOptions that are not CF options but DB options. Eventually, I would like to move those into an ImmutableOptions (= ImmutableDBOptions+ImmutableCFOptions). But that will be part of a future PR to minimize changes and disruptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8176
Reviewed By: pdillinger
Differential Revision: D27954339
Pulled By: mrambacher
fbshipit-source-id: ec6b805ba9afe6e094bffdbd76246c2d99aa9fad
Summary:
This partially reverts commit 10196d7edc.
The problem with this change is because of important filter use cases:
FIFO compaction and SST writer. FIFO "compaction" always uses level 0 so
would only use Ribbon filters if specifically including level 0 for the
Ribbon filter policy. SST writer sets level_at_creation=-1 to indicate
unknown level, and this would be treated the same as level 0 unless
fixed.
We are keeping the part about committing to permanent schema, which is
only changes to API comments and HISTORY.md.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8212
Test Plan: CI
Reviewed By: jay-zhuang
Differential Revision: D27896468
Pulled By: pdillinger
fbshipit-source-id: 50a775f7cba5d64fb729d9b982e355864020596e
Summary:
Since the Ribbon filter schema seems good (compatible back to
6.15.0), this change commits to long term support of the SST schema,
even though we expect the API for enabling Ribbon to change (still
called NewExperimentalRibbonFilterPolicy).
This also adds support for "hybrid" configuration in which some levels
use Bloom (higher levels, lower numbered) for speed and the rest use
Ribbon (lower levels, higher numbered) for memory space efficiency.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8198
Test Plan: unit test added, crash test support
Reviewed By: jay-zhuang
Differential Revision: D27831232
Pulled By: pdillinger
fbshipit-source-id: 90e528677689474d293ed6710b42ba89fbd5b5ab
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
Summary:
Added a "only_mutable_options" flag to the ConfigOptions. When set, the Configurable methods will only look at/update options that are marked as kMutable.
Fixed DB::SetOptions to allow for the update of any mutable TableFactory options. Fixes https://github.com/facebook/rocksdb/issues/7385.
Added tests for the new flag. Updated HISTORY.md
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7936
Reviewed By: akankshamahajan15
Differential Revision: D26389646
Pulled By: mrambacher
fbshipit-source-id: 6dc247f6e999fa2814059ebbd0af8face109fea0
Summary:
Memtable bloom filter is useful in many use cases. A default value on with conservative 1.5% memory can benefit more use cases than use cases impacted.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6584
Test Plan: Run all existing tests.
Reviewed By: pdillinger
Differential Revision: D20626739
fbshipit-source-id: 1dd45532b932139552519b8c2682bd954550c2f9
Summary:
Change the StringEnv and related classes to be based on FileSystem APIs rather than the corresponding Env ones. The StringSink and StringSource classes were changed to be based on the corresponding FS file classes.
Part of a cleanup to use the newer interfaces. This change also eliminates some of the casts/wrappers to LegacyFile classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7786
Reviewed By: jay-zhuang
Differential Revision: D25761460
Pulled By: anand1976
fbshipit-source-id: 428ae8e32b3db97dbeeca08c9d3bb0d9d4d3a38f
Summary:
A temporary hack to work around a bug in 6.10, 6.11, 6.12, 6.13 and
6.14. The bug will write out 8 bytes to OPTIONS file from the starting
address of BlockBasedTableOptions.read_amp_bytes_per_bit which is
actually a uint32. Consequently, the value of read_amp_bytes_per_bit
written in the OPTIONS file is wrong. From 6.15, RocksDB will
try to parse the read_amp_bytes_per_bit from OPTIONS file as a uint32.
To be able to load OPTIONS file generated by affected releases before
the fix, we need to manually parse read_amp_bytes_per_bit with this hack.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7659
Test Plan:
Generate a db with current 6.14.fb (head at b6db05dbb5). Maybe use db_stress.
Checkout this PR, run
```
~/rocksdb/ldb --db=. --try_load_options --ignore_unknown_options idump --count_only
```
Expect success, and should not see
```
Failed: Invalid argument: Error parsing read_amp_bytes_per_bit:17179869184
```
Also
make check
Reviewed By: anand1976
Differential Revision: D24954752
Pulled By: riversand963
fbshipit-source-id: c7b802fc3e52acd050a4fc1cd475016122234394
Summary:
Added experimental public API for Ribbon filter:
NewExperimentalRibbonFilterPolicy(). This experimental API will
take a "Bloom equivalent" bits per key, and configure the Ribbon
filter for the same FP rate as Bloom would have but ~30% space
savings. (Note: optimize_filters_for_memory is not yet implemented
for Ribbon filter. That can be added with no effect on schema.)
Internally, the Ribbon filter is configured using a "one_in_fp_rate"
value, which is 1 over desired FP rate. For example, use 100 for 1%
FP rate. I'm expecting this will be used in the future for configuring
Bloom-like filters, as I expect people to more commonly hold constant
the filter accuracy and change the space vs. time trade-off, rather than
hold constant the space (per key) and change the accuracy vs. time
trade-off, though we might make that available.
### Benchmarking
```
$ ./filter_bench -impl=2 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing
Building...
Build avg ns/key: 34.1341
Number of filters: 1993
Total size (MB): 238.488
Reported total allocated memory (MB): 262.875
Reported internal fragmentation: 10.2255%
Bits/key stored: 10.0029
----------------------------
Mixed inside/outside queries...
Single filter net ns/op: 18.7508
Random filter net ns/op: 258.246
Average FP rate %: 0.968672
----------------------------
Done. (For more info, run with -legend or -help.)
$ ./filter_bench -impl=3 -quick -m_keys_total_max=200 -average_keys_per_filter=100000 -net_includes_hashing
Building...
Build avg ns/key: 130.851
Number of filters: 1993
Total size (MB): 168.166
Reported total allocated memory (MB): 183.211
Reported internal fragmentation: 8.94626%
Bits/key stored: 7.05341
----------------------------
Mixed inside/outside queries...
Single filter net ns/op: 58.4523
Random filter net ns/op: 363.717
Average FP rate %: 0.952978
----------------------------
Done. (For more info, run with -legend or -help.)
```
168.166 / 238.488 = 0.705 -> 29.5% space reduction
130.851 / 34.1341 = 3.83x construction time for this Ribbon filter vs. lastest Bloom filter (could make that as little as about 2.5x for less space reduction)
### Working around a hashing "flaw"
bloom_test discovered a flaw in the simple hashing applied in
StandardHasher when num_starts == 1 (num_slots == 128), showing an
excessively high FP rate. The problem is that when many entries, on the
order of number of hash bits or kCoeffBits, are associated with the same
start location, the correlation between the CoeffRow and ResultRow (for
efficiency) can lead to a solution that is "universal," or nearly so, for
entries mapping to that start location. (Normally, variance in start
location breaks the effective association between CoeffRow and
ResultRow; the same value for CoeffRow is effectively different if start
locations are different.) Without kUseSmash and with num_starts > 1 (thus
num_starts ~= num_slots), this flaw should be completely irrelevant. Even
with 10M slots, the chances of a single slot having just 16 (or more)
entries map to it--not enough to cause an FP problem, which would be local
to that slot if it happened--is 1 in millions. This spreadsheet formula
shows that: =1/(10000000*(1 - POISSON(15, 1, TRUE)))
As kUseSmash==false (the setting for Standard128RibbonBitsBuilder) is
intended for CPU efficiency of filters with many more entries/slots than
kCoeffBits, a very reasonable work-around is to disallow num_starts==1
when !kUseSmash, by making the minimum non-zero number of slots
2*kCoeffBits. This is the work-around I've applied. This also means that
the new Ribbon filter schema (Standard128RibbonBitsBuilder) is not
space-efficient for less than a few hundred entries. Because of this, I
have made it fall back on constructing a Bloom filter, under existing
schema, when that is more space efficient for small filters. (We can
change this in the future if we want.)
TODO: better unit tests for this case in ribbon_test, and probably
update StandardHasher for kUseSmash case so that it can scale nicely to
small filters.
### Other related changes
* Add Ribbon filter to stress/crash test
* Add Ribbon filter to filter_bench as -impl=3
* Add option string support, as in "filter_policy=experimental_ribbon:5.678;"
where 5.678 is the Bloom equivalent bits per key.
* Rename internal mode BloomFilterPolicy::kAuto to kAutoBloom
* Add a general BuiltinFilterBitsBuilder::CalculateNumEntry based on
binary searching CalculateSpace (inefficient), so that subclasses
(especially experimental ones) don't have to provide an efficient
implementation inverting CalculateSpace.
* Minor refactor FastLocalBloomBitsBuilder for new base class
XXH3pFilterBitsBuilder shared with new Standard128RibbonBitsBuilder,
which allows the latter to fall back on Bloom construction in some
extreme cases.
* Mostly updated bloom_test for Ribbon filter, though a test like
FullBloomTest::Schema is a next TODO to ensure schema stability
(in case this becomes production-ready schema as it is).
* Add some APIs to ribbon_impl.h for configuring Ribbon filters.
Although these are reasonably covered by bloom_test, TODO more unit
tests in ribbon_test
* Added a "tool" FindOccupancyForSuccessRate to ribbon_test to get data
for constructing the linear approximations in GetNumSlotsFor95PctSuccess.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7658
Test Plan:
Some unit tests updated but other testing is left TODO. This
is considered experimental but laying down schema compatibility as early
as possible in case it proves production-quality. Also tested in
stress/crash test.
Reviewed By: jay-zhuang
Differential Revision: D24899349
Pulled By: pdillinger
fbshipit-source-id: 9715f3e6371c959d923aea8077c9423c7a9f82b8
Summary:
This patch simply adds a couple of options that will enable users to
configure garbage collection when using the integrated BlobDB
implementation. The actual GC logic will be added in a separate step.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7661
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D24906544
Pulled By: ltamasi
fbshipit-source-id: ee0e056a712a4b4475cd90de8b27d969bd61b7e1
Summary:
Further refinement of the earlier PR. Now the Status is NotFound with a subcode of PathNotFound. Also the existing functions for options parsing/loading are reverted to return InvalidArgument no matter in which way the user-provided arguments are deemed invalid.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7563
Reviewed By: zhichao-cao
Differential Revision: D24422491
Pulled By: ajkr
fbshipit-source-id: ba6b237cd0584d3f925c5ba0d349aeb8c250af67
Summary:
Make LoadLatestOptions return PathNotFound if the options file does not exist. Added tests for the LoadOptions related methods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7554
Reviewed By: akankshamahajan15
Differential Revision: D24298985
Pulled By: zhichao-cao
fbshipit-source-id: c9ae3cb12fc4a5bbef07743e1c1300f98a2441b3
Summary:
This option determines whether WALs will be tracked in MANIFEST and verified on recovery.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7275
Test Plan:
db_options_test
options_test
Reviewed By: pdillinger
Differential Revision: D23181418
Pulled By: cheng-chang
fbshipit-source-id: 5dd1cdc166f3dfc1c93c094df4a2f7734e3b4547
Summary:
This PR merges the functionality of making the ColumnFamilyOptions, TableFactory, and DBOptions into Configurable into a single PR, resolving any merge conflicts
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5753
Reviewed By: ajkr
Differential Revision: D23385030
Pulled By: zhichao-cao
fbshipit-source-id: 8b977a7731556230b9b8c5a081b98e49ee4f160a
Summary:
This PR does a few things:
- The "compression_opts" and "bottom_compression_opts" can now be read/written as name/value pairs of options (instead of only a colon-separated list;
- These options can now be read/written to the Options file;
- The parallel_threads value can now be set (either in the colon or name-value format).
The compression options are now stored and treated as a OptionTypeInfo::Struct by the options system, meaning they can be read and written like the other structs. This change allows them to be read/written easily to the options file.
Additionally, the colon-format was extended to allow support for setting parallel threads. Tests were added to test all of the option settings via the optional parameters in the colon format.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6817
Reviewed By: ajkr
Differential Revision: D22396004
Pulled By: zhichao-cao
fbshipit-source-id: 38bcf74b7e9cd5bc2a84540fac2e9ba4f765b2c8
Summary:
The OptionTypeInfo::Vector method allows a vector<T> to be converted to/from strings via the options.
The kVectorInt and kVectorCompressionType vectors were replaced with this methodology.
As part of this change, the NextToken method was added to the OptionTypeInfo. This method was refactored from code within the StringToMap function.
Future types that could use this functionality include the EventListener vectors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6424
Reviewed By: cheng-chang
Differential Revision: D21832368
Pulled By: zhichao-cao
fbshipit-source-id: e1ca766faff139d54e6e8407a9ec09ece6517439
Summary:
Added code for generically handing structs to OptionTypeInfo. A struct is a collection of variables handled by their own map of OptionTypeInfos. Examples of structs include Compaction and Cache options.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6425
Reviewed By: siying
Differential Revision: D21668789
Pulled By: zhichao-cao
fbshipit-source-id: 064b110de39dadf82361ed4663f7ac1a535b0b07
Summary:
Tried making Status object enforce that it is checked in some way. In cases it is not checked, `PermitUncheckedError()` must be called explicitly.
Added a way to run tests (`ASSERT_STATUS_CHECKED=1 make -j48 check`) on a
whitelist. The effort appears significant to get each test to pass with
this assertion, so I only fixed up enough to get one test (`options_test`)
working and added it to the whitelist.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6798
Reviewed By: pdillinger
Differential Revision: D21377404
Pulled By: ajkr
fbshipit-source-id: 73236f9c8df38f01cf24ecac4a6d1661b72d077e
Summary:
Add methods and constructors for handling enums to the OptionTypeInfo. This change allows enums to be converted/compared without adding a special "type" to the OptionType.
This change addresses a couple of issues:
- It allows new enumerated types to be added to the options without editing the OptionType base class (and related methods)
- It standardizes the procedure for adding enumerated types to the options, reducing potential mistakes
- It moves the enum maps to the location where they are used, allowing them to be static file members rather than global values
- It reduces the number of types and cases that need to be handled in the various OptionType methods
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6423
Reviewed By: siying
Differential Revision: D21408713
Pulled By: zhichao-cao
fbshipit-source-id: fc492af285d011822578b95d186a0fce25d35626
Summary:
The current way of implementing CompressionOptions.parallel_threads introduces a format change. We plan to change CompressionOptions's serailization format to a new JSON-like format, which would be another format change. We would like to consolidate the two format changes into one, rather than making some users to change twice. Hold CompressionOptions.parallel_threads from being supported by option string for now. Will add it back after the general CompressionOptions's format change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6782
Test Plan: Run all existing tests.
Reviewed By: zhichao-cao
Differential Revision: D21338614
fbshipit-source-id: bca2dac3cb37d4e6e64b52cbbe8ea749cd848685
Summary:
Added functions for parsing, serializing, and comparing elements to OptionTypeInfo. These functions allow all of the special cases that could not be handled directly in the map of OptionTypeInfo to be moved into the map. Using these functions, every type can be handled via the map rather than special cased.
By adding these functions, the code for handling options can become more standardized (fewer special cases) and (eventually) handled completely by common classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6422
Test Plan: pass make check
Reviewed By: siying
Differential Revision: D21269005
Pulled By: zhichao-cao
fbshipit-source-id: 9ba71c721a38ebf9ee88259d60bd81b3282b9077
Summary:
The methods in convenience.h are used to compare/convert objects to/from strings. There is a mishmash of parameters in use here with more needed in the future. This PR replaces those parameters with a single structure.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6389
Reviewed By: siying
Differential Revision: D21163707
Pulled By: zhichao-cao
fbshipit-source-id: f807b4cc7e2b0af3871536b69546b2604dfa81bd
Summary:
This PR adds support for pipelined & parallel compression optimization for `BlockBasedTableBuilder`. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set `CompressionOptions::parallel_threads` greater than 1 to enable compression parallelism.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6262
Reviewed By: ajkr
Differential Revision: D20651306
fbshipit-source-id: 62125590a9c15b6d9071def9dc72589c1696a4cb
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
Differential Revision: D19977691
fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
Summary:
clang analysis shows following warning:
options/options_test.cc:1554:24: warning: The left operand of '-' is a garbage value
(file_size - 1) / readahead_size + 1);
~~~~~~~~~ ^
Explicitly initialize file_size and add an assertion to make clang analysis happy.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6398
Test Plan: Run "make analysis" and see the warning goes away.
Differential Revision: D19819662
fbshipit-source-id: 1589ea91c0c8f78242538f01448e4ad0e5fbc219
Summary:
Right, when reading from option files, no readahead is used and 8KB buffer is used. It might introduce high latency if the file system provide high latency and doesn't do readahead. Instead, introduce a readahead to the file. When calling inside DB, infer the value from options.log_readahead. Otherwise, a default 512KB readahead size is used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6372
Test Plan: Add --log_readahead_size in db_bench. Run it with several options and observe read size from option files using strace.
Differential Revision: D19727739
fbshipit-source-id: e6d8053b0a64259abc087f1f388b9cd66fa8a583
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f