Summary:
Add timestamp support for MultiGet().
timestamp from readoptions is honored, and timestamps can be returned along with values.
MultiReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks.
base line (commit 17bef7d3a):
multireadrandom : 104.173 micros/op 307167 ops/sec; (5462999 of 5462999 found)
This PR:
multireadrandom : 104.199 micros/op 307095 ops/sec; (5307999 of 5307999 found)
.\db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=multireadrandom --use_existing_db=1 --num=25000000 --threads=32 --allow_concurrent_memtable_write=0
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6483
Reviewed By: anand1976
Differential Revision: D20498373
Pulled By: riversand963
fbshipit-source-id: 8505f22bc40fd791bc7dd05e48d7e67c91edb627
Summary:
The current Env/FileSystem API separation has a couple of issues -
1. It requires the user to specify 2 options - ```Options::env``` and ```Options::file_system``` - which means they have to make code changes to benefit from the new APIs. Furthermore, there is a risk of accessing the same APIs in two different ways, through Env in the old way and through FileSystem in the new way. The two may not always match, for example, if env is ```PosixEnv``` and FileSystem is a custom implementation. Any stray RocksDB calls to env will use the ```PosixEnv``` implementation rather than the file_system implementation.
2. There needs to be a simple way for the FileSystem developer to instantiate an Env for backward compatibility purposes.
This PR solves the above issues and simplifies the migration in the following ways -
1. Embed a shared_ptr to the ```FileSystem``` in the ```Env```, and remove ```Options::file_system``` as a configurable option. This way, no code changes will be required in application code to benefit from the new API. The default Env constructor uses a ```LegacyFileSystemWrapper``` as the embedded ```FileSystem```.
1a. - This also makes it more robust by ensuring that even if RocksDB
has some stray calls to Env APIs rather than FileSystem, they will go
through the same object and thus there is no risk of getting out of
sync.
2. Provide a ```NewCompositeEnv()``` API that can be used to construct a
PosixEnv with a custom FileSystem implementation. This eliminates an
indirection to call Env APIs, and relieves the FileSystem developer of
the burden of having to implement wrappers for the Env APIs.
3. Add a couple of missing FileSystem APIs - ```SanitizeEnvOptions()``` and
```NewLogger()```
Tests:
1. New unit tests
2. make check and make asan_check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6552
Reviewed By: riversand963
Differential Revision: D20592038
Pulled By: anand1976
fbshipit-source-id: c3801ad4153f96d21d5a3ae26c92ba454d1bf1f7
Summary:
There are situations when RocksDB tries to recover, but the db is in an inconsistent state due to SST files referenced in the MANIFEST being missing. In this case, previous RocksDB will just fail the recovery and return a non-ok status.
This PR enables another possibility. During recovery, RocksDB checks possible MANIFEST files, and try to recover to the most recent state without missing table file. `VersionSet::Recover()` applies version edits incrementally and "materializes" a version only when this version does not reference any missing table file. After processing the entire MANIFEST, the version created last will be the latest version.
`DBImpl::Recover()` calls `VersionSet::Recover()`. Afterwards, WAL replay will *not* be performed.
To use this capability, set `options.best_efforts_recovery = true` when opening the db. Best-efforts recovery is currently incompatible with atomic flush.
Test plan (on devserver):
```
$make check
$COMPILE_WITH_ASAN=1 make all && make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6334
Reviewed By: anand1976
Differential Revision: D19778960
Pulled By: riversand963
fbshipit-source-id: c27ea80f29bc952e7d3311ecf5ee9c54393b40a8
Summary:
In Linux, when reopening DB with many SST files, profiling shows that 100% system cpu time spent for a couple of seconds for `GetLogicalBufferSize`. This slows down MyRocks' recovery time when site is down.
This PR introduces two new APIs:
1. `Env::RegisterDbPaths` and `Env::UnregisterDbPaths` lets `DB` tell the env when it starts or stops using its database directories . The `PosixFileSystem` takes this opportunity to set up a cache from database directories to the corresponding logical block sizes.
2. `LogicalBlockSizeCache` is defined only for OS_LINUX to cache the logical block sizes.
Other modifications:
1. rename `logical buffer size` to `logical block size` to be consistent with Linux terms.
2. declare `GetLogicalBlockSize` in `PosixHelper` to expose it to `PosixFileSystem`.
3. change the functions `IOError` and `IOStatus` in `env/io_posix.h` to have external linkage since they are used in other translation units too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6457
Test Plan:
1. A new unit test is added for `LogicalBlockSizeCache` in `env/io_posix_test.cc`.
2. A new integration test is added for `DB` operations related to the cache in `db/db_logical_block_size_cache_test.cc`.
`make check`
Differential Revision: D20131243
Pulled By: cheng-chang
fbshipit-source-id: 3077c50f8065c0bffb544d8f49fb10bba9408d04
Summary:
Add sequence_number_ptr to the checkpoint interface to expose the sequence number during taking the checkpoint. The number will be consistent with the seq # in rocksdb log.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5528
Test Plan: make check -j64
Reviewed By: Winger1994
Differential Revision: D16080209
fbshipit-source-id: 6dc3c7680287ee97d673c5e61f89aae1f43e33df
Summary:
Allow user to specify options.max_open_files != -1 with FIFO compaction.
If max_open_files != -1, not all table files are kept open.
In the past, FIFO style compaction requires all table files to be open in order
to read file creation time from table properties. Later, we added file creation
time to MANIFEST, making it possible to read file creation time without opening
file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6503
Test Plan: make check
Differential Revision: D20353758
Pulled By: riversand963
fbshipit-source-id: ba5c61a648419e47e9ef6d74e0e280e3ee24f296
Summary:
Preliminary support for iterator with user timestamp. Current implementation does not consider merge operator and reverse iterator. Auto compaction is also disabled in unit tests.
Create an iterator with timestamp.
```
...
read_opts.timestamp = &ts;
auto* iter = db->NewIterator(read_opts);
// target is key without timestamp.
for (iter->Seek(target); iter->Valid(); iter->Next()) {}
for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {}
delete iter;
read_opts.timestamp = &ts1;
// lower_bound and upper_bound are without timestamp.
read_opts.iterate_lower_bound = &lower_bound;
read_opts.iterate_upper_bound = &upper_bound;
auto* iter1 = db->NewIterator(read_opts);
// Do Seek or SeekToFirst()
delete iter1;
```
Test plan (dev server)
```
$make check
```
Simple benchmarking (dev server)
1. The overhead introduced by this PR even when timestamp is disabled.
key size: 16 bytes
value size: 100 bytes
Entries: 1000000
Data reside in main memory, and try to stress iterator.
Repeated three times on master and this PR.
- Seek without next
```
./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3
```
master: 159047.0 ops/sec
this PR: 158922.3 ops/sec (2% drop in throughput)
- Seek and next 10 times
```
./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 -seek_nexts=10
```
master: 109539.3 ops/sec
this PR: 107519.7 ops/sec (2% drop in throughput)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6255
Differential Revision: D19438227
Pulled By: riversand963
fbshipit-source-id: b66b4979486f8474619f4aa6bdd88598870b0746
Summary:
In most places in the code the variable names are spelled correctly as
COMMITTED but in a couple places not. This fixes them and ensures the
variable is always called COMMITTED everywhere.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6481
Differential Revision: D20306776
Pulled By: pdillinger
fbshipit-source-id: b6c1bfe41db559b4bc6955c530934460c07f7022
Summary:
In the current code base, we can use FaultInjectionTestEnv to simulate the env issue such as file write/read errors, which are used in most of the test. The PR https://github.com/facebook/rocksdb/issues/5761 introduce the File System as a new Env API. This PR implement the FaultInjectionTestFS, which can be used to simulate when File System has issues such as IO error. user can specify any IOStatus error as input, such that FS corresponding actions will return certain error to the caller.
A set of ErrorHandlerFSTests are introduced for testing
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6414
Test Plan: pass make asan_check, pass error_handler_fs_test.
Differential Revision: D20252421
Pulled By: zhichao-cao
fbshipit-source-id: e922038f8ce7e6d1da329fd0bba7283c4b779a21
Summary:
Added new Get() methods that return timestamp. Dummy implementation is given so that classes derived from DB don't need to be touched to provide their implementation. MultiGet is not included.
ReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks.
base line (commit 72ee067b9):
101.712 micros/op 314602 ops/sec; 36.0 MB/s (5658999 of 5658999 found)
This PR:
100.288 micros/op 319071 ops/sec; 36.5 MB/s (5674999 of 5674999 found)
./db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=readrandom --use_existing_db=1 --num=25000000 --threads=32
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6409
Differential Revision: D20200086
Pulled By: riversand963
fbshipit-source-id: 490edd74d924f62bd8ae9c29c2a6bbbb8410ca50
Summary:
Make kPageSize extern const size_t (used in draft https://github.com/facebook/rocksdb/issues/6427)
Make kLitteEndian constexpr bool
Clarify a couple of comments
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6443
Test Plan: make check, CI
Differential Revision: D20044558
Pulled By: pdillinger
fbshipit-source-id: e0c5cc13229c82726280dc0ddcba4078346b8418
Summary:
Adding a C API function to set `row_cache` on `rocksdb_options_t` as this functionality is missing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6442
Differential Revision: D20036813
Pulled By: riversand963
fbshipit-source-id: c1fa95ea343345fbc1e57961d0d048e0e79be373
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
Differential Revision: D19977691
fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
Summary:
In the current code base, RocksDB generate the checksum for each block and verify the checksum at usage. Current PR enable SST file checksum. After a SST file is generated by Flush or Compaction, RocksDB generate the SST file checksum and store the checksum value and checksum method name in the vs_info and MANIFEST as part for the FileMetadata.
Added the enable_sst_file_checksum to Options to enable or disable file checksum. Added sst_file_checksum to Options such that user can plugin their own SST file checksum calculate method via overriding the SstFileChecksum class. The checksum information inlcuding uint32_t checksum value and a checksum name (string). A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST. If user enables the file checksum but does not provide the sst_file_checksum instance, RocksDB will use the default crc32checksum implemented in table/sst_file_checksum_crc32c.h
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6216
Test Plan: Added the testing case in table_test and ldb_cmd_test to verify checksum is correct in different level. Pass make asan_check.
Differential Revision: D19171461
Pulled By: zhichao-cao
fbshipit-source-id: b2e53479eefc5bb0437189eaa1941670e5ba8b87
Summary:
It's logically correct for PinnableSlice to support move semantics to transfer ownership of the pinned memory region. This PR adds both move constructor and move assignment to PinnableSlice.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6374
Test Plan:
A set of unit tests for the move semantics are added in slice_test.
So `make slice_test && ./slice_test`.
Differential Revision: D19739254
Pulled By: cheng-chang
fbshipit-source-id: f898bd811bb05b2d87384ec58b645e9915e8e0b1
Summary:
The comments are for iterators, not Cleanable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6371
Test Plan: no need
Differential Revision: D19727527
Pulled By: cheng-chang
fbshipit-source-id: c74aeffa27ea0ce15a36ff6f9694826712cd1c70
Summary:
When paranoid_checks is on, DBImpl::CheckConsistency() iterates over all sst files and calls Env::GetFileSize() for each of them. As far as I could understand, this is pretty arbitrary and doesn't affect correctness - if filesystem doesn't corrupt fsynced files, the file sizes will always match; if it does, it may as well corrupt contents as well as sizes, and rocksdb doesn't check contents on open.
If there are thousands of sst files, getting all their sizes takes a while. If, on top of that, Env is overridden to use some remote storage instead of local filesystem, it can be *really* slow and overload the remote storage service. This PR adds an option to not do GetFileSize(); instead it does GetChildren() for parent directory to check that all the expected sst files are at least present, but doesn't check their sizes.
We can't just disable paranoid_checks instead because paranoid_checks do a few other important things: make the DB read-only on write errors, print error messages on read errors, etc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6353
Test Plan: ran the added sanity check unit test. Will try it out in a LogDevice test cluster where the GetFileSize() calls are causing a lot of trouble.
Differential Revision: D19656425
Pulled By: al13n321
fbshipit-source-id: c2c421b367633033760d1f56747bad206d1fbf82
Summary:
The patch adds statistics support to the new BlobDB garbage collection implementation;
namely, it adds support for the following (pre-existing) tickers:
`BLOB_DB_GC_NUM_FILES`: the number of blob files obsoleted by the GC logic.
`BLOB_DB_GC_NUM_NEW_FILES`: the number of new blob files generated by the GC logic.
`BLOB_DB_GC_FAILURES`: the number of failed GC passes (where a GC pass is
equivalent to a (sub)compaction).
`BLOB_DB_GC_NUM_KEYS_RELOCATED`: the number of blobs relocated to new blob
files by the GC logic.
`BLOB_DB_GC_BYTES_RELOCATED`: the total size of blobs relocated to new blob files.
The tickers `BLOB_DB_GC_NUM_KEYS_OVERWRITTEN`, `BLOB_DB_GC_NUM_KEYS_EXPIRED`,
`BLOB_DB_GC_BYTES_OVERWRITTEN`, `BLOB_DB_GC_BYTES_EXPIRED`, and
`BLOB_DB_GC_MICROS` are not relevant for the new GC logic, and are thus marked
deprecated.
The patch also adds a couple of log messages that log the number and total size of
blobs encountered and relocated during a GC pass, as well as the number of blob
files created and obsoleted.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6296
Test Plan: Extended unit tests and used the BlobDB mode of `db_bench`.
Differential Revision: D19402513
Pulled By: ltamasi
fbshipit-source-id: d53d2bfbf4928a1db1e9346c67ebb9007b8932ec
Summary:
DBTest2.AutoPrefixMode1 doesn't pass because auto prefix mode is not supported there.
Fix it by disabling the test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6346
Test Plan: Run DBTest2.AutoPrefixMode1 in lite mode
Differential Revision: D19627486
fbshipit-source-id: fbde75260aeecb7e6fc406e09c19a71a95aa5f08
Summary:
Add a new option ReadOptions.auto_prefix_mode. When set to true, iterator should return the same result as total order seek, but may choose to do prefix seek internally, based on iterator upper bounds. Also fix two previous bugs when handling prefix extrator changes: (1) reverse iterator should not rely on upper bound to determine prefix. Fix it with skipping prefix check. (2) block-based filter is not handled properly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6314
Test Plan: (1) add a unit test; (2) add the check to stress test and run see whether it can pass at least one run.
Differential Revision: D19458717
fbshipit-source-id: 51c1bcc5cdd826c2469af201979a39600e779bce
Summary:
Adjusted history for 6.6.1 and 6.6.2, switched master version to 6.7.0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6320
Differential Revision: D19499272
Pulled By: gfosco
fbshipit-source-id: 2bafb2456951f231e411e9c03aaa4c044f497684
Summary:
The function was left unimplemented. Although we currently don't have a use for that it was declared with an assert(0) to prevent mistakenly using the remove_prefix of the parent class. The function body with only assert(0) however causes issues with some compiler's warning levels. The patch implements the function to avoid the warning.
It also piggybacks some minor code warning for unnecessary semicolons after the function definition.s
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6330
Differential Revision: D19559062
Pulled By: maysamyabandeh
fbshipit-source-id: 3a022484f688c9abd4556e5412bcc2628ab96a00
Summary:
Help users that would benefit most from new Bloom filter
implementation by logging a warning that recommends the using
format_version >= 5.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6312
Test Plan:
$ (for BPK in 10 13 14 19 20 50; do ./filter_bench -quick -impl=0 -bits_per_key=$BPK -m_queries=1 2>&1; done) | grep 'its/key'
Bits/key actual: 10.0647
Bits/key actual: 13.0593
[WARN] [/block_based/filter_policy.cc:546] Using legacy Bloom filter with high (14) bits/key. Significant filter space and/or accuracy improvement is available with format_verion>=5.
Bits/key actual: 14.0581
[WARN] [/block_based/filter_policy.cc:546] Using legacy Bloom filter with high (19) bits/key. Significant filter space and/or accuracy improvement is available with format_verion>=5.
Bits/key actual: 19.0542
[WARN] [/block_based/filter_policy.cc:546] Using legacy Bloom filter with high (20) bits/key. Dramatic filter space and/or accuracy improvement is available with format_verion>=5.
Bits/key actual: 20.0584
[WARN] [/block_based/filter_policy.cc:546] Using legacy Bloom filter with high (50) bits/key. Dramatic filter space and/or accuracy improvement is available with format_verion>=5.
Bits/key actual: 50.0577
Differential Revision: D19457191
Pulled By: pdillinger
fbshipit-source-id: 073d94cde5c70e03a160f953e1100c15ea83eda4
Summary:
This PR adds a `rocksdb_options_set_atomic_flush` function to the C API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6307
Differential Revision: D19451313
Pulled By: ltamasi
fbshipit-source-id: 750495642ef55b1ea7e13477f85c38cd6574849c
Summary:
This makes it easier to call the functions from Rust as otherwise they require mutable types.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6283
Differential Revision: D19349991
Pulled By: wqfish
fbshipit-source-id: e8da7a75efe8cd97757baef8ca844a054f2519b4
Summary:
This is a continuation of https://github.com/facebook/rocksdb/pull/5320/files
I open a new mr for these purposes, half a year has past since the old mr is posted so it's almost impossible to fulfill some points below on the old mr, especially 5)
1) add validation modes for optimistic txns
2) modify unittests to test both modes
3) make format
4) refine hash functor
5) push to master
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6240
Differential Revision: D19301296
fbshipit-source-id: 5b5b3cbd39558f43947f7d2dec6cd31a06386edb
Summary:
Fix an error message when CURRENT is not found.
Test plan (dev server)
```
make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6264
Differential Revision: D19300699
Pulled By: riversand963
fbshipit-source-id: 303fa206386a125960ecca1dbdeff07422690caf
Summary:
Add oldest snapshot sequence property, so we can use `db.GetProperty("rocksdb.oldest-snapshot-sequence")` to get the sequence number of the oldest snapshot.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6228
Differential Revision: D19264145
Pulled By: maysamyabandeh
fbshipit-source-id: 67fbe5304d89cbc475bd404e30d1299f7b11c010
Summary:
It seems that the C-API doesn't expose the range delete functionality at the moment, so add the API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6259
Differential Revision: D19290320
Pulled By: pdillinger
fbshipit-source-id: 3f403a4c3446d2042d55f1ece7cdc9c040f40c27
Summary:
I found that CleanupSuperVersion() may block Get() for 30ms+ (per MemTable is 256MB).
Then I found "delete sv" in ~SuperVersion() takes the time.
The backtrace looks like this
DBImpl::GetImpl() -> DBImpl::ReturnAndCleanupSuperVersion() ->
DBImpl::CleanupSuperVersion() : delete sv; -> ~SuperVersion()
I think it's better to delete in a background thread, please review it。
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6146
Differential Revision: D18972066
fbshipit-source-id: 0f7b0b70b9bb1e27ad6fc1c8a408fbbf237ae08c
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
Summary:
The patch adds logic that relocates live blobs from the oldest N non-TTL
blob files as they are encountered during compaction (assuming the BlobDB
configuration option `enable_garbage_collection` is `true`), where N is defined
as the number of immutable non-TTL blob files multiplied by the value of
a new BlobDB configuration option called `garbage_collection_cutoff`.
(The default value of this parameter is 0.25, that is, by default the valid blobs
residing in the oldest 25% of immutable non-TTL blob files are relocated.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6121
Test Plan: Added unit test and tested using the BlobDB mode of `db_bench`.
Differential Revision: D18785357
Pulled By: ltamasi
fbshipit-source-id: 8c21c512a18fba777ec28765c88682bb1a5e694e
Summary:
Currently the default txn write policy in crash tests is WRITE_PREPARED. The patch randomly picks the write policy at the start of the crash test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6158
Test Plan:
```
make -j32 crash_test_with_txn
```
Differential Revision: D18946307
Pulled By: maysamyabandeh
fbshipit-source-id: f77d7a94f99a08791ef9626da153d284bf521950
Summary:
Adds two missing functions to the C-API:
- `rocksdb_block_based_options_set_data_block_index_type`
- `rocksdb_block_based_options_set_data_block_hash_ratio`
This enables users in other languages to enjoy the new(-ish) feature.
The changes here are partially overlapping with [another PR](https://github.com/facebook/rocksdb/pull/5630) but are more focused on the DataBlock indexing options.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6101
Differential Revision: D18765639
fbshipit-source-id: 4a8947e71b179f26fa1eb83c267dd47ee64ac3b3
Summary:
To reflect changes in PR https://github.com/facebook/rocksdb/issues/6072
This comment also implies that a seemingly valid use-case for
max_valid_backups_to_open is flawed: even if you only want to add a new
backup without trying to delete, you might need to clean up after a
backup creation that never finished. To clean up properly requires
opening all backups to get proper ref counts on shared files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6105
Test Plan: code comment only
Differential Revision: D18736716
Pulled By: pdillinger
fbshipit-source-id: 2447c0000eefe3a4ca606926bfe922a8456b0cb7
Summary:
options.periodic_compaction_seconds isn't supported when options.max_open_files != -1. It's because that the information of file creation time is stored in table properties and are not guaranteed to be loaded unless options.max_open_files = -1. Relax this constraint by storing the information in manifest.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6090
Test Plan: Pass all existing tests; Modify an existing test to force the manifest value to take 0 to simulate backward compatibility case; manually open the DB generated with the change by release 4.2.
Differential Revision: D18702268
fbshipit-source-id: 13e0bd94f546498a04f3dc5fc0d9dff5125ec9eb
Summary:
This change enables custom implementations of FilterPolicy to
wrap a variety of NewBloomFilterPolicy and select among them based on
contextual information such as table level and compaction style.
* Moves FilterBuildingContext to public API and elaborates it with more
useful data. (It would be nice to put more general options-like data,
but at the time this object is constructed, we are using internal APIs
ImmutableCFOptions and MutableCFOptions and don't have easy access to
ColumnFamilyOptions that I can tell.)
* Renames BloomFilterPolicy::GetFilterBitsBuilderInternal to
GetBuilderWithContext, because it's now public.
* Plumbs through the table's "level_at_creation" for filter building
context.
* Simplified some tests by adding GetBuilder() to
MockBlockBasedTableTester.
* Adds test as DBBloomFilterTest.ContextCustomFilterPolicy, including
sample wrapper class LevelAndStyleCustomFilterPolicy.
* Fixes a cross-test bug in DBBloomFilterTest.OptimizeFiltersForHits
where it does not reset perf context.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6088
Test Plan: make check, valgrind on db_bloom_filter_test
Differential Revision: D18697817
Pulled By: pdillinger
fbshipit-source-id: 5f987a2d7b07cc7a33670bc08ca6b4ca698c1cf4
Summary:
There's no technological impediment to allowing the Bloom
filter bits/key to be non-integer (fractional/decimal) values, and it
provides finer control over the memory vs. accuracy trade-off. This is
especially handy in using the format_version=5 Bloom filter in place
of the old one, because bits_per_key=9.55 provides the same accuracy as
the old bits_per_key=10.
This change not only requires refining the logic for choosing the best
num_probes for a given bits/key setting, it revealed a flaw in that logic.
As bits/key gets higher, the best num_probes for a cache-local Bloom
filter is closer to bpk / 2 than to bpk * 0.69, the best choice for a
standard Bloom filter. For example, at 16 bits per key, the best
num_probes is 9 (FP rate = 0.0843%) not 11 (FP rate = 0.0884%).
This change fixes and refines that logic (for the format_version=5
Bloom filter only, just in case) based on empirical tests to find
accuracy inflection points between each num_probes.
Although bits_per_key is now specified as a double, the new Bloom
filter converts/rounds this to "millibits / key" for predictable/precise
internal computations. Just in case of unforeseen compatibility
issues, we round to the nearest whole number bits / key for the
legacy Bloom filter, so as not to unlock new behaviors for it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6092
Test Plan: unit tests included
Differential Revision: D18711313
Pulled By: pdillinger
fbshipit-source-id: 1aa73295f152a995328cb846ef9157ae8a05522a
Summary:
By default options.ttl is disabled. We believe a better default will be 30 days, which means deleted data the database will be removed from SST files slightly after 30 days, for most of the cases.
Make the default UINT64_MAX - 1 to indicate that it is not overridden by users.
Change periodic_compaction_seconds to be UINT64_MAX - 1 to UINT64_MAX too to be consistent. Also fix a small bug in the previous periodic_compaction_seconds default code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6073
Test Plan: Add unit tests for it.
Differential Revision: D18669626
fbshipit-source-id: 957cd4374cafc1557d45a0ba002010552a378cc8
Summary:
`options.ttl` is now supported in universal compaction, similar to how periodic compactions are implemented in PR https://github.com/facebook/rocksdb/issues/5970 .
Setting `options.ttl` will simply set `options.periodic_compaction_seconds` to execute the periodic compactions code path.
Discarded PR https://github.com/facebook/rocksdb/issues/4749 in lieu of this.
This is a short term work-around/hack of falling back to periodic compactions when ttl is set.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6071
Test Plan: Added a unit test.
Differential Revision: D18668336
Pulled By: sagar0
fbshipit-source-id: e75f5b81ba949f77ef9eff05e44bb1c757f58612
Summary:
Production:
* Fixes GarbageCollect (and auto-GC triggered by PurgeOldBackups, DeleteBackup, or CreateNewBackup) to clean up backup directory independent of current settings (except max_valid_backups_to_open; see issue https://github.com/facebook/rocksdb/issues/4997) and prior settings used with same backup directory.
* Fixes GarbageCollect (and auto-GC) not to attempt to remove "." and ".." entries from directories.
* Clarifies contract with users in modifying BackupEngine operations. In short, leftovers from any incomplete operation are cleaned up by any subsequent call to that same kind of operation (PurgeOldBackups and DeleteBackup considered the same kind of operation). GarbageCollect is available to clean up after all kinds. (NB: right now PurgeOldBackups and DeleteBackup will clean up after incomplete CreateNewBackup, but we aren't promising to continue that behavior.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6023
Test Plan:
* Refactors open parameters to use an option enum, for readability, etc. (Also fixes an unused parameter bug in the redundant OpenDBAndBackupEngineShareWithChecksum.)
* Fixes an apparent bug in ShareTableFilesWithChecksumsTransition in which old backup data was destroyed in the transition to be tested. That test is now augmented to ensure GarbageCollect (or auto-GC) does not remove shared files when BackupEngine is opened with share_table_files=false.
* Augments DeleteTmpFiles test to ensure that CreateNewBackup does auto-GC when an incompletely created backup is detected.
Differential Revision: D18453559
Pulled By: pdillinger
fbshipit-source-id: 5e54e7b08d711b161bc9c656181012b69a8feac4
Summary:
Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter.
Speed
The improved speed, at least on recent x86_64, comes from
* Using fastrange instead of modulo (%)
* Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row.
* Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc.
* Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes.
Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed):
$ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter
Build avg ns/key: 47.7135
Mixed inside/outside queries...
Single filter net ns/op: 26.2825
Random filter net ns/op: 150.459
Average FP rate %: 0.954651
$ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter
Build avg ns/key: 47.2245
Mixed inside/outside queries...
Single filter net ns/op: 63.2978
Random filter net ns/op: 188.038
Average FP rate %: 1.13823
Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected.
The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome.
Accuracy
The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices
within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments.
Accuracy data (generalizes, except old impl gets worse with millions of keys):
Memory bits per key: FP rate percent old impl -> FP rate percent new impl
6: 5.70953 -> 5.69888
8: 2.45766 -> 2.29709
10: 1.13977 -> 0.959254
12: 0.662498 -> 0.411593
16: 0.353023 -> 0.0873754
24: 0.261552 -> 0.0060971
50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP)
Fixes https://github.com/facebook/rocksdb/issues/5857
Fixes https://github.com/facebook/rocksdb/issues/4120
Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized.
Compatibility
Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007
Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version).
Differential Revision: D18294749
Pulled By: pdillinger
fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
Summary:
Add a new API that allows a user to call MultiGet specifying multiple keys belonging to different column families. This is mainly useful for users who want to do a consistent read of keys across column families, with the added performance benefits of batching and returning values using PinnableSlice.
As part of this change, the code in the original multi-column family MultiGet for acquiring the super versions has been refactored into a separate function that can be used by both, the batching and the non-batching versions of MultiGet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5816
Test Plan:
make check
make asan_check
asan_crash_test
Differential Revision: D18408676
Pulled By: anand1976
fbshipit-source-id: 933e7bec91dd70e7b633be4ff623a1116cc28c8d
Summary:
Only if there is a crash, power failure, or I/O error in
DeleteBackup, shared or private files from the backup might be left
behind that are not cleaned up by PurgeOldBackups or DeleteBackup-- only
by GarbageCollect. This makes the BackupEngine API "leaky by default."
Even if it means a modest performance hit, I think we should make
Delete and Purge do as they say, with ongoing best effort: i.e. future
calls will attempt to finish any incomplete work from earlier calls.
This change does that by having DeleteBackup and PurgeOldBackups do a
GarbageCollect, unless (to minimize performance hit) this BackupEngine
has already done a GarbageCollect and there have been no
deletion-related I/O errors in that GarbageCollect or since then.
Rejected alternative 1: remove meta file last instead of first. This would in theory turn partially deleted backups into corrupted backups, but code changes would be needed to allow the missing files and consider it acceptably corrupt, rather than failing to open the BackupEngine. This might be a reasonable choice, but I mostly rejected it because it doesn't solve the legacy problem of cleaning up existing lingering files.
Rejected alternative 2: use a deletion marker file. If deletion started with creating a file that marks a backup as flagged for deletion, then we could reliably detect partially deleted backups and efficiently finish removing them. In addition to not solving the legacy problem, this could be precarious if there's a disk full situation, and we try to create a new file in order to delete some files. Ugh.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6015
Test Plan: Updated unit tests
Differential Revision: D18401333
Pulled By: pdillinger
fbshipit-source-id: 12944e372ce6809f3f5a4c416c3b321a8927d925