rocksdb

Commit Graph

Author	SHA1	Message	Date
Yanqin Jin	d10c5c08d3	Remove iter_start_seqnum and preserve_deletes (#9430 ) Summary: According to https://github.com/facebook/rocksdb/blob/6.27.fb/db/db_impl/db_impl.cc#L2896:L2911 and https://github.com/facebook/rocksdb/blob/6.27.fb/db/db_impl/db_impl_open.cc#L203:L208, we are going to remove `iter_start_seqnum` and `preserve_deletes` starting from RocksDB 7.0 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9430 Test Plan: make check and CI Reviewed By: ajkr Differential Revision: D33753639 Pulled By: riversand963 fbshipit-source-id: c80aab8e8d8fc33e52472fed524ed703d0ffc8b6	3 years ago
Akanksha Mahajan	74ccd1931e	Remove deprecated option DBOptions::skip_log_error_on_recovery (#9434 ) Summary: In RocksDB DBOptions::skip_log_error_on_recovery is marked as "NOT SUPPORTED" for a long time, and setting this option does not have any effect on the behavior of RocksDB library. Therefore, we are removing it in the upcoming 7.0 release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9434 Test Plan: CircleCI Reviewed By: ajkr Differential Revision: D33763015 Pulled By: akankshamahajan15 fbshipit-source-id: 11f09643298da6c02d3dcdb090b996f4c3cfdd76	3 years ago
Jay Zhuang	22321e1027	Remove unused API base_background_compactions (#9462 ) Summary: The API is deprecated long time ago. Clean up the codebase by removing it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9462 Test Plan: CI, fake release: D33835220 Reviewed By: riversand963 Differential Revision: D33835103 Pulled By: jay-zhuang fbshipit-source-id: 6d2dc12c8e7fdbe2700865a3e61f0e3f78bd8184	3 years ago
Hui Xiao	1e0e883ca5	Remove deprecated API AdvancedColumnFamilyOptions::soft_rate_limit/hard_rate_limit (#9452 ) Summary: Context/Summary: AdvancedColumnFamilyOptions::soft_rate_limit/hard_rate_limit have been marked as deprecated and it's time to actually remove the code. - Keep `soft_rate_limit`/`hard_rate_limit` in `cf_mutable_options_type_info` to prevent throwing `InvalidArgument` in `GetColumnFamilyOptionsFromMap` when reading an option file still with these options (e.g, old option file generated from RocksDB before the deprecation) - Keep `soft_rate_limit`/`hard_rate_limit` in under `OptionsOldApiTest.GetOptionsFromMapTest` to test the case mentioned above. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9452 Test Plan: Rely on my eyeball and CI Reviewed By: ajkr Differential Revision: D33804938 Pulled By: hx235 fbshipit-source-id: 133d49f7ec5238d7efceeb0a3122a5792a2b9945	3 years ago
mrambacher	37ec9d0c12	Improve performance of SliceTransform::AsString (#9401 ) Summary: 1. Removed the options from the Capped/Fixed SliceTransforms. Instead these classes are created with id.number. This allows the GetID() id to be calculated and stored at class construction time. This change puts the construction back to similar to how it was prior to the Customizable changes for SliceTransform. 2. Improve the performance of AsString by using the ID only if there are no option properties (which is the case for all of the builtin transforms). Ran tests of calling AsString in a loop 5M times and found approximately a 10x performance increase vs the original code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9401 Reviewed By: pdillinger Differential Revision: D33668672 Pulled By: mrambacher fbshipit-source-id: d0075912c6ece8ed754ee543bc6b0b49a169b309	3 years ago
Jay Zhuang	3e27add385	Fix a backward compatibility issue (#9456 ) Summary: Fix a backward compatibility issue caused by removing `purge_redundant_kvs_while_flush`. Reserve the option internally. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9456 Test Plan: CI: https://app.circleci.com/pipelines/github/facebook/rocksdb/11122/workflows/b7bc0f35-1be8-432c-9292-79125e22ecc7/jobs/280595 Reviewed By: ajkr, ltamasi Differential Revision: D33808474 Pulled By: jay-zhuang fbshipit-source-id: 7c3b553bc8e85c8a560514e8e460a2dbaf25718d	3 years ago
Siddhartha Roychowdhury	c27ca23644	Add option for WAL compression algorithm (#9432 ) Summary: Add an option to set the WAL compression algorithm - wal_compression. TODO: WAL compression is not implemented and will only support zstd initially. Will be added in subsequent diffs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9432 Reviewed By: pdillinger Differential Revision: D33797275 Pulled By: sidroyc fbshipit-source-id: 8db81d9c9cea5e2e4f1445d3aecad8106137b8e7	3 years ago
Jay Zhuang	961d8dacf2	Remove unused option purge_redundant_kvs_while_flush (#9429 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9429 Test Plan: fake release for test: D33754513 Reviewed By: riversand963 Differential Revision: D33753637 Pulled By: jay-zhuang fbshipit-source-id: 18db4701e8f28dda8f1ab660c2be9890a8312c12	3 years ago
Jay Zhuang	022b400cba	Make `bottommost_temperature` dynamically changeable (#9402 ) Summary: Make `AdvancedColumnFamilyOptions.bottommost_temperature` dynamically changeable with `SetOptions` API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9402 Test Plan: added unittest Reviewed By: siying Differential Revision: D33674487 Pulled By: jay-zhuang fbshipit-source-id: 8943768156aa6197c63850a64238a8092527d517	3 years ago
Peter Dillinger	5576ded762	Add Options::DisableExtraChecks, clarify force_consistency_checks (#9363 ) Summary: In response to https://github.com/facebook/rocksdb/issues/9354, this PR adds a way for users to "opt out" of extra checks that can impact peak write performance, which currently only includes force_consistency_checks. I considered including some other options but did not see a db_bench performance difference. Also clarify in comment for force_consistency_checks that it can "slow down saturated writing." Pull Request resolved: https://github.com/facebook/rocksdb/pull/9363 Test Plan: basic coverage in unit tests Using my perf test in https://github.com/facebook/rocksdb/issues/9354 comment, I see force_consistency_checks=true -> 725360 ops/s force_consistency_checks=false -> 783072 ops/s Reviewed By: mrambacher Differential Revision: D33636559 Pulled By: pdillinger fbshipit-source-id: 25bfd006f4844675e7669b342817dd4c6a641e84	3 years ago
Jay Zhuang	9c6fb26033	Fix clang13 build error (#9374 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9374 Test Plan: Add CI for clang13 build Reviewed By: riversand963 Differential Revision: D33522867 Pulled By: jay-zhuang fbshipit-source-id: 642756825cf0b51e35861fb847ebaee4611b76ca	3 years ago
mrambacher	1973fcba11	Restore Regex support for ObjectLibrary::Register, rename new APIs to allow old one to be deprecated in the future (#9362 ) Summary: In order to support old-style regex function registration, restored the original "Register<T>(string, Factory)" method using regular expressions. The PatternEntry methods were left in place but renamed to AddFactory. The goal is to allow for the deprecation of the original regex Registry method in an upcoming release. Added modes to the PatternEntry kMatchZeroOrMore and kMatchAtLeastOne to match * or +, respectively (kMatchAtLeastOne was the original behavior). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9362 Reviewed By: pdillinger Differential Revision: D33432562 Pulled By: mrambacher fbshipit-source-id: ed88ab3f9a2ad0d525c7bd1692873f9bb3209d02	3 years ago
Yanqin Jin	55a2105258	Make RocksDB codebase compatible with newer compilers like clang-12 (#9370 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9370 GCC and newer clang, e.g. clang-12 treat `std::unique_ptr` slightly differently. For the following code ``` #include <iostream> #include <memory> #include <type_traits> struct A { std::unique_ptr<int> m1; }; int main() { std::cout << std::boolalpha; std::cout << std::is_standard_layout<A>::value << '\n'; return 0; } ``` GCC11(C++20) (tested on https://en.cppreference.com/w/cpp/types/is_standard_layout) will print "true", while newer clang, e.g. clang-12 will print "false". This breaks the usage of `offsetof()` on structs with non-static members of type `std::unique_ptr`. Fixing this by replacing the builtin `offsetof` with a trick documented at https://gist.github.com/graphitemaster/494f21190bb2c63c5516. Reviewed By: jay-zhuang Differential Revision: D33420840 fbshipit-source-id: 02bde281dfa28809bec787ad0f7019e85dd9c607	3 years ago
mrambacher	fe31dc53ca	Make the Env class Customizable (#9293 ) Summary: Allows the Env to have options (Configurable) and loads like other Customizable classes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9293 Reviewed By: pdillinger, zhichao-cao Differential Revision: D33181591 Pulled By: mrambacher fbshipit-source-id: 55e823886c654d214eda9eedd45ccdc54dac14d7	3 years ago
mrambacher	1c39b7952b	Remove/Reduce use of Regex in ObjectRegistry/Library (#9264 ) Summary: Added new ObjectLibrary::Entry classes to replace/reduce the use of Regex. For simple factories that only do name matching, there are "StringEntry" and "AltStringEntry" classes. For classes that use some semblance of regular expressions, there is a PatternEntry class that can match a name and prefixes. There is also a class for Customizable::IndividualId format matches. Added tests for the new derivative classes and got all unit tests to pass. Resolves https://github.com/facebook/rocksdb/issues/9225. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9264 Reviewed By: pdillinger Differential Revision: D33062001 Pulled By: mrambacher fbshipit-source-id: c2d2143bd2d38bdf522705c8280c35381b135c03	3 years ago
mrambacher	423538a816	Make MemoryAllocator into a Customizable class (#8980 ) Summary: - Make MemoryAllocator and its implementations into a Customizable class. - Added a "DefaultMemoryAllocator" which uses new and delete - Added a "CountedMemoryAllocator" that counts the number of allocs and free - Updated the existing tests to use these new allocators - Changed the memkind allocator test into a generic test that can test the various allocators. - Added tests for creating all of the allocators - Added tests to verify/create the JemallocNodumpAllocator using its options. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8980 Reviewed By: zhichao-cao Differential Revision: D32990403 Pulled By: mrambacher fbshipit-source-id: 6fdfe8218c10dd8dfef34344a08201be1fa95c76	3 years ago
mrambacher	5486717ee2	Fix an issue with MemTableRepFactory::CreateFromString (#9273 ) Summary: If ignore_unsupported_options=true, then it is possible for MemTableRepFactory::CreateFromString to succeed without setting a result (result=nullptr). This would cause the original value to be overwritten with null and an error would be raised later when PrepareOptions is invoked. Added unit test for this condition. Will add (in another PR unless required by reviewers) comparable tests for all of the other Customizable classes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9273 Reviewed By: ltamasi Differential Revision: D32990365 Pulled By: mrambacher fbshipit-source-id: b150724c3f5ae7346357b3866244fd93466875c7	3 years ago
mrambacher	7cd5835a28	Make RateLimiter Customizable (#9141 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9141 Reviewed By: zhichao-cao Differential Revision: D32432190 Pulled By: mrambacher fbshipit-source-id: 7930ed88a02412128cd407b5063522484e45c6ce	3 years ago
mrambacher	7aa31ba4a9	Fix GetOptionsPtr for Wrapped Customizable; Allow null options map (#9213 ) Summary: 1. Fix GetOptionsPtr for Wrapped (Inner() != nullptr) Customizable objects. This allows the inner options to be returned via this method. 2. Allow the option type map to be nullptr. This allows objects to be registered as options (for GetOptionsPtr) but not be used by the configuration methods. Added tests as appropriate. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9213 Reviewed By: zhichao-cao Differential Revision: D32718882 Pulled By: mrambacher fbshipit-source-id: 563203d1f006a2629060feb31c5dff9a233e1e83	3 years ago
Levi Tamasi	dc5de45af8	Support readahead during compaction for blob files (#9187 ) Summary: The patch adds a new BlobDB configuration option `blob_compaction_readahead_size` that can be used to enable prefetching data from blob files during compaction. This is important when using storage with higher latencies like HDDs or remote filesystems. If enabled, prefetching is used for all cases when blobs are read during compaction, namely garbage collection, compaction filters (when the existing value has to be read from a blob file), and `Merge` (when the value of the base `Put` is stored in a blob file). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9187 Test Plan: Ran `make check` and the stress/crash test. Reviewed By: riversand963 Differential Revision: D32565512 Pulled By: ltamasi fbshipit-source-id: 87be9cebc3aa01cc227bec6b5f64d827b8164f5d	3 years ago
Hui Xiao	74544d582f	Account Bloom/Ribbon filter construction memory in global memory limit (#9073 ) Summary: Note: This PR is the 4th part of a bigger PR stack (https://github.com/facebook/rocksdb/pull/9073) and will rebase/merge only after the first three PRs (https://github.com/facebook/rocksdb/pull/9070, https://github.com/facebook/rocksdb/pull/9071, https://github.com/facebook/rocksdb/pull/9130) merge. Context: Similar to https://github.com/facebook/rocksdb/pull/8428, this PR is to track memory usage during (new) Bloom Filter (i.e,FastLocalBloom) and Ribbon Filter (i.e, Ribbon128) construction, moving toward the goal of [single global memory limit using block cache capacity](https://github.com/facebook/rocksdb/wiki/Projects-Being-Developed#improving-memory-efficiency). It also constrains the size of the banding portion of Ribbon Filter during construction by falling back to Bloom Filter if that banding is, at some point, larger than the available space in the cache under `LRUCacheOptions::strict_capacity_limit=true`. The option to turn on this feature is `BlockBasedTableOptions::reserve_table_builder_memory = true` which by default is set to `false`. We [decided](https://github.com/facebook/rocksdb/pull/9073#discussion_r741548409) not to have separate option for separate memory user in table building therefore their memory accounting are all bundled under one general option. Summary: - Reserved/released cache for creation/destruction of three main memory users with the passed-in `FilterBuildingContext::cache_res_mgr` during filter construction: - hash entries (i.e`hash_entries`.size(), we bucket-charge hash entries during insertion for performance), - banding (Ribbon Filter only, `bytes_coeff_rows` +`bytes_result_rows` + `bytes_backtrack`), - final filter (i.e, `mutable_buf`'s size). - Implementation details: in order to use `CacheReservationManager::CacheReservationHandle` to account final filter's memory, we have to store the `CacheReservationManager` object and `CacheReservationHandle` for final filter in `XXPH3BitsFilterBuilder` as well as explicitly delete the filter bits builder when done with the final filter in block based table. - Added option fo run `filter_bench` with this memory reservation feature Pull Request resolved: https://github.com/facebook/rocksdb/pull/9073 Test Plan: - Added new tests in `db_bloom_filter_test` to verify filter construction peak cache reservation under combination of `BlockBasedTable::Rep::FilterType` (e.g, `kFullFilter`, `kPartitionedFilter`), `BloomFilterPolicy::Mode`(e.g, `kFastLocalBloom`, `kStandard128Ribbon`, `kDeprecatedBlock`) and `BlockBasedTableOptions::reserve_table_builder_memory` - To address the concern for slow test: tests with memory reservation under `kFullFilter` + `kStandard128Ribbon` and `kPartitionedFilter` take around 3000 - 6000 ms and others take around 1500 - 2000 ms, in total adding 20000 - 25000 ms to the test suit running locally - Added new test in `bloom_test` to verify Ribbon Filter fallback on large banding in FullFilter - Added test in `filter_bench` to verify that this feature does not significantly slow down Bloom/Ribbon Filter construction speed. Local result averaged over 20 run as below: - FastLocalBloom - baseline `./filter_bench -impl=2 -quick -runs 20 \| grep 'Build avg'`: - Build avg ns/key: 29.56295 (DEBUG_LEVEL=1), 29.98153 (DEBUG_LEVEL=0) - new feature (expected to be similar as above)`./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true \| grep 'Build avg'`: - Build avg ns/key: 30.99046 (DEBUG_LEVEL=1), 30.48867 (DEBUG_LEVEL=0) - new feature of RibbonFilter with fallback (expected to be similar as above) `./filter_bench -impl=2 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true \| grep 'Build avg'` : - Build avg ns/key: 31.146975 (DEBUG_LEVEL=1), 30.08165 (DEBUG_LEVEL=0) - Ribbon128 - baseline `./filter_bench -impl=3 -quick -runs 20 \| grep 'Build avg'`: - Build avg ns/key: 129.17585 (DEBUG_LEVEL=1), 130.5225 (DEBUG_LEVEL=0) - new feature (expected to be similar as above) `./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true \| grep 'Build avg' `: - Build avg ns/key: 131.61645 (DEBUG_LEVEL=1), 132.98075 (DEBUG_LEVEL=0) - new feature of RibbonFilter with fallback (expected to be a lot faster than above due to fallback) `./filter_bench -impl=3 -quick -runs 20 -reserve_table_builder_memory=true -strict_capacity_limit=true \| grep 'Build avg'` : - Build avg ns/key: 52.032965 (DEBUG_LEVEL=1), 52.597825 (DEBUG_LEVEL=0) - And the warning message of `"Cache reservation for Ribbon filter banding failed due to cache full"` is indeed logged to console. Reviewed By: pdillinger Differential Revision: D31991348 Pulled By: hx235 fbshipit-source-id: 9336b2c60f44d530063da518ceaf56dac5f9df8e	3 years ago
Akanksha Mahajan	17ce1ca48b	Reuse internal auto readhead_size at each Level (expect L0) for Iterations (#9056 ) Summary: RocksDB does auto-readahead for iterators on noticing more than two sequential reads for a table file if user doesn't provide readahead_size. The readahead starts at 8KB and doubles on every additional read up to max_auto_readahead_size. However at each level, if iterator moves over next file, readahead_size starts again from 8KB. This PR introduces a new ReadOption "adaptive_readahead" which when set true will maintain readahead_size at each level. So when iterator moves from one file to another, new file's readahead_size will continue from previous file's readahead_size instead of scratch. However if reads are not sequential it will fall back to 8KB (default) with no prefetching for that block. 1. If block is found in cache but it was eligible for prefetch (block wasn't in Rocksdb's prefetch buffer), readahead_size will decrease by 8KB. 2. It maintains readahead_size for L1 - Ln levels. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9056 Test Plan: Added new unit tests Ran db_bench for "readseq, seekrandom, seekrandomwhilewriting, readrandom" with --adaptive_readahead=true and there was no regression if new feature is enabled. Reviewed By: anand1976 Differential Revision: D31773640 Pulled By: akankshamahajan15 fbshipit-source-id: 7332d16258b846ae5cea773009195a5af58f8f98	3 years ago
Peter Dillinger	2a3511a0df	Fix -Werror=type-limits seen in Travis (#9128 ) Summary: Work around annoying compiler warning-as-error from https://github.com/facebook/rocksdb/issues/9113 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9128 Test Plan: CI Reviewed By: jay-zhuang Differential Revision: D32181499 Pulled By: pdillinger fbshipit-source-id: d7e5f7857a29f7ba47c49c3aee7150b5763b65d9	3 years ago
Peter Dillinger	dfedc74d82	Some checksum code refactoring (#9113 ) Summary: To prepare for adding checksum to footer and "context aware" checksums. This also brings closely related code much closer together. Recently added `BlockBasedTableBuilder::ComputeBlockTrailer` for testing is made obsolete in the refactoring, as testing the checksums can happen at a lower level of abstraction. Also now checking for unrecognized checksum type on reading footer, rather than later on use. Also removed an obsolete function delcaration. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9113 Test Plan: existing tests worked before refactoring to remove `ComputeBlockTrailer`. And then refactored+improved tests using it. Reviewed By: mrambacher Differential Revision: D32090149 Pulled By: pdillinger fbshipit-source-id: 2879da683c1498ea85a3b70dace9b6d9f6b47b6e	3 years ago
mrambacher	f72c834eab	Make FileSystem a Customizable Class (#8649 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8649 Reviewed By: zhichao-cao Differential Revision: D32036059 Pulled By: mrambacher fbshipit-source-id: 4f1e7557ecac52eb849b83ae02b8d7d232112295	3 years ago
Peter Dillinger	a7d4bea43a	Implement XXH3 block checksum type (#9069 ) Summary: XXH3 - latest hash function that is extremely fast on large data, easily faster than crc32c on most any x86_64 hardware. In integrating this hash function, I have handled the compression type byte in a non-standard way to avoid using the streaming API (extra data movement and active code size because of hash function complexity). This approach got a thumbs-up from Yann Collet. Existing functionality change: * reject bad ChecksumType in options with InvalidArgument This change split off from https://github.com/facebook/rocksdb/issues/9058 because context-aware checksum is likely to be handled through different configuration than ChecksumType. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9069 Test Plan: tests updated, and substantially expanded. Unit tests now check that we don't accidentally change the values generated by the checksum algorithms ("schema test") and that we properly handle invalid/unrecognized checksum types in options or in file footer. DBTestBase::ChangeOptions (etc.) updated from two to one configuration changing from default CRC32c ChecksumType. The point of this test code is to detect possible interactions among features, and the likelihood of some bad interaction being detected by including configurations other than XXH3 and CRC32c--and then not detected by stress/crash test--is extremely low. Stress/crash test also updated (manual run long enough to see it accepts new checksum type). db_bench also updated for microbenchmarking checksums. ### Performance microbenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor) ./db_bench -benchmarks=crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3,crc32c,xxhash,xxhash64,xxh3 crc32c : 0.200 micros/op 5005220 ops/sec; 19551.6 MB/s (4096 per op) xxhash : 0.807 micros/op 1238408 ops/sec; 4837.5 MB/s (4096 per op) xxhash64 : 0.421 micros/op 2376514 ops/sec; 9283.3 MB/s (4096 per op) xxh3 : 0.171 micros/op 5858391 ops/sec; 22884.3 MB/s (4096 per op) crc32c : 0.206 micros/op 4859566 ops/sec; 18982.7 MB/s (4096 per op) xxhash : 0.793 micros/op 1260850 ops/sec; 4925.2 MB/s (4096 per op) xxhash64 : 0.410 micros/op 2439182 ops/sec; 9528.1 MB/s (4096 per op) xxh3 : 0.161 micros/op 6202872 ops/sec; 24230.0 MB/s (4096 per op) crc32c : 0.203 micros/op 4924686 ops/sec; 19237.1 MB/s (4096 per op) xxhash : 0.839 micros/op 1192388 ops/sec; 4657.8 MB/s (4096 per op) xxhash64 : 0.424 micros/op 2357391 ops/sec; 9208.6 MB/s (4096 per op) xxh3 : 0.162 micros/op 6182678 ops/sec; 24151.1 MB/s (4096 per op) As you can see, especially once warmed up, xxh3 is fastest. ### Performance macrobenchmark (PORTABLE=0 DEBUG_LEVEL=0, Broadwell processor) Test for I in `seq 1 50`; do for CHK in 0 1 2 3 4; do TEST_TMPDIR=/dev/shm/rocksdb$CHK ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=30000000 -checksum_type=$CHK 2>&1 \| grep 'micros/op' \| tee -a results-$CHK & done; wait; done Results (ops/sec) for FILE in results; do echo -n "$FILE "; awk '{ s += $5; c++; } END { print 1.0 s / c; }' < $FILE; done results-0 252118 # kNoChecksum results-1 251588 # kCRC32c results-2 251863 # kxxHash results-3 252016 # kxxHash64 results-4 252038 # kXXH3 Reviewed By: mrambacher Differential Revision: D31905249 Pulled By: pdillinger fbshipit-source-id: cb9b998ebe2523fc7c400eedf62124a78bf4b4d1	3 years ago
sdong	c66b4429ff	Incremental Space Amp Compactions in Universal Style (#8655 ) Summary: This commit introduces incremental compaction in univeral style for space amplification. This follows the first improvement mentioned in https://rocksdb.org/blog/2021/04/12/universal-improvements.html . The implemention simply picks up files about size of max_compaction_bytes to compact and execute if the penalty is not too big. More optimizations can be done in the future, e.g. prioritizing between this compaction and other types. But for now, the feature is supposed to be functional and can often reduce frequency of full compactions, although it can introduce penalty. In order to add cut files more efficiently so that more files from upper levels can be included, SST file cutting threshold (for current file + overlapping parent level files) is set to 1.5X of target file size. A 2MB target file size will generate files like this: https://gist.github.com/siying/29d2676fba417404f3c95e6c013c7de8 Number of files indeed increases but it is not out of control. Two set of write benchmarks are run: 1. For ingestion rate limited scenario, we can see full compaction is mostly eliminated: https://gist.github.com/siying/959bc1186066906831cf4c808d6e0a19 . The write amp increased from 7.7 to 9.4, as expected. After applying file cutting, the number is improved to 8.9. In another benchmark, the write amp is even better with the incremental approach: https://gist.github.com/siying/d1c16c286d7c59c4d7bba718ca198163 2. For ingestion rate unlimited scenario, incremental compaction turns out to be too expensive most of the time and is not executed, as expected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8655 Test Plan: Add unit tests to the functionality. Reviewed By: ajkr Differential Revision: D31787034 fbshipit-source-id: ce813e63b15a61d5a56e97bf8902a1b28e011beb	4 years ago
Zhichao Cao	6d93b87588	Add lowest_used_cache_tier to ImmutableDBOptions to enable or disable Secondary Cache (#9050 ) Summary: Currently, if Secondary Cache is provided to the lru cache, it is used by default. We add CacheTier to advanced_options.h to describe the cache tier we used. Add a `lowest_used_cache_tier` option to `DBOptions` (immutable) and pass it to BlockBasedTableReader to decide if secondary cache will be used or not. By default it is `CacheTier::kNonVolatileTier`, which means, we always use both block cache (kVolatileTier) and secondary cache (kNonVolatileTier). By set it to `CacheTier::kVolatileTier`, the DB will not use the secondary cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9050 Test Plan: added new tests Reviewed By: anand1976 Differential Revision: D31744769 Pulled By: zhichao-cao fbshipit-source-id: a0575ebd23e1c6dfcfc2b4c8578764e73b15bce6	4 years ago
mrambacher	8fb3fe8d39	Allow unregistered options to be ignored in DBOptions from files (#9045 ) Summary: Adds changes to DBOptions (comparable to ColumnFamilyOptions) to allow some option values to be ignored on rehydration from the Options file. This is necessary for some customizable classes that were not registered with the ObjectRegistry but are saved/restored from the Options file. All tests pass. Will run check_format_compatible.sh shortly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9045 Reviewed By: zhichao-cao Differential Revision: D31761664 Pulled By: mrambacher fbshipit-source-id: 300c2251639cce2b223481c3bb2a63877b1f3766	4 years ago
Levi Tamasi	3e1bf771a3	Make it possible to force the garbage collection of the oldest blob files (#8994 ) Summary: The current BlobDB garbage collection logic works by relocating the valid blobs from the oldest blob files as they are encountered during compaction, and cleaning up blob files once they contain nothing but garbage. However, with sufficiently skewed workloads, it is theoretically possible to end up in a situation when few or no compactions get scheduled for the SST files that contain references to the oldest blob files, which can lead to increased space amp due to the lack of GC. In order to efficiently handle such workloads, the patch adds a new BlobDB configuration option called `blob_garbage_collection_force_threshold`, which signals to BlobDB to schedule targeted compactions for the SST files that keep alive the oldest batch of blob files if the overall ratio of garbage in the given blob files meets the threshold and all the given blob files are eligible for GC based on `blob_garbage_collection_age_cutoff`. (For example, if the new option is set to 0.9, targeted compactions will get scheduled if the sum of garbage bytes meets or exceeds 90% of the sum of total bytes in the oldest blob files, assuming all affected blob files are below the age-based cutoff.) The net result of these targeted compactions is that the valid blobs in the oldest blob files are relocated and the oldest blob files themselves cleaned up (since all SST files that rely on them get compacted away). These targeted compactions are similar to periodic compactions in the sense that they force certain SST files that otherwise would not get picked up to undergo compaction and also in the sense that instead of merging files from multiple levels, they target a single file. (Note: such compactions might still include neighboring files from the same level due to the need of having a "clean cut" boundary but they never include any files from any other level.) This functionality is currently only supported with the leveled compaction style and is inactive by default (since the default value is set to 1.0, i.e. 100%). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8994 Test Plan: Ran `make check` and tested using `db_bench` and the stress/crash tests. Reviewed By: riversand963 Differential Revision: D31489850 Pulled By: ltamasi fbshipit-source-id: 44057d511726a0e2a03c5d9313d7511b3f0c4eab	4 years ago
mrambacher	13ae16c315	Cleanup includes in dbformat.h (#8930 ) Summary: This header file was including everything and the kitchen sink when it did not need to. This resulted in many places including this header when they needed other pieces instead. Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing. Hopefully, this sort of code hygiene cleanup will speed up the builds... Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930 Reviewed By: pdillinger Differential Revision: D31142788 Pulled By: mrambacher fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d	4 years ago
mrambacher	7fd68b7c39	Make WalFilter, SstPartitionerFactory, FileChecksumGenFactory, and TableProperties Customizable (#8638 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8638 Reviewed By: zhichao-cao Differential Revision: D31024729 Pulled By: mrambacher fbshipit-source-id: 954c04ccab0b8dee64050a27aadf78ed119106c0	4 years ago
mrambacher	e0f697d2bd	Make SliceTransform into a Customizable class (#8641 ) Summary: Made SliceTransform into a Customizable class. Would be nice to write a test that stored and used a custom transform in an SST table. There are a set of tests (DBBlockFliterTest.PrefixExtractor*, SamePrefixTest.InDomainTest, PrefixTest.PrefixAndWholeKeyTest that run the same with or without a SliceTransform/PrefixFilter. Is this expected? Pull Request resolved: https://github.com/facebook/rocksdb/pull/8641 Reviewed By: zhichao-cao Differential Revision: D31142793 Pulled By: mrambacher fbshipit-source-id: bb08672fccbfdc263dcae21f25a62307e1facda1	4 years ago
mrambacher	6924869867	Make SystemClock into a Customizable Class (#8636 ) Summary: Made SystemClock into a Customizable class, complete with CreateFromString. Cleaned up some of the existing SystemClock implementations that were redundant (NoSleep was the same as the internal one for MockEnv). Changed MockEnv construction to allow Clock to be passed to the Memory/MockFileSystem. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8636 Reviewed By: zhichao-cao Differential Revision: D30483360 Pulled By: mrambacher fbshipit-source-id: cd0e3a876c39f8c98fe13374c06e8edbd5b9f2a1	4 years ago
mrambacher	dc0dc90cf5	Make Statistics a Customizable Class (#8637 ) Summary: Make the Statistics object into a Customizable object. Statistics can now be stored and created to/from the Options file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8637 Reviewed By: zhichao-cao Differential Revision: D30530550 Pulled By: mrambacher fbshipit-source-id: 5fc7d01d8431f37b2c205bbbd8342c9f697023bd	4 years ago
mrambacher	0fb938c448	Add support to the ObjectRegistry for ManagedObjects (#8658 ) Summary: ManagedObjects are shared pointer objects where RocksDB wants to share a single object between multiple configurations. For example, the Cache may be shared between multiple column families/tables or the Statistics may be shared between multiple databases. ManagedObjects are stored in the ObjectRegistry by Type (e.g. Cache) and ID. For a given type/ID name, a single object is stored. APIs were added to get/set/create these objects. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8658 Reviewed By: pdillinger Differential Revision: D30806273 Pulled By: mrambacher fbshipit-source-id: 832ac4423b210c4c4b4a456b35897334775d3160	4 years ago
mrambacher	beed86473a	Make MemTableRepFactory into a Customizable class (#8419 ) Summary: This PR does the following: -> Makes the MemTableRepFactory into a Customizable class and creatable/configurable via CreateFromString -> Makes the existing implementations compatible with configurations -> Moves the "SpecialRepFactory" test class into testutil, accessible via the ObjectRegistry or a NewSpecial API New tests were added to validate the functionality and all existing tests pass. db_bench and memtablerep_bench were hand-tested to verify the functionality in those tools. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8419 Reviewed By: zhichao-cao Differential Revision: D29558961 Pulled By: mrambacher fbshipit-source-id: 81b7229636e4e649a0c914e73ac7b0f8454c931c	4 years ago
Peter Dillinger	4750421ece	Replace most typedef with using= (#8751 ) Summary: Old typedef syntax is confusing Most but not all changes with perl -pi -e 's/typedef (.*) ([a-zA-Z0-9_]+);/using $2 = $1;/g' list_of_files make format Pull Request resolved: https://github.com/facebook/rocksdb/pull/8751 Test Plan: existing Reviewed By: zhichao-cao Differential Revision: D30745277 Pulled By: pdillinger fbshipit-source-id: 6f65f0631c3563382d43347896020413cc2366d9	4 years ago
mrambacher	6e63e77af1	Make Configurable/Customizable options copyable (#8704 ) Summary: The atomic variable "is_prepared_" was keeping Configurable objects from being copy-constructed. Removed the atomic to allow copies. Since the variable is only changed from false to true (and never back), there is no reason it had to be atomic. Added tests that simple Configurable and Customizable objects can be put on the stack and copied. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8704 Reviewed By: anand1976 Differential Revision: D30530526 Pulled By: ltamasi fbshipit-source-id: 4dd4439b3e5ad7fa396573d0b25d9fb709160576	4 years ago
mrambacher	2e062b2227	Fix LITE build (#8689 ) Summary: Conditional compilation of static functions not used in LITE mode. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8689 Reviewed By: ltamasi Differential Revision: D30476218 Pulled By: mrambacher fbshipit-source-id: 5f3af90982d34818f47d2cb1d36dd5816d0333a5	4 years ago
Peter Dillinger	2a383f21f4	Add Bloom/Ribbon hybrid API support (#8679 ) Summary: This is essentially resurrection and fixing of the part of https://github.com/facebook/rocksdb/issues/8198 that was reverted in https://github.com/facebook/rocksdb/issues/8212, using data added in https://github.com/facebook/rocksdb/issues/8246. Basically, when configuring Ribbon filter, you can specify an LSM level before which Bloom will be used instead of Ribbon. But Bloom is only considered for Leveled and Universal compaction styles and file going into a known LSM level. This way, SST file writer, FIFO compaction, etc. use Ribbon filter as you would expect with NewRibbonFilterPolicy. So that this can be controlled with a single int value and so that flushes can be distinguished from intra-L0, we consider flush to go to level -1 for the purposes of this option. (Explained in API comment.) I also expect the most common and recommended Ribbon configuration to use Bloom during flush, to minimize slowing down writes and because according to my estimates, Ribbon only pays off if the structure lives in memory for more than an hour. Thus, I have changed the default for NewRibbonFilterPolicy to be this mild hybrid configuration. I don't really want to add something like NewHybridFilterPolicy because at least the mild hybrid configuration (Bloom for flush, Ribbon otherwise) should be considered a natural choice. C APIs also updated, but because they don't support overloading, rocksdb_filterpolicy_create_ribbon is kept pure ribbon for clarity and rocksdb_filterpolicy_create_ribbon_hybrid must be called for a hybrid configuration. While touching C API, I changed bits per key options from int to double. BuiltinFilterPolicy is needed so that LevelThresholdFilterPolicy doesn't inherit unused fields from BloomFilterPolicy. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8679 Test Plan: new + updated tests, including crash test Reviewed By: jay-zhuang Differential Revision: D30445797 Pulled By: pdillinger fbshipit-source-id: 6f5aeddfd6d79f7e55493b563c2d1d2d568892e1	4 years ago
mrambacher	9eb002fcf0	Fix some minor issues in the Customizable infrastructure (#8566 ) Summary: - Fix issue with OptionType::Vector when the nested item is a Customizable with no names - Fix issue with OptionType::Vector to appropriately wrap the elements in a Vector; - Fix an issue with nested Customizable object with a null immutable object still appearing in the mutable options; - Fix/Add tests for null/empty customizable objects - Move the RegisterTestObjects from customizable_test into testutil. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8566 Reviewed By: zhichao-cao Differential Revision: D30303724 Pulled By: mrambacher fbshipit-source-id: 33fa8ea2a3b663210cb356da05e64aab7585b1b5	4 years ago
Baptiste Lemaire	a53563d86e	Re-add retired mempurge flag definitions for legacy-options-file temporary support. (#8650 ) Summary: Current internal regression tests pass in an old option flag `experimental_allow_mempurge` to a more recently built db. This flag was retired and removed in a recent PR (https://github.com/facebook/rocksdb/issues/8628), and therefore, the following error comes up : `Failed: Invalid argument: Could not find option: : experimental_allow_mempurge`. In this PR, I reintroduce the two flags retired in https://github.com/facebook/rocksdb/issues/8628, `experimental_allow_mempurge` and `experimental_mempurge_policy` in `db_options.cc` and mark them both as `kDeprecated`. This is a temporary fix to save us time to find a long term solution, which hopefully will consist in ignoring options prefixed with `experimental_` that are no longer recognized. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8650 Reviewed By: pdillinger Differential Revision: D30257307 Pulled By: bjlemaire fbshipit-source-id: 35303655fd2dd9789fd9e3c450e9d8009f3c1f54	4 years ago
Baptiste Lemaire	e3a96c4823	Memtable sampling for mempurge heuristic. (#8628 ) Summary: Changes the API of the MemPurge process: the `bool experimental_allow_mempurge` and `experimental_mempurge_policy` flags have been replaced by a `double experimental_mempurge_threshold` option. This change of API reflects another major change introduced in this PR: the MemPurgeDecider() function now works by sampling the memtables being flushed to estimate the overall amount of useful payload (payload minus the garbage), and then compare this useful payload estimate with the `double experimental_mempurge_threshold` value. Therefore, when the value of this flag is `0.0` (default value), mempurge is simply deactivated. On the other hand, a value of `DBL_MAX` would be equivalent to always going through a mempurge regardless of the garbage ratio estimate. At the moment, a `double experimental_mempurge_threshold` value else than 0.0 or `DBL_MAX` is opnly supported`with the `SkipList` memtable representation. Regarding the sampling, this PR includes the introduction of a `MemTable::UniqueRandomSample` function that collects (approximately) random entries from the memtable by using the new `SkipList::Iterator::RandomSeek()` under the hood, or by iterating through each memtable entry, depending on the target sample size and the total number of entries. The unit tests have been readapted to support this new API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8628 Reviewed By: pdillinger Differential Revision: D30149315 Pulled By: bjlemaire fbshipit-source-id: 1feef5390c95db6f4480ab4434716533d3947f27	4 years ago
sdong	e7c24168d8	Move old files to warm tier in FIFO compactions (#8310 ) Summary: Some FIFO users want to keep the data for longer, but the old data is rarely accessed. This feature allows users to configure FIFO compaction so that data older than a threshold is moved to a warm storage tier. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8310 Test Plan: Add several unit tests. Reviewed By: ajkr Differential Revision: D28493792 fbshipit-source-id: c14824ea634814dee5278b449ab5c98b6e0b5501	4 years ago
mrambacher	d057e8326d	Make MergeOperator+CompactionFilter/Factory into Customizable Classes (#8481 ) Summary: - Changed MergeOperator, CompactionFilter, and CompactionFilterFactory into Customizable classes. - Added Options/Configurable/Object Registration for TTL and Cassandra variants - Changed the StringAppend MergeOperators to accept a string delimiter rather than a simple char. Made the delimiter into a configurable option - Added tests for new functionality Pull Request resolved: https://github.com/facebook/rocksdb/pull/8481 Reviewed By: zhichao-cao Differential Revision: D30136050 Pulled By: mrambacher fbshipit-source-id: 271d1772835935b6773abaf018ee71e42f9491af	4 years ago
mrambacher	ab7f7c9e49	Allow WAL dir to change with db dir (#8582 ) Summary: Prior to this change, the "wal_dir" DBOption would always be set (defaults to dbname) when the DBOptions were sanitized. Because of this setitng in the options file, it was not possible to rename/relocate a database directory after it had been created and use the existing options file. After this change, the "wal_dir" option is only set under specific circumstances. Methods were added to the ImmutableDBOptions class to see if it is set and if it is set to something other than the dbname. Additionally, a method was added to retrieve the effective value of the WAL dir (either the option or the dbname/path). Tests were added to the core and ldb to test that a database could be created and renamed without issue. Additional tests for various permutations of wal_dir were also added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8582 Reviewed By: pdillinger, autopear Differential Revision: D29881122 Pulled By: mrambacher fbshipit-source-id: 67d3d033dc8813d59917b0a3fba2550c0efd6dfb	4 years ago
mrambacher	3aee4fbd41	Make EventListener into a Customizable Class (#8473 ) Summary: - Added Type/CreateFromString - Added ability to load EventListeners to DBOptions - Since EventListeners did not previously have a Name(), defaulted to "". If there is no name, the listener cannot be loaded from the ObjectRegistry. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8473 Reviewed By: zhichao-cao Differential Revision: D29901488 Pulled By: mrambacher fbshipit-source-id: 2d3a4aa6db1562ac03e7ad41b360e3521d486254	4 years ago
Baptiste Lemaire	4361d6d163	Add simple heuristics for experimental mempurge. (#8583 ) Summary: Add `experimental_mempurge_policy` option flag and introduce two new `MemPurge` (Memtable Garbage Collection) policies: 'ALWAYS' and 'ALTERNATE'. Default value: ALTERNATE. `ALWAYS`: every flush will first go through a `MemPurge` process. If the output is too big to fit into a single memtable, then the mempurge is aborted and a regular flush process carries on. `ALWAYS` is designed for user that need to reduce the number of L0 SST file created to a strict minimum, and can afford a small dent in performance (possibly hits to CPU usage, read efficiency, and maximum burst write throughput). `ALTERNATE`: a flush is transformed into a `MemPurge` except if one of the memtables being flushed is the product of a previous `MemPurge`. `ALTERNATE` is a good tradeoff between reduction in number of L0 SST files created and performance. `ALTERNATE` perform particularly well for completely random garbage ratios, or garbage ratios anywhere in (0%,50%], and even higher when there is a wild variability in garbage ratios. This PR also includes support for `experimental_mempurge_policy` in `db_bench`. Testing was done locally by replacing all the `MemPurge` policies of the unit tests with `ALTERNATE`, as well as local testing with `db_crashtest.py` `whitebox` and `blackbox`. Overall, if an `ALWAYS` mempurge policy passes the tests, there is no reasons why an `ALTERNATE` policy would fail, and therefore the mempurge policy was set to `ALWAYS` for all mempurge unit tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8583 Reviewed By: pdillinger Differential Revision: D29888050 Pulled By: bjlemaire fbshipit-source-id: e2cf26646d66679f6f5fb29842624615610759c1	4 years ago
Jay Zhuang	42eaa45c1b	Avoid updating option if there's no value updated (#8518 ) Summary: Try avoid expensive updating options operation if `SetDBOptions()` does not change any option value. Skip updating is not guaranteed, for example, changing `bytes_per_sync` to `0` may still trigger updating, as the value could be sanitized. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8518 Test Plan: added unittest Reviewed By: riversand963 Differential Revision: D29672639 Pulled By: jay-zhuang fbshipit-source-id: b7931de62ceea6f1bdff0d1209adf1197d3ed1f4	4 years ago

1 2 3 4 5 ...

335 Commits (e267909ecfe92e79232f7700e41b6e12c4a87c01)