rocksdb

Commit Graph

Author	SHA1	Message	Date
Hui Xiao	920386f2b7	Detect (new) Bloom/Ribbon Filter construction corruption (#9342 ) Summary: Note: rebase on and merge after https://github.com/facebook/rocksdb/pull/9349, https://github.com/facebook/rocksdb/pull/9345, (optional) https://github.com/facebook/rocksdb/pull/9393 Context: (Quoted from pdillinger) Layers of information during new Bloom/Ribbon Filter construction in building block-based tables includes the following: a) set of keys to add to filter b) set of hashes to add to filter (64-bit hash applied to each key) c) set of Bloom indices to set in filter, with duplicates d) set of Bloom indices to set in filter, deduplicated e) final filter and its checksum This PR aims to detect corruption (e.g, unexpected hardware/software corruption on data structures residing in the memory for a long time) from b) to e) and leave a) as future works for application level. - b)'s corruption is detected by verifying the xor checksum of the hash entries calculated as the entries accumulate before being added to the filter. (i.e, `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()`) - c) - e)'s corruption is detected by verifying the hash entries indeed exists in the constructed filter by re-querying these hash entries in the filter (i.e, `FilterBitsBuilder::MaybePostVerify()`) after computing the block checksum (except for PartitionFilter, which is done right after each `FilterBitsBuilder::Finish` for impl simplicity - see code comment for more). For this stage of detection, we assume hash entries are not corrupted after checking on b) since the time interval from b) to c) is relatively short IMO. Option to enable this feature of detection is `BlockBasedTableOptions::detect_filter_construct_corruption` which is false by default. Summary: - Implemented new functions `XXPH3FilterBitsBuilder::MaybeVerifyHashEntriesChecksum()` and `FilterBitsBuilder::MaybePostVerify()` - Ensured hash entries, final filter and banding and their [cache reservation ](https://github.com/facebook/rocksdb/issues/9073) are released properly despite corruption - See [Filter.construction.artifacts.release.point.pdf ](https://github.com/facebook/rocksdb/files/7923487/Design.Filter.construction.artifacts.release.point.pdf) for high-level design - Bundled and refactored hash entries's related artifact in XXPH3FilterBitsBuilder into `HashEntriesInfo` for better control on lifetime of these artifact during `SwapEntires`, `ResetEntries` - Ensured RocksDB block-based table builder calls `FilterBitsBuilder::MaybePostVerify()` after constructing the filter by `FilterBitsBuilder::Finish()` - When encountering such filter construction corruption, stop writing the filter content to files and mark such a block-based table building non-ok by storing the corruption status in the builder. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9342 Test Plan: - Added new unit test `DBFilterConstructionCorruptionTestWithParam.DetectCorruption` - Included this new feature in `DBFilterConstructionReserveMemoryTestWithParam.ReserveMemory` as this feature heavily touch ReserveMemory's impl - For fallback case, I run `./filter_bench -impl=3 -detect_filter_construct_corruption=true -reserve_table_builder_memory=true -strict_capacity_limit=true -quick -runs 10 \| grep 'Build avg'` to make sure nothing break. - Added to `filter_bench`: increased filter construction time by 30%, mostly by `MaybePostVerify()` - FastLocalBloom - Before change: `./filter_bench -impl=2 -quick -runs 10 \| grep 'Build avg'`: 28.86643s - After change: - `./filter_bench -impl=2 -detect_filter_construct_corruption=false -quick -runs 10 \| grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless): 27.6644s (-4% perf improvement might be due to now we don't drop bloom hash entry in `AddAllEntries` along iteration but in bulk later, same with the bypassing-MaybePostVerify case below) - `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 \| grep 'Build avg'` (expect acceptable increase): 34.41159s (+20%) - `./filter_bench -impl=2 -detect_filter_construct_corruption=true -quick -runs 10 \| grep 'Build avg'` (by-passing MaybePostVerify, expect minor increase): 27.13431s (-6%) - Standard128Ribbon - Before change: `./filter_bench -impl=3 -quick -runs 10 \| grep 'Build avg'`: 122.5384s - After change: - `./filter_bench -impl=3 -detect_filter_construct_corruption=false -quick -runs 10 \| grep 'Build avg'` (expect a tiny increase due to MaybePostVerify is always called regardless - verified by removing MaybePostVerify under this case and found only +-1ns difference): 124.3588s (+2%) - `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 \| grep 'Build avg'`(expect acceptable increase): 159.4946s (+30%) - `./filter_bench -impl=3 -detect_filter_construct_corruption=true -quick -runs 10 \| grep 'Build avg'`(by-passing MaybePostVerify, expect minor increase) : 125.258s (+2%) - Added to `db_stress`: `make crash_test`, `./db_stress --detect_filter_construct_corruption=true` - Manually smoke-tested: manually corrupted the filter construction in some db level tests with basic PUT and background flush. As expected, the error did get returned to users in subsequent PUT and Flush status. Reviewed By: pdillinger Differential Revision: D33746928 Pulled By: hx235 fbshipit-source-id: cb056426be5a7debc1cd16f23bc250f36a08ca57	3 years ago
Andrew Kryczka	c7ce03dce1	db_stress begin tracking expected state after verification (#9470 ) Summary: Previously we enabled tracking expected state changes during `FinishInitDb()`, as soon as the DB was opened. This meant tracing was enabled during `VerifyDb()`. This cost extra CPU by requiring `DBImpl::trace_mutex_` to be acquired on each read operation. It was unnecessary since we know there are no expected state changes during the `VerifyDb()` phase. So, this PR delays tracking expected state changes until after the `VerifyDb()` phase has completed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9470 Test Plan: Measured this PR reduced `VerifyDb()` 76% (387 -> 92 seconds) with `-disable_wal=1` (i.e., expected state tracking enabled). - benchmark command: `./db_stress -max_key=100000000 -ops_per_thread=1 -destroy_db_initially=1 -expected_values_dir=/dev/shm/dbstress_expected/ -db=/dev/shm/dbstress/ --clear_column_family_one_in=0 --disable_wal=1 --reopen=0` - without this PR, `VerifyDb()` takes 387 seconds: ``` 2022/01/30-21:43:04 Initializing worker threads Crash-recovery verification passed :) 2022/01/30-21:49:31 Starting database operations ``` - with this PR, `VerifyDb()` takes 92 seconds ``` 2022/01/30-21:59:06 Initializing worker threads Crash-recovery verification passed :) 2022/01/30-22:00:38 Starting database operations ``` Reviewed By: riversand963 Differential Revision: D33884596 Pulled By: ajkr fbshipit-source-id: 5f259de8087de5b0531f088e11297f37ed2f7685	3 years ago
Peter Dillinger	78aee6fedc	Remove obsolete backupable_db.h, utility_db.h (#9438 ) Summary: This also removes the obsolete names BackupableDBOptions and UtilityDB. API users must now use BackupEngineOptions and DBWithTTL::Open. In C API, `rocksdb_backupable_db_` is replaced `rocksdb_backup_engine_`. Similar renaming in Java API. In reference to https://github.com/facebook/rocksdb/issues/9389 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9438 Test Plan: CI Reviewed By: mrambacher Differential Revision: D33780269 Pulled By: pdillinger fbshipit-source-id: 4a6cfc5c1b4c78bcad790b9d3dd13c5fdf4a1fac	3 years ago
Hui Xiao	1e0e883ca5	Remove deprecated API AdvancedColumnFamilyOptions::soft_rate_limit/hard_rate_limit (#9452 ) Summary: Context/Summary: AdvancedColumnFamilyOptions::soft_rate_limit/hard_rate_limit have been marked as deprecated and it's time to actually remove the code. - Keep `soft_rate_limit`/`hard_rate_limit` in `cf_mutable_options_type_info` to prevent throwing `InvalidArgument` in `GetColumnFamilyOptionsFromMap` when reading an option file still with these options (e.g, old option file generated from RocksDB before the deprecation) - Keep `soft_rate_limit`/`hard_rate_limit` in under `OptionsOldApiTest.GetOptionsFromMapTest` to test the case mentioned above. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9452 Test Plan: Rely on my eyeball and CI Reviewed By: ajkr Differential Revision: D33804938 Pulled By: hx235 fbshipit-source-id: 133d49f7ec5238d7efceeb0a3122a5792a2b9945	3 years ago
Aravind Ramesh	2eac6bb120	db_stress: db_stress fails on custom filesystems. (#9352 ) Summary: db_stress listener service always uses default filesystem to operate, causing it to not recognize custom filesystem (like ZenFS plugin FS). Pass the env to db_stress listener with the correct filesystem information, so it can open the user intended filesystem. Signed-off-by: Aravind Ramesh <Aravind.Ramesh@wdc.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/9352 Reviewed By: riversand963 Differential Revision: D33776762 Pulled By: pdillinger fbshipit-source-id: e79bb9a544384f80ae9dd0108241ab9c83223954	3 years ago
Andrew Kryczka	6892f19b11	Test correctness with WAL disabled in non-txn blackbox crash tests (#9338 ) Summary: Recently we added the ability to verify some prefix of operations are recovered (AKA no "hole" in the recovered data) (https://github.com/facebook/rocksdb/issues/8966). Besides testing unsynced data loss scenarios, it is also useful to test WAL disabled use cases, where unflushed writes are expected to be lost. Note RocksDB only offers the prefix-recovery guarantee to WAL-disabled use cases that use atomic flush, so crash test always enables atomic flush when WAL is disabled. To verify WAL-disabled crash-recovery correctness globally, i.e., also in whitebox and blackbox transaction tests, it is possible but requires further changes. I added TODOs in db_crashtest.py. Depends on https://github.com/facebook/rocksdb/issues/9305. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9338 Test Plan: Running all crash tests and many instances of blackbox. Sandcastle links are in Phabricator diff test plan. Reviewed By: riversand963 Differential Revision: D33345333 Pulled By: ajkr fbshipit-source-id: f56dd7d2e5a78d59301bf4fc3fedb980eb31e0ce	3 years ago
Andrew Kryczka	863c78d2c9	Fix unsynced data loss correctness test with mixed `-test_batches_snapshots` (#9302 ) Summary: This fixes two bugs in the recently committed DB verification following crash-recovery with unsynced data loss (https://github.com/facebook/rocksdb/issues/8966): The first bug was in crash test runs involving mixed values for `-test_batches_snapshots`. The problem was we were neither restoring expected values nor enabling tracing when `-test_batches_snapshots=1`. This caused a future `-test_batches_snapshots=0` run to not find enough trace data to restore expected values. The fix is to restore expected values at the start of `-test_batches_snapshots=1` runs, but still leave tracing disabled as we do not need to track those KVs. The second bug was in `db_stress` runs that restore the expected values file and use compaction filter. The compaction filter was initialized to use the pre-restore expected values, which would be `munmap()`'d during `FileExpectedStateManager::Restore()`. Then compaction filter would run into a segfault. The fix is just to reorder compaction filter init after expected values restore. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9302 Test Plan: - To verify the first problem, the below sequence used to fail; now it passes. ``` $ ./db_stress --db=./test-db/ --expected_values_dir=./test-db-expected/ --max_key=100000 --ops_per_thread=1000 --sync_fault_injection=1 --clear_column_family_one_in=0 --destroy_db_initially=0 -reopen=0 -test_batches_snapshots=0 $ ./db_stress --db=./test-db/ --expected_values_dir=./test-db-expected/ --max_key=100000 --ops_per_thread=1000 --sync_fault_injection=1 --clear_column_family_one_in=0 --destroy_db_initially=0 -reopen=0 -test_batches_snapshots=1 $ ./db_stress --db=./test-db/ --expected_values_dir=./test-db-expected/ --max_key=100000 --ops_per_thread=1000 --sync_fault_injection=1 --clear_column_family_one_in=0 --destroy_db_initially=0 -reopen=0 -test_batches_snapshots=0 ``` - The second problem occurred rarely in the form of a SIGSEGV on a file that was `munmap()`d. I have not seen it after this PR though this doesn't prove much. Reviewed By: jay-zhuang Differential Revision: D33155283 Pulled By: ajkr fbshipit-source-id: 66fd0f0edf34015a010c30015f14f104734e964e	3 years ago
Andrew Kryczka	c9818b3325	db_stress verify with lost unsynced operations (#8966 ) Summary: When a previous run left behind historical state/trace files (implying it was run with --sync_fault_injection set), this PR uses them to restore the expected state according to the DB's recovered sequence number. That way, a tail of latest unsynced operations are permitted to be dropped, as is the case when data in page cache or certain `Env`s is lost. The point of the verification in this scenario is just to ensure there is no hole in the recovered data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8966 Test Plan: - ran it a while, made sure it is restoring expected values using the historical state/trace files: ``` $ rm -rf ./tmp-db/ ./exp/ && mkdir -p ./tmp-db/ ./exp/ && while ./db_stress -compression_type=none -clear_column_family_one_in=0 -expected_values_dir=./exp -sync_fault_injection=1 -destroy_db_initially=0 -db=./tmp-db -max_key=1000000 -ops_per_thread=10000 -reopen=0 -threads=32 ; do : ; done ``` Reviewed By: pdillinger Differential Revision: D31219445 Pulled By: ajkr fbshipit-source-id: f0e1d51fe5b35465b00565c33331190ea38ba0ad	3 years ago
Yanqin Jin	e05c2bb549	Stress test for RocksDB transactions (#8936 ) Summary: Current db_stress does not cover complex read-write transactions. Therefore, this PR adds coverage for emulated MyRocks-style transactions in `MultiOpsTxnsStressTest`. To achieve this, we need: - Add a new operation type 'customops' so that we can add new complex groups of operations, e.g. transactions involving multiple read-write operations. - Implement three read-write transactions and two read-only ones to emulate MyRocks-style transactions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8936 Test Plan: ``` make check ./db_stress -test_multi_ops_txns -use_txn -clear_column_family_one_in=0 -column_families=1 -writepercent=0 -delpercent=0 -delrangepercent=0 -customopspercent=60 -readpercent=20 -prefixpercent=0 -iterpercent=20 -reopen=0 -ops_per_thread=100000 ``` Next step is to add more configurability and refine input generation and result reporting, which will done in separate follow-up PRs. Reviewed By: zhichao-cao Differential Revision: D31071795 Pulled By: riversand963 fbshipit-source-id: 50d7c828346ec643311336b904848a1588a37006	3 years ago
Akanksha Mahajan	9e4d56f2c9	Fix segmentation fault in table_options.prepopulate_block_cache when used with partition_filters (#9263 ) Summary: When table_options.prepopulate_block_cache is set to BlockBasedTableOptions::PrepopulateBlockCache::kFlushOnly and table_options.partition_filters is also set true, then there is segmentation failure when top level filter is fetched because its entered with wrong type in cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9263 Test Plan: Updated unit tests; Ran db_stress: make crash_test -j32 Reviewed By: pdillinger Differential Revision: D32936566 Pulled By: akankshamahajan15 fbshipit-source-id: 8bd79e53830d3e3c1bb79787e1ffbc3cb46d4426	3 years ago
Andrew Kryczka	a6a6aad74e	db_stress support tracking historical values (#8960 ) Summary: When `--sync_fault_injection` is set, this PR takes a snapshot of the expected values and starts an operation trace when the DB is opened. These files are stored in `--expected_values_dir`. They will be used for recovering the expected state of the DB following a crash where a suffix of unsynced operations are allowed to be lost. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8960 Test Plan: injected crashed at various points in `FileExpectedStateManager` and verified the next run recovers the state/trace file with highest seqno and removes all older/temporary files. Note we don't use sync_fault_injection in CI crash tests yet. Reviewed By: pdillinger Differential Revision: D31194941 Pulled By: ajkr fbshipit-source-id: b0f935a529a0186c5a9c7709fcaa8829de8a84cf	3 years ago
Yanqin Jin	42fef0224f	Fix build for msvc (#9230 ) Summary: Test plan With Visual Studio 2017. ``` cd rocksdb mkdir build && cd build cmake -G "Visual Studio 15 Win64" -DWITH_GFLAGS=1 .. MSBuild rocksdb.sln /m /TARGET:cache_bench /TARGET:db_bench /TARGET:db_stress ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9230 Reviewed By: akankshamahajan15 Differential Revision: D32705095 Pulled By: riversand963 fbshipit-source-id: 101e3533f5178b24c0535ddc47a39347ccfcf92c	3 years ago
Levi Tamasi	dc5de45af8	Support readahead during compaction for blob files (#9187 ) Summary: The patch adds a new BlobDB configuration option `blob_compaction_readahead_size` that can be used to enable prefetching data from blob files during compaction. This is important when using storage with higher latencies like HDDs or remote filesystems. If enabled, prefetching is used for all cases when blobs are read during compaction, namely garbage collection, compaction filters (when the existing value has to be read from a blob file), and `Merge` (when the value of the base `Put` is stored in a blob file). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9187 Test Plan: Ran `make check` and the stress/crash test. Reviewed By: riversand963 Differential Revision: D32565512 Pulled By: ltamasi fbshipit-source-id: 87be9cebc3aa01cc227bec6b5f64d827b8164f5d	3 years ago
anand76	78556c14dd	Secondary cache error injection (#9002 ) Summary: Implement secondary cache error injection in db_stress. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9002 Reviewed By: zhichao-cao Differential Revision: D31874896 Pulled By: anand1976 fbshipit-source-id: 8cf04c061a4a44efa0fe88423d05cade67b85f73	3 years ago
Yanqin Jin	d263505417	Avoid div-by-zero error in db_stress (#9086 ) Summary: If a column family has 0 levels, then existing `TestCompactFiles(...)` may hit divide-by-zero. To fix, return early if the cf is empty. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9086 Test Plan: TBD Reviewed By: ajkr Differential Revision: D31986799 Pulled By: riversand963 fbshipit-source-id: 48f7dfb2b2b47cfc1315cb71ca80eb230d947f17	3 years ago
Peter Dillinger	ad5325a736	Experimental support for SST unique IDs (#8990 ) Summary: * New public header unique_id.h and function GetUniqueIdFromTableProperties which computes a universally unique identifier based on table properties of table files from recent RocksDB versions. * Generation of DB session IDs is refactored so that they are guaranteed unique in the lifetime of a process running RocksDB. (SemiStructuredUniqueIdGen, new test included.) Along with file numbers, this enables SST unique IDs to be guaranteed unique among SSTs generated in a single process, and "better than random" between processes. See https://github.com/pdillinger/unique_id * In addition to public API producing 'external' unique IDs, there is a function for producing 'internal' unique IDs, with functions for converting between the two. In short, the external ID is "safe" for things people might do with it, and the internal ID enables more "power user" features for the future. Specifically, the external ID goes through a hashing layer so that any subset of bits in the external ID can be used as a hash of the full ID, while also preserving uniqueness guarantees in the first 128 bits (bijective both on first 128 bits and on full 192 bits). Intended follow-up: * Use the internal unique IDs in cache keys. (Avoid conflicts with https://github.com/facebook/rocksdb/issues/8912) (The file offset can be XORed into the third 64-bit value of the unique ID.) * Publish the external unique IDs in FileStorageInfo (https://github.com/facebook/rocksdb/issues/8968) Pull Request resolved: https://github.com/facebook/rocksdb/pull/8990 Test Plan: Unit tests added, and checking of unique ids in stress test. NOTE in stress test we do not generate nearly enough files to thoroughly stress uniqueness, but the test trims off pieces of the ID to check for uniqueness so that we can infer (with some assumptions) stronger properties in the aggregate. Reviewed By: zhichao-cao, mrambacher Differential Revision: D31582865 Pulled By: pdillinger fbshipit-source-id: 1f620c4c86af9abe2a8d177b9ccf2ad2b9f48243	3 years ago
Levi Tamasi	3e1bf771a3	Make it possible to force the garbage collection of the oldest blob files (#8994 ) Summary: The current BlobDB garbage collection logic works by relocating the valid blobs from the oldest blob files as they are encountered during compaction, and cleaning up blob files once they contain nothing but garbage. However, with sufficiently skewed workloads, it is theoretically possible to end up in a situation when few or no compactions get scheduled for the SST files that contain references to the oldest blob files, which can lead to increased space amp due to the lack of GC. In order to efficiently handle such workloads, the patch adds a new BlobDB configuration option called `blob_garbage_collection_force_threshold`, which signals to BlobDB to schedule targeted compactions for the SST files that keep alive the oldest batch of blob files if the overall ratio of garbage in the given blob files meets the threshold and all the given blob files are eligible for GC based on `blob_garbage_collection_age_cutoff`. (For example, if the new option is set to 0.9, targeted compactions will get scheduled if the sum of garbage bytes meets or exceeds 90% of the sum of total bytes in the oldest blob files, assuming all affected blob files are below the age-based cutoff.) The net result of these targeted compactions is that the valid blobs in the oldest blob files are relocated and the oldest blob files themselves cleaned up (since all SST files that rely on them get compacted away). These targeted compactions are similar to periodic compactions in the sense that they force certain SST files that otherwise would not get picked up to undergo compaction and also in the sense that instead of merging files from multiple levels, they target a single file. (Note: such compactions might still include neighboring files from the same level due to the need of having a "clean cut" boundary but they never include any files from any other level.) This functionality is currently only supported with the leveled compaction style and is inactive by default (since the default value is set to 1.0, i.e. 100%). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8994 Test Plan: Ran `make check` and tested using `db_bench` and the stress/crash tests. Reviewed By: riversand963 Differential Revision: D31489850 Pulled By: ltamasi fbshipit-source-id: 44057d511726a0e2a03c5d9313d7511b3f0c4eab	3 years ago
Akanksha Mahajan	84d71f30c4	Enable SingleDelete with user defined ts in db_bench and crash tests (#8971 ) Summary: Enable SingleDelete with user defined timestamp in db_bench, db_stress and crash test Pull Request resolved: https://github.com/facebook/rocksdb/pull/8971 Test Plan: 1. For db_stress, ran the command for full duration: i) python3 -u tools/db_crashtest.py --enable_ts whitebox --nooverwritepercent=100 ii) make crash_test_with_ts 2. For db_bench, ran: ./db_bench -benchmarks=randomreplacekeys -user_timestamp_size=8 -use_single_deletes=true Reviewed By: riversand963 Differential Revision: D31246558 Pulled By: akankshamahajan15 fbshipit-source-id: 29cd8740c9921341e52f09242fca3c44d75a12b7	3 years ago
mrambacher	13ae16c315	Cleanup includes in dbformat.h (#8930 ) Summary: This header file was including everything and the kitchen sink when it did not need to. This resulted in many places including this header when they needed other pieces instead. Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing. Hopefully, this sort of code hygiene cleanup will speed up the builds... Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930 Reviewed By: pdillinger Differential Revision: D31142788 Pulled By: mrambacher fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d	3 years ago
Andrew Kryczka	5c92aa38ea	Avoid overwriting first non-OK Status in db_stress setup (#8907 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8907 Reviewed By: zhichao-cao Differential Revision: D30922081 Pulled By: ajkr fbshipit-source-id: ad7a32c21d0049342fd20c9b7f555e93674c3671	3 years ago
Peter Dillinger	c9cd5d25a8	Remove some unneeded code (#8736 ) Summary: * FullKey and ParseFullKey appear to serve no purpose in the public API (or anything else) so removed. Only use in one test updated. * NumberToString serves no purpose vs. ToString so removed, numerous calls updated * Remove unnecessary forward declarations in metadata.h by re-arranging class definitions. * Remove some unneeded semicolons Pull Request resolved: https://github.com/facebook/rocksdb/pull/8736 Test Plan: existing tests Reviewed By: mrambacher Differential Revision: D30700039 Pulled By: pdillinger fbshipit-source-id: 1e436a576f511a6ed8b4d97af7cc8216bc729af2	3 years ago
Peter Dillinger	2a383f21f4	Add Bloom/Ribbon hybrid API support (#8679 ) Summary: This is essentially resurrection and fixing of the part of https://github.com/facebook/rocksdb/issues/8198 that was reverted in https://github.com/facebook/rocksdb/issues/8212, using data added in https://github.com/facebook/rocksdb/issues/8246. Basically, when configuring Ribbon filter, you can specify an LSM level before which Bloom will be used instead of Ribbon. But Bloom is only considered for Leveled and Universal compaction styles and file going into a known LSM level. This way, SST file writer, FIFO compaction, etc. use Ribbon filter as you would expect with NewRibbonFilterPolicy. So that this can be controlled with a single int value and so that flushes can be distinguished from intra-L0, we consider flush to go to level -1 for the purposes of this option. (Explained in API comment.) I also expect the most common and recommended Ribbon configuration to use Bloom during flush, to minimize slowing down writes and because according to my estimates, Ribbon only pays off if the structure lives in memory for more than an hour. Thus, I have changed the default for NewRibbonFilterPolicy to be this mild hybrid configuration. I don't really want to add something like NewHybridFilterPolicy because at least the mild hybrid configuration (Bloom for flush, Ribbon otherwise) should be considered a natural choice. C APIs also updated, but because they don't support overloading, rocksdb_filterpolicy_create_ribbon is kept pure ribbon for clarity and rocksdb_filterpolicy_create_ribbon_hybrid must be called for a hybrid configuration. While touching C API, I changed bits per key options from int to double. BuiltinFilterPolicy is needed so that LevelThresholdFilterPolicy doesn't inherit unused fields from BloomFilterPolicy. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8679 Test Plan: new + updated tests, including crash test Reviewed By: jay-zhuang Differential Revision: D30445797 Pulled By: pdillinger fbshipit-source-id: 6f5aeddfd6d79f7e55493b563c2d1d2d568892e1	3 years ago
Jay Zhuang	0729b287e9	Exclude property kLiveSstFilesSizeAtTemperature from stress_test (#8668 ) Summary: Just like other per_level properties. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8668 Test Plan: stress_test Reviewed By: zhichao-cao Differential Revision: D30360967 Pulled By: jay-zhuang fbshipit-source-id: 70da2557b95c55e8081b04ebf1a909a0fe69488f	3 years ago
Baptiste Lemaire	e3a96c4823	Memtable sampling for mempurge heuristic. (#8628 ) Summary: Changes the API of the MemPurge process: the `bool experimental_allow_mempurge` and `experimental_mempurge_policy` flags have been replaced by a `double experimental_mempurge_threshold` option. This change of API reflects another major change introduced in this PR: the MemPurgeDecider() function now works by sampling the memtables being flushed to estimate the overall amount of useful payload (payload minus the garbage), and then compare this useful payload estimate with the `double experimental_mempurge_threshold` value. Therefore, when the value of this flag is `0.0` (default value), mempurge is simply deactivated. On the other hand, a value of `DBL_MAX` would be equivalent to always going through a mempurge regardless of the garbage ratio estimate. At the moment, a `double experimental_mempurge_threshold` value else than 0.0 or `DBL_MAX` is opnly supported`with the `SkipList` memtable representation. Regarding the sampling, this PR includes the introduction of a `MemTable::UniqueRandomSample` function that collects (approximately) random entries from the memtable by using the new `SkipList::Iterator::RandomSeek()` under the hood, or by iterating through each memtable entry, depending on the target sample size and the total number of entries. The unit tests have been readapted to support this new API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8628 Reviewed By: pdillinger Differential Revision: D30149315 Pulled By: bjlemaire fbshipit-source-id: 1feef5390c95db6f4480ab4434716533d3947f27	3 years ago
Akanksha Mahajan	052c24a668	Fix db_stress failure (#8632 ) Summary: FaultInjectionTestFS injects error in Rename operation. Because of injected error, info.log fails to be created if rename returns error and info_log is set to nullptr which leads to this assertion Pull Request resolved: https://github.com/facebook/rocksdb/pull/8632 Test Plan: run the db_stress job locally Reviewed By: ajkr Differential Revision: D30167387 Pulled By: akankshamahajan15 fbshipit-source-id: 8d08c4c33e8f0cabd368bbb498d21b9de0660067	3 years ago
Baptiste Lemaire	d6006f9c9b	Add experimental mempurge policy flag to db_stress. (#8588 ) Summary: Add `experimental_mempurge_policy` flag to `db_stress` and `db_crashtest.py`. This flag is only read if the `experimental_allow_mempurge` flag is set to `true`. This flag can take the following values: `kAlways`, and `kAlternate` (default). - `kAlways`: a flush is always redirected to a mempurge. If the mempurge aborts, the a regular flush proceeds. - `kAlternate`: if one or more of the flush input memtables is an mempurge output memtable, then a flush is performed, else a mempurge is carried out. Similar to kAlways, if a mempurge aborts, the FlushJob proceeds to a regular flush to storage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8588 Reviewed By: pdillinger Differential Revision: D29934251 Pulled By: bjlemaire fbshipit-source-id: 90c1debed2029b9915d066914556547507c33dae	3 years ago
sdong	9b41082d4a	Complete the fix of stress open WAL drop fix (#8570 ) Summary: https://github.com/facebook/rocksdb/pull/8548 is not complete. We should instead cover all cases writable files are buffered, not just when failures are ingested. Extend it to any case where failures are ingested in DB open. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8570 Test Plan: Run db_stress and see it doesn't break Reviewed By: jay-zhuang Differential Revision: D29830415 fbshipit-source-id: 94449a0468fb2f7eec17423724008c9c63b2445d	3 years ago
Jay Zhuang	66ca5ac427	Cleanup cf handlers before deleting db (#8564 ) Summary: Delete column family handlers before deleting db to avoid `last_ref` assert. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8564 Test Plan: Inject compaction test in db_stress test Reviewed By: pdillinger Differential Revision: D29797375 Pulled By: jay-zhuang fbshipit-source-id: e8baf4d279f4db5d963db95c9445454156205501	3 years ago
Peter Dillinger	d5f3b77f23	Add GetMapProperty to db_stress (#8551 ) Summary: Already has good coverage for GetProperty and GetIntProperty but this one was missing. This should add more confidence to https://github.com/facebook/rocksdb/issues/8538 Pull Request resolved: https://github.com/facebook/rocksdb/pull/8551 Test Plan: brief local run with boosted probability showed no immediate issues Reviewed By: siying Differential Revision: D29746383 Pulled By: pdillinger fbshipit-source-id: 9f9f525bc1a7607f85e563e33bda1979ef197127	3 years ago
sdong	39a07c9651	DB Stress Reopen write failure to skip WAL (#8548 ) Summary: When DB Stress enables write failure in reopen, WAL files are also created with a wrapper writalbe file which buffers write until fsync. However, crash test currently expects all writes to WAL is persistent. This is at odd with the unsynced bytes dropped. To work it around temporarily, we disable WAL write failure for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8548 Test Plan: Run db_stress. Manual printf to make sure only WAL files are skipped. Reviewed By: jay-zhuang Differential Revision: D29745095 fbshipit-source-id: 1879dd2c01abad7879ca243ee94570ec47c347f3	3 years ago
Baptiste Lemaire	0229a88dfe	Crashtest mempurge (#8545 ) Summary: Add `experiemental_allow_mempurge` flag support for `db_stress` and `db_crashtest.py`, with a `false` default value. I succesfully tested locally both `whitebox` and `blackbox` crash tests with `experiemental_allow_mempurge` flag set as true. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8545 Reviewed By: akankshamahajan15 Differential Revision: D29734513 Pulled By: bjlemaire fbshipit-source-id: 24316c0eccf6caf409e95c035f31d822c66714ae	3 years ago
sdong	b1a53db327	FaultInjectionTestFS::DeleteFilesCreatedAfterLastDirSync() to recover… (#8501 ) Summary: … small overwritten files. If a file is overwritten with renamed and the parent path is not synced, FaultInjectionTestFS::DeleteFilesCreatedAfterLastDirSync() will delete the file. However, RocksDB relies on file renaming to be atomic no matter whether the parent directory is synced or not, and the current behavior breaks the assumption and caused some false positive: https://github.com/facebook/rocksdb/pull/8489 Since the atomic renaming is used in CURRENT files, to fix the problem, in FaultInjectionTestFS::DeleteFilesCreatedAfterLastDirSync(), we recover the state of overwritten file if the file is small. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8501 Test Plan: Run stress test for a while and see it doesn't break. Reviewed By: anand1976 Differential Revision: D29594384 fbshipit-source-id: 589b5c2f0a9d2aca53752d7bdb0231efa5b3ae92	3 years ago
anand76	fcd8088333	Temporarily disable file deletion after open failure in db_stress (#8489 ) Summary: Write and metadata error injection during DB open was enabled in https://github.com/facebook/rocksdb/issues/8474. This causes crash tests to fail very frequently due to another fault injection feature that deletes files created after the last dir sync during DB open. In real life, a similar failure would happen if the FS returns error on the CURRENT file rename, but the rename actually succeeded and got partially persisted (dir entry for the old CURRENT file got removed, but the entry for the new one is not persisted). Temporarily disable the fault injection feature until we figure out the likelihood of this bug happening and the proper way to fix it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8489 Test Plan: Stress test can open the DB successfully Reviewed By: siying Differential Revision: D29564516 Pulled By: anand1976 fbshipit-source-id: ffd1650715ea3c5bf7131936b0ca6fcf66f4e14e	3 years ago
sdong	f33611d5e9	Stress test to inject read failures in DB reopen (#8476 ) Summary: Inject read failures in DB reopen, just as what we do for metadata writes and writes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8476 Test Plan: Some manual tests and make sure failures are triggered. Reviewed By: anand1976 Differential Revision: D29507283 fbshipit-source-id: d04da0163973447041038bd87701686a417c4e0c	3 years ago
mrambacher	570248aeff	Make SecondaryCache Customizable (#8480 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8480 Reviewed By: zhichao-cao Differential Revision: D29528740 Pulled By: mrambacher fbshipit-source-id: fd0f70d15f66611c8498257a9973f7e98ca13839	3 years ago
Zhichao Cao	a95a776d75	Inject fatal write failures to db_stress when DB is running (#8479 ) Summary: add the injest_error_severity to control if it is a retryable IO Error or a fatal or unrecoverable error. Use a flag to indicate, if fatal error comes, the flag is set and db is stopped (but not corrupted). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8479 Test Plan: run ./db_stress --reopen=0 --read_fault_one_in=1000 --write_fault_one_in=5 --disable_wal=true --write_buffer_size=3000000 -writepercent=5 -readpercent=50 --injest_error_severity=2 --column_families=1, make check Reviewed By: anand1976 Differential Revision: D29524271 Pulled By: zhichao-cao fbshipit-source-id: 1aa9fb9b5655b0adba6f5ad12005ca8c074c795b	3 years ago
sdong	ba224b75c7	Stress Test to inject write failures in reopen (#8474 ) Summary: Previously Stress can inject metadata write failures when reopening a DB. We extend it to file append too, in the same way. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8474 Test Plan: manually run crash test with various setting and make sure the failures are triggered as expected. Reviewed By: zhichao-cao Differential Revision: D29503116 fbshipit-source-id: e73a446e80ccbd09301a579280e56ff949381fab	3 years ago
anand76	6f9ed59b1d	Allow db_stress to use a secondary cache (#8455 ) Summary: Add a ```-secondary_cache_uri``` to db_stress to allow the user to specify a custom ```SecondaryCache``` object from the object registry. Also allow db_crashtest.py to be run with an alternate db_stress location. Together, these changes will allow us to run db_stress using FB internal components. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8455 Reviewed By: zhichao-cao Differential Revision: D29371972 Pulled By: anand1976 fbshipit-source-id: dd1b1fd80ebbedc11aa63d9246ea6ae49edb77c4	3 years ago
Akanksha Mahajan	be8199cdb9	Run Merge with Integrated BlobDB in stress, crash and db_bench (#8461 ) Summary: Run Merge with Intergrated BlobDB in stress tests, crash tests and db_bench. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8461 Test Plan: 1. python3 -u tools/db_crashtest.py --simple whitebox ---use_merge=1 --enable_blob_files=1 2. ./db_bench --benchmarks="readwhilemerging" --merge_operator=uint64add --enable_blob_files=true Reviewed By: ltamasi Differential Revision: D29394824 Pulled By: akankshamahajan15 fbshipit-source-id: 0a8e492b13129673e088fb8af3402ab678bb473a	3 years ago
Peter Dillinger	865a25101d	Mark Ribbon filter and optimize_filters_for_memory as production (#8408 ) Summary: Marked the Ribbon filter and optimize_filters_for_memory features as production-ready, each enabling memory savings for Bloom-like filters. Use `NewRibbonFilterPolicy` in place of `NewBloomFilterPolicy` to use Ribbon filters instead of Bloom, or `ribbonfilter` in place of `bloomfilter` in configuration string. Some small refactoring in db_stress. Removed/refactored unused code in db_bench, in part preparing for future default possibly being different from "disabled." Pull Request resolved: https://github.com/facebook/rocksdb/pull/8408 Test Plan: Lots of prior automated, ad-hoc, and "real world" testing. Updated tests for new API names. Quick db_bench test: bloom fillrandom 77730 ops/sec rocksdb.block.cache.filter.bytes.insert COUNT : 89929384 ribbon fillrandom 71492 ops/sec rocksdb.block.cache.filter.bytes.insert COUNT : 64531384 Reviewed By: mrambacher Differential Revision: D29140805 Pulled By: pdillinger fbshipit-source-id: d742c922722421678f95ad85eeb0aaebc9f5e49a	3 years ago
sdong	7f3a0f5bc6	db_stress: wait for compaction to finish after open with failure injection (#8270 ) Summary: When injecting in DB open, error can happen in background threads, causing DB open succeed, but DB is soon made read-only and subsequence writes will fail, which is not expected. To prevent it from happening, wait for compaction to finish before serving the traffic. If there is a failure, reopen. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8270 Test Plan: Run the test. Reviewed By: ajkr Differential Revision: D28230537 fbshipit-source-id: e2e97888904f9b9bb50c35ccf95b88c2319ef5c3	4 years ago
sdong	e19908cba6	Refactor kill point (#8241 ) Summary: Refactor kill point to one single class, rather than several extern variables. The intention was to drop unflushed data before killing to simulate some job, and I tried to a pointer to fault ingestion fs to the killing class, but it ended up with harder than I thought. Perhaps we'll need to do this in another way. But I thought the refactoring itself is good so I send it out. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8241 Test Plan: make release and run crash test for a while. Reviewed By: anand1976 Differential Revision: D28078486 fbshipit-source-id: f9182c1455f52e6851c13f88a21bade63bcec45f	4 years ago
Andrew Kryczka	0f42e50fec	Fix `GetLiveFiles()` returning OPTIONS-000000 (#8268 ) Summary: See release note in HISTORY.md. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8268 Test Plan: unit test repro Reviewed By: siying Differential Revision: D28227901 Pulled By: ajkr fbshipit-source-id: faf61d13b9e43a761e3d5dcf8203923126b51339	4 years ago
sdong	cde69a7cfd	db_stress to add --open_metadata_write_fault_one_in (#8235 ) Summary: DB Stress to add --open_metadata_write_fault_one_in which would randomly fail in some file metadata modification operations during DB Open, including file creation, close, renaming and directory sync. Some operations can fail before and after the operations take place. If DB open fails, db_stress would retry without the failure ingestion, and DB is expected to open successfully. This option is enabled in crash test in half of the time. Some follow up changes would allow write failures in open time, and ingesting those failures in non-DB open cases. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8235 Test Plan: Run stress tests for a while and see failures got triggered. This can reproduce the bug fixed by https://github.com/facebook/rocksdb/pull/8192 and a similar one that fails when fsyncing parent directory. Reviewed By: anand1976 Differential Revision: D28010944 fbshipit-source-id: 36a96da4dc3633e5f7680cef3ea0a900fcdb5558	4 years ago
Peter Dillinger	95f6add746	Revert Ribbon starting level support from #8198 (#8212 ) Summary: This partially reverts commit `10196d7edc`. The problem with this change is because of important filter use cases: FIFO compaction and SST writer. FIFO "compaction" always uses level 0 so would only use Ribbon filters if specifically including level 0 for the Ribbon filter policy. SST writer sets level_at_creation=-1 to indicate unknown level, and this would be treated the same as level 0 unless fixed. We are keeping the part about committing to permanent schema, which is only changes to API comments and HISTORY.md. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8212 Test Plan: CI Reviewed By: jay-zhuang Differential Revision: D27896468 Pulled By: pdillinger fbshipit-source-id: 50a775f7cba5d64fb729d9b982e355864020596e	4 years ago
Peter Dillinger	10196d7edc	Ribbon long-term support, starting level support (#8198 ) Summary: Since the Ribbon filter schema seems good (compatible back to 6.15.0), this change commits to long term support of the SST schema, even though we expect the API for enabling Ribbon to change (still called NewExperimentalRibbonFilterPolicy). This also adds support for "hybrid" configuration in which some levels use Bloom (higher levels, lower numbered) for speed and the rest use Ribbon (lower levels, higher numbered) for memory space efficiency. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8198 Test Plan: unit test added, crash test support Reviewed By: jay-zhuang Differential Revision: D27831232 Pulled By: pdillinger fbshipit-source-id: 90e528677689474d293ed6710b42ba89fbd5b5ab	4 years ago
Akanksha Mahajan	0be89e87fd	Enable backup/restore for Integrated BlobDB in stress and crash tests (#8165 ) Summary: Enable backup/restore functionality with Integrated BlobDB in db_stress and crash test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8165 Test Plan: Ran python3 -u tools/db_crashtest.py --simple whitebox along with : 1. decreased "backup_in_one" value for backups to be more frequent and 2. manually changed code for "enable_blob_file" to be always true and apply blobdb params 100% for testing purpose. Reviewed By: ltamasi Differential Revision: D27636025 Pulled By: akankshamahajan15 fbshipit-source-id: 0d0e0d1479ced163f992872dc998e79c581bfc99	4 years ago
Peter Dillinger	879357fdb0	Make backups openable as read-only DBs (#8142 ) Summary: A current limitation of backups is that you don't know the exact database state of when the backup was taken. With this new feature, you can at least inspect the backup's DB state without restoring it by opening it as a read-only DB. Rather than add something like OpenAsReadOnlyDB to the BackupEngine API, which would inhibit opening stackable DB implementations read-only (if/when their APIs support it), we instead provide a DB name and Env that can be used to open as a read-only DB. Possible follow-up work: * Add a version of GetBackupInfo for a single backup. * Let CreateNewBackup return the BackupID of the newly-created backup. Implementation details: Refactored ChrootFileSystem to split off new base class RemapFileSystem, which allows more general remapping of files. We use this base class to implement BackupEngineImpl::RemapSharedFileSystem. To minimize API impact, I decided to just add these fields `name_for_open` and `env_for_open` to those set by GetBackupInfo when include_file_details=true. Creating the RemapSharedFileSystem adds a bit to the memory consumption, perhaps unnecessarily in some cases, but this has been mitigated by (a) only initialize the RemapSharedFileSystem lazily when GetBackupInfo with include_file_details=true is called, and (b) using the existing `shared_ptr<FileInfo>` objects to hold most of the mapping data. To enhance API safety, RemapSharedFileSystem is wrapped by new ReadOnlyFileSystem which rejects any attempts to write. This uncovered a couple of places in which DB::OpenForReadOnly would write to the filesystem, so I fixed these. Added a release note because this affects logging. Additional minor refactoring in backupable_db.cc to support the new functionality. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8142 Test Plan: new test (run with ASAN and UBSAN), added to stress test and ran it for a while with amplified backup_one_in Reviewed By: ajkr Differential Revision: D27535408 Pulled By: pdillinger fbshipit-source-id: 04666d310aa0261ef6b2385c43ca793ce1dfd148	4 years ago
Yanqin Jin	08144bc2f5	Add user-defined timestamps to db_stress (#8061 ) Summary: Add some basic test for user-defined timestamp to db_stress. Currently, read with timestamp always tries to read using the current timestamp. Due to the per-key timestamp-sequence ordering constraint, we only add timestamp- related tests to the `NonBatchedOpsStressTest` since this test serializes accesses to the same key and uses a file to cross-check data correctness. The timestamp feature is not supported in a number of components, e.g. Merge, SingleDelete, DeleteRange, CompactionFilter, Readonly instance, secondary instance, SST file ingestion, transaction, etc. Therefore, db_stress should exit if user enables both timestamp and these features at the same time. The (currently) incompatible features can be found in `CheckAndSetOptionsForUserTimestamp`. This PR also fixes a bug triggered when timestamp is enabled together with `index_type=kBinarySearchWithFirstKey`. This bug fix will also be in another separate PR with more unit tests coverage. Fixing it here because I do not want to exclude the index type from crash test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8061 Test Plan: make crash_test_with_ts Reviewed By: jay-zhuang Differential Revision: D27056282 Pulled By: riversand963 fbshipit-source-id: c3e00ad1023fdb9ebbdf9601ec18270c5e2925a9	4 years ago
Levi Tamasi	0d800dadea	Adjust the set of potential min_blob_size values in stress/crash tests (#8085 ) Summary: Since our stress/crash tests by default generate values of size 8, 16, or 24, it does not make much sense to set `min_blob_size` to 256. The patch updates the set of potential `min_blob_size` values in the crash test script and in `db_stress` where it might be set dynamically using `SetOptions`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8085 Test Plan: Ran `make check` and tried the crash test script. Reviewed By: riversand963 Differential Revision: D27238620 Pulled By: ltamasi fbshipit-source-id: 4a96f9944b1ed9220d3045c5ab0b34c49009aeee	4 years ago

1 2 3 4

177 Commits (8860fc902a7c8fababcb44a6373a677135a63d14)