rocksdb

Commit Graph

Author	SHA1	Message	Date
Levi Tamasi	3c9eed688e	Enable moving a string or PinnableSlice into PinnableWideColumns (#11248 ) Summary: This makes it possible to eliminate some copies in `GetEntity` / `MultiGetEntity`, in particular when `Merge`s or blobs are involved. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11248 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D43544215 Pulled By: ltamasi fbshipit-source-id: bc4c8955a24bbd8bc4ab098e72133ead757f9707	3 years ago
Yu Zhang	f007b8fdea	Support iter_start_ts in integrated BlobDB (#11244 ) Summary: Fixed an issue during backward iteration when `iter_start_ts` is set in an integrated BlobDB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11244 Test Plan: ```make check ./db_blob_basic_test --gtest_filter="DBBlobWithTimestampTest.IterateBlobs" tools/db_crashtest.py --stress_cmd=./db_stress --cleanup_cmd='' --enable_ts whitebox --random_kill_odd 888887 --enable_blob_files=1``` Reviewed By: ltamasi Differential Revision: D43506726 Pulled By: jowlyzhang fbshipit-source-id: 2cdc19ebf8da909d8d43d621353905784949a9f0	3 years ago
Changyu Bi	229297d1b8	Refactor AddRangeDels() + consider range tombstone during compaction file cutting (#11113 ) Summary: A second attempt after https://github.com/facebook/rocksdb/issues/10802, with bug fixes and refactoring. This PR updates compaction logic to take range tombstones into account when determining whether to cut the current compaction output file (https://github.com/facebook/rocksdb/issues/4811). Before this change, only point keys were considered, and range tombstones could cause large compactions. For example, if the current compaction outputs is a range tombstone [a, b) and 2 point keys y, z, they would be added to the same file, and may overlap with too many files in the next level and cause a large compaction in the future. This PR also includes ajkr's effort to simplify the logic to add range tombstones to compaction output files in `AddRangeDels()` ([https://github.com/facebook/rocksdb/issues/11078](https://github.com/facebook/rocksdb/pull/11078#issuecomment-1386078861)). The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new class `CompactionMergingIterator` is introduced to replace `MergingIterator` under `CompactionIterator` to enable emitting of range tombstone start keys. Further improvement after this PR include cutting compaction output at some grandparent boundary key (instead of the next output key) when cutting within a range tombstone to reduce overlap with grandparents. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11113 Test Plan: * added unit test in db_range_del_test * crash test with a small key range: `python3 tools/db_crashtest.py blackbox --simple --max_key=100 --interval=600 --write_buffer_size=262144 --target_file_size_base=256 --max_bytes_for_level_base=262144 --block_size=128 --value_size_mult=33 --subcompactions=10 --use_multiget=1 --delpercent=3 --delrangepercent=2 --verify_iterator_with_expected_state_one_in=2 --num_iterations=10` Reviewed By: ajkr Differential Revision: D42655709 Pulled By: cbi42 fbshipit-source-id: 8367e36ef5640e8f21c14a3855d4a8d6e360a34c	3 years ago
ywave	9fa9becf53	fix -Wrange-loop-analysis in Apple clang version 12.0.0 (clang-1200.0.32.29) (#11240 ) Summary: Fix complain ``` db/db_impl/db_impl_compaction_flush.cc:417:19: error: loop variable 'bg_flush_arg' of type 'const rocksdb::DBImpl::BGFlushArg' creates a copy from type 'const rocksdb::DBImpl::BGFlushArg' [-Werror,-Wrange-loop-analysis] for (const auto bg_flush_arg : bg_flush_args) { ^ db/db_impl/db_impl_compaction_flush.cc:417:8: note: use reference type 'const rocksdb::DBImpl::BGFlushArg &' to prevent copying for (const auto bg_flush_arg : bg_flush_args) { ^~~~~~~~~~~~~~~~~~~~~~~~~ & db/db_impl/db_impl_compaction_flush.cc:2911:21: error: loop variable 'bg_flush_arg' of type 'const rocksdb::DBImpl::BGFlushArg' creates a copy from type 'const rocksdb::DBImpl::BGFlushArg' [-Werror,-Wrange-loop-analysis] for (const auto bg_flush_arg : bg_flush_args) { ^ db/db_impl/db_impl_compaction_flush.cc:2911:10: note: use reference type 'const rocksdb::DBImpl::BGFlushArg &' to prevent copying for (const auto bg_flush_arg : bg_flush_args) { ^~~~~~~~~~~~~~~~~~~~~~~~~ & ``` from ```sh xxx@MacBook-Pro / % g++ -v Configured with: --prefix=/Library/Developer/CommandLineTools/usr --with-gxx-include-dir=/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/4.2.1 Apple clang version 12.0.0 (clang-1200.0.32.29) Target: x86_64-apple-darwin21.6.0 Thread model: posix ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11240 Reviewed By: cbi42 Differential Revision: D43458729 Pulled By: ajkr fbshipit-source-id: 26e110f83451509463a1bc308f737ccb693c9f45	3 years ago
Changyu Bi	1b48ecc2c6	Fix an assertion failure in DBIter::SeekToLast() when user-defined timestamp is enabled (#11223 ) Summary: in DBIter::SeekToLast(), key() can be called when iter is invalid and fails the following assertion: ``` ./db/db_iter.h:153: virtual rocksdb::Slice rocksdb::DBIter::key() const: Assertion `valid_' failed. ``` This happens when `iterate_upper_bound` and timestamp_lb_ are set. SeekForPrev(iterate_upper_bound_) positions the iterator on the same user key as iterate_upper_bound_. A subsequent PrevInternal() call makes the iterator invalid just be the call to key(). This PR fixes this issue by setting updating the seek key to have max sequence number AND max timestamp when the seek key has the same user key as *iterate_upper_bound_. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11223 Test Plan: - Added a unit test that would fail the above assertion before this fix. Reviewed By: jowlyzhang Differential Revision: D43283600 Pulled By: cbi42 fbshipit-source-id: 0dd3999845b722584679bbc95be2664b266005ba	3 years ago
leipeng	ea85148b78	DBIter::FindNextUserEntryInternal: do not PrepareValue for `Delete` (#11211 ) Summary: `kTypeDeletion/kTypeDeletionWithTimestamp/kTypeSingleDeletion` does not need access iter value, so omit `PrepareValue`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11211 Reviewed By: ajkr Differential Revision: D43253068 Pulled By: cbi42 fbshipit-source-id: 1945c7f8a90b6909128a0553b62d9fd1078b0a08	3 years ago
HuangYi	83bc03a99a	add c api to set option fail_if_not_bottommost_level (#11158 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11158 Reviewed By: cbi42 Differential Revision: D42870647 Pulled By: ajkr fbshipit-source-id: 1b71a1dd415c34c332cecf60c68ce37fe4393e2a	3 years ago
HuangYi	cfe50f7e77	add c api for HyperClockCache (#11110 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11110 Reviewed By: cbi42 Differential Revision: D42660941 Pulled By: ajkr fbshipit-source-id: e977d9b76dfd5d8c62335f961c275f3b810503d7	3 years ago
Matt Jurik	142b18d00b	C-API: Support multi-CF flush (#11112 ) Summary: This PR adds support to the c-api bindings for calling `Flush()` with multiple column families, which is useful for performing atomic flushes (assuming also that the db has been opened with `atomic_flush = true`). Pull Request resolved: https://github.com/facebook/rocksdb/pull/11112 Reviewed By: cbi42 Differential Revision: D42666382 Pulled By: ajkr fbshipit-source-id: 82f05bf32d28452d85c79ea42411c8fea961fd87	3 years ago
mrambacher	b6640c3117	Remove FactoryFunc from LoadXXXObject (#11203 ) Summary: The primary purpose of the FactoryFunc was to support LITE mode where the ObjectRegistry was not available. With the removal of LITE mode, the function was no longer required. Note that the MergeOperator had some private classes defined in header files. To gain access to their constructors (and name methods), the class definitions were moved into header files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11203 Reviewed By: cbi42 Differential Revision: D43160255 Pulled By: pdillinger fbshipit-source-id: f3a465fd5d1a7049b73ecf31e4b8c3762f6dae6c	3 years ago
Andrew Kryczka	25e1365227	Merge operator failed subcode (#11231 ) Summary: From HISTORY.md: Added a subcode of `Status::Corruption`, `Status::SubCode::kMergeOperatorFailed`, for users to identify corruption failures originating in the merge operator, as opposed to RocksDB's internally identified data corruptions. This is a followup to https://github.com/facebook/rocksdb/issues/11092, where we gave users the ability to keep running a DB despite merge operator failing. Now that the DB keeps running despite such failures, they want to be able to distinguish such failures from real corruptions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11231 Test Plan: updated unit test Reviewed By: akankshamahajan15 Differential Revision: D43396607 Pulled By: ajkr fbshipit-source-id: 17fbcc779ad724dafada8abd73efd38e1c5208b9	3 years ago
Levi Tamasi	9794acb597	Add a new MultiGetEntity API (#11222 ) Summary: The new `MultiGetEntity` API can be used to get a consistent view of a batch of keys, with the results presented as wide-column entities. Similarly to `GetEntity` and the iterator's `columns` API, if the entry corresponding to the key is a wide-column entity to start with, it is returned as-is, and if it is a plain key-value, it is wrapped into an entity with a single default column. Implementation-wise, the new API shares the logic of the batched `MultiGet` API (via the `MultiGetCommon` methods). Both single-CF and multi-CF `MultiGetEntity` APIs are provided, and blobs are also supported. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11222 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D43256950 Pulled By: ltamasi fbshipit-source-id: 47fb2cb7e2d0470e3580f43fdb2fe9e51f0e7005	3 years ago
Peter Dillinger	3cacd4b4ec	Put Cache and CacheWrapper in new public header (#11192 ) Summary: The definition of the Cache class should not be needed by the vast majority of RocksDB users, so I think it is just distracting to include it in cache.h, which is primarily needed for configuring and creating caches. This change moves the class to a new header advanced_cache.h. It is just cut-and-paste except for modifying the class API comment. In general, operations on shared_ptr<Cache> should continue to work when only a forward declaration of Cache is available, as long as all the Cache instances provided are already shared_ptr. See https://stackoverflow.com/a/17650101/454544 Also, the most common way to customize a Cache is by wrapping an existing implementation, so it makes sense to provide CacheWrapper in the public API. This was a cut-and-paste job except removing the implementation of Name() so that derived classes must provide it. Intended follow-up: consolidate Release() into one function to reduce customization bugs / confusion Pull Request resolved: https://github.com/facebook/rocksdb/pull/11192 Test Plan: `make check` Reviewed By: anand1976 Differential Revision: D43055487 Pulled By: pdillinger fbshipit-source-id: 7b05492df35e0f30b581b4c24c579bc275b6d110	3 years ago
Peter Dillinger	b7747bbc9f	Attempt fix flaky DBWriteTest.LockWALInEffect (#11209 ) Summary: Example failure: ``` [ RUN ] DBWriteTestInstance/DBWriteTest.LockWALInEffect/1 db/db_write_test.cc:646: Failure Put("key3", "value") Corruption: Not active ``` Presumably from a background compaction prior to Put. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11209 Test Plan: watch CI Reviewed By: akankshamahajan15 Differential Revision: D43147727 Pulled By: pdillinger fbshipit-source-id: a1c34ac5ab124bfe2f23205a30777990056e9082	3 years ago
anand76	77b61abc7b	Fix bug in WAL streaming uncompression (#11198 ) Summary: Fix a bug in the calculation of the input buffer address/offset in log_reader.cc. The bug is when consecutive fragments of a compressed record are located at the same offset in the log reader buffer, the second fragment input buffer is treated as a leftover from the previous input buffer. As a result, the offset in the `ZSTD_inBuffer` is not reset. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11198 Test Plan: Add a unit test in log_test.cc that fails without the fix and passes with it. Reviewed By: ajkr, cbi42 Differential Revision: D43102692 Pulled By: anand1976 fbshipit-source-id: aa2648f4802c33991b76a3233c5a58d4cc9e77fd	3 years ago
Levi Tamasi	876d281592	Add compaction filter support for wide-column entities (#11196 ) Summary: The patch adds compaction filter support for wide-column entities by introducing a new `CompactionFilter` API called `FilterV3`. This API is called for regular key-values, merge operands, and wide-column entities as well. It is passed the existing value/operand or wide-column structure and it can update the value or columns or keep/delete/etc. the key-value as usual. For compatibility, the default implementation of `FilterV3` keeps all wide-column entities and falls back to calling `FilterV2` for plain old key-values and merge operands. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11196 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D43094147 Pulled By: ltamasi fbshipit-source-id: 75acabe9a35254f7f404ba6173ee9c2774382ebd	3 years ago
Hui Xiao	9b66331388	Simplify TEST_F(DBWALTest, FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL) (#11186 ) Summary: Context/Summary: Simplify `TEST_F(DBWALTest, FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL)` based on https://github.com/facebook/rocksdb/pull/11016#pullrequestreview-1205020134 and delete unused sync points. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11186 Test Plan: - UT failed before fix in https://github.com/facebook/rocksdb/pull/10892 and passes after - Check UT not flaky when running with https://app.circleci.com/pipelines/github/facebook/rocksdb/21985/workflows/5f6cc355-78c1-46d8-89ee-0fd679725a8a/jobs/540878 Reviewed By: ajkr Differential Revision: D43034723 Pulled By: hx235 fbshipit-source-id: f503774987b8f3718505f99e95080a7fad28ac66	3 years ago
Peter Dillinger	0cf1008fe3	Deprecate write_global_seqno and default to false (#11179 ) Summary: This option has long been intended to be set to false by default and deprecated. It might never be practical to completely remove the feature, so that we can continue to test for backward compatibility by keeping the ability to generate DBs in the old way. Also improved API comments. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11179 Test Plan: existing tests (with one tiny update) Reviewed By: hx235 Differential Revision: D42973927 Pulled By: pdillinger fbshipit-source-id: e9bc161cb933266e094aea2dff8cc03753c39dab	3 years ago
Peter Dillinger	390cc0b156	Ensure LockWAL() stall cleared for UnlockWAL() return (#11172 ) Summary: Fixes https://github.com/facebook/rocksdb/issues/11160 By counting the number of stalls placed on a write queue, we can check in UnlockWAL() whether the stall present at the start of UnlockWAL() has been cleared by the end, or wait until it's cleared. More details in code comments and new unit test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11172 Test Plan: unit test added. Yes, it uses sleep to amplify failure on buggy behavior if present, but using a sync point to only allow new behavior would fail with the old code only because it doesn't contain the new sync point. Basically, using a sync point in UnlockWAL() could easily mask a regression by artificially limiting key behaviors. The test would only check that UnlockWAL() invokes code that should do the right thing, without checking that it does the right thing. Reviewed By: ajkr Differential Revision: D42894341 Pulled By: pdillinger fbshipit-source-id: 15c9da0ca383e6aec845b29f5447d76cecbf46c3	3 years ago
anand76	63da9cfa26	Return any errors returned by ReadAsync to the MultiGet caller (#11171 ) Summary: Currently, we incorrectly return a Status::Corruption to the MultiGet caller if the file system ReadAsync cannot issue a read and returns an error for some reason, such as IOStatus::NotSupported(). In this PR, we copy the ReadAsync error to the request status so it can be returned to the user. Tests: Update existing unit tests and add a new one for this scenario Pull Request resolved: https://github.com/facebook/rocksdb/pull/11171 Reviewed By: akankshamahajan15 Differential Revision: D42950057 Pulled By: anand1976 fbshipit-source-id: 85ffcb015fa6c064c311f8a28488fec78c487869	3 years ago
Levi Tamasi	df680b24ef	Clean up InvokeFilterIfNeeded a bit (#11174 ) Summary: The patch makes some code quality enhancements in `CompactionIterator::InvokeFilterIfNeeded` including the renaming of `filter` (which is most likely a remnant of the days before the `FilterV2` API when the compaction filter used to return a boolean) to `decision`, the removal of some outdated comments, the elimination of an `error` flag which was only used in one failure case out of many, as well as some small stylistic improvements. (Some the above will also come in handy when adding compaction filter support for wide-column entities.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11174 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D42901408 Pulled By: ltamasi fbshipit-source-id: ab382d59a4990c5dfe1cee219d49e1d80902b666	3 years ago
Andrew Kryczka	071c33846d	Allow canceling manual compaction while waiting for conflicting compaction (#11165 ) Summary: This PR adds logic to the `RunManualCompaction()` loop to check for cancellation before waiting on any conflicting compactions to finish. In case of cancellation, `RunManualCompaction()` no longer waits on conflicting compactions Pull Request resolved: https://github.com/facebook/rocksdb/pull/11165 Test Plan: repro test case Reviewed By: cbi42 Differential Revision: D42864058 Pulled By: ajkr fbshipit-source-id: ea4dd1a8f294abe212905495a8fbe8f07fca3f5a	3 years ago
Levi Tamasi	a82021c3d0	Fix a bug where GetEntity would expose a blob reference (#11162 ) Summary: The patch fixes a feature interaction bug between BlobDB and the `GetEntity` API: without the patch, `GetEntity` would return the blob reference (wrapped into a single-column entity) instead of the actual blob value. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11162 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D42854092 Pulled By: ltamasi fbshipit-source-id: f750d0ff57def107da16f545077ddce9860ff21a	3 years ago
Peter Dillinger	94e3beec77	Cleanup, improve, stress test LockWAL() (#11143 ) Summary: The previous API comments for LockWAL didn't provide much about why you might want to use it, and didn't really meet what one would infer its contract was. Also, LockWAL was not in db_stress / crash test. In this change: * Implement a counting semantics for LockWAL()+UnlockWAL(), so that they can safely be used concurrently across threads or recursively within a thread. This should make the API much less bug-prone and easier to use. * Make sure no UnlockWAL() is needed after non-OK LockWAL() (to match RocksDB conventions) * Make UnlockWAL() reliably return non-OK when there's no matching LockWAL() (for debug-ability) * Clarify API comments on LockWAL(), UnlockWAL(), FlushWAL(), and SyncWAL(). Their exact meanings are not obvious, and I don't think it's appropriate to talk about implementation mutexes in the API comments, but about what operations might block each other. * Add LockWAL()/UnlockWAL() to db_stress and crash test, mostly to check for assertion failures, but also checks that latest seqno doesn't change while WAL is locked. This is simpler to add when LockWAL() is allowed in multiple threads. * Remove unnecessary use of sync points in test DBWALTest::LockWal. There was a bug during development of above changes that caused this test to fail sporadically, with and without this sync point change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11143 Test Plan: unit tests added / updated, added to stress/crash test Reviewed By: ajkr Differential Revision: D42848627 Pulled By: pdillinger fbshipit-source-id: 6d976c51791941a31fd8fbf28b0f82e888d9f4b4	3 years ago
Yu Zhang	24ac53d81a	Use user key on sst file for blob verification for Get and MultiGet (#11105 ) Summary: Use the user key on sst file for blob verification for `Get` and `MultiGet` instead of the user key passed from caller. Add tests for `Get` and `MultiGet` operations when user defined timestamp feature is enabled in a BlobDB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11105 Test Plan: make V=1 db_blob_basic_test ./db_blob_basic_test --gtest_filter="DBBlobTestWithTimestamp.*" Reviewed By: ltamasi Differential Revision: D42716487 Pulled By: jowlyzhang fbshipit-source-id: 5987ecbb7e56ddf46d2467a3649369390789506a	3 years ago
akankshamahajan	79e57a39a3	Move ExternalSSTTestEnv to FileSystemWrapper (#11139 ) Summary: Migrate ExternalSSTTestEnv to FileSystemWrapper Pull Request resolved: https://github.com/facebook/rocksdb/pull/11139 Reviewed By: anand1976 Differential Revision: D42780180 Pulled By: akankshamahajan15 fbshipit-source-id: 9a4448c9fe5186b518235fe11e1a34dcad897cdd	3 years ago
sdong	4720ba4391	Remove RocksDB LITE (#11147 ) Summary: We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support. Most of changes were done through following comments: unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE \| egrep '[.](cc\|h)'` by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147 Test Plan: See CI Reviewed By: pdillinger Differential Revision: D42796341 fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2	3 years ago
Yu Zhang	6943ff6e50	Remove deprecated util functions in options_util.h (#11126 ) Summary: Remove the util functions in options_util.h that have previously been marked deprecated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11126 Test Plan: `make check` Reviewed By: ltamasi Differential Revision: D42757496 Pulled By: jowlyzhang fbshipit-source-id: 2a138a3c207d0e0e0bbb4d99548cf2cadb44bcfb	3 years ago
Karim TAAM	a1e92bd956	use verify checksum option in block based table reader Open() (#11099 ) Summary: ## Description In this issue https://github.com/facebook/rocksdb/issues/11002 we found that when we use rocksdb with the `verify checksum` read_option to false the verification is done anyway By analyzing the code along the stacktrace I saw that at the level of https://github.com/facebook/rocksdb/compare/main...matkt:feature/use-verify-checksum-in-block-based-table-reader?expand=1#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129f we are not keeping all the options and we forget the `verify_checksum` the comment in this class suggests that it should be managed https://github.com/facebook/rocksdb/compare/main...matkt:feature/use-verify-checksum-in-block-based-table-reader?expand=1#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129fL581 <img width="1724" alt="204511641-86ab4b9b-45e5-4a2b-a13d-81fa26435d38" src="https://user-images.githubusercontent.com/26581503/213152802-c46bc1c7-a3a2-4a6f-9bb1-bf92ee93af7a.png"> this PR just adds the line to manage the `verify checksum` ## Tests - Running unit tests - Test without setting `verify checksum` and verifying that we are calling the checksum code - Test by setting `verify checksum` to true and verifying that we are calling the checksum code - Test by setting `verify checksum` to false and verifying that we are not calling the checksum code Pull Request resolved: https://github.com/facebook/rocksdb/pull/11099 Reviewed By: cbi42 Differential Revision: D42679881 Pulled By: ajkr fbshipit-source-id: c7dd10768282fd0699f7e1bf397ceb7adbea4ab6	3 years ago
Andrew Kryczka	b44cbbf709	Fix GetMergeOperands() returning MergeInProgress (#11136 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11136 Test Plan: the provided unit test used to fail due to `GetMergeOperands()` returning `Status::MergeInProgress()`; it passes now because the `GetMergeOperands()` call returns `Status::OK()` Reviewed By: pdillinger Differential Revision: D42759198 Pulled By: ajkr fbshipit-source-id: 878f9f40ccc1d7e2fe7b1352814bae3a49c19939	3 years ago
akankshamahajan	986c5b9d4e	Migrate TestEnv in listener_test.cc to FileSystemWrapper (#11125 ) Summary: Migrate derived classes from EnvWrapper to FileSystemWrapper so we can eventually deprecate the storage methods in Env. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11125 Test Plan: CircleCI jobs Reviewed By: anand1976 Differential Revision: D42732241 Pulled By: akankshamahajan15 fbshipit-source-id: c89a70a79fcfb13e158bf8919b1a87a9de133222	3 years ago
Levi Tamasi	6da2e20df3	Remove more obsolete statistics (#11131 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11131 Test Plan: `make check` Reviewed By: pdillinger Differential Revision: D42753997 Pulled By: ltamasi fbshipit-source-id: ce8b84c1e55374257e93ed74fd255c9b759723ce	3 years ago
Peter Dillinger	546e213c4f	Fix DelayWrite() calls for two_write_queues (#11130 ) Summary: PR https://github.com/facebook/rocksdb/issues/11020 fixed a case where it was easy to deadlock the DB with LockWAL() but introduced a bug showing up as a rare assertion failure in the stress test. Specifically, `assert(w->state == STATE_INIT)` in `WriteThread::LinkOne()` called from `BeginWriteStall()`, `DelayWrite()`, `WriteImplWALOnly()`. I haven't been about to generate a unit test that reproduces this failure but I believe the root cause is that DelayWrite() was never meant to be re-entrant, only called from the DB's write_thread_ leader. https://github.com/facebook/rocksdb/issues/11020 introduced a call to DelayWrite() from the nonmem_write_thread_ group leader. This fix is to make DelayWrite() apply to the specific write queue that it is being called from (inject a dummy write stall entry to the head of the appropriate write queue). WriteController is re-entrant, based on polling and state changes signalled with bg_cv_, so can manage stalling two queues. The only anticipated complication (called out by Andrew in previous PR) is that we don't want timed write delays being injected in parallel for the two queues, because that dimishes the intended throttling effect. Thus, we only allow timed delays for the primary write queue. HISTORY not updated because this is intended for the same release where the bug was introduced. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11130 Test Plan: Although I was not able to reproduce the assertion failure, I was able to reproduce a distinct flaw with what I believe is the same root cause: a kind of deadlock if both write queues need to wake up from stopped writes. Only one will be waiting on bg_cv_ (the other waiting in `LinkOne()` for the write queue to open up), so a single SignalAll() will only unblock one of the queues, with the other re-instating the stop until another signal on bg_cv_. A simple unit test is added for this case. Will also run crash_test_with_multiops_wc_txn for a while looking for issues. Reviewed By: ajkr Differential Revision: D42749330 Pulled By: pdillinger fbshipit-source-id: 4317dd899a93d57c26fd5af7143038f82d4d4d1b	3 years ago
anand76	bcbab59c55	Migrate ErrorEnv from EnvWrapper to FileSystemWrapper (#11124 ) Summary: Migrate ErrorEnv from EnvWrapper to FileSystemWrapper so we can eventually deprecate the storage methods in Env. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11124 Reviewed By: akankshamahajan15 Differential Revision: D42727791 Pulled By: anand1976 fbshipit-source-id: e8362ad624dc28e55c99fc35eda12866755f62c6	3 years ago
sdong	2800aa069a	Remove compressed block cache (#11117 ) Summary: Compressed block cache is replaced by compressed secondary cache. Remove the feature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11117 Test Plan: See CI passes Reviewed By: pdillinger Differential Revision: D42700164 fbshipit-source-id: 6cbb24e460da29311150865f60ecb98637f9f67d	3 years ago
Peter Dillinger	4a9185340d	A better contract for best_efforts_recovery (#11085 ) Summary: Capture more of the original intent at a high level, without getting bogged down in low-level details. The old text made some weak promises about handling of LOCK files. There should be no specific concern for LOCK files, because we already rely on LockFile() to create the file if it's not present already. And the lock file is generally size 0, so don't have to worry about truncation. Added a unit test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11085 Test Plan: existing tests, and a new one. Reviewed By: siying Differential Revision: D42713233 Pulled By: pdillinger fbshipit-source-id: 2fce7c974d35fac065037c9c4c7326a59c9fe340	3 years ago
Hui Xiao	86fa2592be	Fix data race on `ColumnFamilyData::flush_reason` by letting FlushRequest/Job owns flush_reason instead of CFD (#11111 ) Summary: Context: Concurrent flushes on the same CF can set on `ColumnFamilyData::flush_reason` before each other flush finishes. An symptom is one CF has different flush_reason with others though all of them are in an atomic flush `db_stress: db/db_impl/db_impl_compaction_flush.cc:423: rocksdb::Status rocksdb::DBImpl::AtomicFlushMemTablesToOutputFiles(const rocksdb::autovector<rocksdb::DBImpl::BGFlushArg>&, bool, rocksdb::JobContext, rocksdb::LogBuffer, rocksdb::Env::Priority): Assertion cfd->GetFlushReason() == cfds[0]->GetFlushReason() failed. ` Summary:* Suggested by ltamasi, we now refactor and let FlushRequest/Job to own flush_reason as there is no good way to define `ColumnFamilyData::flush_reason` in face of concurrent flushes on the same CF (which wasn't the case a long time ago when `ColumnFamilyData::flush_reason ` first introduced`) Tets: - new unit test - make check - aggressive crash test rehearsal Pull Request resolved: https://github.com/facebook/rocksdb/pull/11111 Reviewed By: ajkr Differential Revision: D42644600 Pulled By: hx235 fbshipit-source-id: 8589c8184869d3415e5b780c887f877818a5ebaf	3 years ago
Andrew Kryczka	b7fbcefda8	Add API to limit blast radius of merge operator failure (#11092 ) Summary: Prior to this PR, `FullMergeV2()` can only return `false` to indicate failure, which causes any operation invoking it to fail. During a compaction, such a failure causes the compaction to fail and causes the DB to irreversibly enter read-only mode. Some users asked for a way to allow the merge operator to fail without such widespread damage. To limit the blast radius of merge operator failures, this PR introduces the `MergeOperationOutput::op_failure_scope` API. When unpopulated (`kDefault`) or set to `kTryMerge`, the merge operator failure handling is the same as before. When set to `kMustMerge`, merge operator failure still causes failure to operations that must merge (`Get()`, iterator, `MultiGet()`, etc.). However, under `kMustMerge`, flushes/compactions can survive merge operator failures by outputting the unmerged input operands. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11092 Reviewed By: siying Differential Revision: D42525673 Pulled By: ajkr fbshipit-source-id: 951dc3bf190f86347dccf3381be967565cda52ee	3 years ago
Changyu Bi	e9d6a0d7ce	Fix asan failure caused by range tombstone start key use-after-free (#11106 ) Summary: the `last_tombstone_start_user_key` variable in `BuildTable()` and in `CompactionOutputs::AddRangeDels()` may point to a start key that is freed if user-defined timestamp is enabled. This was causing ASAN failure and this PR fixes this issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11106 Test Plan: Added UT for repro. Reviewed By: ajkr Differential Revision: D42590862 Pulled By: cbi42 fbshipit-source-id: c493265ececdf89636d801d55ae929806c4d4b2c	3 years ago
Changyu Bi	4d0f9a995c	Consider TTL compaction file cutting earlier to prevent small output file (#11075 ) Summary: in `CompactionOutputs::ShouldStopBefore()`, TTL-related states, `cur_files_to_cut_for_ttl_` and `next_files_to_cut_for_ttl_`, are not updated if the function returns early. This can cause unnecessary compaction output file cuttings and hence produce smaller output files, which may hurt write amp. See the example in the unit test for how this "unnecessary file cutting" can happen. This PR fixes this issue by moving the code for updating TTL states earlier in `CompactionOutputs::ShouldStopBefore()` so that the states are updated for each key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11075 Test Plan: - Added new unit test. Reviewed By: hx235 Differential Revision: D42398739 Pulled By: cbi42 fbshipit-source-id: 09fab66679c1a734abcfc31bcea33dd9aeb9dbc7	3 years ago
Changyu Bi	6a82b68788	Avoid counting extra range tombstone compensated size in `AddRangeDels()` (#11091 ) Summary: in `CompactionOutputs::AddRangeDels()`, range tombstones with the same start and end key but different sequence numbers all contribute to compensated range tombstone size. This PR removes this redundancy. This PR also includes a fix from https://github.com/facebook/rocksdb/issues/11067 where a range tombstone that is not within a file's range was being added to the file. This fixes an assertion failure for `icmp.Compare(start, end) <= 0` in VersionSet::ApproximateSize() when calculating compensated range tombstone size. Assertions and a comment/essay was added to reason that no such range tombstone will be added after this fix. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11091 Test Plan: - Added unit tests - Stress test with small key range: `python3 tools/db_crashtest.py blackbox --simple --max_key=100 --interval=600 --write_buffer_size=262144 --target_file_size_base=256 --max_bytes_for_level_base=262144 --block_size=128 --value_size_mult=33 --subcompactions=10` Reviewed By: ajkr Differential Revision: D42521588 Pulled By: cbi42 fbshipit-source-id: 5bda3fe38997995314e1f7592319af12b69bc4f8	3 years ago
Changyu Bi	f515d9d203	Revert #10802 Consider range tombstone in compaction output file cutting (#11089 ) Summary: This reverts commit `f02c708aa3` since it introduced several bugs (see https://github.com/facebook/rocksdb/issues/11078 and https://github.com/facebook/rocksdb/issues/11067 for attempts to fix them) and that I do not have a high confidence to fix all of them and ensure no further ones before the next release branch cut. There are also come existing issue found during bug fixing. We will work on it and try to merge it to the release after. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11089 Test Plan: existing CI. Reviewed By: ajkr Differential Revision: D42505972 Pulled By: cbi42 fbshipit-source-id: 2f66dcde6b85dc94977b317c2ce513872cfbc153	3 years ago
Peter Dillinger	9f7801c5f1	Major Cache refactoring, CPU efficiency improvement (#10975 ) Summary: This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache). The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below. * static_cast lines of code +29 -35 (net removed 6) * reinterpret_cast lines of code +6 -32 (net removed 26) ## cache.h and secondary_cache.h * Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications: * Simpler for implementations to deal with just one Insert and one Lookup. * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428. * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks). * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below). * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc. * Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation. * Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char` for serialized byte data all over RocksDB, with many advantages over `void`. `memcpy` etc. are legacy APIs that should not be mimicked.) * Add a type alias Cache::ObjectPtr = void, so that we can better indicate the intent of the void when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.) * Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774) * Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object. * Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change. ## typed_cache.h Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae). The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used. * PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value. * BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter. * FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue. * For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`. These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.) Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it. ## block_cache.h This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table. ## block_based_table_reader.cc Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation. The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions. ## block_based_table_builder.cc, cache_dump_load_impl.cc Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.) ## Everything else Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975 Test Plan: tests updated Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache): 34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844 34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594 34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297 34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523 34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602 34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293 34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926 34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488 233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984 233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922 233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559 233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93 233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418 233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273 233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691 233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82 1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55 1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02 1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45 1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24 1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92 1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78 1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36 1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83 Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn. Reviewed By: anand1976 Differential Revision: D42417818 Pulled By: pdillinger fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432	3 years ago
Changyu Bi	0a2d3b663a	Fix some unit test failure in ExternalSSTFileBasicTest (#11070 ) Summary: valgrind build for `ExternalSSTFileBasicTest/ExternalSSTFileBasicTest.IngestFileWithMixedValueType` and `ExternalSSTFileBasicTest/ExternalSSTFileBasicTest.IngestFileWithGlobalSeqnoPickedSeqno` started failing (see error message in T141554665). I could not repro but I suspect it is due to file ingestion range overlapping with ongoing compaction, which caused a new global seqno being assigned after https://github.com/facebook/rocksdb/issues/10988. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11070 Test Plan: monitor future valgrind tests result. Reviewed By: hx235 Differential Revision: D42319056 Pulled By: cbi42 fbshipit-source-id: acbcd841a2a15e36b278f39ba514f4b9a6ee43ca	3 years ago
Niklas Fiekas	ff04fb154b	Add C API for ReadOptions::async_io (#11062 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11062 Reviewed By: hx235 Differential Revision: D42297489 Pulled By: ajkr fbshipit-source-id: 03fe1477c1ae1f8af73dc77a6986fdc7025edf4f	3 years ago
ywave	7f71880de9	Fix typo in flushing stats CF (#11055 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11055 Test Plan: make check Reviewed By: cbi42 Differential Revision: D42232828 Pulled By: ajkr fbshipit-source-id: 3b46514aebff4da7e47b9954b90800ba4a3ba30b	3 years ago
HuangYi	33aca893c2	add c-api for setting option optimize_filters_for_memory (#11044 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11044 Reviewed By: cbi42 Differential Revision: D42152851 Pulled By: ajkr fbshipit-source-id: 81710d9503ba4f23f112c72ebf16a48112e27158	3 years ago
Hui Xiao	9502856edd	Add missing range conflict check between file ingestion and RefitLevel() (#10988 ) Summary: Context: File ingestion never checks whether the key range it acts on overlaps with an ongoing RefitLevel() (used in `CompactRange()` with `change_level=true`). That's because RefitLevel() doesn't register and make its key range known to file ingestion. Though it checks overlapping with other compactions by https://github.com/facebook/rocksdb/blob/7.8.fb/db/external_sst_file_ingestion_job.cc#L998. RefitLevel() (used in `CompactRange()` with `change_level=true`) doesn't check whether the key range it acts on overlaps with an ongoing file ingestion. That's because file ingestion does not register and make its key range known to other compactions. - Note that non-refitlevel-compaction (e.g, manual compaction w/o RefitLevel() or general compaction) also does not check key range overlap with ongoing file ingestion for the same reason. - But it's fine. Credited to cbi42's discovery, `WaitForIngestFile` was called by background and foreground compactions. They were introduced in `0f88160f67`, `5c64fb67d2` and `87dfc1d23e`. - Regardless, this PR registers file ingestion like a compaction is a general approach that will also add range conflict check between file ingestion and non-refitlevel-compaction, though it has not been the issue motivated this PR. Above are bugs resulting in two bad consequences: - If file ingestion and RefitLevel() creates files in the same level, then range-overlapped files will be created at that level and caught as corruption by `force_consistency_checks=true` - If file ingestion and RefitLevel() creates file in different levels, then with one further compaction on the ingested file, it can result in two same keys both with seqno 0 in two different levels. Then with iterator's [optimization](`c62f322169/db/db_iter.cc (L342-L343)`) that assumes no two same keys both with seqno 0, it will either break this assertion in debug build or, even worst, return value of this same key for the key after it, which is the wrong value to return, in release build. Therefore we decide to introduce range conflict check for file ingestion and RefitLevel() inspired from the existing range conflict check among compactions. Summary: - Treat file ingestion job and RefitLevel() as `Compaction` of new compaction reasons: `CompactionReason::kExternalSstIngestion` and `CompactionReason::kRefitLevel` and register/unregister them. File ingestion is treated as compaction from L0 to different levels and RefitLevel() as compaction from source level to target level. - Check for `RangeOverlapWithCompaction` with other ongoing compactions, `RegisterCompaction()` on this "compaction" before changing the LSM state in `VersionStorageInfo`, and `UnregisterCompaction()` after changing. - Replace scattered fixes (`0f88160f67`, `5c64fb67d2` and `87dfc1d23e`.) that prevents overlapping between file ingestion and non-refit-level compaction with this fix cuz those practices are easy to overlook. - Misc: logic cleanup, see PR comments Pull Request resolved: https://github.com/facebook/rocksdb/pull/10988 Test Plan: - New unit test `DBCompactionTestWithOngoingFileIngestionParam*` that failed pre-fix and passed afterwards. - Made compatible with existing tests, see PR comments - make check - [Ongoing] Stress test rehearsal with normal value and aggressive CI value https://github.com/facebook/rocksdb/pull/10761 Reviewed By: cbi42 Differential Revision: D41535685 Pulled By: hx235 fbshipit-source-id: 549833a577ba1496d20a870583d4caa737da1258	3 years ago
Changyu Bi	cc6f323705	Include estimated bytes deleted by range tombstones in compensated file size (#10734 ) Summary: compensate file sizes in compaction picking so files with range tombstones are preferred, such that they get compacted down earlier as they tend to delete a lot of data. This PR adds a `compensated_range_deletion_size` field in FileMeta that is computed during Flush/Compaction and persisted in MANIFEST. This value is added to `compensated_file_size` which will be used for compaction picking. Currently, for a file in level L, `compensated_range_deletion_size` is set to the estimated bytes deleted by range tombstone of this file in all levels > L. This helps to reduce space amp when data in older levels are covered by range tombstones in level L. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10734 Test Plan: - Added unit tests. - benchmark to check if the above definition `compensated_range_deletion_size` is reducing space amp as intended, without affecting write amp too much. The experiment set up favorable for this optimization: large range tombstone issued infrequently. Command used: ``` ./db_bench -benchmarks=fillrandom,waitforcompaction,stats,levelstats -use_existing_db=false -avoid_flush_during_recovery=true -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -max_bytes_for_level_base=134217728 -target_file_size_base=33554432 -writes_per_range_tombstone=500000 -range_tombstone_width=5000000 -num=50000000 -benchmark_write_rate_limit=8388608 -threads=16 -duration=1800 --max_num_range_tombstones=1000000000 ``` In this experiment, each thread wrote 16 range tombstones over the duration of 30 minutes, each range tombstone has width 5M that is the 10% of the key space width. Results shows this PR generates a smaller DB size. Compaction stats from this PR: ``` Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 2/0 31.54 MB 0.5 0.0 0.0 0.0 8.4 8.4 0.0 1.0 0.0 63.4 135.56 110.94 544 0.249 0 0 0.0 0.0 L4 3/0 96.55 MB 0.8 18.5 6.7 11.8 18.4 6.6 0.0 2.7 65.3 64.9 290.08 284.03 108 2.686 284M 1957K 0.0 0.0 L5 15/0 404.41 MB 1.0 19.1 7.7 11.4 18.8 7.4 0.3 2.5 66.6 65.7 292.93 285.34 220 1.332 293M 3808K 0.0 0.0 L6 143/0 4.12 GB 0.0 45.0 7.5 37.5 41.6 4.1 0.0 5.5 71.2 65.9 647.00 632.66 251 2.578 739M 47M 0.0 0.0 Sum 163/0 4.64 GB 0.0 82.6 21.9 60.7 87.2 26.5 0.3 10.4 61.9 65.4 1365.58 1312.97 1123 1.216 1318M 52M 0.0 0.0 ``` Compaction stats from main: ``` Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 0/0 0.00 KB 0.0 0.0 0.0 0.0 8.4 8.4 0.0 1.0 0.0 60.5 142.12 115.89 569 0.250 0 0 0.0 0.0 L4 3/0 85.68 MB 1.0 17.7 6.8 10.9 17.6 6.7 0.0 2.6 62.7 62.3 289.05 281.79 112 2.581 272M 2309K 0.0 0.0 L5 11/0 293.73 MB 1.0 18.8 7.5 11.2 18.5 7.2 0.5 2.5 64.9 63.9 296.07 288.50 220 1.346 288M 4365K 0.0 0.0 L6 130/0 3.94 GB 0.0 51.5 7.6 43.9 47.9 3.9 0.0 6.3 67.2 62.4 784.95 765.92 258 3.042 848M 51M 0.0 0.0 Sum 144/0 4.31 GB 0.0 88.0 21.9 66.0 92.3 26.3 0.5 11.0 59.6 62.5 1512.19 1452.09 1159 1.305 1409M 58M 0.0 0.0``` Reviewed By: ajkr Differential Revision: D39834713 Pulled By: cbi42 fbshipit-source-id: fe9341040b8704a8fbb10cad5cf5c43e962c7e6b	3 years ago
Peter Dillinger	e6b6e74154	Make CompactRange() more aware of SstPartitionerFactory (#11032 ) Summary: Some users are at least considering using SstPartitioner to support efficient physical migration of specific key ranges between RocksDB instances. One might expect manual `CompactRange()` over a narrow key range across some partition to enforce partitioning of any SST files crossing that partition boundary, but that currently only works if there are keys within that range. This change makes the overlap logic in CompactRange more aware of the partitioner to automatically select relevant files crossing a partition boundary, even when they otherwise would not be selected due to the compaction range falling in a gap between entries. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11032 Test Plan: unit test included Reviewed By: hx235 Differential Revision: D41981380 Pulled By: pdillinger fbshipit-source-id: 2fe445bdddc73c00276c20f295cc1fa33d15b05a	3 years ago

1 2 3 4 5 ...

5263 Commits (a11f1e12ca95ae6795f0a0674c3a3dbf82c157e9)