rocksdb

Commit Graph

Author	SHA1	Message	Date
anand76	a6691d0f65	Update stats to help users estimate MultiGet async IO impact (#10182 ) Summary: Add a couple of stats to help users estimate the impact of potential MultiGet perf improvements - 1. NUM_LEVEL_READ_PER_MULTIGET - A histogram stat for number of levels that required MultiGet to read from a file 2. MULTIGET_COROUTINE_COUNT - A ticker stat to count the number of times the coroutine version of MultiGetFromSST was used The NUM_DATA_BLOCKS_READ_PER_LEVEL stat is obsoleted as it doesn't provide useful information for MultiGet optimization. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10182 Reviewed By: akankshamahajan15 Differential Revision: D37213296 Pulled By: anand1976 fbshipit-source-id: 5d2b7708017c0e278578ae4bffac3926f6530efb	3 years ago
Yanqin Jin	4d31d3c2ed	Abort in dbg mode after logging (#10183 ) Summary: In CompactionIterator code, there are multiple places where the process will abort in dbg mode before logging the error message describing the cause. This PR changes only the logging behavior for compaction iterator so that error message is written to LOG before the process aborts in debug mode. Also updated the triggering condition for an assertion for single delete with user-defined timestamp. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10183 Test Plan: make check Reviewed By: akankshamahajan15 Differential Revision: D37190218 Pulled By: riversand963 fbshipit-source-id: 741bb007067be7cfbe94ac9e530ad4b2b339c009	3 years ago
Akanksha Mahajan	8353ae8b27	Add few optimizations in async_io for short scans (#10140 ) Summary: This PR adds few optimizations for async_io for shorter scans. 1. If async_io is enabled, seek would create FilePrefetchBuffer object to fetch the data asynchronously. However `FilePrefetchbuffer::num_file_reads_` wasn't taken into consideration if it calls Next after Seek and would go for Prefetching. This PR fixes that and Next will go for prefetching only if `FilePrefetchbuffer::num_file_reads_` is greater than 2 along with if blocks are sequential. This scenario is only for implicit auto readahead. 2. For seek, when it calls TryReadFromCacheAsync to poll it makes async call as well because TryReadFromCacheAsync flow wasn't changed. So I updated to return after poll instead of further prefetching any data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10140 Test Plan: 1. Added a unit test 2. Ran crash_test with async_io = 1 to make sure nothing crashes. Reviewed By: anand1976 Differential Revision: D37042242 Pulled By: akankshamahajan15 fbshipit-source-id: b8e6b7cb2ee0886f37a8f53951948b9084e8ffda	3 years ago
Peter Dillinger	3d358a7e25	Fix handling of accidental truncation of IDENTITY file (#10173 ) Summary: A consequence of https://github.com/facebook/rocksdb/issues/9990 was requiring a non-empty DB ID to generate new SST files. But if the DB ID is not tracked in the manifest and the IDENTITY file is somehow truncated to 0 bytes, then an empty DB ID would be assigned, leading to crash. This change ensures a non-empty DB ID is assigned and set in the IDENTITY file. Also, * Some light refactoring to clean up the logic * (I/O efficiency) If the ID is tracked in the manifest and already matches the IDENTITY file, don't needlessly overwrite the file. * (Debugging) Log the DB ID to info log on open, because sometimes IDENTITY can change if DB is moved around (though it would be unusual for info log to be copied/moved without IDENTITY file) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10173 Test Plan: unit tests expanded/updated Reviewed By: ajkr Differential Revision: D37176545 Pulled By: pdillinger fbshipit-source-id: a9b414cd35bfa33de48af322a36c24538d50bef1	3 years ago
Peter Dillinger	94329ae4ec	Use only ASCII in source files (#10164 ) Summary: Fix existing usage of non-ASCII and add a check to prevent future use. Added `-n` option to greps to provide line numbers. Alternative to https://github.com/facebook/rocksdb/issues/10147 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10164 Test Plan: used new checker to find & fix cases, manually check db_bench output is preserved Reviewed By: akankshamahajan15 Differential Revision: D37148792 Pulled By: pdillinger fbshipit-source-id: 68c8b57e7ab829369540d532590bf756938855c7	3 years ago
Changyu Bi	9882652b0e	Verify write batch checksum before WAL (#10114 ) Summary: Context: WriteBatch can have key-value checksums when it was created `with protection_bytes_per_key > 0`. This PR added checksum verification for write batches before they are written to WAL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10114 Test Plan: - Added new unit tests to db_kv_checksum_test.cc: `make check -j32` - benchmark on performance regression: `./db_bench --benchmarks=fillrandom[-X20] -db=/dev/shm/test_rocksdb -write_batch_protection_bytes_per_key=8` - Pre-PR: ` fillrandom [AVG 20 runs] : 198875 (± 3006) ops/sec; 22.0 (± 0.3) MB/sec ` - Post-PR: ` fillrandom [AVG 20 runs] : 196487 (± 2279) ops/sec; 21.7 (± 0.3) MB/sec ` Mean regressed about 1% (198875 -> 196487 ops/sec). Reviewed By: ajkr Differential Revision: D36917464 Pulled By: cbi42 fbshipit-source-id: 29beb74edf65f04b1a890b4f650d873dc7ed790d	3 years ago
Ali Saidi	2e5a323dbd	Change the instruction used for a pause on arm64 (#10118 ) Summary: While the yield instruction conseptually sounds correct on most platforms it is a simple nop that doesn't delay the execution anywhere close to what an x86 pause instruction does. In other projects with spin-wait loops an isb has been observed to be much closer to the x86 behavior. On a Graviton3 system the following test improves on average by 2x with this change averaged over 20 runs: ``` ./db_bench -benchmarks=fillrandom -threads=64 -batch_size=1 -memtablerep=skip_list -value_size=100 --num=100000 level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000 --block_size=16384 --allow_concurrent_memtable_write -compression_type none ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10118 Reviewed By: jay-zhuang Differential Revision: D37120578 fbshipit-source-id: c20bde4298222edfab7ff7cb6d42497e7012400d	3 years ago
sdong	69a32eecab	Use madvise() for mmaped file advise (#10170 ) Summary: A recent PR https://github.com/facebook/rocksdb/pull/10142 enabled fadvise for mmaped file. However, we were told that it might not take effective and madvise() should be used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10170 Test Plan: Run existing tests Run a benchmark using mmap with advise random and see I/O size is indeed small. Reviewed By: anand1976 Differential Revision: D37158582 fbshipit-source-id: 8b3a74f0e89d2e16aac78ee4124c05841d4135c3	3 years ago
Yanqin Jin	ce419c0f10	Allow db_bench and db_stress to set `allow_data_in_errors` (#10171 ) Summary: There is `Options::allow_data_in_errors` that controls whether RocksDB is allowed to log data, e.g. key, value, etc in LOG files. It is false by default. However, in db_bench and db_stress, it is often ok to log data because there is no concern about privacy. This PR allows db_stress and db_bench to set this option on the command line, while it remains false by default. Furthermore, make crash/recovery test driven by db_crashtest.py to opt-in. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10171 Test Plan: Stress test and db_bench Reviewed By: hx235 Differential Revision: D37163787 Pulled By: riversand963 fbshipit-source-id: 0242f24d292ba15b6faf8ff903963b85d3e011f8	3 years ago
Akanksha Mahajan	19345de60d	fix cancel argument for latest liburing (#10168 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10168 the arg changed to u64 Reviewed By: ajkr Differential Revision: D37155407 fbshipit-source-id: 464eab2806675f148fce075a6fea369fa3d7a9bb	3 years ago
iseki	40dfa26049	Fix C4702 on windows (#10146 ) Summary: This code is unreachable when `ROCKSDB_LITE` not defined. And it cause build fail on my environment VS2019 16.11.15. ``` -- Selecting Windows SDK version 10.0.19041.0 to target Windows 10.0.19044. -- The CXX compiler identification is MSVC 19.29.30145.0 -- The C compiler identification is MSVC 19.29.30145.0 -- The ASM compiler identification is MSVC ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10146 Reviewed By: akankshamahajan15 Differential Revision: D37112916 Pulled By: ajkr fbshipit-source-id: e0b2bf3055d6fac1b3fb40b9f02c4cbae3f82757	3 years ago
mpoeter	77f4799515	Fix potential leak when reusing PinnableSlice instances. (#10166 ) Summary: `PinnableSlice` may hold a handle to a cache value which must be released to correctly decrement the ref-counter. However, when `PinnableSlice` variables are reused, e.g. like this: ``` PinnableSlice pin_slice; db.Get("foo", &pin_slice); db.Get("foo", &pin_slice); ``` then the second `Get` simply overwrites the old value in `pin_slice` and the handle returned by the first `Get` is _not_ released. This PR adds `Reset` calls to the `Get`/`MultiGet` calls that accept `PinnableSlice` arguments to ensure proper cleanup of old values. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10166 Reviewed By: hx235 Differential Revision: D37151632 Pulled By: ajkr fbshipit-source-id: 9dd3c3288300f560531b843f67db11aeb569a9ff	3 years ago
Ali Saidi	b550fc0b09	Modify the instructions emited for PREFETCH on arm64 (#10117 ) Summary: __builtin_prefetch(...., 1) prefetches into the L2 cache on x86 while the same emits a pldl3keep instruction on arm64 which doesn't seem to be close enough. Testing on a Graviton3, and M1 system with memtablerep_bench fillrandom and skiplist througpuh increased as follows adjusting the 1 to 2 or 3: ``` 1 -> 2 1 -> 3 ---------------------------- Graviton3 +10% +15% M1 +10% +10% ``` Given that prefetching into the L1 cache seems to help, I chose that conversion Pull Request resolved: https://github.com/facebook/rocksdb/pull/10117 Reviewed By: pdillinger Differential Revision: D37120475 fbshipit-source-id: db1ef43f941445019c68316500a2250acc643d5e	3 years ago
James Tucker	751d1a3e48	mingw: remove no-asynchronous-unwind-tables (#9963 ) Summary: This default is generally incompatible with other parts of mingw, and can be applied by outside users as-needed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9963 Reviewed By: akankshamahajan15 Differential Revision: D36302813 Pulled By: ajkr fbshipit-source-id: 9456b41a96bde302bacbc39e092ccecfcb42f34f	3 years ago
Gang Liao	cba398df8a	Add blob cache option in the column family options (#10155 ) Summary: There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache. This PR is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10155 Reviewed By: ltamasi Differential Revision: D37150819 Pulled By: gangliao fbshipit-source-id: b807c7916ea5d411588128f8e22a49f171388fe2	3 years ago
tabokie	1d2950b8dd	fix a false positive case of parsing table factory from options file (#10094 ) Summary: During options file parsing, reset table factory before attempting to parse it from string. This avoids mistakenly treating the default table factory as a newly created one. Signed-off-by: tabokie <xy.tao@outlook.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/10094 Reviewed By: akankshamahajan15 Differential Revision: D36945378 Pulled By: ajkr fbshipit-source-id: 94b2604e5e87682063b4b78f6370f3e8f101dc44	3 years ago
Hui Xiao	d665afdbf3	Account memory of FileMetaData in global memory limit (#9924 ) Summary: Context/Summary: As revealed by heap profiling, allocation of `FileMetaData` for [newly created file added to a Version](https://github.com/facebook/rocksdb/pull/9924/files#diff-a6aa385940793f95a2c5b39cc670bd440c4547fa54fd44622f756382d5e47e43R774) can consume significant heap memory. This PR is to account that toward our global memory limit based on block cache capacity. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9924 Test Plan: - Previous `make check` verified there are only 2 places where the memory of the allocated `FileMetaData` can be released - New unit test `TEST_P(ChargeFileMetadataTestWithParam, Basic)` - db bench (CPU cost of `charge_file_metadata` in write and compact) - write micros/op: -0.24% : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 (remove this option for pre-PR) -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 \| egrep 'fillseq'` - compact micros/op -0.87% : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 -numdistinct=1000 && ./db_bench -benchmarks=compact -db=$TEST_TMPDIR -use_existing_db=1 -charge_file_metadata=1 -disable_auto_compactions=1 \| egrep 'compact'` table 1 - write #-run \| (pre-PR) avg micros/op \| std micros/op \| (post-PR) micros/op \| std micros/op \| change (%) -- \| -- \| -- \| -- \| -- \| -- 10 \| 3.9711 \| 0.264408 \| 3.9914 \| 0.254563 \| 0.5111933721 20 \| 3.83905 \| 0.0664488 \| 3.8251 \| 0.0695456 \| -0.3633711465 40 \| 3.86625 \| 0.136669 \| 3.8867 \| 0.143765 \| 0.5289363078 80 \| 3.87828 \| 0.119007 \| 3.86791 \| 0.115674 \| -0.2673865734 160 \| 3.87677 \| 0.162231 \| 3.86739 \| 0.16663 \| -0.2419539978 table 2 - compact #-run \| (pre-PR) avg micros/op \| std micros/op \| (post-PR) micros/op \| std micros/op \| change (%) -- \| -- \| -- \| -- \| -- \| -- 10 \| 2,399,650.00 \| 96,375.80 \| 2,359,537.00 \| 53,243.60 \| -1.67 20 \| 2,410,480.00 \| 89,988.00 \| 2,433,580.00 \| 91,121.20 \| 0.96 40 \| 2.41E+06 \| 121811 \| 2.39E+06 \| 131525 \| -0.96 80 \| 2.40E+06 \| 134503 \| 2.39E+06 \| 108799 \| -0.78 - stress test: `python3 tools/db_crashtest.py blackbox --charge_file_metadata=1 --cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D36055583 Pulled By: hx235 fbshipit-source-id: b60eab94707103cb1322cf815f05810ef0232625	3 years ago
Akanksha Mahajan	40d19bc12c	Fix the failure related to io_uring_prep_cancel (#10165 ) Summary: Fix for Internal jobs are failing with ``` error: no matching function for call to 'io_uring_prep_cancel' io_uring_prep_cancel(sqe, posix_handle, 0); ^~~~~~~~~~~~~~~~~~~~ note: candidate function not viable: no known conversion from 'rocksdb::Posix_IOHandle ' to '__u64' (aka 'unsigned long long') for 2nd argument static inline void io_uring_prep_cancel(struct io_uring_sqe sqe, ``` User data is set using `io_uring_set_data` API so no need to pass posix_handle here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10165 Test Plan: CircleCI jobs Reviewed By: jay-zhuang Differential Revision: D37145233 Pulled By: akankshamahajan15 fbshipit-source-id: 05da650e1240e9c6fcc8aed5f0067308dccb164a	3 years ago
Guido Tagliavini Ponce	f105e1a501	Make the per-shard hash table fixed-size. (#10154 ) Summary: We make the size of the per-shard hash table fixed. The base level of the hash table is now preallocated with the required capacity. The user must provide an estimate of the size of the values. Notice that even though the base level becomes fixed, the chains are still dynamic. Overall, the shard capacity mechanisms haven't changed, so we don't need to test this. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10154 Test Plan: `make -j24 check` Reviewed By: pdillinger Differential Revision: D37124451 Pulled By: guidotag fbshipit-source-id: cba6ac76052fe0ec60b8ff4211b3de7650e80d0c	3 years ago
Yanqin Jin	bfaf8291c5	Fix a race condition in transaction stress test (#10157 ) Summary: Before this PR, there can be a race condition between the thread calling `StressTest::Open()` and a background compaction thread calling `MultiOpsTxnsStressTest::VerifyPkSkFast()`. ``` Time thread1 bg_compact_thr \| TransactionDB::Open(..., &txn_db_) \| db_ is still nullptr \| db_->GetSnapshot() // segfault \| db_ = txn_db_ V ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10157 Test Plan: CI Reviewed By: akankshamahajan15 Differential Revision: D37121653 Pulled By: riversand963 fbshipit-source-id: 6a53117f958e9ee86f77297fdeb843e5160a9331	3 years ago
Akanksha Mahajan	c0e0f30667	Implement AbortIO using io_uring (#10125 ) Summary: Implement AbortIO in posix using io_uring to cancel any pending read requests submitted. Its cancelled using io_uring_prep_cancel which sets the IORING_OP_ASYNC_CANCEL flag. To cancel a request, the sqe must have ->addr set to the user_data of the request it wishes to cancel. If the request is cancelled successfully, the original request is completed with -ECANCELED and the cancel request is completed with a result of 0. If the request was already running, the original may or may not complete in error. The cancel request will complete with -EALREADY for that case. And finally, if the request to cancel wasn't found, the cancel request is completed with -ENOENT. Reference: https://kernel.dk/io_uring-whatsnew.pdf, https://lore.kernel.org/io-uring/d9a8d76d23690842f666c326631ecc2d85b6c1bc.1615566409.git.asml.silence@gmail.com/ Pull Request resolved: https://github.com/facebook/rocksdb/pull/10125 Test Plan: Existing Posix tests. Reviewed By: anand1976 Differential Revision: D36946970 Pulled By: akankshamahajan15 fbshipit-source-id: 3bc1f1521b3151d01a348fc6431eb3fc85db3a14	3 years ago
Mark Callaghan	04bd347995	Increase num_levels for universal from 8 to 40 (#10158 ) Summary: See https://github.com/facebook/rocksdb/issues/10082 for more details. Trivial move isn't done for universal when compaction is from L0 into L0. So a too small value for num_levels with db_bench means there will be fewer trivial moves with universal and that means that write-amp will increase. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10158 Test Plan: run it Reviewed By: siying Differential Revision: D37122519 Pulled By: mdcallag fbshipit-source-id: 1cb39049676f68a6cc3ea8d105a9965f89d4d09e	3 years ago
Peter Dillinger	ad135f3ffd	Document design/specification bugs with auto_prefix_mode (#10144 ) Summary: auto_prefix_mode is designed to use prefix filtering in a particular "safe" set of cases where the upper bound and the seek key have different prefixes: where the upper bound is the "same length immediate successor". These conditions are not sufficient to guarantee the same iteration results as total_order_seek if the DB contains "short" keys, less than the "full" (maximum) prefix length. We are not simply disabling the optimization in these successor cases because it is likely that users are essentially getting what they want out of existing usage. Especially if users are constructing successor bounds with the intention of doing a prefix-bounded seek, the existing behavior is more expected than the total_order_seek behavior. Consequently, for now we reconcile the bad specification of behavior by documenting the existing mismatch with total_order_seek. A closely related issue affects hypothetical comparators like ReverseBytewiseComparator: if they "correctly" implement IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more entries (other than "short" keys noted above). Luckily, the built-in ReverseBytewiseComparator has an "incorrect" implementation of IsSameLengthImmediateSuccessor that effectively prevents prefix optimization and, thus, the bug. This is now documented as a new constraint on IsSameLengthImmediateSuccessor, and the implementation tweaked to be simply "safe" rather than "incorrect". This change also includes unit test updates to demonstrate the above issues. (Test was cleaned up for readability and simplicity.) Intended follow-up: * Tweak documented axioms for prefix_extractor (more details then) * Consider some sort of fix for this case. I don't know what that would look like without breaking the performance of existing code. Perhaps if all keys in an SST file have prefixes that are "full length," we can track that fact and use it to allow optimization with the "same length immediate successor", but that would only apply to new files. * Consider a better system of specifying prefix bounds Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144 Test Plan: test updates included Reviewed By: siying Differential Revision: D37052710 Pulled By: pdillinger fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db	3 years ago
Akanksha Mahajan	8273435c22	Bypass tests instead of skipping to resolve internal failure (#10148 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10148 Reviewed By: hx235 Differential Revision: D37092202 Pulled By: akankshamahajan15 fbshipit-source-id: 12fae5641a1c4ab584e586db95f4044273aba23a	3 years ago
Guido Tagliavini Ponce	415200d792	Assume fixed size key (#10137 ) Summary: FastLRUCache now only supports 16B keys. The tests have changed to reflect this. Because the unit tests were designed for caches that accept any string as keys, some tests are no longer compatible with FastLRUCache. We have disabled those for runs with FastLRUCache. (We could potentially change all tests to use 16B keys, but we don't because the cache public API does not require this.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10137 Test Plan: make -j24 check Reviewed By: gitbw95 Differential Revision: D37083934 Pulled By: guidotag fbshipit-source-id: be1719cf5f8364a9a32bc4555bce1a0de3833b0d	3 years ago
sdong	80afa77660	Run fadvise with mmap file (#10142 ) Summary: Right now with mmap file, we don't run fadvise following users' requests. There is no reason for that so this diff does that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10142 Test Plan: A simple readrandom against files with page cache dropped shows latency improvement from 7.8 us to 2.8: ./db_bench -use_existing_db --benchmarks=readrandom --num=100 Reviewed By: anand1976 Differential Revision: D37074975 fbshipit-source-id: ccc72bcac1b5fd634eb8fa2b6a5d9afe332e0bf6	3 years ago
Yanqin Jin	1777e5f7e9	Snapshots with user-specified timestamps (#9879 ) Summary: In RocksDB, keys are associated with (internal) sequence numbers which denote when the keys are written to the database. Sequence numbers in different RocksDB instances are unrelated, thus not comparable. It is nice if we can associate sequence numbers with their corresponding actual timestamps. One thing we can do is to support user-defined timestamp, which allows the applications to specify the format of custom timestamps and encode a timestamp with each key. More details can be found at https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29. This PR provides a different but complementary approach. We can associate rocksdb snapshots (defined in https://github.com/facebook/rocksdb/blob/7.2.fb/include/rocksdb/snapshot.h#L20) with user-specified timestamps. Since a snapshot is essentially an object representing a sequence number, this PR establishes a bi-directional mapping between sequence numbers and timestamps. In the past, snapshots are usually taken by readers. The current super-version is grabbed, and a `rocksdb::Snapshot` object is created with the last published sequence number of the super-version. You can see that the reader actually has no good idea of what timestamp to assign to this snapshot, because by the time the `GetSnapshot()` is called, an arbitrarily long period of time may have already elapsed since the last write, which is when the last published sequence number is written. This observation motivates the creation of "timestamped" snapshots on the write path. Currently, this functionality is exposed only to the layer of `TransactionDB`. Application can tell RocksDB to create a snapshot when a transaction commits, effectively associating the last sequence number with a timestamp. It is also assumed that application will ensure any two snapshots with timestamps should satisfy the following: ``` snapshot1.seq < snapshot2.seq iff. snapshot1.ts < snapshot2.ts ``` If the application can guarantee that when a reader takes a timestamped snapshot, there is no active writes going on in the database, then we also allow the user to use a new API `TransactionDB::CreateTimestampedSnapshot()` to create a snapshot with associated timestamp. Code example ```cpp // Create a timestamped snapshot when committing transaction. txn->SetCommitTimestamp(100); txn->SetSnapshotOnNextOperation(); txn->Commit(); // A wrapper API for convenience Status Transaction::CommitAndTryCreateSnapshot( std::shared_ptr<TransactionNotifier> notifier, TxnTimestamp ts, std::shared_ptr<const Snapshot>* ret); // Create a timestamped snapshot if caller guarantees no concurrent writes std::pair<Status, std::shared_ptr<const Snapshot>> snapshot = txn_db->CreateTimestampedSnapshot(100); ``` The snapshots created in this way will be managed by RocksDB with ref-counting and potentially shared with other readers. We provide the following APIs for readers to retrieve a snapshot given a timestamp. ```cpp // Return the timestamped snapshot correponding to given timestamp. If ts is // kMaxTxnTimestamp, then we return the latest timestamped snapshot if present. // Othersise, we return the snapshot whose timestamp is equal to `ts`. If no // such snapshot exists, then we return null. std::shared_ptr<const Snapshot> TransactionDB::GetTimestampedSnapshot(TxnTimestamp ts) const; // Return the latest timestamped snapshot if present. std::shared_ptr<const Snapshot> TransactionDB::GetLatestTimestampedSnapshot() const; ``` We also provide two additional APIs for stats collection and reporting purposes. ```cpp Status TransactionDB::GetAllTimestampedSnapshots( std::vector<std::shared_ptr<const Snapshot>>& snapshots) const; // Return timestamped snapshots whose timestamps fall in [ts_lb, ts_ub) and store them in `snapshots`. Status TransactionDB::GetTimestampedSnapshots( TxnTimestamp ts_lb, TxnTimestamp ts_ub, std::vector<std::shared_ptr<const Snapshot>>& snapshots) const; ``` To prevent the number of timestamped snapshots from growing infinitely, we provide the following API to release timestamped snapshots whose timestamps are older than or equal to a given threshold. ```cpp void TransactionDB::ReleaseTimestampedSnapshotsOlderThan(TxnTimestamp ts); ``` Before shutdown, RocksDB will release all timestamped snapshots. Comparison with user-defined timestamp and how they can be combined: User-defined timestamp persists every key with a timestamp, while timestamped snapshots maintain a volatile mapping between snapshots (sequence numbers) and timestamps. Different internal keys with the same user key but different timestamps will be treated as different by compaction, thus a newer version will not hide older versions (with smaller timestamps) unless they are eligible for garbage collection. In contrast, taking a timestamped snapshot at a certain sequence number and timestamp prevents all the keys visible in this snapshot from been dropped by compaction. Here, visible means (seq < snapshot and most recent). The timestamped snapshot supports the semantics of reading at an exact point in time. Timestamped snapshots can also be used with user-defined timestamp. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9879 Test Plan: ``` make check TEST_TMPDIR=/dev/shm make crash_test_with_txn ``` Reviewed By: siying Differential Revision: D35783919 Pulled By: riversand963 fbshipit-source-id: 586ad905e169189e19d3bfc0cb0177a7239d1bd4	3 years ago
gitbw95	f4052d13b7	Enable SecondaryCache::CreateFromString to create sec cache based on the uri for CompressedSecondaryCache (#10132 ) Summary: Update SecondaryCache::CreateFromString and enable it to create sec cache based on the uri for CompressedSecondaryCache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10132 Test Plan: Add unit tests. Reviewed By: anand1976 Differential Revision: D36996997 Pulled By: gitbw95 fbshipit-source-id: 882ad563cff6d38b306a53426ad7e47273f34edc	3 years ago
Peter Dillinger	d3a3b02134	Fix bug with kHashSearch and changing prefix_extractor with SetOptions (#10128 ) Summary: When opening an SST file created using index_type=kHashSearch, the current prefix_extractor would be saved, and used with hash index if the new current prefix_extractor at query time is compatible with the SST file. This is a problem if the prefix_extractor at SST open time is not compatible but SetOptions later changes (back) to one that is compatible. This change fixes that by using the known compatible (or missing) prefix extractor we save for use with prefix filtering. Detail: I have moved the InternalKeySliceTransform wrapper to avoid some indirection and remove unnecessary fields. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10128 Test Plan: expanded unit test (using some logic from https://github.com/facebook/rocksdb/issues/10122) that fails before fix and probably covers some other previously uncovered cases. Reviewed By: siying Differential Revision: D36955738 Pulled By: pdillinger fbshipit-source-id: 0c78a6b0d24054ef2f3cb237bf010c1c5589fb10	3 years ago
Yu Zhang	693dffd8e8	Return try again when full_history_ts_low is higher than requested ts (#10126 ) Summary: This PR helps handle the race condition mentioned in this comment thread: https://github.com/facebook/rocksdb/pull/7884#discussion_r572402281 In case where actual full_history_ts_low is higher than the user's requested ts, return a try again message so they don't have the misconception that data between [ts, full_history_ts_low) is kept. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10126 Test Plan: ``` $COMPILE_WITH_ASAN=1 make -j24 all $./db_with_timestamp_basic_test --gtest_filter=UpdateFullHistoryTsLowTest.ConcurrentUpdate $ make -j24 check ``` Reviewed By: riversand963 Differential Revision: D37055368 Pulled By: jowlyzhang fbshipit-source-id: 787fd0984a246540fa03ac227b1d232590d27828	3 years ago
Peter Dillinger	5fa6ef7f18	Fix fragile CacheTest::ApplyToAllEntriesDuringResize (#10145 ) Summary: As seen in https://github.com/facebook/rocksdb/issues/10137, simply churning the cache key hashes (e.g. by changing the raw cache keys) could trigger failure in this test, due to possibility of some cache shard exceeding its portion of capacity and evicting entries. Updated the test to be less fragile by using greater margins, and added a pre-check for evictions, which doesn't manifest as a race condition, before the main check that can race. Also added stack trace handler to cache_test for debugging. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10145 Test Plan: test thousands of iterations with gtest-parallel, including with changes in https://github.com/facebook/rocksdb/issues/10137 that were surfacing the problem. Pre-check without the fix would always fail with https://github.com/facebook/rocksdb/issues/10137 Reviewed By: guidotag Differential Revision: D37058771 Pulled By: pdillinger fbshipit-source-id: a7cf137967aef49c07ae9602d8523c63e7388fab	3 years ago
Bo Wang	1a3e23a251	Update jemalloc version for platform009 (#10143 ) Summary: Update jemalloc version for platform009. Current one is a bit old and the new one can bring some quick wins (e.g. new heap profiling features on devserver). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10143 Test Plan: 1. The building and testing on devserver should work. 2. `db_bench` with `--dump_malloc_stats` `./db_bench --benchmarks=fillrandom --num=10000000 -db=/db_bench_1 ` `./db_bench --benchmarks=overwrite,stats --num=10000000 -use_existing_db -duration=10 --benchmark_write_rate_limit=2000000 -db=/db_bench_1 ` `./db_bench --benchmarks=seekrandom,stats --threads=16 --num=10000000 -use_existing_db -duration=120 --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=520000000 --statistics -db=/db_bench_1 --dump_malloc_stats=true` Before this PR: jemalloc Version: "5.2.1-1303-g73b8faa7149e452f93e52005c89459da08343570" After this PR: jemalloc Version: Reviewed By: anand1976 Differential Revision: D37049347 Pulled By: gitbw95 fbshipit-source-id: 3fcd82cca989047b4bbdfdebe5beba2c4c255ed8	3 years ago
Akanksha Mahajan	ecfd4aef0c	Enable wal_compression in crash_tests (#10141 ) Summary: Same as title Pull Request resolved: https://github.com/facebook/rocksdb/pull/10141 Test Plan: ``` export CRASH_TEST_EXT_ARGS=" --wal_compression=zstd" make crash_test -j ``` Reviewed By: riversand963 Differential Revision: D37042810 Pulled By: akankshamahajan15 fbshipit-source-id: 53f0793d78241f1b5c954dcc808cb4c0a3e9172a	3 years ago
Akanksha Mahajan	f85b31a2e9	Fix bug for WalManager with compressed WAL (#10130 ) Summary: RocksDB uses WalManager to manage WAL files. In WalManager::ReadFirstLine(), the assumption is that reading the first record of a valid WAL file will return OK status and set the output sequence to non-zero value. This assumption has been broken by WAL compression which writes a `kSetCompressionType` record which is not associated with any sequence number. Consequently, WalManager::GetSortedWalsOfType() will skip these WALs and not return them to caller, e.g. Checkpoint, Backup, causing the operations to fail. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10130 Test Plan: - Newly Added test Reviewed By: riversand963 Differential Revision: D36985744 Pulled By: akankshamahajan15 fbshipit-source-id: dfde7b3be68b6a30b75b49479779748eedf29f7f	3 years ago
Mark Callaghan	9efae14428	Fix parsing of db_bench output (#10124 ) Summary: A recent diff add a few more fields to one of the db_bench output lines that gets parsed. This diff updates tools/benchmark.sh to handle that. overwrite : 7.939 micros/op 125963 ops/sec; 50.5 MB/s overwrite : 7.854 micros/op 127320 ops/sec 1800.001 seconds 229176999 operations; 51.0 MB/s Pull Request resolved: https://github.com/facebook/rocksdb/pull/10124 Test Plan: Run it Reviewed By: jay-zhuang Differential Revision: D36945137 Pulled By: mdcallag fbshipit-source-id: 9c96f79491411da997e369a3be9c6b921a21d0fa	3 years ago
Yanqin Jin	f890527b16	Update test for secondary instance in stress test (#10121 ) Summary: This PR updates secondary instance testing in stress test by default. A background thread will be started (disabled by default), running a secondary instance tailing the logs of the primary. Periodically (every 1 sec), this thread calls `TryCatchUpWithPrimary()` and uses point lookup or range scan to read some random keys with only very basic verification to make sure no assertion failure is triggered. Thanks to https://github.com/facebook/rocksdb/issues/10061 , we can enable secondary instance when user-defined timestamp is enabled. Also removed a less useful test configuration, `secondary_catch_up_one_in`. This is very similar to the periodic catch-up. In the last commit, I decided not to enable it now, but just update the tests, since secondary instance does not work well when the underlying file is renamed by primary, e.g. SstFileManager. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10121 Test Plan: ``` TEST_TMPDIR=/dev/shm/rocksdb make crash_test TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_ts TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_atomic_flush ``` Reviewed By: ajkr Differential Revision: D36939458 Pulled By: riversand963 fbshipit-source-id: 1c065b7efc3690fc341569b9d369a5cbd8ef6b3e	3 years ago
Andrew Kryczka	ff32346415	Set db_stress defaults for TSAN deadlock detector (#10131 ) Summary: After https://github.com/facebook/rocksdb/issues/9357 we began seeing the following error attempting to acquire locks for file ingestion: ``` FATAL: ThreadSanitizer CHECK failed: /home/engshare/third-party2/llvm-fb/12/src/llvm/compiler-rt/lib/sanitizer_common/sanitizer_deadlock_detector.h:67 "((n_all_locks_)) < (((sizeof(all_locks_with_contexts_)/sizeof((all_locks_with_contexts_)[0]))))" (0x40, 0x40) ``` The command was using default values for `ingest_external_file_width` (1000) and `log2_keys_per_lock` (2). The expected number of locks needed to update those keys is then (1000 / 2^2) = 250, which is above the 0x40 (64) limit. This PR reduces the default value of `ingest_external_file_width` to 100 so the expected number of locks is 25, which is within the limit. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10131 Reviewed By: ltamasi Differential Revision: D36986307 Pulled By: ajkr fbshipit-source-id: e918cdb2fcc39517d585f1e5fd2539e185ada7c1	3 years ago
gitbw95	5cbee1f609	Add unit test to verify that the dynamic priority can be passed from compaction to FS (#10088 ) Summary: Summary: Add unit tests to verify that the dynamic priority can be passed from compaction to FS. Compaction reads&writes and other DB reads&writes share the same read&write paths to FSRandomAccessFile or FSWritableFile, so a MockTestFileSystem is added to replace the default filesystem from Env to intercept and verify the io_priority. To prepare the compaction input files, use the default filesystem from Env. To test the io priority of the compaction reads and writes, db_options_.fs is set as MockTestFileSystem. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10088 Test Plan: Add unit tests. Reviewed By: anand1976 Differential Revision: D36882528 Pulled By: gitbw95 fbshipit-source-id: 120adc15801966f2b8c9fc45285f590a3fff96d1	3 years ago
zczhu	b6de139df5	Handle "NotSupported" status by default implementation of Close() in … (#10127 ) Summary: The default implementation of Close() function in Directory/FSDirectory classes returns `NotSupported` status. However, we don't want operations that worked in older versions to begin failing after upgrading when run on FileSystems that have not implemented Directory::Close() yet. So we require the upper level that calls Close() function should properly handle "NotSupported" status instead of treating it as an error status. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10127 Reviewed By: ajkr Differential Revision: D36971112 Pulled By: littlepig2013 fbshipit-source-id: 100f0e6ad1191e1acc1ba6458c566a11724cf466	3 years ago
zczhu	3ee6c9baec	Consolidate manual_compaction_paused_ check (#10070 ) Summary: As pointed out by [https://github.com/facebook/rocksdb/pull/8351#discussion_r645765422](https://github.com/facebook/rocksdb/pull/8351#discussion_r645765422), check `manual_compaction_paused` and `manual_compaction_canceled` can be reduced by setting `canceled` to be true in `DisableManualCompaction()` and `canceled` to be false in the last time calling `EnableManualCompaction()`. Changed Tests: The origin `DBTest2.PausingManualCompaction1` uses a callback function to increase `manual_compaction_paused` and the origin CompactionJob/CompactionIterator with `manual_compaction_paused` can detect this. I changed the callback function so that it sets `*canceled` as true if `canceled` is not `nullptr` (to notify CompactionJob/CompactionIterator the compaction has been canceled). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10070 Test Plan: This change does not introduce new features, but some slight difference in compaction implementation. Run the same manual compaction unit tests as before (e.g., PausingManualCompaction[1-4], CancelManualCompaction[1-2], CancelManualCompactionWithListener in db_test2, and db_compaction_test). Reviewed By: ajkr Differential Revision: D36949133 Pulled By: littlepig2013 fbshipit-source-id: c5dc4c956fbf8f624003a0f5ad2690240063a821	3 years ago
Yu Zhang	a101c9de60	Return "invalid argument" when read timestamp is too old (#10109 ) Summary: With this change, when a given read timestamp is smaller than the column-family's full_history_ts_low, Get(), MultiGet() and iterators APIs will return Status::InValidArgument(). Test plan ``` $COMPILE_WITH_ASAN=1 make -j24 all $./db_with_timestamp_basic_test --gtest_filter=DBBasicTestWithTimestamp.UpdateFullHistoryTsLow $ make -j24 check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10109 Reviewed By: riversand963 Differential Revision: D36901126 Pulled By: jowlyzhang fbshipit-source-id: 255feb1a66195351f06c1d0e42acb1ff74527f86	3 years ago
zczhu	9f244b2119	Fix default implementaton of close() function for Directory/FSDirecto… (#10123 ) Summary: As pointed by anand1976 in his [comment](https://github.com/facebook/rocksdb/pull/10049#pullrequestreview-994255819), previous implementation (adding Close() function in Directory/FSDirectory class) is not backward-compatible. And we mistakenly added the default implementation `return Status::NotSupported("Close")` or `return IOStatus::NotSupported("Close")` in WritableFile class in this [pull request](https://github.com/facebook/rocksdb/pull/10101). This pull request fixes the above issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10123 Reviewed By: ajkr Differential Revision: D36943661 Pulled By: littlepig2013 fbshipit-source-id: 9dc45f4d2ab3a9d51c30bdfde679f1d13c4d5509	3 years ago
Guido Tagliavini Ponce	2af132c341	Fix overflow bug in standard deviation computation. (#10100 ) Summary: There was an overflow bug when computing the variance in the HistogramStat class. This manifests, for instance, when running cache_bench with default arguments. This executes 32M lookups/inserts/deletes in a block cache, and then computes (among other things) the variance of the latencies. The variance is computed as ``variance = (cur_sum_squares * cur_num - cur_sum * cur_sum) / (cur_num * cur_num)``, where ``cum_sum_squares`` is the sum of the squares of the samples, ``cur_num`` is the number of samples, and ``cur_sum`` is the sum of the samples. Because the median latency in a typical run is around 3800 nanoseconds, both the ``cur_sum_squares * cur_num`` and ``cur_sum * cur_sum`` terms overflow as uint64_t. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10100 Test Plan: Added a unit test. Run ``make -j24 histogram_test && ./histogram_test``. Reviewed By: pdillinger Differential Revision: D36942738 Pulled By: guidotag fbshipit-source-id: 0af5fb9e2a297a284e8e74c24e604d302906006e	3 years ago
Peter Dillinger	4f78f9699b	Refactor: Add BlockTypes to make them imply C++ type in block cache (#10098 ) Summary: We have three related concepts: * BlockType: an internal enum conceptually indicating a type of SST file block * CacheEntryRole: a user-facing enum for categorizing block cache entries, which is also involved in associated cache entries with an appropriate deleter. Can include categories for non-block cache entries (e.g. memory reservations). * TBlocklike: a C++ type for the actual type behind a void* cache entry. We had some existing code ugliness because BlockType did not imply TBlocklike, because of various kinds of "filter" block. This refactoring fixes that with new BlockTypes. More clean-up can come in later work. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10098 Test Plan: existing tests Reviewed By: akankshamahajan15 Differential Revision: D36897945 Pulled By: pdillinger fbshipit-source-id: 3ae496b5caa81e0a0ed85e873eb5b525e2d9a295	3 years ago
Jay Zhuang	e36008d863	Disable CI benchmark from #9723 (#10119 ) Summary: The script is broken. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10119 Reviewed By: ltamasi Differential Revision: D36928076 Pulled By: jay-zhuang fbshipit-source-id: f325cfd00869c506c64573fe8192cb5b561825d6	3 years ago
Alan Paxton	2f4a0ffef8	CI Benchmarking with CircleCI Runner and OpenSearch Dashboard (EB 1088) (#9723 ) Summary: CircleCI runner based benchmarking. A runner is a dedicate machine configured for CircleCI to perform work on. Our work is a repeatable benchmark, the `benchmark-linux` job in `config.yml` A runner, in CircleCI terminology, is a machine that is managed by the client (us) rather than running on CircleCI resources in the cloud. This means that we define and configure the iron, and that therefore the performance is repeatable and predictable. Which is what we need for performance regression benchmarking. On a time schedule (or on commit, during branch development) benchmarks are set off on the runner, and then a script is run `benchmark_log_tool.py` which parses the benchmark output and pushes it into a pre-configured OpenSearch document connected to an OpenSearch dashboard. Members of the team can examine benchmark performance changes on the dashboard. As time progresses we can add different benchmarks to the suite which gets run. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9723 Reviewed By: pdillinger Differential Revision: D35555626 Pulled By: jay-zhuang fbshipit-source-id: c6a905ca04494495c3784cfbb991f5ab90c807ee	3 years ago
yite.gu	560906ab33	Add a simple example of backup and restore (#10054 ) Summary: Add a simple example of backup and restore Signed-off-by: YiteGu <ess_gyt@qq.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/10054 Reviewed By: jay-zhuang Differential Revision: D36678141 Pulled By: ajkr fbshipit-source-id: 43545356baddb4c2c76c62cd63d7a3238d1f8a00	3 years ago
Levi Tamasi	e9c74bc474	Add wide column serialization primitives (#9915 ) Summary: The patch adds some low-level logic that can be used to serialize/deserialize a sorted vector of wide columns to/from a simple binary searchable string representation. Currently, there is no user-facing API; this will be implemented in subsequent stages. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9915 Test Plan: `make check` Reviewed By: siying Differential Revision: D35978076 Pulled By: ltamasi fbshipit-source-id: 33f5f6628ec3bcd8c8beab363b1978ac047a8788	3 years ago
Yanqin Jin	3e02c6e05a	Point-lookup returns timestamps of Delete and SingleDelete (#10056 ) Summary: If caller specifies a non-null `timestamp` argument in `DB::Get()` or a non-null `timestamps` in `DB::MultiGet()`, RocksDB will return the timestamps of the point tombstones. Note: DeleteRange is still unsupported. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10056 Test Plan: make check Reviewed By: ltamasi Differential Revision: D36677956 Pulled By: riversand963 fbshipit-source-id: 2d7af02cc7237b1829cd269086ea895a49d501ae	3 years ago
Hui Xiao	4bdcc80192	Increase ChargeTableReaderTest/ChargeTableReaderTest.Basic error tolerance rate from 1% to 5% (#10113 ) Summary: Context: https://github.com/facebook/rocksdb/pull/9748 added support to charge table reader memory to block cache. In the test `ChargeTableReaderTest/ChargeTableReaderTest.Basic`, it estimated the table reader memory, calculated the expected number of table reader opened based on this estimation and asserted this number with actual number. The expected number of table reader opened calculated based on estimated table reader memory will not be 100% accurate and should have tolerance for error. It was previously set to 1% and recently encountered an assertion failure that `(opened_table_reader_num) <= (max_table_reader_num_capped_upper_bound), actual: 375 or 376 vs 374` where `opened_table_reader_num` is the actual opened one and `max_table_reader_num_capped_upper_bound` is the estimated opened one (=371 * 1.01). I believe it's safe to increase error tolerance from 1% to 5% hence there is this PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10113 Test Plan: - CI again succeeds. Reviewed By: ajkr Differential Revision: D36911556 Pulled By: hx235 fbshipit-source-id: 259687dd77b450fea0f5658a5b567a1d31d4b1f7	3 years ago

... 11 12 13 14 15 ...

11759 Commits (6aef1a05d65d10731fada543ecab838c51d01156) All Branches Search

11759 Commits (6aef1a05d65d10731fada543ecab838c51d01156)

All Branches