rocksdb

Commit Graph

Author	SHA1	Message	Date
Jay Zhuang	5d3aefb682	Migrate to docker for CI run (#10496 ) Summary: Moved linux builds to using docker to avoid CI instability caused by dependency installation site down. Added the `Dockerfile` which is used to build the image. The build time is also significantly reduced, because no dependencies installation and with using 2xlarge+ instance for slow build (like tsan test). Also fixed a few issues detected while building this: * `DestoryDB()` Status not checked for a few tests * nullptr might be used in `inlineskiplist.cc` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10496 Test Plan: CI Reviewed By: ajkr Differential Revision: D38554200 Pulled By: jay-zhuang fbshipit-source-id: 16e8fb2bf07b9c84bb27fb18421c4d54f2f248fd	3 years ago
Jay Zhuang	fcccc412d7	Remove Travis CI (#10407 ) Summary: Travis CI is depreciated and haven't been maintained for some time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10407 Reviewed By: ajkr Differential Revision: D38078382 Pulled By: jay-zhuang fbshipit-source-id: f42057f2f41f722bdce56bf195f67a94835191fb	3 years ago
DaPorkchop_	6bebe65030	Correctly implement Create-/DropColumnFamilies for PessimisticTransactionDB (#10332 ) Summary: This overrides `CreateColumnFamilies` and `DropColumnFamilies` in `PessimisticTransactionDB` in order to add/remove the created column families to/from the lock manager. Fixes https://github.com/facebook/rocksdb/issues/10322. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10332 Reviewed By: ajkr Differential Revision: D37841079 Pulled By: riversand963 fbshipit-source-id: 854d7d9948b0089e0054a8f2875485ba44436fd2	3 years ago
Andrew Kryczka	5d6005c780	Add WriteOptions::protection_bytes_per_key (#10037 ) Summary: Added an option, `WriteOptions::protection_bytes_per_key`, that controls how many bytes per key we use for integrity protection in `WriteBatch`. It takes effect when `WriteBatch::GetProtectionBytesPerKey() == 0`. Currently the only supported value is eight. Invoking a user API with it set to any other nonzero value will result in `Status::NotSupported` returned to the user. There is also a bug fix for integrity protection with `inplace_callback`, where we forgot to take into account the possible change in varint length when calculating KV checksum for the final encoded buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10037 Test Plan: - Manual - Set default value of `WriteOptions::protection_bytes_per_key` to eight and ran `make check -j24` - Enabled in MyShadow for 1+ week - Automated - Unit tests have a `WriteMode` that enables the integrity protection via `WriteOptions` - Crash test - in most cases, use `WriteOptions::protection_bytes_per_key` to enable integrity protection Reviewed By: cbi42 Differential Revision: D36614569 Pulled By: ajkr fbshipit-source-id: 8650087ceac9b61b560f1e5fafe5e1baf9c725fb	3 years ago
Yanqin Jin	1777e5f7e9	Snapshots with user-specified timestamps (#9879 ) Summary: In RocksDB, keys are associated with (internal) sequence numbers which denote when the keys are written to the database. Sequence numbers in different RocksDB instances are unrelated, thus not comparable. It is nice if we can associate sequence numbers with their corresponding actual timestamps. One thing we can do is to support user-defined timestamp, which allows the applications to specify the format of custom timestamps and encode a timestamp with each key. More details can be found at https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29. This PR provides a different but complementary approach. We can associate rocksdb snapshots (defined in https://github.com/facebook/rocksdb/blob/7.2.fb/include/rocksdb/snapshot.h#L20) with user-specified timestamps. Since a snapshot is essentially an object representing a sequence number, this PR establishes a bi-directional mapping between sequence numbers and timestamps. In the past, snapshots are usually taken by readers. The current super-version is grabbed, and a `rocksdb::Snapshot` object is created with the last published sequence number of the super-version. You can see that the reader actually has no good idea of what timestamp to assign to this snapshot, because by the time the `GetSnapshot()` is called, an arbitrarily long period of time may have already elapsed since the last write, which is when the last published sequence number is written. This observation motivates the creation of "timestamped" snapshots on the write path. Currently, this functionality is exposed only to the layer of `TransactionDB`. Application can tell RocksDB to create a snapshot when a transaction commits, effectively associating the last sequence number with a timestamp. It is also assumed that application will ensure any two snapshots with timestamps should satisfy the following: ``` snapshot1.seq < snapshot2.seq iff. snapshot1.ts < snapshot2.ts ``` If the application can guarantee that when a reader takes a timestamped snapshot, there is no active writes going on in the database, then we also allow the user to use a new API `TransactionDB::CreateTimestampedSnapshot()` to create a snapshot with associated timestamp. Code example ```cpp // Create a timestamped snapshot when committing transaction. txn->SetCommitTimestamp(100); txn->SetSnapshotOnNextOperation(); txn->Commit(); // A wrapper API for convenience Status Transaction::CommitAndTryCreateSnapshot( std::shared_ptr<TransactionNotifier> notifier, TxnTimestamp ts, std::shared_ptr<const Snapshot>* ret); // Create a timestamped snapshot if caller guarantees no concurrent writes std::pair<Status, std::shared_ptr<const Snapshot>> snapshot = txn_db->CreateTimestampedSnapshot(100); ``` The snapshots created in this way will be managed by RocksDB with ref-counting and potentially shared with other readers. We provide the following APIs for readers to retrieve a snapshot given a timestamp. ```cpp // Return the timestamped snapshot correponding to given timestamp. If ts is // kMaxTxnTimestamp, then we return the latest timestamped snapshot if present. // Othersise, we return the snapshot whose timestamp is equal to `ts`. If no // such snapshot exists, then we return null. std::shared_ptr<const Snapshot> TransactionDB::GetTimestampedSnapshot(TxnTimestamp ts) const; // Return the latest timestamped snapshot if present. std::shared_ptr<const Snapshot> TransactionDB::GetLatestTimestampedSnapshot() const; ``` We also provide two additional APIs for stats collection and reporting purposes. ```cpp Status TransactionDB::GetAllTimestampedSnapshots( std::vector<std::shared_ptr<const Snapshot>>& snapshots) const; // Return timestamped snapshots whose timestamps fall in [ts_lb, ts_ub) and store them in `snapshots`. Status TransactionDB::GetTimestampedSnapshots( TxnTimestamp ts_lb, TxnTimestamp ts_ub, std::vector<std::shared_ptr<const Snapshot>>& snapshots) const; ``` To prevent the number of timestamped snapshots from growing infinitely, we provide the following API to release timestamped snapshots whose timestamps are older than or equal to a given threshold. ```cpp void TransactionDB::ReleaseTimestampedSnapshotsOlderThan(TxnTimestamp ts); ``` Before shutdown, RocksDB will release all timestamped snapshots. Comparison with user-defined timestamp and how they can be combined: User-defined timestamp persists every key with a timestamp, while timestamped snapshots maintain a volatile mapping between snapshots (sequence numbers) and timestamps. Different internal keys with the same user key but different timestamps will be treated as different by compaction, thus a newer version will not hide older versions (with smaller timestamps) unless they are eligible for garbage collection. In contrast, taking a timestamped snapshot at a certain sequence number and timestamp prevents all the keys visible in this snapshot from been dropped by compaction. Here, visible means (seq < snapshot and most recent). The timestamped snapshot supports the semantics of reading at an exact point in time. Timestamped snapshots can also be used with user-defined timestamp. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9879 Test Plan: ``` make check TEST_TMPDIR=/dev/shm make crash_test_with_txn ``` Reviewed By: siying Differential Revision: D35783919 Pulled By: riversand963 fbshipit-source-id: 586ad905e169189e19d3bfc0cb0177a7239d1bd4	3 years ago
Yu Zhang	a101c9de60	Return "invalid argument" when read timestamp is too old (#10109 ) Summary: With this change, when a given read timestamp is smaller than the column-family's full_history_ts_low, Get(), MultiGet() and iterators APIs will return Status::InValidArgument(). Test plan ``` $COMPILE_WITH_ASAN=1 make -j24 all $./db_with_timestamp_basic_test --gtest_filter=DBBasicTestWithTimestamp.UpdateFullHistoryTsLow $ make -j24 check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10109 Reviewed By: riversand963 Differential Revision: D36901126 Pulled By: jowlyzhang fbshipit-source-id: 255feb1a66195351f06c1d0e42acb1ff74527f86	3 years ago
Yanqin Jin	3e02c6e05a	Point-lookup returns timestamps of Delete and SingleDelete (#10056 ) Summary: If caller specifies a non-null `timestamp` argument in `DB::Get()` or a non-null `timestamps` in `DB::MultiGet()`, RocksDB will return the timestamps of the point tombstones. Note: DeleteRange is still unsupported. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10056 Test Plan: make check Reviewed By: ltamasi Differential Revision: D36677956 Pulled By: riversand963 fbshipit-source-id: 2d7af02cc7237b1829cd269086ea895a49d501ae	3 years ago
Jay Zhuang	0adac6f88e	Deflake Transaction stress tests (#10063 ) Summary: TSAN test is slower, for `TransactionStressTest` and `DeadlockStress`, they're reaching the timeout limit of 600 seconds. Decreasing the transaction test number. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10063 Test Plan: CI Reviewed By: ajkr Differential Revision: D36711727 Pulled By: jay-zhuang fbshipit-source-id: 600f82a6d32108f52fbe5572fcc7497607b7fe98	3 years ago
XieJiSS	8b1df101da	fix: build on risc-v (#9215 ) Summary: Patch is modified from ~~https://reviews.llvm.org/file/data/du5ol5zctyqw53ma7dwz/PHID-FILE-knherxziu4tl4erti5ab/file~~ Tested on Arch Linux riscv64gc (qemu) UPDATE: Seems like the above link is broken, so I tried to search for a link pointing to the original merge request. It turned out to me that the LLVM guys are cherry-picking from `google/benchmark`, and the upstream should be this: `808571a52f/src/cycleclock.h (L190)` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9215 Reviewed By: siying, jay-zhuang Differential Revision: D34170586 Pulled By: riversand963 fbshipit-source-id: 41b16b9f7f3bb0f3e7b26bb078eb575499c0f0f4	3 years ago
mrambacher	b11ff347b4	Use STATIC_AVOID_DESTRUCTION for static objects with non-trivial destructors (#9958 ) Summary: Changed the static objects that had non-trivial destructors to use the STATIC_AVOID_DESTRUCTION construct. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9958 Reviewed By: pdillinger Differential Revision: D36442982 Pulled By: mrambacher fbshipit-source-id: 029d47b1374d30d198bfede369a4c0ae7a4eb519	3 years ago
sdong	736a7b5433	Remove own ToString() (#9955 ) Summary: ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955 Test Plan: Watch CI tests Reviewed By: riversand963 Differential Revision: D36176799 fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471	4 years ago
Yanqin Jin	2b5df21e95	Remove ifdef for try_emplace after upgrading to c++17 (#9932 ) Summary: Test plan make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/9932 Reviewed By: ajkr Differential Revision: D36085404 Pulled By: riversand963 fbshipit-source-id: 2ece14ca0e2e4c1288339ff79e7e126b76eaf786	4 years ago
Yanqin Jin	2b5c29f9f3	Enforce the contract of SingleDelete (#9888 ) Summary: Enforce the contract of SingleDelete so that they are not mixed with Delete for the same key. Otherwise, it will lead to undefined behavior. See https://github.com/facebook/rocksdb/wiki/Single-Delete#notes. Also fix unit tests and write-unprepared. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9888 Test Plan: make check Reviewed By: ajkr Differential Revision: D35837817 Pulled By: riversand963 fbshipit-source-id: acd06e4dcba8cb18df92b44ed18c57e10e5a7635	4 years ago
Anvesh Komuravelli	aafb377bb5	Update protection info on recovered logs data (#9875 ) Summary: Update protection info on recovered logs data Pull Request resolved: https://github.com/facebook/rocksdb/pull/9875 Test Plan: - Benchmark setup: `TEST_TMPDIR=/dev/shm/100MB_WAL_DB/ ./db_bench -benchmarks=fillrandom -write_buffer_size=1048576000` - Benchmark command: `TEST_TMPDIR=/dev/shm/100MB_WAL_DB/ /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=overwrite -write_buffer_size=1048576000 -writes=1 -report_open_timing=true` - Results before this PR ``` OpenDb: 2350.14 milliseconds OpenDb: 2296.94 milliseconds OpenDb: 2184.29 milliseconds OpenDb: 2167.59 milliseconds OpenDb: 2231.24 milliseconds OpenDb: 2109.57 milliseconds OpenDb: 2197.71 milliseconds OpenDb: 2120.8 milliseconds OpenDb: 2148.12 milliseconds OpenDb: 2207.95 milliseconds ``` - Results after this PR ``` OpenDb: 2424.52 milliseconds OpenDb: 2359.84 milliseconds OpenDb: 2317.68 milliseconds OpenDb: 2339.4 milliseconds OpenDb: 2325.36 milliseconds OpenDb: 2321.06 milliseconds OpenDb: 2353.98 milliseconds OpenDb: 2344.64 milliseconds OpenDb: 2384.09 milliseconds OpenDb: 2428.58 milliseconds ``` Mean regressed 7.2% (2201.4 -> 2359.9) Reviewed By: ajkr Differential Revision: D36012787 Pulled By: akomurav fbshipit-source-id: d2aba09f29c6beb2fd0fe8e1e359be910b4ef02a	4 years ago
Yanqin Jin	94e245a14d	Improve stress test for MultiOpsTxnsStressTest (#9829 ) Summary: Adds more coverage to `MultiOpsTxnsStressTest` with a focus on write-prepared transactions. 1. Add a hack to manually evict commit cache entries. We currently cannot assign small values to `wp_commit_cache_bits` because it requires a prepared transaction to commit within a certain range of sequence numbers, otherwise it will throw. 2. Add coverage for commit-time-write-batch. If write policy is write-prepared, we need to set `use_only_the_last_commit_time_batch_for_recovery` to true. 3. After each flush/compaction, verify data consistency. This is possible since data size can be small: default numbers of primary/secondary keys are just 1000. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9829 Test Plan: ``` TEST_TMPDIR=/dev/shm/rocksdb_crashtest_blackbox/ make blackbox_crash_test_with_multiops_wp_txn ``` Reviewed By: pdillinger Differential Revision: D35806678 Pulled By: riversand963 fbshipit-source-id: d7fde7a29fda0fb481a61f553e0ca0c47da93616	4 years ago
Herman Lee	d9d456de49	Fix locktree accesses to PessimisticTransactions (#9898 ) Summary: The current locktree implementation stores the address of the PessimisticTransactions object as the TXNID. However, when a transaction is blocked on a lock, it records the list of waitees with conflicting locks using the rocksdb assigned TransactionID. This is performed by calling GetID() on PessimisticTransactions objects of the waitees, and then recorded in the waiter's list. However, there is no guarantee the objects are valid when recording the waitee list during the conflict callbacks because the waitee could have released the lock and freed the PessimisticTransactions object. The waitee/txnid values are only valid PessimisticTransaction objects while the mutex for the root of the locktree is held. The simplest fix for this problem is to use the address of the PessimisticTransaction as the TransactionID so that it is consistent with its usage in the locktree. The TXNID is only converted back to a PessimisticTransaction for the report_wait callbacks. Since these callbacks are now all made within the critical section where the lock_request queue mutx is held, these conversions will be safe. Otherwise, only the uint64_t TXNID of the waitee is registerd with the waiter transaction. The PessimisitcTransaction object of the waitee is never referenced. The main downside of this approach is the TransactionID will not change if the PessimisticTransaction object is reused for new transactions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9898 Test Plan: Add a new test case and run unit tests. Also verified with MyRocks workloads using range locks that the crash no longer happens. Reviewed By: riversand963 Differential Revision: D35950376 Pulled By: hermanlee fbshipit-source-id: 8c9cae272e23e487fc139b6a8ed5b8f8f24b1570	4 years ago
Yanqin Jin	d13825e586	Add rollback_deletion_type_callback to TxnDBOptions (#9873 ) Summary: This PR does not affect write-committed. Add a member, `rollback_deletion_type_callback` to TransactionDBOptions so that a write-prepared transaction, when rolling back, can call this callback to decide if a `Delete` or `SingleDelete` should be used to cancel a prior `Put` written to the database during prepare phase. The purpose of this PR is to prevent mixing `Delete` and `SingleDelete` for the same key, causing undefined behaviors. Without this PR, the following can happen: ``` // The application always issues SingleDelete when deleting keys. txn1->Put('a'); txn1->Prepare(); // writes to memtable and potentially gets flushed/compacted to Lmax txn1->Rollback(); // inserts DELETE('a') txn2->Put('a'); txn2->Commit(); // writes to memtable and potentially gets flushed/compacted ``` In the database, we may have ``` L0: [PUT('a', s=100)] L1: [DELETE('a', s=90)] Lmax: [PUT('a', s=0)] ``` If a compaction compacts L0 and L1, then we have ``` L1: [PUT('a', s=100)] Lmax: [PUT('a', s=0)] ``` If a future transaction issues a SingleDelete, we have ``` L0: [SD('a', s=110)] L1: [PUT('a', s=100)] Lmax: [PUT('a', s=0)] ``` Then, a compaction including L0, L1 and Lmax leads to ``` Lmax: [PUT('a', s=0)] ``` which is incorrect. Similar bugs reported and addressed in https://github.com/cockroachdb/pebble/issues/1255. Based on our team's current priority, we have decided to take this approach for now. We may come back and revisit in the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9873 Test Plan: make check Reviewed By: ltamasi Differential Revision: D35762170 Pulled By: riversand963 fbshipit-source-id: b28d56eefc786b53c9844b9ef4a7807acdd82c8d	4 years ago
Peter Dillinger	efd035164b	Meta-internal folly integration with F14FastMap (#9546 ) Summary: Especially after updating to C++17, I don't see a compelling case for requiring any folly components in RocksDB. I was able to purge the existing hard dependencies, and it can be quite difficult to strip out non-trivial components from folly for use in RocksDB. (The prospect of doing that on F14 has changed my mind on the best approach here.) But this change creates an optional integration where we can plug in components from folly at compile time, starting here with F14FastMap to replace std::unordered_map when possible (probably no public APIs for example). I have replaced the biggest CPU users of std::unordered_map with compile-time pluggable UnorderedMap which will use F14FastMap when USE_FOLLY is set. USE_FOLLY is always set in the Meta-internal buck build, and a simulation of that is in the Makefile for public CI testing. A full folly build is not needed, but checking out the full folly repo is much simpler for getting the dependency, and anything else we might want to optionally integrate in the future. Some picky details: * I don't think the distributed mutex stuff is actually used, so it was easy to remove. * I implemented an alternative to `folly::constexpr_log2` (which is much easier in C++17 than C++11) so that I could pull out the hard dependencies on `ConstexprMath.h` * I had to add noexcept move constructors/operators to some types to make F14's complainUnlessNothrowMoveAndDestroy check happy, and I added a macro to make that easier in some common cases. * Updated Meta-internal buck build to use folly F14Map (always) No updates to HISTORY.md nor INSTALL.md as this is not (yet?) considered a production integration for open source users. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9546 Test Plan: CircleCI tests updated so that a couple of them use folly. Most internal unit & stress/crash tests updated to use Meta-internal latest folly. (Note: they should probably use buck but they currently use Makefile.) Example performance improvement: when filter partitions are pinned in cache, they are tracked by PartitionedFilterBlockReader::filter_map_ and we can build a test that exercises that heavily. Build DB with ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters ``` and test with (simultaneous runs with & without folly, ~20 times each to see convergence) ``` TEST_TMPDIR=/dev/shm/rocksdb ./db_bench_folly -readonly -use_existing_db -benchmarks=readrandom -num=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -partition_index_and_filters -duration=40 -pin_l0_filter_and_index_blocks_in_cache ``` Average ops/s no folly: 26229.2 Average ops/s with folly: 26853.3 (+2.4%) Reviewed By: ajkr Differential Revision: D34181736 Pulled By: pdillinger fbshipit-source-id: ffa6ad5104c2880321d8a1aa7187e00ab0d02e94	4 years ago
Yanqin Jin	1a1c5bda23	Disallow commit-time-batch for write-prepared/write-unprepared txn conditionally (#9794 ) Summary: For write-prepared/write-unprepared transactions, GetCommitTimeWriteBatch() can be used only if the transaction is started with `TransactionOptions::use_only_the_last_commit_time_batch_for_recovery` set to true. Otherwise, it is possible that multiple uncommitted versions of the same key exist in the database. During bottommost compaction, RocksDB may set the sequence numbers of both to zero once they become committed, causing output SST file to have two identical internal keys. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9794 Test Plan: make check pay special attention to the following ``` transaction_test --gtest_filter=MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest/* ``` Reviewed By: lth Differential Revision: D35327214 Pulled By: riversand963 fbshipit-source-id: 3bae00a28359c10e96e4c6f676d20de5610d8a0f	4 years ago
Yanqin Jin	565fcead22	Fix clang-analyze by adding assertion (#9682 ) Summary: Clang-analyze complains about potential nullptr dereference. Fix by adding an assertion to make clang happy. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9682 Test Plan: USE_CLANG=1 make -j20 analyze_incremental Reviewed By: ltamasi Differential Revision: D34755210 Pulled By: riversand963 fbshipit-source-id: 948e1899846ee1aa05a1b500a11ff43b0b412e0a	4 years ago
Yanqin Jin	3b6dc049f7	Support user-defined timestamps in write-committed txns (#9629 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9629 Pessimistic transactions use pessimistic concurrency control, i.e. locking. Keys are locked upon first operation that writes the key or has the intention of writing. For example, `PessimisticTransaction::Put()`, `PessimisticTransaction::Delete()`, `PessimisticTransaction::SingleDelete()` will write to or delete a key, while `PessimisticTransaction::GetForUpdate()` is used by application to indicate to RocksDB that the transaction has the intention of performing write operation later in the same transaction. Pessimistic transactions support two-phase commit (2PC). A transaction can be `Prepared()`'ed and then `Commit()`. The prepare phase is similar to a promise: once `Prepare()` succeeds, the transaction has acquired the necessary resources to commit. The resources include locks, persistence of WAL, etc. Write-committed transaction is the default pessimistic transaction implementation. In RocksDB write-committed transaction, `Prepare()` will write data to the WAL as a prepare section. `Commit()` will write a commit marker to the WAL and then write data to the memtables. While writing to the memtables, different keys in the transaction's write batch will be assigned different sequence numbers in ascending order. Until commit/rollback, the transaction holds locks on the keys so that no other transaction can write to the same keys. Furthermore, the keys' sequence numbers represent the order in which they are committed and should be made visible. This is convenient for us to implement support for user-defined timestamps. Since column families with and without timestamps can co-exist in the same database, a transaction may or may not involve timestamps. Based on this observation, we add two optional members to each `PessimisticTransaction`, `read_timestamp_` and `commit_timestamp_`. If no key in the transaction's write batch has timestamp, then setting these two variables do not have any effect. For the rest of this commit, we discuss only the cases when these two variables are meaningful. read_timestamp_ is used mainly for validation, and should be set before first call to `GetForUpdate()`. Otherwise, the latter will return non-ok status. `GetForUpdate()` calls `TryLock()` that can verify if another transaction has written the same key since `read_timestamp_` till this call to `GetForUpdate()`. If another transaction has indeed written the same key, then validation fails, and RocksDB allows this transaction to refine `read_timestamp_` by increasing it. Note that a transaction can still use `Get()` with a different timestamp to read, but the result of the read should not be used to determine data that will be written later. commit_timestamp_ must be set after finishing writing and before transaction commit. This applies to both 2PC and non-2PC cases. In the case of 2PC, it's usually set after prepare phase succeeds. We currently require that the commit timestamp be chosen after all keys are locked. This means we disallow the `TransactionDB`-level APIs if user-defined timestamp is used by the transaction. Specifically, calling `PessimisticTransactionDB::Put()`, `PessimisticTransactionDB::Delete()`, `PessimisticTransactionDB::SingleDelete()`, etc. will return non-ok status because they specify timestamps before locking the keys. Users are also prompted to use the `Transaction` APIs when they receive the non-ok status. Reviewed By: ltamasi Differential Revision: D31822445 fbshipit-source-id: b82abf8e230216dc89cc519564a588224a88fd43	4 years ago
Yanqin Jin	6f12599863	Support WBWI for keys having timestamps (#9603 ) Summary: This PR supports inserting keys to a `WriteBatchWithIndex` for column families that enable user-defined timestamps and reading the keys back. The index does not have timestamps. Writing a key to WBWI is unchanged, because the underlying WriteBatch already supports it. When reading the keys back, we need to make sure to distinguish between keys with and without timestamps before comparison. When user calls `GetFromBatchAndDB()`, no timestamp is needed to query the batch, but a timestamp has to be provided to query the db. The assumption is that data in the batch must be newer than data from the db. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9603 Test Plan: make check Reviewed By: ltamasi Differential Revision: D34354849 Pulled By: riversand963 fbshipit-source-id: d25d1f84e2240ce543e521fa30595082fb8db9a0	4 years ago
Yanqin Jin	1cda273dc3	Fix a silent data loss for write-committed txn (#9571 ) Summary: The following sequence of events can cause silent data loss for write-committed transactions. ``` Time thread 1 bg flush \| db->Put("a") \| txn = NewTxn() \| txn->Put("b", "v") \| txn->Prepare() // writes only to 5.log \| db->SwitchMemtable() // memtable 1 has "a" \| // close 5.log, \| // creates 8.log \| trigger flush \| pick memtable 1 \| unlock db mutex \| write new sst \| txn->ctwb->Put("gtid", "1") // writes 8.log \| txn->Commit() // writes to 8.log \| // writes to memtable 2 \| compute min_log_number_to_keep_2pc, this \| will be 8 (incorrect). \| \| Purge obsolete wals, including 5.log \| V ``` At this point, writes of txn exists only in memtable. Close db without flush because db thinks the data in memtable are backed by log. Then reopen, the writes are lost except key-value pair {"gtid"->"1"}, only the commit marker of txn is in 8.log The reason lies in `PrecomputeMinLogNumberToKeep2PC()` which calls `FindMinPrepLogReferencedByMemTable()`. In the above example, when bg flush thread tries to find obsolete wals, it uses the information computed by `PrecomputeMinLogNumberToKeep2PC()`. The return value of `PrecomputeMinLogNumberToKeep2PC()` depends on three components - `PrecomputeMinLogNumberToKeepNon2PC()`. This represents the WAL that has unflushed data. As the name of this method suggests, it does not account for 2PC. Although the keys reside in the prepare section of a previous WAL, the column family references the current WAL when they are actually inserted into the memtable during txn commit. - `prep_tracker->FindMinLogContainingOutstandingPrep()`. This represents the WAL with a prepare section but the txn hasn't committed. - `FindMinPrepLogReferencedByMemTable()`. This represents the WAL on which some memtables (mutable and immutable) depend for their unflushed data. The bug lies in `FindMinPrepLogReferencedByMemTable()`. Originally, this function skips checking the column families that are being flushed, but the unit test added in this PR shows that they should not be. In this unit test, there is only the default column family, and one of its memtables has unflushed data backed by a prepare section in 5.log. We should return this information via `FindMinPrepLogReferencedByMemTable()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9571 Test Plan: ``` ./transaction_test --gtest_filter=/TransactionTest.SwitchMemtableDuringPrepareAndCommit_WC/ make check ``` Reviewed By: siying Differential Revision: D34235236 Pulled By: riversand963 fbshipit-source-id: 120eb21a666728a38dda77b96276c6af72b008b1	4 years ago
Yanqin Jin	241b5aa15a	Timestamp-based validation for pessimistic txn (#9562 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9562 With per-transaction `read_timestamp_`, it is possible to perform transaction validation after locking a key in addition to sequence-based validation. Specifically, if a transaction has a read_timestamp, then we perform timestamp-based validation as well after the key is locked via `GetForUpdate()`. This is to make sure that no other transaction has modified the key and committed successfully since the read timestamp (but before the locking operation) which represents a consistent view of the database. Reviewed By: ltamasi Differential Revision: D31822034 fbshipit-source-id: c6f1828b7fc23e4f85e2d1ed73ff51464a058d91	4 years ago
Yanqin Jin	d6e1e6f37a	Add commit_timestamp and read_timestamp to Pessimistic transaction (#9537 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9537 Add `Transaction::SetReadTimestampForValidation()` and `Transaction::SetCommitTimestamp()` APIs with default implementation returning `Status::NotSupported()`. Currently, calling these two APIs do not have any effect. Also add checks to `PessimisticTransactionDB` to enforce that column families in the same db either - disable user-defined timestamp - enable 64-bit timestamp Just to clarify, a `PessimisticTransactionDB` can have some column families without timestamps as well as column families that enable timestamp. Each `PessimisticTransaction` can have two optional timestamps, `read_timestamp_` used for additional validation and `commit_timestamp_` which denotes when the transaction commits. For now, we are going to support `WriteCommittedTxn` (in a series of subsequent PRs) Once set, we do not allow decreasing `read_timestamp_`. The `commit_timestamp_` must be greater than `read_timestamp_` for each transaction and must be set before commit, unless the transaction does not involve any column family that enables user-defined timestamp. TransactionDB builds on top of RocksDB core `DB` layer. Though `DB` layer assumes that user-defined timestamps are byte arrays, `TransactionDB` uses uint64_t to store timestamps. When they are passed down, they are still interpreted as byte-arrays by `DB`. Reviewed By: ltamasi Differential Revision: D31567959 fbshipit-source-id: b0b6b69acab5d8e340cf174f33e8b09f1c3d3502	4 years ago
Yanqin Jin	3122cb4358	Revise APIs related to user-defined timestamp (#8946 ) Summary: ajkr reminded me that we have a rule of not including per-kv related data in `WriteOptions`. Namely, `WriteOptions` should not include information about "what-to-write", but should just include information about "how-to-write". According to this rule, `WriteOptions::timestamp` (experimental) is clearly a violation. Therefore, this PR removes `WriteOptions::timestamp` for compliance. After the removal, we need to pass timestamp info via another set of APIs. This PR proposes a set of overloaded functions `Put(write_opts, key, value, ts)`, `Delete(write_opts, key, ts)`, and `SingleDelete(write_opts, key, ts)`. Planned to add `Write(write_opts, batch, ts)`, but its complexity made me reconsider doing it in another PR (maybe). For better checking and returning error early, we also add a new set of APIs to `WriteBatch` that take extra `timestamp` information when writing to `WriteBatch`es. These set of APIs in `WriteBatchWithIndex` are currently not supported, and are on our TODO list. Removed `WriteBatch::AssignTimestamps()` and renamed `WriteBatch::AssignTimestamp()` to `WriteBatch::UpdateTimestamps()` since this method require that all keys have space for timestamps allocated already and multiple timestamps can be updated. The constructor of `WriteBatch` now takes a fourth argument `default_cf_ts_sz` which is the timestamp size of the default column family. This will be used to allocate space when calling APIs that do not specify a column family handle. Also, updated `DB::Get()`, `DB::MultiGet()`, `DB::NewIterator()`, `DB::NewIterators()` methods, replacing some assertions about timestamp to returning Status code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8946 Test Plan: make check ./db_bench -benchmarks=fillseq,fillrandom,readrandom,readseq,deleterandom -user_timestamp_size=8 ./db_stress --user_timestamp_size=8 -nooverwritepercent=0 -test_secondary=0 -secondary_catch_up_one_in=0 -continuous_verification_interval=0 Make sure there is no perf regression by running the following ``` ./db_bench_opt -db=/dev/shm/rocksdb -use_existing_db=0 -level0_stop_writes_trigger=256 -level0_slowdown_writes_trigger=256 -level0_file_num_compaction_trigger=256 -disable_wal=1 -duration=10 -benchmarks=fillrandom ``` Before this PR ``` DB path: [/dev/shm/rocksdb] fillrandom : 1.831 micros/op 546235 ops/sec; 60.4 MB/s ``` After this PR ``` DB path: [/dev/shm/rocksdb] fillrandom : 1.820 micros/op 549404 ops/sec; 60.8 MB/s ``` Reviewed By: ltamasi Differential Revision: D33721359 Pulled By: riversand963 fbshipit-source-id: c131561534272c120ffb80711d42748d21badf09	4 years ago
Sergei Petrunia	c9042db619	Range Locking: add support for escalation barriers (#9290 ) Summary: Range Locking supports Lock Escalation. Lock Escalation is invoked when lock memory is nearly exhausted and it reduced the amount of memory used by joining adjacent locks. Bridging the gap between certain locks has adverse effects. For example, in MyRocks it is not a good idea to bridge the gap between locks in different indexes, as that get the lock to cover large portions of indexes, or even entire indexes. Resolve this by introducing Escalation Barrier. The escalation process will call the user-provided barrier callback function: bool(const Endpoint& a, const Endpoint& b) If the function returns true, there's a barrier between a and b and Lock Escalation will not try to bridge the gap between a and b. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9290 Reviewed By: akankshamahajan15 Differential Revision: D33486753 Pulled By: riversand963 fbshipit-source-id: f97910b67aba0579ea1d35f523ca6863d3dd018e	4 years ago
Yanqin Jin	0376869f05	Remove using namespace (#9369 ) Summary: As title. This is part of an fb-internal task. First, remove all `using namespace` statements if applicable. Next, utilize multiple build platforms and see if anything is broken. Should anything become broken, fix the compilation errors with as little extra change as possible. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9369 Test Plan: internal build and make check make clean && make static_lib && cd examples && make all Reviewed By: pdillinger Differential Revision: D33517260 Pulled By: riversand963 fbshipit-source-id: 3fc4ce6402a073421dfd9a9b2d1c79441dca7a40	4 years ago
Sergei Petrunia	1b076e82db	Expose locktree's wait count in RangeLockManagerHandle::Counters (#9289 ) Summary: locktree is a module providing Range Locking. It has a counter for the number of times a lock acquisition request was blocked by an existing conflicting lock and had to wait for it to be released. Expose this counter in RangeLockManagerHandle::Counters::lock_wait_count. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9289 Reviewed By: jay-zhuang Differential Revision: D33079182 Pulled By: riversand963 fbshipit-source-id: 25b1a362d9da247536ab5007bd15900b319f139e	4 years ago
Hui Xiao	cd85439632	Make TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAcces less flaky (#9281 ) Summary: Context: [Rapid thread creation and deletion](https://github.com/facebook/rocksdb/blob/6.27.fb/utilities/transactions/write_prepared_transaction_test.cc#L439-L444) in `SnapshotConcurrentAccessTest.SnapshotConcurrentAcces` inside a [potentially big loop](https://github.com/facebook/rocksdb/blob/6.27.fb/utilities/transactions/write_prepared_transaction_test.cc#L1238-L1248) can lead to heavy-loading the system with many threads due to delay in actually cleaning up thread's resource in the kernel sometime. We ran into some [flaky failure](https://app.circleci.com/pipelines/github/facebook/rocksdb/10383/workflows/136f1005-80a9-4515-aee9-fe36ac6462a1/jobs/253289) in CI and reproduced it by below: - Command ``` Added `ROCKSDB_NAMESPACE::port::InstallStackTraceHandler();` like https://github.com/facebook/rocksdb/pull/9276 DEBUG_LEVEL=2 make -j56 write_prepared_transaction_test GTEST_CATCH_EXCEPTIONS=0 ~/gtest-parallel/gtest-parallel -r 200 -w 200 ./write_prepared_transaction_test --gtest_filter=TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1 ``` - Stack, where `write_prepared_transaction_test.cc:442` in `https://github.com/facebook/rocksdb/issues/9` points to thread creation ``` [ RUN ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1 ....terminate called after throwing an instance of 'std::system_error' what(): Resource temporarily unavailable Received signal 6 (Aborted) #0 /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38) [0x7fc114f39438] ... https://github.com/facebook/rocksdb/issues/7 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8e73) [0x7fc1158a5e73] ?? ??:0 https://github.com/facebook/rocksdb/issues/8 ./write_prepared_transaction_test() [0x4ca86c] std:🧵:thread<rocksdb::WritePreparedTransactionTestBase::SnapshotConcurrentAccessTestInternal(rocksdb::WritePreparedTxnDB, std::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<unsigned long, std::allocator<unsigned long> const&, rocksdb::WritePreparedTxnDB::CommitEntry&, unsigned long&, unsigned long, unsigned long, unsigned long, unsigned long)::{lambda()https://github.com/facebook/rocksdb/issues/1}>(rocksdb::WritePreparedTransactionTestBase::SnapshotConcurrentAccessTestInternal(rocksdb::WritePreparedTxnDB, s d::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<unsigned long, std::allocator<unsigned long> > const&, rocksdb::WritePreparedTxnDB::CommitEntry&, unsigned long&, unsigned long, unsigned long, unsigned long, unsigned long)::{l mbda()https://github.com/facebook/rocksdb/issues/1}&&) /usr/include/c++/5/thread:137 (discriminator 4) https://github.com/facebook/rocksdb/issues/9 ./write_prepared_transaction_test() [0x4bb80c] rocksdb::WritePreparedTransactionTestBase::SnapshotConcurrentAccessTestInternal(rocksdb::WritePreparedTxnDB*, std::vector<unsigned long, std::allocator<unsigned long> > const&, std::vector<unsigned long, std::allocator<unsigned long> > const&, rocksdb::W itePreparedTxnDB::CommitEntry&, unsigned long&, unsigned long, unsigned long, unsigned long, unsigned long) /home/circleci/project/utilities/transactions/write_prepared_transaction_test.cc:442 https://github.com/facebook/rocksdb/issues/10 ./write_prepared_transaction_test() [0x4407b6] rocksdb::SnapshotConcurrentAccessTest_SnapshotConcurrentAccess_Test::TestBody() /home/circleci/project/utilities/transactions/write_prepared_transaction_test.cc:1244 ... [109/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1 returned/aborted with exit code -6 (34462 ms) ``` - Move thread 2's work into current thread to avoid half of the thread creation cuz there is no difference in doing so. We expect this can make the thread-creation error less often, even though we can't gurantee it from happening again. Considering this is a trivial change with positive impact, it's still worth landing and monitor if it's enough to solve the problem in reality. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9281 Test Plan: Before the change, repeating the test 200 times with 200 workers failed `~/gtest-parallel/gtest-parallel -r 200 -w 200 ./write_prepared_transaction_test --gtest_filter=TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1` ``` [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from TwoWriteQueues/SnapshotConcurrentAccessTest [ RUN ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1 ..unknown file: Failure C++ exception with description "Resource temporarily unavailable" thrown in the test body. [ FAILED ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1, where GetParam() = (false, true, 1, 0, 1, 20) (11882 ms) [----------] 1 test from TwoWriteQueues/SnapshotConcurrentAccessTest (11882 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (11882 ms total) [ PASSED ] 0 tests. [ FAILED ] 1 test, listed below: [ FAILED ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1, where GetParam() = (false, true, 1, 0, 1, 20) ``` After the change: repeating the test 200 times with 200 workers didn't fail, even with repeating the "repeating" for 10 times like below `for i in {1..10}; do ~/gtest-parallel/gtest-parallel -r 200 -w 200 ./write_prepared_transaction_test --gtest_filter=TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1; done` ``` [200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1 [200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1 [200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1 [200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1 [200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1 [200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1 [200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1 [200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1 [200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1 [200/200] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1 ``` It does failed when repeating the test 400 times with 400 workers `~/project$ ~/gtest-parallel/gtest-parallel -r 400 -w 400 ./write_prepared_transaction_test --gtest_filter=TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1` ``` [1/400] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1 (2928 ms) Note: Google Test filter = TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1 [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from TwoWriteQueues/SnapshotConcurrentAccessTest [ RUN ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1 unknown file: Failure C++ exception with description "std::bad_alloc" thrown in the test body. [ FAILED ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/1, where GetParam() = (false, true, 1, 0, 1, 20) (2597 ms) [----------] 1 test from TwoWriteQueues/SnapshotConcurrentAccessTest (2597 ms total) ``` Reviewed By: ajkr Differential Revision: D33026776 Pulled By: hx235 fbshipit-source-id: 509f57126392821e835e48396e5bf224f4f5dcac	4 years ago
Yanqin Jin	bd513fd075	Add commit marker with timestamp (#9266 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9266 This diff adds a new tag `CommitWithTimestamp`. Currently, there is no API to trigger writing this tag to WAL, thus it is unavailable to users. This is an ongoing effort to add user-defined timestamp support to write-committed transactions. This diff also indicates all column families that may potentially participate in the same transaction must either disable timestamp or have the same timestamp format, since `CommitWithTimestamp` tag is followed by a single byte-array denoting the commit timestamp of the transaction. We will enforce this checking in a future diff. We keep this diff small. Reviewed By: ltamasi Differential Revision: D31721350 fbshipit-source-id: e1450811443647feb6ca01adec4c8aaae270ffc6	4 years ago
Peter Dillinger	aec95b8c09	Debug "Resource temporarily unavailable" exception in CircleCI (#9276 ) Summary: This changes write_prepared_transaction_test under CircleCI to print a stack trace on unhandled exception, so that we can debug rare exceptions seen in CircleCI: [ RUN ] TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccess/24 .......unknown file: Failure C++ exception with description "Resource temporarily unavailable" thrown in the test body. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9276 Test Plan: manual run test with seeded 'throw', with and without CIRCLECI=true environment variable Reviewed By: ajkr, hx235 Differential Revision: D32996993 Pulled By: pdillinger fbshipit-source-id: e790408ce204b676d3d84a290e41be511b203bfa	4 years ago
lgqss	77c7085594	MemTableList::TrimHistory now use allocated bytes (#9020 ) Summary: Fix a bug when both max_write_buffer_size_to_maintain and max_write_buffer_number_to_maintain are 0. The bug was introduced in 6.5.0 and https://github.com/facebook/rocksdb/issues/5022. Fix https://github.com/facebook/rocksdb/issues/8371 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9020 Reviewed By: pdillinger Differential Revision: D32767084 Pulled By: ajkr fbshipit-source-id: c401ee6e2557230e892d0fe8abb4966cbd18e85f	4 years ago
Yanqin Jin	2035798834	Update TransactionUtil::CheckKeyForConflict to also use timestamps (#9162 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9162 Existing TransactionUtil::CheckKeyForConflict() performs only seq-based conflict checking. If user-defined timestamp is enabled, it should perform conflict checking based on timestamps too. Update TransactionUtil::CheckKey-related methods to verify the timestamp of the latest version of a key is smaller than the read timestamp. Note that CheckKeysForConflict() is not updated since it's used only by optimistic transaction, and we do not plan to update it in this upcoming batch of diffs. Existing GetLatestSequenceForKey() returns the sequence of the latest version of a specific user key. Since we support user-defined timestamp, we need to update this method to also return the timestamp (if enabled) of the latest version of the key. This will be needed for snapshot validation. Reviewed By: ltamasi Differential Revision: D31567960 fbshipit-source-id: 2e4a14aed267435a9aa91bc632d2411c01946d44	4 years ago
jsteemann	a7478070f3	Fix small issues (#5896 ) Summary: The individual commits in this PR should be self-explanatory. All small and _very_ low-priority changes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5896 Reviewed By: riversand963 Differential Revision: D18065108 Pulled By: mrambacher fbshipit-source-id: 236b1a1d9d21f982cc08aa67027108dde5eaf280	4 years ago
Yanqin Jin	5237b39d2e	Fix assertion error during compaction with write-prepared txn enabled (#9105 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9105 The user contract of SingleDelete is that: a SingleDelete can only be issued to a key that exists and has NOT been updated. For example, application can insert one key `key`, and uses a SingleDelete to delete it in the future. The `key` cannot be updated or removed using Delete. In reality, especially when write-prepared transaction is being used, things can get tricky. For example, a prepared transaction already writes `key` to the memtable after a successful Prepare(). Afterwards, should the transaction rollback, it will insert a Delete into the memtable to cancel out the prior Put. Consider the following sequence of operations. ``` // operation sequence 1 Begin txn Put(key) Prepare() Flush() Rollback txn Flush() ``` There will be two SSTs resulting from above. One of the contains a PUT, while the second one contains a Delete. It is also known that releasing a snapshot can lead to an L0 containing only a SD for a particular key. Consider the following operations following the above block. ``` // operation sequence 2 db->Put(key) db->SingleDelete(key) Flush() ``` The operation sequence 2 can result in an L0 with only the SD. Should there be a snapshot for conflict checking created before operation sequence 1, then an attempt to compact the db may hit the assertion failure below, because ikey_.type is Delete (from a rollback). ``` else if (clear_and_output_next_key_) { assert(ikey_.type == kTypeValue \|\| ikey_.type == kTypeBlobIndex); } ``` To fix the assertion failure, we can skip the SingleDelete if we detect an earlier Delete in the same snapshot interval. Reviewed By: ltamasi Differential Revision: D32056848 fbshipit-source-id: 23620a91e28562d91c45cf7e95f414b54b729748	4 years ago
Yanqin Jin	9b53f14a35	Fixed a bug in CompactionIterator when write-preared transaction is used (#9060 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9060 RocksDB bottommost level compaction may zero out an internal key's sequence if the key's sequence is in the earliest_snapshot. In write-prepared transaction, checking the visibility of a certain sequence in a specific released snapshot may return a "snapshot released" result. Therefore, it is possible, after a certain sequence of events, a PUT has its sequence zeroed out, but a subsequent SingleDelete of the same key will still be output with its original sequence. This violates the ascending order of keys and leads to incorrect result. The solution is to use an extra variable `last_key_seq_zeroed_` to track the information about visibility in earliest snapshot. With this variable, we can know for sure that a SingleDelete is in the earliest snapshot even if the said snapshot is released during compaction before processing the SD. Reviewed By: ltamasi Differential Revision: D31813016 fbshipit-source-id: d8cff59d6f34e0bdf282614034aaea99be9174e1	4 years ago
Yanqin Jin	fdf2a0d7eb	Fix a compaction bug for write-prepared txn (#9061 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9061 In write-prepared txn, checking a sequence's visibility in a released (old) snapshot may return "Snapshot released". Suppose we have two snapshots: ``` earliest_snap < earliest_write_conflict_snap ``` If we release `earliest_write_conflict_snap` but keep `earliest_snap` during bottommost level compaction, then it is possible that certain sequence of events can lead to a PUT being seq-zeroed followed by a SingleDelete of the same key. This violates the ascending order of keys, and will cause data inconsistency. Reviewed By: ltamasi Differential Revision: D31813017 fbshipit-source-id: dc68ba2541d1228489b93cf3edda5f37ed06f285	4 years ago
Jonathan Albrecht	e970248602	Add support for building on s390x platform (#8962 ) Summary: This PR adds support for building on s390x including updating travis CI. It uses the previous work in https://github.com/facebook/rocksdb/pull/6168 and adds some more changes to get all current tests (make check and jni tests) to pass. The tests were run with snappy, lz4, bzip2 and zstd all compiled in. There are a few pieces still needed to get the travis build working that I don't think I can do. adamretter is this something you could help with? 1. A prebuilt https://rocksdb-deps.s3-us-west-2.amazonaws.com/cmake/cmake-3.14.5-Linux-s390x.deb package 2. A https://hub.docker.com/r/evolvedbinary/rocksjava s390x image Not sure if there is more required for travis. Happy to help in any way I can. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8962 Reviewed By: mrambacher Differential Revision: D31802198 Pulled By: pdillinger fbshipit-source-id: 683511466fa6b505f85ba5a9964a268c6151f0c2	4 years ago
Yanqin Jin	e1139167ae	Inline an empty destructor (#9004 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9004 Inline an empty destructor Reviewed By: ltamasi Differential Revision: D31525561 fbshipit-source-id: 3b9e37f06b0c70529a5d2d660de21ea335c73611	4 years ago
Yanqin Jin	1a79839c59	Some code cleanup (#9003 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9003 cleanup some code before real work. Reviewed By: ltamasi Differential Revision: D31525563 fbshipit-source-id: 44558b3594f2200adc7d8621b08b06c77e358a27	4 years ago
mrambacher	13ae16c315	Cleanup includes in dbformat.h (#8930 ) Summary: This header file was including everything and the kitchen sink when it did not need to. This resulted in many places including this header when they needed other pieces instead. Cleaned up this header to only include what was needed and fixed up the remaining code to include what was now missing. Hopefully, this sort of code hygiene cleanup will speed up the builds... Pull Request resolved: https://github.com/facebook/rocksdb/pull/8930 Reviewed By: pdillinger Differential Revision: D31142788 Pulled By: mrambacher fbshipit-source-id: 6b45de3f300750c79f751f6227dece9cfd44085d	4 years ago
Levi Tamasi	be206db351	Deflake MySQLStyleTransactionTest.TransactionStressTest in "status checked" mode (#8947 ) Summary: There is a corner case when using WriteUnprepared transactions when `WriteUnpreparedTxn::Get` returns `Status::TryAgain` instead of propagating the result of `GetFromBatchAndDB`. The patch adds `PermitUncheckedError` to make the `ASSERT_STATUS_CHECKED` build pass in this case as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8947 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D31125422 Pulled By: ltamasi fbshipit-source-id: 42de51dcfa9384e032244c2b4d3f40e9a4111194	4 years ago
Peter Dillinger	4750421ece	Replace most typedef with using= (#8751 ) Summary: Old typedef syntax is confusing Most but not all changes with perl -pi -e 's/typedef (.*) ([a-zA-Z0-9_]+);/using $2 = $1;/g' list_of_files make format Pull Request resolved: https://github.com/facebook/rocksdb/pull/8751 Test Plan: existing Reviewed By: zhichao-cao Differential Revision: D30745277 Pulled By: pdillinger fbshipit-source-id: 6f65f0631c3563382d43347896020413cc2366d9	4 years ago
Yanqin Jin	2b367fa8cc	Fix bug caused by releasing snapshot(s) during compaction (#8608 ) Summary: In debug mode, we are seeing assertion failure as follows ``` db/compaction/compaction_iterator.cc:980: void rocksdb::CompactionIterator::PrepareOutput(): \ Assertion `ikey_.type != kTypeDeletion && ikey_.type != kTypeSingleDeletion' failed. ``` It is caused by releasing earliest snapshot during compaction between the execution of `NextFromInput()` and `PrepareOutput()`. In one case, as demonstrated in unit test `WritePreparedTransaction.ReleaseEarliestSnapshotDuringCompaction_WithSD2`, incorrect result may be returned by a following range scan if we disable assertion, as in opt compilation level: the SingleDelete marker's sequence number is zeroed out, but the preceding PUT is also outputted to the SST file after compaction. Due to the logic of DBIter, the PUT will not be skipped and will be returned by iterator in range scan. https://github.com/facebook/rocksdb/issues/8661 illustrates what happened. Fix by taking a more conservative approach: make compaction zero out sequence number only if key is in the earliest snapshot when the compaction starts. Another assertion failure is ``` Assertion `current_user_key_snapshot_ == last_snapshot' failed. ``` It's caused by releasing the snapshot between the PUT and SingleDelete during compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8608 Test Plan: make check Reviewed By: jay-zhuang Differential Revision: D30145645 Pulled By: riversand963 fbshipit-source-id: 699f58e66faf70732ad53810ccef43935d3bbe81	4 years ago
Yanqin Jin	066b51126d	Several simple local code clean-ups (#8565 ) Summary: This PR tries to remove some unnecessary checks as well as unreachable code blocks to improve readability. An obvious non-public API method naming typo is also corrected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8565 Test Plan: make check Reviewed By: lth Differential Revision: D29963984 Pulled By: riversand963 fbshipit-source-id: cc96e8f09890e5cfe9b20eadb63bdca5484c150a	4 years ago
mrambacher	3aee4fbd41	Make EventListener into a Customizable Class (#8473 ) Summary: - Added Type/CreateFromString - Added ability to load EventListeners to DBOptions - Since EventListeners did not previously have a Name(), defaulted to "". If there is no name, the listener cannot be loaded from the ObjectRegistry. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8473 Reviewed By: zhichao-cao Differential Revision: D29901488 Pulled By: mrambacher fbshipit-source-id: 2d3a4aa6db1562ac03e7ad41b360e3521d486254	4 years ago
longlijian	4e4ec16957	Replace the namespace "rocksdb" to "ROCKSDB_NAMESPACE" (#8531 ) Summary: For more detail can reference the https://github.com/facebook/rocksdb/issues/6433 (https://github.com/facebook/rocksdb/pull/6433) Pull Request resolved: https://github.com/facebook/rocksdb/pull/8531 Reviewed By: siying Differential Revision: D29717057 Pulled By: ajkr fbshipit-source-id: 3ccad9501e5612590e54a7cf8c447118f323c7f4	4 years ago
Andrew Kryczka	ed8eb436db	Move slow valgrind tests behind -DROCKSDB_FULL_VALGRIND_RUN (#8475 ) Summary: Various tests had disabled valgrind due to it slowing down and timing out (as is the case right now) the CI runs. Where a test was disabled with no comment, I assumed slowness was the cause. For these tests that were slow under valgrind, as well as the ones identified in https://github.com/facebook/rocksdb/issues/8352, this PR moves them behind the compiler flag `-DROCKSDB_FULL_VALGRIND_RUN`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8475 Test Plan: running `make full_valgrind_test`, `make valgrind_test`, `make check`; will verify they appear working correctly Reviewed By: jay-zhuang Differential Revision: D29504843 Pulled By: ajkr fbshipit-source-id: 2aac90749cfbd30d5ce11cb29a07a1b9314eeea7	4 years ago
Jay Zhuang	f89423a57a	Revert "Revert "Snapshot release triggered compaction without multiple tombstones (#8357 )" (#8410 )" (#8438 ) Summary: This reverts commit `25be1ed66a`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8438 Test Plan: Run the impacted mysql test 40 times Reviewed By: ajkr Differential Revision: D29286247 Pulled By: jay-zhuang fbshipit-source-id: d3bd056971a19a8b012d5d0295fa045c012b3c04	4 years ago

1 2 3 4 5 ...

419 Commits (5d3aefb6821eef8816fdee74262702b1d9255f50)