Summary:
TransactionDB::Write can receive some optimization hints from the user. One is to skip the concurrency control mechanism. WritePreparedTxnDB is currently ignoring such hints. This patch optimizes WritePreparedTxnDB::Write for skip_concurrency_control and skip_duplicate_key_check hints.
Closes https://github.com/facebook/rocksdb/pull/3496
Differential Revision: D6971784
Pulled By: maysamyabandeh
fbshipit-source-id: cbab10ad538fa2b8bcb47e37c77724afe6e30f03
Summary:
WritePreparedTransactionTest has the UBSAN error because the wrong order of its parent class construction. Fix it.
Closes https://github.com/facebook/rocksdb/pull/3478
Differential Revision: D6928975
Pulled By: siying
fbshipit-source-id: 13edfd5cb9cf73f1ac5ae3b6f53061d32783733d
Summary:
Compared to DB::Write, TransactionDB::Write has the additional overhead of creating and initializing an internal transaction object, as well as the overhead of locking/unlocking the keys. This patch extends the TransactionDB::Write with an skip_cc option to allow the users to indicate that the write batch do not conflict with others and the concurrency control and its overhead can be skipped. TransactionDB::Write by default calls DB::Write when skip_cc is set, which works for WriteCommitted WritePolicy. Any other flavor of TransactionDB that is not compatible with this default behavior (such as WritePreparedTxnDB) can extend ::Write and implement their own approach for taking into account the skip_cc optimization.
Closes https://github.com/facebook/rocksdb/pull/3457
Differential Revision: D6877318
Pulled By: maysamyabandeh
fbshipit-source-id: 56f4e21db87ff71492db4e376fb7c2b03dfeab6b
Summary:
Deletes the transaction object at the end of the test.
Verified by:
- COMPILE_WITH_ASAN=1 make -j32 transaction_test
- ./transaction_test --gtest_filter="DBA**Duplicate*"
Closes https://github.com/facebook/rocksdb/pull/3470
Differential Revision: D6916473
Pulled By: maysamyabandeh
fbshipit-source-id: 8303df25408635d5d3ac2b25f309a3d15957c937
Summary:
This patch takes advantage of memtable being able to detect duplicate <key,seq> and returning TryAgain to handle duplicate keys in WritePrepared Txns. Through WriteBatchWithIndex's index it detects existence of at least a duplicate key in the write batch. If duplicate key was reported, it then pays the cost of counting the number of sub-patches by iterating over the write batch and pass it to DBImpl::Write. DB will make use of the provided batch_count to assign proper sequence numbers before sending them to the WAL. When later inserting the batch to the memtable, it increases the seq each time memtbale reports a duplicate (a sub-patch in our counting) and tries again.
Closes https://github.com/facebook/rocksdb/pull/3455
Differential Revision: D6873699
Pulled By: maysamyabandeh
fbshipit-source-id: db8487526c3a5dc1ddda0ea49f0f979b26ae648d
Summary:
Without this patch, ubsan_check is currently failing with this error:
```
utilities/transactions/write_prepared_transaction_test.cc:369:63: runtime error: member call on address 0x0000051649f8 which does not point to an object of type 'WithParamInterface'
0x0000051649f8: note: object has invalid vptr
```
Tested by `COMPILE_WITH_UBSAN=1 make -j32 transaction_test` and running `./write_prepared_transaction_test --gtest_filter=TwoWriteQueues/SnapshotConcurrentAccessTest.SnapshotConcurrentAccessTest1/0`
Closes https://github.com/facebook/rocksdb/pull/3444
Differential Revision: D6850087
Pulled By: maysamyabandeh
fbshipit-source-id: 5b254da8504b8757f7aec8a820ad464154da1a1d
Summary:
SnapshotConcurrentAccessTest sometimes times out when running on the test infra. This patch splits the test into smaller sub-tests to avoid the timeout. It also benefits from lower run-time of each sub-test and increases the coverage of the test. The overall run-time of each final sub-test is at most half of the original test so we should no longer see a timeout.
Closes https://github.com/facebook/rocksdb/pull/3435
Differential Revision: D6839427
Pulled By: maysamyabandeh
fbshipit-source-id: d53fdb157109e2438ca7fe447d0cf4b71f304bd8
Summary:
This patch addresses a couple of minor TODOs for WritePrepared Txn such as double checking some assert statements at runtime as well, skip extra AddPrepared in non-2pc transactions, and safety check for infinite loops.
Closes https://github.com/facebook/rocksdb/pull/3302
Differential Revision: D6617002
Pulled By: maysamyabandeh
fbshipit-source-id: ef6673c139cb49f64c0879508d2f573b78609aca
Summary:
DestroyDB that is used in tests loops over the files returned by ::GetChildren and delete them one by one. Such files might be already deleted in the file system (during DeleteObsoleteFileImpl for example) but will get actually deleted with a delay sometimes before ::DeleteFile is called on the file name. We have some test failures where FaultInjectionTestEnv::DeleteFile fails on assert(s.ok()) during DestroyDB. This patch removes the assert statement to fix that.
Closes https://github.com/facebook/rocksdb/pull/3324
Differential Revision: D6659545
Pulled By: maysamyabandeh
fbshipit-source-id: 4c9552fbcd494dcf3e61d475c11fc965c4388b2c
Summary:
A proper implementation of Iterator::Refresh() for WritePreparedTxnDB would require release and acquire another snapshot. Since MyRocks don't make use of Iterator::Refresh(), we just simply mark it as not supported.
Closes https://github.com/facebook/rocksdb/pull/3290
Differential Revision: D6599931
Pulled By: yiwu-arbug
fbshipit-source-id: 4e1632d967316431424f6e458254ecf9a97567cf
Summary:
Currently non-2pc writes do the 2nd dummy write to actually commit the transaction. This was necessary to ensure that publishing the commit sequence number will be done only from one queue (the queue that does not write to memtable). This is however not necessary when we have only one write queue, which is actually the setup that would be used by non-2pc writes. This patch eliminates the 2nd write when two_write_queues are disabled by updating the commit map in the 1st write.
Closes https://github.com/facebook/rocksdb/pull/3277
Differential Revision: D6575392
Pulled By: maysamyabandeh
fbshipit-source-id: 8ab458f7ca506905962f9166026b2ec81e749c46
Summary:
Add PreReleaseCallback to be called at the end of WriteImpl but before publishing the sequence number. The callback is used in WritePrepareTxn to i) update the commit map, ii) update the last published sequence number in the 2nd write queue. It also ensures that all the commits will go to the 2nd queue.
These changes will ensure that the commit map is updated before the sequence number is published and used by reading snapshots. If we use two write queues, the snapshots will use the seq number published by the 2nd queue. If we use one write queue (the default, the snapshots will use the last seq number in the memtable, which also indicates the last published seq number.
Closes https://github.com/facebook/rocksdb/pull/3205
Differential Revision: D6438959
Pulled By: maysamyabandeh
fbshipit-source-id: f8b6c434e94bc5f5ab9cb696879d4c23e2577ab9
Summary:
This patch implements MultiGet API for WritePreparedTxnDB and update the existing unit tests.
Closes https://github.com/facebook/rocksdb/pull/3196
Differential Revision: D6401493
Pulled By: maysamyabandeh
fbshipit-source-id: 51501a1e32645fc2da8680e77a50035f6530f2cc
Summary:
The sequence number was not properly advanced after a rollback marker. The patch extends the existing unit tests to detect the bug and also fixes it.
Closes https://github.com/facebook/rocksdb/pull/3157
Differential Revision: D6304291
Pulled By: maysamyabandeh
fbshipit-source-id: 1b519c44a5371b802da49c9e32bd00087a8da401
Summary:
This patch clarifies and refactors the logic around tracked keys in transactions.
Closes https://github.com/facebook/rocksdb/pull/3140
Differential Revision: D6290258
Pulled By: maysamyabandeh
fbshipit-source-id: 03b50646264cbcc550813c060b180fc7451a55c1
Summary:
Add tests to ensure that WritePrepared and WriteCommitted policies are cross compatible when the db WAL is empty. This is important when the admin want to switch between the policies. In such case, before the switch the admin needs to empty the WAL by i) committing/rollbacking all the pending transactions, ii) FlushMemTables
Closes https://github.com/facebook/rocksdb/pull/3118
Differential Revision: D6227247
Pulled By: maysamyabandeh
fbshipit-source-id: bcde3d92c1e89cda3b9cfa69f6a20af5d8993db7
Summary:
Adds two new counters:
`key_lock_wait_count` counts how many times a lock was blocked by another transaction and had to wait, instead of being granted the lock immediately.
`key_lock_wait_time` counts the time spent acquiring locks.
Closes https://github.com/facebook/rocksdb/pull/3107
Differential Revision: D6217332
Pulled By: lth
fbshipit-source-id: 55d4f46da5550c333e523263422fd61d6a46deb9
Summary:
Move WritePreparedTxnDB from pessimistic_transaction_db.h to its own header, write_prepared_txn_db.h
Closes https://github.com/facebook/rocksdb/pull/3114
Differential Revision: D6220987
Pulled By: maysamyabandeh
fbshipit-source-id: 18893fb4fdc6b809fe117dabb544080f9b4a301b
Summary:
Implements ValidateSnapshot for WritePrepared txns and also adds a unit test to clarify the contract of this function.
Closes https://github.com/facebook/rocksdb/pull/3101
Differential Revision: D6199405
Pulled By: maysamyabandeh
fbshipit-source-id: ace509934c307ea5d26f4bbac5f836d7c80fd240
Summary:
GetCommitTimeWriteBatch is currently used to store some state as part of commit in 2PC. In MyRocks it is specifically used to store some data that would be needed only during recovery. So it is not need to be stored in memtable right after each commit.
This patch enables an optimization to write the GetCommitTimeWriteBatch only to the WAL. The batch will be written to memtable during recovery when the WAL is replayed. To cover the case when WAL is deleted after memtable flush, the batch is also buffered and written to memtable right before each memtable flush.
Closes https://github.com/facebook/rocksdb/pull/3071
Differential Revision: D6148023
Pulled By: maysamyabandeh
fbshipit-source-id: 2d09bae5565abe2017c0327421010d5c0d55eaa7
Summary:
A few simple changes to allow RocksDB to be built on OpenBSD. Let me know if any further changes are needed.
Closes https://github.com/facebook/rocksdb/pull/3061
Differential Revision: D6138800
Pulled By: ajkr
fbshipit-source-id: a13a17b5dc051e6518bd56a8c5efd1d24dd81b0c
Summary:
Enable concurrent_prepare flag for WritePrepared transactions and extend the existing transaction tests with this config.
Closes https://github.com/facebook/rocksdb/pull/3046
Differential Revision: D6106534
Pulled By: maysamyabandeh
fbshipit-source-id: 88c8d21d45bc492beb0a131caea84a2ac5e7d38c
Summary:
On iterator create, take a snapshot, create a ReadCallback and pass the ReadCallback to the underlying DBIter to check if key is committed.
Closes https://github.com/facebook/rocksdb/pull/2981
Differential Revision: D6001471
Pulled By: yiwu-arbug
fbshipit-source-id: 3565c4cdaf25370ba47008b0e0cb65b31dfe79fe
Summary:
On WritePreparedTxnDB destruct there could be running compaction/flush holding a SnapshotChecker, which holds a pointer back to WritePreparedTxnDB. Make sure those jobs finished before destructing WritePreparedTxnDB.
This is caught by TransactionTest::SeqAdvanceTest.
Closes https://github.com/facebook/rocksdb/pull/2982
Differential Revision: D6002957
Pulled By: yiwu-arbug
fbshipit-source-id: f1e70390c9798d1bd7959f5c8e2a1c14100773c3
Summary:
Update Compaction/Flush to support WritePreparedTxnDB: Add SnapshotChecker which is a proxy to query WritePreparedTxnDB::IsInSnapshot. Pass SnapshotChecker to DBImpl on WritePreparedTxnDB open. CompactionIterator use it to check if a key has been committed and if it is visible to a snapshot. In CompactionIterator:
* check if key has been committed. If not, output uncommitted keys AS-IS.
* use SnapshotChecker to check if key is visible to a snapshot when in need.
* do not output key with seq = 0 if the key is not committed.
Closes https://github.com/facebook/rocksdb/pull/2926
Differential Revision: D5902907
Pulled By: yiwu-arbug
fbshipit-source-id: 945e037fdf0aa652dc5ba0ad879461040baa0320
Summary:
Compaction will output keys with sequence number 0, if it is visible to
earliest snapshot. Adding a test to make sure IsInSnapshot() report sequence number 0 is
visible to any snapshot.
Closes https://github.com/facebook/rocksdb/pull/2974
Differential Revision: D5990665
Pulled By: yiwu-arbug
fbshipit-source-id: ef50ebc777ff8ca688771f3ab598c7a609b0b65e
Summary:
With WriteCommitted, when the write batch has duplicate keys, the txn db simply inserts them to the db with different seq numbers and let the db ignore/merge the duplicate values at the read time. With WritePrepared all the entries of the batch are inserted with the same seq number which prevents us from benefiting from this simple solution.
This patch applies a hackish solution to unblock the end-to-end testing. The hack is to be replaced with a proper solution soon. The patch simply detects the duplicate key insertions, and mark the previous one as obsolete. Then before writing to the db it rewrites the batch eliminating the obsolete keys. This would incur a memcpy cost. Furthermore handing duplicate merge would require to do FullMerge instead of simply ignoring the previous value, which is not handled by this patch.
Closes https://github.com/facebook/rocksdb/pull/2969
Differential Revision: D5976337
Pulled By: maysamyabandeh
fbshipit-source-id: 114e65b66f137d8454ff2d1d782b8c05da95f989
Summary:
Implement the rollback of WritePrepared txns. For each modified value, it reads the value before the txn and write it back. This would cancel out the effect of transaction. It also remove the rolled back txn from prepared heap.
Closes https://github.com/facebook/rocksdb/pull/2946
Differential Revision: D5937575
Pulled By: maysamyabandeh
fbshipit-source-id: a6d3c47f44db3729f44b287a80f97d08dc4e888d
Summary:
Recover txns from the WAL. Also added some unit tests.
Closes https://github.com/facebook/rocksdb/pull/2901
Differential Revision: D5859596
Pulled By: maysamyabandeh
fbshipit-source-id: 6424967b231388093b4effffe0a3b1b7ec8caeb0
Summary:
SUMMARY
Moves the bytes_per_sync and wal_bytes_per_sync options from immutableoptions to mutable options. Also if wal_bytes_per_sync is changed, the wal file and memtables are flushed.
TEST PLAN
ran make check
all passed
Two new tests SetBytesPerSync, SetWalBytesPerSync check that after issuing setoptions with a new value for the var, the db options have the new value.
Closes https://github.com/facebook/rocksdb/pull/2893
Reviewed By: yiwu-arbug
Differential Revision: D5845814
Pulled By: TheRushingWookie
fbshipit-source-id: 93b52d779ce623691b546679dcd984a06d2ad1bd
Summary:
Looks like the API is simply missing. Adding it.
Closes https://github.com/facebook/rocksdb/pull/2937
Differential Revision: D5919955
Pulled By: yiwu-arbug
fbshipit-source-id: 6e2e9c96c29882b0bb4113d1f8efb72bffc57878
Summary:
The test didn't delete txn before creating a new one.
Closes https://github.com/facebook/rocksdb/pull/2913
Differential Revision: D5880236
Pulled By: yiwu-arbug
fbshipit-source-id: 7a4fcaada3d86332292754502cd8f4341143bf4f
Summary:
By default the seq number in DB is increased once per written key. WritePrepared txns requires the seq to be increased once per the entire batch so that the seq would be used as the prepare timestamp by which the transaction is identified. Also we need to increase seq for the commit marker since it would give a unique id to the commit timestamp of transactions.
Two unit tests are added to verify our understanding of how the seq should be increased. The recovery path requires much more work and is left to another patch.
Closes https://github.com/facebook/rocksdb/pull/2885
Differential Revision: D5837843
Pulled By: maysamyabandeh
fbshipit-source-id: a08960b93d727e1cf438c254d0c2636fb133cc1c
Summary:
We had two proposals for lock-free commit maps. This patch implements the latter one that was simpler. We can later experiment with both proposals.
In this impl each entry is an std::atomic of uint64_t, which are accessed via memory_order_acquire/release. In x86_64 arch this is compiled to simple reads and writes from memory.
Closes https://github.com/facebook/rocksdb/pull/2861
Differential Revision: D5800724
Pulled By: maysamyabandeh
fbshipit-source-id: 41abae9a4a5df050a8eb696c43de11c2770afdda
Summary:
This patch instruments the read path to verify each read value against an optional ReadCallback class. If the value is rejected, the reader moves on to the next value. The WritePreparedTxn makes use of this feature to skip sequence numbers that are not in the read snapshot.
Closes https://github.com/facebook/rocksdb/pull/2850
Differential Revision: D5787375
Pulled By: maysamyabandeh
fbshipit-source-id: 49d808b3062ab35e7ae98ad388f659757794184c
Summary:
This patch advances the max_evicted_seq_ is larger granularities to reduce the overhead of updating the relevant data structures.
It also refactor the related code and adds testing to that. As part of this patch some of the TODOs for removing usage of non-static const members are also addressed.
Closes https://github.com/facebook/rocksdb/pull/2844
Differential Revision: D5772928
Pulled By: maysamyabandeh
fbshipit-source-id: f4fcc2948be69c034f10812cf922ce5ab82ef98c
Summary:
TransactionCallback was never used. Remove it to avoid confusion.
Closes https://github.com/facebook/rocksdb/pull/2853
Differential Revision: D5787219
Pulled By: maysamyabandeh
fbshipit-source-id: e2b6a89537e3770a269ad38be71c4b0b160a88ac
Summary:
The patch skips write_prepared_transaction_test from travis as they time out there. They are still covered in daily runs of tests.
Closes https://github.com/facebook/rocksdb/pull/2836
Differential Revision: D5767203
Pulled By: maysamyabandeh
fbshipit-source-id: 51045ef98a745197136e14b2ec02fc6f38081b75
Summary:
Add -DPORTABLE=1
port::cacheline_aligned_alloc() has arguments swapped which prevents every single test from running.
Closes https://github.com/facebook/rocksdb/pull/2815
Differential Revision: D5751661
Pulled By: siying
fbshipit-source-id: e0857d6e138ec46035b3c23d7c3c751901a0a4a0