rocksdb

Commit Graph

Author	SHA1	Message	Date
Yanqin Jin	bd513fd075	Add commit marker with timestamp (#9266 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9266 This diff adds a new tag `CommitWithTimestamp`. Currently, there is no API to trigger writing this tag to WAL, thus it is unavailable to users. This is an ongoing effort to add user-defined timestamp support to write-committed transactions. This diff also indicates all column families that may potentially participate in the same transaction must either disable timestamp or have the same timestamp format, since `CommitWithTimestamp` tag is followed by a single byte-array denoting the commit timestamp of the transaction. We will enforce this checking in a future diff. We keep this diff small. Reviewed By: ltamasi Differential Revision: D31721350 fbshipit-source-id: e1450811443647feb6ca01adec4c8aaae270ffc6	3 years ago
lgqss	77c7085594	MemTableList::TrimHistory now use allocated bytes (#9020 ) Summary: Fix a bug when both max_write_buffer_size_to_maintain and max_write_buffer_number_to_maintain are 0. The bug was introduced in 6.5.0 and https://github.com/facebook/rocksdb/issues/5022. Fix https://github.com/facebook/rocksdb/issues/8371 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9020 Reviewed By: pdillinger Differential Revision: D32767084 Pulled By: ajkr fbshipit-source-id: c401ee6e2557230e892d0fe8abb4966cbd18e85f	3 years ago
Yanqin Jin	924616526a	Update WriteBatch::AssignTimestamp() and Add (#9205 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9205 Update WriteBatch::AssignTimestamp() APIs so that they take an additional argument, i.e. a function object called `checker` indicating the user-specified logic of performing checks on timestamp sizes. WriteBatch is a building block used by multiple other RocksDB components, each of which may track timestamp information in different data structures. For example, transaction can either write to `WriteBatchWithIndex` which is a `WriteBatch` with index, or write directly to raw `WriteBatch` if `Transaction::DisableIndexing()` is called. `WriteBatchWithIndex` keeps mapping from column family id to comparator, and transaction needs to keep similar information for the `WriteBatch` if user calls `Transaction::DisableIndexing()` (dynamically) so that we will know the size of each timestamp later. The bookkeeping info maintained by `WriteBatchWithIndex` and `Transaction` should not overlap. When we later call `WriteBatch::AssignTimestamp()`, we need to use these data structures to guarantee that we do not accidentally assign timestamps for keys from column families that disable timestamp. Reviewed By: ltamasi Differential Revision: D31735186 fbshipit-source-id: 8b1709ed880ac72f995aa9e012e5873b290840a7	3 years ago
Andrew Kryczka	d648cb47b9	Adapt key-value checksum for timestamp-suffixed keys (#8914 ) Summary: After https://github.com/facebook/rocksdb/issues/8725, keys added to `WriteBatch` may be timestamp-suffixed, while `WriteBatch` has no awareness of the timestamp size. Therefore, `WriteBatch` can no longer calculate timestamp checksum separately from the rest of the key's checksum in all cases. This PR changes the definition of key in KV checksum to include the timestamp suffix. That way we do not need to worry about where the timestamp begins within the key. I believe the only practical effect of this change is now `AssignTimestamp()` requires recomputing the whole key checksum (`UpdateK()`) rather than just the timestamp portion (`UpdateT()`). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8914 Test Plan: run stress command that used to fail ``` $ ./db_stress --batch_protection_bytes_per_key=8 -clear_column_family_one_in=0 -test_batches_snapshots=1 ``` Reviewed By: riversand963 Differential Revision: D30925715 Pulled By: ajkr fbshipit-source-id: c143f7ccb46c0efb390ad57ef415c250d754deff	3 years ago
Yanqin Jin	2a2b3e03a5	Allow WriteBatch to have keys with different timestamp sizes (#8725 ) Summary: In the past, we unnecessarily requires all keys in the same write batch to be from column families whose timestamps' formats are the same for simplicity. Specifically, we cannot use the same write batch to write to two column families, one of which enables timestamp while the other disables it. The limitation is due to the member `timestamp_size_` that used to exist in each `WriteBatch` object. We pass a timestamp_size to the constructor of `WriteBatch`. Therefore, users can simply use the old `WriteBatch::Put()`, `WriteBatch::Delete()`, etc APIs for write, while the internal implementation of `WriteBatch` will take care of memory allocation for timestamps. The above is not necessary. One the one hand, users can set up a memory buffer to store user key and then contiguously append the timestamp to the user key. Then the user can pass this buffer to the `WriteBatch::Put(Slice&)` API. On the other hand, users can set up a SliceParts object which is an array of Slices and let the last Slice to point to the memory buffer storing timestamp. Then the user can pass the SliceParts object to the `WriteBatch::Put(SliceParts&)` API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8725 Test Plan: make check Reviewed By: ltamasi Differential Revision: D30654499 Pulled By: riversand963 fbshipit-source-id: 9d848c77ad3c9dd629aa5fc4e2bc16fb0687b4a2	3 years ago
Yanqin Jin	066b51126d	Several simple local code clean-ups (#8565 ) Summary: This PR tries to remove some unnecessary checks as well as unreachable code blocks to improve readability. An obvious non-public API method naming typo is also corrected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8565 Test Plan: make check Reviewed By: lth Differential Revision: D29963984 Pulled By: riversand963 fbshipit-source-id: cc96e8f09890e5cfe9b20eadb63bdca5484c150a	3 years ago
mrambacher	3dff28cf9b	Use SystemClock* instead of std::shared_ptr<SystemClock> in lower level routines (#8033 ) Summary: For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>. The shared ptr has some performance degradation on certain hardware classes. For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere. For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it. The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource. There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold. In those cases, the shared pointer was preserved. Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17: 6.17: readrandom : 28.046 micros/op 854902 ops/sec; 61.3 MB/s (355999 of 355999 found) 6.18: readrandom : 32.615 micros/op 735306 ops/sec; 52.7 MB/s (290999 of 290999 found) PR: readrandom : 27.500 micros/op 871909 ops/sec; 62.5 MB/s (367999 of 367999 found) (Note that the times for 6.18 are prior to revert of the SystemClock). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033 Reviewed By: pdillinger Differential Revision: D27014563 Pulled By: mrambacher fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67	4 years ago
Andrew Kryczka	78ee8564ad	Integrity protection for live updates to WriteBatch (#7748 ) Summary: This PR adds the foundation classes for key-value integrity protection and the first use case: protecting live updates from the source buffers added to `WriteBatch` through the destination buffer in `MemTable`. The width of the protection info is not yet configurable -- only eight bytes per key is supported. This PR allows users to enable protection by constructing `WriteBatch` with `protection_bytes_per_key == 8`. It does not yet expose a way for users to get integrity protection via other write APIs (e.g., `Put()`, `Merge()`, `Delete()`, etc.). The foundation classes (`ProtectionInfo.`) embed the coverage info in their type, and provide `Protect.()` and `Strip.()` functions to navigate between types with different coverage. For making bytes per key configurable (for powers of two up to eight) in the future, these classes are templated on the unsigned integer type used to store the protection info. That integer contains the XOR'd result of hashes with independent seeds for all covered fields. For integer fields, the hash is computed on the raw unadjusted bytes, so the result is endian-dependent. The most significant bytes are truncated when the hash value (8 bytes) is wider than the protection integer. When `WriteBatch` is constructed with `protection_bytes_per_key == 8`, we hold a `ProtectionInfoKVOTC` (i.e., one that covers key, value, optype aka `ValueType`, timestamp, and CF ID) for each entry added to the batch. The protection info is generated from the original buffers passed by the user, as well as the original metadata generated internally. When writing to memtable, each entry is transformed to a `ProtectionInfoKVOTS` (i.e., dropping coverage of CF ID and adding coverage of sequence number), since at that point we know the sequence number, and have already selected a memtable corresponding to a particular CF. This protection info is verified once the entry is encoded in the `MemTable` buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7748 Test Plan: - an integration test to verify a wide variety of single-byte changes to the encoded `MemTable` buffer are caught - add to stress/crash test to verify it works in variety of configs/operations without intentional corruption - [deferred] unit tests for `ProtectionInfo.` classes for edge cases like KV swap, `SliceParts` and `Slice` APIs are interchangeable, etc. Reviewed By: pdillinger Differential Revision: D25754492 Pulled By: ajkr fbshipit-source-id: e481bac6c03c2ab268be41359730f1ceb9964866	4 years ago
mrambacher	12f1137355	Add a SystemClock class to capture the time functions of an Env (#7858 ) Summary: Introduces and uses a SystemClock class to RocksDB. This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock. Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead. There are likely more places that can be changed, but this is a start to show what can/should be done. Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock. There are several Env classes that implement these functions. Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR. It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc). Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858 Reviewed By: pdillinger Differential Revision: D26006406 Pulled By: mrambacher fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90	4 years ago
Andrew Kryczka	dd6b7fc520	Return `Status` from `MemTable` mutation functions (#7656 ) Summary: This PR updates `MemTable::Add()`, `MemTable::Update()`, and `MemTable::UpdateCallback()` to return `Status` objects, and adapts the client code in `MemTableInserter`. The goal is to prepare these functions for key-value checksum, where we want to verify key-value integrity while adding to memtable. After this PR, the memtable mutation functions can report a failed integrity check by returning `Status::Corruption`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7656 Reviewed By: riversand963 Differential Revision: D24900497 Pulled By: ajkr fbshipit-source-id: 1a7e80581e3774676f2bbba2f0a0b04890f40009	4 years ago
Andrew Kryczka	1c5f13f2a5	Fail early when `merge_operator` not configured (#7667 ) Summary: An application may accidentally write merge operands without properly configuring `merge_operator`. We should alert them as early as possible that there's an API misuse. Previously RocksDB only notified them when a query or background operation needed to merge but couldn't. With this PR, RocksDB notifies them of the problem before applying the merge operand to the memtable (although it may already be in WAL, which seems it'd cause a crash loop until they enable `merge_operator`). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7667 Reviewed By: riversand963 Differential Revision: D24933360 Pulled By: ajkr fbshipit-source-id: 3a4a2ceb0b7aed184113dd03b8efd735a8332f7f	4 years ago
Cheng Chang	73dbe10bbf	Fix write_batch_test when ASSERT_STATUS_CHECKED=1 (#7575 ) Summary: Without this PR, `ASSERT_STATUS_CHECKED=1 make -j32 write_batch_test && ./write_batch_test` fails. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7575 Test Plan: ASSERT_STATUS_CHECKED=1 make -j32 write_batch_test && ./write_batch_test Reviewed By: zhichao-cao Differential Revision: D24411442 Pulled By: cheng-chang fbshipit-source-id: f67dc43c44d6afcc6d7e5ff15c6ae9bbf4dfc943	4 years ago
Zhichao Cao	685cabdafa	Add trace_analyzer_test to ASSERT_STATUS_CHECKED list (#7480 ) Summary: Add trace_analyzer_test to ASSERT_STATUS_CHECKED list Pull Request resolved: https://github.com/facebook/rocksdb/pull/7480 Test Plan: ASSERT_STATUS_CHECKED=1 make -j48 trace_analyzer_test Reviewed By: riversand963 Differential Revision: D24033768 Pulled By: zhichao-cao fbshipit-source-id: b415045e6fab01d6193448650772368c21c6dba6	4 years ago
mrambacher	b7e1c5213f	Add some simulator cache and block tracer tests to ASSERT_STATUS_CHECKED (#7305 ) Summary: More tests now pass. When in doubt, I added a TODO comment to check what should happen with an ignored error. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7305 Reviewed By: akankshamahajan15 Differential Revision: D23301262 Pulled By: ajkr fbshipit-source-id: 5f120edc7393560aefc0633250277bbc7e8de9e6	4 years ago
mrambacher	e9befdebbf	Add EnvTestWithParam::OptionsTest to the ASSERT_STATUS_CHECKED passes (#7283 ) Summary: This test uses database functionality and required more extensive work to get it to pass than the other tests. The DB functionality required for this test now passes the check. When it was unclear what the proper behavior was for unchecked status codes, a TODO was added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7283 Reviewed By: akankshamahajan15 Differential Revision: D23251497 Pulled By: ajkr fbshipit-source-id: 52b79629bdafa0a58de8ead1d1d66f141b331523	4 years ago
Jay Zhuang	00de699096	Replace reinterpret_cast with static_cast_with_check (#7067 ) Summary: Replace `reinterpret_cast` with `static_cast_with_check` for `DBImpl` and `ColumnFamilyHandleImpl`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7067 Reviewed By: siying Differential Revision: D22361587 Pulled By: jay-zhuang fbshipit-source-id: dfe9e8f3af39c3d27cc372c55ab9ad905eb0a5a1	4 years ago
Yanqin Jin	961c7590d6	Add timestamp to delete (#6253 ) Summary: Preliminary user-timestamp support for delete. If ["a", ts=100] exists, you can delete it by calling `DB::Delete(write_options, key)` in which `write_options.timestamp` points to a `ts` higher than 100. Implementation A new ValueType, i.e. `kTypeDeletionWithTimestamp` is added for deletion marker with timestamp. The reason for a separate `kTypeDeletionWithTimestamp`: RocksDB may drop tombstones (keys with kTypeDeletion) when compacting them to the bottom level. This is OK and useful if timestamp is disabled. When timestamp is enabled, should we still reuse `kTypeDeletion`, we may drop the tombstone with a more recent timestamp, causing deleted keys to re-appear. Test plan (dev server) ``` make check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6253 Reviewed By: ltamasi Differential Revision: D20995328 Pulled By: riversand963 fbshipit-source-id: a9e5c22968ad76f98e3dc6ee0151265a3f0df619	5 years ago
Andrew Kryczka	e9ba4ba348	validate range tombstone covers positive range (#6788 ) Summary: We found some files containing nothing but negative range tombstones, and unsurprisingly their metadata specified a negative range, which made things crash. Time to add a bit of user input validation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6788 Reviewed By: zhichao-cao Differential Revision: D21343719 Pulled By: ajkr fbshipit-source-id: f1c16e4c3e9fa150958c8c866176632a3206fb74	5 years ago
Derrick Pallas	5272305437	Fix FilterBench when RTTI=0 (#6732 ) Summary: The dynamic_cast in the filter benchmark causes release mode to fail due to no-rtti. Replace with static_cast_with_check. Signed-off-by: Derrick Pallas <derrick@pallas.us> Addition by peterd: Remove unnecessary 2nd template arg on all static_cast_with_check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6732 Reviewed By: ltamasi Differential Revision: D21304260 Pulled By: pdillinger fbshipit-source-id: 6e8eb437c4ca5a16dbbfa4053d67c4ad55f1608c	5 years ago
Peter Dillinger	31da5e34c1	C++20 compatibility (#6697 ) Summary: Based on https://github.com/facebook/rocksdb/issues/6648 (CLA Signed), but heavily modified / extended: * Implicit capture of this via [=] deprecated in C++20, and [=,this] not standard before C++20 -> now using explicit capture lists * Implicit copy operator deprecated in gcc 9 -> add explicit '= default' definition * std::random_shuffle deprecated in C++17 and removed in C++20 -> migrated to a replacement in RocksDB random.h API * Add the ability to build with different std version though -DCMAKE_CXX_STANDARD=11/14/17/20 on the cmake command line * Minimal rebuild flag of MSVC is deprecated and is forbidden with /std:c++latest (C++20) * Added MSVC 2019 C++11 & MSVC 2019 C++20 in AppVeyor * Added GCC 9 C++11 & GCC9 C++20 in Travis Pull Request resolved: https://github.com/facebook/rocksdb/pull/6697 Test Plan: make check and CI Reviewed By: cheng-chang Differential Revision: D21020318 Pulled By: pdillinger fbshipit-source-id: 12311be5dbd8675a0e2c817f7ec50fa11c18ab91	5 years ago
sdong	fdf882ded2	Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433 ) Summary: When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433 Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag. Differential Revision: D19977691 fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e	5 years ago
Manuel Ung	908b1ee64e	WriteUnPrepared: Fix assertion during recovery (#6419 ) Summary: During recovery, multiple (un)prepared batches could exist in the same WAL record due to group commit. This breaks an assertion in `MemTableInserter::MarkBeginPrepare`. To fix, reset unprepared_batch_ to false after `MarkEndPrepare`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6419 Differential Revision: D19896148 Pulled By: lth fbshipit-source-id: b1a32ef88f775a0881264a18bd1a4a5b8c85eee3	5 years ago
Levi Tamasi	db7c687523	Fix a data race related to memtable trimming (#6187 ) Summary: https://github.com/facebook/rocksdb/pull/6177 introduced a data race involving `MemTableList::InstallNewVersion` and `MemTableList::NumFlushed`. The patch fixes this by caching whether the current version has any memtable history (i.e. flushed memtables that are kept around for transaction conflict checking) in an `std::atomic<bool>` member called `current_has_history_`, similarly to how `current_memory_usage_excluding_last_` is handled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6187 Test Plan: ``` make clean COMPILE_WITH_TSAN=1 make db_test -j24 ./db_test ``` Differential Revision: D19084059 Pulled By: ltamasi fbshipit-source-id: 327a5af9700fb7102baea2cc8903c085f69543b9	5 years ago
Levi Tamasi	bd8404feff	Do not schedule memtable trimming if there is no history (#6177 ) Summary: We have observed an increase in CPU load caused by frequent calls to `ColumnFamilyData::InstallSuperVersion` from `DBImpl::TrimMemtableHistory` when using `max_write_buffer_size_to_maintain` to limit the amount of memtable history maintained for transaction conflict checking. Part of the issue is that trimming can potentially be scheduled even if there is no memtable history. The patch adds a check that fixes this. See also https://github.com/facebook/rocksdb/pull/6169. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6177 Test Plan: Compared `perf` output for ``` ./db_bench -benchmarks=randomtransaction -optimistic_transaction_db=1 -statistics -stats_interval_seconds=1 -duration=90 -num=500000 --max_write_buffer_size_to_maintain=16000000 --transaction_set_snapshot=1 --threads=32 ``` before and after the change. There is a significant reduction for the call chain `rocksdb::DBImpl::TrimMemtableHistory` -> `rocksdb::ColumnFamilyData::InstallSuperVersion` -> `rocksdb::ThreadLocalPtr::StaticMeta::Scrape` even without https://github.com/facebook/rocksdb/pull/6169. Differential Revision: D19057445 Pulled By: ltamasi fbshipit-source-id: dff81882d7b280e17eda7d9b072a2d4882c50f79	5 years ago
sdong	e8263dbdaa	Apply formatter to recent 200+ commits. (#5830 ) Summary: Further apply formatter to more recent commits. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5830 Test Plan: Run all existing tests. Differential Revision: D17488031 fbshipit-source-id: 137458fd94d56dd271b8b40c522b03036943a2ab	5 years ago
Lingjing You	1a928c22a0	Add insert hints for each writebatch (#5728 ) Summary: Add insert hints for each writebatch so that they can be used in concurrent write, and add write option to enable it. Bench result (qps): `./db_bench --benchmarks=fillseq -allow_concurrent_memtable_write=true -num=4000000 -batch-size=1 -threads=1 -db=/data3/ylj/tmp -write_buffer_size=536870912 -num_column_families=4` master: \| batch size \ thread num \| 1 \| 2 \| 4 \| 8 \| \| ----------------------- \| ------- \| ------- \| ------- \| ------- \| \| 1 \| 387883 \| 220790 \| 308294 \| 490998 \| \| 10 \| 1397208 \| 978911 \| 1275684 \| 1733395 \| \| 100 \| 2045414 \| 1589927 \| 1798782 \| 2681039 \| \| 1000 \| 2228038 \| 1698252 \| 1839877 \| 2863490 \| fillseq with writebatch hint: \| batch size \ thread num \| 1 \| 2 \| 4 \| 8 \| \| ----------------------- \| ------- \| ------- \| ------- \| ------- \| \| 1 \| 286005 \| 223570 \| 300024 \| 466981 \| \| 10 \| 970374 \| 813308 \| 1399299 \| 1753588 \| \| 100 \| 1962768 \| 1983023 \| 2676577 \| 3086426 \| \| 1000 \| 2195853 \| 2676782 \| 3231048 \| 3638143 \| Pull Request resolved: https://github.com/facebook/rocksdb/pull/5728 Differential Revision: D17297240 fbshipit-source-id: b053590a6d77871f1ef2f911a7bd013b3899b26c	5 years ago
Wilfried Goesgens	fbab9913e2	upgrade gtest 1.7.0 => 1.8.1 for json result writing Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5332 Differential Revision: D17242232 fbshipit-source-id: c0d4646556a1335e51ac7382b986ca7f6ced7b64	5 years ago
Pratik Dhandharia	1b4c104a67	replace some reinterpret_cast with static_cast_with_check (#5740 ) Summary: This PR focuses on replacing some of the reinterpret_cast<DBImpl*> to static_cast_with_check<DBImpl, DB>. Files impacted: ./db/db_impl/db_impl_compaction_flush.cc ./db/write_batch.cc ./utilities/blob_db/blob_db_impl.cc ./utilities/transactions/pessimistic_transaction_db.cc ./utilities/transactions/transaction_base.cc ./utilities/transactions/write_prepared_txn_db.cc ./utilities/transactions/write_unprepared_txn_db.cc Pull Request resolved: https://github.com/facebook/rocksdb/pull/5740 Differential Revision: D17055691 Pulled By: pdhandharia fbshipit-source-id: 0f8034d1b32eade56e37d59c04b7bf236a81d8e8	5 years ago
Zhongyi Xie	2f41ecfe75	Refactor trimming logic for immutable memtables (#5022 ) Summary: MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory. We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one. The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming. In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022 Differential Revision: D14394062 Pulled By: miasantreble fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5	5 years ago
Manuel Ung	f622ca2c7c	WriteUnPrepared: savepoint support (#5627 ) Summary: Add savepoint support when the current transaction has flushed unprepared batches. Rolling back to savepoint is similar to rolling back a transaction. It requires the set of keys that have changed since the savepoint, re-reading the keys at the snapshot at that savepoint, and the restoring the old keys by writing out another unprepared batch. For this strategy to work though, we must be capable of reading keys at a savepoint. This does not work if keys were written out using the same sequence number before and after a savepoint. Therefore, when we flush out unprepared batches, we must split the batch by savepoint if any savepoints exist. eg. If we have the following: ``` Put(A) Put(B) Put(C) SetSavePoint() Put(D) Put(E) SetSavePoint() Put(F) ``` Then we will write out 3 separate unprepared batches: ``` Put(A) 1 Put(B) 1 Put(C) 1 Put(D) 2 Put(E) 2 Put(F) 3 ``` This is so that when we rollback to eg. the first savepoint, we can just read keys at snapshot_seq = 1. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5627 Differential Revision: D16584130 Pulled By: lth fbshipit-source-id: 6d100dd548fb20c4b76661bd0f8a2647e64477fa	5 years ago
Yanqin Jin	ae152ee666	Avoid user key copying for Get/Put/Write with user-timestamp (#5502 ) Summary: In previous https://github.com/facebook/rocksdb/issues/5079, we added user-specified timestamp to `DB::Get()` and `DB::Put()`. Limitation is that these two functions may cause extra memory allocation and key copy. The reason is that `WriteBatch` does not allocate extra memory for timestamps because it is not aware of timestamp size, and we did not provide an API to assign/update timestamp of each key within a `WriteBatch`. We address these issues in this PR by doing the following. 1. Add a `timestamp_size_` to `WriteBatch` so that `WriteBatch` can take timestamps into account when calling `WriteBatch::Put`, `WriteBatch::Delete`, etc. 2. Add APIs `WriteBatch::AssignTimestamp` and `WriteBatch::AssignTimestamps` so that application can assign/update timestamps for each key in a `WriteBatch`. 3. Avoid key copy in `GetImpl` by adding new constructor to `LookupKey`. Test plan (on devserver): ``` $make clean && COMPILE_WITH_ASAN=1 make -j32 all $./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/* $make check ``` If the API extension looks good, I will add more unit tests. Some simple benchmark using db_bench. ``` $rm -rf /dev/shm/dbbench/* && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000 $rm -rf /dev/shm/dbbench/* && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000 -disable_wal=true ``` Master is at `a78503bd6c`. ``` \| \| readrandom \| fillrandom \| \| master \| 15.53 MB/s \| 25.97 MB/s \| \| PR5502 \| 16.70 MB/s \| 25.80 MB/s \| ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5502 Differential Revision: D16340894 Pulled By: riversand963 fbshipit-source-id: 51132cf792be07d1efc3ac33f5768c4ee2608bb8	5 years ago
Vijay Nadimpalli	cae22c53fb	Make format Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5395 Differential Revision: D15581698 Pulled By: vjnadimpalli fbshipit-source-id: f415972f16e784b1361714c202b97defcab46767	6 years ago
Vijay Nadimpalli	49c5a12dbe	Organizing rocksdb/db directory Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5390 Differential Revision: D15579388 Pulled By: vjnadimpalli fbshipit-source-id: 5bfc95e31554b8ff05b97b76d6534113f527f366	6 years ago
Maysam Yabandeh	f383641a1d	Unordered Writes (#5218 ) Summary: Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees. The patch is prepared based on an original by siying Existing unit tests are extended to include unordered_write option. Benchmark Results: ``` TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions --unordered_write=1 ``` With WAL - Vanilla RocksDB: 78.6 MB/s - WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x) - unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees) Without WAL - Vanilla RocksDB: 111.3 MB/s - WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x) - unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees) - WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x) Limitations: - The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218 Differential Revision: D15219029 Pulled By: maysamyabandeh fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759	6 years ago
jsteemann	de76909464	refactor SavePoints (#5192 ) Summary: Savepoints are assumed to be used in a stack-wise fashion (only the top element should be used), so they were stored by `WriteBatch` in a member variable `save_points` using an std::stack. Conceptually this is fine, but the implementation had a few issues: - the `save_points_` instance variable was a plain pointer to a heap- allocated `SavePoints` struct. The destructor of `WriteBatch` simply deletes this pointer. However, the copy constructor of WriteBatch just copied that pointer, meaning that copying a WriteBatch with active savepoints will very likely have crashed before. Now a proper copy of the savepoints is made in the copy constructor, and not just a copy of the pointer - `save_points_` was an std::stack, which defaults to `std::deque` for the underlying container. A deque is a bit over the top here, as we only need access to the most recent savepoint (i.e. stack.top()) but never any elements at the front. std::deque is rather expensive to initialize in common environments. For example, the STL implementation shipped with GNU g++ will perform a heap allocation of more than 500 bytes to create an empty deque object. Although the `save_points_` container is created lazily by RocksDB, moving from a deque to a plain `std::vector` is much more memory-efficient. So `save_points_` is now a vector. - `save_points_` was changed from a plain pointer to an `std::unique_ptr`, making ownership more explicit. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5192 Differential Revision: D15024074 Pulled By: maysamyabandeh fbshipit-source-id: 5b128786d3789cde94e46465c9e91badd07a25d7	6 years ago
Michael Liu	3c5d1b16b1	Apply modernize-use-override (3) Summary: Use C++11’s override and remove virtual where applicable. Change are automatically generated. bypass-lint drop-conflicts Reviewed By: igorsugak Differential Revision: D14131816 fbshipit-source-id: f20e7f7cecf2e699d70f5fa036f72c0e3f59b50e	6 years ago
Sagar Vemuri	dc3528077a	Update all unique/shared_ptr instances to be qualified with namespace std (#4638 ) Summary: Ran the following commands to recursively change all the files under RocksDB: ``` find . -type f -name ".cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} + ``` Running `make format` updated some formatting on the files touched. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638 Differential Revision: D12934992 Pulled By: sagar0 fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8	6 years ago
Simon Grätzer	ceded4535d	WriteBatch::Iterate wrongly returns Status::Corruption (#4478 ) Summary: Wrong I overwrite `WriteBatch::Handler::Continue` to return _false_ at some point, I always get the `Status::Corruption` error. I don't think this check is used correctly here: The counter in `found` cannot reflect all entries in the WriteBatch when we exit the loop early. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4478 Differential Revision: D10317416 Pulled By: yiwu-arbug fbshipit-source-id: cccae3382805035f9b3239b66682b5fcbba6bb61	6 years ago
Manuel Ung	ea212e5316	WriteUnPrepared: Implement unprepared batches for transactions (#4104 ) Summary: This adds support for writing unprepared batches based on size defined in `TransactionOptions::max_write_batch_size`. This is done by overriding methods that modify data (Put/Delete/SingleDelete/Merge) and checking first if write batch size has exceeded threshold. If so, the write batch is written to DB as an unprepared batch. Support for Commit/Rollback for unprepared batch is added as well. This has been done by simply extending the WritePrepared Commit/Rollback logic to take care of all unprep_seq numbers either when updating prepare heap, or adding to commit map. For updating the commit map, this logic exists inside `WriteUnpreparedCommitEntryPreReleaseCallback`. A test change was also made to have transactions unregister themselves when committing without prepare. This is because with write unprepared, there may be unprepared entries (which act similarly to prepared entries) already when a commit is done without prepare. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4104 Differential Revision: D8785717 Pulled By: lth fbshipit-source-id: c02006e281ec1ce00f628e2a7beec0ee73096a91	6 years ago
Tamir Duberstein	7bee48bdbd	Add GCC 8 to Travis (#3433 ) Summary: - Avoid `strdup` to use jemalloc on Windows - Use `size_t` for consistency - Add GCC 8 to Travis - Add CMAKE_BUILD_TYPE=Release to Travis Pull Request resolved: https://github.com/facebook/rocksdb/pull/3433 Differential Revision: D6837948 Pulled By: sagar0 fbshipit-source-id: b8543c3a4da9cd07ee9a33f9f4623188e233261f	6 years ago
Manuel Ung	b9846370e9	WriteUnPrepared: Add support for recovering WriteUnprepared transactions (#4078 ) Summary: This adds support for recovering WriteUnprepared transactions through the following changes: - The information in `RecoveredTransaction` is extended so that it can reference multiple batches. - `MarkBeginPrepare` is extended with a bool indicating whether it is an unprepared begin, and this is passed down to `InsertRecoveredTransaction` to indicate whether the current transaction is prepared or not. - `WriteUnpreparedTxnDB::Initialize` is overridden so that it will rollback unprepared transactions from the recovered transactions. This can be done without updating the prepare heap/commit map, because this is before the DB has finished initializing, and after writing the rollback batch, those data structures should not contain information about the rolled back transaction anyway. Commit/Rollback of live transactions is still unimplemented and will come later. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4078 Differential Revision: D8703382 Pulled By: lth fbshipit-source-id: 7e0aada6c23bd39299f1f20d6c060492e0e6b60a	6 years ago
Manuel Ung	8ad63a4b86	WriteUnPrepared: Add new WAL marker kTypeBeginUnprepareXID (#4069 ) Summary: This adds a new WAL marker of type kTypeBeginUnprepareXID. Also, DBImpl now contains a field called batch_per_txn (meaning one WriteBatch per transaction, or possibly multiple WriteBatches). This would also indicate that this DB is using WriteUnprepared policy. Recovery code would be able to make use of this extra field on DBImpl in a separate diff. For now, it is just used to determine whether the WAL is compatible or not. Closes https://github.com/facebook/rocksdb/pull/4069 Differential Revision: D8675099 Pulled By: lth fbshipit-source-id: ca27cae1738e46d65f2bb92860fc759deb874749	6 years ago
Maysam Yabandeh	c3d1e36cce	WritePrepared Txn: enable TryAgain for duplicates at the end of the batch Summary: The WriteBatch::Iterate will try with a larger sequence number if the memtable reports a duplicate. This status is specified with TryAgain status. So far the assumption was that the last entry in the batch will never return TryAgain, which is correct when WAL is created via WritePrepared since it always appends a batch separator if a natural one does not exist. However when reading a WAL generated by WriteCommitted this batch separator might not exist. Although WritePrepared is not supposed to be able to read the WAL generated by WriteCommitted we should avoid confusing scenarios in which the behavior becomes unpredictable. The path fixes that by allowing TryAgain even for the last entry of the write batch. Closes https://github.com/facebook/rocksdb/pull/3747 Differential Revision: D7708391 Pulled By: maysamyabandeh fbshipit-source-id: bfaddaa9b14a4cdaff6977f6f63c789a6ab1ee0d	7 years ago
David Lai	3be9b36453	comment unused parameters to turn on -Wunused-parameter flag Summary: This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557. Closes https://github.com/facebook/rocksdb/pull/3662 Differential Revision: D7426121 Pulled By: Dayvedde fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352	7 years ago
Dmitri Smirnov	6f7b7f91b5	Optionally create DuplicateDetector Summary: Address issue https://github.com/facebook/rocksdb/issues/3579 Closes https://github.com/facebook/rocksdb/pull/3589 Differential Revision: D7221161 Pulled By: yiwu-arbug fbshipit-source-id: bd875ab0aa0e414dfa98b1bf036ba9b4ed351361	7 years ago
Bruce Mitchener	a3a3f5497c	Fix some typos in comments and docs. Summary: Closes https://github.com/facebook/rocksdb/pull/3568 Differential Revision: D7170953 Pulled By: siying fbshipit-source-id: 9cfb8dd88b7266da920c0e0c1e10fb2c5af0641c	7 years ago
Maysam Yabandeh	62277e15c3	WritePrepared Txn: Move DuplicateDetector to util Summary: Move DuplicateDetector and SetComparator to its own header file in util. It would also address a complaint in the unity test. Closes https://github.com/facebook/rocksdb/pull/3567 Differential Revision: D7163268 Pulled By: maysamyabandeh fbshipit-source-id: 6ddf82773473646dbbc1284ae601a78c4907c778	7 years ago
Andrew Kryczka	5d68243e61	Comment out unused variables Summary: Submitting on behalf of another employee. Closes https://github.com/facebook/rocksdb/pull/3557 Differential Revision: D7146025 Pulled By: ajkr fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e	7 years ago
Maysam Yabandeh	680864ae54	WritePrepared Txn: Fix bug with duplicate keys during recovery Summary: Fix the following bugs: - During recovery a duplicate key was inserted twice into the write batch of the recovery transaction, once when the memtable returns false (because it was duplicates) and once for the 2nd attempt. This would result into different SubBatch count measured when the recovered transactions is committing. - If a cf is flushed during recovery the memtable is not available to assist in detecting the duplicate key. This could result into not advancing the sequence number when iterating over duplicate keys of a flushed cf and hence inserting the next key with the wrong sequence number. - SubBacthCounter would reset the comparator to default comparator after the first duplicate key. The 2nd duplicate key hence would have gone through a wrong comparator and not being detected. Closes https://github.com/facebook/rocksdb/pull/3562 Differential Revision: D7149440 Pulled By: maysamyabandeh fbshipit-source-id: 91ec317b165f363f5d11ff8b8c47c81cebb8ed77	7 years ago
Igor Sugak	aba3409740	Back out "[codemod] - comment out unused parameters" Reviewed By: igorsugak fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37	7 years ago

1 2 3 4

174 Commits (297d9132752d9658989c1e99cfc836978b886ca1)