Summary:
We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support.
Most of changes were done through following comments:
unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE | egrep '[.](cc|h)'`
by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147
Test Plan: See CI
Reviewed By: pdillinger
Differential Revision: D42796341
fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2
Summary:
Currently, this original behavior should not lead to incorrect result, but will violate the contract of CompareWithTimestamp() that when a_has_ts or b_has_ts is false, the slice does not include timestamp.
Resolves https://github.com/facebook/rocksdb/issues/10709
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10742
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D39834096
Pulled By: riversand963
fbshipit-source-id: c597600f5a7820734f07d0926cdc224cea5eabe1
Summary:
Added an option, `WriteOptions::protection_bytes_per_key`, that controls how many bytes per key we use for integrity protection in `WriteBatch`. It takes effect when `WriteBatch::GetProtectionBytesPerKey() == 0`.
Currently the only supported value is eight. Invoking a user API with it set to any other nonzero value will result in `Status::NotSupported` returned to the user.
There is also a bug fix for integrity protection with `inplace_callback`, where we forgot to take into account the possible change in varint length when calculating KV checksum for the final encoded buffer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10037
Test Plan:
- Manual
- Set default value of `WriteOptions::protection_bytes_per_key` to eight and ran `make check -j24`
- Enabled in MyShadow for 1+ week
- Automated
- Unit tests have a `WriteMode` that enables the integrity protection via `WriteOptions`
- Crash test - in most cases, use `WriteOptions::protection_bytes_per_key` to enable integrity protection
Reviewed By: cbi42
Differential Revision: D36614569
Pulled By: ajkr
fbshipit-source-id: 8650087ceac9b61b560f1e5fafe5e1baf9c725fb
Summary:
ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955
Test Plan: Watch CI tests
Reviewed By: riversand963
Differential Revision: D36176799
fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471
Summary:
This PR supports inserting keys to a `WriteBatchWithIndex` for column families that enable user-defined timestamps
and reading the keys back. **The index does not have timestamps.**
Writing a key to WBWI is unchanged, because the underlying WriteBatch already supports it.
When reading the keys back, we need to make sure to distinguish between keys with and without timestamps before
comparison.
When user calls `GetFromBatchAndDB()`, no timestamp is needed to query the batch, but a timestamp has to be
provided to query the db. The assumption is that data in the batch must be newer than data from the db.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9603
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D34354849
Pulled By: riversand963
fbshipit-source-id: d25d1f84e2240ce543e521fa30595082fb8db9a0
Summary:
ajkr reminded me that we have a rule of not including per-kv related data in `WriteOptions`.
Namely, `WriteOptions` should not include information about "what-to-write", but should just
include information about "how-to-write".
According to this rule, `WriteOptions::timestamp` (experimental) is clearly a violation. Therefore,
this PR removes `WriteOptions::timestamp` for compliance.
After the removal, we need to pass timestamp info via another set of APIs. This PR proposes a set
of overloaded functions `Put(write_opts, key, value, ts)`, `Delete(write_opts, key, ts)`, and
`SingleDelete(write_opts, key, ts)`. Planned to add `Write(write_opts, batch, ts)`, but its complexity
made me reconsider doing it in another PR (maybe).
For better checking and returning error early, we also add a new set of APIs to `WriteBatch` that take
extra `timestamp` information when writing to `WriteBatch`es.
These set of APIs in `WriteBatchWithIndex` are currently not supported, and are on our TODO list.
Removed `WriteBatch::AssignTimestamps()` and renamed `WriteBatch::AssignTimestamp()` to
`WriteBatch::UpdateTimestamps()` since this method require that all keys have space for timestamps
allocated already and multiple timestamps can be updated.
The constructor of `WriteBatch` now takes a fourth argument `default_cf_ts_sz` which is the timestamp
size of the default column family. This will be used to allocate space when calling APIs that do not
specify a column family handle.
Also, updated `DB::Get()`, `DB::MultiGet()`, `DB::NewIterator()`, `DB::NewIterators()` methods, replacing
some assertions about timestamp to returning Status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8946
Test Plan:
make check
./db_bench -benchmarks=fillseq,fillrandom,readrandom,readseq,deleterandom -user_timestamp_size=8
./db_stress --user_timestamp_size=8 -nooverwritepercent=0 -test_secondary=0 -secondary_catch_up_one_in=0 -continuous_verification_interval=0
Make sure there is no perf regression by running the following
```
./db_bench_opt -db=/dev/shm/rocksdb -use_existing_db=0 -level0_stop_writes_trigger=256 -level0_slowdown_writes_trigger=256 -level0_file_num_compaction_trigger=256 -disable_wal=1 -duration=10 -benchmarks=fillrandom
```
Before this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom : 1.831 micros/op 546235 ops/sec; 60.4 MB/s
```
After this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom : 1.820 micros/op 549404 ops/sec; 60.8 MB/s
```
Reviewed By: ltamasi
Differential Revision: D33721359
Pulled By: riversand963
fbshipit-source-id: c131561534272c120ffb80711d42748d21badf09
Summary:
The MultiGetFromBatchAndDB would fail if the PinnableSlice value being returned was pinned. This could happen if the value was retrieved from the DB (not memtable) or potentially if the values were reused (and a previous iteration returned a slice that was pinned).
This change resets the pinnable value to clear it prior to attempting to use it, thereby eliminating the problem with the value already being pinned.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8299
Reviewed By: jay-zhuang
Differential Revision: D28455426
Pulled By: mrambacher
fbshipit-source-id: a34d7d983ec9b6bb4c8a2b4892f72858d43e6972
Summary:
The WBWI has two differing modes of operation dependent on the value
of the constructor parameter `overwrite_key`.
Currently, regardless of the parameter, neither mode performs as
expected when using Merge. This PR remedies this by correctly invoking
the appropriate Merge Operator before returning results from the WBWI.
Examples of issues that exist which are solved by this PR:
## Example 1 with `overwrite_key=false`
Currently, from an empty database, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
Get('k1')
```
Incorrectly yields `v2`, that is to say that the Merge behaves like a Put.
## Example 2 with o`verwrite_key=true`
Currently, from an empty database, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
Get('k1')
```
Incorrectly yields `ERROR: kMergeInProgress`.
## Example 3 with `overwrite_key=false`
Currently, with a database containing `('k1' -> 'v1')`, the following sequence:
```
Merge('k1', 'v2')
GetFromBatchAndDB('k1')
```
Incorrectly yields `v1,v2`
## Example 4 with `overwrite_key=true`
Currently, with a database containing `('k1' -> 'v1')`, the following sequence:
```
Merge('k1', 'v1')
GetFromBatchAndDB('k1')
```
Incorrectly yields `ERROR: kMergeInProgress`.
## Example 5 with `overwrite_key=false`
Currently, from an empty database, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
GetFromBatchAndDB('k1')
```
Incorrectly yields `v1,v2`
## Example 6 with `overwrite_key=true`
Currently, from an empty database, `('k1' -> 'v1')`, the following sequence:
```
Put('k1', 'v1')
Merge('k1', 'v2')
GetFromBatchAndDB('k1')
```
Incorrectly yields `ERROR: kMergeInProgress`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8135
Reviewed By: pdillinger
Differential Revision: D27657938
Pulled By: mrambacher
fbshipit-source-id: 0fbda6bbc66bedeba96a84786d90141d776297df
Summary:
1. Made `WriteBatchWithIndexInternal` into a class that stores the `DB*` or `DBOptions*`.
2. Changed the `GetFromBatch` method to be non-static and use an instance of the class. Added `MergeKey` methods to perform the merge itself and return any status.
This change unifies the multiple calls to the `MergeHelper` under a single wrapped API.
Closes https://github.com/facebook/rocksdb/issues/6683
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6851
Reviewed By: ajkr
Differential Revision: D21706574
Pulled By: pdillinger
fbshipit-source-id: 6860bd64d62669aaa591846e914eed3b674e68b1
Summary:
Some clients do not close their iterators until after the transaction finishes. To handle this case, we will invalidate any iterators on transaction clear.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7733
Reviewed By: cheng-chang
Differential Revision: D25261158
Pulled By: lth
fbshipit-source-id: b91320f00c54cbe0e6882b794b34f3bb5640dbc0
Summary:
The dynamic_cast in the filter benchmark causes release mode to fail due to
no-rtti. Replace with static_cast_with_check.
Signed-off-by: Derrick Pallas <derrick@pallas.us>
Addition by peterd: Remove unnecessary 2nd template arg on all static_cast_with_check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6732
Reviewed By: ltamasi
Differential Revision: D21304260
Pulled By: pdillinger
fbshipit-source-id: 6e8eb437c4ca5a16dbbfa4053d67c4ad55f1608c
Summary:
Add timestamp support for MultiGet().
timestamp from readoptions is honored, and timestamps can be returned along with values.
MultiReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks.
base line (commit 17bef7d3a):
multireadrandom : 104.173 micros/op 307167 ops/sec; (5462999 of 5462999 found)
This PR:
multireadrandom : 104.199 micros/op 307095 ops/sec; (5307999 of 5307999 found)
.\db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=multireadrandom --use_existing_db=1 --num=25000000 --threads=32 --allow_concurrent_memtable_write=0
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6483
Reviewed By: anand1976
Differential Revision: D20498373
Pulled By: riversand963
fbshipit-source-id: 8505f22bc40fd791bc7dd05e48d7e67c91edb627
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
Differential Revision: D19977691
fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6192
Test Plan:
Add a unit test that fails without the fix and passes now
make check
Differential Revision: D19124781
Pulled By: anand1976
fbshipit-source-id: 8c8cb6fa16c3fc23ec011e168561a13f76bbd783
Summary:
Add a new API that allows a user to call MultiGet specifying multiple keys belonging to different column families. This is mainly useful for users who want to do a consistent read of keys across column families, with the added performance benefits of batching and returning values using PinnableSlice.
As part of this change, the code in the original multi-column family MultiGet for acquiring the super versions has been refactored into a separate function that can be used by both, the batching and the non-batching versions of MultiGet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5816
Test Plan:
make check
make asan_check
asan_crash_test
Differential Revision: D18408676
Pulled By: anand1976
fbshipit-source-id: 933e7bec91dd70e7b633be4ff623a1116cc28c8d
Summary:
According to
https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format,
the block read by BlockBasedTable::ReadMetaBlock is actually the meta index
block. Therefore, it is better to rename the function to ReadMetaIndexBlock.
This PR also applies some format change to existing code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6009
Test Plan: make check
Differential Revision: D18333238
Pulled By: riversand963
fbshipit-source-id: 2c4340a29b3edba53d19c132cbfd04caf6242aed
Summary:
For MDEV-19670: MyRocks: key lookups into deleted data are very slow
BaseDeltaIterator remembers iterate_upper_bound and will not let delta_iterator_
walk above the iterate_upper_bound if base_iterator_ is not valid
anymore.
== Rationale ==
The most straightforward way would be to make the delta_iterator
(which is a rocksdb::WBWIIterator) to support iterator bounds. But
checking for bounds has an extra CPU overhead.
So we put the check into BaseDeltaIterator, and only make it when
base_iterator_ is not valid.
(note: We could take it even further, and move the check a few lines
down, and only check iterator bounds ourselves if base_iterator_ is
not valid AND delta_iterator_ hit a tombstone).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5403
Differential Revision: D15863092
Pulled By: maysamyabandeh
fbshipit-source-id: 8da458e7b9af95ff49356666f69664b4a6ccf49b
Summary:
Further apply formatter to more recent commits.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5830
Test Plan: Run all existing tests.
Differential Revision: D17488031
fbshipit-source-id: 137458fd94d56dd271b8b40c522b03036943a2ab
Summary:
This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases:
1. Update subset of columns and read subset of columns -
Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU.
2. Updating very few attributes in a value which is a JSON-like document -
Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge.
----------------------------------------------------------------------------------------------------
API :
Status GetMergeOperands(
const ReadOptions& options, ColumnFamilyHandle* column_family,
const Slice& key, PinnableSlice* merge_operands,
GetMergeOperandsOptions* get_merge_operands_options,
int* number_of_operands)
Example usage :
int size = 100;
int number_of_operands = 0;
std::vector<PinnableSlice> values(size);
GetMergeOperandsOptions merge_operands_info;
db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands);
Description :
Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion.
merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604
Test Plan:
Added unit test and perf test in db_bench that can be run using the command:
./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist
Differential Revision: D16657366
Pulled By: vjnadimpalli
fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf
Summary:
Add savepoint support when the current transaction has flushed unprepared batches.
Rolling back to savepoint is similar to rolling back a transaction. It requires the set of keys that have changed since the savepoint, re-reading the keys at the snapshot at that savepoint, and the restoring the old keys by writing out another unprepared batch.
For this strategy to work though, we must be capable of reading keys at a savepoint. This does not work if keys were written out using the same sequence number before and after a savepoint. Therefore, when we flush out unprepared batches, we must split the batch by savepoint if any savepoints exist.
eg. If we have the following:
```
Put(A)
Put(B)
Put(C)
SetSavePoint()
Put(D)
Put(E)
SetSavePoint()
Put(F)
```
Then we will write out 3 separate unprepared batches:
```
Put(A) 1
Put(B) 1
Put(C) 1
Put(D) 2
Put(E) 2
Put(F) 3
```
This is so that when we rollback to eg. the first savepoint, we can just read keys at snapshot_seq = 1.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5627
Differential Revision: D16584130
Pulled By: lth
fbshipit-source-id: 6d100dd548fb20c4b76661bd0f8a2647e64477fa
Summary:
MultiGet batching was implemented in #5011 in order to reduce CPU utilization when looking up multiple keys at once. This PR implements corresponding ```MultiGet``` and ```MultiGetSingleCFForUpdate``` in ```rocksdb::Transaction``` that call the underlying batching implementation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5210
Differential Revision: D15048164
Pulled By: anand1976
fbshipit-source-id: c52f6043102ab0cbc723f4cba2a7b7d1767f6f52
Summary:
Current implementation of `current_over_upper_bound_` fails to take into consideration that keys might be invalid in either base iterator or delta iterator. Calling key() in such scenario will lead to assertion failure and runtime errors.
This PR addresses the bug by adding check for valid keys before calling `IsOverUpperBound()`, also added test coverage for iterate_upper_bound usage in BaseDeltaIterator
Also recommit https://github.com/facebook/rocksdb/pull/4656 (It was reverted earlier due to bugs)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4702
Differential Revision: D13146643
Pulled By: miasantreble
fbshipit-source-id: 6d136929da12d0f2e2a5cea474a8038ec5cdf1d0
Summary:
Currently transaction iterator does not apply `ReadOptions.iterate_upper_bound` when iterating. This PR attempts to fix the problem by having `BaseDeltaIterator` enforcing the upper bound check when iterator state is changed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4656
Differential Revision: D13039257
Pulled By: miasantreble
fbshipit-source-id: 909eb9f6b4597a4d80418fb139f32ec82c6ec1d1
Summary:
WriteBatchWithIndex's SeekForPrev() has a bug that we internally place the position just before the seek key rather than after. This makes the iterator to miss the result that is the same as the seek key. Fix it by position the iterator equal or smaller.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4559
Differential Revision: D10468534
Pulled By: siying
fbshipit-source-id: 2fb371ae809c561b60a1c11cef71e1c66fea1f19
Summary:
This adds support for writing unprepared batches based on size defined in `TransactionOptions::max_write_batch_size`. This is done by overriding methods that modify data (Put/Delete/SingleDelete/Merge) and checking first if write batch size has exceeded threshold. If so, the write batch is written to DB as an unprepared batch.
Support for Commit/Rollback for unprepared batch is added as well. This has been done by simply extending the WritePrepared Commit/Rollback logic to take care of all unprep_seq numbers either when updating prepare heap, or adding to commit map. For updating the commit map, this logic exists inside `WriteUnpreparedCommitEntryPreReleaseCallback`.
A test change was also made to have transactions unregister themselves when committing without prepare. This is because with write unprepared, there may be unprepared entries (which act similarly to prepared entries) already when a commit is done without prepare.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4104
Differential Revision: D8785717
Pulled By: lth
fbshipit-source-id: c02006e281ec1ce00f628e2a7beec0ee73096a91
Summary:
This adds a new WAL marker of type kTypeBeginUnprepareXID.
Also, DBImpl now contains a field called batch_per_txn (meaning one WriteBatch per transaction, or possibly multiple WriteBatches). This would also indicate that this DB is using WriteUnprepared policy.
Recovery code would be able to make use of this extra field on DBImpl in a separate diff. For now, it is just used to determine whether the WAL is compatible or not.
Closes https://github.com/facebook/rocksdb/pull/4069
Differential Revision: D8675099
Pulled By: lth
fbshipit-source-id: ca27cae1738e46d65f2bb92860fc759deb874749
Summary:
Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are:
* If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted.
* When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not.
However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours).
This PR changes the convention to:
* If status() is not ok, Valid() always returns false.
* Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.)
This does sacrifice the two use cases listed above, but siying said it's ok.
Overview of the changes:
* A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario.
* Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed.
* A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator.
To find the iterator types that needed changes I searched for "public .*Iterator" in the code. Here's an overview of all 27 iterator types:
Iterators that didn't need changes:
* status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator.
* Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator.
Iterators with changes (see inline comments for details):
* DBIter - an overhaul:
- It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back.
- It had a few code paths silently discarding subiterator's status. The stress test caught a few.
- The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example.
- Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption.
- It used to not reset status on seek for some types of errors.
- Some simplifications and better comments.
- Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer.
* MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status.
* ForwardIterator - changed to the new convention, also slightly simplified.
* ForwardLevelIterator - fixed some bugs and simplified.
* LevelIterator - simplified.
* TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_.
* BlockBasedTableIterator - minor changes.
* BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid.
* PlainTableIterator - some seeks used to not reset status.
* CuckooTableIterator - tiny code cleanup.
* ManagedIterator - fixed some bugs.
* BaseDeltaIterator - changed to the new convention and fixed a bug.
* BlobDBIterator - seeks used to not reset status.
* KeyConvertingIterator - some small change.
Closes https://github.com/facebook/rocksdb/pull/3810
Differential Revision: D7888019
Pulled By: al13n321
fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d
Summary:
Make use of the index in WriteBatchWithIndex to also count the number of sub-batches. This eliminates the need to separately scan the batch to count the number of sub-batches once a duplicate key is detected.
Closes https://github.com/facebook/rocksdb/pull/3529
Differential Revision: D7049947
Pulled By: maysamyabandeh
fbshipit-source-id: 81cbf12c4e662541c772c7265a8f91631e25c7cd
Summary:
Somehow the indentation was incorrect in this file.
The only change in this PR is to get it right again in order to make the code more readable.
Please reject if you think it's not worth it.
Closes https://github.com/facebook/rocksdb/pull/3504
Differential Revision: D6996011
Pulled By: miasantreble
fbshipit-source-id: 060514a3a8c910d34bad795b36eb4d278512b154
Summary:
This patch takes advantage of memtable being able to detect duplicate <key,seq> and returning TryAgain to handle duplicate keys in WritePrepared Txns. Through WriteBatchWithIndex's index it detects existence of at least a duplicate key in the write batch. If duplicate key was reported, it then pays the cost of counting the number of sub-patches by iterating over the write batch and pass it to DBImpl::Write. DB will make use of the provided batch_count to assign proper sequence numbers before sending them to the WAL. When later inserting the batch to the memtable, it increases the seq each time memtbale reports a duplicate (a sub-patch in our counting) and tries again.
Closes https://github.com/facebook/rocksdb/pull/3455
Differential Revision: D6873699
Pulled By: maysamyabandeh
fbshipit-source-id: db8487526c3a5dc1ddda0ea49f0f979b26ae648d
Summary:
Add tests to ensure that WritePrepared and WriteCommitted policies are cross compatible when the db WAL is empty. This is important when the admin want to switch between the policies. In such case, before the switch the admin needs to empty the WAL by i) committing/rollbacking all the pending transactions, ii) FlushMemTables
Closes https://github.com/facebook/rocksdb/pull/3118
Differential Revision: D6227247
Pulled By: maysamyabandeh
fbshipit-source-id: bcde3d92c1e89cda3b9cfa69f6a20af5d8993db7
Summary:
Add per-exe execution capability
Add fix parsing of groups/tests
Add timer test exclusion
Fix unit tests
Ifdef threadpool specific tests that do not pass on Vista threadpool.
Remove spurious outout from prefix_test so test case listing works
properly.
Fix not using standard test directories results in file creation errors
in sst_dump_test.
BlobDb fixes:
In C++ end() iterators can not be dereferenced. They are not valid.
When deleting blob_db_ set it to nullptr before any other code executes.
Not fixed:. On Windows you can not delete a file while it is open.
[ RUN ] BlobDBTest.ReadWhileGC
d:\dev\rocksdb\rocksdb\utilities\blob_db\blob_db_test.cc(75): error: DestroyBlobDB(dbname_, options, bdb_options)
IO error: Failed to delete: d:/mnt/db\testrocksdb-17444/blob_db_test/blob_dir/000001.blob: Permission denied
d:\dev\rocksdb\rocksdb\utilities\blob_db\blob_db_test.cc(75): error: DestroyBlobDB(dbname_, options, bdb_options)
IO error: Failed to delete: d:/mnt/db\testrocksdb-17444/blob_db_test/blob_dir/000001.blob: Permission denied
write_batch
Should not call front() if there is a chance the container is empty
Closes https://github.com/facebook/rocksdb/pull/3152
Differential Revision: D6293274
Pulled By: sagar0
fbshipit-source-id: 318c3717c22087fae13b18715dffb24565dbd956
Summary:
The collapse of duplicate keys in write batch needs to sort the indexes of duplicate keys since it only checks the index in the batch with the head of the list of duplicate keys.
Closes https://github.com/facebook/rocksdb/pull/3093
Differential Revision: D6186800
Pulled By: maysamyabandeh
fbshipit-source-id: abc9ae8c2f1840445a5584f925cf86ecc6f37154
Summary:
MSVC does not support unused attribute at this time. A separate assignment line fixes the issue probably by being counted as usage for MSVC and it no longer complains about unused var.
Closes https://github.com/facebook/rocksdb/pull/3048
Differential Revision: D6126272
Pulled By: maysamyabandeh
fbshipit-source-id: 4907865db45fd75a39a15725c0695aaa17509c1f
Summary:
With WriteCommitted, when the write batch has duplicate keys, the txn db simply inserts them to the db with different seq numbers and let the db ignore/merge the duplicate values at the read time. With WritePrepared all the entries of the batch are inserted with the same seq number which prevents us from benefiting from this simple solution.
This patch applies a hackish solution to unblock the end-to-end testing. The hack is to be replaced with a proper solution soon. The patch simply detects the duplicate key insertions, and mark the previous one as obsolete. Then before writing to the db it rewrites the batch eliminating the obsolete keys. This would incur a memcpy cost. Furthermore handing duplicate merge would require to do FullMerge instead of simply ignoring the previous value, which is not handled by this patch.
Closes https://github.com/facebook/rocksdb/pull/2969
Differential Revision: D5976337
Pulled By: maysamyabandeh
fbshipit-source-id: 114e65b66f137d8454ff2d1d782b8c05da95f989
Summary:
This patch instruments the read path to verify each read value against an optional ReadCallback class. If the value is rejected, the reader moves on to the next value. The WritePreparedTxn makes use of this feature to skip sequence numbers that are not in the read snapshot.
Closes https://github.com/facebook/rocksdb/pull/2850
Differential Revision: D5787375
Pulled By: maysamyabandeh
fbshipit-source-id: 49d808b3062ab35e7ae98ad388f659757794184c
Summary:
The ::Get from DB is not augmented with an overload method that takes a PinnableSlice instead of a string. Transactions however are not yet upgraded to use the new API. As a result, transaction users such as MyRocks cannot benefit from it. This patch updates the transactional API with a PinnableSlice overload.
Closes https://github.com/facebook/rocksdb/pull/2736
Differential Revision: D5645770
Pulled By: maysamyabandeh
fbshipit-source-id: f6af520df902f842de1bcf99bed3e8dfc43ad96d