Summary:
First cut for early review; there are few conceptual points to answer and some code structure issues.
For conceptual points -
- restriction-wise, we're going to disallow ingest_behind if (use_seqno_zero_out=true || disable_auto_compaction=false), the user is responsible to properly open and close DB with required params
- we wanted to ingest into reserved bottom most level. Should we fail fast if bottom level isn't empty, or should we attempt to ingest if file fits there key-ranges-wise?
- Modifying AssignLevelForIngestedFile seems the place we we'd handle that.
On code structure - going to refactor GenerateAndAddExternalFile call in the test class to allow passing instance of IngestionOptions, that's just going to incur lots of changes at callsites.
Closes https://github.com/facebook/rocksdb/pull/2144
Differential Revision: D4873732
Pulled By: lightmark
fbshipit-source-id: 81cb698106b68ef8797f564453651d50900e153a
Summary:
Remove double buffering on RandomRead on Windows.
With more logic appear in file reader/write Read no longer
obeys forwarding calls to Windows implementation.
Previously direct_io (unbuffered) was only available on Windows
but now is supported as generic.
We remove intermediate buffering on Windows.
Remove random_access_max_buffer_size option which was windows specific.
Non-zero values for that opton introduced unnecessary lock contention.
Remove Env::EnableReadAhead(), Env::ShouldForwardRawRequest() that are
no longer necessary.
Add aligned buffer reads for cases when requested reads exceed read ahead size.
Closes https://github.com/facebook/rocksdb/pull/2105
Differential Revision: D4847770
Pulled By: siying
fbshipit-source-id: 8ab48f8e854ab498a4fd398a6934859792a2788f
Summary:
It is a potential bug that will be triggered if we ingest files before inserting the first key into an empty db.
0 is a special value reserved to indicate the concept of non-existence. But not good for seqno in this case because 0 is a valid seqno for ingestion(bulk loading)
Closes https://github.com/facebook/rocksdb/pull/2183
Differential Revision: D4919827
Pulled By: lightmark
fbshipit-source-id: 237eea40f88bd6487b66806109d90065dc02c362
Summary:
to void future bug that caused by the mix of userkey/internalkey
Closes https://github.com/facebook/rocksdb/pull/2084
Differential Revision: D4825889
Pulled By: lightmark
fbshipit-source-id: 28411db
Summary:
DB shutdown aborts running compactions by setting an atomic shutting_down=true that CompactionJob periodically checks. Without this PR it checks it before processing every _output_ value. If compaction filter filters everything out, the compaction is uninterruptible. This PR adds checks for shutting_down on every _input_ value (in CompactionIterator and MergeHelper).
There's also some minor code cleanup along the way.
Closes https://github.com/facebook/rocksdb/pull/1639
Differential Revision: D4306571
Pulled By: yiwu-arbug
fbshipit-source-id: f050890
Summary:
When deletion-collapsing mode is enabled (i.e., for DBIter/CompactionIterator), we maintain position in the tombstone maps across calls to ShouldDelete(). Since iterators often access keys sequentially (or reverse-sequentially), scanning forward/backward from the last position can be faster than binary-searching the map for every key.
- When Next() is invoked on an iterator, we use kForwardTraversal to scan forwards, if needed, until arriving at the range deletion containing the next key.
- Similarly for Prev(), we use kBackwardTraversal to scan backwards in the range deletion map.
- When the iterator seeks, we use kBinarySearch for repositioning
- After tombstones are added or before the first ShouldDelete() invocation, the current position is set to invalid, which forces kBinarySearch to be used.
- Non-iterator users (i.e., Get()) use kFullScan, which has the same behavior as before---scan the whole map for every key passed to ShouldDelete().
Closes https://github.com/facebook/rocksdb/pull/1701
Differential Revision: D4350318
Pulled By: ajkr
fbshipit-source-id: 5129b76
Summary:
Embarassingly enough, the first time I tried to use my new feature in logdevice it crashed with this assertion failure:
db/pinned_iterators_manager.h:30: void rocksdb::PinnedIteratorsManager::StartPinning(): Assertion `pinning_enabled == false' failed
The issue was that `pinned_iters_mgr_.StartPinning()` was called but `pinned_iters_mgr_.ReleasePinnedData()` wasn't.
Closes https://github.com/facebook/rocksdb/pull/1611
Differential Revision: D4265622
Pulled By: al13n321
fbshipit-source-id: 747b10f
Summary:
This adds the ability for compaction filter to say "drop this key-value, and also drop everything up to key x". This will cause the compaction to seek input iterator to x, without reading the data. This can make compaction much faster when large consecutive chunks of data are filtered out. See the changes in include/rocksdb/compaction_filter.h for the new API.
Along the way this diff also adds ability for compaction filter changing merge operands, similar to how it can change values; we're not going to use this feature, it just seemed easier and cleaner to implement it than to document that it's not implemented :)
The diff is not as big as it may seem, about half of the lines are a test.
Closes https://github.com/facebook/rocksdb/pull/1599
Differential Revision: D4252092
Pulled By: al13n321
fbshipit-source-id: 41e1e48
Summary:
Both the single deletion and the value are included in compaction outputs, so no need to update the stat for the value's deletion yet, otherwise it'd be double-counted.
Closes https://github.com/facebook/rocksdb/pull/1574
Differential Revision: D4241181
Pulled By: ajkr
fbshipit-source-id: c9aaa15
Summary:
- "rocksdb.compaction.key.drop.range_del" - number of keys dropped during compaction due to a range tombstone covering them
- "rocksdb.compaction.range_del.drop.obsolete" - number of range tombstones dropped due to compaction to bottom level and no snapshot saving them
- s/CompactionIteratorStats/CompactionIterationStats/g since this class is no longer specific to CompactionIterator -- it's also updated for range tombstone iteration during compaction
- Move the above class into a separate .h file to avoid circular dependency.
Closes https://github.com/facebook/rocksdb/pull/1520
Differential Revision: D4187179
Pulled By: ajkr
fbshipit-source-id: 10c2103
Summary:
Note: reviewed in https://reviews.facebook.net/D65115
- DBIter maintains a range tombstone accumulator. We don't cleanup obsolete tombstones yet, so if the user seeks back and forth, the same tombstones would be added to the accumulator multiple times.
- DBImpl::NewInternalIterator() (used to make DBIter's underlying iterator) adds memtable/L0 range tombstones, L1+ range tombstones are added on-demand during NewSecondaryIterator() (see D62205)
- DBIter uses ShouldDelete() when advancing to check whether keys are covered by range tombstones
Closes https://github.com/facebook/rocksdb/pull/1464
Differential Revision: D4131753
Pulled By: ajkr
fbshipit-source-id: be86559
Summary:
This diff introduces RangeDelAggregator, which takes ownership of iterators
provided to it via AddTombstones(). The tombstones are organized in a two-level
map (snapshot stripe -> begin key -> tombstone). Tombstone creation avoids data
copy by holding Slices returned by the iterator, which remain valid thanks to pinning.
For compaction, we create a hierarchical range tombstone iterator with structure
matching the iterator over compaction input data. An aggregator based on that
iterator is used by CompactionIterator to determine which keys are covered by
range tombstones. In case of merge operand, the same aggregator is used by
MergeHelper. Upon finishing each file in the compaction, relevant range tombstones
are added to the output file's range tombstone metablock and file boundaries are
updated accordingly.
To check whether a key is covered by range tombstone, RangeDelAggregator::ShouldDelete()
considers tombstones in the key's snapshot stripe. When this function is used outside of
compaction, it also checks newer stripes, which can contain covering tombstones. Currently
the intra-stripe check involves a linear scan; however, in the future we plan to collapse ranges
within a stripe such that binary search can be used.
RangeDelAggregator::AddToBuilder() adds all range tombstones in the table's key-range
to a new table's range tombstone meta-block. Since range tombstones may fall in the gap
between files, we may need to extend some files' key-ranges. The strategy is (1) first file
extends as far left as possible and other files do not extend left, (2) all files extend right
until either the start of the next file or the end of the last range tombstone in the gap,
whichever comes first.
One other notable change is adding release/move semantics to ScopedArenaIterator
such that it can be used to transfer ownership of an arena-allocated iterator, similar to
how unique_ptr is used for malloc'd data.
Depends on D61473
Test Plan: compaction_iterator_test, mock_table, end-to-end tests in D63927
Reviewers: sdong, IslamAbdelRahman, wanning, yhchiang, lightmark
Reviewed By: lightmark
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62205
Summary:
Issue scenario:
(1) We have 3 files in L1 and we issue a compaction that will compact them into 1 file in L2
(2) While compaction (1) is running, we flush a file into L0 and trigger another compaction that decide to move this file to L1 and then move it again to L2 (this file don't overlap with any other files)
(3) compaction (1) finishes and install the file it generated in L2, but this file overlap with the file we generated in (2) so we break the LSM consistency
Looks like this issue can be triggered by using non-exclusive manual compaction or AddFile()
Test Plan: unit tests
Reviewers: sdong
Reviewed By: sdong
Subscribers: hermanlee4, jkedgar, andrewkr, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D64947
Summary:
Added 2 statistics in compaction job statistics, to
identify if single deletes are not meeting a matching key
(fallthrough) or single deletes are meeting a merge, delete or
another single delete (i.e. not the expected case of put).
Test Plan: Tested the statistics using write_stress and compaction_job_stats_test
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61749
Summary:
This diff include these simple change
- Rename ReleasePinnedIterators to ReleasePinnedData
- Rename PinIteratorIfNeeded to PinIterator
- Use std::vector directly in PinnedIteratorsManager instead of std::unique_ptr<std::vector>
- Generalize PinnedIteratorsManager by adding PinPtr which can pin any pointer
Test Plan: existing tests
Reviewers: sdong, yiwu, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61305
Summary:
Solution is not to change db sequence number to start from 1 because 0 value is used in multiple other places.
Fix covers only compact_iterator::findEarliestVisibleSnapshot with updated logic to support snapshot's numbering starting from 0.
Test Plan:
run:
make all check
it should pass all tests
Reviewers: IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: lgalanis, mgalushka, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56601
Summary:
When subcompaction is used, all subcompactions share the same Compaction
pointer in CompactionJob while each subcompaction all keeps their mutable
stats in SubcompactionState. However, there're still some mutable part
that is currently store in the shared Compaction pointer.
This patch makes two changes:
1. Make the shared Compaction pointer const so that it can never be modified
during the compaction.
2. Move necessary states from Compaction to SubcompactionState.
3. Make functions of Compaction const if the function does not modify
its internal state.
Test Plan: rocksdb and MyRocks test
Reviewers: sdong, kradhakrishnan, andrewkr, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, yoshinorim, gunnarku, leveldb
Differential Revision: https://reviews.facebook.net/D55923
Summary:
Currently, transactions can fail even if there is no actual write conflict. This is due to relying on only the memtables to check for write-conflicts. Users have to tune memtable settings to try to avoid this, but it's hard to figure out exactly how to tune these settings.
With this diff, TransactionDB will use both memtables and SST files to determine if there are any write conflicts. This relies on the fact that BlockBasedTable stores sequence numbers for all writes that happen after any open snapshot. Also, D50295 is needed to prevent SingleDelete from disappearing writes (the TODOs in this test code will be fixed once the other diff is approved and merged).
Note that Optimistic transactions will still rely on tuning memtable settings as we do not want to read from SST while on the write thread. Also, memtable settings can still be used to reduce how often TransactionDB needs to read SST files.
Test Plan: unit tests, db bench
Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb, yoshinorim
Differential Revision: https://reviews.facebook.net/D50475
Summary: For Transactions, we want to start using the SST files to do write conflict checking. To do this, we need to make sure that compaction never removes all writes if an earlier snapshot exists. So I had to change the way we process SingleDeletes to sometimes leave a SingleDelete behind when we encounter a Put followed by a SingleDelete. See the comments in this diff for a more detailed explanation.
Test Plan: added more unit tests
Reviewers: rven, igor, kradhakrishnan, IslamAbdelRahman, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D50295
Summary:
D51183 was reverted due to breaking the LITE build.
This diff is the same as D51183 but with a fix for the LITE BUILD(D51693)
Test Plan: run all unit tests
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D51711
Summary:
D50475 enables using SST files for transaction write-conflict checking. In order for this to work, we need to make sure not to compact out SingleDeletes when there is an earlier transaction snapshot(D50295). If there is a long-held snapshot, this could reduce the benefit of the SingleDelete optimization.
This diff allows Transactions to mark snapshots as being used for write-conflict checking. Then, during compaction, we will be able to optimize SingleDeletes better in the future.
This diff adds a flag to SnapshotImpl which is used by Transactions. This diff also passes the earliest write-conflict snapshot's sequence number to CompactionIterator. This diff does not actually change Compaction (after this diff is pushed, D50295 will be able to use this information).
Test Plan: no behavior change, ran existing tests
Reviewers: rven, kradhakrishnan, yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D51183
Summary:
Provide an API for compaction filter to specify that it needs
to be applied even if there are snapshots.
Test Plan: DBTestCompactionFilter.CompactionFilterIgnoreSnapshot
Reviewers: yhchiang, IslamAbdelRahman, sdong, anthony
Reviewed By: anthony
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D51087
Summary:
Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type.
This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's.
At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it.
Test Plan: Run all existing tests.
Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven
Reviewed By: rven
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48549
Summary:
Since Andres' internship is over, I took over https://reviews.facebook.net/D42555 and rebased and simplified it a bit.
The behavior in this diff is a bit simpler than in D42555:
* only merge operators are passed through FilterMergeValue(). If fitler function returns true, the merge operator is ignored
* compaction filter is *not* called on: 1) results of merge operations and 2) base values that are getting merged with merge operands (the second case was also true in previous diff)
Do we also need a compaction filter to get called on merge results?
Test Plan: make && make check
Reviewers: lovro, tnovak, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: noetzli, kolmike, leveldb, dhruba, sdong
Differential Revision: https://reviews.facebook.net/D47847
Summary:
This patch fixes#7460559. It introduces SingleDelete as a new database
operation. This operation can be used to delete keys that were never
overwritten (no put following another put of the same key). If an overwritten
key is single deleted the behavior is undefined. Single deletion of a
non-existent key has no effect but multiple consecutive single deletions are
not allowed (see limitations).
In contrast to the conventional Delete() operation, the deletion entry is
removed along with the value when the two are lined up in a compaction. Note:
The semantics are similar to @igor's prototype that allowed to have this
behavior on the granularity of a column family (
https://reviews.facebook.net/D42093 ). This new patch, however, is more
aggressive when it comes to removing tombstones: It removes the SingleDelete
together with the value whenever there is no snapshot between them while the
older patch only did this when the sequence number of the deletion was older
than the earliest snapshot.
Most of the complex additions are in the Compaction Iterator, all other changes
should be relatively straightforward. The patch also includes basic support for
single deletions in db_stress and db_bench.
Limitations:
- Not compatible with cuckoo hash tables
- Single deletions cannot be used in combination with merges and normal
deletions on the same key (other keys are not affected by this)
- Consecutive single deletions are currently not allowed (and older version of
this patch supported this so it could be resurrected if needed)
Test Plan: make all check
Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor
Reviewed By: igor
Subscribers: maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43179
Summary:
During the refactoring, the condition that makes sure that compaction
filters are only applied to records newer than the latest snapshot
got butchered. This patch fixes the condition and adds a test case.
Test Plan: make db_compaction_filter_test && ./db_compaction_filter_test
Reviewers: rven, anthony, yhchiang, sdong, aekmekji, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46707
Summary:
The current build is failing on some platforms due to an __unused__ attribute.
This patch prevents the problem by using a pattern similar to MergeHelper
(assert not on the variable but inside a condition that uses the variable). We
should have better error handling in both cases in the future.
Test Plan: make clean all check
Reviewers: rven, anthony, yhchiang, sdong, igor, aekmekji
Reviewed By: aekmekji
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46623
Summary:
Builder and CompactionJob share a lot of fairly complex code. This patch
refactors this code into a separate class, the CompactionIterator. Because the
shared code is fairly complex, this patch hopefully improves maintainability.
While there are is a lot of potential for further improvements, the patch is
intentionally pretty close to the original structure because the change is
already complex enough.
Test Plan: make clean all check && ./db_stress
Reviewers: rven, anthony, yhchiang, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46197