Summary:
Cockroachdb exposed this bug in #1778. The bug happens when a compaction's output files are ended due to exceeding max_compaction_bytes. In that case we weren't taking into account the next file's start key when deciding how far to extend the current file's max_key. This caused the non-overlapping key-range invariant to be violated.
Note this was correctly handled for the usual case of cutting compaction output, which is file size exceeding max_output_file_size. I am not sure why these are two separate code paths, but we can consider refactoring it to prevent such errors in the future.
Closes https://github.com/facebook/rocksdb/pull/1784
Differential Revision: D4430235
Pulled By: ajkr
fbshipit-source-id: 80af748
Summary:
DB shutdown aborts running compactions by setting an atomic shutting_down=true that CompactionJob periodically checks. Without this PR it checks it before processing every _output_ value. If compaction filter filters everything out, the compaction is uninterruptible. This PR adds checks for shutting_down on every _input_ value (in CompactionIterator and MergeHelper).
There's also some minor code cleanup along the way.
Closes https://github.com/facebook/rocksdb/pull/1639
Differential Revision: D4306571
Pulled By: yiwu-arbug
fbshipit-source-id: f050890
Summary:
seems it's expensive to check status since the underlying merge iterator checks status of all its children. so only do it when it's really necessary to get the status before invoking Next(), i.e., when we're advancing to get the first key in the next file.
Closes https://github.com/facebook/rocksdb/pull/1691
Differential Revision: D4343446
Pulled By: siying
fbshipit-source-id: 70ab315
Summary:
- "rocksdb.compaction.key.drop.range_del" - number of keys dropped during compaction due to a range tombstone covering them
- "rocksdb.compaction.range_del.drop.obsolete" - number of range tombstones dropped due to compaction to bottom level and no snapshot saving them
- s/CompactionIteratorStats/CompactionIterationStats/g since this class is no longer specific to CompactionIterator -- it's also updated for range tombstone iteration during compaction
- Move the above class into a separate .h file to avoid circular dependency.
Closes https://github.com/facebook/rocksdb/pull/1520
Differential Revision: D4187179
Pulled By: ajkr
fbshipit-source-id: 10c2103
Summary:
Previously we used TableCache::NewIterator() for multiple purposes (data
block iterator and range deletion iterator), and returned non-ok status in
the data block iterator. In one case where the caller only used the range
deletion block iterator (9e7cf3469b/db/version_set.cc (L965-L973)),
we didn't check/free the data block iterator containing non-ok status, which
caused a valgrind error.
So, this diff decouples creation of data block and range deletion block iterators,
and updates the callers accordingly. Both functions can return non-ok status
in an InternalIterator. Since the non-ok status is returned in an iterator that the
callers will definitely use, it should be more usable/less error-prone.
Closes https://github.com/facebook/rocksdb/pull/1513
Differential Revision: D4181423
Pulled By: ajkr
fbshipit-source-id: 835b8f5
Summary:
Adjusted AddToBuilder() to take lower_bound and upper_bound, which serve two purposes: (1) only range deletions overlapping with the interval [lower_bound, upper_bound) will be added to the output file, and (2) the output file's boundaries will not be extended before lower_bound or after upper_bound. Our computation of lower_bound/upper_bound consider both subcompaction boundaries and previous/next files within the subcompaction.
Test cases are here (level subcompactions: https://gist.github.com/ajkr/63c7eae3e9667c5ebdc0a7efb74ac332, and universal subcompactions: https://gist.github.com/ajkr/5a62af77c4ebe4052a1955c496d51fdb) but can't be included in this diff as they depend on committing the API first. They fail before this change and pass after.
Closes https://github.com/facebook/rocksdb/pull/1501
Reviewed By: yhchiang
Differential Revision: D4171685
Pulled By: ajkr
fbshipit-source-id: ee99db8
Summary:
This conditional should only open a new file that's dedicated to range deletions when it's the sole output of the subcompaction. Previously, we created such a file whenever the table builder was nullptr, which would've also been the case whenever the CompactionIterator's final key coincided with the final output table becoming full.
Closes https://github.com/facebook/rocksdb/pull/1507
Differential Revision: D4174613
Pulled By: ajkr
fbshipit-source-id: 9ffacea
Summary: we should not split file on level 0 in compaction because it will fail the following verification of seqno order on level 0
Test Plan: check with filldeterministic in db_bench
Reviewers: yhchiang, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D65193
Summary:
This diff introduces RangeDelAggregator, which takes ownership of iterators
provided to it via AddTombstones(). The tombstones are organized in a two-level
map (snapshot stripe -> begin key -> tombstone). Tombstone creation avoids data
copy by holding Slices returned by the iterator, which remain valid thanks to pinning.
For compaction, we create a hierarchical range tombstone iterator with structure
matching the iterator over compaction input data. An aggregator based on that
iterator is used by CompactionIterator to determine which keys are covered by
range tombstones. In case of merge operand, the same aggregator is used by
MergeHelper. Upon finishing each file in the compaction, relevant range tombstones
are added to the output file's range tombstone metablock and file boundaries are
updated accordingly.
To check whether a key is covered by range tombstone, RangeDelAggregator::ShouldDelete()
considers tombstones in the key's snapshot stripe. When this function is used outside of
compaction, it also checks newer stripes, which can contain covering tombstones. Currently
the intra-stripe check involves a linear scan; however, in the future we plan to collapse ranges
within a stripe such that binary search can be used.
RangeDelAggregator::AddToBuilder() adds all range tombstones in the table's key-range
to a new table's range tombstone meta-block. Since range tombstones may fall in the gap
between files, we may need to extend some files' key-ranges. The strategy is (1) first file
extends as far left as possible and other files do not extend left, (2) all files extend right
until either the start of the next file or the end of the last range tombstone in the gap,
whichever comes first.
One other notable change is adding release/move semantics to ScopedArenaIterator
such that it can be used to transfer ownership of an arena-allocated iterator, similar to
how unique_ptr is used for malloc'd data.
Depends on D61473
Test Plan: compaction_iterator_test, mock_table, end-to-end tests in D63927
Reviewers: sdong, IslamAbdelRahman, wanning, yhchiang, lightmark
Reviewed By: lightmark
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62205
Summary: we should not call ShouldStopBefore() in compaction when the compaction targets level 0. Otherwise, CheckConsistency will fail the assertion of seq number check on level 0.
Test Plan:
make all check -j64
I also manully test that using db_bench to compact files to level 0. Without this line change, the assertion files and multiple files are generated on level 0 after compaction.
Reviewers: yhchiang, andrewkr, yiwu, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64269
Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs.
Test Plan:
make all check
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64065
Summary: One more small refactor before I split DBOptions into mutable and immutable parts.
Test Plan: existing unit tests.
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64047
Summary: The new variable introduced in 2149059f910149197d1a0f79ac08cf19465ea2d may be unitialized. Valgrind is failing because of it.
Test Plan: Run valgrind tests
Reviewers: yiwu, andrewkr, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D63201
Summary: To reduce number of options, merge source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes.
Test Plan: Add two new unit tests. Run all existing tests, including jtest.
Reviewers: yhchiang, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59829
Summary:
Added 2 statistics in compaction job statistics, to
identify if single deletes are not meeting a matching key
(fallthrough) or single deletes are meeting a merge, delete or
another single delete (i.e. not the expected case of put).
Test Plan: Tested the statistics using write_stress and compaction_job_stats_test
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61749
Summary: GetCurrentMutableCFOptions() can only be called when DB mutex is held so we cannot call it in CompactionJob::ProcessKeyValueCompaction() since it's not holding the db mutex
Test Plan: make check -j64
Reviewers: sdong, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57471
Summary: Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status.
Test Plan: unit test.
Reviewers: dhruba, yhchiang, ott, sdong
Reviewed By: sdong
Subscribers: sdong, kradhakrishnan, IslamAbdelRahman, andrewkr, yhchiang, leveldb, ott, dhruba
Differential Revision: https://reviews.facebook.net/D56337
Summary:
This adds a new metablock containing a shared dictionary that is used
to compress all data blocks in the SST file. The size of the shared dictionary
is configurable in CompressionOptions and defaults to 0. It's currently only
used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
the compression type if the user chooses a nonzero dictionary size.
During compaction, computes the dictionary by randomly sampling the first
output file in each subcompaction. It pre-computes the intervals to sample
by assuming the output file will have the maximum allowable length. In case
the file is smaller, some of the pre-computed sampling intervals can be beyond
end-of-file, in which case we skip over those samples and the dictionary will
be a bit smaller. After the dictionary is generated using the first file in a
subcompaction, it is loaded into the compression library before writing each
block in each subsequent file of that subcompaction.
On the read path, gets the dictionary from the metablock, if it exists. Then,
loads that dictionary into the compression library before reading each block.
Test Plan: new unit test
Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong
Reviewed By: sdong
Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52287
Summary:
In some case, it is possible to have two concesutive SST files might sharing
same boundary keys. However, in the assertion in Compaction::ShouldStopBefore,
it exclude such possibility.
This patch fix this issue by relaxing the assertion to allow the equal case.
Test Plan: rocksdb tests
Reviewers: IslamAbdelRahman, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D55875
Summary:
Added the column family name to the properties block. This property
is omitted only if the property is unavailable, such as when RepairDB()
writes SST files.
In a next diff, I will change RepairDB to use this new property for
deciding to which column family an existing SST file belongs. If this
property is missing, it will add it to the "unknown" column family (same
as its existing behavior).
Test Plan:
New unit test:
$ ./db_table_properties_test --gtest_filter=DBTablePropertiesTest.GetColumnFamilyNameProperty
Reviewers: IslamAbdelRahman, yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D55605
Summary:
When subcompaction is used, all subcompactions share the same Compaction
pointer in CompactionJob while each subcompaction all keeps their mutable
stats in SubcompactionState. However, there're still some mutable part
that is currently store in the shared Compaction pointer.
This patch makes two changes:
1. Make the shared Compaction pointer const so that it can never be modified
during the compaction.
2. Move necessary states from Compaction to SubcompactionState.
3. Make functions of Compaction const if the function does not modify
its internal state.
Test Plan: rocksdb and MyRocks test
Reviewers: sdong, kradhakrishnan, andrewkr, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, yoshinorim, gunnarku, leveldb
Differential Revision: https://reviews.facebook.net/D55923
Summary: Now we skip to add boundary keys to subcompaction candidates since we see an empty level. This makes subcompaction almost disabled for universal compaction. We should consider all files instead.
Test Plan: Run existing tests.
Reviewers: IslamAbdelRahman, andrewkr, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D55005
Summary:
Introude SstFileManager::SetMaxAllowedSpaceUsage() that can be used to limit the maximum space usage allowed for RocksDB.
When this limit is exceeded WriteImpl() will fail and return Status::Aborted()
Test Plan: unit testing
Reviewers: yhchiang, anthony, andrewkr, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D53763
Summary:
Add a new class SstFileTracker that will be notified whenever a DB add/delete/move and sst file, it will also replace DeleteScheduler
SstFileTracker can be used later to abort writes when we exceed a specific size
Test Plan: unit tests
Reviewers: rven, anthony, yhchiang, sdong
Reviewed By: sdong
Subscribers: igor, lovro, march, dhruba
Differential Revision: https://reviews.facebook.net/D50469
Summary:
We can avoid the dependency by forward-declaring ColumnFamilyData and
then treating it as a black box. That means callers of ThreadStatusUtil need to
explicitly provide more options, even if they can be derived from the
ColumnFamilyData, since ThreadStatusUtil doesn't include the definition.
This is part of a series of diffs to eliminate circular dependencies between
directories (e.g., db/* files depending on util/* files and vice-versa).
Test Plan:
$ ./db_test --gtest_filter=DBTest.GetThreadStatus
$ make -j32 commit-prereq
Reviewers: sdong, yhchiang, IslamAbdelRahman
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D53361
Summary:
D51183 was reverted due to breaking the LITE build.
This diff is the same as D51183 but with a fix for the LITE BUILD(D51693)
Test Plan: run all unit tests
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D51711
Summary:
D50475 enables using SST files for transaction write-conflict checking. In order for this to work, we need to make sure not to compact out SingleDeletes when there is an earlier transaction snapshot(D50295). If there is a long-held snapshot, this could reduce the benefit of the SingleDelete optimization.
This diff allows Transactions to mark snapshots as being used for write-conflict checking. Then, during compaction, we will be able to optimize SingleDeletes better in the future.
This diff adds a flag to SnapshotImpl which is used by Transactions. This diff also passes the earliest write-conflict snapshot's sequence number to CompactionIterator. This diff does not actually change Compaction (after this diff is pushed, D50295 will be able to use this information).
Test Plan: no behavior change, ran existing tests
Reviewers: rven, kradhakrishnan, yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D51183
Summary: We don't yet have a CI build for iOS, so our iOS compile gets broken sometimes. Most of the errors are from assumption that size_t is 64-bit, while it's actually 32-bit on some (all?) iOS platforms. This diff fixes the compile.
Test Plan:
TARGET_OS=IOS make static_lib
Observe there are no warnings
Reviewers: sdong, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D49029
Summary:
Add kill points in:
1. after creating a file
2. before writing a manifest record
3. before syncing manifest
4. before creating a new current file
5. after creating a new current file
Test Plan: Run all current tests.
Reviewers: yhchiang, igor, anthony, IslamAbdelRahman, rven, kradhakrishnan
Reviewed By: kradhakrishnan
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48855
Summary:
Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type.
This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's.
At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it.
Test Plan: Run all existing tests.
Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven
Reviewed By: rven
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48549
Summary: It would be nice to have and access to table properties in compaction callbacks. In MyRocks project, it will make possible to update optimizer statistics online.
Test Plan: ran the unit test. Ran myrocks with the new way of collecting stats.
Reviewers: igor, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48267
Summary: Pass column family ID through TablePropertiesCollectorFactory::CreateTablePropertiesCollector() so that users can identify which column family this file is for and handle it differently.
Test Plan: Add unit test scenarios in tests related to table properties collectors to verify the information passed in is correct.
Reviewers: rven, yhchiang, anthony, kradhakrishnan, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: yoshinorim, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48411
Summary:
Since Andres' internship is over, I took over https://reviews.facebook.net/D42555 and rebased and simplified it a bit.
The behavior in this diff is a bit simpler than in D42555:
* only merge operators are passed through FilterMergeValue(). If fitler function returns true, the merge operator is ignored
* compaction filter is *not* called on: 1) results of merge operations and 2) base values that are getting merged with merge operands (the second case was also true in previous diff)
Do we also need a compaction filter to get called on merge results?
Test Plan: make && make check
Reviewers: lovro, tnovak, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: noetzli, kolmike, leveldb, dhruba, sdong
Differential Revision: https://reviews.facebook.net/D47847
Summary:
Although compaction time is recorded in the statistics,
it is helpful to include this value in the log output corresponding
to the end of compaction.
Test Plan: make all && make check
Reviewers: yhchiang, sdong, igor, noetzli, MarkCallaghan
Reviewed By: MarkCallaghan
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D47007
Summary:
Although there are currently counters to keep track of the
stall caused by having too many L0 files, there is no distinction as
to whether when that stall occurs either (A) L0-L1 compaction is taking
place to try and mitigate it, or (B) no L0-L1 compaction has been scheduled
at the moment. This diff adds a counter for (A) so that the nature of L0
stalls can be better understood.
Test Plan: make all && make check
Reviewers: sdong, igor, anthony, noetzli, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, dhruba
Differential Revision: https://reviews.facebook.net/D46749
Summary:
Builder and CompactionJob share a lot of fairly complex code. This patch
refactors this code into a separate class, the CompactionIterator. Because the
shared code is fairly complex, this patch hopefully improves maintainability.
While there are is a lot of potential for further improvements, the patch is
intentionally pretty close to the original structure because the change is
already complex enough.
Test Plan: make clean all check && ./db_stress
Reviewers: rven, anthony, yhchiang, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46197
Summary:
Up to this point, the subcompactions that make up a compaction
job have been divided based on the key range of the L1 files, and each
subcompaction has handled the key range of only one file. However
DBOption.max_subcompactions allows the user to designate how many
subcompactions at most to perform. This patch updates the
CompactionJob::GetSubcompactionBoundaries() to determine these
divisions accordingly based on that option and other input/system factors.
The current approach orders the starting and/or ending keys of certain
compaction input files and then generates a histogram to approximate the
size covered by the key range between each consecutive pair of keys. Then
it groups these ranges into groups so that the sizes are approximately equal
to one another. The approach has also been adapted to work for universal
compaction as well instead of just for level-based compaction as it was before.
These subcompactions are then executed in parallel by locally spawning
threads, one for each. The results are then aggregated and the compaction
completed.
Test Plan: make all && make check
Reviewers: yhchiang, anthony, igor, noetzli, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43269