Summary:
D50475 enables using SST files for transaction write-conflict checking. In order for this to work, we need to make sure not to compact out SingleDeletes when there is an earlier transaction snapshot(D50295). If there is a long-held snapshot, this could reduce the benefit of the SingleDelete optimization.
This diff allows Transactions to mark snapshots as being used for write-conflict checking. Then, during compaction, we will be able to optimize SingleDeletes better in the future.
This diff adds a flag to SnapshotImpl which is used by Transactions. This diff also passes the earliest write-conflict snapshot's sequence number to CompactionIterator. This diff does not actually change Compaction (after this diff is pushed, D50295 will be able to use this information).
Test Plan: no behavior change, ran existing tests
Reviewers: rven, kradhakrishnan, yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D51183
Summary: DBTest.DynamicCompactionOptions ocasionally fails during valgrind run. We sent a sleeping task to block compaction thread pool but we don't wait it to run.
Test Plan: Run the test multiple times in an environment which can cause failure.
Reviewers: rven, kradhakrishnan, igor, IslamAbdelRahman, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51687
Summary:
This patch fix a race condition in persisting options which will cause a crash when:
* Thread A obtain cf options and start to persist options based on that cf options.
* Thread B kicks in and finish DropColumnFamily and delete cf_handle.
* Thread A wakes up and tries to finish the persisting options and crashes.
Test Plan: Add a test in column_family_test that can reproduce the crash
Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D51609
Summary:
Several tests in db_compaction_test are failing with aborts in
valgrind. These are LevelCompactionThirdPath, LevelCompactionPathUse and
CompressLevelCompaction. We now use the SpecialSkipListFactory to make
them more deterministic
Test Plan: valgrind
Reviewers: anthony, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D51663
Summary: After the skip list optimization, ColumnFamilyTest.DifferentWriteBufferSizes can occasionally fail with flush triggering of column family 3. Insert more data to it to make sure flush will trigger.
Test Plan: Run it multiple times with both of jemaloc on and off and see it always passes. (Without thd commit the run with jemalloc fails with chance of about one in two)
Reviewers: rven, yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51645
Summary:
db_universal_compaction_test is still failing because of
UniversalCompactionNumLevels/DBTestUniversalCompaction.UniversalCompactionSecondPathRatio/0
https://travis-ci.org/facebook/rocksdb/jobs/94949919
Use same approach to fix other tests to fix this test
Test Plan: Run ./db_universal_compaction_test on mac and make sure all the tests pass
Reviewers: kradhakrishnan, yhchiang, rven, anthony, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D51591
Summary: Skip list now cannot estimate memory across allocators
consistently and hence triggers flush at different time. This breaks certain
unit tests.
The fix is to adopt key count instead of size for flush.
Test Plan: Ran test on dev box and mac (where it used to fail)
Reviewers: sdong
CC: leveldb@
Task ID: #9273334
Blame Rev:
Summary:
Fixes T8781168.
Added a new function EnableAutoCompactions in db.h to be publicly
avialable. This allows compaction to be re-enabled after disabling it via
SetOptions
Refactored code to set the dbptr earlier on in TransactionDB::Open and DB::Open
Temporarily disable auto_compaction in TransactionDB::Open until dbptr is set to
prevent race condition.
Test Plan:
Ran make all check
verified fix on myrocks side:
was able to reproduce the seg fault with
../tools/mysqltest.sh --mem --force rocksdb.drop_table
method was to manually sleep the thread after DB::Open but before TransactionDB ptr was
assigned in transaction_db_impl.cc:
DB::Open(db_options, dbname, column_families_copy, handles, &db);
clock_t goal = (60000 * 10) + clock();
while (goal > clock());
...dbptr(aka rdb) gets assigned below
verified my changes fixed the issue.
Also added unit test 'ToggleAutoCompaction' in transaction_test.cc
Reviewers: hermanlee4, anthony
Reviewed By: anthony
Subscribers: alex, dhruba
Differential Revision: https://reviews.facebook.net/D51147
Summary: Verifiction condition of DBTest.SuggestCompactRangeTest is too strict. Based on key distribution, we might have more small files in last level. Not check number of files in the last level.
Test Plan: Run DBTest.SuggestCompactRangeTest with both of jemalloc on and off.
Reviewers: rven, IslamAbdelRahman, yhchiang, kradhakrishnan, igor, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51501
Summary: DBTest.DynamicCompactionOptions sometimes fails the assert but I can't repro it locally. Make it more deterministic and readable and see whether the problem is still there.
Test Plan: Run tht test and make sure it passes
Reviewers: kradhakrishnan, yhchiang, igor, rven, IslamAbdelRahman, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51309
Summary: DBCompactionTestWithParam.CompactionTrigger fails in non-jemalloc build, after the skip list memtable change. Fix it by making mem table flush trigger by number of entries.
Test Plan: Run the test using both of jemalloc and non-jemalloc build.
Reviewers: anthony, IslamAbdelRahman, rven, kradhakrishnan, igor, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51471
Summary: With recent commit 33e0c93826, db iterator skips perf context counter internal_key_skipped_count when blindly issuing internal Next(). Now increment the counter by one when issuing this Next()
Test Plan: Run all existing tests
Reviewers: rven, yhchiang, IslamAbdelRahman, kradhakrishnan, igor, anthony
Reviewed By: anthony
Subscribers: yoshinorim, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51465
Summary: DBTest.SuggestCompactRangeTest fails for the case when jemalloc is disabled, including ASAN and valgrind builds. It is caused by the improvement of skip list, which allocates different size of nodes for a new records. Fix it by using a special mem table that triggers a flush by number of entries. In that way the behavior will be consistent for all allocators.
Test Plan: Run the test with both of DISABLE_JEMALLOC=1 and 0
Reviewers: anthony, rven, yhchiang, kradhakrishnan, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51423
Summary: When option.db_write_buffer_size is hit, we currently flush all column families. Move to flush the column family with the largest active memt table instead. In this way, we can avoid too many small files in some cases.
Test Plan: Modify test DBTest.SharedWriteBuffer to work with the updated behavior
Reviewers: kradhakrishnan, yhchiang, rven, anthony, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: march, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51291
Summary: Now DBIter::Next() always compares with current key with itself first, which is unnecessary if the last key is not a merge key. I made the change and didn't see db_iter_test fails. Want to hear whether people have any idea what I miss.
Test Plan: Run all unit tests
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48279
Summary:
This diff completes the creation of InlineSkipList<Cmp>, which is like
SkipList<const char*, Cmp> but it always allocates the key contiguously
with the node. This allows us to remove the pointer from the node
to the key. As a result the memory usage of the skip list is reduced
(by 1 to sizeof(void*) bytes depending on the padding required to align
the key storage), cache locality is improved, and we halve the number
of calls to the allocator.
For skip lists whose keys are freshly-allocated const char*,
InlineSkipList is stricly preferrable to SkipList. This diff doesn't
replace SkipList, however, because some of the use cases of SkipList in
RocksDB are either character sequences that are not allocated at the
same time as the skip list node allocation (for example
hash_linklist_rep) or have different key types (for example
write_batch_with_index). Taking advantage of inline allocation for
those cases is left to future work.
The perf win is biggest for small values. For single-threaded CPU-bound
(32M fillrandom operations with no WAL log) with 16 byte keys and 0 byte
values, the db_bench perf goes from ~310k ops/sec to ~410k ops/sec. For
large values the improvement is less pronounced, but seems to be between
5% and 10% on the same configuration.
Test Plan: make check
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D51123
Summary:
This diff is 2/3 in a sequence that introduces a skip list optimized
for a key that is a freshly-allocated const char*. The change is broken
into pieces to make it easier to review. This piece removes the Key
template type, introduces the AllocateKey interface, and changes the
unit test from using uint64_t as the Key type to using pointers to an 8
byte blob.
Test Plan: unit test
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D51285
Summary:
This diff is 1/3 in a sequence that introduces a skip list optimized for
a key that is a freshly-allocated const char*. The diff is broken into
pieces to make it easier to review. This piece only introduces the new
type by copying the existing SkipList, with mechanical naming changes
and reformatting.
Test Plan: new unit test
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D51279
* conversion from 'size_t' to 'type', by add static_cast
Tested:
* by build solution on Windows, Linux locally,
* run tests
* build CI system successful
Summary: DBTest.DynamicLevelCompressionPerLevel2 sometimes fails during valgrind runs. This causes our valgrind tests to fail. Not sure what the best fix is for this test, but hopefully this simple change is sufficient.
Test Plan: run test
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D51111
Summary:
Provide an API for compaction filter to specify that it needs
to be applied even if there are snapshots.
Test Plan: DBTestCompactionFilter.CompactionFilterIgnoreSnapshot
Reviewers: yhchiang, IslamAbdelRahman, sdong, anthony
Reviewed By: anthony
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D51087
Summary: SpecialEnv::time_elapse_only_sleep_ is not initialized, which might cause some test failures. Fix it.
Test Plan: Run some unit tests. Since tests already broken. Might want to commit it sooner.
Reviewers: IslamAbdelRahman, yhchiang, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D50937
Summary: DBTest.MergeTestTime is a test verifying timing counters. Depending on real time may cause non-determinstic results. Change to fake time to be determinsitic.
Test Plan: Run the test and make sure it passes
Reviewers: yhchiang, anthony, rven, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D50883
Summary:
Travis is now failing because we cannot compile forward_iterator_bench under MAC
https://travis-ci.org/facebook/rocksdb/jobs/91524025
In forward_iterator_bench.cc we are using multiple functions that are not available in MAC like
htobe64
be64toh
Blocking forward_iterator_bench under MAC
Test Plan: compile under mac
Reviewers: rven, yhchiang, anthony, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D50889
Summary:
db_tailing_iter_test was failing on some platforms because of
an incorrect allocation and use. This diff fixes the issue.
Test Plan:
db_tailing_iter_test
Run valgrind for db_tailing_iter_test
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D50835
Summary: Timing counters' upper bounds depend on platform. It frequently fails in valgrind runs. Relax the upper bound.
Test Plan: Run the same valgrind test and make sure it passes.
Reviewers: rven, anthony, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D50829
Summary: Handle multiple calls to DBImpl::PauseBackgroundWork() and DBImpl::ContinueBackgroundWork()
Test Plan: rocksdb.information_schema handles this case.
Reviewers: igor
Reviewed By: igor
Subscribers: hermanlee4, jkedgar, dhruba
Differential Revision: https://reviews.facebook.net/D50781
Summary:
Currently RocksDB may break in lines like this:
for (size_t i = sorted_runs.size() - 1; i >= first_index_after; i--) {
if options.level0_file_num_compaction_trigger=0.
Fix it by not executing the logic of picking compactions if there is no file (sorted_runs.size() = 0). Also internally set options.level0_file_num_compaction_trigger=1 if users give a 0. 0 is a value makes no sense in RocksDB.
Test Plan: Run all tests. Will add a unit test too.
Reviewers: yhchiang, IslamAbdelRahman, anthony, kradhakrishnan, rven
Reviewed By: rven
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D50727
Summary:
Fixed Rocksdb lite build failure in forward_iterator_bench by
defining main for the ROCKSDB_LITE case
Test Plan: build ROCKSDB_LITE
Reviewers: anthony, yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D50733
Summary:
Under a tailing workload, there were increased block cache
misses when a memtable was flushed because we were rebuilding iterators
in that case since the version set changed. This was exacerbated in the
case of iterate_upper_bound, since file iterators which were over the
iterate_upper_bound would have been deleted and are now brought back as
part of the Rebuild, only to be deleted again. We now renew the iterators
and only build iterators for files which are added and delete file
iterators for files which are deleted.
Refer to https://reviews.facebook.net/D50463 for previous version
Test Plan: DBTestTailingIterator.TailingIteratorTrimSeekToNext
Reviewers: anthony, IslamAbdelRahman, igor, tnovak, yhchiang, sdong
Reviewed By: sdong
Subscribers: yhchiang, march, dhruba, leveldb, lovro
Differential Revision: https://reviews.facebook.net/D50679
Summary:
Since level 0 files can overlap, two level 0 compactions cannot
run in parallel. Compact files needs to check this before running a
compaction.
Test Plan: CompactFilesTest.L0ConflictsFiles
Reviewers: igor, IslamAbdelRahman, anthony, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D50079
Summary:
There's no need for WriteImpl to flatten the write batch group
into a single WriteBatch if the WAL is disabled. This diff moves the
flattening into the WAL step, and skips flattening entirely if it isn't
needed. It's good for about 5% speedup on a multi-threaded workload
with no WAL.
This diff also adds clarifying comments about the chance for partial
failure of WriteBatchInternal::InsertInto, and always sets bg_error_ if
the memtable state diverges from the logged state or if a WriteBatch
succeeds only partially.
Benchmark for speedup:
db_bench -benchmarks=fillrandom -threads=16 -batch_size=1 -memtablerep=skip_list -value_size=0 --num=200000 -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000
Test Plan: asserts + make check
Reviewers: sdong, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D50583
Summary:
DBCompactionTest.SkipStatsUpdateTest relies on the number
of files opened during the DB::Open process, but the persisting
options file support altered this number and thus makes
DBCompactionTest.SkipStatsUpdateTest in certain environment.
This patch fixed this test failure.
Test Plan: db_compaction_test
Reviewers: igor, sdong, anthony, IslamAbdelRahman
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D50637
Summary:
This patch allows rocksdb to persist options into a file on
DB::Open, SetOptions, and Create / Drop ColumnFamily.
Options files are created under the same directory as the rocksdb
instance.
In addition, this patch also adds a fail_if_missing_options_file in DBOptions
that makes any function call return non-ok status when it is not able to
persist options properly.
// If true, then DB::Open / CreateColumnFamily / DropColumnFamily
// / SetOptions will fail if options file is not detected or properly
// persisted.
//
// DEFAULT: false
bool fail_if_missing_options_file;
Options file names are formatted as OPTIONS-<number>, and RocksDB
will always keep the latest two options files.
Test Plan:
Add options_file_test.
options_test
column_family_test
Reviewers: igor, IslamAbdelRahman, sdong, anthony
Reviewed By: anthony
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48285
Summary:
Parallel writes will only be possible for certain combinations of
flags and WriteBatch contents. Traversing the WriteBatch at write time
to check these conditions would be expensive, but it is very cheap to
keep track of when building WriteBatch-es. When loading WriteBatch-es
during recovery, a deferred computation state is used so that the flags
never need to be computed.
Test Plan:
1. add asserts and EXPECT_EQ-s
2. make check
Reviewers: sdong, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D50337
Summary:
Using a TLS random instance for skiplist makes it smaller
(useful for hash_skiplist_rep) and prepares skiplist for concurrent
adds. This diff also modifies the branching factor math to avoid an
unnecessary division.
This diff has the effect of changing the sequence of skip list node
height choices made by tests, so it has the potential to cause unit
test failures for tests that implicitly rely on the exact structure
of the skip list. Tests that try to exactly trigger a compaction are
likely suspects for this problem (these tests have always been brittle to
changes in the skiplist details). I've minimizes this risk by reseeding
the main thread's Random at the beginning of each test, increasing the
universal compaction size_ratio limit from 101% to 105% for some tests,
and verifying that the tests pass many times.
Test Plan: for i in `seq 0 9`; do make check; done
Reviewers: sdong, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D50439
Enable C4307 'operator' : integral constant overflow
Longs and ints on Windows are 32-bit hence the overflow
Enable C4309 'conversion' : truncation of constant value
Enable C4512 'class' : assignment operator could not be generated
Enable C4701 Potentially uninitialized local variable 'name' used