If we are in kTolerateCorruptedTailRecords, treat these
errors as the end of the log. This is particularly
important for recycled logs, where we will regularly see
corrupted headers (bad length or checksum) when replaying
a log. If we are aligned with a block boundary or get lucky,
we will land on an old header and see the log number
mismatch, but more commonly we will land midway through
some previous block and record and effectively see noise.
These must be treated as the end of the log in order for
recycling to work.
This makes the LogTest.Recycle/1 test pass.
We also modify a number of existing tests because the
recycled log files behave fundamentally differently in that
they always stop when they reach the first bad record.
Signed-off-by: Sage Weil <sage@redhat.com>
Summary:
A couple of notes from the diff:
- The namespace block I added at the top of table_properties_collector.cc was in reaction to an issue i was having with PutVarint64 and reusing the "val" string. I'm not sure this is the cleanest way of doing this, but abstracting this out at least results in the correct behavior.
- I chose "rocksdb.merge.operands" as the property name. I am open to suggestions for better names.
- The change to sst_dump_tool.cc seems a bit inelegant to me. Is there a better way to do the if-else block?
Test Plan:
I added a test case in table_properties_collector_test.cc. It adds two merge operands and checks to make sure that both of them are reflected by GetMergeOperands. It also checks to make sure the wasPropertyPresent bool is properly set in the method.
Running both of these tests should pass:
./table_properties_collector_test
./sst_dump_test
Reviewers: IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58119
Summary:
This is a part of effort to reduce the size of db_test.cc. We move the following tests to a separate file `db_io_failure_test.cc`:
* DropWrites
* DropWritesFlush
* NoSpaceCompactRange
* NonWritableFileSystem
* ManifestWriteError
* PutFailsParanoid
Test Plan: Run `make check` to see if the tests are working properly.
Reviewers: sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58341
Summary:
NotifyOnCompactionCompleted can unlock the mutex.
That mean that we can schedule a background compaction that will start before we ReleaseCompactionFiles().
Test Plan:
added unittest
existing unittest
Reviewers: yhchiang, sdong
Reviewed By: sdong
Subscribers: yoshinorim, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58065
Summary: This tests that a prepared transaction is not lost after several crashes, restarts, and memtable flushes.
Test Plan: TwoPhaseLongPrepareTest
Reviewers: sdong
Subscribers: hermanlee4, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58185
Summary:
- Make sure we clean up recovered_transactions_ on DBImpl destructor
- delete leaked txns and env in TransactionTest
Test Plan: Run transaction_test under valgrind
Reviewers: sdong, andrewkr, yhchiang, horuff
Reviewed By: horuff
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58263
Summary:
Added a new abstraction to cache page to RocksDB designed for the read
cache use.
RocksDB current block cache is more of an object cache. For the persistent read cache
project, what we need is a page cache equivalent. This changes adds a cache
abstraction to RocksDB to cache pages called PersistentCache. PersistentCache can cache
uncompressed pages or raw pages (content as in filesystem). The user can
choose to operate PersistentCache either in COMPRESSED or UNCOMPRESSED mode.
Blame Rev:
Test Plan: Run unit tests
Reviewers: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D55707
Summary: DBTestXactLogIterator.TransactionLogIterator was failing due the sequence gaps. This was caused by an off-by-one error when calculating the new sequence number after recovering from logs.
Test Plan: db_log_iter_test
Reviewers: andrewkr
Subscribers: andrewkr, hermanlee4, dhruba, IslamAbdelRahman
Differential Revision: https://reviews.facebook.net/D58053
Summary:
GetObsoleteFiles() and LogAndApply() functions modify obsolete_manifests_ vector
we need to make sure that the mutex is held when we modify the obsolete_manifests_
Test Plan: run the test under TSAN
Reviewers: andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58011
Summary:
1. prepare()
2. crash
3. recover
4. commit()
5. crash
6. data is lost
This is due to the transaction data still only residing in the WAL but because the logs were flushed on the first recovery the data is ignored on the second recovery. We must scan all logs found on recovery and only ignore redundant data at the time of replay. It is not possible to know which logs still contain relevant data at time of recovery. We cannot simply ignore a log because all of the non-2pc data it contains has already been written to L0.
The changes made to MemTableInserter are to ensure that prepared sections are still recovered even if all of the non-2pc data in that log has already been flushed to L0.
Test Plan: Provided test.
Reviewers: sdong
Subscribers: andrewkr, hermanlee4, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57729
Summary:
Consider the following WAL with 4 batch entries prefixed with their sequence at time of memtable insert.
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(a)]
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(b)]
[4: COMMIT(a)]
[7: COMMIT(b)]
The first two batches do not consume any sequence numbers so are both prefixed with seq=1.
For 2pc commit, memtable insertion takes place before COMMIT batch is written to WAL.
We can see that sequence number consumption takes place between WAL entries giving us the seemingly sparse sequence prefix for WAL entries.
This is a valid WAL.
Because with 2PC markers one WriteBatch points to another batch containing its inserts a writebatch can consume more or less sequence numbers than the number of sequence consuming entries that it contains.
We can see that, given the entries in the WAL, 6 sequence ids were consumed. Yet on recovery the maximum sequence consumed would be 7 + 3 (the number of sequence numbers consumed by COMMIT(b))
So, now upon recovery we must track the actual consumption of sequence numbers.
In the provided scenario there will be no sequence gaps, but it is possible to produce a sequence gap. This should not be a problem though. correct?
Test Plan: provided test.
Reviewers: sdong
Subscribers: andrewkr, leveldb, dhruba, hermanlee4
Differential Revision: https://reviews.facebook.net/D57645
Summary:
This diff is built on top of WriteBatch modification: https://reviews.facebook.net/D54093 and adds the required functionality to rocksdb core necessary for rocksdb to support 2PC.
modfication of DBImpl::WriteImpl()
- added two arguments *uint64_t log_used = nullptr, uint64_t log_ref = 0;
- *log_used is an output argument which will return the log number which the incoming batch was inserted into, 0 if no WAL insert took place.
- log_ref is a supplied log_number which all memtables inserted into will reference after the batch insert takes place. This number will reside in 'FindMinPrepLogReferencedByMemTable()' until all Memtables insertinto have flushed.
- Recovery/writepath is now aware of prepared batches and commit and rollback markers.
Test Plan: There is currently no test on this diff. All testing of this functionality takes place in the Transaction layer/diff but I will add some testing.
Reviewers: IslamAbdelRahman, sdong
Subscribers: leveldb, santoshb, andrewkr, vasilep, dhruba, hermanlee4
Differential Revision: https://reviews.facebook.net/D56919
Summary: Adds three new WriteBatch data types: Prepare(xid), Commit(xid), Rollback(xid). Prepare(xid) should precede the (single) operation to which is applies. There can obviously be multiple Prepare(xid) markers. There should only be one Rollback(xid) or Commit(xid) marker yet not both. None of this logic is currently enforced and will most likely be implemented further up such as in the memtableinserter. All three markers are similar to PutLogData in that they are writebatch meta-data, ie stored but not counted. All three markers differ from PutLogData in that they will actually be written to disk. As for WriteBatchWithIndex, Prepare, Commit, Rollback are all implemented just as PutLogData and none are tested just as PutLogData.
Test Plan: single unit test in write_batch_test.
Reviewers: hermanlee4, sdong, anthony
Subscribers: leveldb, dhruba, vasilep, andrewkr
Differential Revision: https://reviews.facebook.net/D57867
Summary:
Add a new option that can be used to set a specific compression algorithm for bottommost level.
This option will only affect levels larger than base level.
I have also updated CompactionJobInfo to include the compression algorithm used in compaction
Test Plan:
added new unittest
existing unittests
Reviewers: andrewkr, yhchiang, sdong
Reviewed By: sdong
Subscribers: lightmark, andrewkr, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D57669
Summary: Currently we estimate bytes needed for compaction by assuming fanout value to be level multiplier. It overestimates when size of a level exceeds the target by large. We estimate by the ratio of actual sizes in levels instead.
Test Plan: Fix existing test cases and add a new one.
Reviewers: IslamAbdelRahman, igor, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57789
Summary: Fixing error with win build where we compare int64_t with size_t.
Test Plan: make check
Reviewers: andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57885
Summary: This is to provide a way for users to skip prefix bloom in point look-up.
Test Plan: Add a new unit test scenario.
Reviewers: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57747
Summary: This test is failing under valgrind because we dont delete the Env that we allocated
Test Plan: run the test under valgrind
Reviewers: andrewkr, yhchiang, yiwu, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57693
Summary: Fixing lite build broke in unit test. `FilesPerLevel()` depends on `DB::GetProperty()`, which lite build doesn't support.
Test Plan: OPT=-DROCKSDB_LITE make check -j64
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57651
Summary:
Add an option `iterator_readahead_size` to `ReadOptions` to enable
configurable readahead for iterators similar to the corresponding
option for compaction.
Test Plan:
```
make commit_prereq
```
Reviewers: kumar.rangarajan, ott, igor, sdong
Reviewed By: sdong
Subscribers: yiwu, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D55419
Summary: We should not use IterKey::SetKey with copy = false except if we are pinning the iterator thru it's life time, otherwise we may release the temporarily pinned blocks and in this case the IterKey will be pointing to freed memory
Test Plan: added a new test
Reviewers: sdong, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57561
Using explicit 64-bit type in conditional in platforms above 32-bits
This appears to be necessary on Mac OSX as std::conditional does not appear to short circuit and evaluates the third template arg
Making the third template arg be 64 bits explicitly works around this problem and will work on both 32 bit and 64+ bit platforms.
Summary: GetCurrentMutableCFOptions() can only be called when DB mutex is held so we cannot call it in CompactionJob::ProcessKeyValueCompaction() since it's not holding the db mutex
Test Plan: make check -j64
Reviewers: sdong, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57471
Summary: Adds three new WriteBatch data types: Prepare(xid), Commit(xid), Rollback(xid). Prepare(xid) should precede the (single) operation to which is applies. There can obviously be multiple Prepare(xid) markers. There should only be one Rollback(xid) or Commit(xid) marker yet not both. None of this logic is currently enforced and will most likely be implemented further up such as in the memtableinserter. All three markers are similar to PutLogData in that they are writebatch meta-data, ie stored but not counted. All three markers differ from PutLogData in that they will actually be written to disk. As for WriteBatchWithIndex, Prepare, Commit, Rollback are all implemented just as PutLogData and none are tested just as PutLogData.
Test Plan: single unit test in write_batch_test.
Reviewers: hermanlee4, sdong, anthony
Subscribers: andrewkr, vasilep, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D54093
Summary: Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status.
Test Plan: unit test.
Reviewers: dhruba, yhchiang, ott, sdong
Reviewed By: sdong
Subscribers: sdong, kradhakrishnan, IslamAbdelRahman, andrewkr, yhchiang, leveldb, ott, dhruba
Differential Revision: https://reviews.facebook.net/D56337
Summary:
Changing several option defaults:
options.max_open_files changes from 5000 to -1
options.base_background_compactions changes from max_background_compactions to 1
options.wal_recovery_mode changes from kTolerateCorruptedTailRecords to kTolerateCorruptedTailRecords
options.compaction_pri changes from kByCompensatedSize to kByCompensatedSize
Test Plan: Write unit tests to see OldDefaults() works as expected.
Reviewers: IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: MarkCallaghan, yiwu, kradhakrishnan, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56427
Summary: Warning is printed out with USE_CLANG=1 when including jemalloc.h. Disable it in that case.
Test Plan: Run db_bench with USE_CLANG=1 and not. Make sure they can all build and jemalloc status is printed out in the case where USE_CLANG is not set.
Reviewers: andrewkr, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57399
Summary:
This test relies on "rocksdb.num-files-at-levelN" property that isn't
implemented in rocksdb lite. So we will compile it only for non-lite builds.
Test Plan:
$ make -j40 check 'OPT=-g -DROCKSDB_LITE'
Reviewers: sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57387
Summary:
There was one narrowing conversion in D52287 that only showed up with
clang on osx.
Test Plan:
$ make clean && USE_CLANG=1 DISABLE_JEMALLOC=1 TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make -j32 check
Reviewers: sdong, lightmark, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57357
Summary: There were a few narrowing conversions that clang didn't like.
Test Plan:
$ make clean && USE_CLANG=1 DISABLE_JEMALLOC=1 TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make -j32 check
Reviewers: IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57351
Summary: It looks like we mistakenly enable JEMALLOC even if it's not available on the machine, that's why travis is failing
Test Plan:
check on my devserver
check on my mac
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57345
Summary:
This adds a new metablock containing a shared dictionary that is used
to compress all data blocks in the SST file. The size of the shared dictionary
is configurable in CompressionOptions and defaults to 0. It's currently only
used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
the compression type if the user chooses a nonzero dictionary size.
During compaction, computes the dictionary by randomly sampling the first
output file in each subcompaction. It pre-computes the intervals to sample
by assuming the output file will have the maximum allowable length. In case
the file is smaller, some of the pre-computed sampling intervals can be beyond
end-of-file, in which case we skip over those samples and the dictionary will
be a bit smaller. After the dictionary is generated using the first file in a
subcompaction, it is loaded into the compression library before writing each
block in each subsequent file of that subcompaction.
On the read path, gets the dictionary from the metablock, if it exists. Then,
loads that dictionary into the compression library before reading each block.
Test Plan: new unit test
Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong
Reviewed By: sdong
Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52287
Summary:
Introduced option to dump malloc statistics using new option flag.
Added new command line option to db_bench tool to enable this
funtionality.
Also extended build to support environments with/without jemalloc.
Test Plan:
1) Build rocksdb using `make` command. Launch the following command
`./db_bench --benchmarks=fillrandom --dump_malloc_stats=true
--num=10000000` end verified that jemalloc dump is present in LOG file.
2) Build rocksdb using `DISABLE_JEMALLOC=1 make db_bench -j32` and ran
the same db_bench tool and found the following message in LOG file:
"Please compile with jemalloc to enable malloc dump".
3) Also built rocksdb using `make` command on MacOS to verify behavior
in non-FB environment.
Also to debug build configuration change temporary changed
AM_DEFAULT_VERBOSITY = 1 in Makefile to see compiler and build
tools output. For case 1) -DROCKSDB_JEMALLOC was present in compiler
command line. For both 2) and 3) this flag was not present.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57321
Summary: CompactedDB skips memtable. So we shouldn't use compacted DB if there is outstanding WAL files.
Test Plan: Change to options.max_open_files = -1 perf context test to create a compacted DB, which we shouldn't do.
Reviewers: yhchiang, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57057
Summary:
While trying to reuse PinData() / ReleasePinnedData() .. to optimize away some memcpys I realized that there is a significant overhead for using PinData() / ReleasePinnedData if they were called many times.
This diff refactor the pinning logic by introducing PinnedIteratorsManager a centralized component that will be created once and will be notified whenever we need to Pin an Iterator. This implementation have much less overhead than the original implementation
Test Plan:
make check -j64
COMPILE_WITH_ASAN=1 make check -j64
Reviewers: yhchiang, sdong, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56493
Summary:
The current implementation find the first different byte and try to increment it, if it cannot it return the original key
we can improve this by keep going after the first different byte to find the first non 0xFF byte and increment it
After trying this patch on some logdevice sst files I see decrease in there index block size by 8.5%
Test Plan: existing tests and updated test
Reviewers: yhchiang, andrewkr, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56241
Summary:
In this test some times automatic compactions do everything and Manual compaction become a no-op.
Update the test to make sure manual compaction is not a no-op
Test Plan: run the test
Reviewers: andrewkr, yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57189
Summary:
Now we collect compaction stats per column family, but report default colum family's stat as compaction stats for DB.
Fix it by reporting compaction stats per column family instead.
Test Plan: Run db_bench with --num_column_families=4 and see the number fixed.
Reviewers: IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57063
Summary:
In https://reviews.facebook.net/D56271, we fixed an issue where
we consider flush as compaction. However, that makes us mistakenly
count FLUSH_WRITE_BYTES twice (one in flush_job and one in db_impl.)
This patch removes the one incremented in db_impl.
Test Plan: db_test
Reviewers: yiwu, andrewkr, IslamAbdelRahman, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57111