Summary:
add DestroyColumnFamilyHandle(ColumnFamilyHandle**) to close column family instead of deleting cfh*
User should call this to close a cf and then we can detect the deletion in this function.
Test Plan: make all check -j64
Reviewers: andrewkr, yiwu, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60765
Summary:
When write stalls because of auto compaction is disabled, or stop write trigger is reached,
user may change these two options to unblock writes. Unfortunately we had issue where the write
thread will block the attempt to persist the options, thus creating a deadlock. This diff
fix the issue and add two test cases to detect such deadlock.
Test Plan:
Run unit tests.
Also, revert db_impl.cc to master (but don't revert `DBImpl::BackgroundCompaction:Finish` sync point) and run db_options_test. Both tests should hit deadlock.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60627
Summary:
I was investigating performance issues in the SstFileWriter and found all of the following:
- The SstFileWriter::Add() function created a local InternalKey every time it was called generating a allocation and free each time. Changed to have an InternalKey member variable that can be reset with the new InternalKey::Set() function.
- In SstFileWriter::Add() the smallest_key and largest_key values were assigned the result of a ToString() call, but it is simpler to just assign them directly from the user's key.
- The Slice class had no move constructor so each time one was returned from a function a new one had to be allocated, the old data copied to the new, and the old one was freed. I added the move constructor which also required a copy constructor and assignment operator.
- The BlockBuilder::CurrentSizeEstimate() function calculates the current estimate size, but was being called 2 or 3 times for each key added. I changed the class to maintain a running estimate (equal to the original calculation) so that the function can return an already calculated value.
- The code in BlockBuilder::Add() that calculated the shared bytes between the last key and the new key duplicated what Slice::difference_offset does, so I replaced it with the standard function.
- BlockBuilder::Add() had code to copy just the changed portion into the last key value (and asserted that it now matched the new key). It is more efficient just to copy the whole new key over.
- Moved this same code up into the 'if (use_delta_encoding_)' since the last key value is only needed when delta encoding is on.
- FlushBlockBySizePolicy::BlockAlmostFull calculated a standard deviation value each time it was called, but this information would only change if block_size of block_size_deviation changed, so I created a member variable to hold the value to avoid the calculation each time.
- Each PutVarint??() function has a buffer and calls std::string::append(). Two or three calls in a row could share a buffer and a single call to std::string::append().
Some of these will be helpful outside of the SstFileWriter. I'm not 100% the addition of the move constructor is appropriate as I wonder why this wasn't done before - maybe because of compiler compatibility? I tried it on gcc 4.8 and 4.9.
Test Plan: The changes should not affect the results so the existing tests should all still work and no new tests were added. The value of the changes was seen by manually testing the SstFileWriter class through MyRocks and adding timing code to identify problem areas.
Reviewers: sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59607
Summary: fix Rocksdb Unit Test USER_FAILURE
Test Plan: make all check -j64
Reviewers: sdong, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60603
Summary:
DB::AddFile(std::string file_path) API that allow them to ingest an SST file created using SstFileWriter
We want to update this interface to be able to accept a list of files that will be ingested, DB::AddFile(std::vector<std::string> file_path_list).
Test Plan:
Add test case `AddExternalSstFileList` in `DBSSTTest`. To make sure:
1. files key ranges are not overlapping with each other
2. each file key range dont overlap with the DB key range
3. make sure no snapshots are held
Reviewers: andrewkr, sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D58587
Summary: In concurrent memtable insert case, updating counters in MemTable::Add() can count for 5% CPU usage. By batch all the counters and update in the end of the write batch, the CPU overheads are overhead in the use cases where more than one key is updated in one write batch.
Test Plan:
Write throughput increases 12% with this benchmark setting:
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -disable_auto_compactions -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -num=10000000 --writes=1000000 -max_background_flushes=16 -max_write_buffer_number=16 --threads=64 --batch_size=128 -allow_concurrent_memtable_write -enable_write_thread_adaptive_yield
Reviewers: andrewkr, IslamAbdelRahman, ngbronson, igor
Reviewed By: ngbronson
Subscribers: ngbronson, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60495
Summary:
Use @omegaga's awesome feature to avoid use of callbacks for ensuring
SyncPoints happen in a particular thread.
Depends on D60375.
Test Plan:
$ ./auto_roll_logger_test
Reviewers: omegaga, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, omegaga, leveldb
Differential Revision: https://reviews.facebook.net/D60471
Summary: Add markers to sync points. A marked sync point will only be active when it is on the same thread as the marker sync point.
Test Plan: Write a unit test to validate.
Reviewers: sdong, IslamAbdelRahman, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60375
Summary: MyRocks release integration build breaks because we treat warnings caused by unused variables as errors. Variable `edit` is only used in debug builds. Therefore we need to guard it using `#ifndef NDEBUG` check.
Test Plan:
- `[p]arc diff --preview` for the default validation.
- Verify that release build fails before this fix and passes after applying it.
Reviewers: andrewkr, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60423
Summary: We saw instances where total_log_size is off the real value, but I'm not able to reproduce it. Add more logging to help debugging when it happens again.
Test Plan: Run the unit test and see the logging.
Reviewers: andrewkr, yhchiang, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60081
Summary: Add option write_buffer_manager to help users control total memory spent on memtables across multiple DB instances.
Test Plan: Add a new unit test.
Reviewers: yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: adela, benj, sumeet, muthu, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59925
Summary: Reported in T11889874. When registering the cleanup function we should copy the option so that we can still access it if ReadOptions is deleted.
Test Plan: Add a unit test to reproduce this bug.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60087
Summary: UBSan is unhappy because `cfd` is not initialized. This breaks UBSan build which in turn breaks MyRocks continuous integration with RocksDB which in turns makes me unhappy :-) Fix this.
Test Plan:
- `[p]arc diff --preview` + Sandcastle.
- Verify that `COMPILE_WITH_UBSAN=1 OPT=-g make J=1 ubsan_check` gets past the break.
Reviewers: andrewkr, hermanlee4, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60117
Summary: With max_size_amplification_percent = 0 to make sure that DBTestUniversalCompaction.UniversalCompactionSingleSortedRun tests the configuration to compact to one single sorted run.
Test Plan: Run all existing tests
Reviewers: yhchiang, andrewkr, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60021
Summary:
Overload RepairDB to take vector-of-ColumnFamilyDescriptor, which tells
us CF name + options. Also takes a ColumnFamilyOptions for unspecified column
families encountered during the repair.
One potentially confusing thing is that we store options in the constructor and
don't invoke AddColumnFamily() until discovering the CF in ScanTable. This is
because we don't know the CF ID until we find a table belonging to that CF.
Depends on D59781.
Test Plan:
$ ./repair_test
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59853
Summary:
This diff uses the CF ID and CF name properties in the SST file
to associate recovered data with the proper column family. Depends on D59775.
- In ScanTable(), create column families in VersionSet each time a new one is discovered (via reading SST file properties)
- In ConvertLogToTable(), dump an SST file for every column family with data in the WAL
- In AddTables(), make a VersionEdit per-column family that adds all of that CF's tables
Test Plan:
$ ./repair_test
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59781
Summary:
To support column families, it is easiest to use VersionSet to manage
our column families (if we don't have Versions then ColumnFamilyData always
behaves as a dummy column family). This diff only refactors the existing repair
logic to use VersionSet; the next two parts will add support for multiple
column families.
Test Plan:
$ ./repair_test
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59775
Summary:
Add a read option `background_purge_on_iterator_cleanup` to avoid deleting files in foreground when destroying iterators.
Instead, a job is scheduled in high priority queue and would be executed in a separate background thread.
Test Plan: Add a variant of PurgeObsoleteFileTest. Turn on background purge option in the new test, and use sleeping task to ensure files are deleted in background.
Reviewers: IslamAbdelRahman, sdong
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59499
Summary:
DB::AddFile() right now always add the ingested file to L0
update the logic to add the file to the lowest possible level
Test Plan: unit tests
Reviewers: jkedgar, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D59637
Summary: filter_deltes is not a frequently used feature. Remove it.
Test Plan: Run all test suites.
Reviewers: igor, yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59427
Summary:
We dont report the bytes that we ingested from AddFile which make the write amplification numbers incorrect
Update InternalStats and add logging for AddFile()
Test Plan: Make sure the code compile and existing tests pass
Reviewers: lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59763
Summary: DBCompactionTest.SkipStatsUpdateTest sometimes fails. I don't see any verification related to the deletes issued. Remove them to avoid the uncertainty.
Test Plan: Run the test.
Reviewers: IslamAbdelRahman, andrewkr, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59613
Summary:
We have alot of code duplication whenever we call FullMerge we keep duplicating the instrumentation and statistics code
This is a simple diff to refactor the code to use TimedFullMerge instead of FullMerge
Test Plan: COMPILE_WITH_ASAN=1 make check -j64
Reviewers: andrewkr, yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59577
Summary:
Add option to not flush memtable on open()
In case the option is enabled, don't delete existing log files by not updating log numbers to MANIFEST.
Will still flush if we need to (e.g. memtable full in the middle). In that case we also flush final memtable.
If wal_recovery_mode = kPointInTimeRecovery, do not halt immediately after encounter corruption. Instead, check if seq id of next log file is last_log_sequence + 1. In that case we continue recovery.
Test Plan: See unit test.
Reviewers: dhruba, horuff, sdong
Reviewed By: sdong
Subscribers: benj, yhchiang, andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57813
Summary: It confuses some compilers to have slice.cc under multiple directories. Merge them.
Test Plan: Run existing tests
Reviewers: andrewkr, yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59409
Summary: Currently, if users define both of full key bloom and prefix bloom in SST files. During Get(), if full key bloom shows the key may exist, we still go ahead and check prefix bloom. This is wasteful. If bloom filter for full keys exists, we should always ignore prefix bloom in Get().
Test Plan: Run existing tests
Reviewers: yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57825
Summary:
memtable_prefix_bloom_probes is not a critical option. Remove it to reduce number of options.
It's easier for users to make mistakes with memtable_prefix_bloom_bits, turn it to memtable_prefix_bloom_bits_ratio
Test Plan: Run all existing tests
Reviewers: yhchiang, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: gunnarku, yoshinorim, MarkCallaghan, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59199
Summary: Backup options file to private directory
Test Plan:
backupable_db_test.cc, BackupOptions
Modify DB options by calling OpenDB for 3 times. Check the latest options file is in the right place. Also check no redundent files are backuped.
Reviewers: andrewkr
Reviewed By: andrewkr
Subscribers: leveldb, dhruba, andrewkr
Differential Revision: https://reviews.facebook.net/D59373
Summary:
Add a test to detect that when WAL gets truncated,
seq no's are checked to be contiguous.
This test is put in ColumnFamilyTest as it has the necessary
infrastructure/functions for flushing column families, which
we use to ensure 2 active WAL files
Test Plan:
This is a test, no feature has been added.
This test fails today and hence disabled
Reviewers: sdong
Reviewed By: sdong
Subscribers: lgalanis, dhruba, andrewkr, pritamdamania
Differential Revision: https://reviews.facebook.net/D59253
Summary: With `table_options.cache_index_and_filter_blocks = true`, index and filter blocks are stored in block cache. Then people are curious how much of the block cache total size is used by indexes and bloom filters. It will be nice we have a way to report that. It can help people tune performance and plan for optimized hardware setting. We add several enum values for db Statistics. BLOCK_CACHE_INDEX/FILTER_BYTES_INSERT - BLOCK_CACHE_INDEX/FILTER_BYTES_ERASE = current INDEX/FILTER total block size in bytes.
Test Plan:
write a test case called `DBBlockCacheTest.IndexAndFilterBlocksStats`. The result is:
```
[gzh@dev9927.prn1 ~/local/rocksdb] make db_block_cache_test -j64 && ./db_block_cache_test --gtest_filter=DBBlockCacheTest.IndexAndFilterBlocksStats
Makefile:101: Warning: Compiling in debug mode. Don't use the resulting binary in production
GEN util/build_version.cc
make: `db_block_cache_test' is up to date.
Note: Google Test filter = DBBlockCacheTest.IndexAndFilterBlocksStats
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBBlockCacheTest
[ RUN ] DBBlockCacheTest.IndexAndFilterBlocksStats
[ OK ] DBBlockCacheTest.IndexAndFilterBlocksStats (689 ms)
[----------] 1 test from DBBlockCacheTest (689 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (689 ms total)
[ PASSED ] 1 test.
```
Reviewers: IslamAbdelRahman, andrewkr, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D58677
Summary:
MemTableList::current_ could be written by background flush thread and
simultaneously read in the user thread (NumNotFlushed() is used in
SwitchMemtable()). Use the lock to prevent this case. Found the error from tsan.
Related: D58833
Test Plan:
$ OPT=-g COMPILE_WITH_TSAN=1 make -j64 db_test
$ TEST_TMPDIR=/dev/shm/rocksdb ./db_test --gtest_filter=DBTest.RepeatedWritesToSameKey
Reviewers: lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59139
* Create a callback for memtable becoming immutable
Create a callback for memtable becoming immutable
Create a callback for memtable becoming immutable
moved notification outside the lock
Move sealed notification to unlocked portion of SwitchMemtable
* fix lite build
Summary:
We see some write stalls because of number of unflushed memtables. With existing logging I couldn't figure out what's happening exactly. See internal task t11446054 for details if interested. This diff adds:
- logging of memtable creation at info level; I wanted it on multiple occasions for different reasons; also include number of immutable memtables,
- logging of number of remaining immutable memtables after a flush.
Test Plan: ran tests
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58833
Summary: We added more table properties for each SST file, so when using 2KB SST file size, the estimated size of SST files is off by almost half, causing the LSM tree structure not as expected. Fix it by making file size 4x as previously, as well as LSM base size. Also avoid the sleeping based synchronization and turn to use sync points.
Test Plan: Run paralell unit tests multiple times and make sure they always pass.
Reviewers: IslamAbdelRahman, kradhakrishnan
Reviewed By: kradhakrishnan
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58749
Summary: Make CompactionOptionsFIFO a part of mutable_cf_options
Test Plan: UT
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, lgalanis, dhruba
Differential Revision: https://reviews.facebook.net/D58653
kPointInTimeRecovery is indistinguishable from
kTolerateCorruptedTailRecords in recycle mode since we define
the "end" of the log as the first corrupt record we encounter.
kAbsoluteConsistency doesn't make sense because even a clean
shutdown leaves old junk at the end of the log file.
Signed-off-by: Sage Weil <sage@redhat.com>
If we are in kTolerateCorruptedTailRecords, treat these
errors as the end of the log. This is particularly
important for recycled logs, where we will regularly see
corrupted headers (bad length or checksum) when replaying
a log. If we are aligned with a block boundary or get lucky,
we will land on an old header and see the log number
mismatch, but more commonly we will land midway through
some previous block and record and effectively see noise.
These must be treated as the end of the log in order for
recycling to work.
This makes the LogTest.Recycle/1 test pass.
We also modify a number of existing tests because the
recycled log files behave fundamentally differently in that
they always stop when they reach the first bad record.
Signed-off-by: Sage Weil <sage@redhat.com>