Summary:
Added these events:
* Recovery start, finish and also when recovery creates a file
* Trivial move
* Compaction start, finish and when compaction creates a file
* Flush start, finish
Also includes small fix to EventLogger
Also added option ROCKSDB_PRINT_EVENTS_TO_STDOUT which is useful when we debug things. I've spent far too much time chasing LOG files.
Still didn't get sst table properties in JSON. They are written very deeply into the stack. I'll address in separate diff.
TODO:
* Write specification. Let's first use this for a while and figure out what's good data to put here, too. After that we'll write spec
* Write tools that parse and analyze LOGs. This can be in python or go. Good intern task.
Test Plan: Ran db_bench with ROCKSDB_PRINT_EVENTS_TO_STDOUT. Here's the output: https://phabricator.fb.com/P19811976
Reviewers: sdong, yhchiang, rven, MarkCallaghan, kradhakrishnan, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37521
Summary: To further distinguish the corruption cases were caused by storage media or in memory states when writing it, add a paranoid check after writing the file to iterate all the rows.
Test Plan: Add a new unit test for it
Reviewers: rven, igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D37335
Summary:
The goal of this diff is to make Compaction class easier to use. This should also make new compaction algorithms easier to write (like CompactFiles from @yhchiang and dynamic leveled and multi-leveled universal from @sdong).
Here are couple of things demonstrating that Compaction class is hard to use:
1. we have two constructors of Compaction class
2. there's this thing called grandparents_, but it appears to only be setup for leveled compaction and not compactfiles
3. it's easy to introduce a subtle and dangerous bug like this: D36225
4. SetupBottomMostLevel() is hard to understand and it shouldn't be. See this comment: afbafeaeae/db/compaction.cc (L236-L241). It also made it harder for @yhchiang to write CompactFiles, as evidenced by this: afbafeaeae/db/compaction_picker.cc (L204-L210)
The problem is that we create Compaction object, which holds a lot of state, and then pass it around to some functions. After those functions are done mutating, then we call couple of functions on Compaction object, like SetupBottommostLevel() and MarkFilesBeingCompacted(). It is very hard to see what's happening with all that Compaction's state while it's travelling across different functions. If you're writing a new PickCompaction() function you need to try really hard to understand what are all the functions you need to run on Compaction object and what state you need to setup.
My proposed solution is to make important parts of Compaction immutable after construction. PickCompaction() should calculate compaction inputs and then pass them onto Compaction object once they are finalized. That makes it easy to create a new compaction -- just provide all the parameters to the constructor and you're done. No need to call confusing functions after you created your object.
This diff doesn't fully achieve that goal, but it comes pretty close. Here are some of the changes:
* have one Compaction constructor instead of two.
* inputs_ is constant after construction
* MarkFilesBeingCompacted() is now private to Compaction class and automatically called on construction/destruction.
* SetupBottommostLevel() is gone. Compaction figures it out on its own based on the input.
* CompactionPicker's functions are not passing around Compaction object anymore. They are only passing around the state that they need.
Test Plan:
make check
make asan_check
make valgrind_check
Reviewers: rven, anthony, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: sdong, yhchiang, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D36687
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.
Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties
Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.
Reviewers: yhchiang, igor.sugak, rven, igor
Reviewed By: rven, igor
Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D35373
Summary:
We have addded new stats and perf_context for measuring the merge and filter operation time consumption.
We have bounded all the merge operations within the GUARD statment and collected the total time for these operations in the DB.
Test Plan: WIP
Reviewers: rven, yhchiang, kradhakrishnan, igor, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D34377
Summary:
Fix the deadlock issue in ThreadStatusSingleCompaction.
In the previous version of ThreadStatusSingleCompaction, the compaction
thread will wait for a SYNC_POINT while its db_mutex is held. However,
if the test hasn't finished its Put cycle while a compaction is running,
a deadlock will happen in the test.
Test Plan:
export ROCKSDB_TESTS=ThreadStatus
./db_test
Reviewers: sdong, igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D35001
Summary:
Provide an API which enables users to infor Rocksdb that it is
shutting down.
Test Plan: db_test
Reviewers: sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D34617
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
Summary:
Summary:
Added a new option to ColumnFamllyOptions - optimize_filters_for_hits. This option can be used in the case where most
accesses to the store are key hits and we dont need to optimize performance for key misses.
This is useful when you have a very large database and most of your lookups succeed. The option allows the store to
not store and use filters in the last level (the largest level which contains data). These filters can take a large amount of
space for large databases (in memory and on-disk). For the last level, these filters are only useful for key misses and not
for key hits. If we are not optimizing for key misses, we can choose to not store these filters for that level.
This option is only provided for BlockBasedTable. We skip the filters when we are compacting
Test Plan:
1. Modified db_test toalso run tests with an additonal option (skip_filters_on_last_level)
2. Added another unit test to db_test which specifically tests that filters are being skipped
Reviewers: rven, igor, sdong
Reviewed By: sdong
Subscribers: lgalanis, yoshinorim, MarkCallaghan, rven, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D33717
Summary: Allow GetThreadList to reflect flush activity.
Test Plan:
Developed ThreadStatusFlush test and updated ThreadStatusMultiCompaction test.
./db_test ./thread_list_test
Reviewers: sdong, rven, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D32871
Summary:
It would be good to assing background job their IDs. Two benefits:
1) makes LOGs more readable
2) I might use it in my EventLogger, which will try to make our LOG easier to read/query/visualize
Test Plan: ran rocksdb, read the LOG
Reviewers: sdong, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31617
Summary:
Add a counter for collecting the wait time on db mutex.
Also add MutexWrapper and CondVarWrapper for measuring wait time.
Test Plan:
./db_test
export ROCKSDB_TESTS=MutexWaitStats
./db_test
verify stats output using db_bench
make clean
make release
./db_bench --statistics=1 --benchmarks=fillseq,readwhilewriting --num=10000 --threads=10
Sample output:
rocksdb.db.mutex.wait.micros COUNT : 7546866
Reviewers: MarkCallaghan, rven, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D32787
Summary:
1. If WAL directory is different from db directory. Sync the directory after creating a log file under it.
2. After creating an SST file, sync its parent directory instead of DB directory.
3. change the check of kResetDeleteUnsyncedFiles in fault_injection_test. Since we changed the behavior to sync log files' parent directory after first WAL sync, instead of creating, kResetDeleteUnsyncedFiles will not guarantee to show post sync updates.
Test Plan: make all check
Reviewers: yhchiang, rven, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D32067
Summary: Fixed a bug which could hide non-ok status in CompactionJob::Run()
Test Plan: make
Reviewers: sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28995
Summary:
This is just a simple test that passes two files though a compaction. It shows the framework so that people can continue building new compaction *unit* tests.
In the future we might want to move some Compaction* tests from DBTest here. For example, CompactBetweenSnapshot seems a good candidate.
Hopefully this test can be simpler when we mock out VersionSet.
Test Plan: this is a test
Reviewers: ljin, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28449
Summary: So iOS size_t is 32-bit, so we need to static_cast<size_t> any uint64_t :(
Test Plan: TARGET_OS=IOS make static_lib
Reviewers: dhruba, ljin, yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28743
Summary:
We need to turn on -Wshorten-64-to-32 for mobile. See D1671432 (internal phabricator) for details.
This diff turns on the warning flag and fixes all the errors. There were also some interesting errors that I might call bugs, especially in plain table. Going forward, I think it makes sense to have this flag turned on and be very very careful when converting 64-bit to 32-bit variables.
Test Plan: compiles
Reviewers: ljin, rven, yhchiang, sdong
Reviewed By: yhchiang
Subscribers: bobbaldwin, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28689
Summary: Based on @sdong's feedback in the diff, we shouldn't keep db_mutex in CompactionJob's state. This diff removes db_mutex from CompactionJob state, by making next_file_number_ atomic. That way we only need to pass the lock to InstallCompactionResults() because of LogAndApply()
Test Plan: make check
Reviewers: ljin, yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: sdong, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28491
Summary:
Here's a prototype of redesigning pending_outputs_. This way, we don't have to expose pending_outputs_ to other classes (CompactionJob, FlushJob, MemtableList). DBImpl takes care of it.
Still have to write some comments, but should be good enough to start the discussion.
Test Plan: make check, will also run stress test
Reviewers: ljin, sdong, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28353
Summary:
1. fix possible overflow of the two stats by using uint64_t
2. use a similar source of data to calculate RecordDrop. Previous one is not correct.
Test Plan: See outputs of db_bench settings, and the results look reasonable
Reviewers: MarkCallaghan, ljin, igor
Reviewed By: igor
Subscribers: rven, leveldb, yhchiang, dhruba
Differential Revision: https://reviews.facebook.net/D28155
If a compaction filter implementation is simply filtering values, then
allocating the "changed values" bitmap is an extra memory allocation
that adds no value. Additionally, the compaction implementation has to
do marginally more work to calculate the offset into the bitmap
(vector<bool> specialization) for each record the filter did not mark
for deletion.
Explicitly handle the case where compact_->value_changed_buf_ is empty.
Summary: Apply InfoLogLevel to the logs in db/compaction_job.cc
Test Plan: db_test
Reviewers: ljin, sdong, igor
Reviewed By: igor
Subscribers: leveldb, MarkCallaghan, dhruba
Differential Revision: https://reviews.facebook.net/D28275
Summary:
Long awaited CompactionJob class! Move most compaction-related things from DBImpl to CompactionJob, making CompactionJob easier to test and understand.
Currently this is just replicating exactly the same functionality with as little as change as possible. As future work, we should:
1. Add CompactionJob tests (I think I'll do that tomorrow)
2. Reduce CompactionJob's state that it inherits from DBImpl
3. Figure out how to do yielding to flush better. Currently I implemented a callback as we agreed yesterday, but I don't think it's a good long term solution.
This reduces db_impl.cc from 5000+ LOC to 3400!
Test Plan: make check, will add CompactionJob-specific tests, probably also move some tests from db_test to compaction_job_test
Reviewers: rven, yhchiang, sdong, ljin
Reviewed By: ljin
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D27957