Summary:
This patch changes concurrency guarantees around ColumnFamilySet::column_families_ and ColumnFamilySet::column_families_data_.
Before:
* When mutating: lock DB mutex and spin lock
* When reading: lock DB mutex OR spin lock
After:
* When mutating: lock DB mutex and be in write thread
* When reading: lock DB mutex or be in write thread
That way, we eliminate the spin lock that protects these hash maps and simplify concurrency. That means we don't need to lock the spin lock during writing, since writing is mutually exclusive with column family create/drop (the only operations that mutate those hash maps).
With these new restrictions, I also needed to move column family create to the write thread (column family drop was already in the write thread).
Even though we don't need to lock the spin lock during write, impact on performance should be minimal -- the spin lock is almost never busy, so locking it is almost free.
This addresses task t5116919.
Test Plan:
make check
Stress test with lots and lots of column family drop and create:
time ./db_stress --threads=30 --ops_per_thread=5000000 --max_key=5000 --column_families=200 --clear_column_family_one_in=100000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress/
Reviewers: yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30651
Summary: When trivial move commit is done, we log the summary of the input version instead of current. This is inconsistent with other log messages and confusing.
Test Plan: compiles
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30939
Summary:
Since https://reviews.facebook.net/D16119, we ignore partial tailing writes. Because of that, we no longer need skip_log_error_on_recovery.
The documentation says "Skip log corruption error on recovery (If client is ok with losing most recent changes)", while the option actually ignores any corruption of the WAL (not only just the most recent changes). This is very dangerous and can lead to DB inconsistencies. This was originally set up to ignore partial tailing writes, which we now do automatically (after D16119). I have digged up old task t2416297 which confirms my findings.
Test Plan: There was actually no tests that verified correct behavior of skip_log_error_on_recovery.
Reviewers: yhchiang, rven, dhruba, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30603
Summary:
This is a serious bug. If paranod_check == true and WAL is corrupted, we don't fail DB::Open(). I tried going into history and it seems we've been doing this for a long long time.
I found this when investigating t5852041.
Test Plan: Added unit test to verify correct behavior.
Reviewers: yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30597
Summary:
GetThreadList() feature depends on the thread creation and destruction, which is currently handled under Env.
This patch moves GetThreadList() feature under Env to better manage the dependency of GetThreadList() feature
on thread creation and destruction.
Renamed ThreadStatusImpl to ThreadStatusUpdater. Add ThreadStatusUtil, which is a static class contains
utility functions for ThreadStatusUpdater.
Test Plan: run db_test, thread_list_test and db_bench and verify the life cycle of Env and ThreadStatusUpdater is properly managed.
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: ljin, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30057
Summary: As title. We shouldn't need to execute flush from compaction if there are dedicated threads doing flushes.
Test Plan: make check
Reviewers: rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30579
Summary:
There are two versions of FindObsoleteFiles():
* full scan, which is executed every 6 hours (and it's terribly slow)
* no full scan, which is executed every time a background process finishes and iterator is deleted
This diff is optimizing the second case (no full scan). Here's what we do before the diff:
* Get the list of obsolete files (files with ref==0). Some files in obsolete_files set might actually be live.
* Get the list of live files to avoid deleting files that are live.
* Delete files that are in obsolete_files and not in live_files.
After this diff:
* The only files with ref==0 that are still live are files that have been part of move compaction. Don't include moved files in obsolete_files.
* Get the list of obsolete files (which exclude moved files).
* No need to get the list of live files, since all files in obsolete_files need to be deleted.
I'll post the benchmark results, but you can get the feel of it here: https://reviews.facebook.net/D30123
This depends on D30123.
P.S. We should do full scan only in failure scenarios, not every 6 hours. I'll do this in a follow-up diff.
Test Plan:
One new unit test. Made sure that unit test fails if we don't have a `if (!f->moved)` safeguard in ~Version.
make check
Big number of compactions and flushes:
./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
Reviewers: yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30249
Summary:
This one wasn't easy to find :)
What happens is we go through all cfds on flush_queue_ and find no cfds to flush, *but* the cfd is set to the last CF we looped through and following code assumes we want it flushed.
BTW @sdong do you think we should also make BackgroundFlush() only check a single cfd for flushing instead of doing this `while (!flush_queue_.empty())`?
Test Plan: regression test no longer fails
Reviewers: sdong, rven, yhchiang
Reviewed By: yhchiang
Subscribers: sdong, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30591
Summary:
When scaling to higher number of column families, the worst bottleneck was MaybeScheduleFlushOrCompaction(), which did a for loop over all column families while holding a mutex. This patch addresses the issue.
The approach is similar to our earlier efforts: instead of a pull-model, where we do something for every column family, we can do a push-based model -- when we detect that column family is ready to be flushed/compacted, we add it to the flush_queue_/compaction_queue_. That way we don't need to loop over every column family in MaybeScheduleFlushOrCompaction.
Here are the performance results:
Command:
./db_bench --write_buffer_size=268435456 --db_write_buffer_size=268435456 --db=/fast-rocksdb-tmp/rocks_lots_of_cf --use_existing_db=0 --open_files=55000 --statistics=1 --histogram=1 --disable_data_sync=1 --max_write_buffer_number=2 --sync=0 --benchmarks=fillrandom --threads=16 --num_column_families=5000 --disable_wal=1 --max_background_flushes=16 --max_background_compactions=16 --level0_file_num_compaction_trigger=2 --level0_slowdown_writes_trigger=2 --level0_stop_writes_trigger=3 --hard_rate_limit=1 --num=33333333 --writes=33333333
Before the patch:
fillrandom : 26.950 micros/op 37105 ops/sec; 4.1 MB/s
After the patch:
fillrandom : 17.404 micros/op 57456 ops/sec; 6.4 MB/s
Next bottleneck is VersionSet::AddLiveFiles, which is painfully slow when we have a lot of files. This is coming in the next patch, but when I removed that code, here's what I got:
fillrandom : 7.590 micros/op 131758 ops/sec; 14.6 MB/s
Test Plan:
make check
two stress tests:
Big number of compactions and flushes:
./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
max_background_flushes=0, to verify that this case also works correctly
./db_stress --threads=30 --ops_per_thread=2000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=3 --max_background_compactions=3 --max_background_flushes=0 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000
Reviewers: ljin, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30123
Summary: Avoid unnecessary unlock and lock mutex when notifying events.
Test Plan: ./listener_test
Reviewers: igor
Reviewed By: igor
Subscribers: sdong, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30267
Summary:
We now release the mutex before copying the files in the case
of the trivial move. This path does not use the compaction job.
Test Plan: DBTest.LevelCompactionThirdPath
Reviewers: yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30381
Summary:
Allow Level-style compaction to place files in different paths
This diff provides the code for task 4854591. We now support level-compaction
to place files in different paths by specifying them in db_paths along with
the minimum level for files to store in that path.
Test Plan: ManualLevelCompactionOutputPathId in db_test.cc
Reviewers: yhchiang, MarkCallaghan, dhruba, yoshinorim, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D29799
Summary: Now DB::GetSnapshot() doesn't scale to more column families, as it needs to go through all the column families to find whether snapshot is supported. This patch optimizes it.
Test Plan:
Add unit tests to cover negative cases.
make all check
Reviewers: yhchiang, rven, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D30093
Summary: This is a regression bug introduced by https://reviews.facebook.net/D24729 . max_total_wal_size would be off the target it should be more and more in the case that the a user holds the current super version after flush or compaction. This patch fixes it
Test Plan: make all check
Reviewers: yhchiang, rven, igor
Reviewed By: igor
Subscribers: ljin, yoshinorim, MarkCallaghan, hermanlee4, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D29961
Summary:
Add a counter in SnapshotList to show number of snapshots. Also a unix timestamp in every snapshot.
Add two DB Properties to return number of snapshots and timestamp of the oldest one.
Test Plan: Add unit test checking
Reviewers: yhchiang, rven, igor
Reviewed By: igor
Subscribers: leveldb, dhruba, MarkCallaghan
Differential Revision: https://reviews.facebook.net/D29919
Summary:
Introduces a new class for managing write buffer memory across column
families. We supplement ColumnFamilyOptions::write_buffer_size with
ColumnFamilyOptions::write_buffer, a shared pointer to a WriteBuffer
instance that enforces memory limits before flushing out to disk.
Test Plan: Added SharedWriteBuffer unit test to db_test.cc
Reviewers: sdong, rven, ljin, igor
Reviewed By: igor
Subscribers: tnovak, yhchiang, dhruba, xjin, MarkCallaghan, yoshinorim
Differential Revision: https://reviews.facebook.net/D22581
Summary:
In some environment such as android, the c++ library does not have
std::to_string. This path adds rocksdb::ToString(), which wraps std::to_string
when std::to_string is not available, and implements std::to_string
in the other case.
Test Plan:
make dbg -j32
./db_test
make clean
make dbg OPT=-DOS_ANDROID -j32
./db_test
Reviewers: ljin, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D29181
Summary:
An entry of ConstantColumnFamilyInfo is created when:
1. DB::Open
2. CreateColumnFamily.
However, there are cases that DB::Open could also call CreateColumnFamily
when create_missing_column_families=true. As a result, it will create
duplicate ConstantColumnFamilyInfo and one of them would be leaked.
Test Plan: ./deletefile_test
Reviewers: igor, sdong, ljin
Reviewed By: ljin
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D29307
Summary:
Add enable_thread_tracking to DBOptions to allow
tracking thread status related to the DB. Default is off.
Test Plan:
export ROCKSDB_TESTS=ThreadList
./db_test
Reviewers: ljin, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D29289
Summary:
Add GetThreadList API, which allows developer to track the
status of each process. Currently, calling GetThreadList will
only get the list of background threads in RocksDB with their
thread-id and thread-type (priority) set. Will add more support
on this in the later diffs.
ThreadStatus currently has the following properties:
// An unique ID for the thread.
const uint64_t thread_id;
// The type of the thread, it could be ROCKSDB_HIGH_PRIORITY,
// ROCKSDB_LOW_PRIORITY, and USER_THREAD
const ThreadType thread_type;
// The name of the DB instance where the thread is currently
// involved with. It would be set to empty string if the thread
// does not involve in any DB operation.
const std::string db_name;
// The name of the column family where the thread is currently
// It would be set to empty string if the thread does not involve
// in any column family.
const std::string cf_name;
// The event that the current thread is involved.
// It would be set to empty string if the information about event
// is not currently available.
Test Plan:
./thread_list_test
export ROCKSDB_TESTS=GetThreadList
./db_test
Reviewers: rven, igor, sdong, ljin
Reviewed By: ljin
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D25047
Summary: This way we can gurantee that old MemTables get destructed before DBImpl gets destructed, which might be useful if we want to make them depend on state from DBImpl.
Test Plan: make check with asserts in JobContext's destructor
Reviewers: ljin, sdong, yhchiang, rven, jonahcohen
Reviewed By: jonahcohen
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28959
Summary: As a short-term fix, let's go back to previous way of calculating NeedsCompaction(). SIGSEGV happens because NeedsCompaction() can happen before super_version (and thus MutableCFOptions) is initialized.
Test Plan: make check
Reviewers: ljin, sdong, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28875
Summary:
Move NeedsCompaction() from VersionStorageInfo to CompactionPicker
to allow different compaction strategy to have their own way to
determine whether doing compaction is necessary.
When compaction style is set to kCompactionStyleNone, then
NeedsCompaction() will always return false.
Test Plan:
export ROCKSDB_TESTS=Compact
./db_test
Reviewers: ljin, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28719
Summary: So iOS size_t is 32-bit, so we need to static_cast<size_t> any uint64_t :(
Test Plan: TARGET_OS=IOS make static_lib
Reviewers: dhruba, ljin, yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28743
Summary:
We need to turn on -Wshorten-64-to-32 for mobile. See D1671432 (internal phabricator) for details.
This diff turns on the warning flag and fixes all the errors. There were also some interesting errors that I might call bugs, especially in plain table. Going forward, I think it makes sense to have this flag turned on and be very very careful when converting 64-bit to 32-bit variables.
Test Plan: compiles
Reviewers: ljin, rven, yhchiang, sdong
Reviewed By: yhchiang
Subscribers: bobbaldwin, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28689
Summary: I mistakenly changed the behavior to ++next_file_number_ instead of next_file_number_++, as it should have been: 344edbb044/db/version_set.h (L539)
Test Plan: none. not sure if this would break anything. It's just different behavior, so I'd rather not risk
Reviewers: ljin, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28557
Summary: Based on @sdong's feedback in the diff, we shouldn't keep db_mutex in CompactionJob's state. This diff removes db_mutex from CompactionJob state, by making next_file_number_ atomic. That way we only need to pass the lock to InstallCompactionResults() because of LogAndApply()
Test Plan: make check
Reviewers: ljin, yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: sdong, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28491
Summary:
This diff adds three sets of APIs to RocksDB.
= GetColumnFamilyMetaData =
* This APIs allow users to obtain the current state of a RocksDB instance on one column family.
* See GetColumnFamilyMetaData in include/rocksdb/db.h
= EventListener =
* A virtual class that allows users to implement a set of
call-back functions which will be called when specific
events of a RocksDB instance happens.
* To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners
= CompactFiles =
* CompactFiles API inputs a set of file numbers and an output level, and RocksDB
will try to compact those files into the specified level.
= Example =
* Example code can be found in example/compact_files_example.cc, which implements
a simple external compactor using EventListener, GetColumnFamilyMetaData, and
CompactFiles API.
Test Plan:
listener_test
compactor_test
example/compact_files_example
export ROCKSDB_TESTS=CompactFiles
db_test
export ROCKSDB_TESTS=MetaData
db_test
Reviewers: ljin, igor, rven, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D24705
Summary:
Here's a prototype of redesigning pending_outputs_. This way, we don't have to expose pending_outputs_ to other classes (CompactionJob, FlushJob, MemtableList). DBImpl takes care of it.
Still have to write some comments, but should be good enough to start the discussion.
Test Plan: make check, will also run stress test
Reviewers: ljin, sdong, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28353
Summary: as title
Test Plan: ./db_test
Reviewers: sdong, yhchiang, rven, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28269
Summary: as title
Test Plan: ran it
Reviewers: yhchiang, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28311
Summary: Apply InfoLogLevel to the logs in db/db_impl.cc
Test Plan:
db_test
db_bench
Reviewers: ljin, sdong, igor
Reviewed By: igor
Subscribers: leveldb, MarkCallaghan, dhruba
Differential Revision: https://reviews.facebook.net/D28233
Summary: Enforce the accessier naming convention in functions in version_set.h
Test Plan: make all check
Reviewers: ljin, yhchiang, rven, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D28143
Summary:
With the patch, thread pool size will be automatically increased if DB's options ask for more parallelism of compactions or flushes.
Too many users have been confused by the API. Change it to make it harder for users to make mistakes
Test Plan: Add two unit tests to cover the function.
Reviewers: yhchiang, rven, igor, MarkCallaghan, ljin
Reviewed By: ljin
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D27555
Summary:
Long awaited CompactionJob class! Move most compaction-related things from DBImpl to CompactionJob, making CompactionJob easier to test and understand.
Currently this is just replicating exactly the same functionality with as little as change as possible. As future work, we should:
1. Add CompactionJob tests (I think I'll do that tomorrow)
2. Reduce CompactionJob's state that it inherits from DBImpl
3. Figure out how to do yielding to flush better. Currently I implemented a callback as we agreed yesterday, but I don't think it's a good long term solution.
This reduces db_impl.cc from 5000+ LOC to 3400!
Test Plan: make check, will add CompactionJob-specific tests, probably also move some tests from db_test to compaction_job_test
Reviewers: rven, yhchiang, sdong, ljin
Reviewed By: ljin
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D27957
Summary:
...and fix all the errors :)
Jim suggested turning on -Wshadow because it helped him fix number of critical bugs in fbcode. I think it's a good idea to be -Wshadow clean.
Test Plan: compiles
Reviewers: yhchiang, rven, sdong, ljin
Reviewed By: ljin
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D27711
Summary:
Rename Version::Builder to VersionBuilder and expose its definition to a header.
Make VerisonBuilder not reference Version or ColumnFamilyData, only working with VersionStorageInfo.
Add version_builder_test which has a simple test.
Test Plan: make all check
Reviewers: rven, yhchiang, igor, ljin
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D27969
Summary: Decoupling code that deals with archived log files outside of DBImpl. That will make this code easier to reason about and test. It will also make the code easier to improve, because an improver doesn't have to understand DBImpl code in entirety.
Test Plan: added test
Reviewers: ljin, yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D27873
Summary:
Make compaction picker easier to test.
The basic idea is to separate a minimum subcomponent of Version to VersionStorageInfo, which just responsible to LSM tree. A stub VersionStorageInfo can then be easily created and passed into compaction picker so that we can check the outputs.
It now passes most tests. Still two things need to be done:
(1) deal with the FIFO compaction's file size.
(2) write an example test to make sure the interface can do the job.
Add a compaction_picker_test to make sure compaction picker codes can be easily unit tested.
Test Plan:
Pass all unit tests and compaction_picker_test
Reviewers: yhchiang, rven, igor, ljin
Reviewed By: ljin
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D27639
Summary:
This diff has two fixes.
1. Fix the bug where compaction does not fail when RocksDB can't create a new file.
2. When NewWritableFiles() fails in OpenCompactionOutputFiles(), previously such fail-to-created file will be still be included as a compaction output. This patch also fixes this bug.
3. Allow VersionEdit::EncodeTo() to return Status and add basic check.
Test Plan:
./version_edit_test
export ROCKSDB_TESTS=FileCreationRandomFailure
./db_test
Reviewers: ljin, sdong, nkg-, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D25581