rocksdb

Commit Graph

Author	SHA1	Message	Date
sdong	603b6da8b8	Add options.compaction_measure_io_stats to print write I/O stats in compactions Summary: Add options.compaction_measure_io_stats to print out / pass to listener accumulated time spent on write calls. Example outputs in info logs: 2015/08/12-16:27:59.463944 7fd428bff700 (Original Log Time 2015/08/12-16:27:59.463922) EVENT_LOG_v1 {"time_micros": 1439422079463897, "job": 6, "event": "compaction_finished", "output_level": 1, "num_output_files": 4, "total_output_size": 6900525, "num_input_records": 111483, "num_output_records": 106877, "file_write_nanos": 15663206, "file_range_sync_nanos": 649588, "file_fsync_nanos": 349614797, "file_prepare_write_nanos": 1505812, "lsm_state": [2, 4, 0, 0, 0, 0, 0]} Add two more counters in iostats_context. Also add a parameter of db_bench. Test Plan: Add a unit test. Also manually verify LOG outputs in db_bench Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D44115	9 years ago
Ari Ekmekji	40c64434d4	Parallelize L0-L1 Compaction: Restructure Compaction Job Summary: As of now compactions involving files from Level 0 and Level 1 are single threaded because the files in L0, although sorted, are not range partitioned like the other levels. This means that during L0-L1 compaction each file from L1 needs to be merged with potentially all the files from L0. This attempt to parallelize the L0-L1 compaction assigns a thread and a corresponding iterator to each L1 file that then considers only the key range found in that L1 file and only the L0 files that have those keys (and only the specific portion of those L0 files in which those keys are found). In this way the overlap is minimized and potentially eliminated between different iterators focusing on the same files. The first step is to restructure the compaction logic to break L0-L1 compactions into multiple, smaller, sequential compactions. Eventually each of these smaller jobs will be run simultaneously. Areas to pay extra attention to are # Correct aggregation of compaction job statistics across multiple threads # Proper opening/closing of output files (make sure each thread's is unique) # Keys that span multiple L1 files # Skewed distributions of keys within L0 files Test Plan: Make and run db_test (newer version has separate compaction tests) and compaction_job_stats_test Reviewers: igor, noetzli, anthony, sdong, yhchiang Reviewed By: yhchiang Subscribers: MarkCallaghan, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42699	9 years ago
Igor Canadi	35ca59364c	Don't let flushes preempt compactions Summary: When we first started, max_background_flushes was 0 by default and compaction thread was executing flushes (since there was no flush thread). Then, we switched the default max_background_flushes to 1. However, we still support the case where there is no flush thread and flushes are done in compaction. This is making our code a bit more complicated. By not supporting this use-case we can make our code simpler. We have a special case that when you set max_background_flushes to 0, we schedule the flush to execute on the compaction thread. Test Plan: make check (there might be some unit tests that depend on this behavior) Reviewers: IslamAbdelRahman, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D41931	9 years ago
Igor Canadi	a96fcd09b7	Deprecate CompactionFilterV2 Summary: It has been around for a while and it looks like it never found any uses in the wild. It's also complicating our compaction_job code quite a bit. We're deprecating it in 3.13, but will put it back in 3.14 if we actually find users that need this feature. Test Plan: make check Reviewers: noetzli, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42405	9 years ago
Andres Notzli	6b2d44b2ff	Refactoring of writing key/value pairs Summary: Before, writing key/value pairs out to files was done inside ProcessKeyValueCompaction(). To make ProcessKeyValueCompaction() more understandable, this patch moves the writing part to a separate function. This is intended to be a stepping stone for additional changes. Test Plan: make && make check Reviewers: sdong, rven, yhchiang, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42243	9 years ago
Andres Notzli	ab137af4ba	Partial cleanup of CompactionJob Summary: Logging, dealing with key prefix batches and updating stats moved from CompactionJob::Run() into separate functions. Test Plan: make all && make check Reviewers: sdong, rven, yhchiang, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D41919	9 years ago
Yueh-Hsuan Chiang	bb1c74ce18	Fixed a bug of CompactionStats in multi-level universal compaction case Summary: Universal compaction can involves in multiple levels. However, the current implementation of bytes_readn and bytes_readnp1 (and some other stats with postfix `n` and `np1`) assumes compaction can only have two levels. This patch fixes this bug and redefines bytes_readn and bytes_readnp1: * bytes_readnp1: the number of bytes read in the compaction output level. * bytes_readn: the total number of bytes read minus bytes_readnp1 Test Plan: Add a test in compaction_job_stats_test Reviewers: igor, sdong, rven, anthony, kradhakrishnan, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D40239	10 years ago
Yueh-Hsuan Chiang	fe5c6321cb	Allow EventListener::OnCompactionCompleted to return CompactionJobStats. Summary: Allow EventListener::OnCompactionCompleted to return CompactionJobStats, which contains useful information about a compaction. Example CompactionJobStats returned by OnCompactionCompleted(): smallest_output_key_prefix 05000000 largest_output_key_prefix 06990000 elapsed_time 42419 num_input_records 300 num_input_files 3 num_input_files_at_output_level 2 num_output_records 200 num_output_files 1 actual_bytes_input 167200 actual_bytes_output 110688 total_input_raw_key_bytes 5400 total_input_raw_value_bytes 300000 num_records_replaced 100 is_manual_compaction 1 Test Plan: Developed a mega test in db_test which covers 20 variables in CompactionJobStats. Reviewers: rven, igor, anthony, sdong Reviewed By: sdong Subscribers: tnovak, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38463	10 years ago
Yueh-Hsuan Chiang	fc83821270	Add EventListener::OnTableFileCreated() Summary: Add EventListener::OnTableFileCreated(), which will be called when a table file is created. This patch is part of the EventLogger and EventListener integration. Test Plan: Augment existing test in db/listener_test.cc Reviewers: anthony, kradhakrishnan, rven, igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D38865	10 years ago
Yueh-Hsuan Chiang	77a5a543a5	Allow GetThreadList() to report basic compaction operation properties. Summary: Now we're able to show more details about a compaction in GetThreadList() :) This patch allows GetThreadList() to report basic compaction operation properties. Basic compaction properties include: 1. job id 2. compaction input / output level 3. compaction property flags (is_manual, is_deletion, .. etc) 4. total input bytes 5. the number of bytes has been read currently. 6. the number of bytes has been written currently. Flush operation properties will be done in a seperate diff. Test Plan: /db_bench --threads=30 --num=1000000 --benchmarks=fillrandom --thread_status_per_interval=1 Sample output of tracking same job: ThreadID ThreadType cfName Operation ElapsedTime Stage State OperationProperties 140664171987072 Low Pri default Compaction 31.357 ms CompactionJob::FinishCompactionOutputFile BaseInputLevel 1 \| BytesRead 2264663 \| BytesWritten 1934241 \| IsDeletion 0 \| IsManual 0 \| IsTrivialMove 0 \| JobID 277 \| OutputLevel 2 \| TotalInputBytes 3964158 \| ThreadID ThreadType cfName Operation ElapsedTime Stage State OperationProperties 140664171987072 Low Pri default Compaction 59.440 ms CompactionJob::FinishCompactionOutputFile BaseInputLevel 1 \| BytesRead 2264663 \| BytesWritten 1934241 \| IsDeletion 0 \| IsManual 0 \| IsTrivialMove 0 \| JobID 277 \| OutputLevel 2 \| TotalInputBytes 3964158 \| ThreadID ThreadType cfName Operation ElapsedTime Stage State OperationProperties 140664171987072 Low Pri default Compaction 226.375 ms CompactionJob::Install BaseInputLevel 1 \| BytesRead 3958013 \| BytesWritten 3621940 \| IsDeletion 0 \| IsManual 0 \| IsTrivialMove 0 \| JobID 277 \| OutputLevel 2 \| TotalInputBytes 3964158 \| Reviewers: sdong, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D37653	10 years ago
Igor Canadi	65fe1cfbb3	Cleanup CompactionJob Summary: Couple changes: 1. instead of SnapshotList, just take a vector of snapshots 2. don't take a separate parameter is_snapshots_supported. If there are snapshots in the list, that means they are supported. I actually think we should get rid of this notion of snapshots not being supported. 3. don't pass in mutable_cf_options as a parameter. Lifetime of mutable_cf_options is a bit tricky to maintain, so it's better to not pass it in for the whole compaction job. We only really need it when we install the compaction results. Test Plan: make check Reviewers: sdong, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D36627	10 years ago
Igor Canadi	1bb4928da9	Include bunch of more events into EventLogger Summary: Added these events: * Recovery start, finish and also when recovery creates a file * Trivial move * Compaction start, finish and when compaction creates a file * Flush start, finish Also includes small fix to EventLogger Also added option ROCKSDB_PRINT_EVENTS_TO_STDOUT which is useful when we debug things. I've spent far too much time chasing LOG files. Still didn't get sst table properties in JSON. They are written very deeply into the stack. I'll address in separate diff. TODO: * Write specification. Let's first use this for a while and figure out what's good data to put here, too. After that we'll write spec * Write tools that parse and analyze LOGs. This can be in python or go. Good intern task. Test Plan: Ran db_bench with ROCKSDB_PRINT_EVENTS_TO_STDOUT. Here's the output: https://phabricator.fb.com/P19811976 Reviewers: sdong, yhchiang, rven, MarkCallaghan, kradhakrishnan, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D37521	10 years ago
Anurag Indu	3d1a924ff3	Adding stats for the merge and filter operation Summary: We have addded new stats and perf_context for measuring the merge and filter operation time consumption. We have bounded all the merge operations within the GUARD statment and collected the total time for these operations in the DB. Test Plan: WIP Reviewers: rven, yhchiang, kradhakrishnan, igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D34377	10 years ago
Yueh-Hsuan Chiang	c594b0e89d	Allow GetThreadList() to report operation stage. Summary: Allow GetThreadList() to report operation stage. Test Plan: ./thread_list_test ./db_bench --benchmarks=fillrandom --num=100000 --threads=40 \ --max_background_compactions=10 --max_background_flushes=3 \ --thread_status_per_interval=1000 --key_size=16 --value_size=1000 \ --num_column_families=10 export ROCKSDB_TESTS=ThreadStatus ./db_test Sample output ThreadID ThreadType cfName Operation OP_StartTime ElapsedTime Stage State 140116265861184 Low Pri 140116270055488 Low Pri 140116274249792 High Pri column_family_name_000005 Flush 2015/03/10-14:58:11 0 us FlushJob::WriteLevel0Table 140116400078912 Low Pri column_family_name_000004 Compaction 2015/03/10-14:58:11 0 us CompactionJob::FinishCompactionOutputFile 140116358135872 Low Pri column_family_name_000006 Compaction 2015/03/10-14:58:10 1 us CompactionJob::FinishCompactionOutputFile 140116341358656 Low Pri 140116295221312 High Pri default Flush 2015/03/10-14:58:11 0 us FlushJob::WriteLevel0Table 140116324581440 Low Pri column_family_name_000009 Compaction 2015/03/10-14:58:11 0 us CompactionJob::ProcessKeyValueCompaction 140116278444096 Low Pri 140116299415616 Low Pri column_family_name_000008 Compaction 2015/03/10-14:58:11 0 us CompactionJob::FinishCompactionOutputFile 140116291027008 High Pri column_family_name_000001 Flush 2015/03/10-14:58:11 0 us FlushJob::WriteLevel0Table 140116286832704 Low Pri column_family_name_000002 Compaction 2015/03/10-14:58:11 0 us CompactionJob::FinishCompactionOutputFile 140116282638400 Low Pri Reviewers: rven, igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D34683	10 years ago
Igor Canadi	e7ea51a8e7	Introduce job_id for flush and compaction Summary: It would be good to assing background job their IDs. Two benefits: 1) makes LOGs more readable 2) I might use it in my EventLogger, which will try to make our LOG easier to read/query/visualize Test Plan: ran rocksdb, read the LOG Reviewers: sdong, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D31617	10 years ago
Yueh-Hsuan Chiang	181191a1e4	Add a counter for collecting the wait time on db mutex. Summary: Add a counter for collecting the wait time on db mutex. Also add MutexWrapper and CondVarWrapper for measuring wait time. Test Plan: ./db_test export ROCKSDB_TESTS=MutexWaitStats ./db_test verify stats output using db_bench make clean make release ./db_bench --statistics=1 --benchmarks=fillseq,readwhilewriting --num=10000 --threads=10 Sample output: rocksdb.db.mutex.wait.micros COUNT : 7546866 Reviewers: MarkCallaghan, rven, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D32787	10 years ago
sdong	d888c95748	Sync WAL Directory and DB Path if different from DB directory Summary: 1. If WAL directory is different from db directory. Sync the directory after creating a log file under it. 2. After creating an SST file, sync its parent directory instead of DB directory. 3. change the check of kResetDeleteUnsyncedFiles in fault_injection_test. Since we changed the behavior to sync log files' parent directory after first WAL sync, instead of creating, kResetDeleteUnsyncedFiles will not guarantee to show post sync updates. Test Plan: make all check Reviewers: yhchiang, rven, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D32067	10 years ago
Igor Canadi	4a3bd2bad2	Optimize usage of Status in CompactionJob Summary: Based on @ljin feedback Test Plan: compiles Reviewers: ljin, yhchiang, sdong Reviewed By: sdong Subscribers: ljin, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28515	10 years ago
Igor Canadi	e3d3567b5b	Get rid of mutex in CompactionJob's state Summary: Based on @sdong's feedback in the diff, we shouldn't keep db_mutex in CompactionJob's state. This diff removes db_mutex from CompactionJob state, by making next_file_number_ atomic. That way we only need to pass the lock to InstallCompactionResults() because of LogAndApply() Test Plan: make check Reviewers: ljin, yhchiang, rven, sdong Reviewed By: sdong Subscribers: sdong, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28491	10 years ago
Igor Canadi	53af5d877d	Redesign pending_outputs_ Summary: Here's a prototype of redesigning pending_outputs_. This way, we don't have to expose pending_outputs_ to other classes (CompactionJob, FlushJob, MemtableList). DBImpl takes care of it. Still have to write some comments, but should be good enough to start the discussion. Test Plan: make check, will also run stress test Reviewers: ljin, sdong, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28353	10 years ago
sdong	2ea1219eb6	Fix RecordIn and RecordDrop stats Summary: 1. fix possible overflow of the two stats by using uint64_t 2. use a similar source of data to calculate RecordDrop. Previous one is not correct. Test Plan: See outputs of db_bench settings, and the results look reasonable Reviewers: MarkCallaghan, ljin, igor Reviewed By: igor Subscribers: rven, leveldb, yhchiang, dhruba Differential Revision: https://reviews.facebook.net/D28155	10 years ago
Igor Canadi	74eb4fbe93	CompactionJob Summary: Long awaited CompactionJob class! Move most compaction-related things from DBImpl to CompactionJob, making CompactionJob easier to test and understand. Currently this is just replicating exactly the same functionality with as little as change as possible. As future work, we should: 1. Add CompactionJob tests (I think I'll do that tomorrow) 2. Reduce CompactionJob's state that it inherits from DBImpl 3. Figure out how to do yielding to flush better. Currently I implemented a callback as we agreed yesterday, but I don't think it's a good long term solution. This reduces db_impl.cc from 5000+ LOC to 3400! Test Plan: make check, will add CompactionJob-specific tests, probably also move some tests from db_test to compaction_job_test Reviewers: rven, yhchiang, sdong, ljin Reviewed By: ljin Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D27957	10 years ago

22 Commits (603b6da8b8fb17d0d5a02668c306787c07f7bae8)