rocksdb

Commit Graph

Author	SHA1	Message	Date
Andres Noetzli	effd9dd1e1	Fix deadlock in WAL sync Summary: MarkLogsSynced() was doing `logs_.erase(it++);`. The standard is saying: ``` all iterators and references are invalidated, unless the erased members are at an end (front or back) of the deque (in which case only iterators and references to the erased members are invalidated) ``` Because `it` is an iterator to the first element of the container, it is invalidated, only one iteration is executed and `log.getting_synced = false;` is not being done, so `while (logs_.front().getting_synced)` in `WriteImpl()` is not terminating. Test Plan: make db_bench && ./db_bench --benchmarks=fillsync Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, yhchiang, sdong, tnovak Reviewed By: tnovak Subscribers: kolmike, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45807	9 years ago
Andres Noetzli	72a9b73c9e	Removed unnecessary checks in DBTest.ApproximateMemoryUsage Summary: Just realized that after D45675, part of the code in DBTest.ApproximateMemoryUsage, does not really test anything anymore, so I removed it. Test Plan: make clean all check Reviewers: rven, igor, sdong, anthony, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45783	9 years ago
Venkatesh Radhakrishnan	cb164bfc48	Do not delete iterators for immutable memtables. Summary: The immutable memtable iterators are allocated from an arena and there is no benefit from deleting these. Also the immutable memtables themselves will continue to be in memory until the version set containing it is alive. We will not remove immutable memtable iterators over the upper bound. We now add immutable iterators to the test. Test Plan: db_tailing_iter_test.TailingIteratorTrimSeekToNext Reviewers: tnovak, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45597	9 years ago
sdong	7a0dbdf3ac	Add ZSTD (not final format) compression type Summary: Add ZSTD compression type. The same way as adding LZ4. Test Plan: run all tests. Generate files in db_bench. Make sure reads succeed. But the SST files cannot be opened in older versions. Also some other adhoc tests. Reviewers: rven, anthony, IslamAbdelRahman, kradhakrishnan, igor Reviewed By: igor Subscribers: MarkCallaghan, maykov, yoshinorim, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D45747	9 years ago
Andres Noetzli	e853191c17	Fix DBTest.ApproximateMemoryUsage Summary: This patch fixes two issues in DBTest.ApproximateMemoryUsage: - It was possible that a flush happened between getting the two properties in Phase 1, resulting in different numbers for the properties and failing the assertion. This is fixed by waiting for the flush to finish before getting the properties. - There was a similar issue in Phase 2 and additionally there was an issue that rocksdb.size-all-mem-tables was not monotonically increasing because it was possible that a flush happened just after getting the properties and then another flush just before getting the properties in the next round. In this situation, the reported memory usage decreased. This is fixed by forcing a flush before getting the properties. Note: during testing, I found that kFlushesPerRound does not seem very accurate. I added a TODO for this and it would be great to get some input on what to do there. Test Plan: The first issue can be made more likely to trigger by inserting a `usleep(10000);` between the calls to GetIntProperty() in Phase 1. The second issue can be made more likely to trigger by inserting a `if (r != 0) usleep(10000);` before the calls to GetIntProperty() and a `usleep(10000);` after the calls. Then execute make db_test && ./db_test --gtest_filter=DBTest.ApproximateMemoryUsage Reviewers: rven, yhchiang, igor, sdong, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45675	9 years ago
Yueh-Hsuan Chiang	8ef0144e2f	Add argument --show_table_properties to db_bench Summary: Add argument --show_table_properties to db_bench -show_table_properties (If true, then per-level table properties will be printed on every stats-interval when stats_interval is set and stats_per_interval is on.) type: bool default: false Test Plan: ./db_bench --show_table_properties=1 --stats_interval=100000 --stats_per_interval=1 ./db_bench --show_table_properties=1 --stats_interval=100000 --stats_per_interval=1 --num_column_families=2 Sample Output: Compaction Stats [column_family_name_000001] Level Files Size(MB) Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) Stall(cnt) KeyIn KeyDrop --------------------------------------------------------------------------------------------------------------------------------------------------------------------- L0 3/0 5 0.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 86.3 0 17 0.021 0 0 0 L1 5/0 9 0.9 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0 0 L2 9/0 16 0.2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0 0 Sum 17/0 31 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 86.3 0 17 0.021 0 0 0 Int 0/0 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 83.9 0 2 0.022 0 0 0 Flush(GB): cumulative 0.030, interval 0.004 Stalls(count): 0 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 0 leveln_slowdown_soft, 0 leveln_slowdown_hard Level[0]: # data blocks=2571; # entries=84813; raw key size=2035512; raw average key size=24.000000; raw value size=8481300; raw average value size=100.000000; data block size=5690119; index block size=82415; filter block size=0; (estimated) table size=5772534; filter policy name=N/A; Level[1]: # data blocks=4285; # entries=141355; raw key size=3392520; raw average key size=24.000000; raw value size=14135500; raw average value size=100.000000; data block size=9487353; index block size=137377; filter block size=0; (estimated) table size=9624730; filter policy name=N/A; Level[2]: # data blocks=7713; # entries=254439; raw key size=6106536; raw average key size=24.000000; raw value size=25443900; raw average value size=100.000000; data block size=17077893; index block size=247269; filter block size=0; (estimated) table size=17325162; filter policy name=N/A; Level[3]: # data blocks=0; # entries=0; raw key size=0; raw average key size=0.000000; raw value size=0; raw average value size=0.000000; data block size=0; index block size=0; filter block size=0; (estimated) table size=0; filter policy name=N/A; Level[4]: # data blocks=0; # entries=0; raw key size=0; raw average key size=0.000000; raw value size=0; raw average value size=0.000000; data block size=0; index block size=0; filter block size=0; (estimated) table size=0; filter policy name=N/A; Level[5]: # data blocks=0; # entries=0; raw key size=0; raw average key size=0.000000; raw value size=0; raw average value size=0.000000; data block size=0; index block size=0; filter block size=0; (estimated) table size=0; filter policy name=N/A; Level[6]: # data blocks=0; # entries=0; raw key size=0; raw average key size=0.000000; raw value size=0; raw average value size=0.000000; data block size=0; index block size=0; filter block size=0; (estimated) table size=0; filter policy name=N/A; Reviewers: anthony, IslamAbdelRahman, MarkCallaghan, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45651	9 years ago
Igor Canadi	5f4166c90e	ReadaheadRandomAccessFile -- userspace readahead Summary: ReadaheadRandomAccessFile acts as a transparent layer on top of RandomAccessFile. When a Read() request is issued, it issues a much bigger request to the OS and caches the result. When a new request comes in and we already have the data cached, it doesn't have to issue any requests to the OS. We add ReadaheadRandomAccessFile layer only when file is read during compactions. D45105 was incorrectly closed by Phabricator because I committed it to a separate branch (not master), so I'm resubmitting the diff. Test Plan: make check Reviewers: MarkCallaghan, sdong Reviewed By: sdong Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D45123	9 years ago
sdong	d286b5df90	DBIter to out extra keys with higher sequence numbers when changing direction from forward to backward Summary: When DBIter changes iterating direction from forward to backward, it might see some much larger keys with higher sequence ID. With this commit, these rows will be actively filtered out. It should fix existing disabled tests in db_iter_test. This may not be a perfect fix, but it introduces least impact on existing codes, in order to be safe. Test Plan: Enable existing tests and make sure they pass. Add a new test DBIterWithMergeIterTest.InnerMergeIteratorDataRace8. Also run all existing tests. Reviewers: yhchiang, rven, anthony, IslamAbdelRahman, kradhakrishnan, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D45567	9 years ago
Andres Noetzli	3795449c9d	Fix DBTest.GetProperty Summary: DBTest.GetProperty was failing occasionally (see task #8131266). The reason was that the test closed the database before the compaction was done. When the test reopened the database, RocksDB would schedule a compaction which in turn created table readers and lead the test to fail the assertion that rocksdb.estimate-table-readers-mem is 0. In most cases, GetIntProperty() of rocksdb.estimate-table-readers-mem happened before the compaction created the table readers, hiding the problem. This patch changes the WaitForFlushMemTable() to WaitForCompact(). WaitForFlushMemTable() is not necessary because it is already being called a couple of lines before without any insertions in-between. Test Plan: Insert `usleep(10000);` just after `Reopen(options);` on line 2333 to make the issue more likely, then run: make db_test && while ./db_test --gtest_filter=DBTest.GetProperty; do true; done Reviewers: rven, yhchiang, anthony, igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45603	9 years ago
Dmitri Smirnov	6924d7582b	Address noexcept and const integer lambda capture VS 2013 does not support noexcept. Complains about usage of ineteger constant within lambda requiring explicit capture.	9 years ago
Ari Ekmekji	2f8d71ec05	Moving sequence number compaction variables from SubCompactionState to CompactionJob Summary: It was pointed out to me that the members of SubCompactionState 'earliest_snapshot', 'latest_snapshot' and 'visible_at_tip' are never modified by the subcompactions, so they can stay as global varaibles instead to make things simpler. Test Plan: make all && make check Reviewers: sdong, igor, noetzli, anthony, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D45477	9 years ago
Venkatesh Radhakrishnan	bab9934d9e	Fix build failure caused by bad merge. Summary: There was a bad merge during refresh. Test Plan: make -j all; make check Reviewers: sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D45555	9 years ago
Venkatesh Radhakrishnan	4d28a7d8ab	Add a whitebox test for deleted file iterators. Summary: We have earlier added a feature to delete file iterators when the current key is over the iterate upper bound. We now add a whitebox test to check if the file iterators were actually deleted. Test Plan: Add check for a range which has deleted iterators. Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45321	9 years ago
Venkatesh Radhakrishnan	249fb4f881	Fix use of deleted file iterators with incomplete iterators Summary: After deleting file iterators which are over the iterate upper bound, we also need to check for null pointers in ResetIncompletIterators. Test Plan: db_tailing_iter_test.TailingIteratorTrimSeekToNext Reviewers: tnovak, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45525	9 years ago
Andres Notzli	09d982f9e0	Fix compact_files_example Summary: See task #7983654. The example was triggering an assert in compaction job because the compaction was not marked as manual. With this patch, CompactionPicker::FormCompaction() marks compactions as manual. This patch also fixes a couple of typos, adds optimistic_transaction_example to .gitignore and librocksdb as a dependency for examples. Adding librocksdb as a dependency makes sure that the examples are built with the latest changes in librocksdb. Test Plan: make clean && cd examples && make all && ./compact_files_example Reviewers: rven, sdong, anthony, igor, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45117	9 years ago
Yueh-Hsuan Chiang	6996de87af	Expose per-level aggregated table properties via GetProperty() Summary: This patch adds "rocksdb.aggregated-table-properties" and "rocksdb.aggregated-table-properties-at-levelN", the former returns the aggreated table properties of a column family, while the later returns the aggregated table properties of the specified level N. Test Plan: Added tests in db_test Reviewers: igor, sdong, IslamAbdelRahman, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45087	9 years ago
Andres Noetzli	2050832974	Fixing race condition in DBTest.DynamicMemtableOptions Summary: This patch fixes a race condition in DBTEst.DynamicMemtableOptions. In rare cases, it was possible that the main thread would fill up both memtables before the flush job acquired its work. Then, the flush job was flushing both memtables together, producing only one L0 file while the test expected two. Now, the test waits for flushes to finish earlier, to make sure that the memtables are flushed in separate flush jobs. Test Plan: Insert "usleep(10000);" after "IOSTATS_SET_THREAD_POOL_ID(Env::Priority::HIGH);" in BGWorkFlush() to make the issue more likely. Then test with: make db_test && time while ./db_test --gtest_filter=*DynamicMemtableOptions; do true; done Reviewers: rven, sdong, yhchiang, anthony, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45429	9 years ago
Igor Canadi	e46bcc08b9	Remove an extra 's' from cur-size-all-mem-tabless Summary: As title Test Plan: make check Reviewers: yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45447	9 years ago
Igor Canadi	4ab26c5ad1	Smarter purging during flush Summary: Currently, we only purge duplicate keys and deletions during flush if `earliest_seqno_in_memtable <= newest_snapshot`. This means that the newest snapshot happened before we first created the memtable. This is almost never true for MyRocks and MongoRocks. This patch makes purging during flush able to understand snapshots. The main logic is copied from compaction_job.cc, although the logic over there is much more complicated and extensive. However, we should try to merge the common functionality at some point. I need this patch to implement no_overwrite_i_promise functionality for flush. We'll also need this to support SingleDelete() during Flush(). @yoshinorim requested the feature. Test Plan: make check I had to adjust some unit tests to understand this new behavior Reviewers: yhchiang, yoshinorim, anthony, sdong, noetzli Reviewed By: noetzli Subscribers: yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42087	9 years ago
Ari Ekmekji	b6def58f73	Changed 'num_subcompactions' to the more accurate 'max_subcompactions' Summary: Up until this point we had DbOptions.num_subcompactions, but it is semantically more correct to call this max_subcompactions since we will schedule up to DbOptions.max_subcompactions smaller compactions at a time during a compaction job. I also added a --subcompactions option to db_bench Test Plan: make all make check Reviewers: sdong, igor, anthony, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D45069	9 years ago
sdong	c852968465	db_iter_test: add more test cases for the data race bug Summary: Add more test cases of data race causing wrong iterating results. Tag tests not passing as DISABLED_ Test Plan: Run the tests Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, yhchiang Reviewed By: yhchiang Subscribers: tnovak, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D44907	9 years ago
sdong	9130873a13	Add options.new_table_reader_for_compaction_inputs Summary: Currently compaction inputs share the same file descriptor and table reader as other foreground threads. It makes fadvise works less predictable. Add options.new_table_reader_for_compaction_inputs to enforce to create a new file descriptor and new table reader for it. Test Plan: Add the option. Reviewers: rven, anthony, kradhakrishnan, IslamAbdelRahman, igor, yhchiang Reviewed By: igor Subscribers: igor, MarkCallaghan, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D43311	9 years ago
sdong	07d2d34160	Add a counter about estimated pending compaction bytes Summary: Add a counter of estimated bytes the DB needs to compact for all the compactions to finish. Expose it as a DB Property. In the future, we can use threshold of this counter to replace soft rate limit and hard rate limit. A single threshold of estimated compaction debt in bytes will be easier for users to reason about when should slow down and stopping than more abstract soft and hard rate limits. Test Plan: Add unit tests Reviewers: IslamAbdelRahman, yhchiang, rven, kradhakrishnan, anthony, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D44205	9 years ago
Yueh-Hsuan Chiang	a203b913c1	Fixed a rare deadlock in DBTest.ThreadStatusFlush Summary: Currently, ThreadStatusFlush uses two sync-points to ensure there's a flush currently running when calling GetThreadList(). However, one of the sync-point is inside db-mutex, which could cause deadlock in case there's a DB::Get() call. This patch fix this issue by moving the sync-point to a better place where the flush job does not hold the mutex. Test Plan: db_test Reviewers: igor, sdong, anthony, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45045	9 years ago
Dmitri Smirnov	5bf8907622	More indent adjustment.	9 years ago
Dmitri Smirnov	e2a9f43d64	Adjust indent	9 years ago
Dmitri Smirnov	1cac89c9b1	Address windows build issues Intro SubCompactionState move functionality =delete copy functionality #ifdef SyncPoint in tests for Windows Release builds	9 years ago
Dmitri Smirnov	f25f06ddd2	Address windows build issues Intro SubCompactionState move functionality =delete copy functionality #ifdef SyncPoint in tests for Windows Release builds	9 years ago
Islam AbdelRahman	027ca5b2cd	Total SST files size DB Property Summary: Add a new DB property that calculate the total size of files used by all RocksDB Versions Test Plan: Unittests for the new property Reviewers: igor, yhchiang, anthony, rven, kradhakrishnan, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D44799	9 years ago
Andres Noetzli	b604d2562f	Removing unused variables to fix build Summary: Removing two unused variables that prevented compilation. Test Plan: make all Reviewers: rven, sdong, yhchiang, anthony, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D44991	9 years ago
Venkatesh Radhakrishnan	1b114eed4d	Free file iterators for files which are above the iterate upper bound to Improve memory utilization Summary: This diff improves the memory utilization for tailing iterators RocksDB, by freeing file iterators which are over the upper bound. It is an updating on Siying's original diff for improving the memory usage for tailing iterators. The changes for the seek and next path are now complete and a test has been added to exercise these paths while deleting file iterators which are above the upper bound. Test Plan: db_tailing_iter_test.TailingIteratorTrimSeekToNext Reviewers: march, tnovak, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D43833	9 years ago
Islam AbdelRahman	3fd70b05b8	Rate limit deletes issued by DestroyDB Summary: Update DestroyDB so that all SST files in the first path id go through DeleteScheduler instead of being deleted immediately Test Plan: added a unittest Reviewers: igor, yhchiang, anthony, kradhakrishnan, rven, sdong Reviewed By: sdong Subscribers: jeanxu2012, dhruba Differential Revision: https://reviews.facebook.net/D44955	9 years ago
Yueh-Hsuan Chiang	df79eafcb3	Introduce GetIntProperty("rocksdb.size-all-mem-tables") Summary: Currently, GetIntProperty("rocksdb.cur-size-all-mem-tables") only returns the memory usage by those memtables which have not yet been flushed. This patch introduces GetIntProperty("rocksdb.size-all-mem-tables"), which includes the memory usage by all the memtables, includes those have been flushed but pinned by iterators. Test Plan: Added a test in db_test Reviewers: igor, anthony, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D44229	9 years ago
sdong	888fbdc889	Remove the contstaint that iterator upper bound needs to be within a prefix Summary: There is a check to fail the iterator if prefix extractor is specified but upper bound is out of the prefix for the seek key. Relax this constraint to allow users to set upper bound to the next prefix of the current one. Test Plan: make commit-prereq Reviewers: igor, anthony, kradhakrishnan, yhchiang, rven Reviewed By: rven Subscribers: tnovak, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D44949	9 years ago
Ari Ekmekji	137c376675	Removing variables used only in assertions to prevent build error Summary: A couple variables were declared but only used in assertions which causes issues when building in fbcode. Test Plan: make dbg and make release Reviewers: yhchiang, sdong, igor, anthony, MarkCallaghan Reviewed By: MarkCallaghan Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D44937	9 years ago
Ari Ekmekji	b47cc58516	Bounding Number of Subcompactions Summary: In D43239 (https://reviews.facebook.net/D43239) the number of subcompactions is set based on the number of L1 files with unique starting keys. In certain cases when this number is very large this causes issues, particularly with the overlap between files since very small output files can be generated. This diff bounds the number of subcompactions to the user option DBOption.num_subcompactions. Test Plan: ./db_test ./db_compaction_test Reviewers: sdong, igor, anthony, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D44883	9 years ago
Venkatesh Radhakrishnan	e58e1b18e7	Make tailing iterator show new entries in memtable. Summary: Reseek mutable_iter if it is invalid in Next and immutable_iter is invalid. Test Plan: DBTestTailingIterator.TailingIteratorSeekToNext Reviewers: tnovak, march, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D44865	9 years ago
Ari Ekmekji	601b1aaca0	Fixing Failed Assertion in Subcompaction State Diff Summary: In D43239 (https://reviews.facebook.net/D43239) there is an assertion to make sure a subcompaction's output is never empty at the end of execution. This assertion however breaks the build because some tests lead to exactly that scenario. So instead I have altered the logic to handle this case instead of just failing the assertion. The reason that it is possible for a subcompaction's output to be empty is that during a sequential execution of subcompactions, if a user aborts the compaction job then some of the later subcompactions to be executed may have yet to process any keys and therefore have yet to generate output files. This becomes very rare once the subcompactions are executed in parallel, but for now they are still sequential so the case is possible when there is an early termination, as in some of the tests. Test Plan: ./db_test ./db_compaction_test Reviewers: sdong, igor, anthony, yhchiang Reviewed By: yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D44877	9 years ago
Ari Ekmekji	f0da6977a3	[Parallel L0-L1 Compaction Prep]: Giving Subcompactions Their Own State Summary: In prepration for running multiple threads at the same time during a compaction job, this patch assigns each subcompaction its own state (instead of sharing the one global CompactionState). Each subcompaction then uses this state to update its statistics, keep track of its snapshots, etc. during the course of execution. Then at the end of all the executions the statistics are aggregated across the subcompactions so that the final result is the same as if only one larger compaction had run. Test Plan: ./db_test ./db_compaction_test ./compaction_job_test Reviewers: sdong, anthony, igor, noetzli, yhchiang Reviewed By: yhchiang Subscribers: MarkCallaghan, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D43239	9 years ago
Andres Notzli	f32a572099	Simplify querying of merge results Summary: While working on supporting mixing merge operators with single deletes ( https://reviews.facebook.net/D43179 ), I realized that returning and dealing with merge results can be made simpler. Submitting this as a separate diff because it is not directly related to single deletes. Before, callers of merge helper had to retrieve the merge result in one of two ways depending on whether the merge was successful or not (success = result of merge was single kTypeValue). For successful merges, the caller could query the resulting key/value pair and for unsuccessful merges, the result could be retrieved in the form of two deques of keys and values. However, with single deletes, a successful merge does not return a single key/value pair (if merge operands are merged with a single delete, we have to generate a value and keep the original single delete around to make sure that we are not accidentially producing a key overwrite). In addition, the two existing call sites of the merge helper were taking the same actions independently from whether the merge was successful or not, so this patch simplifies that. Test Plan: make clean all check Reviewers: rven, sdong, yhchiang, anthony, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D43353	9 years ago
sdong	72613657f0	Measure file read latency histogram per level Summary: In internal stats, remember read latency histogram, if statistics is enabled. It can be retrieved from DB::GetProperty() with "rocksdb.dbstats" property, if it is enabled. Test Plan: Manually run db_bench and prints out "rocksdb.dbstats" by hand and make sure it prints out as expected Reviewers: igor, IslamAbdelRahman, rven, kradhakrishnan, anthony, yhchiang Reviewed By: yhchiang Subscribers: MarkCallaghan, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D44193	9 years ago
Nathan Bronson	b7198c3afe	reduce db mutex contention for write batch groups Summary: This diff allows a Writer to join the next write batch group without acquiring any locks. Waiting is performed via a per-Writer mutex, so all of the non-leader writers never need to acquire the db mutex. It is now possible to join a write batch group after the leader has been chosen but before the batch has been constructed. This diff doesn't increase parallelism, but reduces synchronization overheads. For some CPU-bound workloads (no WAL, RAM-sized working set) this can substantially reduce contention on the db mutex in a multi-threaded environment. With T=8 N=500000 in a CPU-bound scenario (see the test plan) this is good for a 33% perf win. Not all scenarios see such a win, but none show a loss. This code is slightly faster even for the single-threaded case (about 2% for the CPU-bound scenario below). Test Plan: 1. unit tests 2. COMPILE_WITH_TSAN=1 make check 3. stress high-contention scenarios with db_bench -benchmarks=fillrandom -threads=$T -batch_size=1 -memtablerep=skip_list -value_size=0 --num=$N -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000 Reviewers: sdong, igor, rven, ljin, yhchiang Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D43887	9 years ago
sdong	603b6da8b8	Add options.compaction_measure_io_stats to print write I/O stats in compactions Summary: Add options.compaction_measure_io_stats to print out / pass to listener accumulated time spent on write calls. Example outputs in info logs: 2015/08/12-16:27:59.463944 7fd428bff700 (Original Log Time 2015/08/12-16:27:59.463922) EVENT_LOG_v1 {"time_micros": 1439422079463897, "job": 6, "event": "compaction_finished", "output_level": 1, "num_output_files": 4, "total_output_size": 6900525, "num_input_records": 111483, "num_output_records": 106877, "file_write_nanos": 15663206, "file_range_sync_nanos": 649588, "file_fsync_nanos": 349614797, "file_prepare_write_nanos": 1505812, "lsm_state": [2, 4, 0, 0, 0, 0, 0]} Add two more counters in iostats_context. Also add a parameter of db_bench. Test Plan: Add a unit test. Also manually verify LOG outputs in db_bench Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D44115	9 years ago
sdong	4637207120	Add test case to repro the mispositional iterator in a low-chance data race case Summary: Iterator has a bug: if a child iterator reaches its end, and user issues a Prev(), and just before SeekToLast() of the child iterator is called, some extra rows is added in the end, the position of iterator can be misplaced. Test Plan: Run the tests with or without valgrind Reviewers: rven, yhchiang, IslamAbdelRahman, anthony Reviewed By: anthony Subscribers: tnovak, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D43671	9 years ago
agiardullo	0db807ec28	Transaction error statuses Summary: Based on feedback from spetrunia, we should better differentiate error statuses for transaction failures. https://github.com/MySQLOnRocksDB/mysql-5.6/issues/86#issuecomment-124605954 Test Plan: unit tests Reviewers: rven, kradhakrishnan, spetrunia, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D43323	9 years ago
agiardullo	c2f2cb0214	Pessimistic Transactions Summary: Initial implementation of Pessimistic Transactions. This diff contains the api changes discussed in D38913. This diff is pretty large, so let me know if people would prefer to meet up to discuss it. MyRocks folks: please take a look at the API in include/rocksdb/utilities/transaction[_db].h and let me know if you have any issues. Also, you'll notice a couple of TODOs in the implementation of RollbackToSavePoint(). After chatting with Siying, I'm going to send out a separate diff for an alternate implementation of this feature that implements the rollback inside of WriteBatch/WriteBatchWithIndex. We can then decide which route is preferable. Next, I'm planning on doing some perf testing and then integrating this diff into MongoRocks for further testing. Test Plan: Unit tests, db_bench parallel testing. Reviewers: igor, rven, sdong, yhchiang, yoshinorim Reviewed By: sdong Subscribers: hermanlee4, maykov, spetrunia, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D40869	9 years ago
Islam AbdelRahman	c2868cbc52	Use manual_compaction for compaction_job_test Summary: Under certain conditions (disable compression) the compactions that are created in compaction_job_test will pass the trivial_move conditions This will cause problems since we assert that we dont run a compaction if it's a trivial move https://github.com/facebook/rocksdb/blob/master/db/compaction_job.cc#L144-L147 for example when we disable compression, compactions become a valid trivial move and the assert fails https://ci-builds.fb.com/view/rocksdb/job/rocksdb_no_compression/180/console Test Plan: compaction_job_test Reviewers: sdong, yhchiang, noetzli, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D43983	9 years ago
Islam AbdelRahman	cee1e8a080	Parallelize LoadTableHandlers Summary: Add a new option that all LoadTableHandlers to use multiple threads to load files on DB Open and Recover Test Plan: make check -j64 COMPILE_WITH_TSAN=1 make check -j64 DISABLE_JEMALLOC=1 make all valgrind_check -j64 (still running) Reviewers: yhchiang, anthony, rven, kradhakrishnan, igor, sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D43755	9 years ago
Andres Notzli	4249f159d5	Removing duplicate code in db_bench/db_stress, fixing typos Summary: While working on single delete support for db_bench, I realized that db_bench/db_stress contain a bunch of duplicate code related to copmression and found some typos. This patch removes duplicate code, typos and a redundant #ifndef in internal_stats.cc. Test Plan: make db_stress && make db_bench && ./db_bench --benchmarks=compress,uncompress Reviewers: yhchiang, sdong, rven, anthony, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D43965	9 years ago
Nathan Bronson	1ae27113c7	reduce comparisons by skiplist Summary: Key comparison is the single largest CPU user for CPU-bound workloads. This diff reduces the number of comparisons in two ways. The first is that it moves predecessor array gathering from FindGreaterOrEqual to FindLessThan, so that FindGreaterOrEqual can return immediately if compare_ returns 0. As part of this change I moved the sequential insertion optimization into Insert, to remove the undocumented (and smelly) requirement that prev must be equal to prev_ if it is non-null. The second optimization is that all of the search functions skip calling compare_ when moving to a lower level that has the same Next pointer. With a branching factor of 4 we would expect this to happen 1/4 of the time. On a single-threaded CPU-bound workload (-benchmarks=fillrandom -threads=1 -batch_size=1 -memtablerep=skip_list -value_size=0 --num=1600000 -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -disable_auto_compactions --max_write_buffer_number=8 -max_background_flushes=8 --disable_wal --write_buffer_size=160000000) on my dev server this is good for a 7% perf win. Test Plan: unit tests Reviewers: rven, ljin, yhchiang, sdong, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D43233	9 years ago

... 33 34 35 36 37 ...

3606 Commits (b9f590065872db9b818874ba4bf4402ddd476cc3)