rocksdb

Commit Graph

Author	SHA1	Message	Date
Sage Weil	2b2a898e0b	db/log_reader: combine kBadRecord{Len,Checksum} for readability These vary only by the corruption string reported. Signed-off-by: Sage Weil <sage@redhat.com>	9 years ago
Sage Weil	34df1c94d5	db/log_reader: treat bad record length or checksum as EOF If we are in kTolerateCorruptedTailRecords, treat these errors as the end of the log. This is particularly important for recycled logs, where we will regularly see corrupted headers (bad length or checksum) when replaying a log. If we are aligned with a block boundary or get lucky, we will land on an old header and see the log number mismatch, but more commonly we will land midway through some previous block and record and effectively see noise. These must be treated as the end of the log in order for recycling to work. This makes the LogTest.Recycle/1 test pass. We also modify a number of existing tests because the recycled log files behave fundamentally differently in that they always stop when they reach the first bad record. Signed-off-by: Sage Weil <sage@redhat.com>	9 years ago
Sage Weil	7947aba68c	db/log_reader: move kBadRecord{Len,Checksum} handling into ReadRecord The behavior here needs to depend on the WAL recovery mode. No functional change in this patch. Signed-off-by: Sage Weil <sage@redhat.com>	9 years ago
Sage Weil	847e471db6	db/log_test: add recycle log test This currently fails because we do not properly map a corrupt header to the logical end of the log. Signed-off-by: Sage Weil <sage@redhat.com>	9 years ago
Aaron Orenstein	2073cf3775	Eliminate use of 'using namespace std'. Also remove a number of ADL references to std functions. Summary: Reduce use of argument-dependent name lookup in RocksDB. Test Plan: 'make check' passed. Reviewers: andrewkr Reviewed By: andrewkr Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58203	9 years ago
Richard Cairns Jr	f6e404c20a	Added "number of merge operands" to statistics in ssts. Summary: A couple of notes from the diff: - The namespace block I added at the top of table_properties_collector.cc was in reaction to an issue i was having with PutVarint64 and reusing the "val" string. I'm not sure this is the cleanest way of doing this, but abstracting this out at least results in the correct behavior. - I chose "rocksdb.merge.operands" as the property name. I am open to suggestions for better names. - The change to sst_dump_tool.cc seems a bit inelegant to me. Is there a better way to do the if-else block? Test Plan: I added a test case in table_properties_collector_test.cc. It adds two merge operands and checks to make sure that both of them are reflected by GetMergeOperands. It also checks to make sure the wasPropertyPresent bool is properly set in the method. Running both of these tests should pass: ./table_properties_collector_test ./sst_dump_test Reviewers: IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58119	9 years ago
omegaga	3c69f77c67	Move IO failure test to separate file Summary: This is a part of effort to reduce the size of db_test.cc. We move the following tests to a separate file `db_io_failure_test.cc`: * DropWrites * DropWritesFlush * NoSpaceCompactRange * NonWritableFileSystem * ManifestWriteError * PutFailsParanoid Test Plan: Run `make check` to see if the tests are working properly. Reviewers: sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58341	9 years ago
Islam AbdelRahman	c70a9335de	Fix mutex unlock issue between scheduled compaction and ReleaseCompactionFiles() Summary: NotifyOnCompactionCompleted can unlock the mutex. That mean that we can schedule a background compaction that will start before we ReleaseCompactionFiles(). Test Plan: added unittest existing unittest Reviewers: yhchiang, sdong Reviewed By: sdong Subscribers: yoshinorim, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58065	9 years ago
Reid Horuff	a6254f2bd4	Long outstanding prepare test Summary: This tests that a prepared transaction is not lost after several crashes, restarts, and memtable flushes. Test Plan: TwoPhaseLongPrepareTest Reviewers: sdong Subscribers: hermanlee4, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58185	9 years ago
Aaron Gao	43afd72bee	[rocksdb] make more options dynamic Summary: make more ColumnFamilyOptions dynamic: - compression - soft_pending_compaction_bytes_limit - hard_pending_compaction_bytes_limit - min_partial_merge_operands - report_bg_io_stats - paranoid_file_checks Test Plan: Add sanity check in `db_test.cc` for all above options except for soft_pending_compaction_bytes_limit and hard_pending_compaction_bytes_limit. All passed. Reviewers: andrewkr, sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57519	9 years ago
Islam AbdelRahman	f6aedb62c0	Fix Transaction memory leak Summary: - Make sure we clean up recovered_transactions_ on DBImpl destructor - delete leaked txns and env in TransactionTest Test Plan: Run transaction_test under valgrind Reviewers: sdong, andrewkr, yhchiang, horuff Reviewed By: horuff Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58263	9 years ago
krad	a08c8c851a	Added PersistentCache abstraction Summary: Added a new abstraction to cache page to RocksDB designed for the read cache use. RocksDB current block cache is more of an object cache. For the persistent read cache project, what we need is a page cache equivalent. This changes adds a cache abstraction to RocksDB to cache pages called PersistentCache. PersistentCache can cache uncompressed pages or raw pages (content as in filesystem). The user can choose to operate PersistentCache either in COMPRESSED or UNCOMPRESSED mode. Blame Rev: Test Plan: Run unit tests Reviewers: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D55707	9 years ago
Reid Horuff	a400336398	TransactionLogIterator sequence gap fix Summary: DBTestXactLogIterator.TransactionLogIterator was failing due the sequence gaps. This was caused by an off-by-one error when calculating the new sequence number after recovering from logs. Test Plan: db_log_iter_test Reviewers: andrewkr Subscribers: andrewkr, hermanlee4, dhruba, IslamAbdelRahman Differential Revision: https://reviews.facebook.net/D58053	9 years ago
Islam AbdelRahman	560358dc93	Fix data race in GetObsoleteFiles() Summary: GetObsoleteFiles() and LogAndApply() functions modify obsolete_manifests_ vector we need to make sure that the mutex is held when we modify the obsolete_manifests_ Test Plan: run the test under TSAN Reviewers: andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58011	9 years ago
Reid Horuff	c27061dae7	[rocksdb] 2PC double recovery bug fix Summary: 1. prepare() 2. crash 3. recover 4. commit() 5. crash 6. data is lost This is due to the transaction data still only residing in the WAL but because the logs were flushed on the first recovery the data is ignored on the second recovery. We must scan all logs found on recovery and only ignore redundant data at the time of replay. It is not possible to know which logs still contain relevant data at time of recovery. We cannot simply ignore a log because all of the non-2pc data it contains has already been written to L0. The changes made to MemTableInserter are to ensure that prepared sections are still recovered even if all of the non-2pc data in that log has already been flushed to L0. Test Plan: Provided test. Reviewers: sdong Subscribers: andrewkr, hermanlee4, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57729	9 years ago
Reid Horuff	a657ee9a9c	[rocksdb] Recovery path sequence miscount fix Summary: Consider the following WAL with 4 batch entries prefixed with their sequence at time of memtable insert. [1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(a)] [1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(b)] [4: COMMIT(a)] [7: COMMIT(b)] The first two batches do not consume any sequence numbers so are both prefixed with seq=1. For 2pc commit, memtable insertion takes place before COMMIT batch is written to WAL. We can see that sequence number consumption takes place between WAL entries giving us the seemingly sparse sequence prefix for WAL entries. This is a valid WAL. Because with 2PC markers one WriteBatch points to another batch containing its inserts a writebatch can consume more or less sequence numbers than the number of sequence consuming entries that it contains. We can see that, given the entries in the WAL, 6 sequence ids were consumed. Yet on recovery the maximum sequence consumed would be 7 + 3 (the number of sequence numbers consumed by COMMIT(b)) So, now upon recovery we must track the actual consumption of sequence numbers. In the provided scenario there will be no sequence gaps, but it is possible to produce a sequence gap. This should not be a problem though. correct? Test Plan: provided test. Reviewers: sdong Subscribers: andrewkr, leveldb, dhruba, hermanlee4 Differential Revision: https://reviews.facebook.net/D57645	9 years ago
Reid Horuff	8a66c85e90	[rocksdb] Two Phase Transaction Summary: Two Phase Commit addition to RocksDB. See wiki: https://github.com/facebook/rocksdb/wiki/Two-Phase-Commit-Implementation Quip: https://fb.quip.com/pxZrAyrx53r3 Depends on: WriteBatch modification: https://reviews.facebook.net/D54093 Memtable Log Referencing and Prepared Batch Recovery: https://reviews.facebook.net/D56919 Test Plan: - SimpleTwoPhaseTransactionTest - PersistentTwoPhaseTransactionTest. - TwoPhaseRollbackTest - TwoPhaseMultiThreadTest - TwoPhaseLogRollingTest - TwoPhaseEmptyWriteTest - TwoPhaseExpirationTest Reviewers: IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: leveldb, hermanlee4, andrewkr, vasilep, dhruba, santoshb Differential Revision: https://reviews.facebook.net/D56925	9 years ago
Reid Horuff	1b8a2e8fdd	[rocksdb] Memtable Log Referencing and Prepared Batch Recovery Summary: This diff is built on top of WriteBatch modification: https://reviews.facebook.net/D54093 and adds the required functionality to rocksdb core necessary for rocksdb to support 2PC. modfication of DBImpl::WriteImpl() - added two arguments uint64_t log_used = nullptr, uint64_t log_ref = 0; - log_used is an output argument which will return the log number which the incoming batch was inserted into, 0 if no WAL insert took place. - log_ref is a supplied log_number which all memtables inserted into will reference after the batch insert takes place. This number will reside in 'FindMinPrepLogReferencedByMemTable()' until all Memtables insertinto have flushed. - Recovery/writepath is now aware of prepared batches and commit and rollback markers. Test Plan: There is currently no test on this diff. All testing of this functionality takes place in the Transaction layer/diff but I will add some testing. Reviewers: IslamAbdelRahman, sdong Subscribers: leveldb, santoshb, andrewkr, vasilep, dhruba, hermanlee4 Differential Revision: https://reviews.facebook.net/D56919	9 years ago
Reid Horuff	0460e9dcce	Modification of WriteBatch to support two phase commit Summary: Adds three new WriteBatch data types: Prepare(xid), Commit(xid), Rollback(xid). Prepare(xid) should precede the (single) operation to which is applies. There can obviously be multiple Prepare(xid) markers. There should only be one Rollback(xid) or Commit(xid) marker yet not both. None of this logic is currently enforced and will most likely be implemented further up such as in the memtableinserter. All three markers are similar to PutLogData in that they are writebatch meta-data, ie stored but not counted. All three markers differ from PutLogData in that they will actually be written to disk. As for WriteBatchWithIndex, Prepare, Commit, Rollback are all implemented just as PutLogData and none are tested just as PutLogData. Test Plan: single unit test in write_batch_test. Reviewers: hermanlee4, sdong, anthony Subscribers: leveldb, dhruba, vasilep, andrewkr Differential Revision: https://reviews.facebook.net/D57867	9 years ago
Islam AbdelRahman	d86f9b9c3f	Fix lite build Summary: Fix lite build Test Plan: run under lite Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57945	9 years ago
Islam AbdelRahman	4b31723433	Add bottommost_compression option Summary: Add a new option that can be used to set a specific compression algorithm for bottommost level. This option will only affect levels larger than base level. I have also updated CompactionJobInfo to include the compression algorithm used in compaction Test Plan: added new unittest existing unittests Reviewers: andrewkr, yhchiang, sdong Reviewed By: sdong Subscribers: lightmark, andrewkr, dhruba, yoshinorim Differential Revision: https://reviews.facebook.net/D57669	9 years ago
sdong	bfb6b1b8a8	Estimate pending compaction bytes more accurately Summary: Currently we estimate bytes needed for compaction by assuming fanout value to be level multiplier. It overestimates when size of a level exceeds the target by large. We estimate by the ratio of actual sizes in levels instead. Test Plan: Fix existing test cases and add a new one. Reviewers: IslamAbdelRahman, igor, yhchiang Reviewed By: yhchiang Subscribers: MarkCallaghan, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57789	9 years ago
Yi Wu	730f7e2e21	Fix win build Summary: Fixing error with win build where we compare int64_t with size_t. Test Plan: make check Reviewers: andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57885	9 years ago
Andrew Kryczka	269f6b2e2d	Revert "Modification of WriteBatch to support two phase commit" Summary: Revert D54093 and D57453 Test Plan: running make check Reviewers: horuff, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57819	9 years ago
sdong	7ccb8d6ef3	BlockBasedTable::Get() not to use prefix bloom if read_options.total_order_seek = true Summary: This is to provide a way for users to skip prefix bloom in point look-up. Test Plan: Add a new unit test scenario. Reviewers: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57747	9 years ago
Islam AbdelRahman	967476eaee	Fix valgrind (DBIteratorTest.ReadAhead) Summary: This test is failing under valgrind because we dont delete the Env that we allocated Test Plan: run the test under valgrind Reviewers: andrewkr, yhchiang, yiwu, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57693	9 years ago
Yi Wu	a4ea345b04	Fixing lite build Summary: Fixing lite build broke in unit test. `FilesPerLevel()` depends on `DB::GetProperty()`, which lite build doesn't support. Test Plan: OPT=-DROCKSDB_LITE make check -j64 Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57651	9 years ago
Yi Wu	24a24f013d	Enable configurable readahead for iterators Summary: Add an option `iterator_readahead_size` to `ReadOptions` to enable configurable readahead for iterators similar to the corresponding option for compaction. Test Plan: ``` make commit_prereq ``` Reviewers: kumar.rangarajan, ott, igor, sdong Reviewed By: sdong Subscribers: yiwu, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D55419	9 years ago
Islam AbdelRahman	ff4b3fb5b4	Fix Iterator::Prev memory pinning bug Summary: We should not use IterKey::SetKey with copy = false except if we are pinning the iterator thru it's life time, otherwise we may release the temporarily pinned blocks and in this case the IterKey will be pointing to freed memory Test Plan: added a new test Reviewers: sdong, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57561	9 years ago
Islam AbdelRahman	6e801b0bd1	Eliminate memcpy in Iterator::Prev() by pinning blocks for keys spanning multiple blocks Summary: This diff is stacked on top of this diff https://reviews.facebook.net/D56493 The current Iterator::Prev() implementation need to copy every value since the underlying Iterator may move after reading the value. This can be optimized by making sure that the block containing the value is pinned until the Iterator move. which will improve the throughput by up to 1.5X master ``` ==> 1000000_Keys_100Byte.txt <== readreverse : 0.449 micros/op 2225887 ops/sec; 246.2 MB/s readreverse : 0.433 micros/op 2311508 ops/sec; 255.7 MB/s readreverse : 0.436 micros/op 2294335 ops/sec; 253.8 MB/s readreverse : 0.471 micros/op 2121295 ops/sec; 234.7 MB/s readreverse : 0.465 micros/op 2152227 ops/sec; 238.1 MB/s readreverse : 0.454 micros/op 2203011 ops/sec; 243.7 MB/s readreverse : 0.451 micros/op 2216095 ops/sec; 245.2 MB/s readreverse : 0.462 micros/op 2162447 ops/sec; 239.2 MB/s readreverse : 0.476 micros/op 2099151 ops/sec; 232.2 MB/s readreverse : 0.472 micros/op 2120710 ops/sec; 234.6 MB/s avg : 242.34 MB/s ==> 1000000_Keys_1KB.txt <== readreverse : 1.013 micros/op 986793 ops/sec; 978.7 MB/s readreverse : 0.942 micros/op 1061136 ops/sec; 1052.5 MB/s readreverse : 0.951 micros/op 1051901 ops/sec; 1043.3 MB/s readreverse : 0.932 micros/op 1072894 ops/sec; 1064.1 MB/s readreverse : 1.024 micros/op 976720 ops/sec; 968.7 MB/s readreverse : 0.935 micros/op 1069169 ops/sec; 1060.4 MB/s readreverse : 1.012 micros/op 988132 ops/sec; 980.1 MB/s readreverse : 0.962 micros/op 1039579 ops/sec; 1031.1 MB/s readreverse : 0.991 micros/op 1008924 ops/sec; 1000.7 MB/s readreverse : 1.004 micros/op 996144 ops/sec; 988.0 MB/s avg : 1016.76 MB/s ==> 1000000_Keys_10KB.txt <== readreverse : 4.167 micros/op 239952 ops/sec; 2346.9 MB/s readreverse : 4.070 micros/op 245713 ops/sec; 2403.3 MB/s readreverse : 4.572 micros/op 218733 ops/sec; 2139.4 MB/s readreverse : 4.497 micros/op 222388 ops/sec; 2175.2 MB/s readreverse : 4.203 micros/op 237920 ops/sec; 2327.1 MB/s readreverse : 4.206 micros/op 237756 ops/sec; 2325.5 MB/s readreverse : 4.181 micros/op 239149 ops/sec; 2339.1 MB/s readreverse : 4.157 micros/op 240552 ops/sec; 2352.8 MB/s readreverse : 4.187 micros/op 238848 ops/sec; 2336.1 MB/s readreverse : 4.106 micros/op 243575 ops/sec; 2382.4 MB/s avg : 2312.78 MB/s ==> 100000_Keys_100KB.txt <== readreverse : 41.281 micros/op 24224 ops/sec; 2366.0 MB/s readreverse : 39.722 micros/op 25175 ops/sec; 2458.9 MB/s readreverse : 40.319 micros/op 24802 ops/sec; 2422.5 MB/s readreverse : 39.762 micros/op 25149 ops/sec; 2456.4 MB/s readreverse : 40.916 micros/op 24440 ops/sec; 2387.1 MB/s readreverse : 41.188 micros/op 24278 ops/sec; 2371.4 MB/s readreverse : 40.061 micros/op 24962 ops/sec; 2438.1 MB/s readreverse : 40.221 micros/op 24862 ops/sec; 2428.4 MB/s readreverse : 40.084 micros/op 24947 ops/sec; 2436.7 MB/s readreverse : 40.655 micros/op 24597 ops/sec; 2402.4 MB/s avg : 2416.79 MB/s ==> 10000_Keys_1MB.txt <== readreverse : 298.038 micros/op 3355 ops/sec; 3355.3 MB/s readreverse : 335.001 micros/op 2985 ops/sec; 2985.1 MB/s readreverse : 286.956 micros/op 3484 ops/sec; 3484.9 MB/s readreverse : 329.954 micros/op 3030 ops/sec; 3030.8 MB/s readreverse : 306.428 micros/op 3263 ops/sec; 3263.5 MB/s readreverse : 330.749 micros/op 3023 ops/sec; 3023.5 MB/s readreverse : 328.903 micros/op 3040 ops/sec; 3040.5 MB/s readreverse : 324.853 micros/op 3078 ops/sec; 3078.4 MB/s readreverse : 320.488 micros/op 3120 ops/sec; 3120.3 MB/s readreverse : 320.536 micros/op 3119 ops/sec; 3119.8 MB/s avg : 3150.21 MB/s ``` After memcpy elimination ``` ==> 1000000_Keys_100Byte.txt <== readreverse : 0.395 micros/op 2529890 ops/sec; 279.9 MB/s readreverse : 0.368 micros/op 2715922 ops/sec; 300.5 MB/s readreverse : 0.384 micros/op 2603929 ops/sec; 288.1 MB/s readreverse : 0.375 micros/op 2663286 ops/sec; 294.6 MB/s readreverse : 0.357 micros/op 2802180 ops/sec; 310.0 MB/s readreverse : 0.363 micros/op 2757684 ops/sec; 305.1 MB/s readreverse : 0.372 micros/op 2689603 ops/sec; 297.5 MB/s readreverse : 0.379 micros/op 2638599 ops/sec; 291.9 MB/s readreverse : 0.375 micros/op 2663803 ops/sec; 294.7 MB/s readreverse : 0.375 micros/op 2665579 ops/sec; 294.9 MB/s avg: 295.72 MB/s (1.22 X) ==> 1000000_Keys_1KB.txt <== readreverse : 0.879 micros/op 1138112 ops/sec; 1128.8 MB/s readreverse : 0.842 micros/op 1187998 ops/sec; 1178.3 MB/s readreverse : 0.837 micros/op 1194915 ops/sec; 1185.1 MB/s readreverse : 0.845 micros/op 1182983 ops/sec; 1173.3 MB/s readreverse : 0.877 micros/op 1140308 ops/sec; 1131.0 MB/s readreverse : 0.849 micros/op 1177581 ops/sec; 1168.0 MB/s readreverse : 0.915 micros/op 1093284 ops/sec; 1084.3 MB/s readreverse : 0.863 micros/op 1159418 ops/sec; 1149.9 MB/s readreverse : 0.895 micros/op 1117670 ops/sec; 1108.5 MB/s readreverse : 0.852 micros/op 1174116 ops/sec; 1164.5 MB/s avg: 1147.17 MB/s (1.12 X) ==> 1000000_Keys_10KB.txt <== readreverse : 3.870 micros/op 258386 ops/sec; 2527.2 MB/s readreverse : 3.568 micros/op 280296 ops/sec; 2741.5 MB/s readreverse : 4.005 micros/op 249694 ops/sec; 2442.2 MB/s readreverse : 3.550 micros/op 281719 ops/sec; 2755.5 MB/s readreverse : 3.562 micros/op 280758 ops/sec; 2746.1 MB/s readreverse : 3.507 micros/op 285125 ops/sec; 2788.8 MB/s readreverse : 3.463 micros/op 288739 ops/sec; 2824.1 MB/s readreverse : 3.428 micros/op 291734 ops/sec; 2853.4 MB/s readreverse : 3.553 micros/op 281491 ops/sec; 2753.2 MB/s readreverse : 3.535 micros/op 282885 ops/sec; 2766.9 MB/s avg : 2719.89 MB/s (1.17 X) ==> 100000_Keys_100KB.txt <== readreverse : 22.815 micros/op 43830 ops/sec; 4281.0 MB/s readreverse : 29.957 micros/op 33381 ops/sec; 3260.4 MB/s readreverse : 25.334 micros/op 39473 ops/sec; 3855.4 MB/s readreverse : 23.037 micros/op 43409 ops/sec; 4239.8 MB/s readreverse : 27.810 micros/op 35958 ops/sec; 3512.1 MB/s readreverse : 30.327 micros/op 32973 ops/sec; 3220.6 MB/s readreverse : 29.704 micros/op 33665 ops/sec; 3288.2 MB/s readreverse : 29.423 micros/op 33987 ops/sec; 3319.6 MB/s readreverse : 23.334 micros/op 42856 ops/sec; 4185.9 MB/s readreverse : 29.969 micros/op 33368 ops/sec; 3259.1 MB/s avg : 3642.21 MB/s (1.5 X) ==> 10000_Keys_1MB.txt <== readreverse : 244.748 micros/op 4085 ops/sec; 4085.9 MB/s readreverse : 230.208 micros/op 4343 ops/sec; 4344.0 MB/s readreverse : 235.655 micros/op 4243 ops/sec; 4243.6 MB/s readreverse : 235.730 micros/op 4242 ops/sec; 4242.2 MB/s readreverse : 237.346 micros/op 4213 ops/sec; 4213.3 MB/s readreverse : 227.306 micros/op 4399 ops/sec; 4399.4 MB/s readreverse : 194.957 micros/op 5129 ops/sec; 5129.4 MB/s readreverse : 238.359 micros/op 4195 ops/sec; 4195.4 MB/s readreverse : 221.588 micros/op 4512 ops/sec; 4513.0 MB/s readreverse : 235.911 micros/op 4238 ops/sec; 4239.0 MB/s avg : 4360.52 MB/s (1.38 X) ``` Test Plan: COMPILE_WITH_ASAN=1 make check -j64 Reviewers: andrewkr, yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56511	9 years ago
Warren Falk	b8cf9130f8	Fix #1110 , 32-bit build failure on Mac OSX (#1112 ) Using explicit 64-bit type in conditional in platforms above 32-bits This appears to be necessary on Mac OSX as std::conditional does not appear to short circuit and evaluates the third template arg Making the third template arg be 64 bits explicitly works around this problem and will work on both 32 bit and 64+ bit platforms.	9 years ago
Islam AbdelRahman	21441c09bd	Fix calling GetCurrentMutableCFOptions in CompactionJob::ProcessKeyValueCompaction() Summary: GetCurrentMutableCFOptions() can only be called when DB mutex is held so we cannot call it in CompactionJob::ProcessKeyValueCompaction() since it's not holding the db mutex Test Plan: make check -j64 Reviewers: sdong, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57471	9 years ago
Reid Horuff	6e56a114be	Modification of WriteBatch to support two phase commit Summary: Adds three new WriteBatch data types: Prepare(xid), Commit(xid), Rollback(xid). Prepare(xid) should precede the (single) operation to which is applies. There can obviously be multiple Prepare(xid) markers. There should only be one Rollback(xid) or Commit(xid) marker yet not both. None of this logic is currently enforced and will most likely be implemented further up such as in the memtableinserter. All three markers are similar to PutLogData in that they are writebatch meta-data, ie stored but not counted. All three markers differ from PutLogData in that they will actually be written to disk. As for WriteBatchWithIndex, Prepare, Commit, Rollback are all implemented just as PutLogData and none are tested just as PutLogData. Test Plan: single unit test in write_batch_test. Reviewers: hermanlee4, sdong, anthony Subscribers: andrewkr, vasilep, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D54093	9 years ago
Yi Wu	a92049e3e7	Added EventListener::OnTableFileCreationStarted() callback Summary: Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status. Test Plan: unit test. Reviewers: dhruba, yhchiang, ott, sdong Reviewed By: sdong Subscribers: sdong, kradhakrishnan, IslamAbdelRahman, andrewkr, yhchiang, leveldb, ott, dhruba Differential Revision: https://reviews.facebook.net/D56337	9 years ago
sdong	6a14f7a976	Change several option defaults Summary: Changing several option defaults: options.max_open_files changes from 5000 to -1 options.base_background_compactions changes from max_background_compactions to 1 options.wal_recovery_mode changes from kTolerateCorruptedTailRecords to kTolerateCorruptedTailRecords options.compaction_pri changes from kByCompensatedSize to kByCompensatedSize Test Plan: Write unit tests to see OldDefaults() works as expected. Reviewers: IslamAbdelRahman, yhchiang, igor Reviewed By: igor Subscribers: MarkCallaghan, yiwu, kradhakrishnan, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56427	9 years ago
Peter Mattis	c6c770a1ac	Use prefix_same_as_start to avoid iteration in FindNextUserEntryInternal. (#1102 ) This avoids excessive iteration in tombstone fields.	9 years ago
sdong	992a8f83b7	Not enable jemalloc status printing if USE_CLANG=1 Summary: Warning is printed out with USE_CLANG=1 when including jemalloc.h. Disable it in that case. Test Plan: Run db_bench with USE_CLANG=1 and not. Make sure they can all build and jemalloc status is printed out in the case where USE_CLANG is not set. Reviewers: andrewkr, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57399	9 years ago
Andrew Kryczka	a06faa6327	Skip PresetCompressionDict test for lite Summary: This test relies on "rocksdb.num-files-at-levelN" property that isn't implemented in rocksdb lite. So we will compile it only for non-lite builds. Test Plan: $ make -j40 check 'OPT=-g -DROCKSDB_LITE' Reviewers: sdong, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57387	9 years ago
Andrew Kryczka	0f428c5619	Fix compression dictionary clang osx error Summary: There was one narrowing conversion in D52287 that only showed up with clang on osx. Test Plan: $ make clean && USE_CLANG=1 DISABLE_JEMALLOC=1 TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make -j32 check Reviewers: sdong, lightmark, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57357	9 years ago
Li Peng	6d4832a998	Merge pull request #1101 from flyd1005/wip-fix-typo fix typos and remove duplicated words	9 years ago
Andrew Kryczka	54de13abac	Fix compression dictionary clang errors Summary: There were a few narrowing conversions that clang didn't like. Test Plan: $ make clean && USE_CLANG=1 DISABLE_JEMALLOC=1 TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make -j32 check Reviewers: IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57351	9 years ago
Islam AbdelRahman	0850bc5147	Fix build on machines without jemalloc Summary: It looks like we mistakenly enable JEMALLOC even if it's not available on the machine, that's why travis is failing Test Plan: check on my devserver check on my mac Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57345	9 years ago
Andrew Kryczka	843d2e3137	Shared dictionary compression using reference block Summary: This adds a new metablock containing a shared dictionary that is used to compress all data blocks in the SST file. The size of the shared dictionary is configurable in CompressionOptions and defaults to 0. It's currently only used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of the compression type if the user chooses a nonzero dictionary size. During compaction, computes the dictionary by randomly sampling the first output file in each subcompaction. It pre-computes the intervals to sample by assuming the output file will have the maximum allowable length. In case the file is smaller, some of the pre-computed sampling intervals can be beyond end-of-file, in which case we skip over those samples and the dictionary will be a bit smaller. After the dictionary is generated using the first file in a subcompaction, it is loaded into the compression library before writing each block in each subsequent file of that subcompaction. On the read path, gets the dictionary from the metablock, if it exists. Then, loads that dictionary into the compression library before reading each block. Test Plan: new unit test Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong Reviewed By: sdong Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D52287	9 years ago
Sergey Makarenko	1c80dfab24	Print memory allocation counters Summary: Introduced option to dump malloc statistics using new option flag. Added new command line option to db_bench tool to enable this funtionality. Also extended build to support environments with/without jemalloc. Test Plan: 1) Build rocksdb using `make` command. Launch the following command `./db_bench --benchmarks=fillrandom --dump_malloc_stats=true --num=10000000` end verified that jemalloc dump is present in LOG file. 2) Build rocksdb using `DISABLE_JEMALLOC=1 make db_bench -j32` and ran the same db_bench tool and found the following message in LOG file: "Please compile with jemalloc to enable malloc dump". 3) Also built rocksdb using `make` command on MacOS to verify behavior in non-FB environment. Also to debug build configuration change temporary changed AM_DEFAULT_VERBOSITY = 1 in Makefile to see compiler and build tools output. For case 1) -DROCKSDB_JEMALLOC was present in compiler command line. For both 2) and 3) this flag was not present. Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57321	9 years ago
sdong	ac0e54b4c6	CompactedDB should not be used if there is outstanding WAL files Summary: CompactedDB skips memtable. So we shouldn't use compacted DB if there is outstanding WAL files. Test Plan: Change to options.max_open_files = -1 perf context test to create a compacted DB, which we shouldn't do. Reviewers: yhchiang, kradhakrishnan, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57057	9 years ago
Islam AbdelRahman	d719b095dc	Introduce PinnedIteratorsManager (Reduce PinData() overhead / Refactor PinData) Summary: While trying to reuse PinData() / ReleasePinnedData() .. to optimize away some memcpys I realized that there is a significant overhead for using PinData() / ReleasePinnedData if they were called many times. This diff refactor the pinning logic by introducing PinnedIteratorsManager a centralized component that will be created once and will be notified whenever we need to Pin an Iterator. This implementation have much less overhead than the original implementation Test Plan: make check -j64 COMPILE_WITH_ASAN=1 make check -j64 Reviewers: yhchiang, sdong, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56493	9 years ago
Islam AbdelRahman	7c14abf2c7	Improve BytewiseComparatorImpl::FindShortestSeparator Summary: The current implementation find the first different byte and try to increment it, if it cannot it return the original key we can improve this by keep going after the first different byte to find the first non 0xFF byte and increment it After trying this patch on some logdevice sst files I see decrease in there index block size by 8.5% Test Plan: existing tests and updated test Reviewers: yhchiang, andrewkr, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56241	9 years ago
Islam AbdelRahman	f3eb0b5b8c	Make EventListenerTest.CompactionReasonLevel more deterministic Summary: In this test some times automatic compactions do everything and Manual compaction become a no-op. Update the test to make sure manual compaction is not a no-op Test Plan: run the test Reviewers: andrewkr, yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57189	9 years ago
sdong	7b78d623f7	Shouldn't report default column family's compaction stats as DB compaction stats Summary: Now we collect compaction stats per column family, but report default colum family's stat as compaction stats for DB. Fix it by reporting compaction stats per column family instead. Test Plan: Run db_bench with --num_column_families=4 and see the number fixed. Reviewers: IslamAbdelRahman, yhchiang Reviewed By: yhchiang Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57063	9 years ago
Yueh-Hsuan Chiang	24110ce90c	Correct Statistics FLUSH_WRITE_BYTES Summary: In https://reviews.facebook.net/D56271, we fixed an issue where we consider flush as compaction. However, that makes us mistakenly count FLUSH_WRITE_BYTES twice (one in flush_job and one in db_impl.) This patch removes the one incremented in db_impl. Test Plan: db_test Reviewers: yiwu, andrewkr, IslamAbdelRahman, kradhakrishnan, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D57111	9 years ago

1 2 3 4 5 ...

2322 Commits (2b2a898e0b9ffe12e4ffb9e2bf4a697c843278f0)