rocksdb

Commit Graph

Author	SHA1	Message	Date
Aaron Gao	db2b4eb50e	avoid direct io in rocksdb_lite Summary: fix lite bugs disable direct io in lite mode Closes https://github.com/facebook/rocksdb/pull/1870 Differential Revision: D4559866 Pulled By: yiwu-arbug fbshipit-source-id: 3761c51	8 years ago
Sagar Vemuri	eb912a927e	Remove disableDataSync option Summary: Remove disableDataSync, and another similarly named disable_data_sync options. This is being done to simplify options, and also because the performance gains of this feature can be achieved by other methods. Closes https://github.com/facebook/rocksdb/pull/1859 Differential Revision: D4541292 Pulled By: sagar0 fbshipit-source-id: 5b3a6ca	8 years ago
Islam AbdelRahman	574b543f80	Rename merger.h -> merging_iterator.h Summary: merger.h was always a confusing name for me, simply give the file a better name Closes https://github.com/facebook/rocksdb/pull/1836 Differential Revision: D4505357 Pulled By: IslamAbdelRahman fbshipit-source-id: 07b28d8	8 years ago
Maysam Yabandeh	d0ba8ec8f9	Revert "PinnableSlice" Summary: This reverts commit `54d94e9c2c`. The pull request was landed by mistake. Closes https://github.com/facebook/rocksdb/pull/1755 Differential Revision: D4391678 Pulled By: maysamyabandeh fbshipit-source-id: 36d5149	8 years ago
Maysam Yabandeh	54d94e9c2c	PinnableSlice Summary: Currently the point lookup values are copied to a string provided by the user. This incures an extra memcpy cost. This patch allows doing point lookup via a PinnableSlice which pins the source memory location (instead of copying their content) and releases them after the content is consumed by the user. The old API of Get(string) is translated to the new API underneath. Here is the summary for improvements: 1. value 100 byte: 1.8% regular, 1.2% merge values 2. value 1k byte: 11.5% regular, 7.5% merge values 3. value 10k byte: 26% regular, 29.9% merge values The improvement for merge could be more if we extend this approach to pin the merge output and delay the full merge operation until the user actually needs it. We have put that for future work. PS: Sometimes we observe a small decrease in performance when switching from t5452014 to this patch but with the old Get(string) API. The difference is a little and could be noise. More importantly it is safely cancelled Closes https://github.com/facebook/rocksdb/pull/1732 Differential Revision: D4374613 Pulled By: maysamyabandeh fbshipit-source-id: a077f1a	8 years ago
Andrew Kryczka	48e8baebc0	Decouple data iterator and range deletion iterator in TableCache Summary: Previously we used TableCache::NewIterator() for multiple purposes (data block iterator and range deletion iterator), and returned non-ok status in the data block iterator. In one case where the caller only used the range deletion block iterator (`9e7cf3469b/db/version_set.cc (L965-L973)`), we didn't check/free the data block iterator containing non-ok status, which caused a valgrind error. So, this diff decouples creation of data block and range deletion block iterators, and updates the callers accordingly. Both functions can return non-ok status in an InternalIterator. Since the non-ok status is returned in an iterator that the callers will definitely use, it should be more usable/less error-prone. Closes https://github.com/facebook/rocksdb/pull/1513 Differential Revision: D4181423 Pulled By: ajkr fbshipit-source-id: 835b8f5	8 years ago
Andrew Kryczka	9e7cf3469b	DeleteRange user iterator support Summary: Note: reviewed in https://reviews.facebook.net/D65115 - DBIter maintains a range tombstone accumulator. We don't cleanup obsolete tombstones yet, so if the user seeks back and forth, the same tombstones would be added to the accumulator multiple times. - DBImpl::NewInternalIterator() (used to make DBIter's underlying iterator) adds memtable/L0 range tombstones, L1+ range tombstones are added on-demand during NewSecondaryIterator() (see D62205) - DBIter uses ShouldDelete() when advancing to check whether keys are covered by range tombstones Closes https://github.com/facebook/rocksdb/pull/1464 Differential Revision: D4131753 Pulled By: ajkr fbshipit-source-id: be86559	8 years ago
Andrew Kryczka	f998c9790f	DeleteRange Get support Summary: During Get()/MultiGet(), build up a RangeDelAggregator with range tombstones as we search through live memtable, immutable memtables, and SST files. This aggregator is then used by memtable.cc's SaveValue() and GetContext::SaveValue() to check whether keys are covered. added tests for Get on memtables/files; end-to-end tests mainly in https://reviews.facebook.net/D64761 Closes https://github.com/facebook/rocksdb/pull/1456 Differential Revision: D4111271 Pulled By: ajkr fbshipit-source-id: 6e388d4	8 years ago
Benoit Girard	2b16d664cb	Change max_bytes_for_level_multiplier to double Summary: Closes https://github.com/facebook/rocksdb/pull/1427 Differential Revision: D4094732 Pulled By: yiwu-arbug fbshipit-source-id: b9b79e9	8 years ago
Islam AbdelRahman	869ae5d786	Support IngestExternalFile (remove AddFile restrictions) Summary: Changes in the diff API changes: - Introduce IngestExternalFile to replace AddFile (I think this make the API more clear) - Introduce IngestExternalFileOptions (This struct will encapsulate the options for ingesting the external file) - Deprecate AddFile() API Logic changes: - If our file overlap with the memtable we will flush the memtable - We will find the first level in the LSM tree that our file key range overlap with the keys in it - We will find the lowest level in the LSM tree above the the level we found in step 2 that our file can fit in and ingest our file in it - We will assign a global sequence number to our new file - Remove AddFile restrictions by using global sequence numbers Other changes: - Refactor all AddFile logic to be encapsulated in ExternalSstFileIngestionJob Test Plan: unit tests (still need to add more) addfile_stress (https://reviews.facebook.net/D65037) Reviewers: yiwu, andrewkr, lightmark, yhchiang, sdong Reviewed By: sdong Subscribers: jkedgar, hcz, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D65061	8 years ago
Andrew Kryczka	6fbe96baf8	Compaction Support for Range Deletion Summary: This diff introduces RangeDelAggregator, which takes ownership of iterators provided to it via AddTombstones(). The tombstones are organized in a two-level map (snapshot stripe -> begin key -> tombstone). Tombstone creation avoids data copy by holding Slices returned by the iterator, which remain valid thanks to pinning. For compaction, we create a hierarchical range tombstone iterator with structure matching the iterator over compaction input data. An aggregator based on that iterator is used by CompactionIterator to determine which keys are covered by range tombstones. In case of merge operand, the same aggregator is used by MergeHelper. Upon finishing each file in the compaction, relevant range tombstones are added to the output file's range tombstone metablock and file boundaries are updated accordingly. To check whether a key is covered by range tombstone, RangeDelAggregator::ShouldDelete() considers tombstones in the key's snapshot stripe. When this function is used outside of compaction, it also checks newer stripes, which can contain covering tombstones. Currently the intra-stripe check involves a linear scan; however, in the future we plan to collapse ranges within a stripe such that binary search can be used. RangeDelAggregator::AddToBuilder() adds all range tombstones in the table's key-range to a new table's range tombstone meta-block. Since range tombstones may fall in the gap between files, we may need to extend some files' key-ranges. The strategy is (1) first file extends as far left as possible and other files do not extend left, (2) all files extend right until either the start of the next file or the end of the last range tombstone in the gap, whichever comes first. One other notable change is adding release/move semantics to ScopedArenaIterator such that it can be used to transfer ownership of an arena-allocated iterator, similar to how unique_ptr is used for malloc'd data. Depends on D61473 Test Plan: compaction_iterator_test, mock_table, end-to-end tests in D63927 Reviewers: sdong, IslamAbdelRahman, wanning, yhchiang, lightmark Reviewed By: lightmark Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D62205	8 years ago
Islam AbdelRahman	2ad68b971a	Support running consistency checks in release mode Summary: We always run consistency checks when compiling in debug mode allow users to set Options::force_consistency_checks to true to be able to run such checks even when compiling in release mode Test Plan: make check -j64 make release Reviewers: lightmark, sdong, yiwu Reviewed By: yiwu Subscribers: hermanlee4, andrewkr, yoshinorim, jkedgar, dhruba Differential Revision: https://reviews.facebook.net/D64701	8 years ago
Aaron Gao	f517d9dd09	Add SeekForPrev() to Iterator Summary: Add new Iterator API, `SeekForPrev`: find the last key that <= target key support prefix_extractor support prefix_same_as_start support upper_bound not supported in iterators without Prev() Also add tests in db_iter_test and db_iterator_test Pass all tests Cheers! Test Plan: make all check -j64 Reviewers: andrewkr, yiwu, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64149	8 years ago
Yi Wu	9ed928e7a9	Split DBOptions into ImmutableDBOptions and MutableDBOptions Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs. Test Plan: make all check Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64065	8 years ago
Yi Wu	81747f1be6	Refactor MutableCFOptions Summary: * Change constructor of MutableCFOptions to depends only on ColumnFamilyOptions. * Move `max_subcompactions`, `compaction_options_fifo` and `compaction_pri` to ImmutableCFOptions to make it clear that they are immutable. Test Plan: existing unit tests. Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D63945	8 years ago
omegaga	e70020e4f6	Only cache level 0 indexes and filter when opening table reader Summary: In T8216281 we decided to disable prefetching the index and filter during opening table handlers during startup (max_open_files = -1). Test Plan: Rely on `IndexAndFilterBlocksOfNewTableAddedToCache` to guarantee L0 indexes and filters are still cached and change `PinL0IndexAndFilterBlocksTest` to make sure other levels are not cached (maybe add one more test to test we don't cache other levels?) Reviewers: sdong, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59913	8 years ago
Islam AbdelRahman	68a8e6b8fa	Introduce FullMergeV2 (eliminate memcpy from merge operators) Summary: This diff update the code to pin the merge operator operands while the merge operation is done, so that we can eliminate the memcpy cost, to do that we need a new public API for FullMerge that replace the std::deque<std::string> with std::vector<Slice> This diff is stacked on top of D56493 and D56511 In this diff we - Update FullMergeV2 arguments to be encapsulated in MergeOperationInput and MergeOperationOutput which will make it easier to add new arguments in the future - Replace std::deque<std::string> with std::vector<Slice> to pass operands - Replace MergeContext std::deque with std::vector (based on a simple benchmark I ran https://gist.github.com/IslamAbdelRahman/78fc86c9ab9f52b1df791e58943fb187) - Allow FullMergeV2 output to be an existing operand ``` [Everything in Memtable \| 10K operands \| 10 KB each \| 1 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=10000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000 [FullMergeV2] readseq : 0.607 micros/op 1648235 ops/sec; 16121.2 MB/s readseq : 0.478 micros/op 2091546 ops/sec; 20457.2 MB/s readseq : 0.252 micros/op 3972081 ops/sec; 38850.5 MB/s readseq : 0.237 micros/op 4218328 ops/sec; 41259.0 MB/s readseq : 0.247 micros/op 4043927 ops/sec; 39553.2 MB/s [master] readseq : 3.935 micros/op 254140 ops/sec; 2485.7 MB/s readseq : 3.722 micros/op 268657 ops/sec; 2627.7 MB/s readseq : 3.149 micros/op 317605 ops/sec; 3106.5 MB/s readseq : 3.125 micros/op 320024 ops/sec; 3130.1 MB/s readseq : 4.075 micros/op 245374 ops/sec; 2400.0 MB/s ``` ``` [Everything in Memtable \| 10K operands \| 10 KB each \| 10 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=1000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000 [FullMergeV2] readseq : 3.472 micros/op 288018 ops/sec; 2817.1 MB/s readseq : 2.304 micros/op 434027 ops/sec; 4245.2 MB/s readseq : 1.163 micros/op 859845 ops/sec; 8410.0 MB/s readseq : 1.192 micros/op 838926 ops/sec; 8205.4 MB/s readseq : 1.250 micros/op 800000 ops/sec; 7824.7 MB/s [master] readseq : 24.025 micros/op 41623 ops/sec; 407.1 MB/s readseq : 18.489 micros/op 54086 ops/sec; 529.0 MB/s readseq : 18.693 micros/op 53495 ops/sec; 523.2 MB/s readseq : 23.621 micros/op 42335 ops/sec; 414.1 MB/s readseq : 18.775 micros/op 53262 ops/sec; 521.0 MB/s ``` ``` [Everything in Block cache \| 10K operands \| 10 KB each \| 1 operand per key] [FullMergeV2] $ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions readseq : 14.741 micros/op 67837 ops/sec; 663.5 MB/s readseq : 1.029 micros/op 971446 ops/sec; 9501.6 MB/s readseq : 0.974 micros/op 1026229 ops/sec; 10037.4 MB/s readseq : 0.965 micros/op 1036080 ops/sec; 10133.8 MB/s readseq : 0.943 micros/op 1060657 ops/sec; 10374.2 MB/s [master] readseq : 16.735 micros/op 59755 ops/sec; 584.5 MB/s readseq : 3.029 micros/op 330151 ops/sec; 3229.2 MB/s readseq : 3.136 micros/op 318883 ops/sec; 3119.0 MB/s readseq : 3.065 micros/op 326245 ops/sec; 3191.0 MB/s readseq : 3.014 micros/op 331813 ops/sec; 3245.4 MB/s ``` ``` [Everything in Block cache \| 10K operands \| 10 KB each \| 10 operand per key] DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10-operands-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions [FullMergeV2] readseq : 24.325 micros/op 41109 ops/sec; 402.1 MB/s readseq : 1.470 micros/op 680272 ops/sec; 6653.7 MB/s readseq : 1.231 micros/op 812347 ops/sec; 7945.5 MB/s readseq : 1.091 micros/op 916590 ops/sec; 8965.1 MB/s readseq : 1.109 micros/op 901713 ops/sec; 8819.6 MB/s [master] readseq : 27.257 micros/op 36687 ops/sec; 358.8 MB/s readseq : 4.443 micros/op 225073 ops/sec; 2201.4 MB/s readseq : 5.830 micros/op 171526 ops/sec; 1677.7 MB/s readseq : 4.173 micros/op 239635 ops/sec; 2343.8 MB/s readseq : 4.150 micros/op 240963 ops/sec; 2356.8 MB/s ``` Test Plan: COMPILE_WITH_ASAN=1 make check -j64 Reviewers: yhchiang, andrewkr, sdong Reviewed By: sdong Subscribers: lovro, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57075	8 years ago
John Alexander	9430333f84	New Statistics to track Compression/Decompression (#1197 ) * Added new statistics and refactored to allow ioptions to be passed around as required to access environment and statistics pointers (and, as a convenient side effect, info_log pointer). * Prevent incrementing compression counter when compression is turned off in options. * Prevent incrementing compression counter when compression is turned off in options. * Added two more supported compression types to test code in db_test.cc * Prevent incrementing compression counter when compression is turned off in options. * Added new StatsLevel that excludes compression timing. * Fixed casting error in coding.h * Fixed CompressionStatsTest for new StatsLevel. * Removed unused variable that was breaking the Linux build	8 years ago
Yi Wu	296545a2c7	Fix clang analyzer errors Summary: Fixing erros reported by clang static analyzer. * Removing some unused variables. * Adding assertions to fix false positives reported by clang analyzer. * Adding `__clang_analyzer__` macro to suppress false positive warnings. Test Plan: USE_CLANG=1 OPT=-g make analyze -j64 Reviewers: andrewkr, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D60549	8 years ago
Gunnar Kudrjavets	b954847fca	Fix release build for MyRocks by using debug-only code only in debug builds Summary: MyRocks release integration build breaks because we treat warnings caused by unused variables as errors. Variable `edit` is only used in debug builds. Therefore we need to guard it using `#ifndef NDEBUG` check. Test Plan: - `[p]arc diff --preview` for the default validation. - Verify that release build fails before this fix and passes after applying it. Reviewers: andrewkr, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60423	8 years ago
sdong	32df9733d1	Add options.write_buffer_manager: control total memtable size across DB instances Summary: Add option write_buffer_manager to help users control total memory spent on memtables across multiple DB instances. Test Plan: Add a new unit test. Reviewers: yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: adela, benj, sumeet, muthu, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59925	8 years ago
Aaron Gao	5aaef91d4a	group multiple batch of flush into one manifest file (one call to LogAndApply) Summary: Currently, if several flush outputs are committed together, we issue each manifest write per batch (1 batch = 1 flush = 1 sst file = 1+ continuous memtables). Each manifest write requires one fsync and one fsync to parent directory. In some cases, it becomes the bottleneck of write. We should batch them and write in one manifest write when possible. Test Plan: ` ./db_bench -benchmarks="fillseq" -max_write_buffer_number=16 -max_background_flushes=16 -disable_auto_compactions=true -min_write_buffer_number_to_merge=1 -write_buffer_size=65536 -level0_stop_writes_trigger=10000 -level0_slowdown_writes_trigger=10000` Before ``` Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags RocksDB: version 4.9 Date: Fri Jul 1 15:38:17 2016 CPU: 32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz CPUCache: 20480 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) Write rate: 0 bytes/second Compression: Snappy Memtablerep: skip_list Perf Level: 1 WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags DB path: [/tmp/rocksdbtest-112628/dbbench] fillseq : 166.277 micros/op 6014 ops/sec; 0.7 MB/s ``` After ``` Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags RocksDB: version 4.9 Date: Fri Jul 1 15:35:05 2016 CPU: 32 * Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz CPUCache: 20480 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) Write rate: 0 bytes/second Compression: Snappy Memtablerep: skip_list Perf Level: 1 WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags DB path: [/tmp/rocksdbtest-112628/dbbench] fillseq : 52.328 micros/op 19110 ops/sec; 2.1 MB/s ``` Reviewers: andrewkr, IslamAbdelRahman, yhchiang, sdong Reviewed By: sdong Subscribers: igor, andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D60075	8 years ago
Islam AbdelRahman	7c919deccc	Reuse TimedFullMerge instead of FullMerge + instrumentation Summary: We have alot of code duplication whenever we call FullMerge we keep duplicating the instrumentation and statistics code This is a simple diff to refactor the code to use TimedFullMerge instead of FullMerge Test Plan: COMPILE_WITH_ASAN=1 make check -j64 Reviewers: andrewkr, yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59577	9 years ago
Ashish Shenoy	99765ed855	Clean up the ComputeCompactionScore() API Summary: Make CompactionOptionsFIFO a part of mutable_cf_options Test Plan: UT Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, lgalanis, dhruba Differential Revision: https://reviews.facebook.net/D58653	9 years ago
Aaron Orenstein	2073cf3775	Eliminate use of 'using namespace std'. Also remove a number of ADL references to std functions. Summary: Reduce use of argument-dependent name lookup in RocksDB. Test Plan: 'make check' passed. Reviewers: andrewkr Reviewed By: andrewkr Subscribers: leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58203	9 years ago
Islam AbdelRahman	560358dc93	Fix data race in GetObsoleteFiles() Summary: GetObsoleteFiles() and LogAndApply() functions modify obsolete_manifests_ vector we need to make sure that the mutex is held when we modify the obsolete_manifests_ Test Plan: run the test under TSAN Reviewers: andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D58011	9 years ago
sdong	bfb6b1b8a8	Estimate pending compaction bytes more accurately Summary: Currently we estimate bytes needed for compaction by assuming fanout value to be level multiplier. It overestimates when size of a level exceeds the target by large. We estimate by the ratio of actual sizes in levels instead. Test Plan: Fix existing test cases and add a new one. Reviewers: IslamAbdelRahman, igor, yhchiang Reviewed By: yhchiang Subscribers: MarkCallaghan, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D57789	9 years ago
Andrew Kryczka	73a847ef89	Add per-level compression ratio property Summary: This is needed so we can measure compression ratio improvements achieved by D52287. The property compares raw data size against the total file size for a given level. If the level is empty it should return 0.0. Test Plan: new unit test Reviewers: IslamAbdelRahman, yhchiang, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D56967	9 years ago
Igor Canadi	ab4c62332e	Don't use version in the error message Summary: We use object `v` in the error message, which is not initialized if the edit is column family manipulation. This doesn't provide much useful info, so this diff is removing it. Instead, it dumps actual VersionEdit contents. Test Plan: compiles. would be great to get tests in version_set_test.cc that cover cases where a file write fails Reviewers: sdong, yhchiang, andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D56349	9 years ago
Aaron Gao	cc87075d63	No need to limit to 20 files in UpdateAccumulatedStats() if options.max_open_files=-1 Summary: There is a hardcoded constraint in our statistics collection that prevents reading properties from more than 20 SST files. This means our statistics will be very inaccurate for databases with > 20 files since additional files are just ignored. The purpose of constraining the number of files used is to bound the I/O performed during statistics collection, since these statistics need to be recomputed every time the database reopened. However, this constraint doesn't take into account the case where option "max_open_files" is -1. In that case, all the file metadata has already been read, so MaybeInitializeFileMetaData() won't incur any I/O cost. so this diff gets rid of the 20-file constraint in case max_open_files == -1. Test Plan: write into unit test db/db_properties_test.cc - "ValidateSampleNumber". We generate 20 files with 2 rows and 10 files with 1 row. If max_open_files !=-1, the `rocksdb.estimate-num-keys` should be (101 + 102)/20 * 30 = 45. Otherwise, it should be the ground truth, 50. {F1089153} Reviewers: andrewkr Reviewed By: andrewkr Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D56253	9 years ago
Marton Trencseni	9b51987521	Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes. Summary: When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache. What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention. Test Plan: 'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK. I didn't run the Java tests, I don't have Java set up on my devserver. Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56133	9 years ago
Islam AbdelRahman	99ffb3d533	Fix perf_context::merge_operator_time_nanos calculation Summary: We were not measuring the time spent in merge_operator when called from Version::Get() Test Plan: added a unittest Reviewers: sdong, yhchiang Reviewed By: yhchiang Subscribers: andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D55905	9 years ago
Yueh-Hsuan Chiang	be9816b3d9	Fix data race issue when sub-compaction is used in CompactionJob Summary: When subcompaction is used, all subcompactions share the same Compaction pointer in CompactionJob while each subcompaction all keeps their mutable stats in SubcompactionState. However, there're still some mutable part that is currently store in the shared Compaction pointer. This patch makes two changes: 1. Make the shared Compaction pointer const so that it can never be modified during the compaction. 2. Move necessary states from Compaction to SubcompactionState. 3. Make functions of Compaction const if the function does not modify its internal state. Test Plan: rocksdb and MyRocks test Reviewers: sdong, kradhakrishnan, andrewkr, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: andrewkr, dhruba, yoshinorim, gunnarku, leveldb Differential Revision: https://reviews.facebook.net/D55923	9 years ago
sdong	b1fafcaca6	Revert "Adding pin_l0_filter_and_index_blocks_in_cache feature." This reverts commit `522de4f59e`. It has bug of index block cleaning up.	9 years ago
Marton Trencseni	522de4f59e	Adding pin_l0_filter_and_index_blocks_in_cache feature. Summary: When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache. What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention. When the table reader is destroyed, it releases the pinned blocks (if there were any). This has to happen before the cache is destroyed, so I had to introduce a TableReader::Close(), to guarantee the order of destruction. Test Plan: Added two unit tests for this. Existing unit tests run fine (default is pin_l0_filter_and_index_blocks_in_cache=false). DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32 Mac: OK. Linux: with D55287 patched in it's OK. Reviewers: sdong Reviewed By: sdong Subscribers: andrewkr, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54801	9 years ago
Edouard A	02e62ebbc8	Fixes warnings and ensure correct int behavior on 32-bit platforms.	9 years ago
Andrew Kryczka	d9620239d2	Cleanup stale manifests outside of full purge Summary: - Keep track of obsolete manifests in VersionSet - Updated FindObsoleteFiles() to put obsolete manifests in the JobContext for later use by PurgeObsoleteFiles() - Added test case that verifies a stale manifest is deleted by a non-full purge Test Plan: $ ./backupable_db_test --gtest_filter=BackupableDBTest.ChangeManifestDuringBackupCreation Reviewers: IslamAbdelRahman, yoshinorim, sdong Reviewed By: sdong Subscribers: andrewkr, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D55269	9 years ago
Andrew Kryczka	69c471bd9b	Handle concurrent manifest update and backup creation Summary: Fixed two related race conditions in backup creation. (1) CreateNewBackup() uses DB::DisableFileDeletions() to prevent table files from being deleted while it is copying; however, the MANIFEST file could still rotate during this time. The fix is to stop deleting the old manifest in the rotation logic. It will be deleted safely later when PurgeObsoleteFiles() runs (can only happen when file deletions are enabled). (2) CreateNewBackup() did not account for the CURRENT file being mutable. This is significant because the files returned by GetLiveFiles() contain a particular manifest filename, but the manifest to which CURRENT refers can change at any time. This causes problems when CURRENT changes between the call to GetLiveFiles() and when it's copied to the backup directory. To workaround this, I manually forge a CURRENT file referring to the manifest filename returned in GetLiveFiles(). (2) also applies to the checkpointing code, so let me know if this approach is good and I'll make the same change there. Test Plan: new test for roll manifest during backup creation. running the test before this change: $ ./backupable_db_test --gtest_filter=BackupableDBTest.ChangeManifestDuringBackupCreation ... IO error: /tmp/rocksdbtest-9383/backupable_db/MANIFEST-000001: No such file or directory running the test after this change: $ ./backupable_db_test --gtest_filter=BackupableDBTest.ChangeManifestDuringBackupCreation ... [ RUN ] BackupableDBTest.ChangeManifestDuringBackupCreation [ OK ] BackupableDBTest.ChangeManifestDuringBackupCreation (2836 ms) Reviewers: IslamAbdelRahman, anthony, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D54711	9 years ago
sdong	92a9ccf1a6	Add a new compaction priority that picks file whose overlapping ratio is smallest Summary: Add a new compaction priority as following: For every file, we calculate total size of files overalapping with the file in the next level, over the file's size itself. The file with smallest ratio will be picked first. My "db_bench --fillrandom" shows about 5% less compaction than kOldestSmallestSeqFirst if --hard_pending_compaction_bytes_limit value to keep LSM tree in shape. If not limiting hard_pending_compaction_bytes_limit, improvement is only 1% or 2%. Test Plan: Add a unit test Reviewers: andrewkr, kradhakrishnan, anthony, IslamAbdelRahman, yhchiang Reviewed By: yhchiang Subscribers: MarkCallaghan, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D54075	9 years ago
Baraa Hamodi	21e95811d1	Updated all copyright headers to the new format.	9 years ago
Andrew Kryczka	fdd70d1495	Skip filters for last L0 file if hit-optimized Summary: Following up on D53493, we can still enable the filter-skipping optimization for last file in L0. It's correct to assume the key will be present in the last L0 file when we're hit-optimized and L0 is deepest. The FilePicker encapsulates the state for traversing each level's files, so I needed to make it expose whether the returned file is last in its level. Test Plan: verified below test fails before this patch and passes afterwards. The change to how the test memtable is populated is needed so file 1 has keys (0, 30, 60), file 2 has keys (10, 40, 70), etc. $ ./db_universal_compaction_test --gtest_filter=UniversalCompactionNumLevels/DBTestUniversalCompaction.OptimizeFiltersForHits/* Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D53583	9 years ago
sdong	4b50f13540	Should not skip bloom filter for L0 during the query. Summary: It's a regression bug caused by `e089db40f9`. With the change, if options.optimize_filters_for_hits=true and there are only L0 files (like single level universal compaction), we skip all the files in L0, which is more than necessary. Fix it by always trying to query bloom filter for files in level 0. Test Plan: Add a unit test for it. Reviewers: anthony, rven, yhchiang, IslamAbdelRahman, kradhakrishnan, andrewkr Reviewed By: andrewkr Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D53493	9 years ago
Venkatesh Radhakrishnan	b7ecf3d214	Fix intermittent hang in ColumnFamilyTest.FlushAndDropRaceCondition Summary: ColumnFamilyTest.FlushAndDropRaceCondition sometimes hangs because the sync point, "FlushJob::InstallResults", sleeps holding the DB mutex. Fixing it by releasing the mutex before sleeping. Test Plan: seq 1000 \|parallel --gnu --eta 't=/dev/shm/rdb-{}; rm -rf $t; mkdir $t && export TEST_TMPDIR=$t; ./column_family_test -gtest_filter=FlushAndDropRaceCondition > $t/log-{}' Reviewers: IslamAbdelRahman, anthony, kradhakrishnan, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D53349	9 years ago
agiardullo	bcd4ccbc33	Revert D7809 Summary: Revert the functionaility of D7809 (but I'm keeping the logging and test code). We decided it was dangerous to ignore sync failures based on attempting to read the data written. The read does not tell us whether the data was synced. Test Plan: There was no test for the particular functionaility that was reverted. Keeping the test code from D7809 that tests whether we set the DB to be readonly when paranoid checks are enabled. Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D52989	9 years ago
Mike Kolupaev	34704d5c7b	[easy] Fixed a crash in LogAndApply() when CF creation failed Summary: That line used to dereference `column_family_data`, which is nullptr if we're creating a column family. Test Plan: `make -j check` Reviewers: sdong Reviewed By: sdong Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D52881	9 years ago
sdong	b9f77ba12b	When slowdown is triggered, reduce the write rate Summary: It's usually hard for users to set a value of options.delayed_write_rate. With this diff, after slowdown condition triggers, we greedily reduce write rate if estimated pending compaction bytes increase. If estimated compaction pending bytes drop, we increase the write rate. Test Plan: Add a unit test Test with db_bench setting: TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -num=10000000 --soft_pending_compaction_bytes_limit=1000000000 --hard_pending_compaction_bytes_limit=3000000000 --delayed_write_rate=100000000 and make sure without the commit, write stop will happen, but with the commit, it will not happen. Reviewers: igor, anthony, rven, yhchiang, kradhakrishnan, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D52131	9 years ago
Andrew Kryczka	e089db40f9	Skip bottom-level filter block caching when hit-optimized Summary: When Get() or NewIterator() trigger file loads, skip caching the filter block if (1) optimize_filters_for_hits is set and (2) the file is on the bottommost level. Also skip checking filters under the same conditions, which means that for a preloaded file or a file that was trivially-moved to the bottom level, its filter block will eventually expire from the cache. - added parameters/instance variables in various places in order to propagate the config ("skip_filters") from version_set to block_based_table_reader - in BlockBasedTable::Rep, this optimization prevents filter from being loaded when the file is opened simply by setting filter_policy = nullptr - in BlockBasedTable::Get/BlockBasedTable::NewIterator, this optimization prevents filter from being used (even if it was loaded already) by setting filter = nullptr Test Plan: updated unit test: $ ./db_test --gtest_filter=DBTest.OptimizeFiltersForHits will also run 'make check' Reviewers: sdong, igor, paultuckfield, anthony, rven, kradhakrishnan, IslamAbdelRahman, yhchiang Reviewed By: yhchiang Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D51633	9 years ago
Gunnar Kudrjavets	97265f5f14	Fix minor bugs in delete operator, snprintf, and size_t usage Summary: List of changes: 1) Fix the snprintf() usage in cases where wrong variable was used to determine the output buffer size. 2) Remove unnecessary checks before calling delete operator. 3) Increase code correctness by using size_t type when getting vector's size. 4) Unify the coding style by removing namespace::std usage at the top of the file to confirm to the majority usage. 5) Fix various lint errors pointed out by 'arc lint'. Test Plan: Code review and build: git diff make clean make -j 32 commit-prereq arc lint Reviewers: kradhakrishnan, sdong, rven, anthony, yhchiang, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D51849	9 years ago
Dmitri Smirnov	aca403d2b5	Fix another rebase problems.	9 years ago
Dmitri Smirnov	a6fbdd64e0	Fix rebase issues and new code warnings.	9 years ago

... 6 7 8 9 10 ...

818 Commits (d140fbfd7dcdb52a66e94b0cf6a14516e409e1fb)