rocksdb

Commit Graph

Author	SHA1	Message	Date
Peter Dillinger	ac498cdb86	Remove a few unnecessary includes Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6046 Test Plan: make check, manual inspection Differential Revision: D18573044 Pulled By: pdillinger fbshipit-source-id: 7a5999fc08d798ce3157b56d4b36d24027409fc3	6 years ago
Levi Tamasi	279c488395	Mark blob files not needed by any memtables/SSTs obsolete (#6032 ) Summary: The patch adds logic to mark no longer needed blob files obsolete upon database open and whenever a flush or compaction completes. Unneeded blob files are detected by iterating through live immutable non-TTL blob files starting from the lowest-numbered one, and stopping when a blob file used by any SSTs or potentially used by memtables is found. (The latter is determined by comparing the sequence number at which the blob file became immutable with the largest sequence number received in flush notifications.) In addition, the patch cleans up the logic around closing and obsoleting blob files and enforces invariants around this area (blob files are now guaranteed to go through the stages mutable-non-obsolete, immutable-non-obsolete, and immutable-obsolete in this order). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6032 Test Plan: Extended unit tests and tested using the BlobDB mode of `db_bench`. Differential Revision: D18495610 Pulled By: ltamasi fbshipit-source-id: 11825b84af74f3f4abfd9bcae04e80870ae58961	6 years ago
sdong	a150604e10	db_stress to cover total order seek (#6039 ) Summary: Right now, in db_stress, as long as prefix extractor is defined, TestIterator always uses. There is value of cover total_order_seek = true when prefix extractor is define. Add a small chance that this flag is turned on. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6039 Test Plan: Run the test for a while. Differential Revision: D18539689 fbshipit-source-id: 568790dd7789c9986b83764b870df0423a122d99	6 years ago
anand76	5b9233bfe8	Fix a test failure on systems that don't have Snappy compression libraries (#6038 ) Summary: The ParallelIO/DBBasicTestWithParallelIO.MultiGet/11 test fails if Snappy compression library is not installed, since RocksDB defaults to Snappy if none is specified. So dynamically determine the supported compression types and pick the first one. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6038 Differential Revision: D18532370 Pulled By: anand1976 fbshipit-source-id: a0a735114d1f8892ea09f7c4af8688d7bcc5b075	6 years ago
Little-Wallace	f65ec09ef8	Fix IngestExternalFile's bug with two_write_queue (#5976 ) Summary: When two_write_queue enable, IngestExternalFile performs EnterUnbatched on both write queues. SwitchMemtable also EnterUnbatched on 2nd write queue when this option is enabled. When the call stack includes IngestExternalFile -> FlushMemTable -> SwitchMemtable, this results into a deadlock. The implemented solution is to pass on the existing writes_stopped argument in FlushMemTable to skip EnterUnbatched in SwitchMemtable. Fixes https://github.com/facebook/rocksdb/issues/5974 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5976 Differential Revision: D18535943 Pulled By: maysamyabandeh fbshipit-source-id: a4f9d4964c10d4a7ca06b1e0102ca2ec395512bc	6 years ago
Maysam Yabandeh	0058daef7b	Disable SmallestUnCommittedSeq in Valgrind run (#6035 ) Summary: SmallestUnCommittedSeq sometimes takes too long when run under Valgrind. The patch disables it when the tests are run under Valgrind. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6035 Differential Revision: D18509198 Pulled By: maysamyabandeh fbshipit-source-id: 1191443b9fedb6b9c50d6b76f5c92371f5030230	6 years ago
Peter Dillinger	00d58a370e	Abandon use of folly::Optional (#6036 ) Summary: Had complications with LITE build and valgrind test. Reverts/fixes small parts of PR https://github.com/facebook/rocksdb/issues/6007 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6036 Test Plan: make LITE=1 all check and ROCKSDB_VALGRIND_RUN=1 DISABLE_JEMALLOC=1 make -j24 db_bloom_filter_test && ROCKSDB_VALGRIND_RUN=1 DISABLE_JEMALLOC=1 ./db_bloom_filter_test Differential Revision: D18512238 Pulled By: pdillinger fbshipit-source-id: 37213cf0d309edf11c483fb4b2fb6c02c2cf2b28	6 years ago
sdong	6123611c42	crash_test: use large max_manifest_file_size most of the time. (#6034 ) Summary: Right now, crash_test always uses 16KB max_manifest_file_size value. It is good to cover logic of manifest file switch. However, information stored in manifest files might be useful in debugging failures. Switch to only use small manifest file size in 1/15 of the time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6034 Test Plan: Observe command generated by db_crash_test.py multiple times and see the --max_manifest_file_size value distribution. Differential Revision: D18513824 fbshipit-source-id: 7b3ae6dbe521a0918df41064e3fa5ecbf2466e04	6 years ago
Peter Dillinger	e8e7fb1dcf	More fixes to auto-GarbageCollect in BackupEngine (#6023 ) Summary: Production: * Fixes GarbageCollect (and auto-GC triggered by PurgeOldBackups, DeleteBackup, or CreateNewBackup) to clean up backup directory independent of current settings (except max_valid_backups_to_open; see issue https://github.com/facebook/rocksdb/issues/4997) and prior settings used with same backup directory. * Fixes GarbageCollect (and auto-GC) not to attempt to remove "." and ".." entries from directories. * Clarifies contract with users in modifying BackupEngine operations. In short, leftovers from any incomplete operation are cleaned up by any subsequent call to that same kind of operation (PurgeOldBackups and DeleteBackup considered the same kind of operation). GarbageCollect is available to clean up after all kinds. (NB: right now PurgeOldBackups and DeleteBackup will clean up after incomplete CreateNewBackup, but we aren't promising to continue that behavior.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6023 Test Plan: * Refactors open parameters to use an option enum, for readability, etc. (Also fixes an unused parameter bug in the redundant OpenDBAndBackupEngineShareWithChecksum.) * Fixes an apparent bug in ShareTableFilesWithChecksumsTransition in which old backup data was destroyed in the transition to be tested. That test is now augmented to ensure GarbageCollect (or auto-GC) does not remove shared files when BackupEngine is opened with share_table_files=false. * Augments DeleteTmpFiles test to ensure that CreateNewBackup does auto-GC when an incompletely created backup is detected. Differential Revision: D18453559 Pulled By: pdillinger fbshipit-source-id: 5e54e7b08d711b161bc9c656181012b69a8feac4	6 years ago
Peter Dillinger	f059c7d9b9	New Bloom filter implementation for full and partitioned filters (#6007 ) Summary: Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter. Speed The improved speed, at least on recent x86_64, comes from * Using fastrange instead of modulo (%) * Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only slightly slower on keys around 12 bytes if hashing the same size many thousands of times in a row. * Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc. * Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes. Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed): $ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter Build avg ns/key: 47.7135 Mixed inside/outside queries... Single filter net ns/op: 26.2825 Random filter net ns/op: 150.459 Average FP rate %: 0.954651 $ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter Build avg ns/key: 47.2245 Mixed inside/outside queries... Single filter net ns/op: 63.2978 Random filter net ns/op: 188.038 Average FP rate %: 1.13823 Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected. The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome. Accuracy The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments. Accuracy data (generalizes, except old impl gets worse with millions of keys): Memory bits per key: FP rate percent old impl -> FP rate percent new impl 6: 5.70953 -> 5.69888 8: 2.45766 -> 2.29709 10: 1.13977 -> 0.959254 12: 0.662498 -> 0.411593 16: 0.353023 -> 0.0873754 24: 0.261552 -> 0.0060971 50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP) Fixes https://github.com/facebook/rocksdb/issues/5857 Fixes https://github.com/facebook/rocksdb/issues/4120 Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized. Compatibility Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007 Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version). Differential Revision: D18294749 Pulled By: pdillinger fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f	6 years ago
Fatih Şentürk	f382f44e39	fix typo (#6025 ) Summary: fix a typo at java readme page Pull Request resolved: https://github.com/facebook/rocksdb/pull/6025 Differential Revision: D18481232 fbshipit-source-id: 1c70c2435bcd4b02f25e28cd7e35c42273e07be0	6 years ago
sdong	bb23bfe63c	Fix a regression bug on total order seek with prefix enabled and range delete (#6028 ) Summary: Recent change https://github.com/facebook/rocksdb/pull/5861 mistakely use "prefix_extractor_ != nullptr" as the condition to determine whehter prefix bloom filter isused. It fails to consider read_options.total_order_seek, so it is wrong. The result is that an optimization for non-total-order seek is mistakely applied to total order seek, and introduces a bug in following corner case: Because of RangeDelete(), a file's largest key is extended. Seek key falls into the range deleted file, so level iterator seeks into the previous file without getting any key. The correct behavior is to place the iterator to the first key of the next file. However, an optimization is triggered and invalidates the iterator because it is out of the prefix range, causing wrong results. This behavior is reproduced in the unit test added. Fix the bug by setting prefix_extractor to be null if total order seek is used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6028 Test Plan: Add a unit test which fails without the fix. Differential Revision: D18479063 fbshipit-source-id: ac075f013029fcf69eb3a598f14c98cce3e810b3	6 years ago
Peter Dillinger	42b5494ec8	Fix BloomFilterPolicy changes for unsigned char (ARM) (#6024 ) Summary: Bug in PR https://github.com/facebook/rocksdb/issues/5941 when char is unsigned that should only affect assertion on unused/invalid filter metadata. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6024 Test Plan: on ARM: ./bloom_test && ./db_bloom_filter_test && ./block_based_filter_block_test && ./full_filter_block_test && ./partitioned_filter_block_test Differential Revision: D18461206 Pulled By: pdillinger fbshipit-source-id: 68a7c813a0b5791c05265edc03cdf52c78880e9a	6 years ago
anand76	6c7b1a0cc7	Batched MultiGet API for multiple column families (#5816 ) Summary: Add a new API that allows a user to call MultiGet specifying multiple keys belonging to different column families. This is mainly useful for users who want to do a consistent read of keys across column families, with the added performance benefits of batching and returning values using PinnableSlice. As part of this change, the code in the original multi-column family MultiGet for acquiring the super versions has been refactored into a separate function that can be used by both, the batching and the non-batching versions of MultiGet. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5816 Test Plan: make check make asan_check asan_crash_test Differential Revision: D18408676 Pulled By: anand1976 fbshipit-source-id: 933e7bec91dd70e7b633be4ff623a1116cc28c8d	6 years ago
sdong	a19de78da5	db_stress to cover SeekForPrev() (#6022 ) Summary: Right now, db_stress doesn't cover SeekForPrev(). Add the coverage, which mirrors what we do for Seek(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6022 Test Plan: Run "make crash_test". Do some manual source code hack to simular iterator wrong results and see it caught. Differential Revision: D18442193 fbshipit-source-id: 879b79000d5e33c625c7e970636de191ccd7776c	6 years ago
anand76	03ce7fb292	Fix a buffer overrun problem in BlockBasedTable::MultiGet (#6014 ) Summary: The calculation in BlockBasedTable::MultiGet for the required buffer length for reading in compressed blocks is incorrect. It needs to take the 5-byte block trailer into account. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6014 Test Plan: Add a unit test DBBasicTest.MultiGetBufferOverrun that fails in asan_check before the fix, and passes after. Differential Revision: D18412753 Pulled By: anand1976 fbshipit-source-id: 754dfb66be1d5f161a7efdf87be872198c7e3b72	6 years ago
蔡渠棠	f29e6b3be2	bugfix: MemTableList::RemoveOldMemTables invalid iterator after remov… (#6013 ) Summary: Fix issue https://github.com/facebook/rocksdb/issues/6012. I found that it may be caused by the following codes in function _RemoveOldMemTables()_ in db/memtable_list.cc : ``` for (auto it = memlist.rbegin(); it != memlist.rend(); ++it) { MemTable* mem = it; if (mem->GetNextLogNumber() > log_number) { break; } current_->Remove(mem, to_delete); ``` The iterator it* turns invalid after `current_->Remove(mem, to_delete);` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6013 Test Plan: ``` make check ``` Differential Revision: D18401107 Pulled By: riversand963 fbshipit-source-id: bf0da3b868ed70f7aff24cf7b3e2049c0c5c7a4e	6 years ago
Sagar Vemuri	c17384fea4	Cascade TTL Compactions to move expired key ranges to bottom levels faster (#5992 ) Summary: When users use Level-Compaction-with-TTL by setting `cf_options.ttl`, the ttl-expired data could take n*ttl time to reach the bottom level (where n is the number of levels) due to how the `creation_time` table property was calculated for the newly created files during compaction. The creation time of new files was set to a max of all compaction-input-files-creation-times which essentially resulted in resetting the ttl as the key range moves across levels. This behavior is now fixed by changing the `creation_time` to be based on minimum of all compaction-input-files-creation-times; this will cause cascading compactions across levels for the ttl-expired data to move to the bottom level, resulting in getting rid of tombstones/deleted-data faster. This will help start cascading compactions to move the expired key range to the bottom-most level faster. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5992 Test Plan: `make check` Differential Revision: D18257883 Pulled By: sagar0 fbshipit-source-id: 00df0bb8d0b7e14d9fc239df2cba8559f3e54cbc	6 years ago
Levi Tamasi	8e7aa62813	BlobDB: Maintain mapping between blob files and SSTs (#6020 ) Summary: The patch adds logic to BlobDB to maintain the mapping between blob files and SSTs for which the blob file in question is the oldest blob file referenced by the SST file. The mapping is initialized during database open based on the information retrieved using `GetLiveFilesMetaData`, and updated after flushes/compactions based on the information received through the `EventListener` interface (or, in the case of manual compactions issued through the `CompactFiles` API, the `CompactionJobInfo` object). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6020 Test Plan: Added a unit test; also tested using the BlobDB mode of `db_bench`. Differential Revision: D18410508 Pulled By: ltamasi fbshipit-source-id: dd9e778af781cfdb0d7056298c54ba9cebdd54a5	6 years ago
Peter Dillinger	aa63abf698	Auto-GarbageCollect on PurgeOldBackups and DeleteBackup (#6015 ) Summary: Only if there is a crash, power failure, or I/O error in DeleteBackup, shared or private files from the backup might be left behind that are not cleaned up by PurgeOldBackups or DeleteBackup-- only by GarbageCollect. This makes the BackupEngine API "leaky by default." Even if it means a modest performance hit, I think we should make Delete and Purge do as they say, with ongoing best effort: i.e. future calls will attempt to finish any incomplete work from earlier calls. This change does that by having DeleteBackup and PurgeOldBackups do a GarbageCollect, unless (to minimize performance hit) this BackupEngine has already done a GarbageCollect and there have been no deletion-related I/O errors in that GarbageCollect or since then. Rejected alternative 1: remove meta file last instead of first. This would in theory turn partially deleted backups into corrupted backups, but code changes would be needed to allow the missing files and consider it acceptably corrupt, rather than failing to open the BackupEngine. This might be a reasonable choice, but I mostly rejected it because it doesn't solve the legacy problem of cleaning up existing lingering files. Rejected alternative 2: use a deletion marker file. If deletion started with creating a file that marks a backup as flagged for deletion, then we could reliably detect partially deleted backups and efficiently finish removing them. In addition to not solving the legacy problem, this could be precarious if there's a disk full situation, and we try to create a new file in order to delete some files. Ugh. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6015 Test Plan: Updated unit tests Differential Revision: D18401333 Pulled By: pdillinger fbshipit-source-id: 12944e372ce6809f3f5a4c416c3b321a8927d925	6 years ago
Yi Wu	72de842ac9	Fix DBFlushTest::FireOnFlushCompletedAfterCommittedResult hang (#6018 ) Summary: The test would fire two flushes to let them run in parallel. Previously it wait for the first job to be scheduled before firing the second. It is possible the job is not started before the second job being scheduled, making the two job combine into one. Change to wait for the first job being started. Fixes https://github.com/facebook/rocksdb/issues/6017 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6018 Test Plan: ``` while ./db_flush_test --gtest_filter=FireOnFlushCompletedAfterCommittedResult; do :; done ``` and let it run for a while. Signed-off-by: Yi Wu <yiwu@pingcap.com> Differential Revision: D18405576 Pulled By: riversand963 fbshipit-source-id: 6ebb6262e033d5dc2ef81cb3eb410b314f2de4c9	6 years ago
Levi Tamasi	f80050fa8f	Add file number/oldest referenced blob file number to {Sst,Live}FileMetaData (#6011 ) Summary: The patch exposes the file numbers of the SSTs as well as the oldest blob files they contain a reference to through the GetColumnFamilyMetaData/ GetLiveFilesMetaData interface. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6011 Test Plan: Fixed and extended the existing unit tests. (The earlier ColumnFamilyMetaDataTest wasn't really testing anything because the generated memtables were never flushed, so the metadata structure was essentially empty.) Differential Revision: D18361697 Pulled By: ltamasi fbshipit-source-id: d5ed1d94ac70858b84393c48711441ddfe1251e9	6 years ago
Yun Tang	07a0ad3c29	Download bzip2 packages from sourceforge (#5995 ) Summary: From bzip2's official [download page](http://www.bzip.org/downloads.html), we could download it from sourceforge. This source would be more credible than previous web archive. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5995 Differential Revision: D18377662 fbshipit-source-id: e8353f83d5d6ea6067f78208b7bfb7f0d5b49c05	6 years ago
anand76	9836a1fa33	Fix MultiGet crash when no_block_cache is set (#5991 ) Summary: This PR fixes https://github.com/facebook/rocksdb/issues/5975. In ```BlockBasedTable::RetrieveMultipleBlocks()```, we were calling ```MaybeReadBlocksAndLoadToCache()```, which is a no-op if neither uncompressed nor compressed block cache are configured. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5991 Test Plan: 1. Add unit tests that fail with the old code and pass with the new 2. make check and asan_check Cc spetrunia Differential Revision: D18272744 Pulled By: anand1976 fbshipit-source-id: e62fa6090d1a6adf84fcd51dfd6859b03c6aebfe	6 years ago
sdong	1da1f04231	Stress test to relax the iterator verification case for lower bound (#5869 ) Summary: In stress test, all iterator verification is turned off is lower bound is enabled. This might be stricter than needed. This PR relaxes the condition and include the case where lower bound is lower than both of seek key and upper bound. It seems to work mostly fine when I run crash test locally. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5869 Test Plan: Run crash_test Differential Revision: D18363578 fbshipit-source-id: 23d57e11ea507949b8100f4190ddfbe8db052d5a	6 years ago
sdong	982a7532a7	Add two test cases for single sorted universal periodic compaction (#6002 ) Summary: It's useful to add test coverage for universal compaction's periodic compaction. Add two tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6002 Test Plan: Run the two tests Differential Revision: D18363544 fbshipit-source-id: bbd04b54057315f64f959709006412db1f76d170	6 years ago
sdong	f0b469e563	Turn on periodic compaction in universal by default if compaction filter is used. (#5994 ) Summary: Recently, periodic compaction got turned on by default for leveled compaction is compaction filter is used. Since periodic compaction is now supported in universal compaction too, we do the same default for universal now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5994 Test Plan: Add a new unit test. Differential Revision: D18363744 fbshipit-source-id: 5093288ce990ee3cab0e44ffd92d8489fbcd6a48	6 years ago
Peter Dillinger	7b3222e10a	Partial rebalance of TEST_GROUPs for Travis (#6010 ) Summary: TEST_GROUP=1 has sometimes been timing out but generally taking 45-50 minutes vs. 20-25 for groups 2-4. Beyond the compilation time, tests in group 1 consist of about 19 minutes of db_test, and 7 minutes of everything else. This change moves most of that "everything else" to group 2. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6010 Test Plan: Travis for this PR, oncall watch Travis Differential Revision: D18373536 Pulled By: pdillinger fbshipit-source-id: 0b3af004c71e4fd6bc01a94dac34cc3079fc9ce1	6 years ago
sdong	111ebf3161	db_stress: improve TestGet() failure printing (#5989 ) Summary: Right now, in db_stress's CF consistency test's TestGet case, if failure happens, we do normal string printing, rather than hex printing, so that some text is not printed out, which makes debugging harder. Fix it by printing hex instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5989 Test Plan: Build db_stress and see t passes. Differential Revision: D18363552 fbshipit-source-id: 09d1b8f6fbff37441cbe7e63a1aef27551226cec	6 years ago
Zhichao Cao	8ea087ad16	Workload generator (Mixgraph) based on prefix hotness (#5953 ) Summary: In the previous PR https://github.com/facebook/rocksdb/issues/4788, user can use db_bench mix_graph option to generate the workload that is from the social graph. The key is generated based on the key access hotness. In this PR, user can further model the key-range hotness and fit those to two-term-exponential distribution. First, user cuts the whole key space into small key ranges (e.g., key-ranges are the same size and the key-range number is the number of SST files). Then, user calculates the average access count per key of each key-range as the key-range hotness. Next, user fits the key-range hotness to two-term-exponential distribution (f(x) = f(x) = aexp(bx) + cexp(dx)) and generate the value of a, b, c, and d. They are the parameters in db_bench: prefix_dist_a, prefix_dist_b, prefix_dist_c, and prefix_dist_d. Finally, user can run db_bench by specify the parameters. For example: `./db_bench --benchmarks="mixgraph" -use_direct_io_for_flush_and_compaction=true -use_direct_reads=true -cache_size=268435456 -key_dist_a=0.002312 -key_dist_b=0.3467 -keyrange_dist_a=14.18 -keyrange_dist_b=-2.917 -keyrange_dist_c=0.0164 -keyrange_dist_d=-0.08082 -keyrange_num=30 -value_k=0.2615 -value_sigma=25.45 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.85 -mix_put_ratio=0.14 -mix_seek_ratio=0.01 -sine_mix_rate_interval_milliseconds=5000 -sine_a=350 -sine_b=0.0105 -sine_d=50000 --perf_level=2 -reads=1000000 -num=5000000 -key_size=48` Pull Request resolved: https://github.com/facebook/rocksdb/pull/5953 Test Plan: run db_bench with different parameters and checked the results. Differential Revision: D18053527 Pulled By: zhichao-cao fbshipit-source-id: 171f8b3142bd76462f1967c58345ad7e4f84bab7	6 years ago
Maysam Yabandeh	50804656d2	Enable write-conflict snapshot in stress tests (#5897 ) Summary: DBImpl extends the public GetSnapshot() with GetSnapshotForWriteConflictBoundary() method that takes snapshots specially for write-write conflict checking. Compaction treats such snapshots differently to avoid GCing a value written after that, so that the write conflict remains visible even after the compaction. The patch extends stress tests with such snapshots. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5897 Differential Revision: D17937476 Pulled By: maysamyabandeh fbshipit-source-id: bd8b0c578827990302194f63ae0181e15752951d	6 years ago
Peter Dillinger	834feaff05	Add clarifying/instructive header to TARGETS and defs.bzl Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6008 Differential Revision: D18343273 Pulled By: pdillinger fbshipit-source-id: f7d1c78d711bbfb0deea9ec88212c19ab2ec91b8	6 years ago
Yanqin Jin	67e735dbf9	Rename BlockBasedTable::ReadMetaBlock (#6009 ) Summary: According to https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format, the block read by BlockBasedTable::ReadMetaBlock is actually the meta index block. Therefore, it is better to rename the function to ReadMetaIndexBlock. This PR also applies some format change to existing code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6009 Test Plan: make check Differential Revision: D18333238 Pulled By: riversand963 fbshipit-source-id: 2c4340a29b3edba53d19c132cbfd04caf6242aed	6 years ago
Sergei Petrunia	230bcae7b6	Add a limited support for iteration bounds into BaseDeltaIterator (#5403 ) Summary: For MDEV-19670: MyRocks: key lookups into deleted data are very slow BaseDeltaIterator remembers iterate_upper_bound and will not let delta_iterator_ walk above the iterate_upper_bound if base_iterator_ is not valid anymore. == Rationale == The most straightforward way would be to make the delta_iterator (which is a rocksdb::WBWIIterator) to support iterator bounds. But checking for bounds has an extra CPU overhead. So we put the check into BaseDeltaIterator, and only make it when base_iterator_ is not valid. (note: We could take it even further, and move the check a few lines down, and only check iterator bounds ourselves if base_iterator_ is not valid AND delta_iterator_ hit a tombstone). Pull Request resolved: https://github.com/facebook/rocksdb/pull/5403 Differential Revision: D15863092 Pulled By: maysamyabandeh fbshipit-source-id: 8da458e7b9af95ff49356666f69664b4a6ccf49b	6 years ago
Maysam Yabandeh	52733b4498	WritePrepared: Fix flaky test MaxCatchupWithNewSnapshot (#5850 ) Summary: MaxCatchupWithNewSnapshot tests that the snapshot sequence number will be larger than the max sequence number when the snapshot was taken. However since the test does not have access to the max sequence number when the snapshot was taken, it uses max sequence number after that, which could have advanced the snapshot by then, thus making the test flaky. The fix is to compare with max sequence number before the snapshot was taken, which is a lower bound for the value when the snapshot was taken. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5850 Test Plan: ~/gtest-parallel/gtest-parallel --repeat=12800 ./write_prepared_transaction_test --gtest_filter="MaxCatchupWithNewSnapshot" Differential Revision: D17608926 Pulled By: maysamyabandeh fbshipit-source-id: b122ae5a27f982b290bd60da852e28d3c5eb0136	6 years ago
sdong	0d91a981e9	Fix assertion in universal compaction periodic compaction (#6000 ) Summary: We recently added periodic compaction to universal compaction. An old assertion that we can't onlyl compact the last sorted run triggered. However, with periodic compaction, it is possible that we only compact the last sorted run, so the assertion now became stricter than needed. Relaxing this assertion. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6000 Test Plan: This should be a low risk change. Will observe whether stress test will pass after it. Differential Revision: D18285396 fbshipit-source-id: 9a6863debdf104c40a7f6c46ab62d84cdf5d8592	6 years ago
sdong	e4e1d35cc2	Revert "Disable pre-5.5 versions in the format compatibility test (#5990 )" (#5999 ) Summary: This reverts commit `351e25401b`. All branches have been fixed to buildable on FB environments, so we can revert it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5999 Differential Revision: D18281947 fbshipit-source-id: 6deaaf1b5df2349eee5d6ed9b91208cd7e23ec8e	6 years ago
Levi Tamasi	a44670e71b	Use aggregate initialization for FlushJobInfo/CompactionJobInfo (#5997 ) Summary: FlushJobInfo and CompactionJobInfo are aggregates; we should use the aggregate initialization syntax to ensure members (specifically those of built-in types) are value-initialized. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5997 Test Plan: make check Differential Revision: D18273398 Pulled By: ltamasi fbshipit-source-id: 35b1a63ad9ca01605d288329858af72fffd7f392	6 years ago
sdong	5b656584af	crash_test: disable periodic compaction in FIFO compaction. (#5993 ) Summary: A recent commit make periodic compaction option valid in FIFO, which means TTL. But we fail to disable it in crash test, causing assert failure. Fix it by having it disabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5993 Test Plan: Restart "make crash_test" many times and make sure --periodic_compaction_seconds=0 is always the case when --compaction_style=2 Differential Revision: D18263223 fbshipit-source-id: c91a802017d83ae89ac43827d1b0012861933814	6 years ago
Peter Dillinger	18f57f5ef8	Add new persistent 64-bit hash (#5984 ) Summary: For upcoming new SST filter implementations, we will use a new 64-bit hash function (XXH3 preview, slightly modified). This change updates hash.{h,cc} for that change, adds unit tests, and out-of-lines the implementations to keep hash.h as clean/small as possible. In developing the unit tests, I discovered that the XXH3 preview always returns zero for the empty string. Zero is problematic for some algorithms (including an upcoming SST filter implementation) if it occurs more often than at the "natural" rate, so it should not be returned from trivial values using trivial seeds. I modified our fork of XXH3 to return a modest hash of the seed for the empty string. With hash function details out-of-lines in hash.h, it makes sense to enable XXH_INLINE_ALL, so that direct calls to XXH64/XXH32/XXH3p are inlined. To fix array-bounds warnings on some inline calls, I injected some casts to uintptr_t in xxhash.cc. (Issue reported to Yann.) Revised: Reverted using XXH_INLINE_ALL for now. Some Facebook checks are unhappy about #include on xxhash.cc file. I would fix that by rename to xxhash_cc.h, but to best preserve history I want to do that in a separate commit (PR) from the uintptr casts. Also updated filter_bench for this change, improving the performance predictability of dry run hashing and adding support for 64-bit hash (for upcoming new SST filter implementations, minor dead code in the tool for now). Pull Request resolved: https://github.com/facebook/rocksdb/pull/5984 Differential Revision: D18246567 Pulled By: pdillinger fbshipit-source-id: 6162fbf6381d63c8cc611dd7ec70e1ddc883fbb8	6 years ago
Peter Dillinger	685e895652	Prepare filter tests for more implementations (#5967 ) Summary: This change sets up for alternate implementations underlying BloomFilterPolicy: * Refactor BloomFilterPolicy and expose in internal .h file so that it's easy to iterate over / select implementations for testing, regardless of what the best public interface will look like. Most notably updated db_bloom_filter_test to use this. * Hide FullFilterBitsBuilder from unit tests (alternate derived classes planned); expose the part important for testing (CalculateSpace), as abstract class BuiltinFilterBitsBuilder. (Also cleaned up internally exposed interface to CalculateSpace.) * Rename BloomTest -> BlockBasedBloomTest for clarity (despite ongoing confusion between block-based table and block-based filter) * Assert that block-based filter construction interface is only used on BloomFilterPolicy appropriately constructed. (A couple of tests updated to add ", true".) Pull Request resolved: https://github.com/facebook/rocksdb/pull/5967 Test Plan: make check Differential Revision: D18138704 Pulled By: pdillinger fbshipit-source-id: 55ef9273423b0696309e251f50b8c1b5e9ec7597	6 years ago
Levi Tamasi	351e25401b	Disable pre-5.5 versions in the format compatibility test (#5990 ) Summary: We have updated earlier release branches going back to 5.5 so they are built using gcc7 by default. Disabling ancient versions before that until we figure out a plan for them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5990 Test Plan: Ran the script locally. Differential Revision: D18252386 Pulled By: ltamasi fbshipit-source-id: a7bbb30dc52ff2eaaf31a29ecc79f7cf4e2834dc	6 years ago
sdong	aa6f7d0995	Support periodic compaction in universal compaction (#5970 ) Summary: Previously, periodic compaction is not supported in universal compaction. Add the support using following approach: if any file is marked as qualified for periodid compaction, trigger a full compaction. If a full compaction is prevented by files being compacted, try to compact the higher levels than files currently being compacted. If in this way we can only compact the last sorted run and none of the file to be compacted qualifies for periodic compaction, skip the compact. This is to prevent the same single level compaction from being executed again and again. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5970 Test Plan: Add several test cases. Differential Revision: D18147097 fbshipit-source-id: 8ecc308154d9aca96fb192c51fbceba3947550c1	6 years ago
sdong	2a9e5caffe	Make FIFO compaction take default 30 days TTL by default (#5987 ) Summary: Right now, by default FIFO compaction has no TTL. We believe that a default TTL of 30 days will be better. With this patch, the default will be changed to 30 days. Default of Options.periodic_compaction_seconds will mean the same as options.ttl. If Options.ttl and Options.periodic_compaction_seconds left default, a default 30 days TTL will be used. If both options are set, the stricter value of the two will be used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5987 Test Plan: Add an option sanitize test to cover the case. Differential Revision: D18237935 fbshipit-source-id: a6dcea1f36c3849e13c0a69e413d73ad8eab58c9	6 years ago
Maysam Yabandeh	dccaf9f03c	Turn compaction asserts to runtime check (#5935 ) Summary: Compaction iterator has many assert statements that are active only during test runs. Some rare bugs would show up only at runtime could violate the assert condition but go unnoticed since assert statements are not compiled in release mode. Turning the assert statements to runtime check sone pors and cons: Pros: - A bug that would result into incorrect data would be detected early before the incorrect data is written to the disk. Cons: - Runtime overhead: which should be negligible since compaction cpu is the minority in the overall cpu usage - The assert statements might already being violated at runtime, and turning them to runtime failure might result into reliability issues. The patch takes a conservative step in this direction by logging the assert violations at runtime. If we see any violation reported in logs, we investigate. Otherwise, we can go ahead turning them to runtime error. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5935 Differential Revision: D18229697 Pulled By: maysamyabandeh fbshipit-source-id: f1890eca80ccd7cca29737f1825badb9aa8038a8	6 years ago
sdong	0337d87b42	crash_test: disable atomic flush with pipelined write (#5986 ) Summary: Recently, pipelined write is enabled even if atomic flush is enabled, which causing sanitizing failure in db_stress. Revert this change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5986 Test Plan: Run "make crash_test_with_atomic_flush" and see it to run for some while so that the old sanitizing error (which showed up quickly) doesn't show up. Differential Revision: D18228278 fbshipit-source-id: 27fdf2f8e3e77068c9725a838b9bef4ab25a2553	6 years ago
sdong	15119f08e2	Add more release branches to tools/check_format_compatible.sh (#5985 ) Summary: More release branches are created. We should include them in continuous format compatibility checks. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5985 Test Plan: Let's see whether it is passes. Differential Revision: D18226532 fbshipit-source-id: 75d8cad5b03ccea4ce16f00cea1f8b7893b0c0c8	6 years ago
sdong	a3960fc875	Move pipeline write waiting logic into WaitForPendingWrites() (#5716 ) Summary: In pipeline writing mode, memtable switching needs to wait for memtable writing to finish to make sure that when memtables are made immutable, inserts are not going to them. This is currently done in DBImpl::SwitchMemtable(). This is done after flush_scheduler_.TakeNextColumnFamily() is called to fetch the list of column families to switch. The function flush_scheduler_.TakeNextColumnFamily() itself, however, is not thread-safe when being called together with flush_scheduler_.ScheduleFlush(). This change provides a fix, which moves the waiting logic before flush_scheduler_.TakeNextColumnFamily(). WaitForPendingWrites() is a natural place where the logic can happen. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5716 Test Plan: Run all tests with ASAN and TSAN. Differential Revision: D18217658 fbshipit-source-id: b9c5e765c9989645bf10afda7c5c726c3f82f6c3	6 years ago
sdong	f22aaf8b3f	db_stress: CF Consistency check to use random CF to validate iterator results (#5983 ) Summary: Right now, in db_stress's iterator tests, we always use the same CF to validate iterator results. This commit changes it so that a randomized CF is used in Cf consistency test, where every CF should have exactly the same data. This would help catch more bugs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5983 Test Plan: Run "make crash_test_with_atomic_flush". Differential Revision: D18217643 fbshipit-source-id: 3ac998852a0378bb59790b20c5f236f6a5d681fe	6 years ago
Sagar Vemuri	4c9aa30a62	Auto enable Periodic Compactions if a Compaction Filter is used (#5865 ) Summary: - Periodic compactions are auto-enabled if a compaction filter or a compaction filter factory is set, in Level Compaction. - The default value of `periodic_compaction_seconds` is changed to UINT64_MAX, which lets RocksDB auto-tune periodic compactions as needed. An explicit value of 0 will still work as before ie. to disable periodic compactions completely. For now, on seeing a compaction filter along with a UINT64_MAX value for `periodic_compaction_seconds`, RocksDB will make SST files older than 30 days to go through periodic copmactions. Some RocksDB users make use of compaction filters to control when their data can be deleted, usually with a custom TTL logic. But it is occasionally possible that the compactions get delayed by considerable time due to factors like low writes to a key range, data reaching bottom level, etc before the TTL expiry. Periodic Compactions feature was originally built to help such cases. Now periodic compactions are auto enabled by default when compaction filters or compaction filter factories are used, as it is generally helpful to all cases to collect garbage. `periodic_compaction_seconds` is set to a large value, 30 days, in `SanitizeOptions` when RocksDB sees that a `compaction_filter` or `compaction_filter_factory` is used. This is done only for Level Compaction style. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5865 Test Plan: - Added a new test `DBCompactionTest.LevelPeriodicCompactionWithCompactionFilters` to make sure that `periodic_compaction_seconds` is set if either `compaction_filter` or `compaction_filter_factory` options are set. - `COMPILE_WITH_ASAN=1 make check` Differential Revision: D17659180 Pulled By: sagar0 fbshipit-source-id: 4887b9cf2e53cf2dc93a7b658c6b15e1181217ee	6 years ago

1 2 3 4 5 ...

8482 Commits (ac498cdb86e35862db348d3a5d5f0b82988f475a) All Branches Search

8482 Commits (ac498cdb86e35862db348d3a5d5f0b82988f475a)

All Branches