rocksdb

Commit Graph

Author	SHA1	Message	Date
Bingyi Sun	61d5a132c9	Fix typo: rename "bounary" to "boundary" in block.cc (#7328 ) Summary: Fix typo in comment for SeekForGetImpl(). Rename "bounary" to "boundary" Pull Request resolved: https://github.com/facebook/rocksdb/pull/7328 Reviewed By: riversand963 Differential Revision: D23439748 Pulled By: zhichao-cao fbshipit-source-id: 83a34c417c71a3210ce54a090d76c4d5571313f3	5 years ago
Jay Zhuang	c2485f2d81	Add buffer prefetch support for non directIO usecase (#7312 ) Summary: A new file interface `SupportPrefetch()` is added. When the user overrides it to `false`, an internal prefetch buffer will be used for readahead. Useful for non-directIO but FS doesn't have readahead support. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7312 Reviewed By: anand1976 Differential Revision: D23329847 Pulled By: jay-zhuang fbshipit-source-id: 71cd4ce6f4a820840294e4e6aec111ab76175527	5 years ago
sdong	722814e357	Get() to fail with underlying failures in PartitionIndexReader::CacheDependencies() (#7297 ) Summary: Right now all I/O failures under PartitionIndexReader::CacheDependencies() is swallowed. This doesn't impact correctness but we've made a decision that any I/O error in read path now should be returned to users for awareness. Return errors in those cases instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7297 Test Plan: Add a new unit test that ingest errors in this code path and see Get() fails. Only one I/O path is hit in PartitionIndexReader::CacheDependencies(). Several option changes are attempt but not able to got other pread paths triggered. Not sure whether other failure cases would be even possible. Would rely on continuous stress test to validate it. Reviewed By: anand1976 Differential Revision: D23257950 fbshipit-source-id: 859dbc92fa239996e1bb378329344d3d54168c03	5 years ago
mrambacher	b7e1c5213f	Add some simulator cache and block tracer tests to ASSERT_STATUS_CHECKED (#7305 ) Summary: More tests now pass. When in doubt, I added a TODO comment to check what should happen with an ignored error. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7305 Reviewed By: akankshamahajan15 Differential Revision: D23301262 Pulled By: ajkr fbshipit-source-id: 5f120edc7393560aefc0633250277bbc7e8de9e6	5 years ago
mrambacher	e9befdebbf	Add EnvTestWithParam::OptionsTest to the ASSERT_STATUS_CHECKED passes (#7283 ) Summary: This test uses database functionality and required more extensive work to get it to pass than the other tests. The DB functionality required for this test now passes the check. When it was unclear what the proper behavior was for unchecked status codes, a TODO was added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7283 Reviewed By: akankshamahajan15 Differential Revision: D23251497 Pulled By: ajkr fbshipit-source-id: 52b79629bdafa0a58de8ead1d1d66f141b331523	5 years ago
Levi Tamasi	9d6f48ec1d	Clean up CompressBlock/CompressBlockInternal a bit (#7249 ) Summary: The patch cleans up and refactors `CompressBlock` and `CompressBlockInternal` a bit. In particular, it does the following: * It renames `CompressBlockInternal` to `CompressData` and moves it to `util/compression.h`, where other general compression-related utilities are located. This will facilitate reuse in the BlobDB write path. * The signature of the method is changed so it now takes `compression_format_version` (similarly to the compression library specific methods) instead of `format_version` (which is specific to the block based table). * `GetCompressionFormatForVersion` no longer takes `compression_type` as a parameter. This parameter was only used in a (not entirely up-to-date) assertion; also, removing it eliminates the need to ensure this precondition holds at all call sites. * Does some minor cleanup in `CompressBlock`, for instance, it is now possible to pass only one of `sampled_output_fast` and `sampled_output_slow`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7249 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D23087278 Pulled By: ltamasi fbshipit-source-id: e6316e45baed8b4e7de7c1780c90501c2a3439b3	5 years ago
Yuhong Guo	5444942f15	Fix cmake build on MacOS (#7205 ) Summary: 1. `std::random_shuffle` is deprecated and now we can use `std::shuffle` ``` /rocksdb/db/prefix_test.cc:590:12: error: 'random_shuffle<std::__1::__wrap_iter<unsigned long long > >' is deprecated [-Werror,-Wdeprecated-declarations] std::random_shuffle(prefixes.begin(), prefixes.end()); ^ /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/algorithm:2982:1: note: 'random_shuffle<std::__1::__wrap_iter<unsigned long long > >' has been explicitly marked deprecated here _LIBCPP_DEPRECATED_IN_CXX14 void ^ /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__config:1107:39: note: expanded from macro '_LIBCPP_DEPRECATED_IN_CXX14' # define _LIBCPP_DEPRECATED_IN_CXX14 _LIBCPP_DEPRECATED ^ /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__config:1090:48: note: expanded from macro '_LIBCPP_DEPRECATED' # define _LIBCPP_DEPRECATED __attribute__ ((deprecated)) ``` 2. `c_test` link error with `-DROCKSDB_BUILD_SHARED=OFF`: ``` [ 7%] Linking CXX executable c_test ld: library not found for -lrocksdb-shared clang: error: linker command failed with exit code 1 (use -v to see invocation) make[5]: * [c_test] Error 1 make[4]: * [CMakeFiles/c_test.dir/all] Error 2 make[4]: *** Waiting for unfinished jobs.... ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/7205 Reviewed By: ajkr Differential Revision: D23030641 Pulled By: pdillinger fbshipit-source-id: f270e50fc0b824ca1a0876ec5c65d33f55a72dd0	5 years ago
anand76	832b056a30	Enable IO timeouts for iterators (#7161 ) Summary: Introduce io_timeout in ReadOptions and enabled deadline/io_timeout for Iterators. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7161 Test Plan: New unit tests in db_basic_test Reviewed By: riversand963 Differential Revision: D22687352 Pulled By: anand1976 fbshipit-source-id: 67bbb0e6d7ae80b256589244468494292538c6ec	5 years ago
sdong	5c1a544122	Clean up InternalIterator upper bound logic a little bit (#7200 ) Summary: IteratorIterator::IsOutOfBound() and IteratorIterator::MayBeOutOfUpperBound() are two functions that related to upper bound check. It is hard for users to reason about this complexity. Consolidate the two functions into one and assign an enum as results to improve readability. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7200 Test Plan: Run all existing test. Would run crash test with atomic for a while. Reviewed By: anand1976 Differential Revision: D22833181 fbshipit-source-id: a0c724267056adbd0476bde74650e6c7226077e6	5 years ago
sdong	41c328fe57	Fix a perf regression that caused every key to go through upper bound check (#7209 ) Summary: https://github.com/facebook/rocksdb/pull/5289 introduces a performance regression that caused an upper bound check within every BlockBasedTableIterator::Next(). This is unnecessary if we've checked the boundary key for current block and it is within upper bound. Fix the bug. Also rename the boolean to a enum so that the code is slightly better readable. The original regression was probably to fix a bug that the block upper bound check status is not reset after a new block is created. Fix it bug so that the regression can be avoided without hitting the bug. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7209 Test Plan: Run all existing tests. Will run atomic black box crash test for a while. Reviewed By: anand1976 Differential Revision: D22859246 fbshipit-source-id: cbdad1f5e656c55fd8b71726d5a4f6cb53ff9140	5 years ago
Andrew Kryczka	a4a4a2dabd	dedup ReadOptions in iterator hierarchy (#7210 ) Summary: Previously, a `ReadOptions` object was stored in every `BlockBasedTableIterator` and every `LevelIterator`. This redundancy consumes extra memory, resulting in the `Arena` making more allocations, and iteration observing worse cache performance. This PR migrates callers of `NewInternalIterator()` and `MakeInputIterator()` to provide a `ReadOptions` object guaranteed to outlive the returned iterator. When the iterator's lifetime will be managed by the user, this lifetime guarantee is achieved by storing the `ReadOptions` value in `ArenaWrappedDBIter`. Then, sub-iterators of `NewInternalIterator()` and `MakeInputIterator()` can hold a reference-to-const `ReadOptions`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7210 Test Plan: - `make check` under ASAN and valgrind - benchmark: on a DB with 2 L0 files and 3 L1+ levels, this PR reduced `Arena` allocation 4792 -> 4160 bytes. Reviewed By: anand1976 Differential Revision: D22861323 Pulled By: ajkr fbshipit-source-id: 54aebb3e89c872eeab0f5793b4b6e42878d093ce	5 years ago
sdong	692f6a3138	Implement NextAndGetResult() in memtable and level iterator (#7179 ) Summary: NextAndGetResult() is not implemented in memtable and is very simply implemented in level iterator. The result is that for a normal leveled iterator, performance regression will be observed for calling PrepareValue() for most iterator Next(). Mitigate the problem by implementing the function for both iterators. In level iterator, the implementation cannot be perfect as when calling file iterator's SeekToFirst() we don't have information about whether the value is prepared. Fortunately, the first key should not cause a big portion of the CPu. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7179 Test Plan: Run normal crash test for a while. Reviewed By: anand1976 Differential Revision: D22783840 fbshipit-source-id: c19f45cdf21b756190adef97a3b66ccde3936e05	5 years ago
mrambacher	d44cbc5314	Add hash of key/value checks when paranoid_file_checks=true (#7134 ) Summary: When paraoid_files_checks=true, a rolling key-value hash is generated and compared to what is written to the file. If the values do not match, the SST file is rejected. Code put in place for the check for both flush and compaction jobs. Corresponding test added to corruption_test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7134 Reviewed By: cheng-chang Differential Revision: D22646149 fbshipit-source-id: 8fde1984a1a11edd3bd82a413acffc5ea7aa683f	5 years ago
Andrew Kryczka	643c863b72	minimize BlockIter comparator scope (#7149 ) Summary: PR https://github.com/facebook/rocksdb/issues/6944 transitioned `BlockIter` from using `Comparator` to using concrete `UserComparatorWrapper` and `InternalKeyComparator`. However, adding them as instance variables to `BlockIter` was not optimal. Bloating `BlockIter` caused the `ArenaWrappedDBIter`'s arena allocator to do more heap allocations (in certain cases) which harmed performance of `DB::NewIterator()`. This PR pushes down the concrete comparator objects to the point of usage, which forces them to be on the stack. As a result, the `BlockIter` is back to its original size prior to https://github.com/facebook/rocksdb/issues/6944 (actually a bit smaller since there were two `Comparator` before). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7149 Test Plan: verified our internal `DB::NewIterator()`-heavy regression test no longer reports regression. Reviewed By: riversand963 Differential Revision: D22623189 Pulled By: ajkr fbshipit-source-id: f6d69accfe5de51e0bd9874a480b32b29909bab6	5 years ago
mrambacher	c7c7b07f06	More Makefile Cleanup (#7097 ) Summary: Cleans up some of the dependencies on test code in the Makefile while building tools: - Moves the test::RandomString, DBBaseTest::RandomString into Random - Moves the test::RandomHumanReadableString into Random - Moves the DestroyDir method into file_utils - Moves the SetupSyncPointsToMockDirectIO into sync_point. - Moves the FaultInjection Env and FS classes under env These changes allow all of the tools to build without dependencies on test_util, thereby simplifying the build dependencies. By moving the FaultInjection code, the dependency in db_stress on different libraries for debug vs release was eliminated. Tested both release and debug builds via Make and CMake for both static and shared libraries. More work remains to clean up how the tools are built and remove some unnecessary dependencies. There is also more work that should be done to get the Makefile and CMake to align in their builds -- what is in the libraries and the sizes of the executables are different. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7097 Reviewed By: riversand963 Differential Revision: D22463160 Pulled By: pdillinger fbshipit-source-id: e19462b53324ab3f0b7c72459dbc73165cc382b2	5 years ago
Andrew Kryczka	82611ee25a	save key comparisons in BlockIter::BinarySeek (#7068 ) Summary: This is a followup to https://github.com/facebook/rocksdb/issues/6646. In that PR, for simplicity I just appended a comparison against the 0th restart key in case `BinarySeek()`'s binary search landed at index 0. As a result there were `2/(N+1) + log_2(N)` key comparisons. This PR does it differently. Now we expand the binary search range by one so it also covers the case where target is at or before the restart key at index 0. As a result, it involves `log_2(N+1)` key comparisons. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7068 Test Plan: ran readrandom with mostly default settings and counted key comparisons using `PerfContext`. before: `user_key_comparison_count = 28881965` after: `user_key_comparison_count = 27823245` setup command: ``` $ TEST_TMPDIR=/dev/shm/dbbench ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -level_compaction_dynamic_level_bytes=true -num=10000000 ``` benchmark command: ``` $ TEST_TMPDIR=/dev/shm/dbbench/ ./db_bench -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=10000000 -compression_type=none -reads=1000000 -perf_level=3 ``` Reviewed By: anand1976 Differential Revision: D22357032 Pulled By: ajkr fbshipit-source-id: 8b01e9c1c2a4e9d02fc9dfe16c1cc0327f8bdf24	5 years ago
Zitan Chen	b35a2f9146	Fix GetFileDbIdentities (#7104 ) Summary: Although PR https://github.com/facebook/rocksdb/issues/7032 fixes the construction of the `SstFileDumper` in `GetFileDbIdentities` by setting a proper `Env` of the `Options` passed in the constructor, the file path was not corrected accordingly. This actually disables backup engine to use db session ids in the file names since the `db_session_id` is always empty. Now it is fixed by setting the correct path in the construction of `SstFileDumper`. Furthermore, to preserve the Direct IO property that backup engine already has, parameter `EnvOptions` is added to `GetFileDbIdentities` and `SstFileDumper`. The `BackupUsingDirectIO` test is updated accordingly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7104 Test Plan: backupable_db_test and some manual tests. Reviewed By: ajkr Differential Revision: D22443245 Pulled By: gg814 fbshipit-source-id: 056a9bb8b82947c5e73d7c3fbb62bfe23af5e562	5 years ago
Akanksha Mahajan	54f171fe90	Update Flush policy in PartitionedIndexBuilder on switching from user-key to internal-key mode (#7096 ) Summary: When format_version is high enough to support user-key and there are index entries for same user key that spans multiple data blocks then it changes from user-key mode to internal-key mode. But the flush policy is not reset to point to Block Builder of internal-keys. After this switch, no entries are added to user key index partition result, thus it never triggers flushing the block. Fix: 1. After adding the entry in sub_builder_index_, if there is a switch from user-key to internal-key, then flush policy is updated to point to Block Builder of internal-keys index partition. 2. Set sub_builder_index_->seperator_is_key_plus_seq_ = true if seperator_is_key_plus_seq_ is set to true so that subsequent partitions can also use internal key mode. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7096 Test Plan: make check -j64 Reviewed By: ajkr Differential Revision: D22416598 Pulled By: akankshamahajan15 fbshipit-source-id: 01fc2dc07ea1b32f8fb803995ebe6e9a3fbe67ac	5 years ago
rockeet	b649d8cb97	Fixed Factory construct just for calling .Name() (#7080 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7080 Reviewed By: riversand963 Differential Revision: D22412352 Pulled By: ajkr fbshipit-source-id: 1d7f4c1621040a0130245139b52c3f4d3deac865	5 years ago
Andrew Kryczka	dd29ad4223	Separate internal and user key comparators in `BlockIter` (#6944 ) Summary: Replace `BlockIter::comparator_` and `IndexBlockIter::user_comparator_wrapper_` with a concrete `UserComparatorWrapper` and `InternalKeyComparator`. The motivation for this change was the inconvenience of not knowing the concrete type of `BlockIter::comparator_`, which prevented calling specialized internal key comparison functions to optimize comparison of keys with global seqno applied. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6944 Test Plan: benchmark setup -- single file DBs, in-memory, no compression. "normal_db" created by regular flush; "ingestion_db" created by ingesting a file. Both DBs have same contents. ``` $ TEST_TMPDIR=/dev/shm/normal_db/ ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=10485760000 -disable_auto_compactions=true -compression_type=none -num=1000000 $ ./ldb write_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/ --compression_type=no --hex --create_if_missing < <(./sst_dump --command=scan --output_hex --file=/dev/shm/normal_db/dbbench/000007.sst \| awk 'began {print "0x" substr($1, 2, length($1) - 2), "==>", "0x" $5} ; /^Sst file format: block-based/ {began=1}') $ ./ldb ingest_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/ ``` benchmark run command: ``` $ TEST_TMPDIR=/dev/shm/$DB/ ./db_bench -benchmarks=seekrandom -seek_nexts=$SEEK_NEXT -use_existing_db=true -cache_index_and_filter_blocks=false -num=1000000 -cache_size=0 -threads=1 -reads=200000000 -mmap_read=1 -verify_checksum=false ``` results: perf improved marginally for ingestion_db and did not change significantly for normal_db: SEEK_NEXT \| DB \| code \| ops/sec \| % change -- \| -- \| -- \| -- \| -- 0 \| normal_db \| master \| 350880 \| 0 \| normal_db \| PR6944 \| 351040 \| 0.0 0 \| ingestion_db \| master \| 343255 \| 0 \| ingestion_db \| PR6944 \| 349424 \| 1.8 10 \| normal_db \| master \| 218711 \| 10 \| normal_db \| PR6944 \| 217892 \| -0.4 10 \| ingestion_db \| master \| 220334 \| 10 \| ingestion_db \| PR6944 \| 226437 \| 2.8 Reviewed By: pdillinger Differential Revision: D21924676 Pulled By: ajkr fbshipit-source-id: ea4288a2eefa8112eb6c651a671c1de18c12e538	5 years ago
Peter Dillinger	a680a7ea37	Un-revert #7049 , revert #7022 (#7071 ) Summary: Even though local bisection gave me a clear signal (and still does) that reverting https://github.com/facebook/rocksdb/issues/7049 would fix the failures in MultiThreadedDBTest, https://github.com/facebook/rocksdb/issues/7022 seems to be the root cause. Reverting https://github.com/facebook/rocksdb/issues/7022 and keeping https://github.com/facebook/rocksdb/issues/7049 seems to fix the issue in local reproducer also. (Had these landed in opposite order, bisection would have found the root cause.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/7071 Reviewed By: akankshamahajan15 Differential Revision: D22362857 Pulled By: pdillinger fbshipit-source-id: ed63df3d74e9d4ce1604de8fe43b216166c7a3f0	5 years ago
Akanksha Mahajan	5edfe3a3d8	Update Flush policy in PartitionedIndexBuilder on switching from user-key to internal-key mode (#7022 ) Summary: When format_version is high enough to support user-key and there are index entries for same user key that spans multiple data blocks then it changes from user-key mode to internal-key mode. But the flush policy is not reset to point to Block Builder of internal-keys. After this switch, no entries are added to user key index partition result, thus it never triggers flushing the block. Fix: After adding the entry in sub_builder_index_, if there is a switch from user-key to internal-key, then flush policy is updated to point to Block Builder of internal-keys index partition. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7022 Test Plan: 1. make check -j64 2. Added one unit test case Reviewed By: ajkr Differential Revision: D22197734 Pulled By: akankshamahajan15 fbshipit-source-id: d87e9e46bccab8e896ee6979d6b79c51f73d479e	5 years ago
Andrew Kryczka	8458532d58	Skip unnecessary allocation for mmap reads under 5000 bytes (#7043 ) Summary: With mmap enabled on an uncompressed file, we were previously always doing a heap allocation to obtain the scratch buffer for `RandomAccessFileReader::Read()`. However, that allocation was unnecessary as the underlying file reader returned a pointer into its mapped memory, not the provided scratch buffer. This PR makes passes the `BlockFetcher`'s inline buffer as the scratch buffer if the data block is small enough (less than `kDefaultStackBufferSize` bytes, currently 5000). Ideally we would not pass a scratch buffer at all for an mmap read; however, the `RandomAccessFile::Read()` API guarantees such a buffer is provided, and non-standard implementations may be relying on it even when `Options::allow_mmap_reads == true`. In that case, this PR still works but introduces an extra copy from the inline buffer to a heap buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7043 Reviewed By: cheng-chang Differential Revision: D22320606 Pulled By: ajkr fbshipit-source-id: ad964dd23df34e07d979c6032c2dfe5454c98b52	5 years ago
Anand Ananthabhotla	9a5886bd8c	Extend Get/MultiGet deadline support to table open (#6982 ) Summary: Current implementation of the ```read_options.deadline``` option only checks the deadline for random file reads during point lookups. This PR extends the checks to file opens, prefetches and preloads as part of table open. The main changes are in the ```BlockBasedTable```, partitioned index and filter readers, and ```TableCache``` to take ReadOptions as an additional parameter. In ```BlockBasedTable::Open```, in order to retain existing behavior w.r.t checksum verification and block cache usage, we filter out most of the options in ```ReadOptions``` except ```deadline```. However, having the ```ReadOptions``` gives us more flexibility to honor other options like verify_checksums, fill_cache etc. in the future. Additional changes in callsites due to function signature changes in ```NewTableReader()``` and ```FilePrefetchBuffer```. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6982 Test Plan: Add new unit tests in db_basic_test Reviewed By: riversand963 Differential Revision: D22219515 Pulled By: anand1976 fbshipit-source-id: 8a3b92f4a889808013838603aa3ca35229cd501b	5 years ago
sdong	f9817201af	Add unity build to CircleCI (#7026 ) Summary: We are still keeping unity build working. So it's a good idea to add to a pre-commit CI. A latest GCC docker image just to get a little bit more coverage. Fix three small issues to make it pass. Also make unity_test to run db_basic_test rather than db_test to cut the test time. There is no point to run expensive tests here. It was set to run db_test before db_basic_test was separated out. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7026 Test Plan: watch tests to pass. Reviewed By: zhichao-cao Differential Revision: D22223197 fbshipit-source-id: baa3b6cbb623bf359829b63ce35715c75bcb0ed4	5 years ago
Zitan Chen	be41c61f22	Add a new option for BackupEngine to store table files under shared_checksum using DB session id in the backup filenames (#6997 ) Summary: `BackupableDBOptions::new_naming_for_backup_files` is added. This option is false by default. When it is true, backup table filenames under directory shared_checksum are of the form `<file_number>_<crc32c>_<db_session_id>.sst`. Note that when this option is true, it comes into effect only when both `share_files_with_checksum` and `share_table_files` are true. Three new test cases are added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6997 Test Plan: Passed make check. Reviewed By: ajkr Differential Revision: D22098895 Pulled By: gg814 fbshipit-source-id: a1d9145e7fe562d71cde7ac995e17cb24fd42e76	5 years ago
sdong	9cc25190e1	Test CircleCI with CLANG-10 (#7025 ) Summary: It's useful to build RocksDB using a more recent clang version in CI. Add a CircleCI build and fix some issues with it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7025 Test Plan: See all tests pass. Reviewed By: pdillinger Differential Revision: D22215700 fbshipit-source-id: 914a729c2cd3f3ac4a627cc0ac58d4691dca2168	5 years ago
Peter Dillinger	5b2bbacb6f	Minimize memory internal fragmentation for Bloom filters (#6427 ) Summary: New experimental option BBTO::optimize_filters_for_memory builds filters that maximize their use of "usable size" from malloc_usable_size, which is also used to compute block cache charges. Rather than always "rounding up," we track state in the BloomFilterPolicy object to mix essentially "rounding down" and "rounding up" so that the average FP rate of all generated filters is the same as without the option. (YMMV as heavily accessed filters might be unluckily lower accuracy.) Thus, the option near-minimizes what the block cache considers as "memory used" for a given target Bloom filter false positive rate and Bloom filter implementation. There are no forward or backward compatibility issues with this change, though it only works on the format_version=5 Bloom filter. With Jemalloc, we see about 10% reduction in memory footprint (and block cache charge) for Bloom filters, but 1-2% increase in storage footprint, due to encoding efficiency losses (FP rate is non-linear with bits/key). Why not weighted random round up/down rather than state tracking? By only requiring malloc_usable_size, we don't actually know what the next larger and next smaller usable sizes for the allocator are. We pick a requested size, accept and use whatever usable size it has, and use the difference to inform our next choice. This allows us to narrow in on the right balance without tracking/predicting usable sizes. Why not weight history of generated filter false positive rates by number of keys? This could lead to excess skew in small filters after generating a large filter. Results from filter_bench with jemalloc (irrelevant details omitted): (normal keys/filter, but high variance) $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9 Build avg ns/key: 29.6278 Number of filters: 5516 Total size (MB): 200.046 Reported total allocated memory (MB): 220.597 Reported internal fragmentation: 10.2732% Bits/key stored: 10.0097 Average FP rate %: 0.965228 $ ./filter_bench -quick -impl=2 -average_keys_per_filter=30000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory Build avg ns/key: 30.5104 Number of filters: 5464 Total size (MB): 200.015 Reported total allocated memory (MB): 200.322 Reported internal fragmentation: 0.153709% Bits/key stored: 10.1011 Average FP rate %: 0.966313 (very few keys / filter, optimization not as effective due to ~59 byte internal fragmentation in blocked Bloom filter representation) $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9 Build avg ns/key: 29.5649 Number of filters: 162950 Total size (MB): 200.001 Reported total allocated memory (MB): 224.624 Reported internal fragmentation: 12.3117% Bits/key stored: 10.2951 Average FP rate %: 0.821534 $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory Build avg ns/key: 31.8057 Number of filters: 159849 Total size (MB): 200 Reported total allocated memory (MB): 208.846 Reported internal fragmentation: 4.42297% Bits/key stored: 10.4948 Average FP rate %: 0.811006 (high keys/filter) $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9 Build avg ns/key: 29.7017 Number of filters: 164 Total size (MB): 200.352 Reported total allocated memory (MB): 221.5 Reported internal fragmentation: 10.5552% Bits/key stored: 10.0003 Average FP rate %: 0.969358 $ ./filter_bench -quick -impl=2 -average_keys_per_filter=1000000 -vary_key_count_ratio=0.9 -optimize_filters_for_memory Build avg ns/key: 30.7131 Number of filters: 160 Total size (MB): 200.928 Reported total allocated memory (MB): 200.938 Reported internal fragmentation: 0.00448054% Bits/key stored: 10.1852 Average FP rate %: 0.963387 And from db_bench (block cache) with jemalloc: $ ./db_bench -db=/dev/shm/dbbench.no_optimize -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false $ ./db_bench -db=/dev/shm/dbbench -benchmarks=fillrandom -format_version=5 -value_size=90 -bloom_bits=10 -num=2000000 -threads=8 -optimize_filters_for_memory -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false $ (for FILE in /dev/shm/dbbench.no_optimize/.sst; do ./sst_dump --file=$FILE --show_properties \| grep 'filter block' ; done) \| awk '{ t += $4; } END { print t; }' 17063835 $ (for FILE in /dev/shm/dbbench/.sst; do ./sst_dump --file=$FILE --show_properties \| grep 'filter block' ; done) \| awk '{ t += $4; } END { print t; }' 17430747 $ #^ 2.1% additional filter storage $ ./db_bench -db=/dev/shm/dbbench.no_optimize -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000 rocksdb.block.cache.index.add COUNT : 33 rocksdb.block.cache.index.bytes.insert COUNT : 8440400 rocksdb.block.cache.filter.add COUNT : 33 rocksdb.block.cache.filter.bytes.insert COUNT : 21087528 rocksdb.bloom.filter.useful COUNT : 4963889 rocksdb.bloom.filter.full.positive COUNT : 1214081 rocksdb.bloom.filter.full.true.positive COUNT : 1161999 $ #^ 1.04 % observed FP rate $ ./db_bench -db=/dev/shm/dbbench -use_existing_db -benchmarks=readrandom,stats -statistics -bloom_bits=10 -num=2000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=false -optimize_filters_for_memory -duration=10 -cache_index_and_filter_blocks -cache_size=1000000000 rocksdb.block.cache.index.add COUNT : 33 rocksdb.block.cache.index.bytes.insert COUNT : 8448592 rocksdb.block.cache.filter.add COUNT : 33 rocksdb.block.cache.filter.bytes.insert COUNT : 18220328 rocksdb.bloom.filter.useful COUNT : 5360933 rocksdb.bloom.filter.full.positive COUNT : 1321315 rocksdb.bloom.filter.full.true.positive COUNT : 1262999 $ #^ 1.08 % observed FP rate, 13.6% less memory usage for filters (Due to specific key density, this example tends to generate filters that are "worse than average" for internal fragmentation. "Better than average" cases can show little or no improvement.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6427 Test Plan: unit test added, 'make check' with gcc, clang and valgrind Reviewed By: siying Differential Revision: D22124374 Pulled By: pdillinger fbshipit-source-id: f3e3aa152f9043ddf4fae25799e76341d0d8714e	5 years ago
Peter Dillinger	25a0d0ca30	Fix block checksum for >=4GB, refactor (#6978 ) Summary: Although RocksDB falls over in various other ways with KVs around 4GB or more, this change fixes how XXH32 and XXH64 were being called by the block checksum code to support >= 4GB in case that should ever happen, or the code copied for other uses. This change is not a schema compatibility issue because the checksum verification code would checksum the first (block_size + 1) mod 2^32 bytes while the checksum construction code would checksum the first block_size mod 2^32 plus the compression type byte, meaning the XXH32/64 checksums for >=4GB block would not match about 255/256 times. While touching this code, I refactored to consolidate redundant implementations, improving diagnostics and performance tracking in some cases. Also used less confusing language in those diagnostics. Makes https://github.com/facebook/rocksdb/issues/6875 obsolete. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6978 Test Plan: I was able to write a test for this using an SST file writer and VerifyChecksum in a reader. The test fails before the fix, though I'm leaving the test disabled because I don't think it's worth the expense of running regularly. Reviewed By: gg814 Differential Revision: D22143260 Pulled By: pdillinger fbshipit-source-id: 982993d16134e8c50bea2269047f901c1783726e	5 years ago
sdong	223b57eeb8	Fix the bug that compressed cache is disabled in read-only DBs (#6990 ) Summary: Compressed block cache is disabled in https://github.com/facebook/rocksdb/pull/4650 for no good reason. Re-enable it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6990 Test Plan: Add a unit test to make sure a general function works with read-only DB + compressed block cache. Reviewed By: ltamasi Differential Revision: D22072755 fbshipit-source-id: 2a55df6363de23a78979cf6c747526359e5dc7a1	5 years ago
Zitan Chen	94d04529de	Store DB identity and DB session ID in SST files (#6983 ) Summary: `db_id` and `db_session_id` are now part of the table properties for all formats and stored in SST files. This adds about 99 bytes to each new SST file. The `TablePropertiesNames` for these two identifiers are `rocksdb.creating.db.identity` and `rocksdb.creating.session.identity`. In addition, SST files generated from SstFileWriter and Repairer have DB identity “SST Writer” and “DB Repairer”, respectively. Their DB session IDs are generated in the same way as `DB::GetDbSessionId`. A table property test is added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6983 Test Plan: make check and some manual tests. Reviewed By: zhichao-cao Differential Revision: D22048826 Pulled By: gg814 fbshipit-source-id: afdf8c11424a6f509b5c0b06dafad584a80103c9	5 years ago
Levi Tamasi	aa8f1331af	Fix uninitialized memory read in table_test (#6980 ) Summary: When using parameterized tests, `gtest` sometimes prints the test parameters. If no other printing method is available, it essentially produces a hex dump of the object. This can cause issues with valgrind with types like `TestArgs` in `table_test`, where the object layout has gaps (with uninitialized contents) due to the members' alignment requirements. The patch fixes the uninitialized reads by providing an `operator<<` for `TestArgs` and also makes sure all members are initialized (in a consistent order) on all code paths. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6980 Test Plan: `valgrind --leak-check=full ./table_test` Reviewed By: siying Differential Revision: D22045536 Pulled By: ltamasi fbshipit-source-id: 6f5920ac28c712d0aa88162fffb80172ed769c32	5 years ago
Zhen Li	9c24a5cb4d	Fix persistent cache on windows (#6932 ) Summary: Persistent cache feature caused rocks db crash on windows. I posted a issue for it, https://github.com/facebook/rocksdb/issues/6919. I found this is because no "persistent_cache_key_prefix" is generated for persistent cache. Looking repo history, "GetUniqueIdFromFile" is not implemented on Windows. So my fix is adding "NewId()" function in "persistent_cache" and using it to generate prefix for persistent cache. In this PR, i also re-enable related test cases defined in "db_test2" and "persistent_cache_test" for windows. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6932 Test Plan: 1. run related test cases in "db_test2" and "persistent_cache_test" on windows and see it passed. 2. manually run db_bench.exe with "read_cache_path" and verified. Reviewed By: riversand963 Differential Revision: D21911608 Pulled By: cheng-chang fbshipit-source-id: cdfd938d54a385edbb2836b13aaa1d39b0a6f1c2	5 years ago
Levi Tamasi	bacd6edcbe	Turn HarnessTest in table_test into a parameterized test (#6974 ) Summary: `HarnessTest` in `table_test.cc` currently tests many parameter combinations sequentially in a loop. This is problematic from a testing perspective, since if the test fails, we have no way of knowing how many/which combinations have failed. It can also cause timeouts on our test system due to the sheer number of combinations tested. (Specifically, the parallel compression threads parameter added by https://github.com/facebook/rocksdb/pull/6262 seems to have been the last straw.) There is some DIY code there that splits the load among eight test cases but that does not appear to be sufficient anymore. Instead, the patch turns `HarnessTest` into a parameterized test, so all the parameter combinations can be tested separately and potentially concurrently. It also cleans up the tests a little, fixes `RandomizedLongDB`, which did not get updated when the parallel compression threads parameter was added, and turns `FooterTests` into a standalone test case (since it does not actually need a fixture class). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6974 Test Plan: `make check` Reviewed By: siying Differential Revision: D22029572 Pulled By: ltamasi fbshipit-source-id: 51baea670771c33928f2eb3902bd69dcf540aa41	5 years ago
Andrew Kryczka	e6be168aa5	save a key comparison in block seeks (#6646 ) Summary: This saves up to two key comparisons in block seeks. The first key comparison saved is a redundant key comparison against the restart key where the linear scan starts. This comparison is saved in all cases except when the found key is in the first restart interval. The second key comparison saved is a redundant key comparison against the restart key where the linear scan ends. This is only saved in cases where all keys in the restart interval are less than the target (probability roughly `1/restart_interval`). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6646 Test Plan: ran a benchmark with mostly default settings and counted key comparisons before: `user_key_comparison_count = 19399529` after: `user_key_comparison_count = 18431498` setup command: ``` $ TEST_TMPDIR=/dev/shm/dbbench ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -max_background_jobs=12 -level_compaction_dynamic_level_bytes=true -num=10000000 ``` benchmark command: ``` $ TEST_TMPDIR=/dev/shm/dbbench/ ./db_bench -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=10000000 -compression_type=none -reads=1000000 -perf_level=3 ``` Reviewed By: pdillinger Differential Revision: D20849707 Pulled By: ajkr fbshipit-source-id: 1f01c5cd99ea771fd27974046e37b194f1cdcfac	5 years ago
Andrew Kryczka	02db03af8d	make L0 index/filter pinned memory usage predictable (#6911 ) Summary: Memory pinned by `pin_l0_filter_and_index_blocks_in_cache` needs to be predictable based on user config. This PR makes sure we do not pin extra memory for large files generated by intra-L0 (see https://github.com/facebook/rocksdb/issues/6889). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6911 Test Plan: unit test Reviewed By: siying Differential Revision: D21835818 Pulled By: ajkr fbshipit-source-id: a11a088549d06bed8aacc2548d266e5983f0ead4	5 years ago
Yanqin Jin	3020df9df5	Remove unnecessary inclusion of version_edit.h in env (#6952 ) Summary: In db_options.c, we should avoid including header files in the `db` directory to avoid introducing unnecessary dependency. The reason why `version_edit.h` has been included in `db_options.cc` is because we need two constants, `kUnknownChecksum` and `kUnknownChecksumFuncName`. We can put these two constants as `constexpr` in the public header `file_checksum.h`. Test plan (devserver): make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6952 Reviewed By: zhichao-cao Differential Revision: D21925341 Pulled By: riversand963 fbshipit-source-id: 2902f3b74c97f0cf16c58ad24c095c787c3a40e2	5 years ago
Zhichao Cao	f941adef88	Clean up the dead code (#6946 ) Summary: Remove the dead code in table test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6946 Test Plan: run table_test Reviewed By: riversand963 Differential Revision: D21913563 Pulled By: zhichao-cao fbshipit-source-id: c0aa9f3b95dfe87dd7fb2cd4823784f08cb3ddd3	5 years ago
anand76	98b0cbea88	Check iterator status BlockBasedTableReader::VerifyChecksumInBlocks() (#6909 ) Summary: The ```for``` loop in ```VerifyChecksumInBlocks``` only checks ```index_iter->Valid()``` which could be ```false``` either due to reaching the end of the index or, in case of partitioned index, it could be due to a checksum mismatch error when reading a 2nd level index block. Instead of throwing away the index iterator status, we need to return any errors back to the caller. Tests: Add a test in block_based_table_reader_test.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6909 Reviewed By: pdillinger Differential Revision: D21833922 Pulled By: anand1976 fbshipit-source-id: bc778ebf1121dbbdd768689de5183f07a9f0beae	5 years ago
Peter Dillinger	c7432cc3c0	Fix more defects reported by Coverity Scan (#6935 ) Summary: Mostly uninitialized values: some probably written before use, but some seem like bugs. Also, destructor needs to be virtual, and possible use-after-free in test Pull Request resolved: https://github.com/facebook/rocksdb/pull/6935 Test Plan: make check Reviewed By: siying Differential Revision: D21885484 Pulled By: pdillinger fbshipit-source-id: e2e7cb0a0cf196f2b55edd16f0634e81f6cc8e08	5 years ago
sdong	afa3518839	Revert "Update googletest from 1.8.1 to 1.10.0 (#6808 )" (#6923 ) Summary: This reverts commit `8d87e9cea1`. Based on offline discussions, it's too early to upgrade to gtest 1.10, as it prevents some developers from using an older version of gtest to integrate to some other systems. Revert it for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6923 Reviewed By: pdillinger Differential Revision: D21864799 fbshipit-source-id: d0726b1ff649fc911b9378f1763316200bd363fc	5 years ago
Peter Dillinger	9360776cb9	Fix handling of too-small filter partition size (#6905 ) Summary: Because ARM and some other platforms have a larger cache line size, they have a larger minimum filter size, which causes recently added PartitionedMultiGet test in db_bloom_filter_test to fail on those platforms. The code would actually end up using larger partitions, because keys_per_partition_ would be 0 and never == number of keys added. The code now attempts to get as close as possible to the small target size, while fully utilizing that filter size, if the target partition size is smaller than the minimum filter size. Also updated the test to break more uniformly across platforms Pull Request resolved: https://github.com/facebook/rocksdb/pull/6905 Test Plan: updated test, tested on ARM Reviewed By: anand1976 Differential Revision: D21840639 Pulled By: pdillinger fbshipit-source-id: 11684b6d35f43d2e98b85ddb2c8dcfd59d670817	5 years ago
Zhichao Cao	2adb7e3768	Fix potential overflow of unsigned type in for loop (#6902 ) Summary: x.size() -1 or y - 1 can overflow to an extremely large value when x.size() pr y is 0 when they are unsigned type. The end condition of i in the for loop will be extremely large, potentially causes segment fault. Fix them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6902 Test Plan: pass make asan_check Reviewed By: ajkr Differential Revision: D21843767 Pulled By: zhichao-cao fbshipit-source-id: 5b8b88155ac5a93d86246d832e89905a783bb5a1	5 years ago
Peter Dillinger	14eca6bf04	For ApproximateSizes, pro-rate table metadata size over data blocks (#6784 ) Summary: The implementation of GetApproximateSizes was inconsistent in its treatment of the size of non-data blocks of SST files, sometimes including and sometimes now. This was at its worst with large portion of table file used by filters and querying a small range that crossed a table boundary: the size estimate would include large filter size. It's conceivable that someone might want only to know the size in terms of data blocks, but I believe that's unlikely enough to ignore for now. Similarly, there's no evidence the internal function AppoximateOffsetOf is used for anything other than a one-sided ApproximateSize, so I intend to refactor to remove redundancy in a follow-up commit. So to fix this, GetApproximateSizes (and implementation details ApproximateSize and ApproximateOffsetOf) now consistently include in their returned sizes a portion of table file metadata (incl filters and indexes) based on the size portion of the data blocks in range. In other words, if a key range covers data blocks that are X% by size of all the table's data blocks, returned approximate size is X% of the total file size. It would technically be more accurate to attribute metadata based on number of keys, but that's not computationally efficient with data available and rarely a meaningful difference. Also includes miscellaneous comment improvements / clarifications. Also included is a new approximatesizerandom benchmark for db_bench. No significant performance difference seen with this change, whether ~700 ops/sec with cache_index_and_filter_blocks and small cache or ~150k ops/sec without cache_index_and_filter_blocks. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6784 Test Plan: Test added to DBTest.ApproximateSizesFilesWithErrorMargin. Old code running new test... [ RUN ] DBTest.ApproximateSizesFilesWithErrorMargin db/db_test.cc:1562: Failure Expected: (size) <= (11 * 100), actual: 9478 vs 1100 Other tests updated to reflect consistent accounting of metadata. Reviewed By: siying Differential Revision: D21334706 Pulled By: pdillinger fbshipit-source-id: 6f86870e45213334fedbe9c73b4ebb1d8d611185	5 years ago
sdong	298b00a396	Reduce dependency on gtest dependency in release code (#6907 ) Summary: Release code now depends on gtest, indirectly through including "test_util/testharness.h". This creates multiple problems. One important reason is the definition of IGNORE_STATUS_IF_ERROR() in test_util/testharness.h. Move it to sync_point.h instead. Note that utilities/cassandra/format.h still depends on "test_util/testharness.h". This will be resolved in a separate diff. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6907 Test Plan: Run all existing tests. Reviewed By: ajkr Differential Revision: D21829884 fbshipit-source-id: 9253c19ffde2936f3ae68998210f8e54f645a6e6	5 years ago
Adam Retter	8d87e9cea1	Update googletest from 1.8.1 to 1.10.0 (#6808 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6808 Reviewed By: anand1976 Differential Revision: D21483984 Pulled By: pdillinger fbshipit-source-id: 70c5eff2bd54ddba469761d95e4cd4611fb8e598	5 years ago
anand76	66942e8158	Avoid unnecessary reads of uncompression dictionary in MultiGet (#6906 ) Summary: We may sometimes read the uncompression dictionary when its not necessary, when we lookup a key in an SST file but the index indicates the key is not present. This can happen with index_type 3. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6906 Test Plan: make check Reviewed By: cheng-chang Differential Revision: D21828944 Pulled By: anand1976 fbshipit-source-id: 7aef4f0a39548d0874eafefd2687006d2652f9bb	5 years ago
Cheng Chang	bcb9e41080	Explicitly free allocated buffer when status is not ok (#6903 ) Summary: Currently we rely on `BlockContents` to implicitly free the allocated scratch buffer, but when IO error happens, it doesn't make sense to construct the `BlockContents` which might be corrupted. In the stress test, we find that `assert(req.result.size() == block_size(handle));` fails because of potential IO errors. In this PR, we explicitly free the scratch buffer on error without constructing `BlockContents`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6903 Test Plan: watch stress test Reviewed By: anand1976 Differential Revision: D21823869 Pulled By: cheng-chang fbshipit-source-id: 5603fc80e9bf3f44a9d7250ddebd871afe1eb89f	5 years ago
Andrew Kryczka	c5abf78bca	avoid `IterKey::UpdateInternalKey()` in `BlockIter` (#6843 ) Summary: `IterKey::UpdateInternalKey()` is an error-prone API as it's incompatible with `IterKey::TrimAppend()`, which is used for decoding delta-encoded internal keys. This PR stops using it in `BlockIter`. Instead, it assigns global seqno in a separate `IterKey`'s buffer when needed. The logic for safely getting a Slice with global seqno properly assigned is encapsulated in `GlobalSeqnoAppliedKey`. `BinarySeek()` is also migrated to use this API (previously it ignored global seqno entirely). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6843 Test Plan: benchmark setup -- single file DBs, in-memory, no compression. "normal_db" created by regular flush; "ingestion_db" created by ingesting a file. Both DBs have same contents. ``` $ TEST_TMPDIR=/dev/shm/normal_db/ ./db_bench -benchmarks=fillrandom,compact -write_buffer_size=10485760000 -disable_auto_compactions=true -compression_type=none -num=1000000 $ ./ldb write_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/ --compression_type=no --hex --create_if_missing < <(./sst_dump --command=scan --output_hex --file=/dev/shm/normal_db/dbbench/000007.sst \| awk 'began {print "0x" substr($1, 2, length($1) - 2), "==>", "0x" $5} ; /^Sst file format: block-based/ {began=1}') $ ./ldb ingest_extern_sst ./tmp.sst --db=/dev/shm/ingestion_db/dbbench/ ``` benchmark run command: ``` TEST_TMPDIR=/dev/shm/$DB/ ./db_bench -benchmarks=seekrandom -seek_nexts=10 -use_existing_db=true -cache_index_and_filter_blocks=false -num=1000000 -cache_size=1048576000 -threads=1 -reads=40000000 ``` results: \| DB \| code \| throughput \| \|---\|---\|---\| \| normal_db \| master \| 267.9 \| \| normal_db \| PR6843 \| 254.2 (-5.1%) \| \| ingestion_db \| master \| 259.6 \| \| ingestion_db \| PR6843 \| 250.5 (-3.5%) \| Reviewed By: pdillinger Differential Revision: D21562604 Pulled By: ajkr fbshipit-source-id: 937596f836930515da8084d11755e1f247dcb264	5 years ago
Yanqin Jin	961c7590d6	Add timestamp to delete (#6253 ) Summary: Preliminary user-timestamp support for delete. If ["a", ts=100] exists, you can delete it by calling `DB::Delete(write_options, key)` in which `write_options.timestamp` points to a `ts` higher than 100. Implementation A new ValueType, i.e. `kTypeDeletionWithTimestamp` is added for deletion marker with timestamp. The reason for a separate `kTypeDeletionWithTimestamp`: RocksDB may drop tombstones (keys with kTypeDeletion) when compacting them to the bottom level. This is OK and useful if timestamp is disabled. When timestamp is enabled, should we still reuse `kTypeDeletion`, we may drop the tombstone with a more recent timestamp, causing deleted keys to re-appear. Test plan (dev server) ``` make check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6253 Reviewed By: ltamasi Differential Revision: D20995328 Pulled By: riversand963 fbshipit-source-id: a9e5c22968ad76f98e3dc6ee0151265a3f0df619	5 years ago

... 2 3 4 5 6 ...

1267 Commits (a7d4bea43aaa2dba3af04bcf9a76ea2f7ad917e6)