rocksdb

Commit Graph

Author	SHA1	Message	Date
Guido Tagliavini Ponce	c277aeb42c	Midpoint insertions in ClockCache (#10305 ) Summary: When an element is first inserted into the ClockCache, it is now assigned either medium or high clock priority, depending on whether its cache priority is low or high, respectively. This is a variant of LRUCache's midpoint insertions. The main difference is that LRUCache can specify the allocated capacity for high-priority elements via the ``high_pri_pool_ratio`` parameter. Contrarily, in ClockCache, low- and high-priority elements compete for all cache slots, and one group can take over the other (of course, it takes more low-priority insertions to push out high-priority elements). However, just as LRUCache, ClockCache provides the following guarantee: a high-priority element will not be evicted before a low-priority element that was inserted earlier in time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10305 Test Plan: ``make -j24 check`` Reviewed By: pdillinger Differential Revision: D37607787 Pulled By: guidotag fbshipit-source-id: 24d9f2523d2f4e6415e7f0029cc061fa275c2040	3 years ago
zczhu	8debfe2b21	Replace the output split key with its pointer in subcompaction (#10316 ) Summary: Earlier implementation of cutting the output files with a compact cursor under Round-Robin priority uses `Valid()` to determine if the `output_split_key` is valid in `ShouldStopBefore`. This contributes to excessive CPU computation, as pointed out by [this issue](https://github.com/facebook/rocksdb/issues/10315). In this PR, we change the type of `output_split_key` to be `InternalKey*` and set it as `nullptr` if it is not going to be used in `ShouldStopBefore`, `Valid()` condition checking can be avoided using that pointer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10316 Reviewed By: ajkr Differential Revision: D37661492 Pulled By: littlepig2013 fbshipit-source-id: 66ff1105f3378e5573d3a126fdaff9bb23b5498f	3 years ago
Peter Dillinger	e6c5e0ab9a	Have Cache use Status::MemoryLimit (#10262 ) Summary: I noticed it would clean up some things to have Cache::Insert() return our MemoryLimit Status instead of Incomplete for the case in which the capacity limit is reached. I suspect this fixes some existing but unknown bugs where this Incomplete could be confused with other uses of Incomplete, especially no_io cases. This is the most suspicious case I noticed, but was not able to reproduce a bug, in part because the existing code is not covered by unit tests (FIXME added): `57adbf0e91/table/get_context.cc (L397)` I audited all the existing uses of IsIncomplete and updated those that seemed relevant. HISTORY updated with a clear warning to users of strict_capacity_limit=true to update uses of `IsIncomplete()` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10262 Test Plan: updated unit tests Reviewed By: hx235 Differential Revision: D37473155 Pulled By: pdillinger fbshipit-source-id: 4bd9d9353ccddfe286b03ebd0652df8ce20f99cb	3 years ago
Manuel Ung	071fe39c05	Allow user to pass git command to makefile (#10318 ) Summary: This allows users to pass their git command with extra options if necessary. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10318 Reviewed By: ajkr Differential Revision: D37661175 Pulled By: lth fbshipit-source-id: 2a7cf27626c74f167471e6ec57e3870630a582b0	3 years ago
Akanksha Mahajan	2acbf386a3	Provide support for direct_reads with async_io (#10197 ) Summary: Provide support for use_direct_reads with async_io. TestPlan: - Updated unit tests - db_bench: Results in https://github.com/facebook/rocksdb/pull/10197#issuecomment-1159239420 - db_stress ``` export CRASH_TEST_EXT_ARGS=" --async_io=1 --use_direct_reads=1" make crash_test -j ``` - Ran db_bench on previous RocksDB version before any async_io implementation (as there have many changes in different PRs in this area) https://github.com/facebook/rocksdb/pull/10197#issuecomment-1160781563. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10197 Reviewed By: anand1976 Differential Revision: D37255646 Pulled By: akankshamahajan15 fbshipit-source-id: fec61ae15bf4d625f79dea56e4f86e0e307ba920	3 years ago
Mark Callaghan	177b2fa341	Set the value for --version, add --build_info (#10275 ) Summary: ./db_bench --version db_bench version 7.5.0 ./db_bench --build_info (RocksDB) 7.5.0 rocksdb_build_date: 2022-06-29 09:58:04 rocksdb_build_git_sha: `d96febeeaa` rocksdb_build_git_tag: print_version_githash Pull Request resolved: https://github.com/facebook/rocksdb/pull/10275 Test Plan: run it Reviewed By: ajkr Differential Revision: D37524720 Pulled By: mdcallag fbshipit-source-id: 0f6c819dbadf7b033a4a3ba2941992bb76b4ff99	3 years ago
Changyu Bi	f9cfc6a808	Updated NewDataBlockIterator to not fetch compression dict for non-da… (#10310 ) Summary: …ta blocks During MyShadow testing, ajkr helped me find out that with partitioned index and dictionary compression enabled, `PartitionedIndexIterator::InitPartitionedIndexBlock()` spent considerable amount of time (1-2% CPU) on fetching uncompression dictionary. Fetching uncompression dict was not needed since the index blocks were not compressed (and even if they were, they use empty dictionary). This should only affect use cases with partitioned index, dictionary compression and without uncompression dictionary pinned. This PR updates NewDataBlockIterator to not fetch uncompression dictionary when it is not for data blocks. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10310 Test Plan: 1. `make check` 2. Perf benchmark: 1.5% (143950 -> 146176) improvement in op/sec for partitioned index + dict compression benchmark. For default config without partitioned index and without dict compression, there is no regression in readrandom perf from multiple runs of db_bench. ``` # Set up for partitioned index with dictionary compression TEST_TMPDIR=/dev/shm ./db_bench_main -benchmarks=filluniquerandom,compact -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -partition_index=true -compression_max_dict_bytes=16384 -compression_zstd_max_train_bytes=1638400 # Pre PR TEST_TMPDIR=/dev/shm ./db_bench_main -use_existing_db=true -benchmarks=readrandom[-X50] -partition_index=true readrandom [AVG 50 runs] : 143950 (± 1108) ops/sec; 15.9 (± 0.1) MB/sec readrandom [MEDIAN 50 runs] : 144406 ops/sec; 16.0 MB/sec # Post PR TEST_TMPDIR=/dev/shm ./db_bench_opt -use_existing_db=true -benchmarks=readrandom[-X50] -partition_index=true readrandom [AVG 50 runs] : 146176 (± 1121) ops/sec; 16.2 (± 0.1) MB/sec readrandom [MEDIAN 50 runs] : 146014 ops/sec; 16.2 MB/sec # Set up for no partitioned index and no dictionary compression TEST_TMPDIR=/dev/shm/baseline ./db_bench_main -benchmarks=filluniquerandom,compact -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false # Pre PR TEST_TMPDIR=/dev/shm/baseline/ ./db_bench_main --use_existing_db=true "--benchmarks=readrandom[-X50]" readrandom [AVG 50 runs] : 158546 (± 1000) ops/sec; 17.5 (± 0.1) MB/sec readrandom [MEDIAN 50 runs] : 158280 ops/sec; 17.5 MB/sec # Post PR TEST_TMPDIR=/dev/shm/baseline/ ./db_bench_opt --use_existing_db=true "--benchmarks=readrandom[-X50]" readrandom [AVG 50 runs] : 161061 (± 1520) ops/sec; 17.8 (± 0.2) MB/sec readrandom [MEDIAN 50 runs] : 161596 ops/sec; 17.9 MB/sec ``` Reviewed By: ajkr Differential Revision: D37631358 Pulled By: cbi42 fbshipit-source-id: 6ca2665e270e63871968e061ba4a99d3136785d9	3 years ago
Changyu Bi	0ff7713112	Handoff checksum during WAL replay (#10212 ) Summary: Added checksum protection for write batch content read from WAL to when per key-value checksum is computed on the write batch. This gives full coverage on write batch integrity of WAL replay to memtable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10212 Test Plan: - Added unit test and the existing tests (replay code path covers the change in this PR): `make -j32 check` - Stress test: ran `db_stress` for 30min. - Perf regression: ``` # setup TEST_TMPDIR=/dev/shm/100MB_WAL_DB/ ./db_bench -benchmarks=fillrandom -write_buffer_size=1048576000 # benchmark db open time TEST_TMPDIR=/dev/shm/100MB_WAL_DB/ /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=overwrite -write_buffer_size=1048576000 -writes=1 -report_open_timing=true For 20 runs, pre-PR avg: 3734.31ms, post-PR avg: 3790.06 ms (~1.5% regression). Pre-PR OpenDb: 3714.36 milliseconds OpenDb: 3622.71 milliseconds OpenDb: 3591.17 milliseconds OpenDb: 3674.7 milliseconds OpenDb: 3615.79 milliseconds OpenDb: 3982.83 milliseconds OpenDb: 3650.6 milliseconds OpenDb: 3809.26 milliseconds OpenDb: 3576.44 milliseconds OpenDb: 3638.12 milliseconds OpenDb: 3845.68 milliseconds OpenDb: 3677.32 milliseconds OpenDb: 3659.64 milliseconds OpenDb: 3837.55 milliseconds OpenDb: 3899.64 milliseconds OpenDb: 3840.72 milliseconds OpenDb: 3802.71 milliseconds OpenDb: 3573.27 milliseconds OpenDb: 3895.76 milliseconds OpenDb: 3778.02 milliseconds Post-PR: OpenDb: 3880.46 milliseconds OpenDb: 3709.02 milliseconds OpenDb: 3954.67 milliseconds OpenDb: 3955.64 milliseconds OpenDb: 3958.64 milliseconds OpenDb: 3631.28 milliseconds OpenDb: 3721 milliseconds OpenDb: 3729.89 milliseconds OpenDb: 3730.55 milliseconds OpenDb: 3966.32 milliseconds OpenDb: 3685.54 milliseconds OpenDb: 3573.17 milliseconds OpenDb: 3703.75 milliseconds OpenDb: 3873.62 milliseconds OpenDb: 3704.4 milliseconds OpenDb: 3820.98 milliseconds OpenDb: 3721.62 milliseconds OpenDb: 3770.86 milliseconds OpenDb: 3949.78 milliseconds OpenDb: 3760.07 milliseconds ``` Reviewed By: ajkr Differential Revision: D37302092 Pulled By: cbi42 fbshipit-source-id: 7346e625f453ce4c0e5d708776cd1fb2af6b068b	3 years ago
Yanqin Jin	caced09e79	Expand stress test coverage for user-defined timestamp (#10280 ) Summary: Before this PR, we call `now()` to get the wall time before performing point-lookup and range scans when user-defined timestamp is enabled. With this PR, we expand the coverage to: - read with an older timestamp which is larger then the wall time when the process starts but potentially smaller than now() - add coverage for `ReadOptions::iter_start_ts != nullptr` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10280 Test Plan: ```bash make check ``` Also, ```bash TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_ts ``` So far, we have had four successful runs of the above In addition, ```bash TEST_TMPDIR=/dev/shm/rocksdb make crash_test ``` Succeeded twice showing no regression. Reviewed By: ltamasi Differential Revision: D37539805 Pulled By: riversand963 fbshipit-source-id: f2d9887ad95245945ce17a014d55bb93f00e1cb5	3 years ago
Mark Callaghan	9eced1a344	Add the git hash and full RocksDB version to report.tsv (#10277 ) Summary: Previously the version was displayed as $major.$minor This changes it to $major.$minor.$path This also adds the git hash for the time from which RocksDB was built to the end of report.tsv. I confirmed that benchmark_log_tool.py still parses it and that the people who consume/graph these results are OK with it. Example output: ops_sec mb_sec lsm_sz blob_sz c_wgb w_amp c_mbps c_wsecs c_csecs b_rgb b_wgb usec_op p50 p99 p99.9 p99.99 pmax uptime stall% Nstall u_cpu s_cpu rss test date version job_id githash 609488 244.1 1GB 0.0GB, 1.4 0.7 93.3 39 38 0 0 1.6 1.0 4 15 26 5365 15 0.0 0 0.1 0.0 0.5 fillseq.wal_disabled.v400 2022-06-29T13:36:05 7.5.0 `6115254416` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10277 Test Plan: Run it Reviewed By: jay-zhuang Differential Revision: D37532418 Pulled By: mdcallag fbshipit-source-id: 55e472640d51265819b228d3373c9fa9b62b660d	3 years ago
sdong	a9565ccb26	Try to trivial move more than one files (#10190 ) Summary: In leveled compaction, try to trivial move more than one files if possible, up to 4 files or max_compaction_bytes. This is to allow higher write throughput for some use cases where data is loaded in sequential order, where appying compaction results is the bottleneck. When pick up a file to compact and it doesn't have overlapping files in the next level, try to expand to the next file if there is still no overlapping. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10190 Test Plan: Add some unit tests. For performance, Try to run ./db_bench_multi_move --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000 -level_compaction_dynamic_level_bytes Together with https://github.com/facebook/rocksdb/pull/10188 , stalling will be eliminated in this benchmark. Reviewed By: jay-zhuang Differential Revision: D37230647 fbshipit-source-id: 42b260f545c46abc5d90335ac2bbfcd09602b549	3 years ago
Yanqin Jin	d6b9c4ae26	Update code comment and logging for secondary instance (#10260 ) Summary: Before this PR, it is required that application open RocksDB secondary instance with `max_open_files = -1`. This is a hacky workaround that prevents IOErrors on the seconary instance during point-lookup or range scan caused by primary instance deleting the table files. This is not necessary if the application can coordinate the primary and secondaries so that primary does not delete files that are still being used by the secondaries. Or users can provide a custom Env/FS implementation that deletes the files only after all primary and secondary instances indicate files are obsolete and deleted. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10260 Test Plan: make check Reviewed By: jay-zhuang Differential Revision: D37462633 Pulled By: riversand963 fbshipit-source-id: 9c2fc939f49663efa61e3d60c8f1e01d64b9d72c	3 years ago
yite.gu	a9117a3490	BackupEngine: we can return immediately if GetFileSize failed (#10176 ) Summary: In some case, GetFileSize would be failure in copy_file_cb. If failure, we can return immediately, the subsequent code is meaningless, and add a log info let user know that problem happen here. Singed-off-by: Yite Gu <ess_gyt@qq.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/10176 Reviewed By: cbi42 Differential Revision: D37510888 Pulled By: ajkr fbshipit-source-id: 044ad8c45852fd19b8cd564b11f65d40c39e296f	3 years ago
Guido Tagliavini Ponce	54f678cd86	Fix CalcHashBits (#10295 ) Summary: We fix two bugs in CalcHashBits. The first one is an off-by-one error: the desired number of table slots is the real number ``capacity / (kLoadFactor * handle_charge)``, which should not be rounded down. The second one is that we should disallow inputs that set the element charge to 0, namely ``estimated_value_size == 0 && metadata_charge_policy == kDontChargeCacheMetadata``. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10295 Test Plan: CalcHashBits is tested by CalcHashBitsTest (in lru_cache_test.cc). The test now iterates over many more inputs; it covers, in particular, the rounding error edge case. Overall, the test is now more robust. Run ``make -j24 check``. Reviewed By: pdillinger Differential Revision: D37573797 Pulled By: guidotag fbshipit-source-id: ea4f4439f7196ab1c1afb88f566fe92850537262	3 years ago
zczhu	e716bda010	Add FLAGS_compaction_pri into crash_test (#10255 ) Summary: Add FLAGS_compaction_pri into correctness test Pull Request resolved: https://github.com/facebook/rocksdb/pull/10255 Test Plan: run crash_test with FLAGS_compaction_pri Reviewed By: ajkr Differential Revision: D37510372 Pulled By: littlepig2013 fbshipit-source-id: 73d93a0a047d0c3993c8a512383dd6ee6acef641	3 years ago
Akanksha Mahajan	11215e0f3a	Fix bug in Logger creation if dbname and db_log_dir are on different filesystem (#10292 ) Summary: If dbname and db_log_dir are at different filesystems (one local and one remote), creation of dbname will fail because that path doesn't exist wrt to db_log_dir. This patch will ignore the error returned on creation of dbname. If they are on same filesystem, db_log_dir creation will automatically return the error in case there is any error in creation of dbname. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10292 Test Plan: Existing unit tests Reviewed By: riversand963 Differential Revision: D37567773 Pulled By: akankshamahajan15 fbshipit-source-id: 005d28c536208d4c126c8cb8e196d1d85b881100	3 years ago
sdong	4428c76181	Multi-File Trivial Move in L0->L1 (#10188 ) Summary: In leveled compaction, L0->L1 trivial move will allow more than one file to be moved in one compaction. This would allow L0 files to be moved down faster when data is loaded in sequential order, making slowdown or stop condition harder to hit. Also seek L0->L1 trivial move when only some files qualify. 1. We always try to find L0->L1 trivial move from the oldest files. Keep including newer files, until adding a new file won't trigger a trivial move 2. Modify the trivial move condition so that this compaction would be tagged as trivial move. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10188 Test Plan: See throughput improvements with db_bench with fast fillseq benchmark and small L0 files: ./db_bench_l0_move --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000 -level_compaction_dynamic_level_bytes The throughput improved by about 50%. Stalling still happens though. Reviewed By: jay-zhuang Differential Revision: D37224743 fbshipit-source-id: 8958d97f22e12bdfc14d2e85930f6fa0070e9659	3 years ago
zczhu	4f51101d31	Remove compact cursor when split sub-compactions (#10289 ) Summary: In round-robin compaction priority, when splitting the compaction into sub-compactions, the earlier implementation takes into account the compact cursor to have full use of available sub-compactions. But this may result in unbalanced sub-compactions, so we remove this here. The removal does not affect the cursor-based splitting mechanism within a sub-compaction, and thus the output files are still ensured to be split according to the cursor. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10289 Reviewed By: ajkr Differential Revision: D37559091 Pulled By: littlepig2013 fbshipit-source-id: b8b45b99f63b09cf873f7f049bcb4ab13871fffc	3 years ago
Mark Callaghan	720ab355f9	Add undefok for BlobDB options not supported prior to 7.5 (#10276 ) Summary: This adds --undefok to support use of this script with BlobDB for db_bench versions prior to 7.5 when the options land in a release. While there is a limit to how far back this script can go WRT backwards compatiblity, this is an easy change to support early 7.x releases. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10276 Test Plan: Run it with versions of db_bench that do not and then do support these options Reviewed By: gangliao Differential Revision: D37529299 Pulled By: mdcallag fbshipit-source-id: 7bb1feec5c68760e6d64792c585bfbde4f5e52d8	3 years ago
sdong	b397dcd390	Change The Way Level Target And Compaction Score Are Calculated (#10057 ) Summary: The current level targets for dynamical leveling has a problem: the target level size will dramatically change after a L0->L1 compaction. When there are many L0 bytes, lower level compactions are delayed, but they will be resumed after the L0->L1 compaction finishes, so the expected write amplification benefits might not be realized. The proposal here is to revert the level targetting size, but instead relying on adjusting score for each level to prioritize levels that need to compact most. Basic idea: (1) target level size isn't adjusted, but score is adjusted. The reasoning is that with parallel compactions, holding compactions from happening might not be desirable, but we would like the compactions are scheduled from the level we feel most needed. For example, if we have a extra-large L2, we would like all compactions are scheduled for L2->L3 compactions, rather than L4->L5. This gets complicated when a large L0->L1 compaction is going on. Should we compact L2->L3 or L4->L5. So the proposal for that is: (2) the score is calculated by actual level size / (target size + estimated upper bytes coming down). The reasoning is that if we have a large amount of pending L0/L1 bytes coming down, compacting L2->L3 might be more expensive, as when the L0 bytes are compacted down to L2, the actual L2->L3 fanout would change dramatically. On the other hand, when the amount of bytes coming down to L5, the impacts to L5->L6 fanout are much less. So when calculating target score, we can adjust it by adding estimated downward bytes to the target level size. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10057 Test Plan: Repurpose tests VersionStorageInfoTest.MaxBytesForLevelDynamicWithLargeL0_* tests to cover this scenario. Reviewed By: ajkr Differential Revision: D37539742 fbshipit-source-id: 9c154cbfe92023f918cf5d80875d8776ad4831a4	3 years ago
Gang Liao	056e08d6c4	Enable blob caching for MultiGetBlob in RocksDB (#10272 ) Summary: - [x] Enabled blob caching for MultiGetBlob in RocksDB - [x] Refactored MultiGetBlob logic and interface in RocksDB - [x] Cleaned up Version::MultiGetBlob() and moved 'blob'-related code snippets into BlobSource - [x] Add End-to-end test cases in db_blob_basic_test and also add unit tests in blob_source_test This task is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10272 Reviewed By: ltamasi Differential Revision: D37558112 Pulled By: gangliao fbshipit-source-id: a73a6a94ffdee0024d5b2a39e6d1c1a7d38664db	3 years ago
Andrew Kryczka	20754b3654	include compaction cursors in VersionEdit debug string (#10288 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10288 Test Plan: try it out - ``` $ ldb manifest_dump --db=/dev/shm/rocksdb.0uWV/rocksdb_crashtest_whitebox/ --hex --verbose \| grep CompactCursor \| head -3 CompactCursor: 1 '00000000000011D9000000000000012B0000000000000266' seq:0, type:1 CompactCursor: 1 '0000000000001F35000000000000012B0000000000000022' seq:0, type:1 CompactCursor: 2 '00000000000011D9000000000000012B0000000000000266' seq:0, type:1 ``` Reviewed By: littlepig2013 Differential Revision: D37557177 Pulled By: ajkr fbshipit-source-id: 7b76b857d9e7a9f3d53398a61bb1d4b78873b91e	3 years ago
Yueh-Hsuan Chiang	17a6f7faaf	Add load_latest_options() to C api (#10152 ) Summary: Add load_latest_options() to C api. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10152 Test Plan: Extend the existing c_test by reopening db using the latest options file at different parts of the test. Reviewed By: akankshamahajan15 Differential Revision: D37305225 Pulled By: ajkr fbshipit-source-id: 8b3bab73f56fa6fcbdba45aae393145d007b3962	3 years ago
Yanqin Jin	b87c355772	Fix assertion error with read_opts.iter_start_ts (#10279 ) Summary: If the internal iterator is not valid, `SeekToLast` with iter_start_ts should have `valid_` is false without assertion failure. Test plan make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/10279 Reviewed By: ltamasi Differential Revision: D37539393 Pulled By: riversand963 fbshipit-source-id: 8e94057838f8a05144fad5768f4d62f1893ec315	3 years ago
Guido Tagliavini Ponce	57a0e2f304	Clock cache (#10273 ) Summary: This is the initial step in the development of a lock-free clock cache. This PR includes the base hash table design (which we mostly ported over from FastLRUCache) and the clock eviction algorithm. Importantly, it's still _not_ lock-free---all operations use a shard lock. Besides the locking, there are other features left as future work: - Remove keys from the handles. Instead, use 128-bit bijective hashes of them for handle comparisons, probing (we need two 32-bit hashes of the key for double hashing) and sharding (we need one 6-bit hash). - Remove the clock_usage_ field, which is updated on every lookup. Even if it were atomically updated, it could cause memory invalidations across cores. - Middle insertions into the clock list. - A test that exercises the clock eviction policy. - Update the Java API of ClockCache and Java calls to C++. Along the way, we improved the code and comments quality of FastLRUCache. These changes are relatively minor. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10273 Test Plan: ``make -j24 check`` Reviewed By: pdillinger Differential Revision: D37522461 Pulled By: guidotag fbshipit-source-id: 3d70b737dbb70dcf662f00cef8c609750f083943	3 years ago
Johnny Shaw	c2dc4c0c52	Fix GetWindowsErrSz nullptr bug (#10282 ) Summary: `GetWindowsErrSz` may assign a `nullptr` to `std::string` in the event it cannot format the error code to a string. This will result in a crash when `std::string` attempts to calculate the length from `nullptr`. The change here checks the output from `FormatMessageA` and only assigns to the otuput `std::string` if it is not null. Additionally, the call to free the buffer is only made if a non-null value is returned from `FormatMessageA`. In the event `FormatMessageA` does not output a string, an empty string is returned instead. Fixes https://github.com/facebook/rocksdb/issues/10274 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10282 Reviewed By: riversand963 Differential Revision: D37542143 Pulled By: ajkr fbshipit-source-id: c21f5119ddb451f76960acec94639d0f538052f2	3 years ago
leipeng	490fcac078	WriteBatch reorder fields to reduce padding (#10266 ) Summary: this reorder reduces sizeof(WriteBatch) by 16 bytes Pull Request resolved: https://github.com/facebook/rocksdb/pull/10266 Reviewed By: akankshamahajan15 Differential Revision: D37505201 Pulled By: ajkr fbshipit-source-id: 6cb6c3735073fcb63921f822d5e15670fecb1c26	3 years ago
sdong	6115254416	Fix A Bug Where Concurrent Compactions Cause Further Slowing Down (#10270 ) Summary: Currently, when installing a new super version, when stalling condition triggers, we compare estimated compaction bytes to previously, and if the new value is larger or equal to the previous one, we reduce the slowdown write rate. However, if concurrent compactions happen, the same value might be used. The result is that, although some compactions reduce estimated compaction bytes, we treat them as a signal for further slowing down. In some cases, it causes slowdown rate drops all the way to the minimum, far lower than needed. Fix the bug by not triggering a re-calculation if a new super version doesn't have Version or a memtable change. With this fix, number of compaction finishes are still undercounted in this algorithm, but it is still better than the current bug where they are negatively counted. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10270 Test Plan: Run a benchmark where the slowdown rate is dropped to minimal unnessarily and see it is back to a normal value. Reviewed By: ajkr Differential Revision: D37497327 fbshipit-source-id: 9bca961cc38fed965c3af0fa6c9ca0efaa7637c4	3 years ago
Edvard Davtyan	12bfd519de	Expose LRU cache num_shard_bits paramater in C api (#10222 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10222 Reviewed By: cbi42 Differential Revision: D37358171 Pulled By: ajkr fbshipit-source-id: e86285fdceaec943415ee9d482090009b00cbc95	3 years ago
Mark Callaghan	28f2d3cca6	Benchmark fix write amplification computation (#10236 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10236 Reviewed By: ajkr Differential Revision: D37489898 Pulled By: mdcallag fbshipit-source-id: 4b4565973b1f2c47342b4d1b857c8f89e91da145	3 years ago
Yanqin Jin	b6cfda1283	Support `iter_start_ts` for backward iteration (#10200 ) Summary: Resolves https://github.com/facebook/rocksdb/issues/9761 With this PR, applications can create an iterator with the following ```cpp ReadOptions read_opts; read_opts.timestamp = &ts_ub; read_opts.iter_start_ts = &ts_lb; auto* it = db->NewIterator(read_opts); it->SeekToLast(); // or it->SeekForPrev("foo"); it->Prev(); ... ``` The application can access different versions of the same user key via `key()`, `value()`, and `timestamp()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10200 Test Plan: make check Reviewed By: ltamasi Differential Revision: D37258074 Pulled By: riversand963 fbshipit-source-id: 3f0b866ade50dcff7ef60d506397a9dd6ec91565	3 years ago
Peter Dillinger	d96febeeaa	Update/clarify required properties for prefix extractors (#10245 ) Summary: Most of the properties listed as required for prefix extractors are not really required but offer some conveniences. This updates API comments to clarify actual requirements, and adds tests to demonstrate how previously presumed requirements can be safely violated. This might seem like a useless exercise, but this relaxing of requirements would be needed if we generalize prefixes to group keys not just at the byte level but also based on bits or arbitrary value ranges. For applications without a "natural" prefix size, having only byte-level granularity often means one prefix size to the next differs in magnitude by a factor of 256. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10245 Test Plan: Tests added, also covering missing Iterator cases from https://github.com/facebook/rocksdb/issues/10244 Reviewed By: bjlemaire Differential Revision: D37371559 Pulled By: pdillinger fbshipit-source-id: ab2dd719992eea7656e9042cf8542393e02fa244	3 years ago
Andrew Kryczka	ca81b80d83	Deflake RateLimiting/BackupEngineRateLimitingTestWithParam (#10271 ) Summary: We saw flakes with the following failure: ``` [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/1 utilities/backup/backup_engine_test.cc:2667: Failure Expected: (restore_time) > (0.8 * rate_limited_restore_time), actual: 48269 vs 60470.4 terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException' what(): utilities/backup/backup_engine_test.cc:2667: Failure Expected: (restore_time) > (0.8 * rate_limited_restore_time), actual: 48269 vs 60470.4 Received signal 6 (Aborted) t/run-backup_engine_test-RateLimiting-BackupEngineRateLimitingTestWithParam.RateLimiting-1: line 4: 1032887 Aborted (core dumped) TEST_TMPDIR=$d ./backup_engine_test --gtest_filter=RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/1 ``` Investigation revealed we forgot to use the mock time `SystemClock` for restore rate limiting. Then the test used wall clock time, which made the execution of "GenericRateLimiter::Request:PostTimedWait" non-deterministic as wall clock time might have advanced enough that waiting was not needed. This PR changes restore rate limiting to use mock time, which guarantees we always execute "GenericRateLimiter::Request:PostTimedWait". Then the assertions that rely on times recorded inside that callback should be robust. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10271 Test Plan: Applied the following patch which guaranteed repro before the fix. Verified the test passes after this PR even with that patch applied. ``` diff --git a/util/rate_limiter.cc b/util/rate_limiter.cc index f369e3220..6b3ed82fa 100644 --- a/util/rate_limiter.cc +++ b/util/rate_limiter.cc @@ -158,6 +158,7 @@ void GenericRateLimiter::SetBytesPerSecond(int64_t bytes_per_second) { void GenericRateLimiter::Request(int64_t bytes, const Env::IOPriority pri, Statistics* stats) { + usleep(100000); assert(bytes <= refill_bytes_per_period_.load(std::memory_order_relaxed)); bytes = std::max(static_cast<int64_t>(0), bytes); TEST_SYNC_POINT("GenericRateLimiter::Request"); ``` Reviewed By: hx235 Differential Revision: D37499848 Pulled By: ajkr fbshipit-source-id: fd790d5a192996be8ba13b656751ccc7d8cb8f6e	3 years ago
Gang Liao	d7ebb58cb5	Add blob cache tickers, perf context statistics, and DB properties (#10203 ) Summary: In order to be able to monitor the performance of the new blob cache, we made the follow changes: - Add blob cache hit/miss/insertion tickers (see https://github.com/facebook/rocksdb/wiki/Statistics) - Extend the perf context similarly (see https://github.com/facebook/rocksdb/wiki/Perf-Context-and-IO-Stats-Context) - Implement new DB properties (see e.g. https://github.com/facebook/rocksdb/blob/main/include/rocksdb/db.h#L1042-L1051) that expose the capacity and current usage of the blob cache. This PR is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10203 Reviewed By: ltamasi Differential Revision: D37478658 Pulled By: gangliao fbshipit-source-id: d8ee3f41d47315ef725e4551226330b4b6832e40	3 years ago
Guido Tagliavini Ponce	c6055cba30	Calculate table size of FastLRUCache more accurately (#10235 ) Summary: Calculate the required size of the hash table in FastLRUCache more accurately. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10235 Test Plan: ``make -j24 check`` Reviewed By: gitbw95 Differential Revision: D37460546 Pulled By: guidotag fbshipit-source-id: 7945128d6f002832f8ed922ef0151919f4350854	3 years ago
Gang Liao	a1eb02f089	Change the semantics of bytes_read in GetBlob/MultiGetBlob for consistency (#10248 ) Summary: The `bytes_read` returned by the current BlobSource interface is ambiguous. The uncompressed blob size is returned if the cache hits. The size of the blob read from disk, presumably the compressed version, is returned if the cache misses. Two differing semantics might cause ambiguity and consistency issues. For example, this inconsistency causes the assertion failure (T124246362 and its hot fix is https://github.com/facebook/rocksdb/issues/10249). This goal is to require that the value of `byte read` always be an on-disk blob record size. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10248 Reviewed By: ltamasi Differential Revision: D37470292 Pulled By: gangliao fbshipit-source-id: fbca521b2791d3674dbf2484cea5fcae2fdd94d2	3 years ago
Levi Tamasi	0d1e0722ef	Fix in-place updates for value types other than kTypeValue (#10254 ) Summary: The patch fixes a couple of issues related to in-place updates: 1) the value type was not passed from `MemTableInserter::PutCFImpl` to `MemTable::Update` and 2) `MemTable::UpdateCallback` was called for any value type (with the callee's logic assuming `kTypeValue`) even though the callback mechanism is only safe for plain values. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10254 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D37463644 Pulled By: ltamasi fbshipit-source-id: 33802477dac0691681f416ae84c4d9742c6fe41a	3 years ago
Yanqin Jin	d3de59255a	Enable compaction filter for db_stress with user-defined timestamp (#10259 ) Summary: Before this PR, when user-defined timestamp is enabled, db_stress disables compaction filter. This is no longer necessary after this PR, since the `DbStressCompactionFilter` is now aware of the presence of timestamps. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10259 Test Plan: TEST_TMPDIR=/dev/shm make crash_test_with_ts Reviewed By: ajkr Differential Revision: D37459692 Pulled By: riversand963 fbshipit-source-id: 8fe62e90a63bd9317fe1bb95a2b4984080c9e5ef	3 years ago
Levi Tamasi	c73d2a9d18	Add API for writing wide-column entities (#10242 ) Summary: The patch builds on https://github.com/facebook/rocksdb/pull/9915 and adds a new API called `PutEntity` that can be used to write a wide-column entity to the database. The new API is added to both `DB` and `WriteBatch`. Note that currently there is no way to retrieve these entities; more precisely, all read APIs (`Get`, `MultiGet`, and iterator) return `NotSupported` when they encounter a wide-column entity that is required to answer a query. Read-side support (as well as other missing functionality like `Merge`, compaction filter, and timestamp support) will be added in later PRs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10242 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D37369748 Pulled By: ltamasi fbshipit-source-id: 7f5e412359ed7a400fd80b897dae5599dbcd685d	3 years ago
Andrew Kryczka	f322f273b0	Temporarily disable mempurge in crash test (#10252 ) Summary: Need to disable it for now as CI is failing, particularly `MultiOpsTxnsStressTest`. Investigation details in internal task T124324915. This PR disables mempurge more widely than `MultiOpsTxnsStressTest` until we know the issue is contained to that particular test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10252 Reviewed By: riversand963 Differential Revision: D37432948 Pulled By: ajkr fbshipit-source-id: d0cf5b0e0ec7c3142c382a0347f35a4c34f4607a	3 years ago
Bo Wang	8e63d90ff8	Pass rate_limiter_priority through filter block reader functions to FS (#10251 ) Summary: With https://github.com/facebook/rocksdb/pull/9996 , we can pass the rate_limiter_priority to FS for most cases. This PR is to update the code path for filter block reader. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10251 Test Plan: Current unit tests should pass. Reviewed By: pdillinger Differential Revision: D37427667 Pulled By: gitbw95 fbshipit-source-id: 1ce5b759b136efe4cfa48a6b97e2f837ff087433	3 years ago
zczhu	410ca2efd2	Fix the flaky cursor persist test (#10250 ) Summary: The 'PersistRoundRobinCompactCursor' unit test in `db_compaction_test` may occasionally fail due to the inconsistent LSM state. The issue is fixed by adding `Flush()` and `WaitForFlushMemTable()` to produce a more predictable and stable LSM state. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10250 Test Plan: 'PersistRoundRobinCompactCursor' unit test in `db_compaction_test` Reviewed By: jay-zhuang, riversand963 Differential Revision: D37426091 Pulled By: littlepig2013 fbshipit-source-id: 56fbaab0384c380c1f279a16dc8732b139c9f611	3 years ago
sdong	246d469750	Reduce overhead of SortFileByOverlappingRatio() (#10161 ) Summary: Currently SortFileByOverlappingRatio() is O(nlogn). It is usually OK but When there are a lot of files in an LSM-tree, SortFileByOverlappingRatio() can take non-trivial amount of time. The problem is severe when the user is loading keys in sorted order, where compaction is only trivial move and this operation becomes the bottleneck and limit the total throughput. This commit makes SortFileByOverlappingRatio() only find the top 50 files based on score. 50 files are usually enough for the parallel compactions needed for the level, and in case it is not enough, we would fall back to random, which should be acceptable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10161 Test Plan: Run a fillseq that generates a lot of files, and observe throughput improved (although stall is not yet eliminated). The command ran: TEST_TMPDIR=/dev/shm/ ./db_bench_sort --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000 The throughput improved by 11%. Reviewed By: jay-zhuang Differential Revision: D37129469 fbshipit-source-id: 492da2ef5bfc7cdd6daa3986b50d2ff91f88542d	3 years ago
Gang Liao	052666aed5	BlobDB in crash test hitting assertion (#10249 ) Summary: This task is to fix assertion failures during the crash test runs. The cache entry size might not match value size because value size can include the on-disk (possibly compressed) size. Therefore, we removed the assertions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10249 Reviewed By: ltamasi Differential Revision: D37407576 Pulled By: gangliao fbshipit-source-id: 577559f267c5b2437bcd0631cd0efabb6dde3b69	3 years ago
Yanqin Jin	725df120e9	Fix race condition between file purge and backup/checkpoint (#10187 ) Summary: Resolves https://github.com/facebook/rocksdb/issues/10129 I extracted this fix from https://github.com/facebook/rocksdb/issues/7516 since it's also already a bug in main branch, and we want to separate it from the main part of the PR. There can be a race condition between two threads. Thread 1 executes `DBImpl::FindObsoleteFiles()` while thread 2 executes `GetSortedWals()`. ``` Time thread 1 thread 2 \| mutex_.lock \| read disable_delete_obsolete_files_ \| ... \| wait on log_sync_cv and release mutex_ \| mutex_.lock \| ++disable_delete_obsolete_files_ \| mutex_.unlock \| mutex_.lock \| while (pending_purge_obsolete_files > 0) { bg_cv.wait;} \| wake up with mutex_ locked \| compute WALs tracked by MANIFEST \| mutex_.unlock \| wake up with mutex_ locked \| ++pending_purge_obsolete_files_ \| mutex_.unlock \| \| delete obsolete WAL \| WAL missing but tracked in MANIFEST. V ``` The fix proposed eliminates the possibility of the above by increasing `pending_purge_obsolete_files_` before `FindObsoleteFiles()` can possibly release the mutex. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10187 Test Plan: make check Reviewed By: ltamasi Differential Revision: D37214235 Pulled By: riversand963 fbshipit-source-id: 556ab1b58ae6d19150169dfac4db08195c797184	3 years ago
Mark Callaghan	6061905790	Wrapper for benchmark.sh to run a sequence of db_bench tests (#10215 ) Summary: This provides two things: 1) Runs a sequence of db_bench tests. This sequence was chosen to provide good coverage with less variance. 2) Makes it easier to do A/B testing for multiple binaries. This combines the report.tsv files into summary.tsv to make it easier to compare results across multiple binaries. Example output for 2) is: ops_sec mb_sec lsm_sz blob_sz c_wgb w_amp c_mbps c_wsecs c_csecs b_rgb b_wgb usec_op p50 p99 p99.9 p99.99 pmax uptime stall% Nstall u_cpu s_cpu rss test date version job_id 1115171 446.7 9GB 8.9 1.0 454.7 26 26 0 0 0.9 0.5 2 7 51 5547 20 0.0 0 0.1 0.1 0.2 fillseq.wal_disabled.v400 2022-04-12T08:53:51 6.0 1045726 418.9 8GB 0.0GB 8.4 1.0 432.4 27 26 0 0 1.0 0.5 2 6 102 5618 20 0.0 0 0.1 0.0 0.1 fillseq.wal_disabled.v400 2022-04-12T12:25:36 6.28 ops_sec mb_sec lsm_sz blob_sz c_wgb w_amp c_mbps c_wsecs c_csecs b_rgb b_wgb usec_op p50 p99 p99.9 p99.99 pmax uptime stall% Nstall u_cpu s_cpu rss test date version job_id 2969192 1189.3 16GB 0.0 0.0 0 0 0 0 10.8 9.3 25 33 49 13551 1781 0.0 0 48.2 6.8 16.8 readrandom.t32 2022-04-12T08:54:28 6.0 2692922 1078.6 16GB 0.0GB 0.0 0.0 0 0 0 0 11.9 10.2 30 38 56 49735 1781 0.0 0 47.8 6.7 16.8 readrandom.t32 2022-04-12T12:26:15 6.28 ... ops_sec mb_sec lsm_sz blob_sz c_wgb w_amp c_mbps c_wsecs c_csecs b_rgb b_wgb usec_op p50 p99 p99.9 p99.99 pmax uptime stall% Nstall u_cpu s_cpu rss test date version job_id 180227 72.2 38GB 1126.4 8.7 643.2 3286 3218 0 0 177.6 50.2 2687 4083 6148 854083 1793 68.4 7804 17.0 5.9 0.5 overwrite.t32.s0 2022-04-12T11:55:21 6.0 236512 94.7 31GB 0.0GB 1502.9 8.9 862.2 5242 5125 0 0 135.3 59.9 2537 3268 5404 18545 1785 49.7 5112 25.5 8.0 9.4 overwrite.t32.s0 2022-04-12T15:27:25 6.28 Example output with formatting preserved is here: https://gist.github.com/mdcallag/4432e5bbaf91915c916d46bd6ce3c313 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10215 Test Plan: run it Reviewed By: jay-zhuang Differential Revision: D37299892 Pulled By: mdcallag fbshipit-source-id: e6e0ed638fd7e8deeb869d700593fdc3eba899c8	3 years ago
Yueh-Hsuan Chiang	2a3792edfc	Add suggest_compact_range() and suggest_compact_range_cf() to C API. (#10175 ) Summary: Add suggest_compact_range() and suggest_compact_range_cf() to C API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10175 Test Plan: As verifying the result requires SyncPoint, which is not available in the c_test.c, the test is currently done by invoking the functions and making sure it does not crash. Reviewed By: jay-zhuang Differential Revision: D37305191 Pulled By: ajkr fbshipit-source-id: 0fe257b45914f6c9aeb985d8b1820dafc57a20db	3 years ago
zczhu	17a1d65e3a	Cut output files at compaction cursors (#10227 ) Summary: The files behind the compaction cursor contain newer data than the files ahead of it. If a compaction writes a file that spans from before its output level’s cursor to after it, then data before the cursor will be contaminated with the old timestamp from the data after the cursor. To avoid this, we can split the output file into two – one entirely before the cursor and one entirely after the cursor. Note that, in rare cases, we DO NOT need to cut the file if it is a trivial move since the file will not be contaminated by older files. In such case, the compact cursor is not guaranteed to be the boundary of the file, but it does not hurt the round-robin selection process. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10227 Test Plan: Add 'RoundRobinCutOutputAtCompactCursor' unit test in `db_compaction_test` Task: [T122216351](https://www.internalfb.com/intern/tasks/?t=122216351) Reviewed By: jay-zhuang Differential Revision: D37388088 Pulled By: littlepig2013 fbshipit-source-id: 9246a6a084b6037b90d6ab3183ba4dfb75a3378d	3 years ago
Gang Liao	ba1f62ddfb	Read from blob cache first when MultiGetBlob() (#10225 ) Summary: There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10225 Test Plan: Add test cases for MultiGetBlob In this task, we added the new API MultiGetBlob() for BlobSource. This PR is a part of https://github.com/facebook/rocksdb/issues/10156 Reviewed By: ltamasi Differential Revision: D37358364 Pulled By: gangliao fbshipit-source-id: aff053a37615d96d768fb9aedde17da5618c7ae6	3 years ago
Guido Tagliavini Ponce	b52620ab0e	Fix key size in cache_bench (#10234 ) Summary: cache_bench wasn't generating 16B keys, which are necessary for FastLRUCache. Also: - Added asserts in cache_bench, which is assuming that inserts never fail. When they fail (for example, if we used keys of the wrong size), memory allocated to the values will becomes leaked, and eventually the program crashes. - Move kCacheKeySize to the right spot. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10234 Test Plan: ``make -j24 check``. Also, run cache_bench with FastLRUCache and check that memory usage doesn't blow up: ``./cache_bench -cache_type=fast_lru_cache -num_shard_bits=6 -skewed=true \ -lookup_insert_percent=100 -lookup_percent=0 -insert_percent=0 -erase_percent=0 \ -populate_cache=true -cache_size=1073741824 -ops_per_thread=10000000 \ -value_bytes=8192 -resident_ratio=1 -threads=16`` Reviewed By: pdillinger Differential Revision: D37382949 Pulled By: guidotag fbshipit-source-id: b697a942ebb215de5d341f98dc8566763436ba9b	3 years ago

... 3 4 5 6 7 ...

11442 Commits (ce2c11d848db3ab02a4c847007d755155c88baed) All Branches Search

11442 Commits (ce2c11d848db3ab02a4c847007d755155c88baed)

All Branches