rocksdb

Commit Graph

Author	SHA1	Message	Date
Guido Tagliavini Ponce	7e1b417824	Revert NewClockCache signature (#10358 ) Summary: This complements https://github.com/facebook/rocksdb/issues/10351. This PR reverts NewClockCache's signature to an older version, expected by the users of the old (buggy) ClockCache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10358 Test Plan: ``make -j24 check`` and re-run the pre-release tests. Reviewed By: siying Differential Revision: D37832601 Pulled By: guidotag fbshipit-source-id: 32a91d3da4119be187935003b7b897272ceb1950	3 years ago
Changyu Bi	5f9fe7f21e	Added WAL compression checksum (#10319 ) Summary: Enabled zstd checksum flag in StreamingCompress so that WAL (de)compreression is protected by a checksum per compression frame. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10319 Test Plan: - `make check` - WAL perf: average ops/sec over 10 runs is 161226 pre PR and 159635 post PR (1% drop). ``` sudo TEST_TMPDIR=/dev/shm/memtable_write ./db_bench_checksum -benchmarks=fillseq -max_write_buffer_number=100 -num=1000000 -min_write_buffer_number_to_merge=10 -wal_compression=zstd ``` Reviewed By: ajkr Differential Revision: D37673311 Pulled By: cbi42 fbshipit-source-id: 9f34a3bfc2a82e5c80b1ec63bb339a7465108ec9	3 years ago
Bo Wang	86c2d0a95d	Add the secondary cache information into LRUCache:: GetPrintableOptions (#10346 ) Summary: If the primary cache is LRU cache and there is a secondary cache, add Secondary Cache printable options into LRUCache::GetPrintableOptions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10346 Test Plan: 1. Current Unit Tests should pass. 2. Use db_bench (with compressed_secondary_cache ) and the LOG should includes the new printable options from Seoncdary Cache. Reviewed By: jay-zhuang Differential Revision: D37779310 Pulled By: gitbw95 fbshipit-source-id: 88ce1f7df6b5f25740e598d9e7fa91e4c414cb8f	3 years ago
Guido Tagliavini Ponce	9645e66fc9	Temporarily return a LRUCache from NewClockCache (#10351 ) Summary: ClockCache is still in experimental stage, and currently fails some pre-release fbcode tests. See https://www.internalfb.com/diff/D37772011. API calls to construct ClockCache are done via the function NewClockCache. For now, NewClockCache calls will return an LRUCache (with appropriate arguments), which is stable. The idea that NewClockCache returns nullptr was also floated, but this would be interpreted as unsupported cache, and a default LRUCache would be constructed instead, potentially causing a performance regression that is harder to identify. A new version of the NewClockCache function was created for our internal tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10351 Test Plan: ``make -j24 check`` and re-run the pre-release tests. Reviewed By: pdillinger Differential Revision: D37802685 Pulled By: guidotag fbshipit-source-id: 0a8d10612ff21e576f7360cb13e20bc36e244972	3 years ago
Yanqin Jin	b283f041f5	Stop tracking syncing live WAL for performance (#10330 ) Summary: With https://github.com/facebook/rocksdb/issues/10087, applications calling `SyncWAL()` or writing with `WriteOptions::sync=true` can suffer from performance regression. This PR reverts to original behavior of tracking the syncing of closed WALs. After we revert back to old behavior, recovery, whether kPointInTime or kAbsoluteConsistency, may fail to detect corruption in synced WALs if the corruption is in the live WAL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10330 Test Plan: make check Before https://github.com/facebook/rocksdb/issues/10087 ```bash fillsync : 750.269 micros/op 1332 ops/sec 75.027 seconds 100000 operations; 0.1 MB/s (100 ops) fillsync : 776.492 micros/op 1287 ops/sec 77.649 seconds 100000 operations; 0.1 MB/s (100 ops) fillsync [AVG 2 runs] : 1310 (± 44) ops/sec; 0.1 (± 0.0) MB/sec fillsync : 805.625 micros/op 1241 ops/sec 80.563 seconds 100000 operations; 0.1 MB/s (100 ops) fillsync [AVG 3 runs] : 1287 (± 51) ops/sec; 0.1 (± 0.0) MB/sec fillsync [AVG 3 runs] : 1287 (± 51) ops/sec; 0.1 (± 0.0) MB/sec fillsync [MEDIAN 3 runs] : 1287 ops/sec; 0.1 MB/sec ``` Before this PR and after https://github.com/facebook/rocksdb/issues/10087 ```bash fillsync : 1479.601 micros/op 675 ops/sec 147.960 seconds 100000 operations; 0.1 MB/s (100 ops) fillsync : 1626.080 micros/op 614 ops/sec 162.608 seconds 100000 operations; 0.1 MB/s (100 ops) fillsync [AVG 2 runs] : 645 (± 59) ops/sec; 0.1 (± 0.0) MB/sec fillsync : 1588.402 micros/op 629 ops/sec 158.840 seconds 100000 operations; 0.1 MB/s (100 ops) fillsync [AVG 3 runs] : 640 (± 35) ops/sec; 0.1 (± 0.0) MB/sec fillsync [AVG 3 runs] : 640 (± 35) ops/sec; 0.1 (± 0.0) MB/sec fillsync [MEDIAN 3 runs] : 629 ops/sec; 0.1 MB/sec ``` After this PR ```bash fillsync : 749.621 micros/op 1334 ops/sec 74.962 seconds 100000 operations; 0.1 MB/s (100 ops) fillsync : 865.577 micros/op 1155 ops/sec 86.558 seconds 100000 operations; 0.1 MB/s (100 ops) fillsync [AVG 2 runs] : 1244 (± 175) ops/sec; 0.1 (± 0.0) MB/sec fillsync : 845.837 micros/op 1182 ops/sec 84.584 seconds 100000 operations; 0.1 MB/s (100 ops) fillsync [AVG 3 runs] : 1223 (± 109) ops/sec; 0.1 (± 0.0) MB/sec fillsync [AVG 3 runs] : 1223 (± 109) ops/sec; 0.1 (± 0.0) MB/sec fillsync [MEDIAN 3 runs] : 1182 ops/sec; 0.1 MB/sec ``` Reviewed By: ajkr Differential Revision: D37725212 Pulled By: riversand963 fbshipit-source-id: 8fa7d13b3c7662be5d56351c42caf3266af937ae	3 years ago
sdong	769b156e65	Remove customized naming from InternalKeyComparator (#10343 ) Summary: InternalKeyComparator is a thin wrapper around user comparator. Storing a string for name is relatively expensive to this small wrapper for both CPU and memory usage. Try to remove it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10343 Test Plan: Run existing tests Reviewed By: ajkr Differential Revision: D37772469 fbshipit-source-id: d2d106a8d022193058fd7f6b220108e3d94aca34	3 years ago
Yanqin Jin	7679f22a89	Add coverage for the combination of write-prepared and WAL recycling (#10350 ) Summary: as title. Test plan - make check - CI on PR - TEST_TMPDIR=/dev/shm make crash_test_with_multiops_wp_txn (tested with successful run) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10350 Reviewed By: ajkr Differential Revision: D37792872 Pulled By: riversand963 fbshipit-source-id: ff064093b7f715d0acf387af2e3ae87b1278b52b	3 years ago
Yanqin Jin	7e2004a123	Remove unused variables (#10327 ) Summary: As title. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10327 Test Plan: make check Reviewed By: gitbw95 Differential Revision: D37699040 Pulled By: riversand963 fbshipit-source-id: 305a88628907a47dea53c4d9aec9c2f5bb9b58df	3 years ago
Yanqin Jin	2f13f5f7d0	Add coverage for timestamped snapshot to MultiOpsTxnsStressTest (#10325 ) Summary: As title. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10325 Test Plan: ```bash TEST_TMPDIR=/dev/shm/rocksdb/ make crash_test_with_multiops_wc_txn TEST_TMPDIR=/dev/shm/rocksdb/ make crash_test_with_txn ``` Reviewed By: akankshamahajan15 Differential Revision: D37688742 Pulled By: riversand963 fbshipit-source-id: e198ace921898af63f99e869568c1a7bbf69f1a4	3 years ago
zczhu	96206531bc	Support reservation in thread pool (#10278 ) Summary: Add `ReserveThreads` and `ReleaseThreads` functions in thread pool to support reservation in for a specific thread pool. With this feature, a thread will be blocked if the number of waiting threads (noted by `num_waiting_threads_`) equals the number of reserved threads (noted by `reserved_threads_`), normally `reserved_threads_` is upper bounded by `num_waiting_threads_`; in rare cases (e.g. `SetBackgroundThreadsInternal` is called when some threads are already reserved), `num_waiting_threads_` can be less than `reserved_threads`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10278 Test Plan: Add `ReserveThreads` unit test in `env_test`. Update the unit test `SimpleColumnFamilyInfoTest` in `thread_list_test` with adding `ReserveThreads` related assertions. Reviewed By: hx235 Differential Revision: D37640946 Pulled By: littlepig2013 fbshipit-source-id: 4d691f6b9a433569f96ab52d52c3defe5b065367	3 years ago
Gang Liao	28586be8ec	Update HISTORY.md for blob cache (#10328 ) Summary: Update HISTORY.md for blob cache. Implementation can be found from Github issue https://github.com/facebook/rocksdb/issues/10156 (or Github PRs https://github.com/facebook/rocksdb/issues/10155, https://github.com/facebook/rocksdb/issues/10178, https://github.com/facebook/rocksdb/issues/10225, https://github.com/facebook/rocksdb/issues/10198, and https://github.com/facebook/rocksdb/issues/10272). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10328 Reviewed By: riversand963 Differential Revision: D37732514 Pulled By: gangliao fbshipit-source-id: 4c942a41c07914bfc8db56a0d3cf4d3e53d5963f	3 years ago
Akanksha Mahajan	fc51b7f33a	Fix clang error implicit conversion loses integer precision (#10323 ) Summary: Fix error: implicit conversion loses integer precision: 'size_t' (aka 'unsigned long') to 'uint32_t' (aka 'unsigned int') [-Werror,-Wshorten-64-to-32] Pull Request resolved: https://github.com/facebook/rocksdb/pull/10323 Test Plan: USE_CLANG=1 make -j32 Reviewed By: gitbw95 Differential Revision: D37688250 Pulled By: akankshamahajan15 fbshipit-source-id: 443873e41279ee8bdbe8452818549792047532fb	3 years ago
Gang Liao	c987eb4712	Eliminate the copying of blobs when serving reads from the cache (#10297 ) Summary: The blob cache enables an optimization on the read path: when a blob is found in the cache, we can avoid copying it into the buffer provided by the application. Instead, we can simply transfer ownership of the cache handle to the target `PinnableSlice`. (Note: this relies on the `Cleanable` interface, which is implemented by `PinnableSlice`.) This has the potential to save a lot of CPU, especially with large blob values. This task is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10297 Reviewed By: riversand963 Differential Revision: D37640311 Pulled By: gangliao fbshipit-source-id: 92de0e35cc703d06c87c5c1861cc2899ec52234a	3 years ago
Guido Tagliavini Ponce	c277aeb42c	Midpoint insertions in ClockCache (#10305 ) Summary: When an element is first inserted into the ClockCache, it is now assigned either medium or high clock priority, depending on whether its cache priority is low or high, respectively. This is a variant of LRUCache's midpoint insertions. The main difference is that LRUCache can specify the allocated capacity for high-priority elements via the ``high_pri_pool_ratio`` parameter. Contrarily, in ClockCache, low- and high-priority elements compete for all cache slots, and one group can take over the other (of course, it takes more low-priority insertions to push out high-priority elements). However, just as LRUCache, ClockCache provides the following guarantee: a high-priority element will not be evicted before a low-priority element that was inserted earlier in time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10305 Test Plan: ``make -j24 check`` Reviewed By: pdillinger Differential Revision: D37607787 Pulled By: guidotag fbshipit-source-id: 24d9f2523d2f4e6415e7f0029cc061fa275c2040	3 years ago
zczhu	8debfe2b21	Replace the output split key with its pointer in subcompaction (#10316 ) Summary: Earlier implementation of cutting the output files with a compact cursor under Round-Robin priority uses `Valid()` to determine if the `output_split_key` is valid in `ShouldStopBefore`. This contributes to excessive CPU computation, as pointed out by [this issue](https://github.com/facebook/rocksdb/issues/10315). In this PR, we change the type of `output_split_key` to be `InternalKey*` and set it as `nullptr` if it is not going to be used in `ShouldStopBefore`, `Valid()` condition checking can be avoided using that pointer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10316 Reviewed By: ajkr Differential Revision: D37661492 Pulled By: littlepig2013 fbshipit-source-id: 66ff1105f3378e5573d3a126fdaff9bb23b5498f	3 years ago
Peter Dillinger	e6c5e0ab9a	Have Cache use Status::MemoryLimit (#10262 ) Summary: I noticed it would clean up some things to have Cache::Insert() return our MemoryLimit Status instead of Incomplete for the case in which the capacity limit is reached. I suspect this fixes some existing but unknown bugs where this Incomplete could be confused with other uses of Incomplete, especially no_io cases. This is the most suspicious case I noticed, but was not able to reproduce a bug, in part because the existing code is not covered by unit tests (FIXME added): `57adbf0e91/table/get_context.cc (L397)` I audited all the existing uses of IsIncomplete and updated those that seemed relevant. HISTORY updated with a clear warning to users of strict_capacity_limit=true to update uses of `IsIncomplete()` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10262 Test Plan: updated unit tests Reviewed By: hx235 Differential Revision: D37473155 Pulled By: pdillinger fbshipit-source-id: 4bd9d9353ccddfe286b03ebd0652df8ce20f99cb	3 years ago
Manuel Ung	071fe39c05	Allow user to pass git command to makefile (#10318 ) Summary: This allows users to pass their git command with extra options if necessary. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10318 Reviewed By: ajkr Differential Revision: D37661175 Pulled By: lth fbshipit-source-id: 2a7cf27626c74f167471e6ec57e3870630a582b0	3 years ago
Akanksha Mahajan	2acbf386a3	Provide support for direct_reads with async_io (#10197 ) Summary: Provide support for use_direct_reads with async_io. TestPlan: - Updated unit tests - db_bench: Results in https://github.com/facebook/rocksdb/pull/10197#issuecomment-1159239420 - db_stress ``` export CRASH_TEST_EXT_ARGS=" --async_io=1 --use_direct_reads=1" make crash_test -j ``` - Ran db_bench on previous RocksDB version before any async_io implementation (as there have many changes in different PRs in this area) https://github.com/facebook/rocksdb/pull/10197#issuecomment-1160781563. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10197 Reviewed By: anand1976 Differential Revision: D37255646 Pulled By: akankshamahajan15 fbshipit-source-id: fec61ae15bf4d625f79dea56e4f86e0e307ba920	3 years ago
Mark Callaghan	177b2fa341	Set the value for --version, add --build_info (#10275 ) Summary: ./db_bench --version db_bench version 7.5.0 ./db_bench --build_info (RocksDB) 7.5.0 rocksdb_build_date: 2022-06-29 09:58:04 rocksdb_build_git_sha: `d96febeeaa` rocksdb_build_git_tag: print_version_githash Pull Request resolved: https://github.com/facebook/rocksdb/pull/10275 Test Plan: run it Reviewed By: ajkr Differential Revision: D37524720 Pulled By: mdcallag fbshipit-source-id: 0f6c819dbadf7b033a4a3ba2941992bb76b4ff99	3 years ago
Changyu Bi	f9cfc6a808	Updated NewDataBlockIterator to not fetch compression dict for non-da… (#10310 ) Summary: …ta blocks During MyShadow testing, ajkr helped me find out that with partitioned index and dictionary compression enabled, `PartitionedIndexIterator::InitPartitionedIndexBlock()` spent considerable amount of time (1-2% CPU) on fetching uncompression dictionary. Fetching uncompression dict was not needed since the index blocks were not compressed (and even if they were, they use empty dictionary). This should only affect use cases with partitioned index, dictionary compression and without uncompression dictionary pinned. This PR updates NewDataBlockIterator to not fetch uncompression dictionary when it is not for data blocks. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10310 Test Plan: 1. `make check` 2. Perf benchmark: 1.5% (143950 -> 146176) improvement in op/sec for partitioned index + dict compression benchmark. For default config without partitioned index and without dict compression, there is no regression in readrandom perf from multiple runs of db_bench. ``` # Set up for partitioned index with dictionary compression TEST_TMPDIR=/dev/shm ./db_bench_main -benchmarks=filluniquerandom,compact -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -partition_index=true -compression_max_dict_bytes=16384 -compression_zstd_max_train_bytes=1638400 # Pre PR TEST_TMPDIR=/dev/shm ./db_bench_main -use_existing_db=true -benchmarks=readrandom[-X50] -partition_index=true readrandom [AVG 50 runs] : 143950 (± 1108) ops/sec; 15.9 (± 0.1) MB/sec readrandom [MEDIAN 50 runs] : 144406 ops/sec; 16.0 MB/sec # Post PR TEST_TMPDIR=/dev/shm ./db_bench_opt -use_existing_db=true -benchmarks=readrandom[-X50] -partition_index=true readrandom [AVG 50 runs] : 146176 (± 1121) ops/sec; 16.2 (± 0.1) MB/sec readrandom [MEDIAN 50 runs] : 146014 ops/sec; 16.2 MB/sec # Set up for no partitioned index and no dictionary compression TEST_TMPDIR=/dev/shm/baseline ./db_bench_main -benchmarks=filluniquerandom,compact -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false # Pre PR TEST_TMPDIR=/dev/shm/baseline/ ./db_bench_main --use_existing_db=true "--benchmarks=readrandom[-X50]" readrandom [AVG 50 runs] : 158546 (± 1000) ops/sec; 17.5 (± 0.1) MB/sec readrandom [MEDIAN 50 runs] : 158280 ops/sec; 17.5 MB/sec # Post PR TEST_TMPDIR=/dev/shm/baseline/ ./db_bench_opt --use_existing_db=true "--benchmarks=readrandom[-X50]" readrandom [AVG 50 runs] : 161061 (± 1520) ops/sec; 17.8 (± 0.2) MB/sec readrandom [MEDIAN 50 runs] : 161596 ops/sec; 17.9 MB/sec ``` Reviewed By: ajkr Differential Revision: D37631358 Pulled By: cbi42 fbshipit-source-id: 6ca2665e270e63871968e061ba4a99d3136785d9	3 years ago
Changyu Bi	0ff7713112	Handoff checksum during WAL replay (#10212 ) Summary: Added checksum protection for write batch content read from WAL to when per key-value checksum is computed on the write batch. This gives full coverage on write batch integrity of WAL replay to memtable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10212 Test Plan: - Added unit test and the existing tests (replay code path covers the change in this PR): `make -j32 check` - Stress test: ran `db_stress` for 30min. - Perf regression: ``` # setup TEST_TMPDIR=/dev/shm/100MB_WAL_DB/ ./db_bench -benchmarks=fillrandom -write_buffer_size=1048576000 # benchmark db open time TEST_TMPDIR=/dev/shm/100MB_WAL_DB/ /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=overwrite -write_buffer_size=1048576000 -writes=1 -report_open_timing=true For 20 runs, pre-PR avg: 3734.31ms, post-PR avg: 3790.06 ms (~1.5% regression). Pre-PR OpenDb: 3714.36 milliseconds OpenDb: 3622.71 milliseconds OpenDb: 3591.17 milliseconds OpenDb: 3674.7 milliseconds OpenDb: 3615.79 milliseconds OpenDb: 3982.83 milliseconds OpenDb: 3650.6 milliseconds OpenDb: 3809.26 milliseconds OpenDb: 3576.44 milliseconds OpenDb: 3638.12 milliseconds OpenDb: 3845.68 milliseconds OpenDb: 3677.32 milliseconds OpenDb: 3659.64 milliseconds OpenDb: 3837.55 milliseconds OpenDb: 3899.64 milliseconds OpenDb: 3840.72 milliseconds OpenDb: 3802.71 milliseconds OpenDb: 3573.27 milliseconds OpenDb: 3895.76 milliseconds OpenDb: 3778.02 milliseconds Post-PR: OpenDb: 3880.46 milliseconds OpenDb: 3709.02 milliseconds OpenDb: 3954.67 milliseconds OpenDb: 3955.64 milliseconds OpenDb: 3958.64 milliseconds OpenDb: 3631.28 milliseconds OpenDb: 3721 milliseconds OpenDb: 3729.89 milliseconds OpenDb: 3730.55 milliseconds OpenDb: 3966.32 milliseconds OpenDb: 3685.54 milliseconds OpenDb: 3573.17 milliseconds OpenDb: 3703.75 milliseconds OpenDb: 3873.62 milliseconds OpenDb: 3704.4 milliseconds OpenDb: 3820.98 milliseconds OpenDb: 3721.62 milliseconds OpenDb: 3770.86 milliseconds OpenDb: 3949.78 milliseconds OpenDb: 3760.07 milliseconds ``` Reviewed By: ajkr Differential Revision: D37302092 Pulled By: cbi42 fbshipit-source-id: 7346e625f453ce4c0e5d708776cd1fb2af6b068b	3 years ago
Yanqin Jin	caced09e79	Expand stress test coverage for user-defined timestamp (#10280 ) Summary: Before this PR, we call `now()` to get the wall time before performing point-lookup and range scans when user-defined timestamp is enabled. With this PR, we expand the coverage to: - read with an older timestamp which is larger then the wall time when the process starts but potentially smaller than now() - add coverage for `ReadOptions::iter_start_ts != nullptr` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10280 Test Plan: ```bash make check ``` Also, ```bash TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_ts ``` So far, we have had four successful runs of the above In addition, ```bash TEST_TMPDIR=/dev/shm/rocksdb make crash_test ``` Succeeded twice showing no regression. Reviewed By: ltamasi Differential Revision: D37539805 Pulled By: riversand963 fbshipit-source-id: f2d9887ad95245945ce17a014d55bb93f00e1cb5	3 years ago
Mark Callaghan	9eced1a344	Add the git hash and full RocksDB version to report.tsv (#10277 ) Summary: Previously the version was displayed as $major.$minor This changes it to $major.$minor.$path This also adds the git hash for the time from which RocksDB was built to the end of report.tsv. I confirmed that benchmark_log_tool.py still parses it and that the people who consume/graph these results are OK with it. Example output: ops_sec mb_sec lsm_sz blob_sz c_wgb w_amp c_mbps c_wsecs c_csecs b_rgb b_wgb usec_op p50 p99 p99.9 p99.99 pmax uptime stall% Nstall u_cpu s_cpu rss test date version job_id githash 609488 244.1 1GB 0.0GB, 1.4 0.7 93.3 39 38 0 0 1.6 1.0 4 15 26 5365 15 0.0 0 0.1 0.0 0.5 fillseq.wal_disabled.v400 2022-06-29T13:36:05 7.5.0 `6115254416` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10277 Test Plan: Run it Reviewed By: jay-zhuang Differential Revision: D37532418 Pulled By: mdcallag fbshipit-source-id: 55e472640d51265819b228d3373c9fa9b62b660d	3 years ago
sdong	a9565ccb26	Try to trivial move more than one files (#10190 ) Summary: In leveled compaction, try to trivial move more than one files if possible, up to 4 files or max_compaction_bytes. This is to allow higher write throughput for some use cases where data is loaded in sequential order, where appying compaction results is the bottleneck. When pick up a file to compact and it doesn't have overlapping files in the next level, try to expand to the next file if there is still no overlapping. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10190 Test Plan: Add some unit tests. For performance, Try to run ./db_bench_multi_move --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000 -level_compaction_dynamic_level_bytes Together with https://github.com/facebook/rocksdb/pull/10188 , stalling will be eliminated in this benchmark. Reviewed By: jay-zhuang Differential Revision: D37230647 fbshipit-source-id: 42b260f545c46abc5d90335ac2bbfcd09602b549	3 years ago
Yanqin Jin	d6b9c4ae26	Update code comment and logging for secondary instance (#10260 ) Summary: Before this PR, it is required that application open RocksDB secondary instance with `max_open_files = -1`. This is a hacky workaround that prevents IOErrors on the seconary instance during point-lookup or range scan caused by primary instance deleting the table files. This is not necessary if the application can coordinate the primary and secondaries so that primary does not delete files that are still being used by the secondaries. Or users can provide a custom Env/FS implementation that deletes the files only after all primary and secondary instances indicate files are obsolete and deleted. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10260 Test Plan: make check Reviewed By: jay-zhuang Differential Revision: D37462633 Pulled By: riversand963 fbshipit-source-id: 9c2fc939f49663efa61e3d60c8f1e01d64b9d72c	3 years ago
yite.gu	a9117a3490	BackupEngine: we can return immediately if GetFileSize failed (#10176 ) Summary: In some case, GetFileSize would be failure in copy_file_cb. If failure, we can return immediately, the subsequent code is meaningless, and add a log info let user know that problem happen here. Singed-off-by: Yite Gu <ess_gyt@qq.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/10176 Reviewed By: cbi42 Differential Revision: D37510888 Pulled By: ajkr fbshipit-source-id: 044ad8c45852fd19b8cd564b11f65d40c39e296f	3 years ago
Guido Tagliavini Ponce	54f678cd86	Fix CalcHashBits (#10295 ) Summary: We fix two bugs in CalcHashBits. The first one is an off-by-one error: the desired number of table slots is the real number ``capacity / (kLoadFactor * handle_charge)``, which should not be rounded down. The second one is that we should disallow inputs that set the element charge to 0, namely ``estimated_value_size == 0 && metadata_charge_policy == kDontChargeCacheMetadata``. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10295 Test Plan: CalcHashBits is tested by CalcHashBitsTest (in lru_cache_test.cc). The test now iterates over many more inputs; it covers, in particular, the rounding error edge case. Overall, the test is now more robust. Run ``make -j24 check``. Reviewed By: pdillinger Differential Revision: D37573797 Pulled By: guidotag fbshipit-source-id: ea4f4439f7196ab1c1afb88f566fe92850537262	3 years ago
zczhu	e716bda010	Add FLAGS_compaction_pri into crash_test (#10255 ) Summary: Add FLAGS_compaction_pri into correctness test Pull Request resolved: https://github.com/facebook/rocksdb/pull/10255 Test Plan: run crash_test with FLAGS_compaction_pri Reviewed By: ajkr Differential Revision: D37510372 Pulled By: littlepig2013 fbshipit-source-id: 73d93a0a047d0c3993c8a512383dd6ee6acef641	3 years ago
Akanksha Mahajan	11215e0f3a	Fix bug in Logger creation if dbname and db_log_dir are on different filesystem (#10292 ) Summary: If dbname and db_log_dir are at different filesystems (one local and one remote), creation of dbname will fail because that path doesn't exist wrt to db_log_dir. This patch will ignore the error returned on creation of dbname. If they are on same filesystem, db_log_dir creation will automatically return the error in case there is any error in creation of dbname. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10292 Test Plan: Existing unit tests Reviewed By: riversand963 Differential Revision: D37567773 Pulled By: akankshamahajan15 fbshipit-source-id: 005d28c536208d4c126c8cb8e196d1d85b881100	3 years ago
sdong	4428c76181	Multi-File Trivial Move in L0->L1 (#10188 ) Summary: In leveled compaction, L0->L1 trivial move will allow more than one file to be moved in one compaction. This would allow L0 files to be moved down faster when data is loaded in sequential order, making slowdown or stop condition harder to hit. Also seek L0->L1 trivial move when only some files qualify. 1. We always try to find L0->L1 trivial move from the oldest files. Keep including newer files, until adding a new file won't trigger a trivial move 2. Modify the trivial move condition so that this compaction would be tagged as trivial move. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10188 Test Plan: See throughput improvements with db_bench with fast fillseq benchmark and small L0 files: ./db_bench_l0_move --benchmarks=fillseq --compression_type=lz4 --write_buffer_size=5000000 --num=100000000 --value_size=1000 -level_compaction_dynamic_level_bytes The throughput improved by about 50%. Stalling still happens though. Reviewed By: jay-zhuang Differential Revision: D37224743 fbshipit-source-id: 8958d97f22e12bdfc14d2e85930f6fa0070e9659	3 years ago
zczhu	4f51101d31	Remove compact cursor when split sub-compactions (#10289 ) Summary: In round-robin compaction priority, when splitting the compaction into sub-compactions, the earlier implementation takes into account the compact cursor to have full use of available sub-compactions. But this may result in unbalanced sub-compactions, so we remove this here. The removal does not affect the cursor-based splitting mechanism within a sub-compaction, and thus the output files are still ensured to be split according to the cursor. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10289 Reviewed By: ajkr Differential Revision: D37559091 Pulled By: littlepig2013 fbshipit-source-id: b8b45b99f63b09cf873f7f049bcb4ab13871fffc	3 years ago
Mark Callaghan	720ab355f9	Add undefok for BlobDB options not supported prior to 7.5 (#10276 ) Summary: This adds --undefok to support use of this script with BlobDB for db_bench versions prior to 7.5 when the options land in a release. While there is a limit to how far back this script can go WRT backwards compatiblity, this is an easy change to support early 7.x releases. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10276 Test Plan: Run it with versions of db_bench that do not and then do support these options Reviewed By: gangliao Differential Revision: D37529299 Pulled By: mdcallag fbshipit-source-id: 7bb1feec5c68760e6d64792c585bfbde4f5e52d8	3 years ago
sdong	b397dcd390	Change The Way Level Target And Compaction Score Are Calculated (#10057 ) Summary: The current level targets for dynamical leveling has a problem: the target level size will dramatically change after a L0->L1 compaction. When there are many L0 bytes, lower level compactions are delayed, but they will be resumed after the L0->L1 compaction finishes, so the expected write amplification benefits might not be realized. The proposal here is to revert the level targetting size, but instead relying on adjusting score for each level to prioritize levels that need to compact most. Basic idea: (1) target level size isn't adjusted, but score is adjusted. The reasoning is that with parallel compactions, holding compactions from happening might not be desirable, but we would like the compactions are scheduled from the level we feel most needed. For example, if we have a extra-large L2, we would like all compactions are scheduled for L2->L3 compactions, rather than L4->L5. This gets complicated when a large L0->L1 compaction is going on. Should we compact L2->L3 or L4->L5. So the proposal for that is: (2) the score is calculated by actual level size / (target size + estimated upper bytes coming down). The reasoning is that if we have a large amount of pending L0/L1 bytes coming down, compacting L2->L3 might be more expensive, as when the L0 bytes are compacted down to L2, the actual L2->L3 fanout would change dramatically. On the other hand, when the amount of bytes coming down to L5, the impacts to L5->L6 fanout are much less. So when calculating target score, we can adjust it by adding estimated downward bytes to the target level size. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10057 Test Plan: Repurpose tests VersionStorageInfoTest.MaxBytesForLevelDynamicWithLargeL0_* tests to cover this scenario. Reviewed By: ajkr Differential Revision: D37539742 fbshipit-source-id: 9c154cbfe92023f918cf5d80875d8776ad4831a4	3 years ago
Gang Liao	056e08d6c4	Enable blob caching for MultiGetBlob in RocksDB (#10272 ) Summary: - [x] Enabled blob caching for MultiGetBlob in RocksDB - [x] Refactored MultiGetBlob logic and interface in RocksDB - [x] Cleaned up Version::MultiGetBlob() and moved 'blob'-related code snippets into BlobSource - [x] Add End-to-end test cases in db_blob_basic_test and also add unit tests in blob_source_test This task is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10272 Reviewed By: ltamasi Differential Revision: D37558112 Pulled By: gangliao fbshipit-source-id: a73a6a94ffdee0024d5b2a39e6d1c1a7d38664db	3 years ago
Andrew Kryczka	20754b3654	include compaction cursors in VersionEdit debug string (#10288 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10288 Test Plan: try it out - ``` $ ldb manifest_dump --db=/dev/shm/rocksdb.0uWV/rocksdb_crashtest_whitebox/ --hex --verbose \| grep CompactCursor \| head -3 CompactCursor: 1 '00000000000011D9000000000000012B0000000000000266' seq:0, type:1 CompactCursor: 1 '0000000000001F35000000000000012B0000000000000022' seq:0, type:1 CompactCursor: 2 '00000000000011D9000000000000012B0000000000000266' seq:0, type:1 ``` Reviewed By: littlepig2013 Differential Revision: D37557177 Pulled By: ajkr fbshipit-source-id: 7b76b857d9e7a9f3d53398a61bb1d4b78873b91e	3 years ago
Yueh-Hsuan Chiang	17a6f7faaf	Add load_latest_options() to C api (#10152 ) Summary: Add load_latest_options() to C api. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10152 Test Plan: Extend the existing c_test by reopening db using the latest options file at different parts of the test. Reviewed By: akankshamahajan15 Differential Revision: D37305225 Pulled By: ajkr fbshipit-source-id: 8b3bab73f56fa6fcbdba45aae393145d007b3962	3 years ago
Yanqin Jin	b87c355772	Fix assertion error with read_opts.iter_start_ts (#10279 ) Summary: If the internal iterator is not valid, `SeekToLast` with iter_start_ts should have `valid_` is false without assertion failure. Test plan make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/10279 Reviewed By: ltamasi Differential Revision: D37539393 Pulled By: riversand963 fbshipit-source-id: 8e94057838f8a05144fad5768f4d62f1893ec315	3 years ago
Guido Tagliavini Ponce	57a0e2f304	Clock cache (#10273 ) Summary: This is the initial step in the development of a lock-free clock cache. This PR includes the base hash table design (which we mostly ported over from FastLRUCache) and the clock eviction algorithm. Importantly, it's still _not_ lock-free---all operations use a shard lock. Besides the locking, there are other features left as future work: - Remove keys from the handles. Instead, use 128-bit bijective hashes of them for handle comparisons, probing (we need two 32-bit hashes of the key for double hashing) and sharding (we need one 6-bit hash). - Remove the clock_usage_ field, which is updated on every lookup. Even if it were atomically updated, it could cause memory invalidations across cores. - Middle insertions into the clock list. - A test that exercises the clock eviction policy. - Update the Java API of ClockCache and Java calls to C++. Along the way, we improved the code and comments quality of FastLRUCache. These changes are relatively minor. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10273 Test Plan: ``make -j24 check`` Reviewed By: pdillinger Differential Revision: D37522461 Pulled By: guidotag fbshipit-source-id: 3d70b737dbb70dcf662f00cef8c609750f083943	3 years ago
Johnny Shaw	c2dc4c0c52	Fix GetWindowsErrSz nullptr bug (#10282 ) Summary: `GetWindowsErrSz` may assign a `nullptr` to `std::string` in the event it cannot format the error code to a string. This will result in a crash when `std::string` attempts to calculate the length from `nullptr`. The change here checks the output from `FormatMessageA` and only assigns to the otuput `std::string` if it is not null. Additionally, the call to free the buffer is only made if a non-null value is returned from `FormatMessageA`. In the event `FormatMessageA` does not output a string, an empty string is returned instead. Fixes https://github.com/facebook/rocksdb/issues/10274 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10282 Reviewed By: riversand963 Differential Revision: D37542143 Pulled By: ajkr fbshipit-source-id: c21f5119ddb451f76960acec94639d0f538052f2	3 years ago
leipeng	490fcac078	WriteBatch reorder fields to reduce padding (#10266 ) Summary: this reorder reduces sizeof(WriteBatch) by 16 bytes Pull Request resolved: https://github.com/facebook/rocksdb/pull/10266 Reviewed By: akankshamahajan15 Differential Revision: D37505201 Pulled By: ajkr fbshipit-source-id: 6cb6c3735073fcb63921f822d5e15670fecb1c26	3 years ago
sdong	6115254416	Fix A Bug Where Concurrent Compactions Cause Further Slowing Down (#10270 ) Summary: Currently, when installing a new super version, when stalling condition triggers, we compare estimated compaction bytes to previously, and if the new value is larger or equal to the previous one, we reduce the slowdown write rate. However, if concurrent compactions happen, the same value might be used. The result is that, although some compactions reduce estimated compaction bytes, we treat them as a signal for further slowing down. In some cases, it causes slowdown rate drops all the way to the minimum, far lower than needed. Fix the bug by not triggering a re-calculation if a new super version doesn't have Version or a memtable change. With this fix, number of compaction finishes are still undercounted in this algorithm, but it is still better than the current bug where they are negatively counted. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10270 Test Plan: Run a benchmark where the slowdown rate is dropped to minimal unnessarily and see it is back to a normal value. Reviewed By: ajkr Differential Revision: D37497327 fbshipit-source-id: 9bca961cc38fed965c3af0fa6c9ca0efaa7637c4	3 years ago
Edvard Davtyan	12bfd519de	Expose LRU cache num_shard_bits paramater in C api (#10222 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10222 Reviewed By: cbi42 Differential Revision: D37358171 Pulled By: ajkr fbshipit-source-id: e86285fdceaec943415ee9d482090009b00cbc95	3 years ago
Mark Callaghan	28f2d3cca6	Benchmark fix write amplification computation (#10236 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10236 Reviewed By: ajkr Differential Revision: D37489898 Pulled By: mdcallag fbshipit-source-id: 4b4565973b1f2c47342b4d1b857c8f89e91da145	3 years ago
Yanqin Jin	b6cfda1283	Support `iter_start_ts` for backward iteration (#10200 ) Summary: Resolves https://github.com/facebook/rocksdb/issues/9761 With this PR, applications can create an iterator with the following ```cpp ReadOptions read_opts; read_opts.timestamp = &ts_ub; read_opts.iter_start_ts = &ts_lb; auto* it = db->NewIterator(read_opts); it->SeekToLast(); // or it->SeekForPrev("foo"); it->Prev(); ... ``` The application can access different versions of the same user key via `key()`, `value()`, and `timestamp()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10200 Test Plan: make check Reviewed By: ltamasi Differential Revision: D37258074 Pulled By: riversand963 fbshipit-source-id: 3f0b866ade50dcff7ef60d506397a9dd6ec91565	3 years ago
Peter Dillinger	d96febeeaa	Update/clarify required properties for prefix extractors (#10245 ) Summary: Most of the properties listed as required for prefix extractors are not really required but offer some conveniences. This updates API comments to clarify actual requirements, and adds tests to demonstrate how previously presumed requirements can be safely violated. This might seem like a useless exercise, but this relaxing of requirements would be needed if we generalize prefixes to group keys not just at the byte level but also based on bits or arbitrary value ranges. For applications without a "natural" prefix size, having only byte-level granularity often means one prefix size to the next differs in magnitude by a factor of 256. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10245 Test Plan: Tests added, also covering missing Iterator cases from https://github.com/facebook/rocksdb/issues/10244 Reviewed By: bjlemaire Differential Revision: D37371559 Pulled By: pdillinger fbshipit-source-id: ab2dd719992eea7656e9042cf8542393e02fa244	3 years ago
Andrew Kryczka	ca81b80d83	Deflake RateLimiting/BackupEngineRateLimitingTestWithParam (#10271 ) Summary: We saw flakes with the following failure: ``` [ RUN ] RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/1 utilities/backup/backup_engine_test.cc:2667: Failure Expected: (restore_time) > (0.8 * rate_limited_restore_time), actual: 48269 vs 60470.4 terminate called after throwing an instance of 'testing::internal::GoogleTestFailureException' what(): utilities/backup/backup_engine_test.cc:2667: Failure Expected: (restore_time) > (0.8 * rate_limited_restore_time), actual: 48269 vs 60470.4 Received signal 6 (Aborted) t/run-backup_engine_test-RateLimiting-BackupEngineRateLimitingTestWithParam.RateLimiting-1: line 4: 1032887 Aborted (core dumped) TEST_TMPDIR=$d ./backup_engine_test --gtest_filter=RateLimiting/BackupEngineRateLimitingTestWithParam.RateLimiting/1 ``` Investigation revealed we forgot to use the mock time `SystemClock` for restore rate limiting. Then the test used wall clock time, which made the execution of "GenericRateLimiter::Request:PostTimedWait" non-deterministic as wall clock time might have advanced enough that waiting was not needed. This PR changes restore rate limiting to use mock time, which guarantees we always execute "GenericRateLimiter::Request:PostTimedWait". Then the assertions that rely on times recorded inside that callback should be robust. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10271 Test Plan: Applied the following patch which guaranteed repro before the fix. Verified the test passes after this PR even with that patch applied. ``` diff --git a/util/rate_limiter.cc b/util/rate_limiter.cc index f369e3220..6b3ed82fa 100644 --- a/util/rate_limiter.cc +++ b/util/rate_limiter.cc @@ -158,6 +158,7 @@ void GenericRateLimiter::SetBytesPerSecond(int64_t bytes_per_second) { void GenericRateLimiter::Request(int64_t bytes, const Env::IOPriority pri, Statistics* stats) { + usleep(100000); assert(bytes <= refill_bytes_per_period_.load(std::memory_order_relaxed)); bytes = std::max(static_cast<int64_t>(0), bytes); TEST_SYNC_POINT("GenericRateLimiter::Request"); ``` Reviewed By: hx235 Differential Revision: D37499848 Pulled By: ajkr fbshipit-source-id: fd790d5a192996be8ba13b656751ccc7d8cb8f6e	3 years ago
Gang Liao	d7ebb58cb5	Add blob cache tickers, perf context statistics, and DB properties (#10203 ) Summary: In order to be able to monitor the performance of the new blob cache, we made the follow changes: - Add blob cache hit/miss/insertion tickers (see https://github.com/facebook/rocksdb/wiki/Statistics) - Extend the perf context similarly (see https://github.com/facebook/rocksdb/wiki/Perf-Context-and-IO-Stats-Context) - Implement new DB properties (see e.g. https://github.com/facebook/rocksdb/blob/main/include/rocksdb/db.h#L1042-L1051) that expose the capacity and current usage of the blob cache. This PR is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10203 Reviewed By: ltamasi Differential Revision: D37478658 Pulled By: gangliao fbshipit-source-id: d8ee3f41d47315ef725e4551226330b4b6832e40	3 years ago
Guido Tagliavini Ponce	c6055cba30	Calculate table size of FastLRUCache more accurately (#10235 ) Summary: Calculate the required size of the hash table in FastLRUCache more accurately. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10235 Test Plan: ``make -j24 check`` Reviewed By: gitbw95 Differential Revision: D37460546 Pulled By: guidotag fbshipit-source-id: 7945128d6f002832f8ed922ef0151919f4350854	3 years ago
Gang Liao	a1eb02f089	Change the semantics of bytes_read in GetBlob/MultiGetBlob for consistency (#10248 ) Summary: The `bytes_read` returned by the current BlobSource interface is ambiguous. The uncompressed blob size is returned if the cache hits. The size of the blob read from disk, presumably the compressed version, is returned if the cache misses. Two differing semantics might cause ambiguity and consistency issues. For example, this inconsistency causes the assertion failure (T124246362 and its hot fix is https://github.com/facebook/rocksdb/issues/10249). This goal is to require that the value of `byte read` always be an on-disk blob record size. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10248 Reviewed By: ltamasi Differential Revision: D37470292 Pulled By: gangliao fbshipit-source-id: fbca521b2791d3674dbf2484cea5fcae2fdd94d2	3 years ago
Levi Tamasi	0d1e0722ef	Fix in-place updates for value types other than kTypeValue (#10254 ) Summary: The patch fixes a couple of issues related to in-place updates: 1) the value type was not passed from `MemTableInserter::PutCFImpl` to `MemTable::Update` and 2) `MemTable::UpdateCallback` was called for any value type (with the callee's logic assuming `kTypeValue`) even though the callback mechanism is only safe for plain values. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10254 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D37463644 Pulled By: ltamasi fbshipit-source-id: 33802477dac0691681f416ae84c4d9742c6fe41a	3 years ago

... 14 15 16 17 18 ...

12005 Commits (bb8fcc00448cc841395663b6a7fc7ed571e86d0f) All Branches Search

12005 Commits (bb8fcc00448cc841395663b6a7fc7ed571e86d0f)

All Branches