rocksdb

Commit Graph

Author	SHA1	Message	Date
Bo Wang	87b82f28a1	Split cache to minimize internal fragmentation (#10287 ) Summary: ### Summary: To minimize the internal fragmentation caused by the variable size of the compressed blocks, the original block is split according to the jemalloc bin size in `Insert()` and then merged back in `Lookup()`. Based on the analysis of the results of the following tests, from the overall internal fragmentation perspective, this PR does mitigate the internal fragmentation issue. _Do more myshadow tests with the latest commit. I finished several myshadow AB Testing and the results are promising. For the config of 4GB primary cache and 3GB secondary cache, Jemalloc resident stats shows consistently ~0.15GB memory saving; the allocated and active stats show similar memory savings. The CPU usage is almost the same before and after this PR._ To evaluate the issue of memory fragmentations and the benefits of this PR, I conducted two sets of local tests as follows. T1 Keys: 16 bytes each (+ 0 bytes user-defined timestamp) Values: 100 bytes each (50 bytes after compression) Entries: 90000000 RawSize: 9956.4 MB (estimated) FileSize: 5664.8 MB (estimated) \| Test Name \| Primary Cache Size (MB) \| Compressed Secondary Cache Size (MB) \| \| - \| - \| - \| \| T1_3 \| 4000 \| 4000 \| \| T1_4 \| 2000 \| 3000 \| Populate the DB: ./db_bench --benchmarks=fillrandom --num=90000000 -db=/mem_fragmentation/db_bench_1 Overwrite it to a stable state: ./db_bench --benchmarks=overwrite --num=90000000 -use_existing_db -db=/mem_fragmentation/db_bench_1 Run read tests with differnt cache setting: T1_3: MALLOC_CONF="prof:true,prof_stats:true" ../rocksdb/db_bench --benchmarks=seekrandom --threads=16 --num=90000000 -use_existing_db --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=4000000000 -compressed_secondary_cache_size=4000000000 -use_compressed_secondary_cache -db=/mem_fragmentation/db_bench_1 --print_malloc_stats=true > ~/temp/mem_frag/20220710/jemalloc_stats_json_T1_3_20220710 -duration=1800 & T1_4: MALLOC_CONF="prof:true,prof_stats:true" ../rocksdb/db_bench --benchmarks=seekrandom --threads=16 --num=90000000 -use_existing_db --benchmark_write_rate_limit=52000000 -use_direct_reads --cache_size=2000000000 -compressed_secondary_cache_size=3000000000 -use_compressed_secondary_cache -db=/mem_fragmentation/db_bench_1 --print_malloc_stats=true > ~/temp/mem_frag/20220710/jemalloc_stats_json_T1_4_20220710 -duration=1800 & For T1_3 and T1_4, I also conducted the tests before and after this PR. The following table show the important jemalloc stats. \| Test Name \| T1_3 \| T1_3 after mem defrag \| T1_4 \| T1_4 after mem defrag \| \| - \| - \| - \| - \| - \| \| allocated (MB) \| 8728 \| 8076 \| 5518 \| 5043 \| \| available (MB) \| 8753 \| 8092 \| 5536 \| 5051 \| \| external fragmentation rate \| 0.003 \| 0.002 \| 0.003 \| 0.0016 \| \| resident (MB) \| 8956 \| 8365 \| 5655 \| 5235 \| T2 Keys: 32 bytes each (+ 0 bytes user-defined timestamp) Values: 256 bytes each (128 bytes after compression) Entries: 40000000 RawSize: 10986.3 MB (estimated) FileSize: 6103.5 MB (estimated) \| Test Name \| Primary Cache Size (MB) \| Compressed Secondary Cache Size (MB) \| \| - \| - \| - \| \| T2_3 \| 4000 \| 4000 \| \| T2_4 \| 2000 \| 3000 \| Create DB (10GB): ./db_bench -benchmarks=fillrandom -use_direct_reads=true -num=40000000 -key_size=32 -value_size=256 -db=/mem_fragmentation/db_bench_2 Overwrite it to a stable state: ./db_bench --benchmarks=overwrite --num=40000000 -use_existing_db -key_size=32 -value_size=256 -db=/mem_fragmentation/db_bench_2 Run read tests with differnt cache setting: T2_3: MALLOC_CONF="prof:true,prof_stats:true" ./db_bench --benchmarks="mixgraph" -use_direct_io_for_flush_and_compaction=true -use_direct_reads=true -cache_size=4000000000 -compressed_secondary_cache_size=4000000000 -use_compressed_secondary_cache -keyrange_dist_a=14.18 -keyrange_dist_b=-2.917 -keyrange_dist_c=0.0164 -keyrange_dist_d=-0.08082 -keyrange_num=30 -value_k=0.2615 -value_sigma=25.45 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.85 -mix_put_ratio=0.14 -mix_seek_ratio=0.01 -sine_mix_rate_interval_milliseconds=5000 -sine_a=1000 -sine_b=0.000073 -sine_d=400000 -reads=80000000 -num=40000000 -key_size=32 -value_size=256 -use_existing_db=true -db=/mem_fragmentation/db_bench_2 --print_malloc_stats=true > ~/temp/mem_frag/jemalloc_stats_T2_3 -duration=1800 & T2_4: MALLOC_CONF="prof:true,prof_stats:true" ./db_bench --benchmarks="mixgraph" -use_direct_io_for_flush_and_compaction=true -use_direct_reads=true -cache_size=2000000000 -compressed_secondary_cache_size=3000000000 -use_compressed_secondary_cache -keyrange_dist_a=14.18 -keyrange_dist_b=-2.917 -keyrange_dist_c=0.0164 -keyrange_dist_d=-0.08082 -keyrange_num=30 -value_k=0.2615 -value_sigma=25.45 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.85 -mix_put_ratio=0.14 -mix_seek_ratio=0.01 -sine_mix_rate_interval_milliseconds=5000 -sine_a=1000 -sine_b=0.000073 -sine_d=400000 -reads=80000000 -num=40000000 -key_size=32 -value_size=256 -use_existing_db=true -db=/mem_fragmentation/db_bench_2 --print_malloc_stats=true > ~/temp/mem_frag/jemalloc_stats_T2_4 -duration=1800 & For T2_3 and T2_4, I also conducted the tests before and after this PR. The following table show the important jemalloc stats. \| Test Name \| T2_3 \| T2_3 after mem defrag \| T2_4 \| T2_4 after mem defrag \| \| - \| - \| - \| - \| - \| \| allocated (MB) \| 8425 \| 8093 \| 5426 \| 5149 \| \| available (MB) \| 8489 \| 8138 \| 5435 \| 5158 \| \| external fragmentation rate \| 0.008 \| 0.0055 \| 0.0017 \| 0.0017 \| \| resident (MB) \| 8676 \| 8392 \| 5541 \| 5321 \| Pull Request resolved: https://github.com/facebook/rocksdb/pull/10287 Test Plan: Unit tests. Reviewed By: anand1976 Differential Revision: D37743362 Pulled By: gitbw95 fbshipit-source-id: 0010c5af08addeacc5ebbc4ffe5be882fb1d38ad	3 years ago
mpoeter	bef3127b00	Fix race in ExitAsBatchGroupLeader with pipelined writes (#9944 ) Summary: Resolves https://github.com/facebook/rocksdb/issues/9692 This PR adds a unit test that reproduces the race described in https://github.com/facebook/rocksdb/issues/9692 and an according fix. The unit test does not have any assertions, because I could not find a reliable and save way to assert that the writers list does not form a cycle. So with the old (buggy) code, the test would simply hang, while with the fix the test passes successfully. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9944 Reviewed By: pdillinger Differential Revision: D36134604 Pulled By: riversand963 fbshipit-source-id: ef636c5a79ddbef18658ab2f19ca9210a427324a	3 years ago
Peter Dillinger	27f3af5966	Fix serious FSDirectory use-after-Close bug (missing fsync) (#10460 ) Summary: TL;DR: due to a recent change, if you drop a column family, often that DB will no longer fsync after writing new SST files to remaining or new column families, which could lead to data loss on power loss. More bug detail: The intent of https://github.com/facebook/rocksdb/issues/10049 was to Close FSDirectory objects at DB::Close time rather than waiting for DB object destruction. Unfortunately, it also closes shared FSDirectory objects on DropColumnFamily (& destroy remaining handles), which can lead to use-after-Close on FSDirectory shared with remaining column families. Those "uses" are only Fsyncs (or redundant Closes). In the default Posix filesystem, an Fsync on a closed FSDirectory is a quiet no-op. Consequently (under most configurations), if you drop a column family, that DB will no longer fsync after writing new SST files to column families sharing the same directory (true under most configurations). More fix detail: Basically, this removes unnecessary Close ops on destroying ColumnFamilyData. We let `shared_ptr` take care of calling the destructor at the right time. If the intent was to require Close be called before destroying FSDirectory, that was not made clear by the author of FileSystem and was not at all enforced by https://github.com/facebook/rocksdb/issues/10049, which could have added `assert(fd_ == -1)` to `~PosixDirectory()` but did not. To keep this fix simple, we relax the unit test for https://github.com/facebook/rocksdb/issues/10049 to allow timely destruction of FSDirectory to suffice as Close (in CountedFileSystem). Added a TODO to revisit that. Also in this PR: * Added a TODO to share FSDirectory instances between DB and its column families. (Already shared among column families.) * Made DB::Close attempt to close all its open FSDirectory objects even if there is a failure in closing one. Also code clean-up around this logic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10460 Test Plan: add an assert to check for use-after-Close. With that existing tests can detect the misuse. With fix, tests pass (except noted relaxing of unit test for https://github.com/facebook/rocksdb/issues/10049) Reviewed By: ajkr Differential Revision: D38357922 Pulled By: pdillinger fbshipit-source-id: d42079cadbedf0a969f03389bf586b3b4e1f9137	3 years ago
Peter Dillinger	9da97a3726	regression_test.sh: kill very old db_bench (and more) (#10441 ) Summary: If a db_bench process gets hung or runaway on a machine, that could prevent regression_test.sh from ever making progress. To fix that, regression_test.sh will now kill any db_bench process that is >12 hours old. Also made this more reliable by not using string matching (grep) to get db_bench process IDs. I also had to make some other updates to get local runs working reliably: * Fix some quoting hell and other dubious complexity with db_bench_cmd * Only save a DB for re-use when building it passes * Report failed command in more cases * Add safeguards against "rm -rf ." Pull Request resolved: https://github.com/facebook/rocksdb/pull/10441 Test Plan: manual (local and remote), with temporary changes e.g. to have a manageable age threshold etc. Reviewed By: riversand963 Differential Revision: D38285537 Pulled By: pdillinger fbshipit-source-id: 4d598876aedc38ac4bd9d8ddf32c5995d8e44db8	3 years ago
Levi Tamasi	cc8ded6152	Do not put blobs read during compaction into cache (#10457 ) Summary: During compaction, blobs are currently read using the default `ReadOptions`, which has the `fill_cache` flag set to true. Earlier, this didn't make any difference since we didn't have a blob cache; however, now we have to explicitly set this flag to false to avoid polluting the cache during compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10457 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D38333528 Pulled By: ltamasi fbshipit-source-id: 5b4d49a1e39543bee73c7df2aa9194fb101875e2	3 years ago
Yanqin Jin	fbfcf5cbcd	Remove unused fields from FileMetaData (temporarily) (#10443 ) Summary: FileMetaData::[min\|max]_timestamp are not currently being used or tracked by RocksDB, even when user-defined timestamp is enabled. Each of them is a std::string which can occupy 32 bytes. Remove them for now. They may be added back when we have a pressing need for them. When we do add them back, consider store them in a more compact way, e.g. one boolean flag and a byte array of size 16. Per file min/max timestamp bounds are available as table properties. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10443 Test Plan: make check Reviewed By: pdillinger Differential Revision: D38292275 Pulled By: riversand963 fbshipit-source-id: 841dc4e855ad8f8481c80cb020603de9607c9c94	3 years ago
sdong	cc2099803a	Use EnvLogger instead of PosixLogger (#10436 ) Summary: EnvLogger was built to replace PosixLogger that supports multiple Envs. Make FileSystem use EnvLogger by default, remove Posix FS specific implementation and remove PosixLogger code, Some hacky changes are made to make sure iostats are not polluted by logging, in order to pass existing unit tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10436 Test Plan: Run db_bench and watch info log files. Reviewed By: anand1976 Differential Revision: D38259855 fbshipit-source-id: 67d65874bfba7a33535b6d0dd0ed92cbbc9888b8	3 years ago
gitbw95	e1b176d274	Add CompressedSecondaryCache into stress test (#10442 ) Summary: The secondary cache is randomly disabled or enabled with CompressedSecondaryCache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10442 Test Plan: - To test that the CompressedSecondaryCache is used and the stress test runs successfully, run `make -j24 CRASH_TEST_EXT_ARGS=—duration=960 blackbox_crash_test ` Reviewed By: anand1976 Differential Revision: D38290796 Pulled By: gitbw95 fbshipit-source-id: bb7027b39e0ed9c0c62835abe09e759898130ec8	3 years ago
Akanksha Mahajan	56463d443d	Provide support for subcompactions with user-defined timestamps (#10344 ) Summary: The subcompaction logic currently picks file boundaries as subcompaction boundaries. This is not compatible with user-defined timestamps because of two issues. Issue1: ReadOptions.iterate_lower_bound and ReadOptions.iterate_upper_bound contains timestamps which results in assertion failure as BlockBasedTableIterator expects bounds to be without timestamps. As result, because of wrong comparison end key is returned as user_key resulting in assertion failure. Issue2: Since it might result in two keys that only differ by user timestamp getting processed by two different subcompactions (and thus two different CompactionIterator state machines), which in turn can cause data correction issues. This PR provide support to reenable subcompactions with user-defined timestamps. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10344 Test Plan: Added new unit test - Without fix for Issue1 unit test MultipleSubCompactions fails with error: ``` db_with_timestamp_compaction_test: ./db/compaction/clipping_iterator.h:247: void rocksdb::ClippingIterat│ or::AssertBounds(): Assertion `!valid_ \|\| !end_ \|\| cmp_->Compare(key(), end_) < 0' failed. Received signal 6 (Aborted) │ #0 /usr/local/fbcode/platform009/lib/libc.so.6(gsignal+0x100) [0x7f8fbbbfe530] db_with_timestamp_compaction_test: ./db/compaction/clipping_iterator.h:247: void rocksdb::ClippingIterator::AssertBounds(): Assertion `!valid_ \|\| !end_ \|\| cmp_->Compare(key(), end_) < 0' failed. Aborted (core dumped) ``` Ran stress test `make crash_test_with_ts -j32` Reviewed By: riversand963 Differential Revision: D38220841 Pulled By: akankshamahajan15 fbshipit-source-id: 5d5cae2bd37fcaeba1e77fce0a69070ad4158ccb	3 years ago
anand76	54aebb2cc5	Fix cache metrics update when secondary cache is used (#10440 ) Summary: If a secondary cache is configured, its possible that a cache lookup will get a hit in the secondary cache. In that case, the ```LRUCacheShard::Lookup``` doesn't immediately update the ```total_charge``` for the item handle if the ```wait``` parameter is false (i.e caller will call later to check the completeness). However, ```BlockBasedTable::GetEntryFromCache``` assumes the handle is complete and calls ```UpdateCacheHitMetrics```, which checks the usage of the cache item and fails the assert in https://github.com/facebook/rocksdb/blob/main/cache/lru_cache.h#L237 (```assert(total_charge >= meta_charge)```). To fix this, we call ```UpdateCacheHitMetrics``` later in ```MultiGet```, after waiting for all cache lookup completions. Test plan - Run crash test with changes from https://github.com/facebook/rocksdb/issues/10160 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10440 Reviewed By: gitbw95 Differential Revision: D38283968 Pulled By: anand1976 fbshipit-source-id: 31c54ef43517726c6e5fdda81899b364241dd7e1	3 years ago
Bo Wang	1aab5b32ad	Update passing rate_limiter_priority for a PartitionedFilterBlockReader function to FS (#10438 ) Summary: Add param rate_limiter_parameter in PartitionedFilterBlockReader::GetFilterPartitionBlock . Pull Request resolved: https://github.com/facebook/rocksdb/pull/10438 Test Plan: Unit Tests. Reviewed By: anand1976 Differential Revision: D38266395 Pulled By: gitbw95 fbshipit-source-id: 3ed062a3b43d6df323371cb0d266f7fe869e9ad2	3 years ago
sdong	aec28ebae6	db_bench -use_stderr_info_logger to print timestamp (#10435 ) Summary: Right now db_bench -use_stderr_info_logger would redirect RocksDB info logging to stderr but no timetamp is printed out. Add timestamp to there. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10435 Test Plan: Run "db_bench -use_stderr_info_logger" Reviewed By: riversand963 Differential Revision: D38258699 fbshipit-source-id: 3fee6eb1205127b923bc6a660f86bd2742519aec	3 years ago
Peter Dillinger	15da225268	Fix regression_test.sh deleterandom duration (#10437 ) Summary: deleterandom tests are too fast to get good signal, e.g. --deletes=31250 in 0.170 seconds vs. --reads=1500000 in 288.491 seconds for readrandom. Removing the special handling (unknown motivation in `faa7eb3b99`) should suffice. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10437 Test Plan: watch continuous results Reviewed By: ltamasi Differential Revision: D38261185 Pulled By: pdillinger fbshipit-source-id: 0f1b1b19efccda5689027d36cc2f01307f36031d	3 years ago
Peter Dillinger	65036e4217	Revert "Add a blob-specific cache priority (#10309 )" (#10434 ) Summary: This reverts commit `8d178090be` because of a clear performance regression seen in internal dashboard https://fburl.com/unidash/tpz75iee Pull Request resolved: https://github.com/facebook/rocksdb/pull/10434 Reviewed By: ltamasi Differential Revision: D38256373 Pulled By: pdillinger fbshipit-source-id: 134aa00f50dd7b1bbe037c227884a351342ec44b	3 years ago
Andrew Kryczka	c7ccbb33a6	Allow manual compactions to run in parallel by default (#10317 ) Summary: This PR changes the default value of `CompactRangeOptions::exclusive_manual_compaction` from true to false so manual `CompactRange()`s can run in parallel with other compactions. I believe no artificial parallelism restriction is the intuitive behavior so feel the old default value is a trap, which I have fallen into several times, including yesterday. `CompactRangeOptions::exclusive_manual_compaction == false` has been used in both our correctness test and in production for years so should be reasonably safe. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10317 Reviewed By: jay-zhuang Differential Revision: D37659392 Pulled By: ajkr fbshipit-source-id: 504915e978bbe300b79483d064070c75e93d91e5	3 years ago
Jay Zhuang	87649d3288	Best efforts recovery to skip empty MANIFEST (#10416 ) Summary: Skip empty MANIFEST fie during best_efforts_recovery. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10416 Test Plan: make failed db_stress test pass Reviewed By: riversand963 Differential Revision: D38126273 Pulled By: jay-zhuang fbshipit-source-id: 4498d322b09eaa194dd2cbf9c683d62ab54bfb01	3 years ago
Gang Liao	8d178090be	Add a blob-specific cache priority (#10309 ) Summary: RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them. This task is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10309 Reviewed By: ltamasi Differential Revision: D38211655 Pulled By: gangliao fbshipit-source-id: 65ef33337db4d85277cc6f9782d67c421ad71dd5	3 years ago
Guido Tagliavini Ponce	d976f68977	Fix assertion failure and memory leak in ClockCache. (#10430 ) Summary: This fixes two issues: - [T127355728](https://www.internalfb.com/intern/tasks/?t=127355728): In the stress tests, when the ClockCache is operating close to full capacity and a burst of inserts are concurrently executed, every slot in the hash table may become occupied. This contradicts an assertion in the code, which is no longer valid in the lock-free setting. We are removing that assertion and handling the case of an insertion into a full table. - [T127427659](https://www.internalfb.com/intern/tasks/?t=127427659): There was a memory leak when an insertion is performed over capacity, but no handle is provided. In that case, a handle was dynamically allocated, but the pointer wasn't stored anywhere. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10430 Test Plan: - ``make -j24 check`` - ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush`` - ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush`` Reviewed By: pdillinger Differential Revision: D38226114 Pulled By: guidotag fbshipit-source-id: 18f6ab7e6214e11e9721d5ff289db1bf795d0008	3 years ago
Zichen Zhu	8b2d429251	Mention kRoundRobin in HISTORY.md (#10421 ) Summary: Update HISTORY.md for CompactionPri::kRoundRobin. Detailed implementation can be found in [PR10107](https://github.com/facebook/rocksdb/pull/10107), [PR10227](https://github.com/facebook/rocksdb/pull/10227), [PR10250](https://github.com/facebook/rocksdb/pull/10250), [PR10278](https://github.com/facebook/rocksdb/pull/10278), [PR10316](https://github.com/facebook/rocksdb/pull/10316), and [PR10341](https://github.com/facebook/rocksdb/pull/10341) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10421 Reviewed By: ajkr Differential Revision: D38194070 Pulled By: littlepig2013 fbshipit-source-id: 4ce153dc0bf22cd865d09c5429955023dbc90f37	3 years ago
BilyZ98	8c0810de26	add trace tools flags in CMakeLists (#10404 ) Summary: It seems like there is no flags in CMakeLists.txt to control the generation of trace tools including trace_analyzer and block_cache_trace_analyzer. So I add it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10404 Reviewed By: ajkr Differential Revision: D38077673 Pulled By: jay-zhuang fbshipit-source-id: b4d83b3a3281edf34b2ef4a8715c2835e53ffc0f	3 years ago
Jay Zhuang	6a0010eb46	ldb to display public unique id and dump work with key range (#10417 ) Summary: 2 ldb command improvements: 1. `ldb manifest_dump --verbose` display both the internal unique id and public id. which is useful to manually check sst_unique_id between manifest and SST; 2. `ldb dump` has `--from/to` option, but not working. Add support for that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10417 Test Plan: run the command locally ``` $ ldb manifest_dump --path=MANIFEST-000026 --verbose ... AddFile: 0 18 1023 'bar' seq:6, type:1 .. 'foo' seq:5, type:1 oldest_ancester_time:1658787615 file_creation_time:1658787615 file_checksum: file_checksum_func_name: Unknown unique_id(internal): {8800772265202404198,16149248642318466463} public_unique_id: F3E0A029B631D7D4-6E402DE08E771780 ``` ``` $ ldb dump --path=000036.sst --from=key000006 --to=key000009 Sst file format: block-based 'key000006' seq:2411, type:1 => value6 'key000007' seq:2412, type:1 => value7 'key000008' seq:2413, type:1 => value8 ... ``` Reviewed By: ajkr Differential Revision: D38136140 Pulled By: jay-zhuang fbshipit-source-id: 8be6eeaa07ff9f089e33011ebe90fd0b69d33bf3	3 years ago
Zichen Zhu	c945a9a664	Allow sufficient subcompactions under round-robin compaction priority (#10422 ) Summary: Allow sufficient subcompactions can be used when the number of input files is less than `max_subcompactions` under round-robin compaction priority. Test Case: Add `RoundRobinWithoutAdditionalResources` into `db_compaction_test` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10422 Reviewed By: ajkr Differential Revision: D38186545 Pulled By: littlepig2013 fbshipit-source-id: b8e5098306f1e5b9561dfafafc8300a38f7fe88e	3 years ago
Guido Tagliavini Ponce	9d7de6517c	Towards a production-quality ClockCache (#10418 ) Summary: In this PR we bring ClockCache closer to production quality. We implement the following changes: 1. Fixed a few bugs in ClockCache. 2. ClockCache now fully supports ``strict_capacity_limit == false``: When an insertion over capacity is commanded, we allocate a handle separately from the hash table. 3. ClockCache now runs on almost every test in cache_test. The only exceptions are a test where either the LRU policy is required, and a test that dynamically increases the table capacity. 4. ClockCache now supports dynamically decreasing capacity via SetCapacity. (This is easy: we shrink the capacity upper bound and run the clock algorithm.) 5. Old FastLRUCache tests in lru_cache_test.cc are now also used on ClockCache. As a byproduct of 1. and 2. we are able to turn on ClockCache in the stress tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10418 Test Plan: - ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 check`` - ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 check`` - ``make -j24 USE_CLANG=1 COMPILE_WITH_ASAN=1 COMPILE_WITH_UBSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush`` - ``make -j24 USE_CLANG=1 COMPILE_WITH_TSAN=1 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache" blackbox_crash_test_with_atomic_flush`` Reviewed By: pdillinger Differential Revision: D38170673 Pulled By: guidotag fbshipit-source-id: 508987b9dc9d9d68f1a03eefac769820b680340a	3 years ago
Alan Paxton	8db8b98f98	Transaction.prepare should be public (#10412 ) Summary: The absence of a public modifier appears to be an omission. prepare() is necessary for the TM to participate as a peer in a distributed transaction. Also add basic “yes it does work in java” tests. Resolves https://github.com/facebook/rocksdb/issues/10283 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10412 Reviewed By: ajkr Differential Revision: D38135513 Pulled By: riversand963 fbshipit-source-id: ff52b96bc7218bc3bf12845dee49f5d8edf0e297	3 years ago
Jay Zhuang	3134471457	Deflake FlushStaleColumnFamilies test (#10409 ) Summary: Make the Stale Flush test more robust by explicitly checking the target CF is flushed. Currently it's flaky because the default CF may have more than 3 SSTs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10409 Test Plan: the test more likely to fail on a resource limited host: ``` gtest-parallel ./column_family_test --gtest_filter=FormatDef/ColumnFamilyTest.FlushStaleColumnFamilies/0 -r 1000 -w 100 ``` Reviewed By: ajkr Differential Revision: D38116383 Pulled By: jay-zhuang fbshipit-source-id: e27cc56f76f14d0936504f126104e3d87e3d0d5f	3 years ago
Jay Lee	84e9b6ee2d	full_history_ts_low should be const (#10411 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10411 Reviewed By: jay-zhuang Differential Revision: D38131644 Pulled By: riversand963 fbshipit-source-id: d241521dccff1ab8882ae0726ec368f84b7e8311	3 years ago
Changyu Bi	2fc6df37d6	Add checksum handshake for WAL fragment decompression (#10339 ) Summary: If WAL compression is enabled, WAL fragment decompression results are concatenated together in `log::Reader::ReadPhysicalRecord()`. This PR adds checksum handshake to protect memory corruption during the copying process. `checksum` is renamed to `record_checksum` in `ReadRecord()` to differentiate it from `checksum_` flag that specifies whether CRC32C checksum is verified. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10339 Test Plan: added checksum verification in log_test.cc, `make check -j32`. Reviewed By: ajkr Differential Revision: D37763734 Pulled By: cbi42 fbshipit-source-id: c4faa7c76b9ff1df35026edf31adfe4b47ae3154	3 years ago
Alan Paxton	e637470f64	Run new benchmark script in branch. (#10303 ) Summary: Configure CI to run modernised benchmark script Pull Request resolved: https://github.com/facebook/rocksdb/pull/10303 Reviewed By: ramvadiv Differential Revision: D37719116 Pulled By: jay-zhuang fbshipit-source-id: 79ecb1cd0abd4d800c6906ba6673268c2adee10e	3 years ago
Peter Dillinger	01a2e20299	Account for DB ID in stress testing block cache keys (#10388 ) Summary: I recently discovered that block cache keys are slightly lower quality than previously thought, because my stress testing tool failed to simulate the effect of DB ID differences. This change updates the tool and gives us data to guide future developments. (No changes to production code here and now.) Nevertheless, the following promise still holds ``` // In fact, if our SST files are all < 4TB (see // BlockBasedTable::kMaxFileSizeStandardEncoding), then SST files generated // in a single process are guaranteed to have unique cache keys, unless/until // number session ids * max file number = 2**86 ... ``` because although different DB IDs could cause collision in file number and offset data, that would have to be using the same DB session (lower) to cause a block cache key collision, which is not possible in the same process. (A session is associated with only one DB ID.) This change fixes cache_bench -stress_cache_key to set and reset DB IDs in a parameterized way to evaluate the effect. Previous results assumed to be representative (using -sck_keep_bits=43): ``` 15 collisions after 15 x 90 days, est 90 days between (1.03763e+20 corrected) ``` or expected collision on a single machine every 104 billion billion days (see "corrected" value). After accounting for DB IDs, test never really changing, intermediate, and very frequently changing (using default -sck_db_count=100): ``` -sck_newdb_nreopen=1000000000: 15 collisions after 2 x 90 days, est 12 days between (1.38351e+19 corrected) -sck_newdb_nreopen=10000: 17 collisions after 2 x 90 days, est 10.5882 days between (1.22074e+19 corrected) -sck_newdb_nreopen=100: 19 collisions after 2 x 90 days, est 9.47368 days between (1.09224e+19 corrected) ``` or roughly 10x more often than previously thought (still extremely if not impossibly rare), and better than random base cache keys (with -sck_randomize), though < 10x better than random: ``` 31 collisions after 1 x 90 days, est 2.90323 days between (3.34719e+18 corrected) ``` If we simply fixed this by ignoring DB ID for cache keys, we would potentially have a shortage of entropy for some cases, such as small file numbers and offsets (e.g. many short-lived processes each using SstFileWriter to create a small file), because existing DB session IDs only provide ~103 bits of entropy. We could upgrade the entropy in DB session IDs to accommodate, but it's not known what all would be affected by changing from 20 digit session IDs to something larger. Instead, my plan is to 1) Move to block cache keys derived from SST unique IDs (so that we can derive block cache keys from manifest data without reading file on storage), and show no significant regression in expected collision rate. 2) Generate better SST unique IDs in format_version=6 (https://github.com/facebook/rocksdb/issues/9058), which should have ~100x lower expected/predicted collision rate based on simulations with this stress test: ``` ./cache_bench -stress_cache_key -sck_keep_bits=39 -sck_newdb_nreopen=100 -sck_footer_unique_id ... 15 collisions after 19 x 90 days, est 114 days between (2.10293e+21 corrected) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10388 Test Plan: no production changes Reviewed By: jay-zhuang Differential Revision: D37986714 Pulled By: pdillinger fbshipit-source-id: e759b2469e3365cb01c6661a69e0ab849ef4c3df	3 years ago
sdong	4e00748098	Fix a bug in hash linked list (#10401 ) Summary: In hash linked list, with a bucket of only one record, following sequence can cause users to temporarily miss a record: Thread 1: Fetch the structure bucket x points too, which would be a Node n1 for a key, with next pointer to be null Thread 2: Insert a key to bucket x that is larger than the existing key. This will make n1->next points to a new node n2, and update bucket x to point to n1. Thread 1: see n1->next is not null, so it thinks it is a header of linked list and ignore the key of n1. Fix it by refetch structure that bucket x points to when it sees n1->next is not null. This should work because if n1->next is not null, bucket x should already point to a linked list or skip list header. A related change is to revert th order of testing for linked list and skip list. This is because after refetching the bucket, it might end up with a skip list, rather than linked list. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10401 Test Plan: Run existing tests and make sure at least it doesn't regress. Reviewed By: jay-zhuang Differential Revision: D38064471 fbshipit-source-id: 142bb85e1546c803f47e3357aef3e76debccd8df	3 years ago
Guido Tagliavini Ponce	6a160e1fec	Lock-free ClockCache (#10390 ) Summary: ClockCache completely free of locks. As part of this PR we have also pushed clock algorithm functionality out of ClockCacheShard into ClockHandleTable, so that ClockCacheShard acts more as an interface and less as an actual data structure. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10390 Test Plan: - ``make -j24 check`` - ``make -j24 CRASH_TEST_EXT_ARGS="--duration=960 --cache_type=clock_cache --cache_size=1073741824 --block_size=16384" blackbox_crash_test_with_atomic_flush`` Reviewed By: pdillinger Differential Revision: D38106945 Pulled By: guidotag fbshipit-source-id: 6cbf6bd2397dc9f582809ccff5118a8a33ea6cb1	3 years ago
Zichen Zhu	8860fc902a	Support subcmpct using reserved resources for round-robin priority (#10341 ) Summary: Earlier implementation of round-robin priority can only pick one file at a time and disallows parallel compactions within the same level. In this PR, round-robin compaction policy will expand towards more input files with respecting some additional constraints, which are summarized as follows: * Constraint 1: We can only pick consecutive files - Constraint 1a: When a file is being compacted (or some input files are being compacted after expanding), we cannot choose it and have to stop choosing more files - Constraint 1b: When we reach the last file (with the largest keys), we cannot choose more files (the next file will be the first one with small keys) * Constraint 2: We should ensure the total compaction bytes (including the overlapped files from the next level) is no more than `mutable_cf_options_.max_compaction_bytes` * Constraint 3: We try our best to pick as many files as possible so that the post-compaction level size can be just less than `MaxBytesForLevel(start_level_)` * Constraint 4: If trivial move is allowed, we reuse the logic of `TryNonL0TrivialMove()` instead of expanding files with Constraint 3 More details can be found in `LevelCompactionBuilder::SetupOtherFilesWithRoundRobinExpansion()`. The above optimization accelerates the process of moving the compaction cursor, in which the write-amp can be further reduced. While a large compaction may lead to high write stall, we break this large compaction into several subcompactions regardless of the `max_subcompactions` limit. The number of subcompactions for round-robin compaction priority is determined through the following steps: * Step 1: Initialized against `max_output_file_limit`, the number of input files in the start level, and also the range size limit `ranges.size()` * Step 2: Call `AcquireSubcompactionResources()`when max subcompactions is not sufficient, but we may or may not obtain desired resources, additional number of resources is stored in `extra_num_subcompaction_threads_reserved_`). Subcompaction limit is changed and update `num_planned_subcompactions` with `GetSubcompactionLimit()` * Step 3: Call `ShrinkSubcompactionResources()` to ensure extra resources can be released (extra resources may exist for round-robin compaction when the number of actual number of subcompactions is less than the number of planned subcompactions) More details can be found in `CompactionJob::AcquireSubcompactionResources()`,`CompactionJob::ShrinkSubcompactionResources()`, and `CompactionJob::ReleaseSubcompactionResources()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10341 Test Plan: Add `CompactionPriMultipleFilesRoundRobin[1-3]` unit test in `compaction_picker_test.cc` and `RoundRobinSubcompactionsAgainstResources.SubcompactionsUsingResources/[0-4]`, `RoundRobinSubcompactionsAgainstPressureToken.PressureTokenTest/[0-1]` in `db_compaction_test.cc` Reviewed By: ajkr, hx235 Differential Revision: D37792644 Pulled By: littlepig2013 fbshipit-source-id: 7fecb7c4ffd97b34bbf6e3b760b2c35a772a0657	3 years ago
sdong	252bea405e	Improve SubCompaction Partitioning (#10393 ) Summary: Unit tests still haven't been fixed. Also need to add more tests. But I ran some simple fillrandom db_bench and the partitioning feels reasonable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10393 Test Plan: 1. Make sure existing tests pass. This should cover some basic sub compaction logic to be correct and the partitioning result is reasonable; 2. Add a new unit test to ApproximateKeyAnchors() 3. Run some db_bench with max_subcompaction = 4 and watch the compaction is indeed partitioned evenly. Reviewed By: jay-zhuang Differential Revision: D38043783 fbshipit-source-id: 085008e0f85f9b7c5abff7800307618320efb19f	3 years ago
Jay Zhuang	fcccc412d7	Remove Travis CI (#10407 ) Summary: Travis CI is depreciated and haven't been maintained for some time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10407 Reviewed By: ajkr Differential Revision: D38078382 Pulled By: jay-zhuang fbshipit-source-id: f42057f2f41f722bdce56bf195f67a94835191fb	3 years ago
Yu Zhao 00540916	bfc737da21	fix typos in some code and comment (#10139 ) Summary: Minor issue, I just found a few typos on db_test and column_family while reading the code. And I have this PR opened to contribute. :) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10139 Reviewed By: ajkr Differential Revision: D38007098 Pulled By: jay-zhuang fbshipit-source-id: 511947b32424c34348184691216640f32c410fb1	3 years ago
Andrew Kryczka	7b44724205	Fix WAL compression fragmentation test (#10402 ) Summary: Previously the "Fragmentation" test didn't cover fragmentation because the WAL data was compressible into trivial size. This PR changes it to use random data so the post-compression size is large enough to require fragmentation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10402 Reviewed By: cbi42 Differential Revision: D38065596 Pulled By: ajkr fbshipit-source-id: 0d5f89ca14d33546501a74b5d4fafbadc28a46a7	3 years ago
Jun He	5cf18c7634	Fix build error due to uninitialized read_req (#10312 ) Summary: GCC-12 has strick check on variables, and thus build fails when it finds read_req is not properly initialized (-Werror=maybe-uninitialized). Add default value to fix this. Change-Id: Ib8a9085e2d613ee7b943b58a6a58e1bc351725d7 Signed-off-by: Jun He <jun.he@arm.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/10312 Reviewed By: riversand963 Differential Revision: D37656997 Pulled By: ajkr fbshipit-source-id: fe47492c913b34b3a03c04beeec9ec57831dcaff	3 years ago
LIU HU	8885b0537b	Fix underflow in FIFOCompactionPicker (#10386 ) Summary: Fix https://github.com/facebook/rocksdb/issues/10133 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10386 Reviewed By: riversand963 Differential Revision: D38067265 Pulled By: ajkr fbshipit-source-id: 3a99a98ac5d7ac37581b5b636fbfa7901563d834	3 years ago
Yanqin Jin	dd759537d0	Print perf context for all benchmarks if enabled (#10396 ) Summary: If user runs `db_bench` with `-perf_level=2` or higher, db_bench should print perf context after each of all benchmarks. Or make `-perf_level` a per-benchmark switch. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10396 Test Plan: ./db_bench -benchmarks=fillseq,readseq -perf_level=2 Reviewed By: ajkr Differential Revision: D38016324 Pulled By: riversand963 fbshipit-source-id: d83ea4abc34d40ffea394ca6abf0814bc5c0a2e0	3 years ago
dependabot[bot]	944ace8f70	Bump tzinfo from 1.2.9 to 1.2.10 in /docs (#10400 ) Summary: Bumps [tzinfo](https://github.com/tzinfo/tzinfo) from 1.2.9 to 1.2.10. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/tzinfo/tzinfo/releases">tzinfo's releases</a>.</em></p> <blockquote> <h2>v1.2.10</h2> <ul> <li>Fixed a relative path traversal bug that could cause arbitrary files to be loaded with require when used with <code>RubyDataSource</code>. Please refer to <a href="https://github.com/tzinfo/tzinfo/security/advisories/GHSA-5cm2-9h8c-rvfx">https://github.com/tzinfo/tzinfo/security/advisories/GHSA-5cm2-9h8c-rvfx</a> for details. CVE-2022-31163.</li> <li>Ignore the SECURITY file from Arch Linux's tzdata package. <a href="https://github-redirect.dependabot.com/tzinfo/tzinfo/issues/134">https://github.com/facebook/rocksdb/issues/134</a>.</li> </ul> <p><a href="https://rubygems.org/gems/tzinfo/versions/1.2.10">TZInfo v1.2.10 on RubyGems.org</a></p> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/tzinfo/tzinfo/blob/master/CHANGES.md">tzinfo's changelog</a>.</em></p> <blockquote> <h2>Version 1.2.10 - 19-Jul-2022</h2> <ul> <li>Fixed a relative path traversal bug that could cause arbitrary files to be loaded with <code>require</code> when used with <code>RubyDataSource</code>. Please refer to <a href="https://github.com/tzinfo/tzinfo/security/advisories/GHSA-5cm2-9h8c-rvfx">https://github.com/tzinfo/tzinfo/security/advisories/GHSA-5cm2-9h8c-rvfx</a> for details. CVE-2022-31163.</li> <li>Ignore the SECURITY file from Arch Linux's tzdata package. <a href="https://github-redirect.dependabot.com/tzinfo/tzinfo/issues/134">https://github.com/facebook/rocksdb/issues/134</a>.</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="`0814dcd619`"><code>0814dcd</code></a> Fix the release date.</li> <li><a href="`fd05e2a61c`"><code>fd05e2a</code></a> Preparing v1.2.10.</li> <li><a href="`b98c32efd6`"><code>b98c32e</code></a> Merge branch 'fix-directory-traversal-1.2' into 1.2</li> <li><a href="`ac3ee6828a`"><code>ac3ee68</code></a> Remove unnecessary escaping of + within regex character classes.</li> <li><a href="`9d49bf9728`"><code>9d49bf9</code></a> Fix relative path loading tests.</li> <li><a href="`394c381eb6`"><code>394c381</code></a> Remove <code>private_constant</code> for consistency and compatibility.</li> <li><a href="`5e9f99086f`"><code>5e9f990</code></a> Exclude Arch Linux's SECURITY file from the time zone index.</li> <li><a href="`17fc9e1fa9`"><code>17fc9e1</code></a> Workaround for 'Permission denied - NUL' errors with JRuby on Windows.</li> <li><a href="`6bd7a5191d`"><code>6bd7a51</code></a> Update copyright years.</li> <li><a href="`9905ca93ab`"><code>9905ca9</code></a> Fix directory traversal in Timezone.get when using Ruby data source</li> <li>Additional commits viewable in <a href="https://github.com/tzinfo/tzinfo/compare/v1.2.9...v1.2.10">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=tzinfo&package-manager=bundler&previous-version=1.2.9&new-version=1.2.10)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `dependabot rebase` will rebase this PR - `dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `dependabot merge` will merge this PR after your CI passes on it - `dependabot squash and merge` will squash and merge this PR after your CI passes on it - `dependabot cancel merge` will cancel a previously requested merge and block automerging - `dependabot reopen` will reopen this PR if it is closed - `dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/facebook/rocksdb/network/alerts). </details> Pull Request resolved: https://github.com/facebook/rocksdb/pull/10400 Reviewed By: ajkr Differential Revision: D38064880 Pulled By: jay-zhuang fbshipit-source-id: 87854e33913ec14f119a090b2d3911d244b87af4	3 years ago
DaPorkchop_	6bebe65030	Correctly implement Create-/DropColumnFamilies for PessimisticTransactionDB (#10332 ) Summary: This overrides `CreateColumnFamilies` and `DropColumnFamilies` in `PessimisticTransactionDB` in order to add/remove the created column families to/from the lock manager. Fixes https://github.com/facebook/rocksdb/issues/10322. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10332 Reviewed By: ajkr Differential Revision: D37841079 Pulled By: riversand963 fbshipit-source-id: 854d7d9948b0089e0054a8f2875485ba44436fd2	3 years ago
Wallace	1e9bf25f61	Do not hold mutex when write keys if not necessary (#7516 ) Summary: ## Problem Summary RocksDB will acquire the global mutex of db instance for every time when user calls `Write`. When RocksDB schedules a lot of compaction jobs, it will compete the mutex with write thread and it will hurt the write performance. ## Problem Solution: I want to use log_write_mutex to replace the global mutex in most case so that we do not acquire it in write-thread unless there is a write-stall event or a write-buffer-full event occur. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7516 Test Plan: 1. make check 2. CI 3. COMPILE_WITH_TSAN=1 make db_stress make crash_test make crash_test_with_multiops_wp_txn make crash_test_with_multiops_wc_txn make crash_test_with_atomic_flush Reviewed By: siying Differential Revision: D36908702 Pulled By: riversand963 fbshipit-source-id: 59b13881f4f5c0a58fd3ca79128a396d9cd98efe	3 years ago
Andrew Kryczka	a0c63083d3	Fix explanation of XOR usage in KV checksum blog post (#10392 ) Summary: Thanks pdillinger for reminding us that we are protected from swapping corruptions due to independent seeds (and for suggesting that approach in the first place). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10392 Reviewed By: cbi42 Differential Revision: D37981819 Pulled By: ajkr fbshipit-source-id: 3ed32982ae1dbc88eb92569010f9f2e8d190c962	3 years ago
Yanqin Jin	b443d24f4d	Stop operating on DB in a stress test background thread (#10373 ) Summary: Stress test background threads do not coordinate with test worker threads for db reopen in the middle of a test run, thus accessing db obj in a stress test bg thread can race with test workers. Remove the TimestampedSnapshotThread. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10373 Test Plan: ``` ./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_concurrent_memtable_write=1 \ --allow_data_in_errors=True --async_io=0 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 \ --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 \ --block_size=16384 --bloom_bits=7.580319535285394 --bottommost_compression_type=disable \ --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache \ --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=0 --charge_filter_construction=1 \ --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kxxHash64 --clear_column_family_one_in=0 \ --compact_files_one_in=1000000 --compact_range_one_in=0 --compaction_pri=1 --compaction_ttl=0 \ --compression_max_dict_buffer_bytes=0 --compression_max_dict_bytes=0 --compression_parallel_threads=1 \ --compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 \ --continuous_verification_interval=0 --create_timestamped_snapshot_one_in=20 --data_block_index_type=0 \ --db=/dev/shm/rocksdb/ --db_write_buffer_size=0 --delpercent=5 --delrangepercent=0 --destroy_db_initially=1 \ --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=1 --enable_pipelined_write=0 \ --fail_if_options_file_error=1 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=2 \ --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 \ --get_sorted_wal_files_one_in=0 --index_block_restart_interval=11 --index_type=0 --ingest_external_file_one_in=0 \ --iterpercent=0 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True \ --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 \ --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 \ --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 \ --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.5 \ --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=0 --mock_direct_io=True \ --nooverwritepercent=1 --open_files=500000 --open_metadata_write_fault_one_in=0 \ --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=20000 \ --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=2 \ --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=1 \ --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=1000 \ --readpercent=55 --recycle_log_file_num=0 --reopen=100 --ribbon_starting_level=8 \ --secondary_cache_fault_one_in=0 --secondary_cache_uri= --snapshot_hold_ops=100000 \ --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 \ --subcompactions=3 --sync=0 --sync_fault_injection=0 --target_file_size_base=2097152 \ --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=1 \ --txn_write_policy=0 --unordered_write=0 --unpartitioned_pinning=0 \ --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=1 --use_full_merge_v1=1 \ --use_merge=1 --use_multiget=0 --use_txn=1 --user_timestamp_size=0 --value_size_mult=32 \ --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 \ --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none \ --write_buffer_size=4194304 --write_dbid_to_manifest=0 --writepercent=35 ``` make crash_test_with_txn make crash_test_with_multiops_wc_txn Reviewed By: jay-zhuang Differential Revision: D37903189 Pulled By: riversand963 fbshipit-source-id: cd1728ad7ba4ce4cf47af23c4f65dda0956744f9	3 years ago
Andrew Kryczka	e576f2ab19	Fix race conditions in GenericRateLimiter (#10374 ) Summary: Made locking strict for all accesses of `GenericRateLimiter` internal state. `SetBytesPerSecond()` was the main problem since it had no locking, while the two updates it makes need to be done as one atomic operation. The test case, "ConfigOptionsTest.ConfiguringOptionsDoesNotRevertRateLimiterBandwidth", is for the issue fixed in https://github.com/facebook/rocksdb/issues/10378, but I forgot to include the test there. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10374 Reviewed By: pdillinger Differential Revision: D37906367 Pulled By: ajkr fbshipit-source-id: ccde620d2a7f96d1401bdafd2bdb685cbefbafa5	3 years ago
Gang Liao	0b6bc101ba	Charge blob cache usage against the global memory limit (#10321 ) Summary: To help service owners to manage their memory budget effectively, we have been working towards counting all major memory users inside RocksDB towards a single global memory limit (see e.g. https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache). The global limit is specified by the capacity of the block-based table's block cache, and is technically implemented by inserting dummy entries ("reservations") into the block cache. The goal of this task is to support charging the memory usage of the new blob cache against this global memory limit when the backing cache of the blob cache and the block cache are different. This PR is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10321 Reviewed By: ltamasi Differential Revision: D37913590 Pulled By: gangliao fbshipit-source-id: eaacf23907f82dc7d18964a3f24d7039a2937a72	3 years ago
Jay Zhuang	18a61a1734	Fix seqno->time worker not scheduled with multi DB instances (#10383 ) Summary: `PeriodicWorkScheduler` is a global singleton, which were used to store per-instance setting `record_seqno_time_cadence_`. Move that to db_impl.h which is per-instance. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10383 Reviewed By: siying Differential Revision: D37928009 Pulled By: jay-zhuang fbshipit-source-id: e517754f4a9db98798ac04f72033d4b517f734e9	3 years ago
Changyu Bi	5b5144deb2	Per kv checksum blogpost (#10385 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10385 Test Plan: https://github.com/facebook/rocksdb/blob/main/docs/CONTRIBUTING.md Reviewed By: ajkr Differential Revision: D37944670 Pulled By: cbi42 fbshipit-source-id: 963c4d973dc748d4280b9c9d82dc31c33679f22a	3 years ago
Akanksha Mahajan	f6c4d7a576	Fix hang in MultiRead with O_DIRECT and io_uring (#10368 ) Summary: Fix bug in O_DIRECT and io_uring when its EOF and bytes_read = 0 because of wrong check, it got added into incomplete list and gets stuck in an infinite loop as it will always return bytes_read = 0. The bug was introduced by PR https://github.com/facebook/rocksdb/pull/10197 and that PR is not released yet in any release branch. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10368 Test Plan: Added new unit test Reviewed By: siying Differential Revision: D37885184 Pulled By: akankshamahajan15 fbshipit-source-id: 35b36a44b696d29b2f6f25301aa1b19547b4e03b	3 years ago
Andrew Kryczka	25cc564ff7	Make RateLimiter not Customizable (#10378 ) Summary: (PR created for informational/testing purposes only.) - Fixes lost dynamic updates to GenericRateLimiter bandwidth using `SetBytesPerSecond()` - Benefit over #10374 is eliminating race conditions with Configurable framework. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10378 Reviewed By: pdillinger Differential Revision: D37914865 fbshipit-source-id: d4f566d60ec9726d26932388c61671adf0ee0f30	3 years ago

... 12 13 14 15 16 ...

11971 Commits (baf37a0e818dc334a0ed94f3d315155e2c138c93) All Branches Search

11971 Commits (baf37a0e818dc334a0ed94f3d315155e2c138c93)

All Branches