rocksdb

Commit Graph

Author	SHA1	Message	Date
Levi Tamasi	3f57d84af4	Introduce a dedicated class to represent blob values (#10571 ) Summary: The patch introduces a new class called `BlobContents`, which represents a single uncompressed blob value. We currently use `std::string` for this purpose; `BlobContents` is somewhat smaller but the primary reason for a dedicated class is that it enables certain improvements and optimizations like eliding a copy when inserting a blob into the cache, using custom allocators, or more control over and better accounting of the memory usage of cached blobs (see https://github.com/facebook/rocksdb/issues/10484). (We plan to implement these in subsequent PRs.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10571 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D39000965 Pulled By: ltamasi fbshipit-source-id: f296eddf9dec4fc3e11cad525b462bdf63c78f96	3 years ago
Andrew Kryczka	7ad4b38617	Ensure writes to WAL tail during `FlushWAL(true /* sync /)` will be synced (#10560 ) Summary: WAL append and switch can both happen between `FlushWAL(true / sync */)`'s sync operations and its call to `MarkLogsSynced()`. We permit this since locks need to be released for the sync operations. Such an appended/switched WAL is both inactive and incompletely synced at the time `MarkLogsSynced()` processes it. Prior to this PR, `MarkLogsSynced()` assumed all inactive WALs were fully synced and removed them from consideration for future syncs. That was wrong in the scenario described above and led to the latest append(s) never being synced. This PR changes `MarkLogsSynced()` to only remove inactive WALs from consideration for which all flushed data has been synced. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10560 Test Plan: repro unit test for the scenario described above. Without this PR, it fails on "key2" not found Reviewed By: riversand963 Differential Revision: D38957391 Pulled By: ajkr fbshipit-source-id: da77175eba97ff251a4219b227b3bb2d4843ed26	3 years ago
muthukrishnan.s	79ed4be80f	Add get_name, get_id for column family handle in C API (#10499 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10499 Reviewed By: hx235 Differential Revision: D38523859 Pulled By: ajkr fbshipit-source-id: 268bba1fcce4a3e20c51e498a79d7b476f663aea	3 years ago
Levi Tamasi	78bbdef530	Fix a typo in BlobSecondaryCacheTest (#10566 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10566 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D38989926 Pulled By: ltamasi fbshipit-source-id: 6402635fe745e4e7eb3083ef9ad9f04c0177d762	3 years ago
EdvardD	6e93d24935	Expose set_checksum function to C api (#10537 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10537 Reviewed By: hx235 Differential Revision: D38797662 Pulled By: ajkr fbshipit-source-id: a8db723c3eb9d5592cd78f8be7e442e4826686ad	3 years ago
Chen Lixiang	9593fd1c82	Fix wrong compression type and options in universal compaction picker (#10515 ) Summary: In UniversalCompactionBuilder::PickCompactionToReduceSortedRuns, we passed start_level to get compression type and options. I think that is wrong and we should use output_level instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10515 Reviewed By: hx235 Differential Revision: D38611335 Pulled By: ajkr fbshipit-source-id: bb860caed4b6c6bbde8f75fc50cf875a9f04723d	3 years ago
anand76	35cdd3e71e	MultiGet async IO across multiple levels (#10535 ) Summary: This PR exploits parallelism in MultiGet across levels. It applies only to the coroutine version of MultiGet. Previously, MultiGet file reads from SST files in the same level were parallelized. With this PR, MultiGet batches with keys distributed across multiple levels are read in parallel. This is accomplished by splitting the keys not present in a level (determined by bloom filtering) into a separate batch, and processing the new batch in parallel with the original batch. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10535 Test Plan: 1. Ensure existing MultiGet unit tests pass, updating them as necessary 2. New unit tests - TODO 3. Run stress test - TODO No noticeable regression (<1%) without async IO - Without PR: `multireadrandom : 7.261 micros/op 1101724 ops/sec 60.007 seconds 66110936 operations; 571.6 MB/s (8168992 of 8168992 found)` With PR: `multireadrandom : 7.305 micros/op 1095167 ops/sec 60.007 seconds 65717936 operations; 568.2 MB/s (8271992 of 8271992 found)` For a fully cached DB, but with async IO option on, no regression observed (<1%) - Without PR: `multireadrandom : 5.201 micros/op 1538027 ops/sec 60.005 seconds 92288936 operations; 797.9 MB/s (11540992 of 11540992 found) ` With PR: `multireadrandom : 5.249 micros/op 1524097 ops/sec 60.005 seconds 91452936 operations; 790.7 MB/s (11649992 of 11649992 found) ` Reviewed By: akankshamahajan15 Differential Revision: D38774009 Pulled By: anand1976 fbshipit-source-id: c955e259749f1c091590ade73105b3ee46cd0007	3 years ago
Levi Tamasi	81388b36e0	Add support for wide-column point lookups (#10540 ) Summary: The patch adds a new API `GetEntity` that can be used to perform wide-column point lookups. It also extends the `Get` code path and the `MemTable` / `MemTableList` and `Version` / `GetContext` logic accordingly so that wide-column entities can be served from both memtables and SSTs. If the result of a lookup is a wide-column entity (`kTypeWideColumnEntity`), it is passed to the application in deserialized form; if it is a plain old key-value (`kTypeValue`), it is presented as a wide-column entity with a single default (anonymous) column. (In contrast, regular `Get` returns plain old key-values as-is, and returns the value of the default column for wide-column entities, see https://github.com/facebook/rocksdb/issues/10483 .) The result of `GetEntity` is a self-contained `PinnableWideColumns` object. `PinnableWideColumns` contains a `PinnableSlice`, which either stores the underlying data in its own buffer or holds on to a cache handle. It also contains a `WideColumns` instance, which indexes the contents of the `PinnableSlice`, so applications can access the values of columns efficiently. There are several pieces of functionality which are currently not supported for wide-column entities: there is currently no `MultiGetEntity` or wide-column iterator; also, `Merge` and `GetMergeOperands` are not supported, and there is no `GetEntity` implementation for read-only and secondary instances. We plan to implement these in future PRs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10540 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D38847474 Pulled By: ltamasi fbshipit-source-id: 42311a34ccdfe88b3775e847a5e2a5296e002b5b	3 years ago
Andrew Kryczka	91166012c8	Prevent a case of WriteBufferManager flush thrashing (#6364 ) Summary: Previously, the flushes triggered by `WriteBufferManager` could affect the same CF repeatedly if it happens to get consecutive writes. Such flushes are not particularly useful for reducing memory usage since they switch nearly-empty memtables to immutable while they've just begun filling their first arena block. In fact they may not even reduce the mutable memory count if they involve replacing one mutable memtable containing one arena block with a new mutable memtable containing one arena block. Further, if such switches happen even a few times before a flush finishes, the immutable memtable limit will be reached and writes will stall. This PR adds a heuristic to not switch memtables to immutable for CFs that already have one or more immutable memtables awaiting flush. There is a memory usage regression if the user continues writing to the same CF, that DB does not have any CFs eligible for switching, flushes are not finishing, and the `WriteBufferManager` was constructed with `allow_stall=false`. Before, it would grow by switching nearly empty memtables until writes stall. Now, it would grow by filling memtables until writes stall. This feels like an acceptable behavior change because users who prefer to stall over violate the memory limit should be using `allow_stall=true`, which is unaffected by this PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6364 Test Plan: - Command: `rm -rf /dev/shm/dbbench/ && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num_multi_db=8 -num_column_families=2 -write_buffer_size=4194304 -db_write_buffer_size=16777216 -compression_type=none -statistics=true -target_file_size_base=4194304 -max_bytes_for_level_base=16777216` - `rocksdb.db.write.stall` count before this PR: 175 - `rocksdb.db.write.stall` count after this PR: 0 Reviewed By: jay-zhuang Differential Revision: D20167197 Pulled By: ajkr fbshipit-source-id: 4a64064e9bc33d57c0a35f15547542d0191d0cb7	3 years ago
anand76	65814a4ae6	Fix range deletion handling in async MultiGet (#10534 ) Summary: The fix in https://github.com/facebook/rocksdb/issues/10513 was not complete w.r.t range deletion handling. It didn't handle the case where a file with a range tombstone covering a key also overlapped another key in the batch. In that case, ```mget_range``` would be non-empty. However, ```mget_range``` would only have the second key and, therefore, the first key would be skipped when iterating through the range tombstones in ```TableCache::MultiGet```. Test plan - 1. Add a unit test 2. Run stress tests Pull Request resolved: https://github.com/facebook/rocksdb/pull/10534 Reviewed By: akankshamahajan15 Differential Revision: D38773880 Pulled By: anand1976 fbshipit-source-id: dae491dbe52e18bbce5179b77b63f20771a66c00	3 years ago
Gang Liao	275cd80cdb	Add a blob-specific cache priority (#10461 ) Summary: RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them. This task is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10461 Reviewed By: siying Differential Revision: D38672823 Pulled By: ltamasi fbshipit-source-id: 90cf7362036563d79891f47be2cc24b827482743	3 years ago
sdong	bc575c614c	Fix two extra headers (#10525 ) Summary: Fix copyright for two more extra headers to make internal tool happy. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10525 Reviewed By: jay-zhuang Differential Revision: D38661390 fbshipit-source-id: ab2d055bfd145dfe82b5bae7a6c25cc338c8de94	3 years ago
Changyu Bi	fd165c869d	Add memtable per key-value checksum (#10281 ) Summary: Append per key-value checksum to internal key. These checksums are verified on read paths including Get, Iterator and during Flush. Get and Iterator will return `Corruption` status if there is a checksum verification failure. Flush will make DB become read-only upon memtable entry checksum verification failure. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10281 Test Plan: - Added new unit test cases: `make check` - Benchmark on memtable insert ``` TEST_TMPDIR=/dev/shm/memtable_write ./db_bench -benchmarks=fillseq -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100 # avg over 10 runs Baseline: 1166936 ops/sec memtable 2 bytes kv checksum : 1.11674e+06 ops/sec (-4%) memtable 2 bytes kv checksum + write batch 8 bytes kv checksum: 1.08579e+06 ops/sec (-6.95%) write batch 8 bytes kv checksum: 1.17979e+06 ops/sec (+1.1%) ``` - Benchmark on only memtable read: ops/sec dropped 31% for `readseq` due to time spend on verifying checksum. ops/sec for `readrandom` dropped ~6.8%. ``` # Readseq sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillseq,readseq"[-X20]" -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100 readseq [AVG 20 runs] : 7432840 (± 212005) ops/sec; 822.3 (± 23.5) MB/sec readseq [MEDIAN 20 runs] : 7573878 ops/sec; 837.9 MB/sec With -memtable_protection_bytes_per_key=2: readseq [AVG 20 runs] : 5134607 (± 119596) ops/sec; 568.0 (± 13.2) MB/sec readseq [MEDIAN 20 runs] : 5232946 ops/sec; 578.9 MB/sec # Readrandom sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillrandom,readrandom"[-X10]" -disable_wal=true -max_write_buffer_number=100 -num=1000000 -min_write_buffer_number_to_merge=100 readrandom [AVG 10 runs] : 140236 (± 3938) ops/sec; 9.8 (± 0.3) MB/sec readrandom [MEDIAN 10 runs] : 140545 ops/sec; 9.8 MB/sec With -memtable_protection_bytes_per_key=2: readrandom [AVG 10 runs] : 130632 (± 2738) ops/sec; 9.1 (± 0.2) MB/sec readrandom [MEDIAN 10 runs] : 130341 ops/sec; 9.1 MB/sec ``` - Stress test: `python3 -u tools/db_crashtest.py whitebox --duration=1800` Reviewed By: ajkr Differential Revision: D37607896 Pulled By: cbi42 fbshipit-source-id: fdaefb475629d2471780d4a5f5bf81b44ee56113	3 years ago
Peter Dillinger	86a1e3e0e7	Derive cache keys from SST unique IDs (#10394 ) Summary: ... so that cache keys can be derived from DB manifest data before reading the file from storage--so that every part of the file can potentially go in a persistent cache. See updated comments in cache_key.cc for technical details. Importantly, the new cache key encoding uses some fancy but efficient math to pack data into the cache key without depending on the sizes of the various pieces. This simplifies some existing code creating cache keys, like cache warming before the file size is known. This should provide us an essentially permanent mapping between SST unique IDs and base cache keys, with the ability to "upgrade" SST unique IDs (and thus cache keys) with new SST format_versions. These cache keys are of similar, perhaps indistinguishable quality to the previous generation. Before this change (see "corrected" days between collision): ``` ./cache_bench -stress_cache_key -sck_keep_bits=43 18 collisions after 2 x 90 days, est 10 days between (1.15292e+19 corrected) ``` After this change (keep 43 bits, up through 50, to validate "trajectory" is ok on "corrected" days between collision): ``` 19 collisions after 3 x 90 days, est 14.2105 days between (1.63836e+19 corrected) 16 collisions after 5 x 90 days, est 28.125 days between (1.6213e+19 corrected) 15 collisions after 7 x 90 days, est 42 days between (1.21057e+19 corrected) 15 collisions after 17 x 90 days, est 102 days between (1.46997e+19 corrected) 15 collisions after 49 x 90 days, est 294 days between (2.11849e+19 corrected) 15 collisions after 62 x 90 days, est 372 days between (1.34027e+19 corrected) 15 collisions after 53 x 90 days, est 318 days between (5.72858e+18 corrected) 15 collisions after 309 x 90 days, est 1854 days between (1.66994e+19 corrected) ``` However, the change does modify (probably weaken) the "guaranteed unique" promise from this > SST files generated in a single process are guaranteed to have unique cache keys, unless/until number session ids * max file number = 2**86 to this (see https://github.com/facebook/rocksdb/issues/10388) > With the DB id limitation, we only have nice guaranteed unique cache keys for files generated in a single process until biggest session_id_counter and offset_in_file reach combined 64 bits I don't think this is a practical concern, though. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10394 Test Plan: unit tests updated, see simulation results above Reviewed By: jay-zhuang Differential Revision: D38667529 Pulled By: pdillinger fbshipit-source-id: 49af3fe7f47e5b61162809a78b76c769fd519fba	3 years ago
Peter Dillinger	9fa5c146d7	LOG more info on oldest snapshot and sequence numbers (#10454 ) Summary: The info LOG file does not currently give any direct information about the existence of old, live snapshots, nor how to estimate wall time from a sequence number within the scope of LOG history. This change addresses both with: * Logging smallest and largest seqnos for generated SST files, which can help associate sequence numbers with write time (based on flushes). * Logging oldest_snapshot_seqno for each compaction, which (along with that seqno info) helps us to determine how much old data might be kept around for old (leaked?) snapshots. Including the date here I thought might be excessive. I wanted to log the date and seqno of the oldest snapshot with periodic stats, but the current structure of the code doesn't really support that because `DumpDBStats` doesn't have access to the DB object. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10454 Test Plan: manual inspect LOG from `KEEP_DB=1 ./db_basic_test --gtest_filter=CompactBetweenSnapshots` Reviewed By: ajkr Differential Revision: D38326948 Pulled By: pdillinger fbshipit-source-id: 294918ffc04a419844146cd826045321b4d5c038	3 years ago
sdong	2297769b38	Fix regression issue of too large score (#10518 ) Summary: https://github.com/facebook/rocksdb/pull/10057 caused a regression bug: since the base level size is not adjusted based on L0 size anymore, L0 score might become very large. This makes compaction heavily favor L0->L1 compaction against L1->L2 compaction, and cause in some cases, data stuck in L1 without being moved down. We fix calculating a score of L0 by size(L0)/size(L1) in the case where L0 is large.. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10518 Test Plan: run db_bench against data on tmpfs and watch the behavior of data stuck in L1 goes away. Reviewed By: ajkr Differential Revision: D38603145 fbshipit-source-id: 4949e52dc28b54aacfe08417c6e6cc7e40a27225	3 years ago
sherriiiliu	4753e5a2e9	Fix wrong value passed in compaction filter in BlobDB (#10391 ) Summary: New blobdb has a bug in compaction filter, where `blob_value_` is not reset for next iterated key. This will cause blob_value_ not empty and previous value read from blob is passed into the filter function for next key, even if its value is not in blob. Fixed by reseting regardless of key type. Test Case: Add `FilterByValueLength` test case in `DBBlobCompactionTest` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10391 Reviewed By: riversand963 Differential Revision: D38629900 Pulled By: ltamasi fbshipit-source-id: 47d23ff2e5ec697958a210db9e6ceeb8b2fc49fa	3 years ago
sdong	9277569ba3	Add some missing headers (#10519 ) Summary: Some files miss headers. Also some headers are irregular. Fix them to make an internal checkup tool happy. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10519 Reviewed By: jay-zhuang Differential Revision: D38603291 fbshipit-source-id: 13b1bbd6d48f5ee15ba20da67544396de48238f1	3 years ago
Jay Zhuang	5d3aefb682	Migrate to docker for CI run (#10496 ) Summary: Moved linux builds to using docker to avoid CI instability caused by dependency installation site down. Added the `Dockerfile` which is used to build the image. The build time is also significantly reduced, because no dependencies installation and with using 2xlarge+ instance for slow build (like tsan test). Also fixed a few issues detected while building this: * `DestoryDB()` Status not checked for a few tests * nullptr might be used in `inlineskiplist.cc` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10496 Test Plan: CI Reviewed By: ajkr Differential Revision: D38554200 Pulled By: jay-zhuang fbshipit-source-id: 16e8fb2bf07b9c84bb27fb18421c4d54f2f248fd	3 years ago
Guido Tagliavini Ponce	a0798f6f92	Enable ClockCache in DB block cache test (#10482 ) Summary: A test in db_block_cache_test.cc was skipping ClockCache due to the 16-byte key length requirement. We fixed this. Along the way, we fixed a bug in ApplyToSomeEntries, which assumed the function being applied could modify handle metadata, and thus took an exclusive reference. This is incompatible with calls that need to inspect every element (including externally referenced ones) to gather stats. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10482 Test Plan: ``make -j24 check`` Reviewed By: anand1976 Differential Revision: D38553073 Pulled By: guidotag fbshipit-source-id: 0ed63fed4d3b89e5056b35b7091fce579f5647ae	3 years ago
sdong	911c0208b9	WritableFileWriter tries to skip operations after failure (#10489 ) Summary: A flag in WritableFileWriter is introduced to remember error has happened. Subsequent operations will fail with an assertion. Those operations, except Close() are not supposed to be called anyway. This change will help catch bug in tests and stress tests and limit damage of a potential bug of continue writing to a file after a failure. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10489 Test Plan: Fix existing unit tests and watch crash tests for a while. Reviewed By: anand1976 Differential Revision: D38473277 fbshipit-source-id: 09aafb971e56cfd7f9ef92ad15b883f54acf1366	3 years ago
Yanqin Jin	fee2c472d0	Include minimal contextual information in `CompactionIterator` (#10505 ) Summary: The main purpose is to make debugging easier without sacrificing performance. Instead of using a boolean variable for `CompactionIterator::valid_`, we can extend it to an `uint8_t`, using the LSB to denote if the compaction iterator is valid and 4 additional bits to denote where the iterator is set valid inside `NextFromInput()`. Therefore, when the control flow reaches `PrepareOutput()` and hits assertion there, we can have a better idea of what has gone wrong. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10505 Test Plan: make check ``` TEST_TMPDIR=/dev/shm/rocksdb time ./db_bench -compression_type=none -write_buffer_size=1073741824 -benchmarks=fillseq,flush ``` The above command has a 'flush' benchmark which uses `CompactionIterator`. I haven't observed any CPU regression or drop in throughput or latency increase. Reviewed By: ltamasi Differential Revision: D38551615 Pulled By: riversand963 fbshipit-source-id: 1250848fc118bb753d71fa9ff8ba840df999f5e0	3 years ago
anand76	0b02960d8c	Fix MultiGet range deletion handling and a memory leak (#10513 ) Summary: This PR fixes 2 bugs introduced in https://github.com/facebook/rocksdb/issues/10432 - 1. If the bloom filter returned a negative result for all MultiGet keys in a file, the range tombstones in that file were being ignored, resulting in incorrect results if those tombstones covered a key in a higher level. 2. If all the keys in a file were filtered out in `TableCache::MultiGetFilter`, the table cache handle was not being released. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10513 Test Plan: Add a new unit test that fails without this fix Reviewed By: akankshamahajan15 Differential Revision: D38548739 Pulled By: anand1976 fbshipit-source-id: a741a1e25d2e991d63f038100f126c2dc404a87c	3 years ago
Levi Tamasi	06b04127a8	Reset blob value as soon as it's not needed in DBIter (#10490 ) Summary: We have recently added caching support to BlobDB, and separately, implemented an optimization where reading blobs from the cache results in the cache handle being transferred to the target `PinnableSlice` (as opposed to the contents getting copied). With these changes, it makes sense to reset the `PinnableSlice` storing the blob value in `DBIter` as soon as we move to a different iterator position to prevent us from holding on to the cache handle any longer than necessary. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10490 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D38473630 Pulled By: ltamasi fbshipit-source-id: 84c045ffac76436c6152fd0f5775b007f4051386	3 years ago
Levi Tamasi	24bcab7d5d	Make queries return the value of the default column for wide-column entities (#10483 ) Summary: The patch adds support for wide-column entities to the existing query APIs (`Get`, `MultiGet`, and iterator). Namely, when during a query a wide-column entity is encountered, we will return the value of the default (anonymous) column as the result. Later, we plan to add wide-column specific query APIs which will enable retrieving entire wide-column entities or a subset of their columns. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10483 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D38441881 Pulled By: ltamasi fbshipit-source-id: 6444e79a31aff2470e866698e3a97985bc2b3543	3 years ago
Jay Zhuang	3f763763aa	Change `bottommost_temperture` to `last_level_temperture` (#10471 ) Summary: Change tiered compaction feature from `bottommost_temperture` to `last_level_temperture`. The old option is kept for migration purpose only, which is behaving the same as `last_level_temperture` and it will be removed in the next release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10471 Test Plan: CI Reviewed By: siying Differential Revision: D38450621 Pulled By: jay-zhuang fbshipit-source-id: cc1cdf8bad409376fec0152abc0a64fb72a91527	3 years ago
Jay Zhuang	375534752a	Improve universal compaction picker for tiered compaction (#10467 ) Summary: Current universal compaction picker may cause extra size amplification compaction if there're more hot data on penultimate level. Improve the picker to skip the last level for size amp calculation if tiered compaction is enabled, which can 1. avoid extra unnecessary size amp compaction; 2. typically cold tier (the last level) is not size constrained, so skip size amp for cold tier is intended; Pull Request resolved: https://github.com/facebook/rocksdb/pull/10467 Test Plan: CI and added unittest Reviewed By: siying Differential Revision: D38391350 Pulled By: jay-zhuang fbshipit-source-id: 103c0731c05e0a7e8f267e9e829d022328be25d2	3 years ago
Levi Tamasi	0cc9e98bbb	Respect fill_cache when reading blobs in DBIter (#10492 ) Summary: Similarly to https://github.com/facebook/rocksdb/pull/10457, we now have to explicitly set the `fill_cache` read option when reading blobs in `DBIter` to prevent the cache from getting polluted by queries with `fill_cache` set to false. (Before we added support for a blob cache, the setting had not made any difference either way.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10492 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D38476121 Pulled By: ltamasi fbshipit-source-id: ea5c5e252f83e4a4e2c74156b37d40308d7e0c80	3 years ago
Burton Li	e446bc65e6	Remove local static string (#8103 ) Summary: Local static string is not friendly to Jemalloc arena aware implementation, as it will be allocated on the arena of the first caller, which causes crash if the allocated arena gets refunded earlier. P.S. A Jemalloc arena aware implementation is each rocksdb instance only use certain Jemalloc arenas, and arena will be refunded after associated DB instance is destroyed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8103 Reviewed By: ajkr Differential Revision: D38477235 Pulled By: ltamasi fbshipit-source-id: a58d32cb647ed64c144b4736fb2d5db27c2c28f9	3 years ago
Jay Zhuang	edae671ce0	Re-enable SuggestCompactRangeTest and add Universal Compaction test (#10473 ) Summary: The feature `SuggestCompactRange()` is still experimental. Just re-add the test back. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10473 Test Plan: CI Reviewed By: akankshamahajan15 Differential Revision: D38427153 Pulled By: jay-zhuang fbshipit-source-id: 0b4491c947cbce6c18ff147b167e3c678633129a	3 years ago
Hui Xiao	56dbcb4f72	Deflake ChargeFileMetadataTestWithParam/ChargeFileMetadataTestWithParam.Basic/0 (#10481 ) Summary: Context/summary: `ChargeFileMetadataTestWithParam/ChargeFileMetadataTestWithParam.Basic/0 ` relies on `DBImpl::BackgroundCallCompaction:PurgedObsoleteFiles` happens before verifying `EXPECT_EQ(file_metadata_charge_only_cache->GetCacheCharge(), 1 * CacheReservationManagerImpl< CacheEntryRole::kFileMetadata>::GetDummyEntrySize());` or `EXPECT_EQ(file_metadata_charge_only_cache->GetCacheCharge(), 0);` to ensure appropriate cache reservation release is done before checking. However, this might not be the case under some timing delay and spurious wake-up as coerced below. ``` diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc index 4378f3212..3e4f60853 100644 --- a/db/db_impl/db_impl_compaction_flush.cc +++ b/db/db_impl/db_impl_compaction_flush.cc @@ -2989,6 +2989,8 @@ void DBImpl::BackgroundCallCompaction(PrepickedCompaction* prepicked_compaction, if (job_context.HaveSomethingToClean() \|\| job_context.HaveSomethingToDelete() \|\| !log_buffer.IsEmpty()) { mutex_.Unlock(); + bg_cv_.SignalAll(); + usleep(1000); // Have to flush the info logs before bg_compaction_scheduled_-- // because if bg_flush_scheduled_ becomes 0 and the lock is // released, the deconstructor of DB can kick in and destroy all the // states of DB so info_log might not be available after that point. // It also applies to access other states that DB owns. log_buffer.FlushBufferToLog(); if (job_context.HaveSomethingToDelete()) { PurgeObsoleteFiles(job_context); TEST_SYNC_POINT("DBImpl::BackgroundCallCompaction:PurgedObsoleteFiles"); } ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10481 Test Plan: The test of interest failed often at the above coercion: After fix, the test of interest passed at the above coercion: Reviewed By: jay-zhuang Differential Revision: D38438256 Pulled By: hx235 fbshipit-source-id: de80ecdb250174f00e7c2f5e4d952695ed56f51e	3 years ago
Changyu Bi	9d77bf8f7b	Fragment memtable range tombstone in the write path (#10380 ) Summary: - Right now each read fragments the memtable range tombstones https://github.com/facebook/rocksdb/issues/4808. This PR explores the idea of fragmenting memtable range tombstones in the write path and reads can just read this cached fragmented tombstone without any fragmenting cost. This PR only does the caching for immutable memtable, and does so right before a memtable is added to an immutable memtable list. The fragmentation is done without holding mutex to minimize its performance impact. - db_bench is updated to print out the number of range deletions executed if there is any. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10380 Test Plan: - CI, added asserts in various places to check whether a fragmented range tombstone list should have been constructed. - Benchmark: as this PR only optimizes immutable memtable path, the number of writes in the benchmark is chosen such an immutable memtable is created and range tombstones are in that memtable. ``` single thread: ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=100000 --max_num_range_tombstones=100 multi_thread ./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=1 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=15000 --reads=20000 --threads=32 --max_num_range_tombstones=100 ``` Commit 99cdf16464a057ca44de2f747541dedf651bae9e is included in benchmark result. It was an earlier attempt where tombstones are fragmented for each write operation. Reader threads share it using a shared_ptr which would slow down multi-thread read performance as seen in benchmark results. Results are averaged over 5 runs. Single thread result: \| Max # tombstones \| main fillrandom micros/op \| 99cdf16464a057ca44de2f747541dedf651bae9e \| Post PR \| main readrandom micros/op \| 99cdf16464a057ca44de2f747541dedf651bae9e \| Post PR \| \| ------------- \| ------------- \|------------- \|------------- \|------------- \|------------- \|------------- \| \| 0 \|6.68 \|6.57 \|6.72 \|4.72 \|4.79 \|4.54 \| \| 1 \|6.67 \|6.58 \|6.62 \|5.41 \|4.74 \|4.72 \| \| 10 \|6.59 \|6.5 \|6.56 \|7.83 \|4.69 \|4.59 \| \| 100 \|6.62 \|6.75 \|6.58 \|29.57 \|5.04 \|5.09 \| \| 1000 \|6.54 \|6.82 \|6.61 \|320.33 \|5.22 \|5.21 \| 32-thread result: note that "Max # tombstones" is per thread. \| Max # tombstones \| main fillrandom micros/op \| 99cdf16464a057ca44de2f747541dedf651bae9e \| Post PR \| main readrandom micros/op \| 99cdf16464a057ca44de2f747541dedf651bae9e \| Post PR \| \| ------------- \| ------------- \|------------- \|------------- \|------------- \|------------- \|------------- \| \| 0 \|234.52 \|260.25 \|239.42 \|5.06 \|5.38 \|5.09 \| \| 1 \|236.46 \|262.0 \|231.1 \|19.57 \|22.14 \|5.45 \| \| 10 \|236.95 \|263.84 \|251.49 \|151.73 \|21.61 \|5.73 \| \| 100 \|268.16 \|296.8 \|280.13 \|2308.52 \|22.27 \|6.57 \| Reviewed By: ajkr Differential Revision: D37916564 Pulled By: cbi42 fbshipit-source-id: 05d6d2e16df26c374c57ddcca13a5bfe9d5b731e	3 years ago
anand76	bf4532eb5c	Break TableReader MultiGet into filter and lookup stages (#10432 ) Summary: This PR is the first step in enhancing the coroutines MultiGet to be able to lookup a batch in parallel across levels. By having a separate TableReader function for probing the bloom filters, we can quickly figure out which overlapping keys from a batch are definitely not in the file and can move on to the next level. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10432 Reviewed By: akankshamahajan15 Differential Revision: D38245910 Pulled By: anand1976 fbshipit-source-id: 3d20db2350378c3fe6f086f0c7ba5ff01d7f04de	3 years ago
Yanqin Jin	538df26fcc	Deflake DBWALTest.RaceInstallFlushResultsWithWalObsoletion (#10456 ) Summary: Existing DBWALTest.RaceInstallFlushResultsWithWalObsoletion test relies on a specific interleaving of two background flush threads. We call them bg1 and bg2, and assume bg1 starts to install flush results ahead of bg2. After bg1 enters `ProcessManifestWrites`, bg1 waits for bg2 to also enter `MemTableList::TryInstallMemtableFlushResults()` before bg1 can proceed with MANIFEST write. However, if bg2 called `SyncClosedLogs()` and needed to commit to the MANIFEST but falls behind bg1, then bg2 needs to wait for bg1 to finish writing to MANIFEST. This is a circular dependency. Fix this by allowing bg2 to start only after bg1 grabs the chance to sync the WAL and commit to MANIFEST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10456 Test Plan: 1. make check 2. export TEST_TMPDIR=/dev/shm && gtest-parallel -r 1000 -w 32 ./db_wal_test --gtest_filter=DBWALTest.RaceInstallFlushResultsWithWalObsoletion Reviewed By: ltamasi Differential Revision: D38391856 Pulled By: riversand963 fbshipit-source-id: 55f647d5b94e534c008a4dd2fb082675ddf58c96	3 years ago
Andrew Kryczka	504fe4de80	Avoid allocations/copies for large `GetMergeOperands()` results (#10458 ) Summary: This PR avoids allocations and copies for the result of `GetMergeOperands()` when the average operand size is at least 256 bytes and the total operands size is at least 32KB. The `GetMergeOperands()` already included `PinnableSlice` but was calling `PinSelf()` (i.e., allocating and copying) for each operand. When this optimization takes effect, we instead call `PinSlice()` to skip that allocation and copy. Resources are pinned in order for the `PinnableSlice` to point to valid memory even after `GetMergeOperands()` returns. The pinned resources include a referenced `SuperVersion`, a `MergingContext`, and a `PinnedIteratorsManager`. They are bundled into a `GetMergeOperandsState`. We use `SharedCleanablePtr` to share that bundle among all `PinnableSlice`s populated by `GetMergeOperands()`. That way, the last `PinnableSlice` to be `Reset()` will cleanup the bundle, including unreferencing the `SuperVersion`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10458 Test Plan: - new DB level test - measured benefit/regression in a number of memtable scenarios Setup command: ``` $ ./db_bench -benchmarks=mergerandom -merge_operator=StringAppendOperator -num=$num -writes=16384 -key_size=16 -value_size=$value_sz -compression_type=none -write_buffer_size=1048576000 ``` Benchmark command: ``` ./db_bench -threads=$threads -use_existing_db=true -avoid_flush_during_recovery=true -write_buffer_size=1048576000 -benchmarks=readrandomoperands -merge_operator=StringAppendOperator -num=$num -duration=10 ``` Worst regression is when a key has many tiny operands: - Parameters: num=1 (implying 16384 operands per key), value_sz=8, threads=1 - `GetMergeOperands()` latency increases 682 micros -> 800 micros (+17%) The regression disappears into the noise (<1% difference) if we remove the `Reset()` loop and the size counting loop. The former is arguably needed regardless of this PR as the convention in `Get()` and `MultiGet()` is to `Reset()` the input `PinnableSlice`s at the start. The latter could be optimized to count the size as we accumulate operands rather than after the fact. Best improvement is when a key has large operands and high concurrency: - Parameters: num=4 (implying 4096 operands per key), value_sz=2KB, threads=32 - `GetMergeOperands()` latency decreases 11492 micros -> 437 micros (-96%). Reviewed By: cbi42 Differential Revision: D38336578 Pulled By: ajkr fbshipit-source-id: 48146d127e04cb7f2d4d2939a2b9dff3aba18258	3 years ago
mpoeter	bef3127b00	Fix race in ExitAsBatchGroupLeader with pipelined writes (#9944 ) Summary: Resolves https://github.com/facebook/rocksdb/issues/9692 This PR adds a unit test that reproduces the race described in https://github.com/facebook/rocksdb/issues/9692 and an according fix. The unit test does not have any assertions, because I could not find a reliable and save way to assert that the writers list does not form a cycle. So with the old (buggy) code, the test would simply hang, while with the fix the test passes successfully. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9944 Reviewed By: pdillinger Differential Revision: D36134604 Pulled By: riversand963 fbshipit-source-id: ef636c5a79ddbef18658ab2f19ca9210a427324a	3 years ago
Peter Dillinger	27f3af5966	Fix serious FSDirectory use-after-Close bug (missing fsync) (#10460 ) Summary: TL;DR: due to a recent change, if you drop a column family, often that DB will no longer fsync after writing new SST files to remaining or new column families, which could lead to data loss on power loss. More bug detail: The intent of https://github.com/facebook/rocksdb/issues/10049 was to Close FSDirectory objects at DB::Close time rather than waiting for DB object destruction. Unfortunately, it also closes shared FSDirectory objects on DropColumnFamily (& destroy remaining handles), which can lead to use-after-Close on FSDirectory shared with remaining column families. Those "uses" are only Fsyncs (or redundant Closes). In the default Posix filesystem, an Fsync on a closed FSDirectory is a quiet no-op. Consequently (under most configurations), if you drop a column family, that DB will no longer fsync after writing new SST files to column families sharing the same directory (true under most configurations). More fix detail: Basically, this removes unnecessary Close ops on destroying ColumnFamilyData. We let `shared_ptr` take care of calling the destructor at the right time. If the intent was to require Close be called before destroying FSDirectory, that was not made clear by the author of FileSystem and was not at all enforced by https://github.com/facebook/rocksdb/issues/10049, which could have added `assert(fd_ == -1)` to `~PosixDirectory()` but did not. To keep this fix simple, we relax the unit test for https://github.com/facebook/rocksdb/issues/10049 to allow timely destruction of FSDirectory to suffice as Close (in CountedFileSystem). Added a TODO to revisit that. Also in this PR: * Added a TODO to share FSDirectory instances between DB and its column families. (Already shared among column families.) * Made DB::Close attempt to close all its open FSDirectory objects even if there is a failure in closing one. Also code clean-up around this logic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10460 Test Plan: add an assert to check for use-after-Close. With that existing tests can detect the misuse. With fix, tests pass (except noted relaxing of unit test for https://github.com/facebook/rocksdb/issues/10049) Reviewed By: ajkr Differential Revision: D38357922 Pulled By: pdillinger fbshipit-source-id: d42079cadbedf0a969f03389bf586b3b4e1f9137	3 years ago
Levi Tamasi	cc8ded6152	Do not put blobs read during compaction into cache (#10457 ) Summary: During compaction, blobs are currently read using the default `ReadOptions`, which has the `fill_cache` flag set to true. Earlier, this didn't make any difference since we didn't have a blob cache; however, now we have to explicitly set this flag to false to avoid polluting the cache during compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10457 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D38333528 Pulled By: ltamasi fbshipit-source-id: 5b4d49a1e39543bee73c7df2aa9194fb101875e2	3 years ago
Yanqin Jin	fbfcf5cbcd	Remove unused fields from FileMetaData (temporarily) (#10443 ) Summary: FileMetaData::[min\|max]_timestamp are not currently being used or tracked by RocksDB, even when user-defined timestamp is enabled. Each of them is a std::string which can occupy 32 bytes. Remove them for now. They may be added back when we have a pressing need for them. When we do add them back, consider store them in a more compact way, e.g. one boolean flag and a byte array of size 16. Per file min/max timestamp bounds are available as table properties. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10443 Test Plan: make check Reviewed By: pdillinger Differential Revision: D38292275 Pulled By: riversand963 fbshipit-source-id: 841dc4e855ad8f8481c80cb020603de9607c9c94	3 years ago
Akanksha Mahajan	56463d443d	Provide support for subcompactions with user-defined timestamps (#10344 ) Summary: The subcompaction logic currently picks file boundaries as subcompaction boundaries. This is not compatible with user-defined timestamps because of two issues. Issue1: ReadOptions.iterate_lower_bound and ReadOptions.iterate_upper_bound contains timestamps which results in assertion failure as BlockBasedTableIterator expects bounds to be without timestamps. As result, because of wrong comparison end key is returned as user_key resulting in assertion failure. Issue2: Since it might result in two keys that only differ by user timestamp getting processed by two different subcompactions (and thus two different CompactionIterator state machines), which in turn can cause data correction issues. This PR provide support to reenable subcompactions with user-defined timestamps. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10344 Test Plan: Added new unit test - Without fix for Issue1 unit test MultipleSubCompactions fails with error: ``` db_with_timestamp_compaction_test: ./db/compaction/clipping_iterator.h:247: void rocksdb::ClippingIterat│ or::AssertBounds(): Assertion `!valid_ \|\| !end_ \|\| cmp_->Compare(key(), end_) < 0' failed. Received signal 6 (Aborted) │ #0 /usr/local/fbcode/platform009/lib/libc.so.6(gsignal+0x100) [0x7f8fbbbfe530] db_with_timestamp_compaction_test: ./db/compaction/clipping_iterator.h:247: void rocksdb::ClippingIterator::AssertBounds(): Assertion `!valid_ \|\| !end_ \|\| cmp_->Compare(key(), end_) < 0' failed. Aborted (core dumped) ``` Ran stress test `make crash_test_with_ts -j32` Reviewed By: riversand963 Differential Revision: D38220841 Pulled By: akankshamahajan15 fbshipit-source-id: 5d5cae2bd37fcaeba1e77fce0a69070ad4158ccb	3 years ago
Peter Dillinger	65036e4217	Revert "Add a blob-specific cache priority (#10309 )" (#10434 ) Summary: This reverts commit `8d178090be` because of a clear performance regression seen in internal dashboard https://fburl.com/unidash/tpz75iee Pull Request resolved: https://github.com/facebook/rocksdb/pull/10434 Reviewed By: ltamasi Differential Revision: D38256373 Pulled By: pdillinger fbshipit-source-id: 134aa00f50dd7b1bbe037c227884a351342ec44b	3 years ago
Andrew Kryczka	c7ccbb33a6	Allow manual compactions to run in parallel by default (#10317 ) Summary: This PR changes the default value of `CompactRangeOptions::exclusive_manual_compaction` from true to false so manual `CompactRange()`s can run in parallel with other compactions. I believe no artificial parallelism restriction is the intuitive behavior so feel the old default value is a trap, which I have fallen into several times, including yesterday. `CompactRangeOptions::exclusive_manual_compaction == false` has been used in both our correctness test and in production for years so should be reasonably safe. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10317 Reviewed By: jay-zhuang Differential Revision: D37659392 Pulled By: ajkr fbshipit-source-id: 504915e978bbe300b79483d064070c75e93d91e5	3 years ago
Jay Zhuang	87649d3288	Best efforts recovery to skip empty MANIFEST (#10416 ) Summary: Skip empty MANIFEST fie during best_efforts_recovery. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10416 Test Plan: make failed db_stress test pass Reviewed By: riversand963 Differential Revision: D38126273 Pulled By: jay-zhuang fbshipit-source-id: 4498d322b09eaa194dd2cbf9c683d62ab54bfb01	3 years ago
Gang Liao	8d178090be	Add a blob-specific cache priority (#10309 ) Summary: RocksDB's `Cache` abstraction currently supports two priority levels for items: high (used for frequently accessed/highly valuable SST metablocks like index/filter blocks) and low (used for SST data blocks). Blobs are typically lower-value targets for caching than data blocks, since 1) with BlobDB, data blocks containing blob references conceptually form an index structure which has to be consulted before we can read the blob value, and 2) cached blobs represent only a single key-value, while cached data blocks generally contain multiple KVs. Since we would like to make it possible to use the same backing cache for the block cache and the blob cache, it would make sense to add a new, lower-than-low cache priority level (bottom level) for blobs so data blocks are prioritized over them. This task is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10309 Reviewed By: ltamasi Differential Revision: D38211655 Pulled By: gangliao fbshipit-source-id: 65ef33337db4d85277cc6f9782d67c421ad71dd5	3 years ago
Jay Zhuang	6a0010eb46	ldb to display public unique id and dump work with key range (#10417 ) Summary: 2 ldb command improvements: 1. `ldb manifest_dump --verbose` display both the internal unique id and public id. which is useful to manually check sst_unique_id between manifest and SST; 2. `ldb dump` has `--from/to` option, but not working. Add support for that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10417 Test Plan: run the command locally ``` $ ldb manifest_dump --path=MANIFEST-000026 --verbose ... AddFile: 0 18 1023 'bar' seq:6, type:1 .. 'foo' seq:5, type:1 oldest_ancester_time:1658787615 file_creation_time:1658787615 file_checksum: file_checksum_func_name: Unknown unique_id(internal): {8800772265202404198,16149248642318466463} public_unique_id: F3E0A029B631D7D4-6E402DE08E771780 ``` ``` $ ldb dump --path=000036.sst --from=key000006 --to=key000009 Sst file format: block-based 'key000006' seq:2411, type:1 => value6 'key000007' seq:2412, type:1 => value7 'key000008' seq:2413, type:1 => value8 ... ``` Reviewed By: ajkr Differential Revision: D38136140 Pulled By: jay-zhuang fbshipit-source-id: 8be6eeaa07ff9f089e33011ebe90fd0b69d33bf3	3 years ago
Zichen Zhu	c945a9a664	Allow sufficient subcompactions under round-robin compaction priority (#10422 ) Summary: Allow sufficient subcompactions can be used when the number of input files is less than `max_subcompactions` under round-robin compaction priority. Test Case: Add `RoundRobinWithoutAdditionalResources` into `db_compaction_test` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10422 Reviewed By: ajkr Differential Revision: D38186545 Pulled By: littlepig2013 fbshipit-source-id: b8e5098306f1e5b9561dfafafc8300a38f7fe88e	3 years ago
Jay Zhuang	3134471457	Deflake FlushStaleColumnFamilies test (#10409 ) Summary: Make the Stale Flush test more robust by explicitly checking the target CF is flushed. Currently it's flaky because the default CF may have more than 3 SSTs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10409 Test Plan: the test more likely to fail on a resource limited host: ``` gtest-parallel ./column_family_test --gtest_filter=FormatDef/ColumnFamilyTest.FlushStaleColumnFamilies/0 -r 1000 -w 100 ``` Reviewed By: ajkr Differential Revision: D38116383 Pulled By: jay-zhuang fbshipit-source-id: e27cc56f76f14d0936504f126104e3d87e3d0d5f	3 years ago
Changyu Bi	2fc6df37d6	Add checksum handshake for WAL fragment decompression (#10339 ) Summary: If WAL compression is enabled, WAL fragment decompression results are concatenated together in `log::Reader::ReadPhysicalRecord()`. This PR adds checksum handshake to protect memory corruption during the copying process. `checksum` is renamed to `record_checksum` in `ReadRecord()` to differentiate it from `checksum_` flag that specifies whether CRC32C checksum is verified. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10339 Test Plan: added checksum verification in log_test.cc, `make check -j32`. Reviewed By: ajkr Differential Revision: D37763734 Pulled By: cbi42 fbshipit-source-id: c4faa7c76b9ff1df35026edf31adfe4b47ae3154	3 years ago
Zichen Zhu	8860fc902a	Support subcmpct using reserved resources for round-robin priority (#10341 ) Summary: Earlier implementation of round-robin priority can only pick one file at a time and disallows parallel compactions within the same level. In this PR, round-robin compaction policy will expand towards more input files with respecting some additional constraints, which are summarized as follows: * Constraint 1: We can only pick consecutive files - Constraint 1a: When a file is being compacted (or some input files are being compacted after expanding), we cannot choose it and have to stop choosing more files - Constraint 1b: When we reach the last file (with the largest keys), we cannot choose more files (the next file will be the first one with small keys) * Constraint 2: We should ensure the total compaction bytes (including the overlapped files from the next level) is no more than `mutable_cf_options_.max_compaction_bytes` * Constraint 3: We try our best to pick as many files as possible so that the post-compaction level size can be just less than `MaxBytesForLevel(start_level_)` * Constraint 4: If trivial move is allowed, we reuse the logic of `TryNonL0TrivialMove()` instead of expanding files with Constraint 3 More details can be found in `LevelCompactionBuilder::SetupOtherFilesWithRoundRobinExpansion()`. The above optimization accelerates the process of moving the compaction cursor, in which the write-amp can be further reduced. While a large compaction may lead to high write stall, we break this large compaction into several subcompactions regardless of the `max_subcompactions` limit. The number of subcompactions for round-robin compaction priority is determined through the following steps: * Step 1: Initialized against `max_output_file_limit`, the number of input files in the start level, and also the range size limit `ranges.size()` * Step 2: Call `AcquireSubcompactionResources()`when max subcompactions is not sufficient, but we may or may not obtain desired resources, additional number of resources is stored in `extra_num_subcompaction_threads_reserved_`). Subcompaction limit is changed and update `num_planned_subcompactions` with `GetSubcompactionLimit()` * Step 3: Call `ShrinkSubcompactionResources()` to ensure extra resources can be released (extra resources may exist for round-robin compaction when the number of actual number of subcompactions is less than the number of planned subcompactions) More details can be found in `CompactionJob::AcquireSubcompactionResources()`,`CompactionJob::ShrinkSubcompactionResources()`, and `CompactionJob::ReleaseSubcompactionResources()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10341 Test Plan: Add `CompactionPriMultipleFilesRoundRobin[1-3]` unit test in `compaction_picker_test.cc` and `RoundRobinSubcompactionsAgainstResources.SubcompactionsUsingResources/[0-4]`, `RoundRobinSubcompactionsAgainstPressureToken.PressureTokenTest/[0-1]` in `db_compaction_test.cc` Reviewed By: ajkr, hx235 Differential Revision: D37792644 Pulled By: littlepig2013 fbshipit-source-id: 7fecb7c4ffd97b34bbf6e3b760b2c35a772a0657	3 years ago
sdong	252bea405e	Improve SubCompaction Partitioning (#10393 ) Summary: Unit tests still haven't been fixed. Also need to add more tests. But I ran some simple fillrandom db_bench and the partitioning feels reasonable. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10393 Test Plan: 1. Make sure existing tests pass. This should cover some basic sub compaction logic to be correct and the partitioning result is reasonable; 2. Add a new unit test to ApproximateKeyAnchors() 3. Run some db_bench with max_subcompaction = 4 and watch the compaction is indeed partitioned evenly. Reviewed By: jay-zhuang Differential Revision: D38043783 fbshipit-source-id: 085008e0f85f9b7c5abff7800307618320efb19f	3 years ago

1 2 3 4 5 ...

5043 Commits (3f57d84af4788e590bdf9ebf4669522917b82def)