Summary:
when the compaction output file is empty, the assertion in `TimestampTablePropertiesCollector::Finish()` breaks. This PR fixes this assert and added unit test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11015
Test Plan: added UT.
Reviewed By: ajkr
Differential Revision: D41716719
Pulled By: cbi42
fbshipit-source-id: d891d46be4c4805e3d49be6b80c9d75f1bd51080
Summary:
When MultiGet with the async_io option encounters an IO error in TableCache::GetTableReader, it may result in leakage of table cache handles due to queued coroutines being abandoned. This PR fixes it by ensuring any queued coroutines are run before aborting the MultiGet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10997
Test Plan:
1. New unit test in db_basic_test
2. asan_crash
Reviewed By: pdillinger
Differential Revision: D41587244
Pulled By: anand1976
fbshipit-source-id: 900920cd3fba47cb0fc744a62facc5ffe2eccb64
Summary:
The change to `make_unique<char[]>` in https://github.com/facebook/rocksdb/issues/10810 inadvertently started initializing data in Arena blocks, which could lead to increased memory use due to (at least on our implementation) force-mapping pages as a result. This change goes back to `new char[]` while keeping all the other good parts of https://github.com/facebook/rocksdb/issues/10810.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11012
Test Plan: unit test added (fails on Linux before fix)
Reviewed By: anand1976
Differential Revision: D41658893
Pulled By: pdillinger
fbshipit-source-id: 267b7dccfadaeeb1be767d43c602a6abb0e71cd0
Summary:
**Context**
`Options::track_and_verify_wals_in_manifest = true` verifies each of the WALs tracked in manifest indeed presents in the WAL folder. If not, a corruption "Missing WAL with log number" will be thrown.
`DB::SyncWAL()` called at a specific timing (i.e, at the `TEST_SYNC_POINT("FindObsoleteFiles::PostMutexUnlock")`) can record in a new manifest the WAL addition of a WAL file that already had a WAL deletion recorded in the previous manifest.
And the WAL deletion record is not rollover-ed to the new manifest. So the new manifest creates the illusion of such WAL never gets deleted and should presents at db re/open.
- Such WAL deletion record can be caused by flushing the memtable associated with that WAL and such WAL deletion can actually happen in` PurgeObsoleteFiles()`.
As a consequence, upon `DB::Reopen()`, this WAL file can be deleted while manifest still has its WAL addition record , which causes a false alarm of corruption "Missing WAL with log number" to be thrown.
**Summary**
This PR fixes this false alarm by rolling over the WAL deletion record from prev manifest to the new manifest by adding the WAL deletion record to the new manifest.
**Test**
- Make check
- Added new unit test `TEST_F(DBWALTest, FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL)` that failed before the fix and passed after
- [Ongoing]CI stress test + aggressive value as in https://github.com/facebook/rocksdb/pull/10761 , which is how this false alarm was first surfaced, to confirm such false alarm disappears
- [Ongoing]Regular CI stress test to confirm such fix didn't harm anything
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10892
Reviewed By: ajkr
Differential Revision: D40778965
Pulled By: hx235
fbshipit-source-id: a512364bfdeb0b1a55c171890e60d856c528f37f
Summary:
**Context/Summary:**
This reverts commit fc74abb436 and related HISTORY record.
The issue with PR 10777 or general approach using earliest_mem_seqno like https://github.com/facebook/rocksdb/pull/5958#issue-511150930 is that the earliest seqno of memtable of each CFs does not get persisted and will always start with 0 upon Recover(). Later when creating a new memtable in certain CF, we use the last seqno of the whole DB (but not of that CF from previous DB session) for this CF. This will lead to false positive overlapping seqno and PR 10777 will throw something like https://github.com/facebook/rocksdb/blob/main/db/compaction/compaction_picker.cc#L1002-L1004
Luckily a more elegant and complete solution to the overlapping seqno problem these PR aim to solve does not have above problem, see https://github.com/facebook/rocksdb/pull/10922. It is already being pursued and in the process of review. So we can just revert this PR and focus on getting PR10922 to land.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10999
Test Plan: make check
Reviewed By: anand1976
Differential Revision: D41572604
Pulled By: hx235
fbshipit-source-id: 9d9bdf594abd235e2137045cef513ca0b14e0a3a
Summary:
In MergingIterator, if a range tombstone's start or end key is added to minHeap/maxHeap, the key is copied. This PR removes the copying of range tombstone keys by adding InternalKey comparator that compares `Slice` for internal key and `ParsedInternalKey` directly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10878
Test Plan:
- existing UT
- ran all flavors of stress test through sandcastle
- benchmarks: I did not get improvement when compiling with DEBUG_LEVEL=0, and saw many noise. With `OPTIMIZE_LEVEL="-O3" USE_LTO=1` I do see improvement.
```
# Favorable set up: half of the writes are DeleteRange.
TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=1000000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=50
# benchmark command
TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472 --disable_auto_compactions=true --avoid_flush_during_recovery=true --seek_nexts=100 --reads=1000000 --num=1000000 --threads=25
# main
readseq [AVG 5 runs] : 26017977 (± 371077) ops/sec; 3721.9 (± 53.1) MB/sec
readseq [MEDIAN 5 runs] : 26096905 ops/sec; 3733.2 MB/sec
# this PR
readseq [AVG 5 runs] : 27481724 (± 568758) ops/sec; 3931.3 (± 81.4) MB/sec
readseq [MEDIAN 5 runs] : 27323957 ops/sec; 3908.7 MB/sec
```
Reviewed By: ajkr
Differential Revision: D40711170
Pulled By: cbi42
fbshipit-source-id: 708cb584e2bd085a9ce0d2ef6a420489f721717f
Summary:
Currently, `iterate_upper_bound` is not checked for range tombstone keys in MergingIterator. This may impact performance when there is a large number of range tombstones right after `iterate_upper_bound`. This PR fixes this issue by checking `iterate_upper_bound` in MergingIterator for range tombstone keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10966
Test Plan:
- added unit test
- stress test: `python3 tools/db_crashtest.py whitebox --simple --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=18 --writepercent=48 --readpercen=15 --duration=36000 --range_deletion_width=100`
- ran different stress tests over sandcastle
- Falcon team ran some test traffic and saw reduced CPU usage on processing range tombstones.
Reviewed By: ajkr
Differential Revision: D41414172
Pulled By: cbi42
fbshipit-source-id: 9b2c29eb3abb99327c6a649bdc412e70d863f981
Summary:
Enabled output to penultimate level when file endpoints overlap. This is probably only possible when range tombstones span files. Otherwise the overlapping files would all be included in the penultimate level inputs thanks to our atomic compaction unit logic.
Also, corrected `penultimate_output_range_type_`, which is a minor fix as it appears only used for logging.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10961
Test Plan: updated unit test
Reviewed By: cbi42
Differential Revision: D41370615
Pulled By: ajkr
fbshipit-source-id: 7e75ec369a3b41b8382b336446c81825a4c4f572
Summary:
The check for SST unique IDs added to best-efforts recovery (`Options::best_efforts_recovery` is true).
With best_efforts_recovery being true, RocksDB will recover to the latest point in
MANIFEST such that all valid SST files included up to this point pass unique ID checks as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10962
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D41378241
Pulled By: riversand963
fbshipit-source-id: a036064e2c17dec13d080a24ef2a9f85d607b16c
Summary:
Closes https://github.com/facebook/rocksdb/issues/9909
- Constructing an Options from a DBOptions should use the Env from the DBOptions
- DBOptions should be constructed with the default Env as the env_, rather than null. Why ever not ?
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10666
Reviewed By: riversand963
Differential Revision: D40515418
Pulled By: ajkr
fbshipit-source-id: 4122ba3f537660720262694c21ab4bfb13b6f8de
Summary:
Since the latency measurement uses real time it is possible for the operation to complete in zero microseconds and then fail these checks. We saw this with the operation that invokes Get() on an invalid CF. This PR relaxes the assertions to allow for operations completing in zero microseconds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10979
Reviewed By: riversand963
Differential Revision: D41478300
Pulled By: ajkr
fbshipit-source-id: 50ef096bd8f0162b31adb46f54ae6ddc337d0a5e
Summary:
before this PR, if there is a range tombstone-only file generated in penultimate level, it is marked the `last_level_temperature`. This PR fixes this issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10972
Test Plan: added unit test for this scenario.
Reviewed By: ajkr
Differential Revision: D41449215
Pulled By: cbi42
fbshipit-source-id: 1e06b5ae3bc0183db2991a45965a9807a7e8be0c
Summary:
Can simplify some ugly code in cache_dump_load_impl.cc by having an API in SecondaryCache that can directly consume persisted data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10945
Test Plan: existing tests for CacheDumper, added basic unit test
Reviewed By: anand1976
Differential Revision: D41231497
Pulled By: pdillinger
fbshipit-source-id: b8ec993ef7d3e7efd68aae8602fd3f858da58068
Summary:
We were not resetting it in non-debug mode so it could be true once and then stay true for future keys where it should be false. This PR adds the reset logic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10967
Test Plan:
- built `db_bench` with DEBUG_LEVEL=0
- ran benchmark: `TEST_TMPDIR=/dev/shm/prefix ./db_bench -benchmarks=fillrandom -compaction_style=1 -preserve_internal_time_seconds=100 -preclude_last_level_data_seconds=10 -write_buffer_size=1048576 -target_file_size_base=1048576 -subcompactions=8 -duration=120`
- compared "output_to_penultimate_level: X bytes + last: Y bytes" lines in LOG output
- Before this fix, Y was always zero
- After this fix, Y gradually increased throughout the benchmark
Reviewed By: riversand963
Differential Revision: D41417726
Pulled By: ajkr
fbshipit-source-id: ace1e9a289e751a5b0c2fbaa8addd4eda5525329
Summary:
Background. One of the core risks of chosing HyperClockCache is ending up with degraded performance if estimated_entry_charge is very significantly wrong. Too low leads to under-utilized hash table, which wastes a bit of (tracked) memory and likely increases access times due to larger working set size (more TLB misses). Too high leads to fully populated hash table (at some limit with reasonable lookup performance) and not being able to cache as many objects as the memory limit would allow. In either case, performance degradation is graceful/continuous but can be quite significant. For example, cutting block size in half without updating estimated_entry_charge could lead to a large portion of configured block cache memory (up to roughly 1/3) going unused.
Fix. This change adds a mechanism through which the DB periodically probes the block cache(s) for "problems" to report, and adds diagnostics to the HyperClockCache for bad estimated_entry_charge. The periodic probing is currently done with DumpStats / stats_dump_period_sec, and diagnostics reported to info_log (normally LOG file).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10965
Test Plan:
unit test included. Doesn't cover all the implemented subtleties of reporting, but ensures basics of when to report or not.
Also manual testing with db_bench. Create db with
```
./db_bench --benchmarks=fillrandom,flush --num=3000000 --disable_wal=1
```
Use and check LOG file for HyperClockCache for various block sizes (used as estimated_entry_charge)
```
./db_bench --use_existing_db --benchmarks=readrandom --num=3000000 --duration=20 --stats_dump_period_sec=8 --cache_type=hyper_clock_cache -block_size=XXXX
```
Seeing warnings / errors or not as expected.
Reviewed By: anand1976
Differential Revision: D41406932
Pulled By: pdillinger
fbshipit-source-id: 4ca56162b73017e4b9cec2cad74466f49c27a0a7
Summary:
After a couple minor bug fixes and successful productions roll-outs in a few places, I think we can mark this as production-ready. It has a clear value proposition for many workloads, even if we don't have clear advice for every workload yet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10963
Test Plan: existing tests, comment changes only
Reviewed By: siying
Differential Revision: D41384083
Pulled By: pdillinger
fbshipit-source-id: 56359f01a57bb28de8697666b342382fac72ce6d
Summary:
This was just a stepping stone to what eventually became HyperClockCache, and is now just more code to maintain.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10954
Test Plan: tests updated
Reviewed By: akankshamahajan15
Differential Revision: D41310123
Pulled By: pdillinger
fbshipit-source-id: 618ee148a1a0a29ee756ba8fe28359617b7cd67c
Summary:
No material changes to code or comments, just re-arranging things to prepare for a big refactoring, making it easier to what changed. Some specifics:
* This groups things together in Cache in anticipation of secondary cache features being marked production-ready (vs. experimental).
* CacheEntryRole will be needed in definition of class Cache, so that has been moved above it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10942
Test Plan: existing tests
Reviewed By: anand1976
Differential Revision: D41205509
Pulled By: pdillinger
fbshipit-source-id: 3f2559ab1651c758918dc97056951fa2b5eb0348
Summary:
With the recent changes, `GetMergeOperands` is now supported for wide-column entities as well, so we can use it for verification purposes in the non-batched stress tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10952
Test Plan: Ran a simple non-batched ops blackbox crash test.
Reviewed By: riversand963
Differential Revision: D41292114
Pulled By: ltamasi
fbshipit-source-id: 70b4c756a4a1fecb445c16c7096aad805a51203c
Summary:
Fix db_stress failure in async_io in FilePrefetchBuffer.
From the logs, assertion was caused when
- prev_offset_ = offset but somehow prev_len != 0 and explicit_prefetch_submitted_ = true. That scenario is when we send async request to prefetch buffer during seek but in second seek that data is found in cache. prev_offset_ and prev_len_ get updated but we were not setting explicit_prefetch_submitted_ = false because of which buffers were getting out of sync.
It's possible a read by another thread might have loaded the block into the cache in the meantime.
Particular assertion example:
```
prev_offset: 0, prev_len_: 8097 , offset: 0, length: 8097, actual_length: 8097 , actual_offset: 0 ,
curr_: 0, bufs_[curr_].offset_: 4096 ,bufs_[curr_].CurrentSize(): 48541 , async_len_to_read: 278528, bufs_[curr_].async_in_progress_: false
second: 1, bufs_[second].offset_: 282624 ,bufs_[second].CurrentSize(): 0, async_len_to_read: 262144 ,bufs_[second].async_in_progress_: true ,
explicit_prefetch_submitted_: true , copy_to_third_buffer: false
```
As we can see curr_ was expected to read 278528 but it read 48541. Also buffers are out of sync.
Also `explicit_prefetch_submitted_` is set true but prev_len not 0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10949
Test Plan:
- Ran db_bench for regression to make sure there is no regression;
- Ran db_stress failing without this fix,
- Ran build-linux-mini-crashtest 7- 8 times locally + CircleCI
Reviewed By: anand1976
Differential Revision: D41257786
Pulled By: akankshamahajan15
fbshipit-source-id: 1d100f94f8c06bbbe4cc76ca27f1bbc820c2494f
Summary:
Add stats for time spent in the ReadAsync call, and async read errors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10947
Test Plan: Run db_bench and look at stats
Reviewed By: akankshamahajan15
Differential Revision: D41236637
Pulled By: anand1976
fbshipit-source-id: 70539b69a28491d57acead449436a761f7108acf
Summary:
Compressed block cache depends on reading the block compression marker beyond the payload block size. Only the payload bytes were being saved and loaded from SecondaryCache -> boom!
This removes some unnecessary code attempting to combine these two competing features. Note that BlockContents was previously used for block-based filter in block cache, but that support has been removed.
Also marking block_cache_compressed as deprecated in this commit as we expect it to be replaced with SecondaryCache.
This problem was discovered during refactoring but didn't want to combine bug fix with that refactoring.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10944
Test Plan: test added that fails on base revision (at least with ASAN)
Reviewed By: akankshamahajan15
Differential Revision: D41205578
Pulled By: pdillinger
fbshipit-source-id: 1b29d36c7a6552355ac6511fcdc67038ef4af29f
Summary:
The patch extends the compaction logic to handle `Merge`s in conjunction with wide-column entities. As usual, the merge operation is applied to the anonymous default column, and any other columns are unaffected.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10946
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D41233722
Pulled By: ltamasi
fbshipit-source-id: dfd9b1362222f01bafcecb139eb48480eb279fed
Summary:
The patch adds `Merge` support for wide-column entities in `DBIter`. As before, the `Merge` operation is applied to the default column of the entity; any other columns are unchanged. As a small cleanup, the PR also changes the signature of `DBIter::Merge` to simply return a boolean instead of the `Merge` operation's `Status` since the actual `Status` is already stored in a member variable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10941
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D41195471
Pulled By: ltamasi
fbshipit-source-id: 362cf555897296e252c3de5ddfbd569ef34f85ef
Summary:
The patch untangles some nested ifs in `MergeHelper::MergeUntil`. This will come in handy when extending the compaction logic to support `Merge` for wide-column entities, and also enables us to eliminate some repeated branching on value type and to decrease the scope of some variables.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10943
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D41201946
Pulled By: ltamasi
fbshipit-source-id: 890bd3d4e31cdccadca614489a94686d76485ba9
Summary:
The patch refines/reworks `MergeHelper::TimedFullMerge(WithEntity)`
a bit in two ways. First, it eliminates the recently introduced `TimedFullMerge`
overload, which makes the responsibilities clearer by making sure the query
result (`value` for `Get`, `columns` for `GetEntity`) is set uniformly in
`SaveValue` and `GetContext`. Second, it changes the interface of
`TimedFullMergeWithEntity` so it exposes its result in a serialized form; this
is a more decoupled design which will come in handy when adding support
for `Merge` with wide-column entities to `DBIter`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10932
Test Plan: `make check`
Reviewed By: akankshamahajan15
Differential Revision: D41129399
Pulled By: ltamasi
fbshipit-source-id: 69d8da358c77d4fc7e8c40f4dafc2c129a710677
Summary:
`DBIter::saved_value_` stores the result of any `Merge` that was performed to compute the iterator's current value. This value can be ditched whenever the iterator's position is changed, and is already cleared in `Seek`, `SeekForPrev`, `SeekToFirst`, and `SeekToLast`. With the patch, it is also cleared in `Next` and `Prev`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10934
Test Plan: `make check`
Reviewed By: akankshamahajan15
Differential Revision: D41133473
Pulled By: ltamasi
fbshipit-source-id: cf9e936f48151e64e455cc1664d6e9f4a03aa308
Summary:
Hello,
As discussed previously in this [discussion](https://github.com/facebook/rocksdb/pull/9680#discussion_r853105163), the mentioned PR introduced a regression in portable versions that compile with MSVC - crc_3way optimization won't be used even in cases where it is supported.
This PR aims to fix just that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10667
Reviewed By: akankshamahajan15
Differential Revision: D40644592
Pulled By: ajkr
fbshipit-source-id: dadbeb10d57c19800e74288258ec3b96095557dd
Summary:
The patch fixes a bug where `GetContext::Merge` (and `MergeEntity`) does not update the ticker `READ_NUM_MERGE_OPERANDS` because it implicitly uses the default parameter value of `update_num_ops_stats=false` when calling `MergeHelper::TimedFullMerge`. Also, to prevent such issues going forward, the PR removes the default parameter values from the `TimedFullMerge` methods. In addition, it removes an unused/unnecessary parameter from `TimedFullMergeWithEntity`, and does some cleanup at the call sites of these methods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10925
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D41096453
Pulled By: ltamasi
fbshipit-source-id: fc60646d32b4d516b8fe81e265c3f020a32fd7f8
Summary:
In transaction unit tests, replace a few member variable lambdas with
non-static methods. It's easier for gdb to work with variables in methods than in lambdas.
(Seen similar things to https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86675).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10924
Test Plan: make check
Reviewed By: jay-zhuang
Differential Revision: D41072241
Pulled By: riversand963
fbshipit-source-id: e4fa491de573c4656225a86a75af926c1df827f6
Summary:
Prevents `MemTableList::PickMemtablesToFlush()` from picking non-consecutive memtables. It leads to wrong ordering in L0 if the files are committed, or an error like below if force_consistency_checks=true catches it:
```
Corruption: force_consistency_checks: VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/25 with seqno 320416 368066 vs. file https://github.com/facebook/rocksdb/issues/24 with seqno 336037 352068
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10921
Test Plan: fix the expectation in the existing test of this behavior
Reviewed By: riversand963
Differential Revision: D41046935
Pulled By: ajkr
fbshipit-source-id: 783696bff56115063d5dc5856dfaed6a9881d1ab
Summary:
When performing Merge during range scan, iterator should understand value types of kDeletionWithTimestamp.
Also add an additional check in debug mode to MergeHelper, and account for the presence of compaction filter.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10915
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D40960039
Pulled By: riversand963
fbshipit-source-id: dd79d86d7c79d05755bb939a3d94e0c53ddd7f59
Summary:
The patch adds `Merge` support for wide-column entities to the point lookup
APIs, i.e. `Get`, `MultiGet`, `GetEntity`, and `GetMergeOperands`. (I plan to
update the iterator and compaction logic in separate PRs.) In terms of semantics,
the `Merge` operation is applied to the default (anonymous) column; any other
columns in the entity are unaffected.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10916
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D40962311
Pulled By: ltamasi
fbshipit-source-id: 244bc9d172be1af2f204796b2f89104e4d2fa373
Summary:
For clean-up and in preparation for some other anticipated changes, including
* A new dynamically-scaling variant of HyperClockCache
* SecondaryCache support for HyperClockCache
This change does some refactoring for current and future code sharing and reusability. (Including follow-up on https://github.com/facebook/rocksdb/issues/10843)
## clock_cache.h
* TBD whether new variant will be a HyperClockCache or use some other name, so namespace is just clock_cache for the family of structures.
* A number of helper functions introduced and used.
* Pre-emptively split ClockHandle (shared among lock-free clock cache variants) and HandleImpl (specific to a kind of Table), and introduce template to plug new Table implementation into ClockCacheShard.
## clock_cache.cc
* Mostly using helper functions. Some things like `Rollback()` and `FreeDataMarkEmpty()` were not combined because `Rollback()` is Table-specific while `FreeDataMarkEmpty()` can be used with different table implementations.
* Performance testing indicated that despite more opportunities for parallelism, making a local copy of handle data for processing after marking an entry empty was slower than doing that processing before marking the entry empty (but after marking it "under construction"), thus avoiding a few words of copying data. At least for now, this answers the "TODO? Delay freeing?" questions (no).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10887
Test Plan:
fixed a unit testing gap; other minor test updates for refactoring
No functionality change
## Performance
Same setup as https://github.com/facebook/rocksdb/issues/10801:
Before: `readrandom [AVG 81 runs] : 627992 (± 5124) ops/sec`
After: `readrandom [AVG 81 runs] : 637512 (± 4866) ops/sec`
I've been getting some inconsistent results on restarts like the system is not being fair to the two processes, so I'm not sure there's such a real difference.
Reviewed By: anand1976
Differential Revision: D40959240
Pulled By: pdillinger
fbshipit-source-id: 0a8f3646b3bdb5bc7aaad60b26790b0779189949
Summary:
With TSAN build on CircleCI (see mini-tsan in .circleci/config).
Sometimes `SeqAdvanceConcurrentTest.SeqAdvanceConcurrent` will get stuck when an experimental feature called
"unordered write" is enabled. Stack trace will be the following
```
Thread 7 (Thread 0x7f2284a1c700 (LWP 481523) "write_prepared_"):
#0 0x00000000004fa3f5 in __tsan_atomic64_load () at ./db/merge_context.h:15
https://github.com/facebook/rocksdb/issues/1 0x00000000005e5942 in std::__atomic_base<unsigned long>::load (this=0x7b74000012f8, __m=std::memory_order_seq_cst) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:481
https://github.com/facebook/rocksdb/issues/2 std::__atomic_base<unsigned long>::operator unsigned long (this=0x7b74000012f8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:341
https://github.com/facebook/rocksdb/issues/3 0x00000000005bf001 in rocksdb::SeqAdvanceConcurrentTest_SeqAdvanceConcurrent_Test::TestBody()::$_9::operator()(void*) const (this=0x7b14000085e8) at utilities/transactions/write_prepared_transaction_test.cc:1702
Thread 6 (Thread 0x7f228421b700 (LWP 481521) "write_prepared_"):
#0 0x000000000052178c in __tsan::MetaMap::GetAndLock(__tsan::ThreadState*, unsigned long, unsigned long, bool, bool) () at ./db/merge_context.h:15
https://github.com/facebook/rocksdb/issues/1 0x00000000004fa48e in __tsan_atomic64_load () at ./db/merge_context.h:15
https://github.com/facebook/rocksdb/issues/2 0x00000000005e5942 in std::__atomic_base<unsigned long>::load (this=0x7b74000012f8, __m=std::memory_order_seq_cst) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:481
https://github.com/facebook/rocksdb/issues/3 std::__atomic_base<unsigned long>::operator unsigned long (this=0x7b74000012f8) at /usr/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/atomic_base.h:341
https://github.com/facebook/rocksdb/issues/4 0x00000000005bf001 in rocksdb::SeqAdvanceConcurrentTest_SeqAdvanceConcurrent_Test::TestBody()::$_9::operator()(void*) const (this=0x7b14000085e8) at utilities/transactions/write_prepared_transaction_test.cc:1702
```
This is problematic and suspicious. Two threads will get stuck in the same place trying to load from an atomic variable.
https://github.com/facebook/rocksdb/blob/7.8.fb/utilities/transactions/write_prepared_transaction_test.cc#L1694:L1707. Not sure why two threads can reach the same point.
The stack trace shows that there may be a deadlock, since the two threads are on the same write thread (one is doing Prepare, while the other is trying to commit).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10909
Test Plan:
On CircleCI mini-tsan, apply a patch first so that we have a higher chance of hitting the same problematic situation,
```
diff --git a/utilities/transactions/write_prepared_transaction_test.cc b/utilities/transactions/write_prepared_transaction_test.cc
index 4bc1f3744..bd5dc4924 100644
--- a/utilities/transactions/write_prepared_transaction_test.cc
+++ b/utilities/transactions/write_prepared_transaction_test.cc
@@ -1714,13 +1714,13 @@ TEST_P(SeqAdvanceConcurrentTest, SeqAdvanceConcurrent) {
size_t d = (n % base[bi + 1]) / base[bi];
switch (d) {
case 0:
- threads.emplace_back(txn_t0, bi);
+ threads.emplace_back(txn_t3, bi);
break;
case 1:
- threads.emplace_back(txn_t1, bi);
+ threads.emplace_back(txn_t3, bi);
break;
case 2:
- threads.emplace_back(txn_t2, bi);
+ threads.emplace_back(txn_t3, bi);
break;
case 3:
threads.emplace_back(txn_t3, bi);
```
then build and run tests
```
COMPILE_WITH_TSAN=1 CC=clang-13 CXX=clang++-13 ROCKSDB_DISABLE_ALIGNED_NEW=1 USE_CLANG=1 make V=1 -j32 check
gtest-parallel -r 100 ./write_prepared_transaction_test --gtest_filter=TwoWriteQueues/SeqAdvanceConcurrentTest.SeqAdvanceConcurrent/19
```
In the above, `SeqAdvanceConcurrent/19`. The tests 10 to 19 correspond to unordered write in which Prepare() and Commit() can both enter the same write thread.
Before this PR, there is a high chance of hitting the deadlock. With this PR, no deadlock has been encountered so far.
Reviewed By: ltamasi
Differential Revision: D40869387
Pulled By: riversand963
fbshipit-source-id: 81e82a70c263e4f3417597a201b081ee54f1deab