Summary:
This PR fixes a heap use after free bug in the async prefetch code that happens in the following scenario -
1. Scan thread starts 2 async reads for Seek, one for the seek block and one for prefetching
2. Before the first read in https://github.com/facebook/rocksdb/issues/1 completes, another thread reads and loads the block in cache
3. The first scan thread finds the block in cache, continues and the next block cache miss is for a block that spans the boundary of the 2 prefetch buffers, and the 1st read is complete but the 2nd one is not complete yet
4. The scan thread will reallocate (i.e free the old buffer and allocate a new one) the 2nd prefetch buffer, and the in-progress prefetch is orphaned
5. The orphaned prefetch finally completes, resulting in a use after free
Also add a few asserts to surface bugs earlier in the crash tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11049
Test Plan: Repro with db_stress and verify the fix
Reviewed By: akankshamahajan15
Differential Revision: D42181118
Pulled By: anand1976
fbshipit-source-id: 1ac55d2f64a89ce128c1c574262b8aa7d82eb8cc
Summary:
the [assertion](c3f720c60d/db/compaction/compaction_outputs.cc (L643)) in `CompactionOutputs::AddRangeDels()` can fail after https://github.com/facebook/rocksdb/pull/10802. The assertion fails when `lower_bound_from_range_tombstone` is true during `AddRangeDels()` for a new compaction output file, while the lower bound range tombstone key has seqno 0 and op_type kTypeRangeDeletion. It can have seqno 0 when it was truncated at a point key whose seqno was zeroed out during compaction, the seqno and op_type could be set [here](c3f720c60d/db/compaction/compaction_outputs.cc (L594)). This PR fixes the assertion excluding the case when `lower_bound_from_range_tombstone` is true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11040
Test Plan: CI
Reviewed By: ajkr
Differential Revision: D42119914
Pulled By: cbi42
fbshipit-source-id: 0897e71b5304cb02aac30f71667b590c37b72baf
Summary:
The IO uring usage is disabled in RocksDB by default and, as a result, PosixRandomAccessFile::ReadAsync returns a NotSupported() status. This was causing stress test failures with MultiGet and async_io combination. Fix it by relying on redirection of ReadAsync to Read when default Env is used in db_stress.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11045
Reviewed By: akankshamahajan15
Differential Revision: D42136213
Pulled By: anand1976
fbshipit-source-id: fc7904d8ece74d7e8f2e1a34c3d70bd5774fb45f
Summary:
The db_stress code uses a wrapper Env on top of the raw/fault injection Env. The wrapper, DbStressEnvWrapper, is a legacy Env and thus has a default implementation of ReadAsync that just does a sync read. As a result, the ReadAsync implementations of PosixFileSystem and other file systems weren't being tested. Also, the ReadAsync interface wasn't implemented in FaultInjectionTestFS. This change implements the necessary interfaces in FaultInjectionTestFS and derives DbStressEnvWrapper from FileSystemWrapper rather than EnvWrapper.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11037
Test Plan: Run db_stress standalone and crash test. With this change, db_stress is able to repro the bug fixed in https://github.com/facebook/rocksdb/issues/10890.
Reviewed By: akankshamahajan15
Differential Revision: D42061290
Pulled By: anand1976
fbshipit-source-id: 7f0331fd15ee33fb4f7f0f4b22b206fe801ba074
Summary:
This PR is the first step for Issue https://github.com/facebook/rocksdb/issues/4811. Currently compaction output files are cut at point keys, and the decision is made mainly in `CompactionOutputs::ShouldStopBefore()`. This makes it possible for range tombstones to cause large compactions that does not respect `max_compaction_bytes`. For example, we can have a large range tombstone that overlaps with too many files from the next level. Another example is when there is a gap between a range tombstone and another key. The first issue may be more acceptable, as a lot of data is deleted. This PR address the second issue by calling `ShouldStopBefore()` for range tombstone start keys. The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new `CompactionMergingIterator` is introduced and only used under `CompactionIterator` for this purpose. Further improvement after this PR include 1) cut compaction output at some grandparent boundary key instead of at the next point key or range tombstone start key and 2) cut compaction output file within a large range tombstone (it may be easier and reasonable to only do it for range tombstones at the end of a compaction output).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10802
Test Plan:
- added unit tests in db_range_del_test.
- stress test: `python3 tools/db_crashtest.py whitebox --[simple|enable_ts] --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=2 --writepercent=58 --readpercen=21 --duration=36000 --range_deletion_width=1000000`
Reviewed By: ajkr, jay-zhuang
Differential Revision: D40308827
Pulled By: cbi42
fbshipit-source-id: a8fd6f70a3f09d0ef7a40e006f6c964bba8c00df
Summary:
Right now, in unit tests, when background tests are sleeping using SleepingBackgroundTask, and the test exits with test assertion failure, the process will hang and it might prevent us to see the test failure message in CI runs. Try to wake up the thread so that the test can exit correctly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11036
Test Plan: Watch CI succeeds
Reviewed By: riversand963
Differential Revision: D42020489
fbshipit-source-id: 5b8441b18d5f67bbb3ade59a1225a8d3c860c2eb
Summary:
Closes https://github.com/facebook/rocksdb/issues/10980
Reproduced as per the suggestion in the ticket, and `$ jcmd <PID> VM.native_memory | grep Internal` reports that we are no longer leaking internal memory with the suggested fix.
I did the repro in `MultiGetTest.java` which I have optimized imports on. It did not seem helpful to leave the test code around as it would be onerous to build a memory leak reproducer, and regression seems a remote possibility.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10981
Reviewed By: riversand963
Differential Revision: D41498748
Pulled By: ajkr
fbshipit-source-id: 8c6dd0d608172879c8bda479c7c9c05c12d34e70
Summary:
RocksDB has two public APIs: `DB::LockWAL()`/`DB::UnlockWAL()`. The current implementation acquires and
releases the internal `DBImpl::log_write_mutex_`.
According to the comment on `DBImpl::log_write_mutex_`: https://github.com/facebook/rocksdb/blob/7.8.fb/db/db_impl/db_impl.h#L2287:L2288
> Note: to avoid dealock, if needed to acquire both log_write_mutex_ and mutex_, the order should be first mutex_ and then log_write_mutex_.
This puts limitations on how applications can use the `LockWAL()` API. After `LockWAL()` returns ok, then application
should not perform any operation that acquires `mutex_`. Currently, the use case of `LockWAL()` is MyRocks implementing
the MySQL storage engine handlerton `lock_hton_log` interface. The operation that MyRocks performs after `LockWAL()`
is `GetSortedWalFiless()` which not only acquires mutex_, but also `log_write_mutex_`.
There are two issues:
1. Applications using these two APIs may hang if one thread calls `GetSortedWalFiles()` after
calling `LockWAL()` because log_write_mutex is not recursive.
2. Two threads may dead lock due to lock order inversion.
To fix these issues, we can modify the implementation of LockWAL so that it does not keep
`log_write_mutex_` held until UnlockWAL. To achieve the goal of locking the WAL, we can
instead manually inject a write stall so that all future writes will be stopped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11020
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D41785203
Pulled By: riversand963
fbshipit-source-id: 5ccb7a9c6eb9a2c3fa80fd2c399cc2568b8f89ce
Summary:
**Context:**
Sorting L0 files by `largest_seqno` has at least two inconvenience:
- File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap.
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
- insert k1@1 to memtable m1
- ingest file s1 with k2@2, ingest file s2 with k3@3
- insert k4@4 to m1
- compact files s1, s2 and result in new file s3 of seqno range [2, 3]
- flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
- However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
- Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example)
- an existing SST s1 contains only k1@1
- insert k1@2 to memtable m1
- ingest file s2 with k3@3, ingest file s3 with k4@4
- insert single delete k5@5 in m1
- flush m1 and result in new file s4 of seqno range [2, 5]
- compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
- compact s4 and result in new file s6 of seqno range [2] due to single delete
- By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno`
Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways:
- In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more.
- In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.
**Summary:**
- Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`.
- `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`)
- Compaction output file is assigned with the minimum `epoch_number` among input files'
- Refit level: reuse refitted file's epoch_number
- Other paths needing `epoch_number` treatment:
- Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`
- Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`.
- Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
- Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
- Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery
- Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more
- Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag`
- Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above
- Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`.
- Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
- Misc:
- update existing tests with `epoch_number` so make check will pass
- update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases
- assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922
Test Plan:
- `make check`
- New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc`
- Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930
- [Ongoing] Compatibility test: manually run 36a5686ec0 (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox`
- [Ongoing] normal db stress test
- [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: ajkr
Differential Revision: D41063187
Pulled By: hx235
fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
Summary:
Previously, the "latest" valid backup would not be updated on delete.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11029
Test Plan: unit test included (added to an existing test for efficiency)
Reviewed By: hx235
Differential Revision: D41967925
Pulled By: pdillinger
fbshipit-source-id: ca143354d281eb979557ea421902cd26803a1137
Summary:
Add a tiered storage migration test which would conflict with
an ongoing penultimate level compaction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10908
Test Plan: Test only change
Reviewed By: anand1976
Differential Revision: D40864509
Pulled By: ajkr
fbshipit-source-id: e316e849a01a6c71a41be130101f909b6c0498cb
Summary:
Besides the existing ordering and validation, more is coming to VersionBuilder/VersionStorageInfo, like migration of epoch_numbers from older RocksDB versions. We should start using those common classes for importing CFs, instead of duplicating their ordering, validation, and migration logic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11028
Test Plan: rely on existing tests
Reviewed By: hx235
Differential Revision: D41865427
Pulled By: ajkr
fbshipit-source-id: 873f5cd87b8902a2380c3b71373ce0b0db3a0c50
Summary:
Previously, you could get a format_version error if SST file size was too small in manifest, or a weird "too short" error if too big in manifest. Now we ensure:
* Magic number error is reported first if we attempt to open an SST file and the footer is completely bad.
* Footer errors are reported with affected file.
* If manifest file size doesn't match actual, then the error includes expected and actual sizes (if an error is reported; in some cases we allow the file to be too big)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11009
Test Plan:
unit tests added, some manual
Previously, the code for "file too short" in footer processing was only covered by some tests attempting to verify SST checksums on non-SST files (fixed).
Reviewed By: siying
Differential Revision: D41656272
Pulled By: pdillinger
fbshipit-source-id: 3da32702eb5aaedbea0e5e74742ad57edd7ad3df
Summary:
Bumps [nokogiri](https://github.com/sparklemotion/nokogiri) from 1.13.9 to 1.13.10.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/sparklemotion/nokogiri/releases">nokogiri's releases</a>.</em></p>
<blockquote>
<h2>1.13.10 / 2022-12-07</h2>
<h3>Security</h3>
<ul>
<li>[CRuby] Address CVE-2022-23476, unchecked return value from <code>xmlTextReaderExpand</code>. See <a href="https://github.com/sparklemotion/nokogiri/security/advisories/GHSA-qv4q-mr5r-qprj">GHSA-qv4q-mr5r-qprj</a> for more information.</li>
</ul>
<h3>Improvements</h3>
<ul>
<li>[CRuby] <code>XML::Reader#attribute_hash</code> now returns <code>nil</code> on parse errors. This restores the behavior of <code>#attributes</code> from v1.13.7 and earlier. [<a href="https://github-redirect.dependabot.com/sparklemotion/nokogiri/issues/2715">https://github.com/facebook/rocksdb/issues/2715</a>]</li>
</ul>
<hr />
<p>sha256 checksums:</p>
<pre><code>777ce2e80f64772e91459b943e531dfef387e768f2255f9bc7a1655f254bbaa1 nokogiri-1.13.10-aarch64-linux.gem
b432ff47c51386e07f7e275374fe031c1349e37eaef2216759063bc5fa5624aa nokogiri-1.13.10-arm64-darwin.gem
73ac581ddcb680a912e92da928ffdbac7b36afd3368418f2cee861b96e8c830b nokogiri-1.13.10-java.gem
916aa17e624611dddbf2976ecce1b4a80633c6378f8465cff0efab022ebc2900 nokogiri-1.13.10-x64-mingw-ucrt.gem
0f85a1ad8c2b02c166a6637237133505b71a05f1bb41b91447005449769bced0 nokogiri-1.13.10-x64-mingw32.gem
91fa3a8724a1ce20fccbd718dafd9acbde099258183ac486992a61b00bb17020 nokogiri-1.13.10-x86-linux.gem
d6663f5900ccd8f72d43660d7f082565b7ffcaade0b9a59a74b3ef8791034168 nokogiri-1.13.10-x86-mingw32.gem
81755fc4b8130ef9678c76a2e5af3db7a0a6664b3cba7d9fe8ef75e7d979e91b nokogiri-1.13.10-x86_64-darwin.gem
51d5246705dedad0a09b374d09cc193e7383a5dd32136a690a3cd56e95adf0a3 nokogiri-1.13.10-x86_64-linux.gem
d3ee00f26c151763da1691c7fc6871ddd03e532f74f85101f5acedc2d099e958 nokogiri-1.13.10.gem
</code></pre>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/sparklemotion/nokogiri/blob/main/CHANGELOG.md">nokogiri's changelog</a>.</em></p>
<blockquote>
<h2>1.13.10 / 2022-12-07</h2>
<h3>Security</h3>
<ul>
<li>[CRuby] Address CVE-2022-23476, unchecked return value from <code>xmlTextReaderExpand</code>. See <a href="https://github.com/sparklemotion/nokogiri/security/advisories/GHSA-qv4q-mr5r-qprj">GHSA-qv4q-mr5r-qprj</a> for more information.</li>
</ul>
<h3>Improvements</h3>
<ul>
<li>[CRuby] <code>XML::Reader#attribute_hash</code> now returns <code>nil</code> on parse errors. This restores the behavior of <code>#attributes</code> from v1.13.7 and earlier. [<a href="https://github-redirect.dependabot.com/sparklemotion/nokogiri/issues/2715">https://github.com/facebook/rocksdb/issues/2715</a>]</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="4c80121dc3"><code>4c80121</code></a> version bump to v1.13.10</li>
<li><a href="85410e3841"><code>85410e3</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/sparklemotion/nokogiri/issues/2715">https://github.com/facebook/rocksdb/issues/2715</a> from sparklemotion/flavorjones-fix-reader-error-hand...</li>
<li><a href="9fe0761c47"><code>9fe0761</code></a> fix(cruby): XML::Reader#attribute_hash returns nil on error</li>
<li><a href="3b9c736bee"><code>3b9c736</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/sparklemotion/nokogiri/issues/2717">https://github.com/facebook/rocksdb/issues/2717</a> from sparklemotion/flavorjones-lock-psych-to-fix-bui...</li>
<li><a href="2efa87b49a"><code>2efa87b</code></a> test: skip large cdata test on system libxml2</li>
<li><a href="3187d6739c"><code>3187d67</code></a> dep(dev): pin psych to v4 until v5 builds in CI</li>
<li><a href="a16b4bf14c"><code>a16b4bf</code></a> style(rubocop): disable Minitest/EmptyLineBeforeAssertionMethods</li>
<li>See full diff in <a href="https://github.com/sparklemotion/nokogiri/compare/v1.13.9...v1.13.10">compare view</a></li>
</ul>
</details>
<br />
[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=nokogiri&package-manager=bundler&previous-version=1.13.9&new-version=1.13.10)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `dependabot rebase` will rebase this PR
- `dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `dependabot merge` will merge this PR after your CI passes on it
- `dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `dependabot cancel merge` will cancel a previously requested merge and block automerging
- `dependabot reopen` will reopen this PR if it is closed
- `dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
- `dependabot use these labels` will set the current labels as the default for future PRs for this repo and language
- `dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language
- `dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language
- `dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/facebook/rocksdb/network/alerts).
</details>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11024
Reviewed By: ltamasi
Differential Revision: D41830779
Pulled By: riversand963
fbshipit-source-id: d5ec751cfb947413a6eabe412871bf30b9db5674
Summary:
**Context/Summary:**
Credit to ajkr's https://github.com/facebook/rocksdb/pull/11016#pullrequestreview-1205020134,
flaky test https://app.circleci.com/pipelines/github/facebook/rocksdb/21985/workflows/5f6cc355-78c1-46d8-89ee-0fd679725a8a/jobs/540878 is due to `Flush()` called in the test returned earlier than obsoleted WAL being found in background flush and SyncWAL() was called (i.e, "sync_point_called" sets to true). Fix this by making checking `sync_point_called == true` after obsoleted WAL is found and `SyncWAL()` is called. Also rename the "sync_point_called" to be something more specific.
Also, fix a potential flakiness due to manually setting a log threshold to force new manifest creation. This is unreliable so I decided to use sync point to force new manifest creation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11016
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D41717786
Pulled By: hx235
fbshipit-source-id: ad1e4701a987285bbe6c8e7d9b05c4db06b4edf4
Summary:
when the compaction output file is empty, the assertion in `TimestampTablePropertiesCollector::Finish()` breaks. This PR fixes this assert and added unit test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11015
Test Plan: added UT.
Reviewed By: ajkr
Differential Revision: D41716719
Pulled By: cbi42
fbshipit-source-id: d891d46be4c4805e3d49be6b80c9d75f1bd51080
Summary:
When MultiGet with the async_io option encounters an IO error in TableCache::GetTableReader, it may result in leakage of table cache handles due to queued coroutines being abandoned. This PR fixes it by ensuring any queued coroutines are run before aborting the MultiGet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10997
Test Plan:
1. New unit test in db_basic_test
2. asan_crash
Reviewed By: pdillinger
Differential Revision: D41587244
Pulled By: anand1976
fbshipit-source-id: 900920cd3fba47cb0fc744a62facc5ffe2eccb64
Summary:
The change to `make_unique<char[]>` in https://github.com/facebook/rocksdb/issues/10810 inadvertently started initializing data in Arena blocks, which could lead to increased memory use due to (at least on our implementation) force-mapping pages as a result. This change goes back to `new char[]` while keeping all the other good parts of https://github.com/facebook/rocksdb/issues/10810.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11012
Test Plan: unit test added (fails on Linux before fix)
Reviewed By: anand1976
Differential Revision: D41658893
Pulled By: pdillinger
fbshipit-source-id: 267b7dccfadaeeb1be767d43c602a6abb0e71cd0
Summary:
**Context**
`Options::track_and_verify_wals_in_manifest = true` verifies each of the WALs tracked in manifest indeed presents in the WAL folder. If not, a corruption "Missing WAL with log number" will be thrown.
`DB::SyncWAL()` called at a specific timing (i.e, at the `TEST_SYNC_POINT("FindObsoleteFiles::PostMutexUnlock")`) can record in a new manifest the WAL addition of a WAL file that already had a WAL deletion recorded in the previous manifest.
And the WAL deletion record is not rollover-ed to the new manifest. So the new manifest creates the illusion of such WAL never gets deleted and should presents at db re/open.
- Such WAL deletion record can be caused by flushing the memtable associated with that WAL and such WAL deletion can actually happen in` PurgeObsoleteFiles()`.
As a consequence, upon `DB::Reopen()`, this WAL file can be deleted while manifest still has its WAL addition record , which causes a false alarm of corruption "Missing WAL with log number" to be thrown.
**Summary**
This PR fixes this false alarm by rolling over the WAL deletion record from prev manifest to the new manifest by adding the WAL deletion record to the new manifest.
**Test**
- Make check
- Added new unit test `TEST_F(DBWALTest, FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL)` that failed before the fix and passed after
- [Ongoing]CI stress test + aggressive value as in https://github.com/facebook/rocksdb/pull/10761 , which is how this false alarm was first surfaced, to confirm such false alarm disappears
- [Ongoing]Regular CI stress test to confirm such fix didn't harm anything
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10892
Reviewed By: ajkr
Differential Revision: D40778965
Pulled By: hx235
fbshipit-source-id: a512364bfdeb0b1a55c171890e60d856c528f37f
Summary:
**Context/Summary:**
This reverts commit fc74abb436 and related HISTORY record.
The issue with PR 10777 or general approach using earliest_mem_seqno like https://github.com/facebook/rocksdb/pull/5958#issue-511150930 is that the earliest seqno of memtable of each CFs does not get persisted and will always start with 0 upon Recover(). Later when creating a new memtable in certain CF, we use the last seqno of the whole DB (but not of that CF from previous DB session) for this CF. This will lead to false positive overlapping seqno and PR 10777 will throw something like https://github.com/facebook/rocksdb/blob/main/db/compaction/compaction_picker.cc#L1002-L1004
Luckily a more elegant and complete solution to the overlapping seqno problem these PR aim to solve does not have above problem, see https://github.com/facebook/rocksdb/pull/10922. It is already being pursued and in the process of review. So we can just revert this PR and focus on getting PR10922 to land.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10999
Test Plan: make check
Reviewed By: anand1976
Differential Revision: D41572604
Pulled By: hx235
fbshipit-source-id: 9d9bdf594abd235e2137045cef513ca0b14e0a3a
Summary:
In MergingIterator, if a range tombstone's start or end key is added to minHeap/maxHeap, the key is copied. This PR removes the copying of range tombstone keys by adding InternalKey comparator that compares `Slice` for internal key and `ParsedInternalKey` directly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10878
Test Plan:
- existing UT
- ran all flavors of stress test through sandcastle
- benchmarks: I did not get improvement when compiling with DEBUG_LEVEL=0, and saw many noise. With `OPTIMIZE_LEVEL="-O3" USE_LTO=1` I do see improvement.
```
# Favorable set up: half of the writes are DeleteRange.
TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=1000000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=50
# benchmark command
TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472 --disable_auto_compactions=true --avoid_flush_during_recovery=true --seek_nexts=100 --reads=1000000 --num=1000000 --threads=25
# main
readseq [AVG 5 runs] : 26017977 (± 371077) ops/sec; 3721.9 (± 53.1) MB/sec
readseq [MEDIAN 5 runs] : 26096905 ops/sec; 3733.2 MB/sec
# this PR
readseq [AVG 5 runs] : 27481724 (± 568758) ops/sec; 3931.3 (± 81.4) MB/sec
readseq [MEDIAN 5 runs] : 27323957 ops/sec; 3908.7 MB/sec
```
Reviewed By: ajkr
Differential Revision: D40711170
Pulled By: cbi42
fbshipit-source-id: 708cb584e2bd085a9ce0d2ef6a420489f721717f
Summary:
Currently, `iterate_upper_bound` is not checked for range tombstone keys in MergingIterator. This may impact performance when there is a large number of range tombstones right after `iterate_upper_bound`. This PR fixes this issue by checking `iterate_upper_bound` in MergingIterator for range tombstone keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10966
Test Plan:
- added unit test
- stress test: `python3 tools/db_crashtest.py whitebox --simple --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=18 --writepercent=48 --readpercen=15 --duration=36000 --range_deletion_width=100`
- ran different stress tests over sandcastle
- Falcon team ran some test traffic and saw reduced CPU usage on processing range tombstones.
Reviewed By: ajkr
Differential Revision: D41414172
Pulled By: cbi42
fbshipit-source-id: 9b2c29eb3abb99327c6a649bdc412e70d863f981
Summary:
Enabled output to penultimate level when file endpoints overlap. This is probably only possible when range tombstones span files. Otherwise the overlapping files would all be included in the penultimate level inputs thanks to our atomic compaction unit logic.
Also, corrected `penultimate_output_range_type_`, which is a minor fix as it appears only used for logging.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10961
Test Plan: updated unit test
Reviewed By: cbi42
Differential Revision: D41370615
Pulled By: ajkr
fbshipit-source-id: 7e75ec369a3b41b8382b336446c81825a4c4f572
Summary:
The check for SST unique IDs added to best-efforts recovery (`Options::best_efforts_recovery` is true).
With best_efforts_recovery being true, RocksDB will recover to the latest point in
MANIFEST such that all valid SST files included up to this point pass unique ID checks as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10962
Test Plan: make check
Reviewed By: pdillinger
Differential Revision: D41378241
Pulled By: riversand963
fbshipit-source-id: a036064e2c17dec13d080a24ef2a9f85d607b16c
Summary:
Closes https://github.com/facebook/rocksdb/issues/9909
- Constructing an Options from a DBOptions should use the Env from the DBOptions
- DBOptions should be constructed with the default Env as the env_, rather than null. Why ever not ?
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10666
Reviewed By: riversand963
Differential Revision: D40515418
Pulled By: ajkr
fbshipit-source-id: 4122ba3f537660720262694c21ab4bfb13b6f8de
Summary:
Since the latency measurement uses real time it is possible for the operation to complete in zero microseconds and then fail these checks. We saw this with the operation that invokes Get() on an invalid CF. This PR relaxes the assertions to allow for operations completing in zero microseconds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10979
Reviewed By: riversand963
Differential Revision: D41478300
Pulled By: ajkr
fbshipit-source-id: 50ef096bd8f0162b31adb46f54ae6ddc337d0a5e
Summary:
before this PR, if there is a range tombstone-only file generated in penultimate level, it is marked the `last_level_temperature`. This PR fixes this issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10972
Test Plan: added unit test for this scenario.
Reviewed By: ajkr
Differential Revision: D41449215
Pulled By: cbi42
fbshipit-source-id: 1e06b5ae3bc0183db2991a45965a9807a7e8be0c
Summary:
Can simplify some ugly code in cache_dump_load_impl.cc by having an API in SecondaryCache that can directly consume persisted data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10945
Test Plan: existing tests for CacheDumper, added basic unit test
Reviewed By: anand1976
Differential Revision: D41231497
Pulled By: pdillinger
fbshipit-source-id: b8ec993ef7d3e7efd68aae8602fd3f858da58068
Summary:
We were not resetting it in non-debug mode so it could be true once and then stay true for future keys where it should be false. This PR adds the reset logic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10967
Test Plan:
- built `db_bench` with DEBUG_LEVEL=0
- ran benchmark: `TEST_TMPDIR=/dev/shm/prefix ./db_bench -benchmarks=fillrandom -compaction_style=1 -preserve_internal_time_seconds=100 -preclude_last_level_data_seconds=10 -write_buffer_size=1048576 -target_file_size_base=1048576 -subcompactions=8 -duration=120`
- compared "output_to_penultimate_level: X bytes + last: Y bytes" lines in LOG output
- Before this fix, Y was always zero
- After this fix, Y gradually increased throughout the benchmark
Reviewed By: riversand963
Differential Revision: D41417726
Pulled By: ajkr
fbshipit-source-id: ace1e9a289e751a5b0c2fbaa8addd4eda5525329
Summary:
Background. One of the core risks of chosing HyperClockCache is ending up with degraded performance if estimated_entry_charge is very significantly wrong. Too low leads to under-utilized hash table, which wastes a bit of (tracked) memory and likely increases access times due to larger working set size (more TLB misses). Too high leads to fully populated hash table (at some limit with reasonable lookup performance) and not being able to cache as many objects as the memory limit would allow. In either case, performance degradation is graceful/continuous but can be quite significant. For example, cutting block size in half without updating estimated_entry_charge could lead to a large portion of configured block cache memory (up to roughly 1/3) going unused.
Fix. This change adds a mechanism through which the DB periodically probes the block cache(s) for "problems" to report, and adds diagnostics to the HyperClockCache for bad estimated_entry_charge. The periodic probing is currently done with DumpStats / stats_dump_period_sec, and diagnostics reported to info_log (normally LOG file).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10965
Test Plan:
unit test included. Doesn't cover all the implemented subtleties of reporting, but ensures basics of when to report or not.
Also manual testing with db_bench. Create db with
```
./db_bench --benchmarks=fillrandom,flush --num=3000000 --disable_wal=1
```
Use and check LOG file for HyperClockCache for various block sizes (used as estimated_entry_charge)
```
./db_bench --use_existing_db --benchmarks=readrandom --num=3000000 --duration=20 --stats_dump_period_sec=8 --cache_type=hyper_clock_cache -block_size=XXXX
```
Seeing warnings / errors or not as expected.
Reviewed By: anand1976
Differential Revision: D41406932
Pulled By: pdillinger
fbshipit-source-id: 4ca56162b73017e4b9cec2cad74466f49c27a0a7
Summary:
After a couple minor bug fixes and successful productions roll-outs in a few places, I think we can mark this as production-ready. It has a clear value proposition for many workloads, even if we don't have clear advice for every workload yet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10963
Test Plan: existing tests, comment changes only
Reviewed By: siying
Differential Revision: D41384083
Pulled By: pdillinger
fbshipit-source-id: 56359f01a57bb28de8697666b342382fac72ce6d
Summary:
This was just a stepping stone to what eventually became HyperClockCache, and is now just more code to maintain.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10954
Test Plan: tests updated
Reviewed By: akankshamahajan15
Differential Revision: D41310123
Pulled By: pdillinger
fbshipit-source-id: 618ee148a1a0a29ee756ba8fe28359617b7cd67c
Summary:
No material changes to code or comments, just re-arranging things to prepare for a big refactoring, making it easier to what changed. Some specifics:
* This groups things together in Cache in anticipation of secondary cache features being marked production-ready (vs. experimental).
* CacheEntryRole will be needed in definition of class Cache, so that has been moved above it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10942
Test Plan: existing tests
Reviewed By: anand1976
Differential Revision: D41205509
Pulled By: pdillinger
fbshipit-source-id: 3f2559ab1651c758918dc97056951fa2b5eb0348
Summary:
With the recent changes, `GetMergeOperands` is now supported for wide-column entities as well, so we can use it for verification purposes in the non-batched stress tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10952
Test Plan: Ran a simple non-batched ops blackbox crash test.
Reviewed By: riversand963
Differential Revision: D41292114
Pulled By: ltamasi
fbshipit-source-id: 70b4c756a4a1fecb445c16c7096aad805a51203c
Summary:
Fix db_stress failure in async_io in FilePrefetchBuffer.
From the logs, assertion was caused when
- prev_offset_ = offset but somehow prev_len != 0 and explicit_prefetch_submitted_ = true. That scenario is when we send async request to prefetch buffer during seek but in second seek that data is found in cache. prev_offset_ and prev_len_ get updated but we were not setting explicit_prefetch_submitted_ = false because of which buffers were getting out of sync.
It's possible a read by another thread might have loaded the block into the cache in the meantime.
Particular assertion example:
```
prev_offset: 0, prev_len_: 8097 , offset: 0, length: 8097, actual_length: 8097 , actual_offset: 0 ,
curr_: 0, bufs_[curr_].offset_: 4096 ,bufs_[curr_].CurrentSize(): 48541 , async_len_to_read: 278528, bufs_[curr_].async_in_progress_: false
second: 1, bufs_[second].offset_: 282624 ,bufs_[second].CurrentSize(): 0, async_len_to_read: 262144 ,bufs_[second].async_in_progress_: true ,
explicit_prefetch_submitted_: true , copy_to_third_buffer: false
```
As we can see curr_ was expected to read 278528 but it read 48541. Also buffers are out of sync.
Also `explicit_prefetch_submitted_` is set true but prev_len not 0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10949
Test Plan:
- Ran db_bench for regression to make sure there is no regression;
- Ran db_stress failing without this fix,
- Ran build-linux-mini-crashtest 7- 8 times locally + CircleCI
Reviewed By: anand1976
Differential Revision: D41257786
Pulled By: akankshamahajan15
fbshipit-source-id: 1d100f94f8c06bbbe4cc76ca27f1bbc820c2494f
Summary:
Add stats for time spent in the ReadAsync call, and async read errors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10947
Test Plan: Run db_bench and look at stats
Reviewed By: akankshamahajan15
Differential Revision: D41236637
Pulled By: anand1976
fbshipit-source-id: 70539b69a28491d57acead449436a761f7108acf