rocksdb

Commit Graph

Author	SHA1	Message	Date
anand76	dbf37c290a	Fix async prefetch heap use after free (#11049 ) Summary: This PR fixes a heap use after free bug in the async prefetch code that happens in the following scenario - 1. Scan thread starts 2 async reads for Seek, one for the seek block and one for prefetching 2. Before the first read in https://github.com/facebook/rocksdb/issues/1 completes, another thread reads and loads the block in cache 3. The first scan thread finds the block in cache, continues and the next block cache miss is for a block that spans the boundary of the 2 prefetch buffers, and the 1st read is complete but the 2nd one is not complete yet 4. The scan thread will reallocate (i.e free the old buffer and allocate a new one) the 2nd prefetch buffer, and the in-progress prefetch is orphaned 5. The orphaned prefetch finally completes, resulting in a use after free Also add a few asserts to surface bugs earlier in the crash tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11049 Test Plan: Repro with db_stress and verify the fix Reviewed By: akankshamahajan15 Differential Revision: D42181118 Pulled By: anand1976 fbshipit-source-id: 1ac55d2f64a89ce128c1c574262b8aa7d82eb8cc	3 years ago
Changyu Bi	53b703eafe	Fix an assertion failure in `CompactionOutputs::AddRangeDels()` (#11040 ) Summary: the [assertion](`c3f720c60d/db/compaction/compaction_outputs.cc (L643)`) in `CompactionOutputs::AddRangeDels()` can fail after https://github.com/facebook/rocksdb/pull/10802. The assertion fails when `lower_bound_from_range_tombstone` is true during `AddRangeDels()` for a new compaction output file, while the lower bound range tombstone key has seqno 0 and op_type kTypeRangeDeletion. It can have seqno 0 when it was truncated at a point key whose seqno was zeroed out during compaction, the seqno and op_type could be set [here](`c3f720c60d/db/compaction/compaction_outputs.cc (L594)`). This PR fixes the assertion excluding the case when `lower_bound_from_range_tombstone` is true. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11040 Test Plan: CI Reviewed By: ajkr Differential Revision: D42119914 Pulled By: cbi42 fbshipit-source-id: 0897e71b5304cb02aac30f71667b590c37b72baf	3 years ago
ehds	ddad943c29	snapshots of FragmentedRangeTombstoneList must in ascending order (#11046 ) Summary: `snapshots` argument of `FragmentedRangeTombstoneList` should be in ascending order. If we pass it in descending order order， it will not work. for example: ``` auto range_del_iter = MakeRangeDelIter({{"a", "e", 3},{"a","e", 6}}); FragmentedRangeTombstoneList fragment_list( std::move(range_del_iter), bytewise_icmp, true /* for_compaction /, {8 ,7 ,4} / snapshots /); FragmentedRangeTombstoneIterator iter(&fragment_list, bytewise_icmp, kMaxSequenceNumber / upper_bound */); VerifyFragmentedRangeDels(&iter, {{"a", "e", 6}, {"a", "e", 3}}); ``` VerifyFragmentedRangeDels will fail. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11046 Reviewed By: ajkr Differential Revision: D42148654 Pulled By: cbi42 fbshipit-source-id: a2e76f96dccf56fcca1a91cb8da9b99145f68026	3 years ago
anand76	692d6be358	Prevent db_stress failure when io_uring is disabled (#11045 ) Summary: The IO uring usage is disabled in RocksDB by default and, as a result, PosixRandomAccessFile::ReadAsync returns a NotSupported() status. This was causing stress test failures with MultiGet and async_io combination. Fix it by relying on redirection of ReadAsync to Read when default Env is used in db_stress. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11045 Reviewed By: akankshamahajan15 Differential Revision: D42136213 Pulled By: anand1976 fbshipit-source-id: fc7904d8ece74d7e8f2e1a34c3d70bd5774fb45f	3 years ago
anand76	c3f720c60d	Enable ReadAsync testing and fault injection in db_stress (#11037 ) Summary: The db_stress code uses a wrapper Env on top of the raw/fault injection Env. The wrapper, DbStressEnvWrapper, is a legacy Env and thus has a default implementation of ReadAsync that just does a sync read. As a result, the ReadAsync implementations of PosixFileSystem and other file systems weren't being tested. Also, the ReadAsync interface wasn't implemented in FaultInjectionTestFS. This change implements the necessary interfaces in FaultInjectionTestFS and derives DbStressEnvWrapper from FileSystemWrapper rather than EnvWrapper. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11037 Test Plan: Run db_stress standalone and crash test. With this change, db_stress is able to repro the bug fixed in https://github.com/facebook/rocksdb/issues/10890. Reviewed By: akankshamahajan15 Differential Revision: D42061290 Pulled By: anand1976 fbshipit-source-id: 7f0331fd15ee33fb4f7f0f4b22b206fe801ba074	3 years ago
Changyu Bi	f02c708aa3	Consider range tombstone in compaction output file cutting (#10802 ) Summary: This PR is the first step for Issue https://github.com/facebook/rocksdb/issues/4811. Currently compaction output files are cut at point keys, and the decision is made mainly in `CompactionOutputs::ShouldStopBefore()`. This makes it possible for range tombstones to cause large compactions that does not respect `max_compaction_bytes`. For example, we can have a large range tombstone that overlaps with too many files from the next level. Another example is when there is a gap between a range tombstone and another key. The first issue may be more acceptable, as a lot of data is deleted. This PR address the second issue by calling `ShouldStopBefore()` for range tombstone start keys. The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new `CompactionMergingIterator` is introduced and only used under `CompactionIterator` for this purpose. Further improvement after this PR include 1) cut compaction output at some grandparent boundary key instead of at the next point key or range tombstone start key and 2) cut compaction output file within a large range tombstone (it may be easier and reasonable to only do it for range tombstones at the end of a compaction output). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10802 Test Plan: - added unit tests in db_range_del_test. - stress test: `python3 tools/db_crashtest.py whitebox --[simple\|enable_ts] --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=2 --writepercent=58 --readpercen=21 --duration=36000 --range_deletion_width=1000000` Reviewed By: ajkr, jay-zhuang Differential Revision: D40308827 Pulled By: cbi42 fbshipit-source-id: a8fd6f70a3f09d0ef7a40e006f6c964bba8c00df	3 years ago
sdong	1928902a6f	~SleepingBackgroundTask() to wake up the sleeping task (#11036 ) Summary: Right now, in unit tests, when background tests are sleeping using SleepingBackgroundTask, and the test exits with test assertion failure, the process will hang and it might prevent us to see the test failure message in CI runs. Try to wake up the thread so that the test can exit correctly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11036 Test Plan: Watch CI succeeds Reviewed By: riversand963 Differential Revision: D42020489 fbshipit-source-id: 5b8441b18d5f67bbb3ade59a1225a8d3c860c2eb	3 years ago
Alan Paxton	6a8920f988	JNI native memory leak - release array elements (#10981 ) Summary: Closes https://github.com/facebook/rocksdb/issues/10980 Reproduced as per the suggestion in the ticket, and `$ jcmd <PID> VM.native_memory \| grep Internal` reports that we are no longer leaking internal memory with the suggested fix. I did the repro in `MultiGetTest.java` which I have optimized imports on. It did not seem helpful to leave the test code around as it would be onerous to build a memory leak reproducer, and regression seems a remote possibility. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10981 Reviewed By: riversand963 Differential Revision: D41498748 Pulled By: ajkr fbshipit-source-id: 8c6dd0d608172879c8bda479c7c9c05c12d34e70	3 years ago
Yanqin Jin	c93ba7db5d	Revise LockWAL/UnlockWAL implementation (#11020 ) Summary: RocksDB has two public APIs: `DB::LockWAL()`/`DB::UnlockWAL()`. The current implementation acquires and releases the internal `DBImpl::log_write_mutex_`. According to the comment on `DBImpl::log_write_mutex_`: https://github.com/facebook/rocksdb/blob/7.8.fb/db/db_impl/db_impl.h#L2287:L2288 > Note: to avoid dealock, if needed to acquire both log_write_mutex_ and mutex_, the order should be first mutex_ and then log_write_mutex_. This puts limitations on how applications can use the `LockWAL()` API. After `LockWAL()` returns ok, then application should not perform any operation that acquires `mutex_`. Currently, the use case of `LockWAL()` is MyRocks implementing the MySQL storage engine handlerton `lock_hton_log` interface. The operation that MyRocks performs after `LockWAL()` is `GetSortedWalFiless()` which not only acquires mutex_, but also `log_write_mutex_`. There are two issues: 1. Applications using these two APIs may hang if one thread calls `GetSortedWalFiles()` after calling `LockWAL()` because log_write_mutex is not recursive. 2. Two threads may dead lock due to lock order inversion. To fix these issues, we can modify the implementation of LockWAL so that it does not keep `log_write_mutex_` held until UnlockWAL. To achieve the goal of locking the WAL, we can instead manually inject a write stall so that all future writes will be stopped. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11020 Test Plan: make check Reviewed By: ajkr Differential Revision: D41785203 Pulled By: riversand963 fbshipit-source-id: 5ccb7a9c6eb9a2c3fa80fd2c399cc2568b8f89ce	3 years ago
Hui Xiao	98d5db5c2e	Sort L0 files by newly introduced epoch_num (#10922 ) Summary: Context: Sorting L0 files by `largest_seqno` has at least two inconvenience: - File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap. - For example, consider the following sequence of events ("key@n" indicates key at seqno "n") - insert k1@1 to memtable m1 - ingest file s1 with k2@2, ingest file s2 with k3@3 - insert k4@4 to m1 - compact files s1, s2 and result in new file s3 of seqno range [2, 3] - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1 - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption. - Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example) - an existing SST s1 contains only k1@1 - insert k1@2 to memtable m1 - ingest file s2 with k3@3, ingest file s3 with k4@4 - insert single delete k5@5 in m1 - flush m1 and result in new file s4 of seqno range [2, 5] - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4] - compact s4 and result in new file s6 of seqno range [2] due to single delete - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno` Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways: - In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption` for more. - In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption. Summary:* - Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`. - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`) - Compaction output file is assigned with the minimum `epoch_number` among input files' - Refit level: reuse refitted file's epoch_number - Other paths needing `epoch_number` treatment: - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo` - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`. - Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair). - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder. - Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag` - Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`. - Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR. - Misc: - update existing tests with `epoch_number` so make check will pass - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber() Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922 Test Plan: - `make check` - New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc` - Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930 - [Ongoing] Compatibility test: manually run `36a5686ec0` (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox` - [Ongoing] normal db stress test - [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 Reviewed By: ajkr Differential Revision: D41063187 Pulled By: hx235 fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9	3 years ago
Peter Dillinger	9b34c097a1	Fix bug updating latest backup on delete (#11029 ) Summary: Previously, the "latest" valid backup would not be updated on delete. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11029 Test Plan: unit test included (added to an existing test for efficiency) Reviewed By: hx235 Differential Revision: D41967925 Pulled By: pdillinger fbshipit-source-id: ca143354d281eb979557ea421902cd26803a1137	3 years ago
Arvid Lunnemark	00238a386b	replace sprintf with its safe version snprintf (v2) (#11011 ) Summary: same motivations as https://github.com/facebook/rocksdb/pull/5475, applied to the last remaining `sprintf`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11011 Reviewed By: pdillinger Differential Revision: D41673500 Pulled By: ajkr fbshipit-source-id: 88618ea791cafad86a9a491799c45979d46e3544	3 years ago
Jay Zhuang	1078d860a9	Add an unittest for Periodic compaction conflict with ongoing compaction (#10908 ) Summary: Add a tiered storage migration test which would conflict with an ongoing penultimate level compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10908 Test Plan: Test only change Reviewed By: anand1976 Differential Revision: D40864509 Pulled By: ajkr fbshipit-source-id: e316e849a01a6c71a41be130101f909b6c0498cb	3 years ago
Andrew Kryczka	4d60cbc629	Use VersionBuilder for CF import ordering/validation (#11028 ) Summary: Besides the existing ordering and validation, more is coming to VersionBuilder/VersionStorageInfo, like migration of epoch_numbers from older RocksDB versions. We should start using those common classes for importing CFs, instead of duplicating their ordering, validation, and migration logic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11028 Test Plan: rely on existing tests Reviewed By: hx235 Differential Revision: D41865427 Pulled By: ajkr fbshipit-source-id: 873f5cd87b8902a2380c3b71373ce0b0db3a0c50	3 years ago
Peter Dillinger	433d7e4594	Improve error messages for SST footer and size errors (#11009 ) Summary: Previously, you could get a format_version error if SST file size was too small in manifest, or a weird "too short" error if too big in manifest. Now we ensure: * Magic number error is reported first if we attempt to open an SST file and the footer is completely bad. * Footer errors are reported with affected file. * If manifest file size doesn't match actual, then the error includes expected and actual sizes (if an error is reported; in some cases we allow the file to be too big) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11009 Test Plan: unit tests added, some manual Previously, the code for "file too short" in footer processing was only covered by some tests attempting to verify SST checksums on non-SST files (fixed). Reviewed By: siying Differential Revision: D41656272 Pulled By: pdillinger fbshipit-source-id: 3da32702eb5aaedbea0e5e74742ad57edd7ad3df	3 years ago
dependabot[bot]	6648dec0a3	Bump nokogiri from 1.13.9 to 1.13.10 in /docs (#11024 ) Summary: Bumps [nokogiri](https://github.com/sparklemotion/nokogiri) from 1.13.9 to 1.13.10. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/sparklemotion/nokogiri/releases">nokogiri's releases</a>.</em></p> <blockquote> <h2>1.13.10 / 2022-12-07</h2> <h3>Security</h3> <ul> <li>[CRuby] Address CVE-2022-23476, unchecked return value from <code>xmlTextReaderExpand</code>. See <a href="https://github.com/sparklemotion/nokogiri/security/advisories/GHSA-qv4q-mr5r-qprj">GHSA-qv4q-mr5r-qprj</a> for more information.</li> </ul> <h3>Improvements</h3> <ul> <li>[CRuby] <code>XML::Reader#attribute_hash</code> now returns <code>nil</code> on parse errors. This restores the behavior of <code>#attributes</code> from v1.13.7 and earlier. [<a href="https://github-redirect.dependabot.com/sparklemotion/nokogiri/issues/2715">https://github.com/facebook/rocksdb/issues/2715</a>]</li> </ul> <hr /> <p>sha256 checksums:</p> <pre><code>777ce2e80f64772e91459b943e531dfef387e768f2255f9bc7a1655f254bbaa1 nokogiri-1.13.10-aarch64-linux.gem b432ff47c51386e07f7e275374fe031c1349e37eaef2216759063bc5fa5624aa nokogiri-1.13.10-arm64-darwin.gem 73ac581ddcb680a912e92da928ffdbac7b36afd3368418f2cee861b96e8c830b nokogiri-1.13.10-java.gem 916aa17e624611dddbf2976ecce1b4a80633c6378f8465cff0efab022ebc2900 nokogiri-1.13.10-x64-mingw-ucrt.gem 0f85a1ad8c2b02c166a6637237133505b71a05f1bb41b91447005449769bced0 nokogiri-1.13.10-x64-mingw32.gem 91fa3a8724a1ce20fccbd718dafd9acbde099258183ac486992a61b00bb17020 nokogiri-1.13.10-x86-linux.gem d6663f5900ccd8f72d43660d7f082565b7ffcaade0b9a59a74b3ef8791034168 nokogiri-1.13.10-x86-mingw32.gem 81755fc4b8130ef9678c76a2e5af3db7a0a6664b3cba7d9fe8ef75e7d979e91b nokogiri-1.13.10-x86_64-darwin.gem 51d5246705dedad0a09b374d09cc193e7383a5dd32136a690a3cd56e95adf0a3 nokogiri-1.13.10-x86_64-linux.gem d3ee00f26c151763da1691c7fc6871ddd03e532f74f85101f5acedc2d099e958 nokogiri-1.13.10.gem </code></pre> </blockquote> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/sparklemotion/nokogiri/blob/main/CHANGELOG.md">nokogiri's changelog</a>.</em></p> <blockquote> <h2>1.13.10 / 2022-12-07</h2> <h3>Security</h3> <ul> <li>[CRuby] Address CVE-2022-23476, unchecked return value from <code>xmlTextReaderExpand</code>. See <a href="https://github.com/sparklemotion/nokogiri/security/advisories/GHSA-qv4q-mr5r-qprj">GHSA-qv4q-mr5r-qprj</a> for more information.</li> </ul> <h3>Improvements</h3> <ul> <li>[CRuby] <code>XML::Reader#attribute_hash</code> now returns <code>nil</code> on parse errors. This restores the behavior of <code>#attributes</code> from v1.13.7 and earlier. [<a href="https://github-redirect.dependabot.com/sparklemotion/nokogiri/issues/2715">https://github.com/facebook/rocksdb/issues/2715</a>]</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="`4c80121dc3`"><code>4c80121</code></a> version bump to v1.13.10</li> <li><a href="`85410e3841`"><code>85410e3</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/sparklemotion/nokogiri/issues/2715">https://github.com/facebook/rocksdb/issues/2715</a> from sparklemotion/flavorjones-fix-reader-error-hand...</li> <li><a href="`9fe0761c47`"><code>9fe0761</code></a> fix(cruby): XML::Reader#attribute_hash returns nil on error</li> <li><a href="`3b9c736bee`"><code>3b9c736</code></a> Merge pull request <a href="https://github-redirect.dependabot.com/sparklemotion/nokogiri/issues/2717">https://github.com/facebook/rocksdb/issues/2717</a> from sparklemotion/flavorjones-lock-psych-to-fix-bui...</li> <li><a href="`2efa87b49a`"><code>2efa87b</code></a> test: skip large cdata test on system libxml2</li> <li><a href="`3187d6739c`"><code>3187d67</code></a> dep(dev): pin psych to v4 until v5 builds in CI</li> <li><a href="`a16b4bf14c`"><code>a16b4bf</code></a> style(rubocop): disable Minitest/EmptyLineBeforeAssertionMethods</li> <li>See full diff in <a href="https://github.com/sparklemotion/nokogiri/compare/v1.13.9...v1.13.10">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=nokogiri&package-manager=bundler&previous-version=1.13.9&new-version=1.13.10)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `dependabot rebase` will rebase this PR - `dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `dependabot merge` will merge this PR after your CI passes on it - `dependabot squash and merge` will squash and merge this PR after your CI passes on it - `dependabot cancel merge` will cancel a previously requested merge and block automerging - `dependabot reopen` will reopen this PR if it is closed - `dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) - `dependabot use these labels` will set the current labels as the default for future PRs for this repo and language - `dependabot use these reviewers` will set the current reviewers as the default for future PRs for this repo and language - `dependabot use these assignees` will set the current assignees as the default for future PRs for this repo and language - `dependabot use this milestone` will set the current milestone as the default for future PRs for this repo and language You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/facebook/rocksdb/network/alerts). </details> Pull Request resolved: https://github.com/facebook/rocksdb/pull/11024 Reviewed By: ltamasi Differential Revision: D41830779 Pulled By: riversand963 fbshipit-source-id: d5ec751cfb947413a6eabe412871bf30b9db5674	3 years ago
Hui Xiao	5f52dbc505	Move two history entries mistake out of 7.9 section (#11013 ) Summary: https://github.com/facebook/rocksdb/pull/10892 and https://github.com/facebook/rocksdb/pull/10955 mistakenly added two entries under sealed 7.9.history section. This PR fixes these two. No need to update 7.9 branch (https://github.com/facebook/rocksdb/blob/7.9.fb/HISTORY.md) cuz it's cut before these two PRs landed Pull Request resolved: https://github.com/facebook/rocksdb/pull/11013 Reviewed By: cbi42 Differential Revision: D41666514 Pulled By: hx235 fbshipit-source-id: c4bc7a29ff663664bf0be1ba1c7eab4d00a61528	3 years ago
Hui Xiao	15bb4ea084	Deflake DBWALTest.FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL (#11016 ) Summary: Context/Summary: Credit to ajkr's https://github.com/facebook/rocksdb/pull/11016#pullrequestreview-1205020134, flaky test https://app.circleci.com/pipelines/github/facebook/rocksdb/21985/workflows/5f6cc355-78c1-46d8-89ee-0fd679725a8a/jobs/540878 is due to `Flush()` called in the test returned earlier than obsoleted WAL being found in background flush and SyncWAL() was called (i.e, "sync_point_called" sets to true). Fix this by making checking `sync_point_called == true` after obsoleted WAL is found and `SyncWAL()` is called. Also rename the "sync_point_called" to be something more specific. Also, fix a potential flakiness due to manually setting a log threshold to force new manifest creation. This is unreliable so I decided to use sync point to force new manifest creation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11016 Test Plan: make check Reviewed By: pdillinger Differential Revision: D41717786 Pulled By: hx235 fbshipit-source-id: ad1e4701a987285bbe6c8e7d9b05c4db06b4edf4	3 years ago
Changyu Bi	23af6786a9	Fix an assertion failure in `TimestampTablePropertiesCollector` for empty output (#11015 ) Summary: when the compaction output file is empty, the assertion in `TimestampTablePropertiesCollector::Finish()` breaks. This PR fixes this assert and added unit test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11015 Test Plan: added UT. Reviewed By: ajkr Differential Revision: D41716719 Pulled By: cbi42 fbshipit-source-id: d891d46be4c4805e3d49be6b80c9d75f1bd51080	3 years ago
anand76	8ffabdc226	Fix table cache leak in MultiGet with async_io (#10997 ) Summary: When MultiGet with the async_io option encounters an IO error in TableCache::GetTableReader, it may result in leakage of table cache handles due to queued coroutines being abandoned. This PR fixes it by ensuring any queued coroutines are run before aborting the MultiGet. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10997 Test Plan: 1. New unit test in db_basic_test 2. asan_crash Reviewed By: pdillinger Differential Revision: D41587244 Pulled By: anand1976 fbshipit-source-id: 900920cd3fba47cb0fc744a62facc5ffe2eccb64	3 years ago
Peter Dillinger	95bf302189	Fix use of make_unique in Arena::AllocateNewBlock (#11012 ) Summary: The change to `make_unique<char[]>` in https://github.com/facebook/rocksdb/issues/10810 inadvertently started initializing data in Arena blocks, which could lead to increased memory use due to (at least on our implementation) force-mapping pages as a result. This change goes back to `new char[]` while keeping all the other good parts of https://github.com/facebook/rocksdb/issues/10810. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11012 Test Plan: unit test added (fails on Linux before fix) Reviewed By: anand1976 Differential Revision: D41658893 Pulled By: pdillinger fbshipit-source-id: 267b7dccfadaeeb1be767d43c602a6abb0e71cd0	3 years ago
WLeoo	be3a62a2e7	Fix an uninitialized variable warning for g++ 12.2.0 (#10995 ) Summary: /home/wl/rocksdbtry/rocksdb-WL/util/bloom_test.cc: In constructor ‘rocksdb::RawFilterTester::RawFilterTester()’: /home/wl/rocksdbtry/rocksdb-WL/util/bloom_test.cc:813:40: error: member ‘rocksdb::RawFilterTester::data_’ is used uninitialized [-Werror=uninitialized] Pull Request resolved: https://github.com/facebook/rocksdb/pull/10995 Reviewed By: cbi42 Differential Revision: D41620186 Pulled By: ajkr fbshipit-source-id: a6ebd3820ef12e0af322cbfb7eb553de5bdfcb29	3 years ago
Jay Zhuang	c068799d0c	Blog: Time Aware Tiered Storage in RocksDB (#11005 ) Summary: Blog for time aware tiered storage Pull Request resolved: https://github.com/facebook/rocksdb/pull/11005 Test Plan: document only Reviewed By: siying Differential Revision: D41620430 Pulled By: ajkr fbshipit-source-id: d85406a50a7b1553495f2f2f72143dbd90101b01	3 years ago
Hui Xiao	2f76ab150d	Fix missing WAL in new manifest by rolling over the WAL deletion record from prev manifest (#10892 ) Summary: Context `Options::track_and_verify_wals_in_manifest = true` verifies each of the WALs tracked in manifest indeed presents in the WAL folder. If not, a corruption "Missing WAL with log number" will be thrown. `DB::SyncWAL()` called at a specific timing (i.e, at the `TEST_SYNC_POINT("FindObsoleteFiles::PostMutexUnlock")`) can record in a new manifest the WAL addition of a WAL file that already had a WAL deletion recorded in the previous manifest. And the WAL deletion record is not rollover-ed to the new manifest. So the new manifest creates the illusion of such WAL never gets deleted and should presents at db re/open. - Such WAL deletion record can be caused by flushing the memtable associated with that WAL and such WAL deletion can actually happen in` PurgeObsoleteFiles()`. As a consequence, upon `DB::Reopen()`, this WAL file can be deleted while manifest still has its WAL addition record , which causes a false alarm of corruption "Missing WAL with log number" to be thrown. Summary This PR fixes this false alarm by rolling over the WAL deletion record from prev manifest to the new manifest by adding the WAL deletion record to the new manifest. Test - Make check - Added new unit test `TEST_F(DBWALTest, FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL)` that failed before the fix and passed after - [Ongoing]CI stress test + aggressive value as in https://github.com/facebook/rocksdb/pull/10761 , which is how this false alarm was first surfaced, to confirm such false alarm disappears - [Ongoing]Regular CI stress test to confirm such fix didn't harm anything Pull Request resolved: https://github.com/facebook/rocksdb/pull/10892 Reviewed By: ajkr Differential Revision: D40778965 Pulled By: hx235 fbshipit-source-id: a512364bfdeb0b1a55c171890e60d856c528f37f	3 years ago
Hui Xiao	f1574a20ff	Revert PR 10777 "Fix FIFO causing overlapping seqnos in L0 files due to overla…" (#10999 ) Summary: Context/Summary: This reverts commit `fc74abb436` and related HISTORY record. The issue with PR 10777 or general approach using earliest_mem_seqno like https://github.com/facebook/rocksdb/pull/5958#issue-511150930 is that the earliest seqno of memtable of each CFs does not get persisted and will always start with 0 upon Recover(). Later when creating a new memtable in certain CF, we use the last seqno of the whole DB (but not of that CF from previous DB session) for this CF. This will lead to false positive overlapping seqno and PR 10777 will throw something like https://github.com/facebook/rocksdb/blob/main/db/compaction/compaction_picker.cc#L1002-L1004 Luckily a more elegant and complete solution to the overlapping seqno problem these PR aim to solve does not have above problem, see https://github.com/facebook/rocksdb/pull/10922. It is already being pursued and in the process of review. So we can just revert this PR and focus on getting PR10922 to land. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10999 Test Plan: make check Reviewed By: anand1976 Differential Revision: D41572604 Pulled By: hx235 fbshipit-source-id: 9d9bdf594abd235e2137045cef513ca0b14e0a3a	3 years ago
Changyu Bi	6cdb7af9f8	Remove copying of range tombstones keys in iterator (#10878 ) Summary: In MergingIterator, if a range tombstone's start or end key is added to minHeap/maxHeap, the key is copied. This PR removes the copying of range tombstone keys by adding InternalKey comparator that compares `Slice` for internal key and `ParsedInternalKey` directly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10878 Test Plan: - existing UT - ran all flavors of stress test through sandcastle - benchmarks: I did not get improvement when compiling with DEBUG_LEVEL=0, and saw many noise. With `OPTIMIZE_LEVEL="-O3" USE_LTO=1` I do see improvement. ``` # Favorable set up: half of the writes are DeleteRange. TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=fillseq,levelstats --writes_per_range_tombstone=1 --max_num_range_tombstones=1000000 --range_tombstone_width=2 --num=1000000 --max_bytes_for_level_base=4194304 --disable_auto_compactions --write_buffer_size=33554432 --key_size=50 # benchmark command TEST_TMPDIR=/tmp/rocksdb-rangedel-test-all-tombstone ./db_bench --benchmarks=readseq[-W1][-X5],levelstats --use_existing_db=true --cache_size=3221225472 --disable_auto_compactions=true --avoid_flush_during_recovery=true --seek_nexts=100 --reads=1000000 --num=1000000 --threads=25 # main readseq [AVG 5 runs] : 26017977 (± 371077) ops/sec; 3721.9 (± 53.1) MB/sec readseq [MEDIAN 5 runs] : 26096905 ops/sec; 3733.2 MB/sec # this PR readseq [AVG 5 runs] : 27481724 (± 568758) ops/sec; 3931.3 (± 81.4) MB/sec readseq [MEDIAN 5 runs] : 27323957 ops/sec; 3908.7 MB/sec ``` Reviewed By: ajkr Differential Revision: D40711170 Pulled By: cbi42 fbshipit-source-id: 708cb584e2bd085a9ce0d2ef6a420489f721717f	3 years ago
Hui Xiao	d8c043f7ad	Trigger FIFO file deletion in non L0 only if exceeding max_table_files_size (#10955 ) Summary: Context https://github.com/facebook/rocksdb/pull/10348 allows multi-level FIFO but accidentally made change to the logic of deleting files in `FIFOCompactionPicker::PickSizeCompaction`. With [this](https://github.com/facebook/rocksdb/pull/10348/files#diff-d8fb3d50749aa69b378de447e3d9cf2f48abe0281437f010b5d61365a7b813fdR156) and [this](https://github.com/facebook/rocksdb/pull/10348/files#diff-d8fb3d50749aa69b378de447e3d9cf2f48abe0281437f010b5d61365a7b813fdR235) together, it deletes one file in non-L0 even when `total_size <= mutable_cf_options.compaction_options_fifo.max_table_files_size`, which is incorrect. As a consequence, FIFO exercises more file deletion in our crash testing, which is not able to verify correctly on deleted keys in the file deleted by compaction. This results in errors `error : inconsistent values for key 000000000000239F000000000000012B000000000000028B: expected state has the key, Get() returns NotFound. Verification failed :(` or `Expected state has key 00000000000023A90000000000000003787878, iterator is at key 00000000000023A9000000000000004178 Column family: default, op_logs: S 00000000000023A90000000000000003787878` Summary: - Delete file for non-L0 only if `total_size <= mutable_cf_options.compaction_options_fifo.max_table_files_size` - Add some helpful log to LOG file Pull Request resolved: https://github.com/facebook/rocksdb/pull/10955 Test Plan: - Errors repro-ed by ``` ./db_stress --preserve_unverified_changes=1 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --async_io=0 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=10 --bottommost_compression_type=none --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_style=2 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8589934591 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=1048576 --delpercent=0 --delrangepercent=0 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=2 --ingest_external_file_one_in=1000000 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=False --log2_keys_per_lock=10 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=0 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=40000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=7 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=0 --readpercent=65 --recycle_log_file_num=1 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=1 --use_full_merge_v1=1 --use_merge=0 --use_multiget=0 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=20 ``` is gone after this fix - CI Reviewed By: ajkr Differential Revision: D41319441 Pulled By: hx235 fbshipit-source-id: 6939753767007f7449ea7055b1420aabd03d7709	3 years ago
relife22	ed23fd7591	Add Apache Spark as a user (#10993 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10993 Reviewed By: ajkr Differential Revision: D41543962 Pulled By: cbi42 fbshipit-source-id: a895d7863543bd64734c5c9faa7b55b0732b3d60	3 years ago
Changyu Bi	534fb06dd3	Prevent iterating over range tombstones beyond `iterate_upper_bound` (#10966 ) Summary: Currently, `iterate_upper_bound` is not checked for range tombstone keys in MergingIterator. This may impact performance when there is a large number of range tombstones right after `iterate_upper_bound`. This PR fixes this issue by checking `iterate_upper_bound` in MergingIterator for range tombstone keys. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10966 Test Plan: - added unit test - stress test: `python3 tools/db_crashtest.py whitebox --simple --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=18 --writepercent=48 --readpercen=15 --duration=36000 --range_deletion_width=100` - ran different stress tests over sandcastle - Falcon team ran some test traffic and saw reduced CPU usage on processing range tombstones. Reviewed By: ajkr Differential Revision: D41414172 Pulled By: cbi42 fbshipit-source-id: 9b2c29eb3abb99327c6a649bdc412e70d863f981	3 years ago
Andrew Kryczka	54c2542df2	Support tiering when file endpoints overlap (#10961 ) Summary: Enabled output to penultimate level when file endpoints overlap. This is probably only possible when range tombstones span files. Otherwise the overlapping files would all be included in the penultimate level inputs thanks to our atomic compaction unit logic. Also, corrected `penultimate_output_range_type_`, which is a minor fix as it appears only used for logging. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10961 Test Plan: updated unit test Reviewed By: cbi42 Differential Revision: D41370615 Pulled By: ajkr fbshipit-source-id: 7e75ec369a3b41b8382b336446c81825a4c4f572	3 years ago
Yanqin Jin	3d0d6b8140	Make best-efforts recovery verify SST unique ID before Version construction (#10962 ) Summary: The check for SST unique IDs added to best-efforts recovery (`Options::best_efforts_recovery` is true). With best_efforts_recovery being true, RocksDB will recover to the latest point in MANIFEST such that all valid SST files included up to this point pass unique ID checks as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10962 Test Plan: make check Reviewed By: pdillinger Differential Revision: D41378241 Pulled By: riversand963 fbshipit-source-id: a036064e2c17dec13d080a24ef2a9f85d607b16c	3 years ago
jsteemann	d8e792e4cf	fix compile warnings (#10976 ) Summary: Fixes lots of compile warnings related to missing override specifiers, e.g. ``` ./3rdParty/rocksdb/trace_replay/block_cache_tracer.h:130:10: warning: ‘virtual rocksdb::Status rocksdb::BlockCacheTraceWriterImpl::WriteBlockAccess(const rocksdb::BlockCacheTraceRecord&, const rocksdb::Slice&, const rocksdb::Slice&, const rocksdb::Slice&)’ can be marked override [-Wsuggest-override] 130 \| Status WriteBlockAccess(const BlockCacheTraceRecord& record, \| ^~~~~~~~~~~~~~~~ ./3rdParty/rocksdb/trace_replay/block_cache_tracer.h:136:10: warning: ‘virtual rocksdb::Status rocksdb::BlockCacheTraceWriterImpl::WriteHeader()’ can be marked override [-Wsuggest-override] 136 \| Status WriteHeader(); \| ^~~~~~~~~~~ ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10976 Reviewed By: riversand963 Differential Revision: D41478588 Pulled By: ajkr fbshipit-source-id: d30b0457241999e38b16aacf6dabe3e691f7c46f	3 years ago
Alan Paxton	ae115eff8f	improve copying of Env in Options (#10666 ) Summary: Closes https://github.com/facebook/rocksdb/issues/9909 - Constructing an Options from a DBOptions should use the Env from the DBOptions - DBOptions should be constructed with the default Env as the env_, rather than null. Why ever not ? Pull Request resolved: https://github.com/facebook/rocksdb/pull/10666 Reviewed By: riversand963 Differential Revision: D40515418 Pulled By: ajkr fbshipit-source-id: 4122ba3f537660720262694c21ab4bfb13b6f8de	3 years ago
Andrew Kryczka	db9cbddc6f	Deflake DBTest2.TraceAndReplay by relaxing latency checks (#10979 ) Summary: Since the latency measurement uses real time it is possible for the operation to complete in zero microseconds and then fail these checks. We saw this with the operation that invokes Get() on an invalid CF. This PR relaxes the assertions to allow for operations completing in zero microseconds. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10979 Reviewed By: riversand963 Differential Revision: D41478300 Pulled By: ajkr fbshipit-source-id: 50ef096bd8f0162b31adb46f54ae6ddc337d0a5e	3 years ago
anand76	f4cfcfe824	Post 7.9.0 release branch cut updates (#10974 ) Summary: Update HISTORY.md, version.h, and check_format_compatible.sh Pull Request resolved: https://github.com/facebook/rocksdb/pull/10974 Reviewed By: akankshamahajan15 Differential Revision: D41455289 Pulled By: anand1976 fbshipit-source-id: 99888ebcb9109e5ced80584a66b20123f8783c0b	3 years ago
Changyu Bi	6c5ec92070	Set correct temperature for range tombstone only file in penultimate level (#10972 ) Summary: before this PR, if there is a range tombstone-only file generated in penultimate level, it is marked the `last_level_temperature`. This PR fixes this issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10972 Test Plan: added unit test for this scenario. Reviewed By: ajkr Differential Revision: D41449215 Pulled By: cbi42 fbshipit-source-id: 1e06b5ae3bc0183db2991a45965a9807a7e8be0c	3 years ago
anand76	3ff6da6bd5	Update HISTORY.md for 7.9.0 (#10973 ) Summary: Update HISTORY.md for 7.9.0 release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10973 Reviewed By: pdillinger Differential Revision: D41453720 Pulled By: anand1976 fbshipit-source-id: 47a23d4b6539ec6a9a09c9e69c026f7c8b10afa7	3 years ago
Peter Dillinger	e079d562af	Add a SecondaryCache::InsertSaved() API, use in CacheDumper impl (#10945 ) Summary: Can simplify some ugly code in cache_dump_load_impl.cc by having an API in SecondaryCache that can directly consume persisted data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10945 Test Plan: existing tests for CacheDumper, added basic unit test Reviewed By: anand1976 Differential Revision: D41231497 Pulled By: pdillinger fbshipit-source-id: b8ec993ef7d3e7efd68aae8602fd3f858da58068	3 years ago
Andrew Kryczka	097f9f4425	Fix CompactionIterator flag for penultimate level output (#10967 ) Summary: We were not resetting it in non-debug mode so it could be true once and then stay true for future keys where it should be false. This PR adds the reset logic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10967 Test Plan: - built `db_bench` with DEBUG_LEVEL=0 - ran benchmark: `TEST_TMPDIR=/dev/shm/prefix ./db_bench -benchmarks=fillrandom -compaction_style=1 -preserve_internal_time_seconds=100 -preclude_last_level_data_seconds=10 -write_buffer_size=1048576 -target_file_size_base=1048576 -subcompactions=8 -duration=120` - compared "output_to_penultimate_level: X bytes + last: Y bytes" lines in LOG output - Before this fix, Y was always zero - After this fix, Y gradually increased throughout the benchmark Reviewed By: riversand963 Differential Revision: D41417726 Pulled By: ajkr fbshipit-source-id: ace1e9a289e751a5b0c2fbaa8addd4eda5525329	3 years ago
Peter Dillinger	3182beeffc	Observe and warn about misconfigured HyperClockCache (#10965 ) Summary: Background. One of the core risks of chosing HyperClockCache is ending up with degraded performance if estimated_entry_charge is very significantly wrong. Too low leads to under-utilized hash table, which wastes a bit of (tracked) memory and likely increases access times due to larger working set size (more TLB misses). Too high leads to fully populated hash table (at some limit with reasonable lookup performance) and not being able to cache as many objects as the memory limit would allow. In either case, performance degradation is graceful/continuous but can be quite significant. For example, cutting block size in half without updating estimated_entry_charge could lead to a large portion of configured block cache memory (up to roughly 1/3) going unused. Fix. This change adds a mechanism through which the DB periodically probes the block cache(s) for "problems" to report, and adds diagnostics to the HyperClockCache for bad estimated_entry_charge. The periodic probing is currently done with DumpStats / stats_dump_period_sec, and diagnostics reported to info_log (normally LOG file). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10965 Test Plan: unit test included. Doesn't cover all the implemented subtleties of reporting, but ensures basics of when to report or not. Also manual testing with db_bench. Create db with ``` ./db_bench --benchmarks=fillrandom,flush --num=3000000 --disable_wal=1 ``` Use and check LOG file for HyperClockCache for various block sizes (used as estimated_entry_charge) ``` ./db_bench --use_existing_db --benchmarks=readrandom --num=3000000 --duration=20 --stats_dump_period_sec=8 --cache_type=hyper_clock_cache -block_size=XXXX ``` Seeing warnings / errors or not as expected. Reviewed By: anand1976 Differential Revision: D41406932 Pulled By: pdillinger fbshipit-source-id: 4ca56162b73017e4b9cec2cad74466f49c27a0a7	3 years ago
Yanqin Jin	a8a4ed52a4	Test Merge with timestamps in stress test (#10948 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10948 Test Plan: make crash_test_with_ts Reviewed By: ltamasi Differential Revision: D41390854 Pulled By: riversand963 fbshipit-source-id: 599e114da8e2b2bbff5628fb8c67fa0393a31c05	3 years ago
Peter Dillinger	8c0f5b1fcf	Mark HyperClockCache as production-ready (#10963 ) Summary: After a couple minor bug fixes and successful productions roll-outs in a few places, I think we can mark this as production-ready. It has a clear value proposition for many workloads, even if we don't have clear advice for every workload yet. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10963 Test Plan: existing tests, comment changes only Reviewed By: siying Differential Revision: D41384083 Pulled By: pdillinger fbshipit-source-id: 56359f01a57bb28de8697666b342382fac72ce6d	3 years ago
Levi Tamasi	8fa8780932	Mention wide-column support in HISTORY.md (#10959 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10959 Reviewed By: akankshamahajan15 Differential Revision: D41348198 Pulled By: ltamasi fbshipit-source-id: 51e89d03c1fe87f576a766f609a7f233a519c83d	3 years ago
Peter Dillinger	32520df1d9	Remove prototype FastLRUCache (#10954 ) Summary: This was just a stepping stone to what eventually became HyperClockCache, and is now just more code to maintain. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10954 Test Plan: tests updated Reviewed By: akankshamahajan15 Differential Revision: D41310123 Pulled By: pdillinger fbshipit-source-id: 618ee148a1a0a29ee756ba8fe28359617b7cd67c	3 years ago
Peter Dillinger	b55e70357c	Re-arrange cache.h to prepare for refactoring (#10942 ) Summary: No material changes to code or comments, just re-arranging things to prepare for a big refactoring, making it easier to what changed. Some specifics: * This groups things together in Cache in anticipation of secondary cache features being marked production-ready (vs. experimental). * CacheEntryRole will be needed in definition of class Cache, so that has been moved above it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10942 Test Plan: existing tests Reviewed By: anand1976 Differential Revision: D41205509 Pulled By: pdillinger fbshipit-source-id: 3f2559ab1651c758918dc97056951fa2b5eb0348	3 years ago
Levi Tamasi	b644baa1eb	Support using GetMergeOperands for verification with wide columns (#10952 ) Summary: With the recent changes, `GetMergeOperands` is now supported for wide-column entities as well, so we can use it for verification purposes in the non-batched stress tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10952 Test Plan: Ran a simple non-batched ops blackbox crash test. Reviewed By: riversand963 Differential Revision: D41292114 Pulled By: ltamasi fbshipit-source-id: 70b4c756a4a1fecb445c16c7096aad805a51203c	3 years ago
Akanksha Mahajan	1562524e63	Fix db_stress failure in async_io in FilePrefetchBuffer (#10949 ) Summary: Fix db_stress failure in async_io in FilePrefetchBuffer. From the logs, assertion was caused when - prev_offset_ = offset but somehow prev_len != 0 and explicit_prefetch_submitted_ = true. That scenario is when we send async request to prefetch buffer during seek but in second seek that data is found in cache. prev_offset_ and prev_len_ get updated but we were not setting explicit_prefetch_submitted_ = false because of which buffers were getting out of sync. It's possible a read by another thread might have loaded the block into the cache in the meantime. Particular assertion example: ``` prev_offset: 0, prev_len_: 8097 , offset: 0, length: 8097, actual_length: 8097 , actual_offset: 0 , curr_: 0, bufs_[curr_].offset_: 4096 ,bufs_[curr_].CurrentSize(): 48541 , async_len_to_read: 278528, bufs_[curr_].async_in_progress_: false second: 1, bufs_[second].offset_: 282624 ,bufs_[second].CurrentSize(): 0, async_len_to_read: 262144 ,bufs_[second].async_in_progress_: true , explicit_prefetch_submitted_: true , copy_to_third_buffer: false ``` As we can see curr_ was expected to read 278528 but it read 48541. Also buffers are out of sync. Also `explicit_prefetch_submitted_` is set true but prev_len not 0. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10949 Test Plan: - Ran db_bench for regression to make sure there is no regression; - Ran db_stress failing without this fix, - Ran build-linux-mini-crashtest 7- 8 times locally + CircleCI Reviewed By: anand1976 Differential Revision: D41257786 Pulled By: akankshamahajan15 fbshipit-source-id: 1d100f94f8c06bbbe4cc76ca27f1bbc820c2494f	3 years ago
xiaochenfan	0993c9225f	Fix broken dependency: update zlib from 1.2.12 to 1.2.13 (#10833 ) Summary: zlib(https://zlib.net/) has released v1.2.13. 1.2.12 is no longer available for downloading and Makefile for rocksdb will be broken due to can't find the source .tar.gz. https://nvd.nist.gov/vuln/detail/CVE-2022-37434 This pr update the version number and the shasum of new .tar.gz file. (1.2.13) Fixes https://github.com/facebook/rocksdb/issues/10876 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10833 Reviewed By: hx235 Differential Revision: D40575954 Pulled By: ajkr fbshipit-source-id: 3e560e453ddf58d045214fc4e64f83bef91f22e5	3 years ago
akankshamahajan	8515437594	Update unit test to avoid timeout (#10950 ) Summary: Update unit test to avoid timeout Pull Request resolved: https://github.com/facebook/rocksdb/pull/10950 Reviewed By: hx235 Differential Revision: D41258892 Pulled By: akankshamahajan15 fbshipit-source-id: cbfe94da63e9e54544a307845deb79ba42458301	3 years ago
anand76	ecba6a320e	Add some async read stats (#10947 ) Summary: Add stats for time spent in the ReadAsync call, and async read errors. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10947 Test Plan: Run db_bench and look at stats Reviewed By: akankshamahajan15 Differential Revision: D41236637 Pulled By: anand1976 fbshipit-source-id: 70539b69a28491d57acead449436a761f7108acf	3 years ago

... 3 4 5 6 7 ...

11865 Commits (e0e318f37072ba5f45a156eb45c0075e212b5b9f) All Branches Search

11865 Commits (e0e318f37072ba5f45a156eb45c0075e212b5b9f)

All Branches