Summary:
**Context/Summary:**
After https://github.com/facebook/rocksdb/pull/11631, file hint is not longer needed for compaction read. Therefore we can deprecate `Options::access_hint_on_compaction_start`. As this is a public API change, we should first mark the relevant APIs (including the Java's) deprecated and remove it in next major release 9.0.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11658
Test Plan: No code change
Reviewed By: ajkr
Differential Revision: D47997856
Pulled By: hx235
fbshipit-source-id: 16e015ae7728c224b1caef73143aa9915668f4ac
Summary:
Add a mutable column family option `memtable_max_range_deletions`. When non-zero, RocksDB will try to flush the current memtable after it has at least `memtable_max_range_deletions` range deletions. Java API is added and crash test is updated accordingly to randomly enable this option.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11358
Test Plan:
* New unit test: `DBRangeDelTest.MemtableMaxRangeDeletions`
* Ran crash test `python3 ./tools/db_crashtest.py whitebox --simple --memtable_max_range_deletions=20` and saw logs showing flushed memtables usually with 20 range deletions.
Reviewed By: ajkr
Differential Revision: D46582680
Pulled By: cbi42
fbshipit-source-id: f23d6fa8d8264ecf0a18d55c113ba03f5e2504da
Summary:
... used in validating some HyperClockCache development in progress.
* Revamp the "populate cache" step to avoid redundant insertions (very rare in practice) and more consistently approach the desired resident_ratio while maintaining appropriate skew (still not perfect).
* Track and print hit ratio on lookups, to ensure a fair comparison is happening between implementations etc.
* Add an option to disable tracking and printing histograms (lots of output)
* Add an option to specify a random seed (for more reproducibility)
* Remove confusing/redundant "-skewed" option
Uses BitwiseAnd from https://github.com/facebook/rocksdb/issues/11660 (tested there)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11661
Test Plan: manual
Reviewed By: akankshamahajan15, jowlyzhang
Differential Revision: D47937671
Pulled By: pdillinger
fbshipit-source-id: 85a2bb881b1bca4f63e015bac684105fd91c9f35
Summary:
Updated the main branch for the 8.5.fb branch cut. Also made unreleased_history/release.sh backdate to the last commit instead of the current date in case the release manager is a laggard like myself.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11642
Reviewed By: cbi42
Differential Revision: D47783574
Pulled By: ajkr
fbshipit-source-id: 4e2a80f5ccd542dc7dd0d22dfd7e59cb136325a1
Summary:
BottomNBits() - there is a single fast instruction for this on x86 since BMI2, but testing with godbolt indicates you need at least GCC 10 for the compiler to choose that instruction from the obvious C++ code. https://godbolt.org/z/5a7Ysd41h
BitwiseAnd() - this is a convenience function that works around the language flaw that the type of the result of x & y is the larger of the two input types, when it should be the smaller. This can save some ugly static_cast.
I expect to use both of these in coming HyperClockCache developments, and have applied them in a couple of places in existing code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11660
Test Plan: unit tests added
Reviewed By: jowlyzhang
Differential Revision: D47935531
Pulled By: pdillinger
fbshipit-source-id: d148c43a1e51df4a1c549b93aaf2725a3f8d3bd6
Summary:
Adds a few missing features to the C API:
1) Statistics level
2) Getting individual values instead of a serialized string
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11263
Test Plan: unit tests
Reviewed By: ajkr
Differential Revision: D47309963
Pulled By: hx235
fbshipit-source-id: 84df59db4045fc0fb3ea4aec451bc5c2afd2a248
Summary:
(Copied from https://www.internalfb.com/diff/D46606060)
This diff makes its files safe for use with -Wimplicit-fallthrough. Now that we're using C+20 there's no reason not to use this C++17 feature to make our code safer.
It's currently possible to write code like this:
```
switch(x){
case 1:
foo1();
case 2:
foo2();
break;
case 3:
foo3();
}
```
But that's scary because we don't know whether the fallthrough from case 1 was intentional or not.
The -Wimplicit-fallthrough flag will make this an error. The solution is to either fix the bug by inserting break or indicating intention by using [[fallthrough]]; (from C++17).
```
switch(x){
case 1:
foo1();
[[fallthrough]]; // Solution if we intended to fallthrough
break; // Solution if we did not intend to fallthrough
case 2:
foo2();
break;
case 3:
foo3();
}
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11663
Test Plan: Existing tests
Reviewed By: jowlyzhang
Differential Revision: D47961248
Pulled By: jaykorean
fbshipit-source-id: 0d374c721bf1b328c14949dc5c17693da7311d03
Summary:
https://github.com/facebook/rocksdb/issues/11653 broke some crash tests.
Apparently these Rollbacks are needed for pessimistic transaction cases. (I'm still not sure if the API makes any sense with regard to safe usage. It's certainly not documented. Will consider in follow-up PRs.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11656
Test Plan: manual crash test runs, crash_test_with_multiops_wc_txn and crash_test_with_multiops_wp_txn
Reviewed By: cbi42
Differential Revision: D47906280
Pulled By: pdillinger
fbshipit-source-id: d058a01b6dbb47a4f08d199e335364168304f81b
Summary:
## Context checksum
All RocksDB checksums currently use 32 bits of checking
power, which should be 1 in 4 billion false negative (FN) probability (failing to
detect corruption). This is true for random corruptions, and in some cases
small corruptions are guaranteed to be detected. But some possible
corruptions, such as in storage metadata rather than storage payload data,
would have a much higher FN rate. For example:
* Data larger than one SST block is replaced by data from elsewhere in
the same or another SST file. Especially with block_align=true, the
probability of exact block size match is probably around 1 in 100, making
the FN probability around that same. Without `block_align=true` the
probability of same block start location is probably around 1 in 10,000,
for FN probability around 1 in a million.
To solve this problem in new format_version=6, we add "context awareness"
to block checksum checks. The stored and expected checksum value is
modified based on the block's position in the file and which file it is in. The
modifications are cleverly chosen so that, for example
* blocks within about 4GB of each other are guaranteed to use different context
* blocks that are offset by exactly some multiple of 4GiB are guaranteed to use
different context
* files generated by the same process are guaranteed to use different context
for the same offsets, until wrap-around after 2^32 - 1 files
Thus, with format_version=6, if a valid SST block and checksum is misplaced,
its checksum FN probability should be essentially ideal, 1 in 4B.
## Footer checksum
This change also adds checksum protection to the SST footer (with
format_version=6), for the first time without relying on whole file checksum.
To prevent a corruption of the format_version in the footer (e.g. 6 -> 5) to
defeat the footer checksum, we change much of the footer data format
including an "extended magic number" in format_version 6 that would be
interpreted as empty index and metaindex block handles in older footer
versions. We also change the encoding of handles to free up space for
other new data in footer.
## More detail: making space in footer
In order to keep footer the same size in format_version=6 (avoid change to IO
patterns), we have to free up some space for new data. We do this two ways:
* Metaindex block handle is encoded down to 4 bytes (from 10) by assuming
it immediately precedes the footer, and by assuming it is < 4GB.
* Index block handle is moved into metaindex. (I don't know why it was
in footer to begin with.)
## Performance
In case of small performance penalty, I've made a "pay as you go" optimization
to compensate: replace `MutableCFOptions` in BlockBasedTableBuilder::Rep
with the only field used in that structure after construction: `prefix_extractor`.
This makes the PR an overall performance improvement (results below).
Nevertheless I'm seeing essentially no difference going from fv=5 to fv=6,
even including that improvement for both. That's based on extreme case table
write performance testing, many files with many blocks. This is relatively
checksum intensive (small blocks) and salt generation intensive (small files).
```
(for I in `seq 1 100`; do TEST_TMPDIR=/dev/shm/dbbench2 ./db_bench -benchmarks=fillseq -memtablerep=vector -disable_wal=1 -allow_concurrent_memtable_write=false -num=3000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -write_buffer_size=100000 -compression_type=none -block_size=1000; done) 2>&1 | grep micros/op | tee out
awk '{ tot += $5; n += 1; } END { print int(1.0 * tot / n) }' < out
```
Each value below is ops/s averaged over 100 runs, run simultaneously with competing
configuration for load fairness
Before -> after (both fv=5): 483530 -> 483673 (negligible)
Re-run 1: 480733 -> 485427 (1.0% faster)
Re-run 2: 483821 -> 484541 (0.1% faster)
Before (fv=5) -> after (fv=6): 482006 -> 485100 (0.6% faster)
Re-run 1: 482212 -> 485075 (0.6% faster)
Re-run 2: 483590 -> 484073 (0.1% faster)
After fv=5 -> after fv=6: 483878 -> 485542 (0.3% faster)
Re-run 1: 485331 -> 483385 (0.4% slower)
Re-run 2: 485283 -> 483435 (0.4% slower)
Re-run 3: 483647 -> 486109 (0.5% faster)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9058
Test Plan:
unit tests included (table_test, db_properties_test, salt in env_test). General DB tests
and crash test updated to test new format_version.
Also temporarily updated the default format version to 6 and saw some test failures. Almost all
were due to an inadvertent additional read in VerifyChecksum to verify the index block checksum,
though it's arguably a bug that VerifyChecksum does not appear to (re-)verify the index block
checksum, just assuming it was verified in opening the index reader (probably *usually* true but
probably not always true). Some other concerns about VerifyChecksum are left in FIXME
comments. The only remaining test failure on change of default (in block_fetcher_test) now
has a comment about how to upgrade the test.
The format compatibility test does not need updating because we have not updated the default
format_version.
Reviewed By: ajkr, mrambacher
Differential Revision: D33100915
Pulled By: pdillinger
fbshipit-source-id: 8679e3e572fa580181a737fd6d113ed53c5422ee
Summary:
In rare cases, optimistic transaction commit returns TryAgain. This change tolerates that intentional behavior in db_stress, up to a small limit in a row. This way, we don't miss a possible regression with excessive TryAgain, and trying again (rolling back the transaction) should have a well renewed chance of success as the writes will be associated with fresh sequence numbers.
Also, some of the APIs were not clear about Transaction semantics, so I have clarified:
* (Best I can tell....) Destroying a Transaction is safe without calling Rollback() (or at least should be). I don't know why it's a common pattern in our test code and examples to rollback before unconditional destruction. Stress test updated not to call Rollback unnecessarily (to test safe destruction).
* Despite essentially doing what is asked, simply trying Commit() again when it returns TryAgain does not have a chance of success, because of the transaction being bound to the DB state at the time of operations before Commit. Similar logic applies to Busy AFAIK. Commit() API comments updated, and expanded unit test in optimistic_transaction_test.
Also also, because I can't stop myself, I refactored a good portion of the transaction handling code in db_stress.
* Avoid existing and new copy-paste for most transaction interactions with a new ExecuteTransaction (higher-order) function.
* Use unique_ptr (nicely complements removing unnecessary Rollbacks)
* Abstract out a pattern for safely calling std::terminate() and use it in more places. (The TryAgain errors we saw did not have stack traces because of "terminate called recursively".)
Intended follow-up: resurrect use of `FLAGS_rollback_one_in` but also include non-trivial cases
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11653
Test Plan:
this is the test :)
Also, temporarily bypassed the new retry logic and boosted the chance of hitting TryAgain. Quickly reproduced the TryAgain error. Then re-enabled the new retry logic, and was not able to hit the error after running for tens of minutes, even with the boosted chances.
Reviewed By: cbi42
Differential Revision: D47882995
Pulled By: pdillinger
fbshipit-source-id: 21eadb1525423340dbf28d17cf166b9583311a0d
Summary:
Some trailing whitespace has leaked into HISTORY.md entries. This can lead to unexpected changes when modifying HISTORY.md with hygienic editors (e.g. for a patch release). This change should protect against future cases.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11652
Test Plan: manual
Reviewed By: akankshamahajan15
Differential Revision: D47882814
Pulled By: pdillinger
fbshipit-source-id: 148c3746d3b298cb6e1f655f0416d46619969086
Summary:
... to improve data integrity validation during compaction.
A new option `compaction_verify_record_count` is introduced for this verification and is enabled by default. One exception when the verification is not done is when a compaction filter returns kRemoveAndSkipUntil which can cause CompactionIterator to seek until some key and hence not able to keep track of the number of keys processed.
For expected number of input keys, we sum over the number of total keys - number of range tombstones across compaction input files (`CompactionJob::UpdateCompactionStats()`). Table properties are consulted if `FileMetaData` is not initialized for some input file. Since table properties for all input files were also constructed during `DBImpl::NotifyOnCompactionBegin()`, `Compaction::GetTableProperties()` is introduced to reduce duplicated code.
For actual number of keys processed, each subcompaction will record its number of keys processed to `sub_compact->compaction_job_stats.num_input_records` and aggregated when all subcompactions finish (`CompactionJob::AggregateCompactionStats()`). In the case when some subcompaction encountered kRemoveAndSkipUntil from compaction filter and does not have accurate count, it propagates this information through `sub_compact->compaction_job_stats.has_num_input_records`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11571
Test Plan:
* Add a new unit test `DBCompactionTest.VerifyRecordCount` for the corruption case.
* All other unit tests for non-corrupted case.
* Ran crash test for a few hours: `python3 ./tools/db_crashtest.py whitebox --simple`
Reviewed By: ajkr
Differential Revision: D47131965
Pulled By: cbi42
fbshipit-source-id: cc8e94565dd526c4347e9d3843ecf32f6727af92
Summary:
Add a built-in comparator that supports uint64_t style user-defined timestamps for ReverseBytewiseComparator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11647
Test Plan:
Added a test wrapper for retrieving this comparator from registry and used it in this test:
`./udt_util_test`
Reviewed By: ltamasi
Differential Revision: D47848303
Pulled By: jowlyzhang
fbshipit-source-id: 5af5534a8c2d9195997d0308c8e194c1c797548c
Summary:
Fix use_after_free bug in async_io MultiReads when underlying FS enabled kFSBuffer. kFSBuffer is when underlying FS pass their own buffer instead of using RocksDB scratch in FSReadRequest
Since it's an experimental feature, added a hack for now to fix the bug.
Planning to make public API change to remove const from the callback as it doesn't make sense to use const.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11645
Test Plan: tested locally
Reviewed By: ltamasi
Differential Revision: D47819907
Pulled By: akankshamahajan15
fbshipit-source-id: 1faf5ef795bf27e2b3a60960374d91274931df8d
Summary:
Add support to allow enabling / disabling user-defined timestamps feature for an existing column family in combination with the in-Memtable only feature.
To do this, this PR includes:
1) Log the `persist_user_defined_timestamps` option per column family in Manifest to facilitate detecting an attempt to enable / disable UDT. This entry is enforced to be logged in the same VersionEdit as the user comparator name entry.
2) User-defined timestamps related options are validated when re-opening a column family, including user comparator name and the `persist_user_defined_timestamps` flag. These type of settings and settings change are considered valid:
a) no user comparator change and no effective `persist_user_defined_timestamp` flag change.
b) switch user comparator to enable UDT provided the immediately effective `persist_user_defined_timestamps` flag
is false.
c) switch user comparator to disable UDT provided that the before-change `persist_user_defined_timestamps` is
already false.
3) when an attempt to enable UDT is detected, we mark all its existing SST files as "having no UDT" by marking its `FileMetaData.user_defined_timestamps_persisted` flag to false and handle their file boundaries `FileMetaData.smallest`, `FileMetaData.largest` by padding a min timestamp.
4) while enabling / disabling UDT feature, timestamp size inconsistency in existing WAL logs are handled to make it compatible with the running user comparator.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11623
Test Plan:
```
make all check
./db_with_timestamp_basic_test --gtest-filter="*EnableDisableUDT*"
./db_wal_test --gtest_filter="*EnableDisableUDT*"
```
Reviewed By: ltamasi
Differential Revision: D47636862
Pulled By: jowlyzhang
fbshipit-source-id: dcd19f67292da3c3cc9584c09ad00331c9ab9322
Summary:
Make flush respect the cutoff timestamp `full_history_ts_low` as much as possible for the user-defined timestamps in Memtables only feature. We achieve this by not proceeding with the actual flushing but instead reschedule the same `FlushRequest` so a follow up flush job can continue with the check after some interval.
This approach doesn't work well for atomic flush, so this feature currently is not supported in combination with atomic flush. Furthermore, this approach also requires a customized method to get the next immediately bigger user-defined timestamp. So currently it's limited to comparator that use uint64_t as the user-defined timestamp format. This support can be extended when we add such a customized method to `AdvancedColumnFamilyOptions`.
For non atomic flush request, at any single time, a column family can only have as many as one FlushRequest for it in the `flush_queue_`. There is deduplication done at `FlushRequest` enqueueing(`SchedulePendingFlush`) and dequeueing time (`PopFirstFromFlushQueue`). We hold the db mutex between when a `FlushRequest` is popped from the queue and the same FlushRequest get rescheduled, so no other `FlushRequest` with a higher `max_memtable_id` can be added to the `flush_queue_` blocking us from re-enqueueing the same `FlushRequest`.
Flush is continued nevertheless if there is risk of entering write stall mode had the flush being postponed, e.g. due to accumulation of write buffers, exceeding the `max_write_buffer_number` setting. When this happens, the newest user-defined timestamp in the involved Memtables need to be tracked and we use it to increase the `full_history_ts_low`, which is an inclusive cutoff timestamp for which RocksDB promises to keep all user-defined timestamps equal to and newer than it.
Tet plan:
```
./column_family_test --gtest_filter="*RetainUDT*"
./memtable_list_test --gtest_filter="*WithTimestamp*"
./flush_job_test --gtest_filter="*WithTimestamp*"
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11599
Reviewed By: ajkr
Differential Revision: D47561586
Pulled By: jowlyzhang
fbshipit-source-id: 9400445f983dd6eac489e9dd0fb5d9b99637fe89
Summary:
this is stacked on https://github.com/facebook/rocksdb/issues/11550 to further clarify usage of these two options for universal compaction. Similar to FIFO, the two options have the same meaning for universal compaction, which can be confusing to use. For example, for universal compaction, dynamically changing the value of `ttl` has no impact on periodic compactions. Users should dynamically change `periodic_compaction_seconds` instead. From feature matrix (https://fburl.com/daiquery/5s647hwh), there are instances where users set `ttl` to non-zero value and `periodic_compaction_seconds` to 0. For backward compatibility reason, instead of deprecating `ttl`, comments are added to mention that `periodic_compaction_seconds` are preferred. In `SanitizeOptions()`, we update the value of `periodic_compaction_seconds` to take into account value of `ttl`. The logic is documented in relevant option comment.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11552
Test Plan: * updated existing unit test `DBTestUniversalCompaction2.PeriodicCompactionDefault`
Reviewed By: ajkr
Differential Revision: D47381434
Pulled By: cbi42
fbshipit-source-id: bc41f29f77318bae9a96be84dd89bf5617c7fd57
Summary:
Remove obsolete comment.
Support for WriteBatchWithIndex::NewIteratorWithBase when overwrite_key=false is added in https://github.com/facebook/rocksdb/pull/8135, as you can clearly see in the HISTORY.md.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11636
Reviewed By: jowlyzhang
Differential Revision: D47722955
Pulled By: ajkr
fbshipit-source-id: 4fa44a309d9708e9f4a1530918a9aaf7114c9032
Summary:
... ahead of dynamic variant.
* Introduce an Unref function for a common pattern. Cases that were previously using std::memory_order_acq_rel we doing so because we were saving the pre-updated value in case it might be used. Now we are explicitly throwing away the pre-updated value so do not need the acquire semantic, just release.
* Introduce a reusable EvictionData struct and TrackAndReleaseEvictedEntry() function.
* Based on a linter suggesting, use const Func& parameter type instead of Func for templated callable parameters.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11630
Test Plan: existing tests, and performance test with release build of cache_bench. Getting 1-2% difference between before & after from run to run, but inconsistent about which one is faster.
Reviewed By: jowlyzhang
Differential Revision: D47657334
Pulled By: pdillinger
fbshipit-source-id: 5cf2377c0d47a39143b04be6735f98c550e8bdc3
Summary:
**Context/Summary**
As titled. The benefit of doing so is to explicitly call readahead() instead of relying page cache behavior for compaction read when we know that we most likely need readahead as compaction read is sequential read .
**Test**
Extended the existing UT to cover compaction read case
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11631
Reviewed By: ajkr
Differential Revision: D47681437
Pulled By: hx235
fbshipit-source-id: 78792f64985c4dc44aa8f2a9c41ab3e8bbc0bc90
Summary:
Add `rocksdb_transactiondb_get_base_db` and `rocksdb_transactiondb_close_base_db` functions to the C API modeled after `rocksdb_optimistictransactiondb_get_base_db` and `rocksdb_optimistictransactiondb_close_base_db`:
ca50ccc71a/include/rocksdb/c.h (L2711-L2716)
With this pair of functions, it is possible to get a `rocksdb_t *` from a `rocksdb_transactiondb_t *`. The main goal is to be able to use the approximate memory usage API, only accessible to the `rocksdb_t *` type:
ca50ccc71a/include/rocksdb/c.h (L2821-L2833)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11562
Reviewed By: ajkr
Differential Revision: D47603343
Pulled By: jowlyzhang
fbshipit-source-id: c70cf6af5834026e232fe7791634db3a396f7d5e
Summary:
Add path existence check in the script to avoid script running even when db_bench executable does not exist or relative path is not right.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11621
Reviewed By: jowlyzhang
Differential Revision: D47552590
Pulled By: ajkr
fbshipit-source-id: f09ea069f69e067212b249a22ad755b76bc6063a