Summary:
Stacked on https://github.com/facebook/rocksdb/issues/11572
* Minimize use of std::function and lambdas to minimize chances of
compiler heap-allocating closures (unnecessary stress on allocator). It
appears that converting FindSlot to a template enables inlining the
lambda parameters, avoiding heap allocations.
* Clean up some logic with FindSlot (FIXMEs from https://github.com/facebook/rocksdb/issues/11572)
* Fix handling of rare case of probing all slots, with new unit test.
(Previously Insert would not roll back displacements in that case, which
would kill performance if it were to happen.)
* Add an -early_exit option to cache_bench for gathering memory stats
before deallocation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11601
Test Plan:
unit test added for probing all slots
## Seeing heap allocations
Run `MALLOC_CONF="stats_print:true" ./cache_bench -cache_type=hyper_clock_cache`
before https://github.com/facebook/rocksdb/issues/11572 vs. after this change. Before, we see this in the
interesting bin statistics:
```
size nrequests
---- ---------
32 578460
64 24340
8192 578460
```
And after:
```
size nrequests
---- ---------
32 (insignificant)
64 24370
8192 579130
```
## Performance test
Build with `make USE_CLANG=1 PORTABLE=0 DEBUG_LEVEL=0 -j32 cache_bench`
Run `./cache_bench -cache_type=hyper_clock_cache -ops_per_thread=5000000`
in before and after configurations, simultaneously:
```
Before: Complete in 33.244 s; Rough parallel ops/sec = 2406442
After: Complete in 32.773 s; Rough parallel ops/sec = 2441019
```
Reviewed By: jowlyzhang
Differential Revision: D47375092
Pulled By: pdillinger
fbshipit-source-id: 46f0f57257ddb374290a0a38c651764ea60ba410
Summary:
We observed `CompactionOutputs::UpdateGrandparentBoundaryInfo` consumes much time for `InternalKey::DecodeFrom` and `InternalKey::~InternalKey` in flame graph.
This PR omit the InternalKey object in `CompactionOutputs::UpdateGrandparentBoundaryInfo` .
![image](https://github.com/facebook/rocksdb/assets/1574991/661eaeec-2f46-46c6-a6a8-9738d6c191de)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11610
Reviewed By: ajkr
Differential Revision: D47426971
Pulled By: cbi42
fbshipit-source-id: f0d3a8186d778294515c0685032f5b395c4d6a62
Summary:
Separate out some functionality that will be common to both static and dynamic HCC into BaseClockTable. Table::InsertState and GrowIfNeeded will be used by the dynamic HCC so don't make much sense right now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11572
Test Plan:
existing tests. No functional changes intended.
Performance test in subsequent PR https://github.com/facebook/rocksdb/issues/11601
Reviewed By: jowlyzhang
Differential Revision: D47110496
Pulled By: pdillinger
fbshipit-source-id: 379bd433322a42ea28c0043b41ec24956d21e7aa
Summary:
I got the following error message when an SST file is recorded in MANIFEST but is missing from the db folder.
It's confusing in two ways:
1. The part about file "./074837.ldb" which RocksDB will attempt to open only after ./074837.sst is not found.
2. The last part about "No such file or directory in file ./MANIFEST-074507" sounds like `074837.ldb` is not found in manifest.
```
ldb --hex --db=. get some_key
Failed: Corruption: Corruption: IO error: No such file or directory: While open a file for random read: ./074837.ldb: No such file or directory in file ./MANIFEST-074507
```
Improving the error message a little bit:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11573
Test Plan:
run the same command after this PR
```
Failed: Corruption: Corruption: IO error: No such file or directory: While open a file for random read: ./074837.sst: No such file or directory The file ./MANIFEST-074507 may be corrupted.
```
Reviewed By: ajkr
Differential Revision: D47192056
Pulled By: cbi42
fbshipit-source-id: 06863f376cc4455803cffb2250c41399b4c39467
Summary:
Handle file boundaries `FileMetaData.smallest`, `FileMetaData.largest` for when `persist_user_defined_timestamps` is false:
1) on the manifest write path, the original user-defined timestamps in file boundaries are stripped. This stripping is done during `VersionEdit::Encode` to limit the effect of the stripping to only the persisted version of the file boundaries.
2) on the manifest read path during DB open, a a min timestamp is padded to the file boundaries. Ideally, this padding should happen during `VersionEdit::Decode` so that all in memory file boundaries have a compatible user key format as the running user comparator. However, because the user-defined timestamp size information is not available at that time. This change is added to `VersionEditHandler::OnNonCfOperation`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11578
Test Plan:
```
make all check
./version_edit_test --gtest_filter="*EncodeDecodeNewFile4HandleFileBoundary*".
./db_with_timestamp_basic_test --gtest_filter="*HandleFileBoundariesTest*"
```
Reviewed By: pdillinger
Differential Revision: D47309399
Pulled By: jowlyzhang
fbshipit-source-id: 21b4d54d2089a62826b31d779094a39cb2bbbd51
Summary:
Thanks pdillinger for pointing out this test hole. The test `DBWALTestWithTimestamp.Recover` that is intended to test recovery from WAL including user-defined timestamps doesn't achieve its promised coverage. Specifically, after https://github.com/facebook/rocksdb/issues/11557, timestamps will be removed during flush, and RocksDB by default flush memtables during recovery with `avoid_flush_during_recovery` defaults to false. This test didn't fail even if all the timestamps are quickly lost due to the default flush behavior.
This PR renamed test `Recover` to `RecoverAndNoFlush`, and updated it to verify timestamps are successfully recovered from WAL with some time-travel reads. `avoid_flush_during_recovery` is set to true to help do this verification.
On the other hand, for test `DBWALTestWithTimestamp.RecoverAndFlush`, since flush on reopen is DB's default behavior. Setting the flags `max_write_buffer` and `arena_block_size` are not really the factors that enforces the flush, so these flags are removed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11577
Test Plan: ./db_wal_test
Reviewed By: pdillinger
Differential Revision: D47142892
Pulled By: jowlyzhang
fbshipit-source-id: 9465e278806faa5885b541b4e32d99e698edef7d
Summary:
https://github.com/facebook/rocksdb/issues/11542 added a parameter to the C API `rocksdb_options_add_compact_on_deletion_collector_factory` which causes some internal builds to fail. External users using this API would also require code change. Making the API backward compatible by restoring the old C API and add the parameter to a new C API `rocksdb_options_add_compact_on_deletion_collector_factory_del_ratio`.
Also updated change log for 8.4 and will backport this change to 8.4 branch once landed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11593
Test Plan: `make c_test && ./c_test`
Reviewed By: akankshamahajan15
Differential Revision: D47299555
Pulled By: cbi42
fbshipit-source-id: 517dc093ef4cf02cac2fe4af4f1af13754bbda63
Summary:
both options `ttl` and `periodic_compaction_seconds` have the same meaning for FIFO compaction, which is redundant and can be confusing to use. For example, setting TTL to 0 does not disable TTL: user needs to also set periodic_compaction_seconds to 0. Another example is that dynamically setting `periodic_compaction_seconds` (surprisingly) has no effect on TTL compaction. This is because FIFO compaction picker internally only looks at value of `ttl`. The value of `ttl` is in `SanitizeOptions()` which take into account the value of `periodic_compaction_seconds`, but dynamically setting an option does not invoke this method.
This PR clarifies the usage of both options for FIFO compaction: only `ttl` should be used, `periodic_compaction_seconds` will not have any effect on FIFO compaction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11550
Test Plan:
- updated existing unit test `DBOptionsTest.SanitizeFIFOPeriodicCompaction`
- checked existing values of both options in feature matrix: https://fburl.com/daiquery/xxd0gs9w. All current uses cases either have `periodic_compaction_seconds = 0` or have `periodic_compaction_seconds > ttl`, so should not cause change of behavior.
Reviewed By: ajkr
Differential Revision: D46902959
Pulled By: cbi42
fbshipit-source-id: a9ede235b276783b4906aaec443551fa62ceff4c
Summary:
`sst_dump --command=verify` did not set read_options.verify_checksum to true so it was not verifying checksum.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11576
Test Plan:
ran the same command on an SST file with bad checksum:
```
sst_dump --command=verify --file=...sst_file_with_bad_block_checksum
Before this PR:
options.env is 0x6ba048
Process ...sst_file_with_bad_block_checksum
Sst file format: block-based
The file is ok
After this PR:
options.env is 0x7f43f6690000
Process ...sst_file_with_bad_block_checksum
Sst file format: block-based
... is corrupted: Corruption: block checksum mismatch: stored = 2170109798, computed = 2170097510, type = 4 ...
```
Reviewed By: ajkr
Differential Revision: D47136284
Pulled By: cbi42
fbshipit-source-id: 07d68db715c00347145e5b83d649aef2c3f2acd9
Summary:
This should be a benign bug caused by a long lived typo, this PR fix this issue.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11398
Reviewed By: ajkr
Differential Revision: D47163379
Pulled By: cbi42
fbshipit-source-id: 531728cae496fd7ac1371bbbd64fc103c3a90dcf
Summary:
Logically strip the user-defined timestamp when L0 files are created during flush when `AdvancedColumnFamilyOptions.persist_user_defined_timestamps` is false. Logically stripping timestamp here means replacing the original user-defined timestamp with a mininum timestamp, which for now is hard coded to be all zeros bytes.
While working on this, I caught a missing piece on the `BlockBuilder` level for this feature. The current quick path `std::min(buffer_size, last_key_size)` needs a bit tweaking to work for this feature. When user-defined timestamp is stripped during block building, on writing first entry or right after resetting, `buffer` is empty and `buffer_size` is zero as usual. However, in follow-up writes, depending on the size of the stripped user-defined timestamp, and the size of the value, what's in `buffer` can sometimes be smaller than `last_key_size`, leading `std::min(buffer_size, last_key_size)` to truncate the `last_key`. Previous test doesn't caught the bug because in those tests, the size of the stripped user-defined timestamps bytes is smaller than the length of the value. In order to avoid the conditional operation, this PR changed the original trivial `std::min` operation into an arithmetic operation. Since this is a change in a hot and performance critical path, I did the following benchmark to check no observable regression is introduced.
```TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=50000000```
Compiled with DEBUG_LEVEL=0
Test vs. control runs simulaneous for better accuracy, units = ops/sec
PR vs base:
Round 1: 350652 vs 349055
Round 2: 365733 vs 364308
Round 3: 355681 vs 354475
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11557
Test Plan:
New timestamp specific test added or existing tests augmented, both are parameterized with `UserDefinedTimestampTestMode`:
`UserDefinedTimestampTestMode::kNormal` -> UDT feature enabled, write / read with min timestamp
`UserDefinedTimestampTestMode::kStripUserDefinedTimestamps` -> UDT feature enabled, write / read with min timestamp, set Options.persist_user_defined_timestamps to false.
```
make all check
./db_wal_test --gtest_filter="*WithTimestamp*"
./flush_job_test --gtest_filter="*WithTimestamp*"
./repair_test --gtest_filter="*WithTimestamp*"
./block_based_table_reader_test
```
Reviewed By: pdillinger
Differential Revision: D47027664
Pulled By: jowlyzhang
fbshipit-source-id: e729193b6334dfc63aaa736d684d907a022571f5
Summary:
Expose the remaining fields of PlainTableOptions as arguments to `rocksdb_options_set_plain_table_factory` in the C API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11442
Reviewed By: ajkr
Differential Revision: D46786962
Pulled By: hx235
fbshipit-source-id: 8862083dde332bfecc5ff02f9375776ad35c11f5
Summary:
Add `skip_tmpdir_check` argument in crash script. If `tmp_dir` is on remote storage and exist, `isdir` will be false (checking on local storage) leading to exit. By passing `skip_tmpdir_check` with `crashtest.py`, the dir check can be skipped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11539
Test Plan: Ran locally
Reviewed By: anand1976
Differential Revision: D46740456
Pulled By: akankshamahajan15
fbshipit-source-id: 8726882ef53d2c84b604c7515e84eda6d1bf797c
Summary:
I made some changes to add OpenBSD support.
Second time doing something like this, so I apologize in advance if I'm doing something wrong (had some minor hiccups with how github worked).
Fixes https://github.com/facebook/rocksdb/issues/11220
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11255
Reviewed By: akankshamahajan15
Differential Revision: D46361706
Pulled By: ajkr
fbshipit-source-id: 90922fa30197fe6d6f3c0e3ecca2dbb92c337277
Summary:
There are some comments on subclasses in EncryptedEnv module which are duplicate to their parent classes, it would be nice to remove the duplication and keep the consistency if the comments on parent classes updated in someday.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11549
Reviewed By: akankshamahajan15
Differential Revision: D47007061
Pulled By: ajkr
fbshipit-source-id: 8bfdaf9f2418a24ca951c30bb88e90ac861d9016
Summary:
Infer detected a(n) [Unnecessary Copy Intermediate](https://fbinfer.com/docs/next/all-issue-types#unnecessary copy intermediate) issue. variable &my_secondary_handles is copied unnecessarily into an intermediate on line 268. To avoid the copy, try moving it by calling std::move instead or alternatively change the callee's parameter type to const &.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11566
Reviewed By: akankshamahajan15
Differential Revision: D47057361
Pulled By: ajkr
fbshipit-source-id: bc5d7a71638aecbf976f1a163128b489c9e87fd8
Summary:
When num_file_reads_for_auto_readahead = 1, during seek, it would go for prefetchingextra data in second buffer along with seek data, that would lead to increase in read data and
discarded bytes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11560
Test Plan: Added unit test
Reviewed By: anand1976
Differential Revision: D47008102
Pulled By: akankshamahajan15
fbshipit-source-id: 566c6131cb5f968d5efb81fd0ab233ff7e534ab0
Summary:
https://github.com/facebook/rocksdb/issues/11378 added a new overloaded `CreateColumnFamilyWithImport` API and updated the virtual function in `StackableDB` and `DBImplReadOnly` to the newly overloaded one. This caused internal error when there is a derived class that tries to override the original `CreateColumnFamilyWithImport` function. This PR adds the original `CreateColumnFamilyWithImport` function back to `StackableDB` and `DBImplReadOnly`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11556
Test Plan: check if this fixes an internal build
Reviewed By: akankshamahajan15
Differential Revision: D46980506
Pulled By: cbi42
fbshipit-source-id: 975a6c5748bf9481499a62ee5997ca59e542e3bc
Summary:
1. Public API change: Replace `use_async_io` API in file_system with `SupportedOps` API which is used by underlying FileSystem to indicate to upper layers whether the FileSystem supports different operations introduced in `enum FSSupportedOps `. Right now operations are `async_io` and whether FS will provide its own buffer during reads or not. The api is changed to extend it to various FileSystem operations in one API rather than creating a separate API for each operation.
2. Provide support for underlying FS to pass their own buffer during Reads (async and sync read) instead of using RocksDB provided `scratch` (buffer) in `FSReadRequest`. Currently only MultiRead supports it and later will be extended to other reads as well (point lookup, scan etc). More details in how to enable in file_system.h
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11324
Test Plan: Tested locally
Reviewed By: anand1976
Differential Revision: D44465322
Pulled By: akankshamahajan15
fbshipit-source-id: 9ec9e08f839b5cc815e75d5dade6cd549998d0ec
Summary:
... instead of race-condition-laden FaultInjectionTestEnv. See https://app.circleci.com/pipelines/github/facebook/rocksdb/27912/workflows/4c63e5a8-597e-439d-8c7e-82308056af02/jobs/609648 and similar PR https://github.com/facebook/rocksdb/issues/11271
Had to fix the semantics of FaultInjectionTestFS Close() operations to allow a non-OK Close() to fulfill the obligation to close before destruction. To me, this is the obvious choice of Close contract, because what is the caller supposed to do if Close() fails and they still have an obligation to successfully close before object destruction? Call Close() in an infinite loop? Leak the object? I have added API comments to the Env and Filesystem Close() functions to clarify the contracts.
Note that `DB::Close()` has one exception to this kind of Close contract, but it is clearly described in API comments and it is really only for catching programming mistakes, not for dealing with exogenous errors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11499
Test Plan: watch CI
Reviewed By: jowlyzhang
Differential Revision: D46375708
Pulled By: pdillinger
fbshipit-source-id: 03d4d8251e5df50a82ecd139f7e83f613015fe40
Summary:
Start to record the value of the flag `AdvancedColumnFamilyOptions.persist_user_defined_timestamps` in the Manifest and table properties for a SST file when it is created. And use the recorded flag when creating a table reader for the SST file. This flag's default value is true, it is only explicitly recorded if it's false.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11515
Test Plan:
```
make all check
./version_edit_test
```
Reviewed By: ltamasi
Differential Revision: D46920386
Pulled By: jowlyzhang
fbshipit-source-id: 075c20363d3d2cc1368422ecc805617ed135cc26
Summary:
... so that a non-cryptographic whole file checksum would be highly resistant
to manipulation by a user able to manipulate key-value data (e.g. a user whose data is
stored in RocksDB) and able to predict SST metadata such as DB session id and file
number based on read access to logs or DB files. The adversary would also need to predict
the salt in order to influence the checksum result toward collision with another file's
checksum.
This change is just internal code to support such a future feature. I think this should be a
passive feature, not option-controlled, because you probably won't think about needing it
until you discover you do need it, and it should be low cost, in space (16 bytes per SST
file) and CPU.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11331
Test Plan: Unit tests added to verify at least pseudorandom behavior. (Actually caught a bug in first draft!) The new "stress" style tests run in ~3ms each on my system.
Reviewed By: ajkr
Differential Revision: D46129415
Pulled By: pdillinger
fbshipit-source-id: 7972dc74487e062b29b1fd9c227425e922c98796
Summary:
The class `NewCompactOnDeletionCollectorFactory` exposes the parameter `delete_ratio`.
The C API `rocksdb_options_add_compact_on_deletion_collector_factory` does not allow a user to pass a delete ration to be passed down the the C++ class bellow.
The class has default value for the delete ratio which makes it pass the compilation and the tests.
closes https://github.com/facebook/rocksdb/issues/11541
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11542
Reviewed By: ajkr
Differential Revision: D46770908
Pulled By: cbi42
fbshipit-source-id: 7b5162fe459896052e392e2d85a8f6c01db3b464
Summary:
Calling `Flush` (even with `wait==true`) does not guarantee that obsolete WAL files are physically deleted before the call returns. The patch attempts to fix the resulting flakiness by using `SyncPoint`s to make sure `PurgeObsoleteFiles` finishes before checking for WAL deletions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11537
Test Plan:
```
gtest-parallel --repeat=1000 ./db_wal_test --gtest_filter="*SkipDeletedWALs*"
```
Reviewed By: pdillinger
Differential Revision: D46736050
Pulled By: ltamasi
fbshipit-source-id: 47a931b7a3a03ef681fbf4adb5a0b223d452703e
Summary:
`StressTest::optimistic_txn_db_` is currently not initialized by the constructor, which
can lead to assertion failures down the line in `StressTest::Open`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11547
Reviewed By: cbi42
Differential Revision: D46845658
Pulled By: ltamasi
fbshipit-source-id: 578b0f24fc00e3e97f24221fcdd003cc529439c2
Summary:
Context:
OptimisticTransactionDB has not been covered by db_stress (including crash test) like TransactionDB.
1. Adding the following gflag options to to test OptimisticTransactionDB
- `use_optimistic_txn`: When true, open OptimisticTransactionDB to test
- `occ_validation_policy`: `OccValidationPolicy::kValidateParallel = 1` by default.
- `share_occ_lock_buckets`: Use shared occ locks
- `occ_lock_bucket_count`: 500 by default. Number of buckets to use for shared occ lock.
2. Opening OptimisticTransactionDB and NewTxn/Commit added per `use_optimistic_txn` flag in `db_stress_test_base.cc`
3. OptimisticTransactionDB blackbox/whitebox test added in crash_test.mk
Please note that the existing flag `use_txn` is being used here. When `use_txn == true` and `use_optimistic_txn == false`, we use `TransactionDB` (a.k.a. pessimistic transaction db). When both `use_txn` and `use_optimistic_txn` are true, we use `OptimisticTransactionDB`. If `use_txn == false` but `use_optimistic_txn == true` throw error with message _"You cannot set use_optimistic_txn true while use_txn is false. Please set use_txn true if you want to use OptimisticTransactionDB"_.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11513
Test Plan:
**Crash Test**
Serial Validation
```
export CRASH_TEST_EXT_ARGS="--use_optimistic_txn=1 --use_txn=1 --use_put_entity_one_in=0 --occ_validation_policy=0"
make crash_test -j
```
Parallel Validation (no share bucket)
```
export CRASH_TEST_EXT_ARGS="--use_optimistic_txn=1 --use_txn=1 --use_put_entity_one_in=0 --occ_validation_policy=1 --share_occ_lock_buckets=0"
make crash_test -j
```
Parallel Validation (share bucket)
```
export CRASH_TEST_EXT_ARGS="--use_optimistic_txn=1 --use_txn=1 --use_put_entity_one_in=0 --occ_validation_policy=1 --share_occ_lock_buckets=1 --occ_lock_bucket_count=500"
make crash_test -j
```
**Stress Test**
```
./db_stress -use_optimistic_txn -threads=32
```
Reviewed By: pdillinger
Differential Revision: D46547387
Pulled By: jaykorean
fbshipit-source-id: ca19819ca6e0281694966998014b40d95d4e5960
Summary:
**Context/Summary:**
When db is upgrading to adopt [pr11406](https://github.com/facebook/rocksdb/pull/11406/), it's possible for RocksDB to infer a small tail size to prefetch for pre-upgrade files. Such small tail size would have caused 1 file read per index or filter partition if partitioned index or filer is used. This PR provides a UT to show this would not happen.
Misc: refactor the related UTs a bit to make this new UT more readable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11522
Test Plan:
- New UT
If logic of upgrade is wrong e.g,
```
--- a/table/block_based/partitioned_index_reader.cc
+++ b/table/block_based/partitioned_index_reader.cc
@@ -166,7 +166,8 @@ Status PartitionIndexReader::CacheDependencies(
uint64_t prefetch_len = last_off - prefetch_off;
std::unique_ptr<FilePrefetchBuffer> prefetch_buffer;
if (tail_prefetch_buffer == nullptr || !tail_prefetch_buffer->Enabled() ||
- tail_prefetch_buffer->GetPrefetchOffset() > prefetch_off) {
+ (false && tail_prefetch_buffer->GetPrefetchOffset() > prefetch_off)) {
```
, then the UT will fail like below
```
[ RUN ] PrefetchTailTest/PrefetchTailTest.UpgradeToTailSizeInManifest/0
file/prefetch_test.cc:461: Failure
Expected: (db_open_file_read.count) < (num_index_partition), actual: 38 vs 33
Received signal 11 (Segmentation fault)
```
Reviewed By: pdillinger
Differential Revision: D46546707
Pulled By: hx235
fbshipit-source-id: 9897b0a975e9055963edac5451fd1cd9d6c45d0e
Summary:
Fix the error handling in `GetHostName` for non EFAULT, non EINVAL error. Current handling will cause stack overflow when non null-terminated c style string is in `name`, e.g. ENAMETOOLONG, when the `name` buffer is not big enough and the host name is truncated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11544
Test Plan:
```
COMPILE_WITH_ASAN=1 make all check
```
Reviewed By: pdillinger
Differential Revision: D46775799
Pulled By: jowlyzhang
fbshipit-source-id: e0fc9400c50fe38bc1fd888b4fea5fe8706165bf
Summary:
This ticker combined with `rocksdb.files.marked.trash` can help give a better picture of how DeleteScheduler is keeping up.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11540
Test Plan:
```
./delete_scheduler_test
```
Reviewed By: ajkr
Differential Revision: D46746401
Pulled By: jowlyzhang
fbshipit-source-id: f3daa622aa3ddefe7d673e0cc257a47699d506df
Summary:
after https://github.com/facebook/rocksdb/issues/11321 and https://github.com/facebook/rocksdb/issues/11340 (both included in RocksDB v8.2), migration from `level_compaction_dynamic_level_bytes=false` to `level_compaction_dynamic_level_bytes=true` is automatic by RocksDB and requires no manual compaction from user. Making the option true by default as it has several advantages: 1. better space amplification guarantee (a more stable LSM shape). 2. compaction is more adaptive to write traffic. 3. automatic draining of unneeded levels. Wiki is updated with more detail: https://github.com/facebook/rocksdb/wiki/Leveled-Compaction#option-level_compaction_dynamic_level_bytes-and-levels-target-size.
The PR mostly contains fixes for unit tests as they assumed `level_compaction_dynamic_level_bytes=false`. Most notable change is commit f742be330c and b1928e42b3 which override the default option in DBTestBase to still set `level_compaction_dynamic_level_bytes=false` by default. This helps to reduce the change needed for unit tests. I think this default option override in unit tests is okay since the behavior of `level_compaction_dynamic_level_bytes=true` is tested by explicitly setting this option. Also, `level_compaction_dynamic_level_bytes=false` may be more desired in unit tests as it makes it easier to create a desired LSM shape.
Comment for option `level_compaction_dynamic_level_bytes` is updated to reflect this change and change made in https://github.com/facebook/rocksdb/issues/10057.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11525
Test Plan: `make -j32 J=32 check` several times to try to catch flaky tests due to this option change.
Reviewed By: ajkr
Differential Revision: D46654256
Pulled By: cbi42
fbshipit-source-id: 6b5827dae124f6f1fdc8cca2ac6f6fcd878830e1
Summary:
The original Feature Request is from [https://github.com/facebook/rocksdb/issues/11317](https://github.com/facebook/rocksdb/issues/11317).
Flink uses rocksdb as the state backend, all DB options are the same, and the keys of each DB instance are adjacent and there is no key overlap between two db instances.
In the Flink rescaling scenario, it is necessary to quickly split the DB according to a certain key range or quickly merge multiple DBs into one.
This PR is mainly used to quickly merge multiple DBs into one.
We hope to extend the function of `CreateColumnFamilyWithImports` to support creating ColumnFamily by importing multiple ColumnFamily with no overlapping keys.
The import logic is almost the same as `CreateColumnFamilyWithImport`, but it will check whether there is key overlap between CF when importing. The import will fail if there are key overlaps.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11378
Reviewed By: ajkr
Differential Revision: D46413709
Pulled By: cbi42
fbshipit-source-id: 846d0049fad11c59cf460fa846c345b26c658dfb
Summary:
Use another static object to join threads instead.
This change is motivated by a case in which some code using NewLRUCache() -> ShardedCacheBase -> SemiStructuredUniqueIdGen -> GenerateRawUniqueId() -> Env::Default() was happening
during static destruction.
I didn't see anything else in PosixEnv or base classes that would cause a problem by not
destroying. (WinEnv is already not destroyed; see env_default.cc)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11538UndefinedBehaviorSanitizer: undefined-behavior env/env_test.cc:3561:23 in
$
```
Test Plan:
test added, which would previously fail with UBSAN:
```
$ ./env_test --gtest_filter=*Destruct*
Note: Google Test filter = *Destruct*
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from EnvTestMisc
[ RUN ] EnvTestMisc.StaticDestruction
[ OK ] EnvTestMisc.StaticDestruction (0 ms)
[----------] 1 test from EnvTestMisc (0 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (0 ms total)
[ PASSED ] 1 test.
env/env_test.cc:3561:23: runtime error: member call on address 0x7f7b96671ca8 which does not point to an object of type 'rocksdb::Env'
0x7f7b96671ca8: note: object is of type 'N7rocksdb12ConfigurableE'
00 00 00 00 90 a7 f7 95 7b 7f 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
^~~~~~~~~~~~~~~~~~~~~~~
vptr for 'N7rocksdb12ConfigurableE'
Reviewed By: jowlyzhang
Differential Revision: D46737389
Pulled By: pdillinger
fbshipit-source-id: 0f80a443bf799ffc5641e898cf3a75f7d10a987b
Summary:
when a DB is configured with `allow_ingest_behind = true`, the last level should be reserved for ingested files and these files should not be included in any compaction. Currently, a major compaction can compact these files to smaller levels. This can cause future files to be rejected for ingest behind (see `ExternalSstFileIngestionJob::CheckLevelForIngestedBehindFile()`). This PR fixes the issue such that files in the last level is not included in any compaction.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11489
Test Plan: * Updated unit test `ExternalSSTFileTest.IngestBehind` to test that last level is not included in manual and auto-compaction.
Reviewed By: ajkr
Differential Revision: D46455711
Pulled By: cbi42
fbshipit-source-id: 5e2142c2a709ef932ad797897795021c06c4ac8c
Summary:
See "unreleased_history/new_features/obsolete_sst_files_size.md" for description
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11533
Test Plan: updated unit test
Reviewed By: jowlyzhang
Differential Revision: D46703152
Pulled By: ajkr
fbshipit-source-id: ea5e31cd6293eccc154130c13e66b5271f57c102
Summary:
the first CI step "Check buck targets and code format..." is failing with the following error message:
```
Download action repository 'wei/wget@v1' (SHA:c15e476d1463f4936cb54f882170d9d631f1aba5)
Error: An action could not be found at the URI 'c15e476d14'
```
Not sure why the action is lost, but it seems we can use wget directly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11532
Test Plan: watch CI job "Check buck targets and code format" passes
Reviewed By: ltamasi
Differential Revision: D46700626
Pulled By: cbi42
fbshipit-source-id: 53c09f27965444b533b3fe3755aec922571bba2c
Summary:
Add new tickers: `rocksdb.error.handler.bg.error.count`, `rocksdb.error.handler.bg.io.error.count`, `rocksdb.error.handler.bg.retryable.io.error.count` to replace the misspelled ones: `rocksdb.error.handler.bg.errro.count`, `rocksdb.error.handler.bg.io.errro.count`, `rocksdb.error.handler.bg.retryable.io.errro.count` ('error' instead of 'errro'). Users should switch to use the new tickers before 9.0 release as the misspelled old tickers will be completely removed then.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11509
Reviewed By: ltamasi
Differential Revision: D46542809
Pulled By: jowlyzhang
fbshipit-source-id: a2a6d8354af46a060de81d40ef6f5336a80bd32e
Summary:
The `rocksdb.db.get.micros` histogram is never updated if the DB is open in ReadOnly mode, as well as the `get_cpu_nanos` perf counter. An earlier PR (https://github.com/facebook/rocksdb/issues/4260) for some reason has only added the TODO line, not the accounting itself, so this one is intended to fix it, adding two lines to match [DBImplSecondary](4dafa5b220/db/db_impl/db_impl_secondary.cc (L366-L367)).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11521
Reviewed By: cbi42
Differential Revision: D46577330
Pulled By: jowlyzhang
fbshipit-source-id: be147923e763af32bbc18fd6bdf3aff8ebf08aee
Summary:
Fix a use-after-move issue in block.cc and added some unit tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11505
Test Plan:
```
make all check
./block_test
```
Reviewed By: ltamasi
Differential Revision: D46506188
Pulled By: jowlyzhang
fbshipit-source-id: 316ed8ddd221c00b2bce2cf9fd47eea686cd74a5
Summary:
Support was added in 8.1.0
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11517
Test Plan: comments only
Reviewed By: anand1976
Differential Revision: D46489929
Pulled By: pdillinger
fbshipit-source-id: 4fd30078389065c9ec225bf55b6773f1641f0646
Summary:
**Context:**
[PR11406](https://github.com/facebook/rocksdb/pull/11406/) caused more frequent read during db open reading files with no `tail_size` in the manifest as part of the upgrade to 11406. This is due to that PR introduced
- [smaller](https://github.com/facebook/rocksdb/pull/11406/files#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129fR833) prefetch tail buffer size compared to pre-11406 for small files (< 52 MB) when `tail_prefetch_stats` infers tail size to be 0 (usually happens when the stats does not have much historical data to infer early on)
- more read (up to # of partitioned filter/index) when such small prefetch tail buffer does not contain all the partitioned filter/index needed in CacheDependencies() since the [fallback logic](https://github.com/facebook/rocksdb/pull/11406/files#diff-d98f1a83de24412ad7f3527725dae7e28851c7222622c3cdb832d3cdf24bbf9fR165-R179) that prefetches all partitions at once will be [skipped](url) when such a small prefetch tail buffer is passed in
**Summary:**
- Revert the fallback prefetch buffer size change to preserve existing behavior fully during upgrading in `BlockBasedTable::PrefetchTail()`
- Use passed-in prefetch tail buffer in `CacheDependencies()` only if it has a smaller offset than the the offset of first partition filter/index, that is, at least as good as the existing prefetching behavior
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11516
Test Plan:
- db bench
Create db with small files prior to PR 11406
```
./db_bench -db=/tmp/testdb/ --partition_index_and_filters=1 --statistics=1 -benchmarks=fillseq -key_size=3200 -value_size=5 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=true -compression_type=zstd`
```
Read db to see if post-pr has lower read qps (i.e, rocksdb.file.read.db.open.micros count) during db open.
```
./db_bench -use_direct_reads=1 --file_opening_threads=1 --threads=1 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 --db=/tmp/testdb/ --benchmarks=readrandom --key_size=3200 --value_size=5 --num=100 --disable_auto_compactions=true --compression_type=zstd
```
Pre-PR:
```
rocksdb.file.read.db.open.micros P50 : 3.399023 P95 : 5.924468 P99 : 12.408333 P100 : 29.000000 COUNT : 611 SUM : 2539
```
Post-PR:
```
rocksdb.file.read.db.open.micros P50 : 593.736842 P95 : 861.605263 P99 : 1212.868421 P100 : 2663.000000 COUNT : 585 SUM : 345349
```
_Note: To control the starting offset of the prefetch tail buffer easier, I manually override the following to eliminate the effect of alignment_
```
class PosixRandomAccessFile : public FSRandomAccessFile {
virtual size_t GetRequiredBufferAlignment() const override {
- return logical_sector_size_;
+ return 1;
}
```
- CI
Reviewed By: pdillinger
Differential Revision: D46472566
Pulled By: hx235
fbshipit-source-id: 2fe14ac8d489d15b0e08e6f8fe4f46d5f110978e