Summary:
The bug can impact the following scenario. There must be two `CompactRange()`s, call them A and B. Compaction A must have `change_level=true`. Compactions A and B must run in parallel, and new data must be added while they run as well.
Now, on to the details of the race condition. Compaction A must reach the refitting phase while B's next step is to trivial move new data (i.e., data that has been inserted behind A) down to the same level that A's refit targets (`CompactRangeOptions::target_level`). B must be unregistered (i.e., has not yet called `AddManualCompaction()` for the current `RunManualCompaction()`) while A invokes `DisableManualCompaction()`s to prepare for refitting. In the old code, B could still proceed to register a manual compaction, while A had disabled manual compaction.
The next part of the race condition is B picks and schedules a trivial move while A has released the lock in refitting phase in order to persist the LSM state change (i.e., the log phase of `LogAndApply()`). That way, B does not see the refitted data when picking a trivial-move compaction. So it is susceptible to picking one that overlaps.
Finally, B executes the picked trivial-move compaction. Trivial-move compactions are special in that they never check whether manual compaction is disabled. So the picked compaction causing overlap ends up being applied, leading to LSM corruption if `force_consistency_checks=false`, or entering read-only mode with `Status::Corruption` if `force_consistency_checks=true` (the default).
The fix is just to prevent B from registering itself in `RunManualCompaction()` while manual compactions are disabled, consequently preventing any trivial move or other compaction from being picked/scheduled.
Thanks to siying for finding the bug.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9077
Test Plan: The test does not go all the way in exposing the bug because it requires a compaction to be picked/scheduled while logging LSM state change for RefitLevel(). But the fix is to make such a compaction not picked/scheduled in the first place, so any repro of that scenario would end up hanging RefitLevel() logging. So instead I just verified no such compaction is registered in the scenario where `RefitLevel()` disables manual compactions.
Reviewed By: siying
Differential Revision: D31921908
Pulled By: ajkr
fbshipit-source-id: 9bb5d0e847ad428211227f40830c685c209fbecb
Summary:
Somewhat confusingly, index and filter partition blocks are
never owned by table readers, even with
cache_index_and_filter_blocks=false. They still go into block cache
(possibly pinned by table reader) if there is a block cache. If no block
cache, they are only loaded transiently on demand.
This PR primarily clarifies the options APIs and some internal code
comments.
Also, this closes a hypothetical data corruption vulnerability where
some but not all index partitions are pinned. I haven't been able to
reproduce a case where it can happen (the failure seems to propagate
to abort table open) but it's worth patching nonetheless.
Fixes https://github.com/facebook/rocksdb/issues/8979
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9068
Test Plan:
existing tests :-/ I could cover the new code using sync
points, but then I'd have to very carefully relax my `assert(false)`
Reviewed By: ajkr
Differential Revision: D31898284
Pulled By: pdillinger
fbshipit-source-id: f2511a7d3a36bc04b627935d8e6cfea6422f98be
Summary:
The comments in the `#endif` section at the end of the file were in the
wrong order.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9033
Reviewed By: mrambacher
Differential Revision: D31935856
Pulled By: ajkr
fbshipit-source-id: 24aca039993d6e27022cfe8d6434e90f2934c87c
Summary:
Force POSIX locale for calls to 'git remote' that might have
locale-dependent formatting, as shown in https://github.com/facebook/rocksdb/issues/8731 comment
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9079
Test Plan:
manual (haven't tried on a machine with non-english default
locale)
Reviewed By: ltamasi
Differential Revision: D31943092
Pulled By: pdillinger
fbshipit-source-id: 7dbe5915824f39f73b412cc3d1a86a2521cf76c1
Summary:
https://github.com/facebook/rocksdb/pull/7898 enable write buffer manager to stall write when memory_usage exceeds buffer_size, this is really useful for container running case to limit the memory usage. However, this feature is not visiable for rocksJava yet.
This PR targets to introduce this feature for rocksJava.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9076
Reviewed By: akankshamahajan15
Differential Revision: D31931092
Pulled By: anand1976
fbshipit-source-id: 5531c16a87598663a02368c07b5e13a503164578
Summary:
This PR adds support for building on s390x including updating travis CI. It uses the previous work in https://github.com/facebook/rocksdb/pull/6168 and adds some more changes to get all current tests (make check and jni tests) to pass. The tests were run with snappy, lz4, bzip2 and zstd all compiled in.
There are a few pieces still needed to get the travis build working that I don't think I can do. adamretter is this something you could help with?
1. A prebuilt https://rocksdb-deps.s3-us-west-2.amazonaws.com/cmake/cmake-3.14.5-Linux-s390x.deb package
2. A https://hub.docker.com/r/evolvedbinary/rocksjava s390x image
Not sure if there is more required for travis. Happy to help in any way I can.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8962
Reviewed By: mrambacher
Differential Revision: D31802198
Pulled By: pdillinger
fbshipit-source-id: 683511466fa6b505f85ba5a9964a268c6151f0c2
Summary:
In atomic flush, concurrent background flush threads will commit to the MANIFEST
one by one, in the order of the IDs of their picked memtables for all included column
families. Each time, a background flush thread decides whether to wait based on two
criteria:
- Is db stopped? If so, don't wait.
- Am I the one to commit the currently earliest memtable? If so, don't wait and ready to go.
When atomic flush was implemented, error writing to or syncing the MANIFEST would
cause the db to be stopped. Therefore, this background thread does not have to check
for the background error while waiting. If there has been such an error, `DBStopped()`
would have been true, and this thread will **not** wait forever.
After we improved error handling, RocksDB may map an IOError while writing to MANIFEST
to a soft error, if there is no WAL. This requires the background threads to check for
background error while waiting. Otherwise, a background flush thread may wait forever.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9034
Test Plan: make check
Reviewed By: zhichao-cao
Differential Revision: D31639225
Pulled By: riversand963
fbshipit-source-id: e9ab07c4d8f2eade238adeefe3e42dd9a5a3ebbd
Summary:
This PR does not change code sematics, it just changes for:
1. local obj `nonmem_w` and `lfile` are unused
2. null check for `delete ptr` is unnecessary
3. use `unique_ptr::reset` instead of `release` + `delete`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9052
Reviewed By: zhichao-cao
Differential Revision: D31801661
Pulled By: anand1976
fbshipit-source-id: 16a77d45da8c8833bf5bf3bce546bb3711b335df
Summary:
This PR has no semantic changes, just to make code shorter.
`stats_` has value same with `immutable_db_options_.stats`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9053
Reviewed By: zhichao-cao
Differential Revision: D31801603
Pulled By: anand1976
fbshipit-source-id: cbd8fe478d3e90ae078ace49b4f2eb9bb028ccf6
Summary:
This PR supports querying `GetMapProperty()` with "rocksdb.dbstats" to get the DB-level stats in a map format. It only reports cumulative stats over the DB lifetime and, as such, does not update the baseline for interval stats. Like other map properties, the string keys are not (yet) exposed in the public API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9057
Test Plan: new unit test
Reviewed By: zhichao-cao
Differential Revision: D31781495
Pulled By: ajkr
fbshipit-source-id: 6f77d3aee8b4b1a015061b8c260a123859ceaf9b
Summary:
This commit introduces incremental compaction in univeral style for space amplification. This follows the first improvement mentioned in https://rocksdb.org/blog/2021/04/12/universal-improvements.html . The implemention simply picks up files about size of max_compaction_bytes to compact and execute if the penalty is not too big. More optimizations can be done in the future, e.g. prioritizing between this compaction and other types. But for now, the feature is supposed to be functional and can often reduce frequency of full compactions, although it can introduce penalty.
In order to add cut files more efficiently so that more files from upper levels can be included, SST file cutting threshold (for current file + overlapping parent level files) is set to 1.5X of target file size. A 2MB target file size will generate files like this: https://gist.github.com/siying/29d2676fba417404f3c95e6c013c7de8 Number of files indeed increases but it is not out of control.
Two set of write benchmarks are run:
1. For ingestion rate limited scenario, we can see full compaction is mostly eliminated: https://gist.github.com/siying/959bc1186066906831cf4c808d6e0a19 . The write amp increased from 7.7 to 9.4, as expected. After applying file cutting, the number is improved to 8.9. In another benchmark, the write amp is even better with the incremental approach: https://gist.github.com/siying/d1c16c286d7c59c4d7bba718ca198163
2. For ingestion rate unlimited scenario, incremental compaction turns out to be too expensive most of the time and is not executed, as expected.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8655
Test Plan: Add unit tests to the functionality.
Reviewed By: ajkr
Differential Revision: D31787034
fbshipit-source-id: ce813e63b15a61d5a56e97bf8902a1b28e011beb
Summary:
Currently, if Secondary Cache is provided to the lru cache, it is used by default. We add CacheTier to advanced_options.h to describe the cache tier we used. Add a `lowest_used_cache_tier` option to `DBOptions` (immutable) and pass it to BlockBasedTableReader to decide if secondary cache will be used or not. By default it is `CacheTier::kNonVolatileTier`, which means, we always use both block cache (kVolatileTier) and secondary cache (kNonVolatileTier). By set it to `CacheTier::kVolatileTier`, the DB will not use the secondary cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9050
Test Plan: added new tests
Reviewed By: anand1976
Differential Revision: D31744769
Pulled By: zhichao-cao
fbshipit-source-id: a0575ebd23e1c6dfcfc2b4c8578764e73b15bce6
Summary:
Right now, when picking a compaction, grand parent files are from output_level + 1. This usually works, but if the level doesn't have any overlapping file, it will be more efficient to go further down. This is because the files are likely to be trivial moved further and might create a violation of max_compaction_bytes. This situation can naturally happen and might happen even more with TTL compactions. There is no harm to fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9051
Test Plan: Run existing tests and see it passes. Also briefly run crash test.
Reviewed By: ajkr
Differential Revision: D31748829
fbshipit-source-id: 52b99ab4284dc816d22f34406d528a3c98ff6719
Summary:
... by bypassing tracking of last_key in BlockBuilder when
last_key is already known (for BlockBasedTableBuilder::data_block).
I tried extracting a base class of BlockBuilder without the last_key
tracking at all, but that became complicated by NewFlushBlockPolicy() in
the public API referencing BlockBuilder, which would need to be the base
class, and I don't want to replace nearly all the internal references to
BlockBuilder.
Possible follow-up:
* Investigate / consider using AddWithLastKey in more places
This improvement should stack with https://github.com/facebook/rocksdb/issues/9039
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9040
Test Plan:
TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=50000000
Compiled with DEBUG_LEVEL=0
Test vs. control runs simulaneous for better accuracy, units = ops/sec
Run 1: 278929 vs. 267799 (+4.2%)
Run 2: 281836 vs. 267432 (+5.4%)
Run 3: 278279 vs. 270454 (+2.9%)
(This benchmark is chosen to have detectable signal-to-noise, not to
represent expected improvement percent on real workloads.)
Reviewed By: mrambacher
Differential Revision: D31706033
Pulled By: pdillinger
fbshipit-source-id: 8a50fe6fefdd67b6d7665ffa687bbdcf5ad0d5ec
Summary:
Was not handling the case of OnTableFileCreated invoked for
table file NOT created.
Also improved error reporting and caught a missing status check.
Also strengthened the db_stress listener to require file_size > 0 when
status.ok(). We would be violating the API contract if status is OK and
we didn't create a valid SST file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9054
Test Plan: make blackbox_crash_test for a while
Reviewed By: zhichao-cao
Differential Revision: D31765200
Pulled By: pdillinger
fbshipit-source-id: 7c527f5531bc239a5efd7a7b018545d480f926e2
Summary:
Adds changes to DBOptions (comparable to ColumnFamilyOptions) to allow some option values to be ignored on rehydration from the Options file. This is necessary for some customizable classes that were not registered with the ObjectRegistry but are saved/restored from the Options file.
All tests pass. Will run check_format_compatible.sh shortly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9045
Reviewed By: zhichao-cao
Differential Revision: D31761664
Pulled By: mrambacher
fbshipit-source-id: 300c2251639cce2b223481c3bb2a63877b1f3766
Summary:
Implementation of https://github.com/facebook/rocksdb/issues/8221, plus/including extension of Java options API to allow the get() of options from RocksDB. The extension allows more comprehensive testing of options at the Java side, by validating that the options are set at the C++ side.
Variations on methods:
MutableColumnFamilyOptions.MutableColumnFamilyOptionsBuilder getOptions()
MutableDBOptions.MutableDBOptionsBuilder getDBOptions()
retrieve the options via RocksDB C++ interfaces, and parse the resulting string into one of the Java-style option objects.
This necessitated generalising the parsing of option strings in Java, which now parses the full range of option strings returned by the C++ interface, rather than a useful subset. This necessitates the list-separator being changed to :(colon) from , (comma).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8999
Reviewed By: jay-zhuang
Differential Revision: D31655487
Pulled By: ltamasi
fbshipit-source-id: c38e98145c81c61dc38238b0df580db176ce4efd
Summary:
* New public header unique_id.h and function GetUniqueIdFromTableProperties
which computes a universally unique identifier based on table properties
of table files from recent RocksDB versions.
* Generation of DB session IDs is refactored so that they are
guaranteed unique in the lifetime of a process running RocksDB.
(SemiStructuredUniqueIdGen, new test included.) Along with file numbers,
this enables SST unique IDs to be guaranteed unique among SSTs generated
in a single process, and "better than random" between processes.
See https://github.com/pdillinger/unique_id
* In addition to public API producing 'external' unique IDs, there is a function
for producing 'internal' unique IDs, with functions for converting between the
two. In short, the external ID is "safe" for things people might do with it, and
the internal ID enables more "power user" features for the future. Specifically,
the external ID goes through a hashing layer so that any subset of bits in the
external ID can be used as a hash of the full ID, while also preserving
uniqueness guarantees in the first 128 bits (bijective both on first 128 bits
and on full 192 bits).
Intended follow-up:
* Use the internal unique IDs in cache keys. (Avoid conflicts with https://github.com/facebook/rocksdb/issues/8912) (The file offset can be XORed into
the third 64-bit value of the unique ID.)
* Publish the external unique IDs in FileStorageInfo (https://github.com/facebook/rocksdb/issues/8968)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8990
Test Plan:
Unit tests added, and checking of unique ids in stress test.
NOTE in stress test we do not generate nearly enough files to thoroughly
stress uniqueness, but the test trims off pieces of the ID to check for
uniqueness so that we can infer (with some assumptions) stronger
properties in the aggregate.
Reviewed By: zhichao-cao, mrambacher
Differential Revision: D31582865
Pulled By: pdillinger
fbshipit-source-id: 1f620c4c86af9abe2a8d177b9ccf2ad2b9f48243
Summary:
Add a property_bag option in FileOptions for direct FileSystem users to pass custom properties to the provider in APIs such as NewRandomAccessFile, NewWritableFile etc. This field will be ignored/not populated by RocksDB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9030
Reviewed By: zhichao-cao
Differential Revision: D31630643
Pulled By: anand1976
fbshipit-source-id: 1e1ddc5e2933ecada99a94eada5f309b674a03e8
Summary:
`dbs` should not be cleared, as it is reused later when reopening the DBs, so we have an out-of-bounds access with `dbnames[dbnum]`. The values left in the vector don't need to be reset, as the db pointer is an out parameter for `DB::Open`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9046
Reviewed By: pdillinger
Differential Revision: D31738263
Pulled By: ot
fbshipit-source-id: c619e947b8d3dbc3d896f29971f093d3e3c794d3
Summary:
If `DB::Close()` is called in multi-thread env, the resource
could be double released, which causes exception or assert.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8970
Test Plan:
Test with multi-thread benchmark, with each thread try to
close the DB at the end.
Reviewed By: pdillinger
Differential Revision: D31242042
Pulled By: jay-zhuang
fbshipit-source-id: a61276b1b61e07732e375554106946aea86a23eb
Summary:
WaitForFlushMemTable() may only wait for mem flush but not background flush
finishing. The the obsoleted file may not be purged yet.
fcaa7ff638/db/db_impl/db_impl_compaction_flush.cc (L2200-L2203)
Use WaitForCompact() instead to wait for background flush job.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9049
Test Plan: `gtest-parallel ./obsolete_files_test --gtest_filter=ObsoleteFilesTest.DeleteObsoleteOptionsFile -r 1000`
Reviewed By: zhichao-cao
Differential Revision: D31737343
Pulled By: jay-zhuang
fbshipit-source-id: 82276ebeae7c7c75a733d3e1fd1c130d45e4761f
Summary:
https://github.com/facebook/rocksdb/issues/8031 broke internal tests. This should fix but also preserve
the intended capability of getting git commit id when hg not used
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9047
Test Plan: already broken ¯\\_(ツ)_/¯
Reviewed By: zhichao-cao
Differential Revision: D31732198
Pulled By: pdillinger
fbshipit-source-id: 7dba8531ddca55a6de5e04978a1a1601aae4cee9
Summary:
Primarily, this change reserves space in the std::string for building
the next block once a block is finished, using `block_size` as
reservation size. Note: also tried reusing same std::string in the
common "unbuffered" path but that showed no benefit or regression.
Secondarily, this slightly reduces the work in resetting `restarts_`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9039
Test Plan:
TEST_TMPDIR=/dev/shm/rocksdb1 ./db_bench -benchmarks=fillseq -memtablerep=vector -allow_concurrent_memtable_write=false -num=50000000
Compiled with DEBUG_LEVEL=0
Test vs. control runs simulaneous for better accuracy, units = ops/sec
Run 1, Primary change only: 292697 vs. 280267 (+4.4%)
Run 2, Primary change only: 288763 vs. 279621 (+3.3%)
Run 1, Secondary change only: 260065 vs. 254232 (+2.3%)
Run 2, Secondary change only: 275925 vs. 272248 (+1.4%)
Run 1, Both changes: 284890 vs. 270372 (+5.3%)
Run 2, Both changes: 263511 vs. 258188 (+2.0%)
Reviewed By: zhichao-cao
Differential Revision: D31701253
Pulled By: pdillinger
fbshipit-source-id: 7e40810afbb98e6b6446955e77bda59e69b19ffd
Summary:
New classes FileStorageInfo and LiveFileStorageInfo and
'experimental' function DB::GetLiveFilesStorageInfo, which is intended
to largely replace several fragmented DB functions needed to create
checkpoints and backups.
This function is now used to create checkpoints and backups, because
it fixes many (probably not all) of the prior complexities of checkpoint
not having atomic access to DB metadata. This also ensures strong
functional test coverage of the new API. Specifically, much of the old
CheckpointImpl::CreateCustomCheckpoint has been migrated to and
updated in DBImpl::GetLiveFilesStorageInfo, with the former now
calling the latter.
Also, the class FileStorageInfo in metadata.h compatibly replaces
BackupFileInfo and serves as a new base class for SstFileMetaData.
Some old fields of SstFileMetaData are still provided (for now) but
deprecated.
Although FileStorageInfo::directory is accurate when using db_paths
and/or cf_paths, these have never been supported by Checkpoint
nor BackupEngine and still are not. This change does now detect
these cases and return NotSupported when appropriate. (More work
needed for support.)
Somehow this change broke ProgressCallbackDuringBackup, but
the progress_callback logic was dubious to begin with because it
would call the callback based on copy buffer size, not size actually
copied. Logic and test updated to track size actually copied
per-thread.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8968
Test Plan:
tests updated.
DB::GetLiveFilesStorageInfo mostly tested by use in CheckpointImpl.
DBTest.SnapshotFiles updated to also test GetLiveFilesStorageInfo,
including reading the data after DB close.
Added CheckpointTest.CheckpointWithDbPath (NotSupported).
Reviewed By: siying
Differential Revision: D31242045
Pulled By: pdillinger
fbshipit-source-id: b183d1ce9799e220daaefd6b3b5365d98de676c0
Summary:
The first parameter of SyncPoint::Process is "const std::string&". The majority, maybe all, of the actual calls to this function use a "const char *". The conversion before entering the function requires a construction of a std::string object on the heap. This std::object is then typically not needed because first use of the string is a rocksdb::Slice which has a less costly conversion of char * to slice.
Example:
We have a load and iterate test. The test loads 10m keys and iterates most via 10 rocksdb::Iterator objects. We used TCMALLOC to gather information about allocation and space usage during iterators.
- Before this PR: test took 32 min 17 sec
- After this PR: test took 1 min 14 sec
The TCMALLOC top object list before this PR:
<pre>
Total: 5105999 objects
5003717 98.0% 98.0% 5009471 98.1% rocksdb::DBIter::MergeValuesNewToOld (inline)
20260 0.4% 98.4% 20260 0.4% std::__cxx11::basic_string::_M_mutate
15214 0.3% 98.7% 15214 0.3% rocksdb::UncompressBlockContentsForCompressionType (inline)
13408 0.3% 99.0% 13408 0.3% std::_Rb_tree::_M_emplace_hint_unique [clone .constprop.416] (inline)
12957 0.3% 99.2% 12957 0.3% std::_Rb_tree::_M_emplace_hint_unique [clone .constprop.405] (inline)
9327 0.2% 99.4% 9327 0.2% std::_Rb_tree::_M_copy (inline)
7691 0.2% 99.5% 9919 0.2% JVM_FindSignal
2859 0.1% 99.6% 2859 0.1% rocksdb::Cleanable::RegisterCleanup
2844 0.1% 99.7% 2844 0.1% std::map::operator[] (inline)
</pre>
The "MergeValuesNewToOld (inline)" objects are the #define wrappers to SyncPoint::Process. We discovered this in a 5.18 rocksdb release. There TCMALLOC was more specific that std::basic_string was being constructed. I believe that was before SyncPoint::Process was declared inline in subsequent releases.
The TCMALLOC top object list after this PR:
<pre>
Total: 104911 objects
45090 43.0% 43.0% 45090 43.0% rocksdb::Cleanable::RegisterCleanup
29995 28.6% 71.6% 29995 28.6% rocksdb::LRUCacheShard::Insert
15229 14.5% 86.1% 15229 14.5% rocksdb::UncompressBlockContentsForCompressionType (inline)
4373 4.2% 90.3% 4551 4.3% JVM_FindSignal
2881 2.7% 93.0% 2881 2.7% rocksdb::::ReadBlockFromFile (inline)
1162 1.1% 94.1% 1176 1.1% rocksdb::BlockFetcher::ReadBlockContents (inline)
1036 1.0% 95.1% 1036 1.0% std::__cxx11::basic_string::_M_mutate
869 0.8% 95.9% 869 0.8% std::vector::_M_realloc_insert (inline)
806 0.8% 96.7% 806 0.8% SnmpAgent::GetVariables (inline)
</pre>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9023
Reviewed By: pdillinger
Differential Revision: D31610907
Pulled By: mrambacher
fbshipit-source-id: 574ff51b639dd46ad253a8e664a575f06b7cc85d
Summary:
EncryptedWritableFile is derived from FSWritableFile, which implements
the `IsSyncThreadSafe()` function as
bool IsSyncThreadSafe() const { return false; }
EncryptedWritableFile does not override this method from the base class,
so the `IsSyncThreadSafe()` function on an EncryptedWritableFile will
always return false.
This change adds an override of `IsSyncThreadSafe()` to
EncryptedWritableFile so that the latter will now ask its underlying
`file_` object for the thread-safety of sync operations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8993
Reviewed By: jay-zhuang
Differential Revision: D31613123
Pulled By: ajkr
fbshipit-source-id: b18625e21a9911744eef3215c29913490e4b6001
Summary:
`valueIndexMap_` in histogram is redundant and search in `valueIndexMap_` is slower than search in `bucketValues_`.
this PR delete `valueIndexMap_` and search in `bucketValues_` by `std::lower_bound`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8625
Reviewed By: zhichao-cao
Differential Revision: D31613386
Pulled By: ajkr
fbshipit-source-id: d7415d724f5c8f41f80cbe82afd7467cfad6f009
Summary:
I get `clang-format-diff` after running `apt install clang-format` on Ubuntu instead of `clang-format-diff.py`. So I think it makes sense to make the format script compatible with this behavior.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9028
Reviewed By: ajkr
Differential Revision: D31634041
Pulled By: jay-zhuang
fbshipit-source-id: b936de791ddcafa6ff304039ef33936e1e04864d
Summary:
closes https://github.com/facebook/rocksdb/issues/8039
Unnecessary use of multiple local JNI references at the same time, 1 per key, was limiting the size of the key array. The local references don't need to be held simultaneously, so if we rearrange the code we can make it work for bigger key arrays.
Incidentally, make errors throw helpful exception messages rather than returning a null pointer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9012
Reviewed By: mrambacher
Differential Revision: D31580862
Pulled By: jay-zhuang
fbshipit-source-id: ce05831d52ede332e1b20e74d2dc621d219b9616
Summary:
Set the perf_level in ```tools/regression_test.sh``` in order to exercise ```PerfContext``` counters in regression tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8031
Test Plan: Manually run the script
Reviewed By: ajkr
Differential Revision: D31508269
Pulled By: anand1976
fbshipit-source-id: 20ddfd1cbca37f1439eed2870086a86d90653b44
Summary:
The code in `IngestExternalFiles()` that bumps the DB's sequence number
depending on what seqnos were assigned to the files has 3 bugs:
1) There is an assertion that the sequence number is increased in all the
affected column families, but this is unnecessary, it is fine if some files can
stick to a lower sequence number. It is very easy to hit the assertion: it is
sufficient to insert 2 files in 2 CFs, one which overlaps the CF and one that
doesn't (for example the CF is empty). The line added in the
`IngestFilesIntoMultipleColumnFamilies_Success` test makes the assertion fail.
2) SetLastSequence() is called with the sum of all the bumps across CFs, but we
should take the maximum instead, as all CFs start with the current seqno and bump
it independently.
3) The code above is accidentally under a `#ifndef NDEBUG`, so it doesn't run in
optimized builds, so some files may be assigned seqnos from the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9005
Test Plan:
Added line in `IngestFilesIntoMultipleColumnFamilies_Success` that
triggers the assertion, verified that the test (and all the others) pass after
the fix.
Reviewed By: ajkr
Differential Revision: D31597892
Pulled By: ot
fbshipit-source-id: c2d3237f90290df1178736ace8653a9623f5a770
Summary:
The patch adds a new BlobDB benchmarking script called `run_blob_bench.sh`.
It is a thin wrapper around `benchmark.sh` (similarly to `run_flash_bench.sh`):
it actually calls `benchmark.sh` a number of times, cycling through six workloads,
two write-only ones (bulk load and overwrite), two read/write ones (point lookups
while writing, range scans while writing), and two read-only ones (point lookups
and range scans).
Note: this is a simpler/cleaned up/reworked version of the script used to produce the
benchmark results in http://rocksdb.org/blog/2021/05/26/integrated-blob-db.html .
The new version takes advantage of several recent `benchmark.sh` improvements
like the ability to pass in arbitrary `db_bench` options or the possibility of using a
job ID.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9015
Test Plan: Ran the script manually with different parameter combinations.
Reviewed By: riversand963
Differential Revision: D31555277
Pulled By: ltamasi
fbshipit-source-id: 0e151b2f7b2cf6f66ed7f95455571492ad7ea87f
Summary:
EndWriteStall has a data race: `queue_.empty()` is checked outside of the
mutex, so once we enter the critical section another thread may already have
cleared the list, and accessing the `front()` is undefined behavior (and causes
interesting crashes under high concurrency).
This PR fixes the bug, and also rewrites the logic to make it easier to reason
about it. It also fixes another subtle bug: if some writers are stalled and
`SetBufferSize(0)` is called, which disables the WBM, the writer are not
unblocked because of an early `enabled()` check in `EndWriteStall()`.
It doesn't significantly change the locking behavior, as before writers won't
lock unless entering a stall condition, and `FreeMem` almost always locks if
stalling is allowed, but that is inevitable with the current design. Liveness is
guaranteed by the fact that if some writes are blocked, eventually all writes
will be blocked due to `stall_active_`, and eventually all memory is freed.
While at it, do a couple of optimizations:
- In `WBMStallInterface::Signal()` signal the CV only after releasing the
lock. Signaling under the lock is a common pitfall, as it causes the woken-up
thread to immediately go back to sleep because the mutex is still locked by
the awaker.
- Move all allocations and deallocations outside of the lock.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9009
Test Plan:
```
USE_CLANG=1 make -j64 all check
```
Reviewed By: akankshamahajan15
Differential Revision: D31550668
Pulled By: ot
fbshipit-source-id: 5125387c3dc7ecaaa2b8bbc736e58c4156698580
Summary:
The current BlobDB garbage collection logic works by relocating the valid
blobs from the oldest blob files as they are encountered during compaction,
and cleaning up blob files once they contain nothing but garbage. However,
with sufficiently skewed workloads, it is theoretically possible to end up in a
situation when few or no compactions get scheduled for the SST files that contain
references to the oldest blob files, which can lead to increased space amp due
to the lack of GC.
In order to efficiently handle such workloads, the patch adds a new BlobDB
configuration option called `blob_garbage_collection_force_threshold`,
which signals to BlobDB to schedule targeted compactions for the SST files
that keep alive the oldest batch of blob files if the overall ratio of garbage in
the given blob files meets the threshold *and* all the given blob files are
eligible for GC based on `blob_garbage_collection_age_cutoff`. (For example,
if the new option is set to 0.9, targeted compactions will get scheduled if the
sum of garbage bytes meets or exceeds 90% of the sum of total bytes in the
oldest blob files, assuming all affected blob files are below the age-based cutoff.)
The net result of these targeted compactions is that the valid blobs in the oldest
blob files are relocated and the oldest blob files themselves cleaned up (since
*all* SST files that rely on them get compacted away).
These targeted compactions are similar to periodic compactions in the sense
that they force certain SST files that otherwise would not get picked up to undergo
compaction and also in the sense that instead of merging files from multiple levels,
they target a single file. (Note: such compactions might still include neighboring files
from the same level due to the need of having a "clean cut" boundary but they never
include any files from any other level.)
This functionality is currently only supported with the leveled compaction style
and is inactive by default (since the default value is set to 1.0, i.e. 100%).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8994
Test Plan: Ran `make check` and tested using `db_bench` and the stress/crash tests.
Reviewed By: riversand963
Differential Revision: D31489850
Pulled By: ltamasi
fbshipit-source-id: 44057d511726a0e2a03c5d9313d7511b3f0c4eab
Summary:
`FaultInjectionTest{Env,FS}::ReopenWritableFile()` functions were accidentally deleting WALs from previous `db_stress` runs causing verification to fail. They were operating under the assumption that `ReopenWritableFile()` would delete any existing file. It was a reasonable assumption considering the `{Env,FileSystem}::ReopenWritableFile()` documentation stated that would happen. The only problem was neither the implementations we offer nor the "real" clients in RocksDB code followed that contract. So, this PR updates the contract as well as fixing the fault injection client usage.
The fault injection change exposed that `ExternalSSTFileBasicTest.SyncFailure` was relying on a fault injection `Env` dropping unsynced data written by a regular `Env`. I changed that test to make its `SstFileWriter` use fault injection `Env`, and also implemented `LinkFile()` in fault injection so the unsynced data is tracked under the new name.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8995
Test Plan:
- Verified it fixes the following failure:
```
$ ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --ops_per_thread=1000 --prefixpercent=0 --readpercent=60 --reopen=0 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1
...
$ ./db_stress --avoid_flush_during_recovery=1 --clear_column_family_one_in=0 --column_families=1 --db=/dev/shm/rocksdb_crashtest_whitebox --delpercent=5 --destroy_db_initially=0 --expected_values_dir=/dev/shm/rocksdb_crashtest_expected --iterpercent=10 --key_len_percent_dist=1,30,69 --max_bytes_for_level_base=4194304 --max_key=100000 --max_key_len=3 --nooverwritepercent=1 --open_files=-1 --open_metadata_write_fault_one_in=8 --open_write_fault_one_in=16 --ops_per_thread=1000 --prefix_size=-1 --prefixpercent=0 --readpercent=50 --sync=1 --target_file_size_base=1048576 --test_batches_snapshots=0 --write_buffer_size=1048576 --writepercent=35 --value_size_mult=33 -threads=1
...
Verification failed for column family 0 key 000000000000001300000000000000857878787878 (1143): Value not found: NotFound:
Crash-recovery verification failed :(
...
```
- `make check -j48`
Reviewed By: ltamasi
Differential Revision: D31495388
Pulled By: ajkr
fbshipit-source-id: 7886ccb6a07cb8b78ad7b6c1c341ccf40bb68385