Summary:
In https://github.com/facebook/rocksdb/issues/8454, I introduced a new process baptized `MemPurge` (memtable garbage collection). This new PR is built upon this past mempurge prototype.
In this PR, I made the `mempurge` process a background task, which provides superior performance since the mempurge process does not cling on the db_mutex anymore, and addresses severe restrictions from the past iteration (including a scenario where the past mempurge was failling, when a memtable was mempurged but was still referred to by an iterator/snapshot/...).
Now the mempurge process ressembles an in-memory compaction process: the stack of immutable memtables is filtered out, and the useful payload is used to populate an output memtable. If the output memtable is filled at more than 60% capacity (arbitrary heuristic) the mempurge process is aborted and a regular flush process takes place, else the output memtable is kept in the immutable memtable stack. Note that adding this output memtable to the `imm()` memtable stack does not trigger another flush process, so that the flush thread can go to sleep at the end of a successful mempurge.
MemPurge is activated by making the `experimental_allow_mempurge` flag `true`. When activated, the `MemPurge` process will always happen when the flush reason is `kWriteBufferFull`.
The 3 unit tests confirm that this process supports `Put`, `Get`, `Delete`, `DeleteRange` operators and is compatible with `Iterators` and `CompactionFilters`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8505
Reviewed By: pdillinger
Differential Revision: D29619283
Pulled By: bjlemaire
fbshipit-source-id: 8a99bee76b63a8211bff1a00e0ae32360aaece95
Summary:
Implement an experimental feature called "MemPurge", which consists in purging "garbage" bytes out of a memtable and reuse the memtable struct instead of making it immutable and eventually flushing its content to storage.
The prototype is by default deactivated and is not intended for use. It is intended for correctness and validation testing. At the moment, the "MemPurge" feature can be switched on by using the `options.experimental_allow_mempurge` flag. For this early stage, when the allow_mempurge flag is set to `true`, all the flush operations will be rerouted to perform a MemPurge. This is a temporary design decision that will give us the time to explore meaningful heuristics to use MemPurge at the right time for relevant workloads . Moreover, the current MemPurge operation only supports `Puts`, `Deletes`, `DeleteRange` operations, and handles `Iterators` as well as `CompactionFilter`s that are invoked at flush time .
Three unit tests are added to `db_flush_test.cc` to test if MemPurge works correctly (and checks that the previously mentioned operations are fully supported thoroughly tested).
One noticeable design decision is the timing of the MemPurge operation in the memtable workflow: for this prototype, the mempurge happens when the memtable is switched (and usually made immutable). This is an inefficient process because it implies that the entirety of the MemPurge operation happens while holding the db_mutex. Future commits will make the MemPurge operation a background task (akin to the regular flush operation) and aim at drastically enhancing the performance of this operation. The MemPurge is also not fully "WAL-compatible" yet, but when the WAL is full, or when the regular MemPurge operation fails (or when the purged memtable still needs to be flushed), a regular flush operation takes place. Later commits will also correct these behaviors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8454
Reviewed By: anand1976
Differential Revision: D29433971
Pulled By: bjlemaire
fbshipit-source-id: 6af48213554e35048a7e03816955100a80a26dc5
Summary:
Currently, we either use the file system inode or a monotonically incrementing runtime ID as the block cache key prefix. However, if we use a monotonically incrementing runtime ID (in the case that the file system does not support inode id generation), in some cases, it cannot ensure uniqueness (e.g., we have secondary cache migrated from host to host). We use DbSessionID (20 bytes) + current file number (at most 10 bytes) as the new cache block key prefix when the secondary cache is enabled. So can accommodate scenarios such as transfer of cache state across hosts.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8360
Test Plan: add the test to lru_cache_test
Reviewed By: pdillinger
Differential Revision: D29006215
Pulled By: zhichao-cao
fbshipit-source-id: 6cff686b38d83904667a2bd39923cd030df16814
Summary:
This patch does two things:
1) Introduces some aliases in order to eliminate/prevent long-winded type names
w/r/t the internal table property collectors (see e.g.
`std::vector<std::unique_ptr<IntTblPropCollectorFactory>>`).
2) Makes it possible to apply only a subrange of table property collectors during
table building by turning `TableBuilderOptions::int_tbl_prop_collector_factories`
from a pointer to a `vector` into a range (i.e. a pair of iterators).
Rationale: I plan to introduce a BlobDB related table property collector, which
should only be applied during table creation if blob storage is enabled at the moment
(which can be changed dynamically). This change will make it possible to include/
exclude the BlobDB related collector as needed without having to introduce
a second `vector` of collectors in `ColumnFamilyData` with pretty much the same
contents.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8298
Test Plan: `make check`
Reviewed By: jay-zhuang
Differential Revision: D28430910
Pulled By: ltamasi
fbshipit-source-id: a81d28f2c59495865300f43deb2257d2e6977c8e
Summary:
The ImmutableCFOptions contained a bunch of fields that belonged to the ImmutableDBOptions. This change cleans that up by introducing an ImmutableOptions struct. Following the pattern of Options struct, this class inherits from the DB and CFOption structs (of the Immutable form).
Only one structural change (the ImmutableCFOptions::fs was changed to a shared_ptr from a raw one) is in this PR. All of the other changes involve moving the member variables from the ImmutableCFOptions into the ImmutableOptions and changing member variables or function parameters as required for compilation purposes.
Follow-on PRs may do a further clean-up of the code, such as renaming variables (such as "ImmutableOptions cf_options") and potentially eliminating un-needed function parameters (there is no longer a need to pass both an ImmutableDBOptions and an ImmutableOptions to a function).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8262
Reviewed By: pdillinger
Differential Revision: D28226540
Pulled By: mrambacher
fbshipit-source-id: 18ae71eadc879dedbe38b1eb8e6f9ff5c7147dbf
Summary:
## 1. Bug description:
When RocksDB Checkpoint, it may be stuck in `WaitUntilFlushWouldNotStallWrites` method.
## 2. Simple analysis of the reasons:
### 2.1 Configuration parameters:
```yaml
Compaction Style : Universal
max_write_buffer_number : 4
min_write_buffer_number_to_merge : 3
```
Checkpoint is usually very fast. When the Checkpoint is executed, `WaitUntilFlushWouldNotStallWrites` is called. If there are 2 Immutable MemTables, which are less than `min_write_buffer_number_to_merge`, they will not be flushed. But will enter this code.
```c++
// method: GetWriteStallConditionAndCause
if (mutable_cf_options.max_write_buffer_number> 3 &&
num_unflushed_memtables >=
mutable_cf_options.max_write_buffer_number-1) {
return {WriteStallCondition::kDelayed, WriteStallCause::kMemtableLimit};
}
```
code link: fbed72f03c/db/column_family.cc (L847)
Checkpoint thought there was a FlushJob, but it didn't. So will always wait.
### 2.2 solution:
Increase the restriction: the `number of Immutable MemTable` >= `min_write_buffer_number_to_merge will wait`.
If there are other better solutions, you can correct me.
### 2.3 Code that can reproduce the problem:
https://github.com/1996fanrui/fanrui-learning/blob/flink-1.12/module-java/src/main/java/com/dream/rocksdb/RocksDBCheckpointStuck.java
## 3. Interesting point
This bug will be triggered only when `the number of sorted runs >= level0_file_num_compaction_trigger`.
Because there is a break in WaitUntilFlushWouldNotStallWrites.
```c++
if (cfd->imm()->NumNotFlushed() <
cfd->ioptions()->min_write_buffer_number_to_merge &&
vstorage->l0_delay_trigger_count() <
mutable_cf_options.level0_file_num_compaction_trigger) {
break;
}
```
code link: fbed72f03c/db/db_impl/db_impl_compaction_flush.cc (L1974)
Universal may have `l0_delay_trigger_count() >= level0_file_num_compaction_trigger`, so this bug is triggered.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7921
Reviewed By: jay-zhuang
Differential Revision: D26900559
Pulled By: ajkr
fbshipit-source-id: 133c1252dad7393753f04a47590b68c7d8e670df
Summary:
In the write path, there is an optimization: when a new WAL is created during SwitchMemtable, we update the internal log number of the empty column families to the new WAL. `FindObsoleteFiles` marks a WAL as obsolete if the WAL's log number is less than `VersionSet::MinLogNumberWithUnflushedData`. After updating the empty column families' internal log number, `VersionSet::MinLogNumberWithUnflushedData` might change, so some WALs might become obsolete to be purged from disk.
For example, consider there are 3 column families: 0, 1, 2:
1. initially, all the column families' log number is 1;
2. write some data to cf0, and flush cf0, but the flush is pending;
3. now a new WAL 2 is created;
4. write data to cf1 and WAL 2, now cf0's log number is 1, cf1's log number is 2, cf2's log number is 2 (because cf1 and cf2 are empty, so their log numbers will be set to the highest log number);
5. now cf0's flush hasn't finished, flush cf1, a new WAL 3 is created, and cf1's flush finishes, now cf0's log number is 1, cf1's log number is 3, cf2's log number is 3, since WAL 1 still contains data for the unflushed cf0, no WAL can be deleted from disk;
6. now cf0's flush finishes, cf0's log number is 2 (because when cf0 was switching memtable, WAL 3 does not exist yet), cf1's log number is 3, cf2's log number is 3, so WAL 1 can be purged from disk now, but WAL 2 still cannot because `MinLogNumberToKeep()` is 2;
7. write data to cf2 and WAL 3, because cf0 is empty, its log number is updated to 3, so now cf0's log number is 3, cf1's log number is 3, cf2's log number is 3;
8. now if the background threads want to purge obsolete files from disk, WAL 2 can be purged because `MinLogNumberToKeep()` is 3. But there are only two flush results written to MANIFEST: the first is for flushing cf1, and the `MinLogNumberToKeep` is 1, the second is for flushing cf0, and the `MinLogNumberToKeep` is 2. So without this PR, if the DB crashes at this point and try to recover, `WalSet` will still expect WAL 2 to exist.
When WAL tracking is enabled, we assume WALs will only become obsolete after a flush result is written to MANIFEST in `MemtableList::TryInstallMemtableFlushResults` (or its atomic flush counterpart). The above situation breaks this assumption.
This PR tracks WAL obsoletion if necessary before updating the empty column families' log numbers.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7781
Test Plan:
watch existing tests and stress tests to pass.
`make -j48 blackbox_crash_test` on devserver
Reviewed By: ltamasi
Differential Revision: D25631695
Pulled By: cheng-chang
fbshipit-source-id: ca7fff967bdb42204b84226063d909893bc0a4ec
Summary:
Uncommon bug seen by ASAN with
ColumnFamilyTest.LiveIteratorWithDroppedColumnFamily, if the last two
references to a ColumnFamilyData are both SuperVersions (during
InstallSuperVersion). The fix is to use UnrefAndTryDelete even in
SuperVersion::Cleanup but with a parameter to avoid re-entering Cleanup
on the same SuperVersion being cleaned up.
ColumnFamilyData::Unref is considered unsafe so removed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7749
Test Plan: ./column_family_test --gtest_filter=*LiveIter* --gtest_repeat=100
Reviewed By: jay-zhuang
Differential Revision: D25354304
Pulled By: pdillinger
fbshipit-source-id: e78f3a3f67c40013b8432f31d0da8bec55c5321c
Summary:
Following https://github.com/facebook/rocksdb/issues/7655 and https://github.com/facebook/rocksdb/issues/7657, this PR adds `full_history_ts_low_` to `ColumnFamilyData`.
`ColumnFamilyData::full_history_ts_low_` will be used to create `FlushJob` and `CompactionJob`.
`ColumnFamilyData::full_history_ts_low` is persisted to the MANIFEST file. An application can only
increase its value. Consider the following case:
>
> The database has a key at ts=950. `full_history_ts_low` is first set to 1000, and then a GC is triggered
> and cleans up all data older than 1000. If the application sets `full_history_ts_low` to 900 afterwards,
> and tries to read at ts=960, the key at 950 is not seen. From the perspective of the read, the result
> is hard to reason. For simplicity, we just do now allow decreasing full_history_ts_low for now.
>
During recovery, the value of `full_history_ts_low` is restored for each column family if applicable. Note that
version edits in the MANIFEST file for the same column family may have `full_history_ts_low` unsorted due
to the potential interleaving of `LogAndApply` calls. Only the max will be used to restore the state of the
column family.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7740
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D25296217
Pulled By: riversand963
fbshipit-source-id: 24acda1df8262cd7cfdc6ce7b0ec56438abe242a
Summary:
Fixes Issue https://github.com/facebook/rocksdb/issues/7497
When allow_data_in_errors db_options is set, log error key details in `ParseInternalKey()`
Have fixed most of the calls. Have few TODOs still pending - because have to make more deeper changes to pass in the allow_data_in_errors flag. Will do those in a separate PR later.
Tests:
- make check
- some of the existing tests that exercise the "internal key too small" condition are: dbformat_test, cuckoo_table_builder_test
- some of the existing tests that exercise the corrupted key path are: corruption_test, merge_helper_test, compaction_iterator_test
Example of new status returns:
- Key too small - `Corrupted Key: Internal Key too small. Size=5`
- Corrupt key with allow_data_in_errors option set to false: `Corrupted Key: '<redacted>' seq:3, type:3`
- Corrupt key with allow_data_in_errors option set to true: `Corrupted Key: '61' seq:3, type:3`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7515
Reviewed By: ajkr
Differential Revision: D24240264
Pulled By: ramvadiv
fbshipit-source-id: bc48f5d4475ac19d7713e16df37505b31aac42e7
Summary:
The patch adds blob file support to the `Get` API by extending `Version` so that
whenever a blob reference is read from a file, the blob is retrieved from the corresponding
blob file and passed back to the caller. (This is assuming the blob reference is valid
and the blob file is actually part of the given `Version`.) It also introduces a cache
of `BlobFileReader`s called `BlobFileCache` that enables sharing `BlobFileReader`s
between callers. `BlobFileCache` uses the same backing cache as `TableCache`, so
`max_open_files` (if specified) limits the total number of open (table + blob) files.
TODO: proactively open/cache blob files and pin the cache handles of the readers in the
metadata objects similarly to what `VersionBuilder::LoadTableHandlers` does for
table files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7540
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D24260219
Pulled By: ltamasi
fbshipit-source-id: a8a2a4f11d3d04d6082201b52184bc4d7b0857ba
Summary:
Replace FSRandomAccessFile pointer with FSRandomAccessFilePtr
object in RandomAccessFileReader.
This new object wraps FSRandomAccessFile pointer.
Objective: If tracing is enabled, FSRandomAccessFile Ptr returns
FSRandomAccessFileTracingWrapper pointer that includes all necessary
information in IORecord and calls underlying FileSystem and invokes
IOTracer to dump that record in a binary file. If tracing is disabled
then, underlying FileSystem pointer is returned directly.
FSRandomAccessFilePtr wrapper class is added to bypass the FSRandomAccessFileWrapper when
tracing is disabled.
Test Plan: make check -j64
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7192
Reviewed By: anand1976
Differential Revision: D23356867
Pulled By: akankshamahajan15
fbshipit-source-id: 48f31168166a17a7444b40be44a9a9d4a5c7182c
Summary:
There are situations when RocksDB tries to recover, but the db is in an inconsistent state due to SST files referenced in the MANIFEST being missing. In this case, previous RocksDB will just fail the recovery and return a non-ok status.
This PR enables another possibility. During recovery, RocksDB checks possible MANIFEST files, and try to recover to the most recent state without missing table file. `VersionSet::Recover()` applies version edits incrementally and "materializes" a version only when this version does not reference any missing table file. After processing the entire MANIFEST, the version created last will be the latest version.
`DBImpl::Recover()` calls `VersionSet::Recover()`. Afterwards, WAL replay will *not* be performed.
To use this capability, set `options.best_efforts_recovery = true` when opening the db. Best-efforts recovery is currently incompatible with atomic flush.
Test plan (on devserver):
```
$make check
$COMPILE_WITH_ASAN=1 make all && make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6334
Reviewed By: anand1976
Differential Revision: D19778960
Pulled By: riversand963
fbshipit-source-id: c27ea80f29bc952e7d3311ecf5ee9c54393b40a8
Summary:
In Linux, when reopening DB with many SST files, profiling shows that 100% system cpu time spent for a couple of seconds for `GetLogicalBufferSize`. This slows down MyRocks' recovery time when site is down.
This PR introduces two new APIs:
1. `Env::RegisterDbPaths` and `Env::UnregisterDbPaths` lets `DB` tell the env when it starts or stops using its database directories . The `PosixFileSystem` takes this opportunity to set up a cache from database directories to the corresponding logical block sizes.
2. `LogicalBlockSizeCache` is defined only for OS_LINUX to cache the logical block sizes.
Other modifications:
1. rename `logical buffer size` to `logical block size` to be consistent with Linux terms.
2. declare `GetLogicalBlockSize` in `PosixHelper` to expose it to `PosixFileSystem`.
3. change the functions `IOError` and `IOStatus` in `env/io_posix.h` to have external linkage since they are used in other translation units too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6457
Test Plan:
1. A new unit test is added for `LogicalBlockSizeCache` in `env/io_posix_test.cc`.
2. A new integration test is added for `DB` operations related to the cache in `db/db_logical_block_size_cache_test.cc`.
`make check`
Differential Revision: D20131243
Pulled By: cheng-chang
fbshipit-source-id: 3077c50f8065c0bffb544d8f49fb10bba9408d04
Summary:
In the current code base, we can use Directory from Env to manage directory (e.g, Fsync()). The PR https://github.com/facebook/rocksdb/issues/5761 introduce the File System as a new Env API. So we further replace the Directory class in DB with FSDirectory such that we can have more IO information from IOStatus returned by FSDirectory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6468
Test Plan: pass make asan_check
Differential Revision: D20195261
Pulled By: zhichao-cao
fbshipit-source-id: 93962cb9436852bfcfb76e086d9e7babd461cbe1
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
Differential Revision: D19977691
fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
Summary:
A relatively recent regression causes for every CF, create and open directory is called for the DB directory, unless CF has a private directory. This doesn't scale well with large number of column families.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6358
Test Plan: Run all existing tests and see it pass. strace with db_bench --num_column_families and observe it doesn't open directory for number of column families.
Differential Revision: D19675141
fbshipit-source-id: da01d9216f1dae3f03d4064fbd88ce71245bd9be
Summary:
I found that CleanupSuperVersion() may block Get() for 30ms+ (per MemTable is 256MB).
Then I found "delete sv" in ~SuperVersion() takes the time.
The backtrace looks like this
DBImpl::GetImpl() -> DBImpl::ReturnAndCleanupSuperVersion() ->
DBImpl::CleanupSuperVersion() : delete sv; -> ~SuperVersion()
I think it's better to delete in a background thread, please review it。
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6146
Differential Revision: D18972066
fbshipit-source-id: 0f7b0b70b9bb1e27ad6fc1c8a408fbbf237ae08c
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
Summary:
It's easy to cause coredump when closing ColumnFamilyHandle with unreleased iterators, especially iterators release is controlled by java GC when using JNI.
This patch fixed concurrent CF iteration and drop, we let iterators(actually SuperVersion) hold a ColumnFamilyData reference to prevent the CF from being released too early.
fixed https://github.com/facebook/rocksdb/issues/5982
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6147
Differential Revision: D18926378
fbshipit-source-id: 1dff6d068c603d012b81446812368bfee95a5e15
Summary:
We see this TSAN warning:
WARNING: ThreadSanitizer: data race (pid=282806)
Write of size 8 at 0x7b6c00000e38 by thread T16 (mutexes: write M1023578822185846136):
#0 operator delete(void*) <null> (libtsan.so.0+0x0000000795f8)
https://github.com/facebook/rocksdb/issues/1 rocksdb::DBImpl::BackgroundFlush(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::FlushReason*, rocksdb::Env::Priority) db/db_impl/db_impl_compaction_flush.cc:2202 (db_flush_test+0x00000060b462)
https://github.com/facebook/rocksdb/issues/2 rocksdb::DBImpl::BackgroundCallFlush(rocksdb::Env::Priority) db/db_impl/db_impl_compaction_flush.cc:2226 (db_flush_test+0x00000060cbd8)
https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::BGWorkFlush(void*) db/db_impl/db_impl_compaction_flush.cc:2073 (db_flush_test+0x00000060d5ac)
......
Previous atomic write of size 4 at 0x7b6c00000e38 by main thread:
#0 __tsan_atomic32_fetch_sub <null> (libtsan.so.0+0x00000006d721)
https://github.com/facebook/rocksdb/issues/1 std::__atomic_base<int>::fetch_sub(int, std::memory_order) /mnt/gvfs/third-party2/libgcc/c67031f0f739ac61575a061518d6ef5038f99f90/7.x/platform007/5620abc/include/c++/7.3.0/bits/atomic_base.h:524 (db_flush_test+0x0000005f9e38)
https://github.com/facebook/rocksdb/issues/2 rocksdb::ColumnFamilyData::Unref() db/column_family.h:286 (db_flush_test+0x0000005f9e38)
https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::FlushMemTable(rocksdb::ColumnFamilyData*, rocksdb::FlushOptions const&, rocksdb::FlushReason, bool) db/db_impl/db_impl_compaction_flush.cc:1624 (db_flush_test+0x0000005f9e38)
https://github.com/facebook/rocksdb/issues/4 rocksdb::DBImpl::TEST_FlushMemTable(rocksdb::ColumnFamilyData*, rocksdb::FlushOptions const&) db/db_impl/db_impl_debug.cc:127 (db_flush_test+0x00000061ace9)
https://github.com/facebook/rocksdb/issues/5 rocksdb::DBFlushTest_CFDropRaceWithWaitForFlushMemTables_Test::TestBody() db/db_flush_test.cc:320 (db_flush_test+0x0000004b44e5)
https://github.com/facebook/rocksdb/issues/6 void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) third-party/gtest-1.7.0/fused-src/gtest/gtest-all.cc:3824 (db_flush_test+0x000000be2988)
......
It's still very clear the cause of the warning is because that TSAN treats results from relaxed atomic::fetch_sub() as non-atomic with the operation itself. We can make it more explicit by bumping up the order to CS.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5723
Test Plan: Run all existing test.
Differential Revision: D16908250
fbshipit-source-id: bf17d39ed19058372bdf97f6440a743f88153021
Summary:
This PR integrates the block cache tracer class into db_impl.cc.
db_impl.cc contains a member variable of AtomicBlockCacheTraceWriter class and passes its reference to the block_based_table_reader.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5433
Differential Revision: D15728016
Pulled By: HaoyuHuang
fbshipit-source-id: 23d5659e8c82d556833dcc1a5558aac8c1f7db71
Summary:
Currently we validate options in DB::Open. However the validation step is missing when options are dynamically updated in ::SetOptions. The patch fixes that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5368
Differential Revision: D15540101
Pulled By: maysamyabandeh
fbshipit-source-id: d27bbffd8f0252d1b50bcf59e0a70a278ed937f4
Summary:
Depending on the config, manual compaction (leveled compaction style) does following compactions:
L0->L1
L1->L2
...
Ln-1 -> Ln
Ln -> Ln
The final Ln -> Ln compaction is partly unnecessary as it recompacts all the files that were just generated by the Ln-1 -> Ln. We should avoid recompacting such files. This rule should be applied to Lmax only.
Resolves issue https://github.com/facebook/rocksdb/issues/4995
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5138
Differential Revision: D14940106
Pulled By: miasantreble
fbshipit-source-id: 8d3cf5507a17e76f3333cfd4bac5256d005636e5
Summary:
Avoid locking the DB mutex in order to reference SuperVersions. Instead, we get the thread local cached SuperVersion for each column family in the list. It depends on finding a sequence number that overlaps with all the open memtables. We start with the latest published sequence number, and if any of the memtables is sealed before we can get all the SuperVersions, the process is repeated. After a few times, give up and lock the DB mutex.
Tests:
1. Unit tests
2. make check
3. db_bench -
TEST_TMPDIR=/dev/shm ./db_bench -use_existing_db=true -benchmarks=readrandom -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=5000000 -reads=1000000 -threads=32 -compression_type=none -cache_size=1048576000 -batch_size=1 -bloom_bits=1
readrandom : 0.167 micros/op 5983920 ops/sec; 426.2 MB/s (1000000 of 1000000 found)
Multireadrandom with batch size 1:
multireadrandom : 0.176 micros/op 5684033 ops/sec; (1000000 of 1000000 found)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4754
Differential Revision: D13363550
Pulled By: anand1976
fbshipit-source-id: 6243e8de7dbd9c8bb490a8eca385da0c855b1dd4
Summary:
Sometimes we want to compact files as fast as possible, but don't want to set a large `max_subcompactions` in the `DBOptions` by default.
I add a `max_subcompactions` options to `CompactionOptions` so that we can choose a proper concurrency dynamically.
Closes https://github.com/facebook/rocksdb/pull/3775
Differential Revision: D7792357
Pulled By: ajkr
fbshipit-source-id: 94f54c3784dce69e40a229721a79a97e80cd6a6c
Summary:
We use `queued_for_flush_` to indicate a column family has been added to the
flush queue. Similarly and to be consistent in our naming, we need to use `queued_for_compaction_` to indicate a column family has been added to the compaction queue. In the past we used
`pending_compaction_` which can also be ambiguous.
Closes https://github.com/facebook/rocksdb/pull/3781
Differential Revision: D7790063
Pulled By: riversand963
fbshipit-source-id: 6786b11a4fcaea36dc9b4672233dbe042f921804
Summary:
With ColumnFamilyData::pending_flush_, we have the following code snippet in DBImpl::ScheedulePendingFlush
```
if (!cfd->pending_flush() && cfd->imm()->IsFlushPending()) {
...
}
```
`Pending` is ambiguous, and I feel `queued_for_flush` is a better name,
especially for the sake of readability.
Closes https://github.com/facebook/rocksdb/pull/3777
Differential Revision: D7783066
Pulled By: riversand963
fbshipit-source-id: f1bd8c8bfe5eafd2c94da0d8566c9b2b6bb57229
Summary:
In this change, an option to set different paths for different column families is added.
This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path.
To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path.
Changes :
1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions. This member is used to identify the path information whenever files are accessed.
2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting.
3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths.
4) Unit tests are added appropriately.
Closes https://github.com/facebook/rocksdb/pull/3102
Differential Revision: D6951697
Pulled By: ajkr
fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d
Summary:
When destorying column family handle after the column family has been deleted, the handle may hold share pointers of some objects in ColumnFamilyOptions, but in the destructor, the destructing order may cause some of the objects to be destoryed before being used by the following steps. Fix it by making a copy of the option object and destory it as the last step.
Closes https://github.com/facebook/rocksdb/pull/3610
Differential Revision: D7281025
Pulled By: siying
fbshipit-source-id: ac18f3b2841788cba4ccfa1abd8d59158c1113bc
Summary:
Add "rocksdb.live-sst-files-size" DB property which only include files of latest version. Existing "rocksdb.total-sst-files-size" include files from all versions and thus include files that's obsolete but not yet deleted. I'm going to use this new property to cap blob db sst + blob files size.
Closes https://github.com/facebook/rocksdb/pull/3548
Differential Revision: D7116939
Pulled By: yiwu-arbug
fbshipit-source-id: c6a52e45ce0f24ef78708156e1a923c1dd6bc79a
Summary:
CompactRange has a call to Flush because we guarantee that, at the time it's called, all existing keys in the range will be pushed through the user's compaction filter. However, previously the flush was done blindly, so it'd happen even if the memtable does not contain keys in the range specified by the user. This caused unnecessarily many L0 files to be created, leading to write stalls in some cases. This PR checks the memtable's contents, and decides to flush only if it overlaps with `CompactRange`'s range.
- Move the memtable overlap check logic from `ExternalSstFileIngestionJob` to `ColumnFamilyData::RangesOverlapWithMemtables`
- Reuse the above logic in `CompactRange` and skip flushing if no overlap
Closes https://github.com/facebook/rocksdb/pull/3520
Differential Revision: D7018897
Pulled By: ajkr
fbshipit-source-id: a3c6b1cfae56687b49dd89ccac7c948e53545934
Summary:
- Refactored logic for checking write stall condition to a helper function: `GetWriteStallConditionAndCause`. Now it is decoupled from the logic for updating WriteController / stats in `RecalculateWriteStallConditions`, so we can reuse it for predicting whether write stall will occur.
- Updated `CompactRange` to first check whether the one additional immutable memtable / L0 file would cause stalling before it flushes. If so, it waits until that is no longer true.
- Updated `bg_cv_` to be signaled on `SetOptions` calls. The stall conditions `CompactRange` cares about can change when (1) flush finishes, (2) compaction finishes, or (3) options dynamically change. The cv was already signaled for (1) and (2) but not yet for (3).
Closes https://github.com/facebook/rocksdb/pull/3381
Differential Revision: D6754983
Pulled By: ajkr
fbshipit-source-id: 5613e03f1524df7192dc6ae885d40fd8f091d972
Summary:
It's always a mystery from the logs why flush was triggered -- user triggered it manually, WriteBufferManager triggered it, logs were full, write buffer was full, etc.
This PR logs Flush reason whenever a flush is scheduled.
Closes https://github.com/facebook/rocksdb/pull/3401
Differential Revision: D6788142
Pulled By: miasantreble
fbshipit-source-id: a867e54d493c06adf5172bd36a180fb3faae3511
Summary:
Flush() call could be waiting indefinitely if min_write_buffer_number_to_merge is used. Consider the sequence:
1. User call Flush() with flush_options.wait = true
2. The manual flush started in the background
3. New memtable become immutable because of writes. The new memtable will not trigger flush if min_write_buffer_number_to_merge is not reached.
4. The manual flush finish.
Because of the new memtable created at step 3 not being flush, previous logic of WaitForFlushMemTable() keep waiting, despite the memtables it intent to flush has been flushed.
Here instead of checking if there are any more memtables to flush, WaitForFlushMemTable() also check the id of the earliest memtable. If the id is larger than that of latest memtable at the time flush was initiated, it means all the memtable at the time of flush start has all been flush.
Closes https://github.com/facebook/rocksdb/pull/3378
Differential Revision: D6746789
Pulled By: yiwu-arbug
fbshipit-source-id: 35e698f71c7f90b06337a93e6825f4ea3b619bfa
Summary:
A data race is caught by tsan_crash test between compaction and DropColumnFamily:
https://gist.github.com/yiwu-arbug/5a2b4baae05eeb99ae1719b650f30a44 Compaction checks if the column family has been dropped on each key input, while user can issue DropColumnFamily which updates cfd->dropped_, causing the data race. Fixing it by making cfd->dropped_ an atomic.
Closes https://github.com/facebook/rocksdb/pull/3250
Differential Revision: D6535991
Pulled By: yiwu-arbug
fbshipit-source-id: 5571df020beae7fa7db6fff5ad0d598f49962895
Summary:
Add a simple policy for NVMe write time life hint
Closes https://github.com/facebook/rocksdb/pull/3095
Differential Revision: D6298030
Pulled By: shligit
fbshipit-source-id: 9a72a42e32e92193af11599eb71f0cf77448e24d
Summary:
db/column_family.h :
79 ColumnFamilyHandleInternal()
CID 1322806 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
2. uninit_member: Non-static class member internal_cfd_ is not initialized in this constructor nor in any functions that it calls.
80 : ColumnFamilyHandleImpl(nullptr, nullptr, nullptr) {}
db/compacted_db_impl.cc:
18CompactedDBImpl::CompactedDBImpl(
19 const DBOptions& options, const std::string& dbname)
20 : DBImpl(options, dbname) {
2. uninit_member: Non-static class member cfd_ is not initialized in this constructor nor in any functions that it calls.
4. uninit_member: Non-static class member version_ is not initialized in this constructor nor in any functions that it calls.
CID 1396120 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
6. uninit_member: Non-static class member user_comparator_ is not initialized in this constructor nor in any functions that it calls.
21}
db/compaction_iterator.cc:
9. uninit_member: Non-static class member current_user_key_sequence_ is not initialized in this constructor nor in any functions that it calls.
11. uninit_member: Non-static class member current_user_key_snapshot_ is not initialized in this constructor nor in any functions that it calls.
CID 1419855 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
13. uninit_member: Non-static class member current_key_committed_ is not initialized in this constructor nor in any functions that it calls.
Closes https://github.com/facebook/rocksdb/pull/3084
Differential Revision: D6172999
Pulled By: sagar0
fbshipit-source-id: 084d73393faf8022c01359cfb445807b6a782460
Summary:
instad of for specific level -> instead of a specific level
Closes https://github.com/facebook/rocksdb/pull/3040
Differential Revision: D6090811
Pulled By: sagar0
fbshipit-source-id: 499edef0a6f596c448f61791e6aca8f5cce08e9c
Summary:
Add a new function in Listener to let the caller know when rocksdb
is stalling writes.
Closes https://github.com/facebook/rocksdb/pull/2897
Differential Revision: D5860124
Pulled By: schischi
fbshipit-source-id: ee791606169aa64f772c86f817cebf02624e05e1
Summary:
CreateColumnFamily() releases DB mutex after adding column family to the set and install super version (to write option file), so if users call GetAggregatedIntProperty() in the middle, then super version will be null and the process will crash. Fix it by skipping those column families without super version installed.
Maybe we should also fix the problem of releasing the lock when reading option file, but it is more risky. so I'm doing a quick and safer fix and we can investigate it later.
Closes https://github.com/facebook/rocksdb/pull/2475
Differential Revision: D5298053
Pulled By: siying
fbshipit-source-id: 4b3c8f91c60400b163fcc6cda8a0c77723be0ef6
Summary:
Move some files under util/ to new directories env/, monitoring/ options/ and cache/
Closes https://github.com/facebook/rocksdb/pull/2090
Differential Revision: D4833681
Pulled By: siying
fbshipit-source-id: 2fd8bef
Summary:
Consider the following single column family scenario:
prepare in log A
commit in log B
*WAL is too large, flush all CFs to releast log A*
*CFA is on log B so we do not see CFA is depending on log A so no flush is requested*
To fix this we must also consider the log containing the prepare section when determining what log a CF is dependent on.
Closes https://github.com/facebook/rocksdb/pull/1768
Differential Revision: D4403265
Pulled By: reidHoruff
fbshipit-source-id: ce800ff
Summary:
Current write stalling system has the problem of lacking of positive feedback if the restricted rate is already too low. Users sometimes stack in very low slowdown value. With the diff, we add a positive feedback (increasing the slowdown value) if we recover from slowdown state back to normal. To avoid the positive feedback to keep the slowdown value to be to high, we add issue a negative feedback every time we are close to the stop condition. Experiments show it is easier to reach a relative balance than before.
Also increase level0_stop_writes_trigger default from 24 to 32. Since level0_slowdown_writes_trigger default is 20, stop trigger 24 only gives four files as the buffer time to slowdown writes. In order to avoid stop in four files while 20 files have been accumulated, the slowdown value must be very low, which is amost the same as stop. It also doesn't give enough time for the slowdown value to converge. Increase it to 32 will smooth out the system.
Closes https://github.com/facebook/rocksdb/pull/1562
Differential Revision: D4218519
Pulled By: siying
fbshipit-source-id: 95e4088