Summary:
Previously, options.ttl cannot be set with options.max_open_files = -1, because it makes use of creation_time field in table properties, which is not available unless max_open_files = -1. With this commit, the information will be stored in manifest and when it is available, will be used instead.
Note that, this change will break forward compatibility for release 5.1 and older.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6060
Test Plan: Extend existing test case to options.max_open_files != -1, and simulate backward compatility in one test case by forcing the value to be 0.
Differential Revision: D18631623
fbshipit-source-id: 30c232a8672de5432ce9608bb2488ecc19138830
Summary:
Signed-off-by: Little-Wallace <bupt2013211450@gmail.com>
This PR is to fix unstable unit test added by (https://github.com/facebook/rocksdb/pull/5958).
I set SYNC_POINT in PickCompaction before. If IntraL0Compaction was trigger, the compact job which compact sst to base level would start instantly. If the compaction thread run faster than unittest main thread, we may observe the number of files in L0 reduce.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6061
Differential Revision: D18642301
fbshipit-source-id: 3e4da2ee963532b6e142336951ea3f47d46df148
Summary:
Use db mutex to protect the execution of Version::GetColumnFamilyMetaData()
called in DBImpl::GetColumnFamilyMetaData().
Without mutex, GetColumnFamilyMetaData() races with MarkFilesBeingCompacted()
for access to FileMetaData::being_compacted.
Other than mutex, there are several more alternatives.
- Make FileMetaData::being_compacted an atomic variable. This will make
FileMetaData non-copy-able.
- Separate being_compacted from FileMetaData. This requires re-organizing data
structures that are already used in many places.
Test Plan (dev server):
```
make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6056
Differential Revision: D18620488
Pulled By: riversand963
fbshipit-source-id: 87f89660b5d5e2ab4ef7962b7b2a7d00e346aa3b
Summary:
The new DB::MultiGet() doesn't validate input for num_keys > 1 and GCC-9 complains about it. Fix it by directly return when num_keys == 0
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6054
Test Plan: Build with GCC-9 and see it passes.
Differential Revision: D18608958
fbshipit-source-id: 1c279aff3c7fe6e9d5a6d085ed02550ecea4fdb2
Summary:
## Problem Description
Our process was abort when it call `CheckConsistency`. And the information in `stderr` show that "`L0 files seqno 3001491972 3004797440 vs. 3002875611 3004524421` ". Here are the causes of the accident I investigated.
* RocksDB will call `CheckConsistency` whenever `MANIFEST` file is update. It will check sequence number interval of every file, except files which were ingested.
* When one file is ingested into RocksDB, it will be assigned the value of global sequence number, and the minimum and maximum seqno of this file are equal, which are both equal to global sequence number.
* `CheckConsistency` determines whether the file is ingested by whether the smallest and largest seqno of an sstable file are equal.
* If IntraL0Compaction picks one sst which was ingested just now and compacted it into another sst, the `smallest_seqno` of this new file will be smaller than his `largest_seqno`.
* If more than one ingested file was ingested before memtable schedule flush, and they all compact into one new sstable file by `IntraL0Compaction`. The sequence interval of this new file will be included in the interval of the memtable. So `CheckConsistency` will return a `Corruption`.
* If a sstable was ingested after the memtable was schedule to flush, which would assign a larger seqno to it than memtable. Then the file was compacted with other files (these files were all flushed before the memtable) in L0 into one file. This compaction start before the flush job of memtable start, but completed after the flush job finish. So this new file produced by the compaction (we call it s1) would have a larger interval of sequence number than the file produced by flush (we call it s2). **But there was still some data in s1 written into RocksDB before the s2, so it's possible that some data in s2 was cover by old data in s1.** Of course, it would also make a `Corruption` because of overlap of seqno. There is the relationship of the files:
> s1.smallest_seqno < s2.smallest_seqno < s2.largest_seqno < s1.largest_seqno
So I skip pick sst file which was ingested in function `FindIntraL0Compaction `
## Reason
Here is my bug report: https://github.com/facebook/rocksdb/issues/5913
There are two situations that can cause the check to fail.
### First situation:
- First we ingest five external sst into Rocksdb, and they happened to be ingested in L0. and there had been some data in memtable, which make the smallest sequence number of memtable is less than which of sst that we ingest.
- If there had been one compaction job which compacted sst from L0 to L1, `LevelCompactionPicker` would trigger a `IntraL0Compaction` which would compact this five sst from L0 to L0. We call this sst A, which was merged from five ingested sst.
- Then some data was put into memtable, and memtable was flushed to L0. We called this sst B.
- RocksDB check consistency , and find the `smallest_seqno` of B is less than that of A and crash. Because A was merged from five sst, the smallest sequence number of it was less than the biggest sequece number of itself, so RocksDB could not tell if A was produce by ingested.
### Secondary situaion
- First we have flushed many sst in L0, we call them [s1, s2, s3].
- There is an immutable memtable request to be flushed, but because flush thread is busy, so it has not been picked. we call it m1. And at the moment, one sst is ingested into L0. We call it s4. Because s4 is ingested after m1 became immutable memtable, so it has a larger log sequence number than m1.
- m1 is flushed in L0. because it is small, this flush job finish quickly. we call it s5.
- [s1, s2, s3, s4] are compacted into one sst to L0, by IntraL0Compaction. We call it s6.
- compacted 4@0 files to L0
- When s6 is added into manifest, the corruption happened. because the largest sequence number of s6 is equal to s4, and they are both larger than that of s5. But because s1 is older than m1, so the smallest sequence number of s6 is smaller than that of s5.
- s6.smallest_seqno < s5.smallest_seqno < s5.largest_seqno < s6.largest_seqno
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5958
Differential Revision: D18601316
fbshipit-source-id: 5fe54b3c9af52a2e1400728f565e895cde1c7267
Summary:
The SetOptions API used by the test is not supported in LITE mode,
so we should skip the new chunk in this case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6052
Test Plan: Ran the unit tests both in regular and LITE mode.
Differential Revision: D18601763
Pulled By: ltamasi
fbshipit-source-id: 883d6882771e0fb4aae72bb77ba4e63d9febec04
Summary:
Fix: when `db_iter` falls back to using seek by `FindValueForCurrentKeyUsingSeek`, `is_blob_` flag is not properly set on encountering BlobIndex.
Also patch existing test for the mentioned code path.
Signed-off-by: tabokie <xy.tao@outlook.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6051
Differential Revision: D18596274
Pulled By: ltamasi
fbshipit-source-id: 8e4714af263b99dc2c379707d50db88fe6799278
Summary:
GetSupportedCompressions() is not available in LITE build, so check and use Snappy compression in db_basic_test.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6050
Test Plan:
make LITE=1 check
make check
Differential Revision: D18588114
Pulled By: anand1976
fbshipit-source-id: a193de58c44f91bcc237107f25dbc1b9458eef3d
Summary:
The ParallelIO/DBBasicTestWithParallelIO.MultiGet/11 test fails if Snappy compression library is not installed, since RocksDB defaults to Snappy if none is specified. So dynamically determine the supported compression types and pick the first one.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6038
Differential Revision: D18532370
Pulled By: anand1976
fbshipit-source-id: a0a735114d1f8892ea09f7c4af8688d7bcc5b075
Summary:
When two_write_queue enable, IngestExternalFile performs EnterUnbatched on both write queues. SwitchMemtable also EnterUnbatched on 2nd write queue when this option is enabled. When the call stack includes IngestExternalFile -> FlushMemTable -> SwitchMemtable, this results into a deadlock.
The implemented solution is to pass on the existing writes_stopped argument in FlushMemTable to skip EnterUnbatched in SwitchMemtable.
Fixes https://github.com/facebook/rocksdb/issues/5974
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5976
Differential Revision: D18535943
Pulled By: maysamyabandeh
fbshipit-source-id: a4f9d4964c10d4a7ca06b1e0102ca2ec395512bc
Summary:
Had complications with LITE build and valgrind test.
Reverts/fixes small parts of PR https://github.com/facebook/rocksdb/issues/6007
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6036
Test Plan:
make LITE=1 all check
and
ROCKSDB_VALGRIND_RUN=1 DISABLE_JEMALLOC=1 make -j24 db_bloom_filter_test && ROCKSDB_VALGRIND_RUN=1 DISABLE_JEMALLOC=1 ./db_bloom_filter_test
Differential Revision: D18512238
Pulled By: pdillinger
fbshipit-source-id: 37213cf0d309edf11c483fb4b2fb6c02c2cf2b28
Summary:
Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter.
Speed
The improved speed, at least on recent x86_64, comes from
* Using fastrange instead of modulo (%)
* Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row.
* Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc.
* Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes.
Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed):
$ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter
Build avg ns/key: 47.7135
Mixed inside/outside queries...
Single filter net ns/op: 26.2825
Random filter net ns/op: 150.459
Average FP rate %: 0.954651
$ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter
Build avg ns/key: 47.2245
Mixed inside/outside queries...
Single filter net ns/op: 63.2978
Random filter net ns/op: 188.038
Average FP rate %: 1.13823
Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected.
The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome.
Accuracy
The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices
within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments.
Accuracy data (generalizes, except old impl gets worse with millions of keys):
Memory bits per key: FP rate percent old impl -> FP rate percent new impl
6: 5.70953 -> 5.69888
8: 2.45766 -> 2.29709
10: 1.13977 -> 0.959254
12: 0.662498 -> 0.411593
16: 0.353023 -> 0.0873754
24: 0.261552 -> 0.0060971
50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP)
Fixes https://github.com/facebook/rocksdb/issues/5857
Fixes https://github.com/facebook/rocksdb/issues/4120
Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized.
Compatibility
Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007
Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version).
Differential Revision: D18294749
Pulled By: pdillinger
fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
Summary:
Recent change https://github.com/facebook/rocksdb/pull/5861 mistakely use "prefix_extractor_ != nullptr" as the condition to determine whehter prefix bloom filter isused. It fails to consider read_options.total_order_seek, so it is wrong. The result is that an optimization for non-total-order seek is mistakely applied to total order seek, and introduces a bug in following corner case:
Because of RangeDelete(), a file's largest key is extended. Seek key falls into the range deleted file, so level iterator seeks into the previous file without getting any key. The correct behavior is to place the iterator to the first key of the next file. However, an optimization is triggered and invalidates the iterator because it is out of the prefix range, causing wrong results. This behavior is reproduced in the unit test added.
Fix the bug by setting prefix_extractor to be null if total order seek is used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6028
Test Plan: Add a unit test which fails without the fix.
Differential Revision: D18479063
fbshipit-source-id: ac075f013029fcf69eb3a598f14c98cce3e810b3
Summary:
Add a new API that allows a user to call MultiGet specifying multiple keys belonging to different column families. This is mainly useful for users who want to do a consistent read of keys across column families, with the added performance benefits of batching and returning values using PinnableSlice.
As part of this change, the code in the original multi-column family MultiGet for acquiring the super versions has been refactored into a separate function that can be used by both, the batching and the non-batching versions of MultiGet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5816
Test Plan:
make check
make asan_check
asan_crash_test
Differential Revision: D18408676
Pulled By: anand1976
fbshipit-source-id: 933e7bec91dd70e7b633be4ff623a1116cc28c8d
Summary:
The calculation in BlockBasedTable::MultiGet for the required buffer length for reading in compressed blocks is incorrect. It needs to take the 5-byte block trailer into account.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6014
Test Plan: Add a unit test DBBasicTest.MultiGetBufferOverrun that fails in asan_check before the fix, and passes after.
Differential Revision: D18412753
Pulled By: anand1976
fbshipit-source-id: 754dfb66be1d5f161a7efdf87be872198c7e3b72
Summary:
Fix issue https://github.com/facebook/rocksdb/issues/6012.
I found that it may be caused by the following codes in function _RemoveOldMemTables()_ in **db/memtable_list.cc** :
```
for (auto it = memlist.rbegin(); it != memlist.rend(); ++it) {
MemTable* mem = *it;
if (mem->GetNextLogNumber() > log_number) {
break;
}
current_->Remove(mem, to_delete);
```
The iterator **it** turns invalid after `current_->Remove(mem, to_delete);`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6013
Test Plan:
```
make check
```
Differential Revision: D18401107
Pulled By: riversand963
fbshipit-source-id: bf0da3b868ed70f7aff24cf7b3e2049c0c5c7a4e
Summary:
When users use Level-Compaction-with-TTL by setting `cf_options.ttl`, the ttl-expired data could take n*ttl time to reach the bottom level (where n is the number of levels) due to how the `creation_time` table property was calculated for the newly created files during compaction. The creation time of new files was set to a max of all compaction-input-files-creation-times which essentially resulted in resetting the ttl as the key range moves across levels. This behavior is now fixed by changing the `creation_time` to be based on minimum of all compaction-input-files-creation-times; this will cause cascading compactions across levels for the ttl-expired data to move to the bottom level, resulting in getting rid of tombstones/deleted-data faster.
This will help start cascading compactions to move the expired key range to the bottom-most level faster.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5992
Test Plan: `make check`
Differential Revision: D18257883
Pulled By: sagar0
fbshipit-source-id: 00df0bb8d0b7e14d9fc239df2cba8559f3e54cbc
Summary:
The test would fire two flushes to let them run in parallel. Previously it wait for the first job to be scheduled before firing the second. It is possible the job is not started before the second job being scheduled, making the two job combine into one. Change to wait for the first job being started.
Fixes https://github.com/facebook/rocksdb/issues/6017
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6018
Test Plan:
```
while ./db_flush_test --gtest_filter=*FireOnFlushCompletedAfterCommittedResult*; do :; done
```
and let it run for a while.
Signed-off-by: Yi Wu <yiwu@pingcap.com>
Differential Revision: D18405576
Pulled By: riversand963
fbshipit-source-id: 6ebb6262e033d5dc2ef81cb3eb410b314f2de4c9
Summary:
The patch exposes the file numbers of the SSTs as well as the oldest blob
files they contain a reference to through the GetColumnFamilyMetaData/
GetLiveFilesMetaData interface.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6011
Test Plan:
Fixed and extended the existing unit tests. (The earlier ColumnFamilyMetaDataTest
wasn't really testing anything because the generated memtables were never
flushed, so the metadata structure was essentially empty.)
Differential Revision: D18361697
Pulled By: ltamasi
fbshipit-source-id: d5ed1d94ac70858b84393c48711441ddfe1251e9
Summary:
This PR fixes https://github.com/facebook/rocksdb/issues/5975. In ```BlockBasedTable::RetrieveMultipleBlocks()```, we were calling ```MaybeReadBlocksAndLoadToCache()```, which is a no-op if neither uncompressed nor compressed block cache are configured.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5991
Test Plan:
1. Add unit tests that fail with the old code and pass with the new
2. make check and asan_check
Cc spetrunia
Differential Revision: D18272744
Pulled By: anand1976
fbshipit-source-id: e62fa6090d1a6adf84fcd51dfd6859b03c6aebfe
Summary:
It's useful to add test coverage for universal compaction's periodic compaction. Add two tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6002
Test Plan: Run the two tests
Differential Revision: D18363544
fbshipit-source-id: bbd04b54057315f64f959709006412db1f76d170
Summary:
Recently, periodic compaction got turned on by default for leveled compaction is compaction filter is used. Since periodic compaction is now supported in universal compaction too, we do the same default for universal now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5994
Test Plan: Add a new unit test.
Differential Revision: D18363744
fbshipit-source-id: 5093288ce990ee3cab0e44ffd92d8489fbcd6a48
Summary:
For MDEV-19670: MyRocks: key lookups into deleted data are very slow
BaseDeltaIterator remembers iterate_upper_bound and will not let delta_iterator_
walk above the iterate_upper_bound if base_iterator_ is not valid
anymore.
== Rationale ==
The most straightforward way would be to make the delta_iterator
(which is a rocksdb::WBWIIterator) to support iterator bounds. But
checking for bounds has an extra CPU overhead.
So we put the check into BaseDeltaIterator, and only make it when
base_iterator_ is not valid.
(note: We could take it even further, and move the check a few lines
down, and only check iterator bounds ourselves if base_iterator_ is
not valid AND delta_iterator_ hit a tombstone).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5403
Differential Revision: D15863092
Pulled By: maysamyabandeh
fbshipit-source-id: 8da458e7b9af95ff49356666f69664b4a6ccf49b
Summary:
We recently added periodic compaction to universal compaction. An old assertion that we can't onlyl compact the last sorted run triggered. However, with periodic compaction, it is possible that we only compact the last sorted run, so the assertion now became stricter than needed. Relaxing this assertion.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6000
Test Plan: This should be a low risk change. Will observe whether stress test will pass after it.
Differential Revision: D18285396
fbshipit-source-id: 9a6863debdf104c40a7f6c46ab62d84cdf5d8592
Summary:
FlushJobInfo and CompactionJobInfo are aggregates; we should use the
aggregate initialization syntax to ensure members (specifically those of
built-in types) are value-initialized.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5997
Test Plan: make check
Differential Revision: D18273398
Pulled By: ltamasi
fbshipit-source-id: 35b1a63ad9ca01605d288329858af72fffd7f392
Summary:
This change sets up for alternate implementations underlying
BloomFilterPolicy:
* Refactor BloomFilterPolicy and expose in internal .h file so that it's easy to iterate over / select implementations for testing, regardless of what the best public interface will look like. Most notably updated db_bloom_filter_test to use this.
* Hide FullFilterBitsBuilder from unit tests (alternate derived classes planned); expose the part important for testing (CalculateSpace), as abstract class BuiltinFilterBitsBuilder. (Also cleaned up internally exposed interface to CalculateSpace.)
* Rename BloomTest -> BlockBasedBloomTest for clarity (despite ongoing confusion between block-based table and block-based filter)
* Assert that block-based filter construction interface is only used on BloomFilterPolicy appropriately constructed. (A couple of tests updated to add ", true".)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5967
Test Plan: make check
Differential Revision: D18138704
Pulled By: pdillinger
fbshipit-source-id: 55ef9273423b0696309e251f50b8c1b5e9ec7597
Summary:
Previously, periodic compaction is not supported in universal compaction. Add the support using following approach: if any file is marked as qualified for periodid compaction, trigger a full compaction. If a full compaction is prevented by files being compacted, try to compact the higher levels than files currently being compacted. If in this way we can only compact the last sorted run and none of the file to be compacted qualifies for periodic compaction, skip the compact. This is to prevent the same single level compaction from being executed again and again.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5970
Test Plan: Add several test cases.
Differential Revision: D18147097
fbshipit-source-id: 8ecc308154d9aca96fb192c51fbceba3947550c1
Summary:
Right now, by default FIFO compaction has no TTL. We believe that a default TTL of 30 days will be better. With this patch, the default will be changed to 30 days. Default of Options.periodic_compaction_seconds will mean the same as options.ttl. If Options.ttl and Options.periodic_compaction_seconds left default, a default 30 days TTL will be used. If both options are set, the stricter value of the two will be used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5987
Test Plan: Add an option sanitize test to cover the case.
Differential Revision: D18237935
fbshipit-source-id: a6dcea1f36c3849e13c0a69e413d73ad8eab58c9
Summary:
Compaction iterator has many assert statements that are active only during test runs. Some rare bugs would show up only at runtime could violate the assert condition but go unnoticed since assert statements are not compiled in release mode. Turning the assert statements to runtime check sone pors and cons:
Pros:
- A bug that would result into incorrect data would be detected early before the incorrect data is written to the disk.
Cons:
- Runtime overhead: which should be negligible since compaction cpu is the minority in the overall cpu usage
- The assert statements might already being violated at runtime, and turning them to runtime failure might result into reliability issues.
The patch takes a conservative step in this direction by logging the assert violations at runtime. If we see any violation reported in logs, we investigate. Otherwise, we can go ahead turning them to runtime error.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5935
Differential Revision: D18229697
Pulled By: maysamyabandeh
fbshipit-source-id: f1890eca80ccd7cca29737f1825badb9aa8038a8
Summary:
In pipeline writing mode, memtable switching needs to wait for memtable writing to finish to make sure that when memtables are made immutable, inserts are not going to them. This is currently done in DBImpl::SwitchMemtable(). This is done after flush_scheduler_.TakeNextColumnFamily() is called to fetch the list of column families to switch. The function flush_scheduler_.TakeNextColumnFamily() itself, however, is not thread-safe when being called together with flush_scheduler_.ScheduleFlush().
This change provides a fix, which moves the waiting logic before flush_scheduler_.TakeNextColumnFamily(). WaitForPendingWrites() is a natural place where the logic can happen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5716
Test Plan: Run all tests with ASAN and TSAN.
Differential Revision: D18217658
fbshipit-source-id: b9c5e765c9989645bf10afda7c5c726c3f82f6c3
Summary:
- Periodic compactions are auto-enabled if a compaction filter or a compaction filter factory is set, in Level Compaction.
- The default value of `periodic_compaction_seconds` is changed to UINT64_MAX, which lets RocksDB auto-tune periodic compactions as needed. An explicit value of 0 will still work as before ie. to disable periodic compactions completely. For now, on seeing a compaction filter along with a UINT64_MAX value for `periodic_compaction_seconds`, RocksDB will make SST files older than 30 days to go through periodic copmactions.
Some RocksDB users make use of compaction filters to control when their data can be deleted, usually with a custom TTL logic. But it is occasionally possible that the compactions get delayed by considerable time due to factors like low writes to a key range, data reaching bottom level, etc before the TTL expiry. Periodic Compactions feature was originally built to help such cases. Now periodic compactions are auto enabled by default when compaction filters or compaction filter factories are used, as it is generally helpful to all cases to collect garbage.
`periodic_compaction_seconds` is set to a large value, 30 days, in `SanitizeOptions` when RocksDB sees that a `compaction_filter` or `compaction_filter_factory` is used.
This is done only for Level Compaction style.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5865
Test Plan:
- Added a new test `DBCompactionTest.LevelPeriodicCompactionWithCompactionFilters` to make sure that `periodic_compaction_seconds` is set if either `compaction_filter` or `compaction_filter_factory` options are set.
- `COMPILE_WITH_ASAN=1 make check`
Differential Revision: D17659180
Pulled By: sagar0
fbshipit-source-id: 4887b9cf2e53cf2dc93a7b658c6b15e1181217ee
Summary:
Fix for lite build
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5971
Test Plan: make J=1 -j64 LITE=1 all check
Differential Revision: D18148306
Pulled By: vjnadimpalli
fbshipit-source-id: 5b9a3edc3e73e054fee6b96e6f6e583cecc898f3
Summary:
Adding a new API to db.h that allows users to get file_creation_time of the oldest file in the DB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5948
Test Plan: Added unit test.
Differential Revision: D18056151
Pulled By: vjnadimpalli
fbshipit-source-id: 448ec9d34cb6772e1e5a62db399ace00dcbfbb5d
Summary:
A bug occasionally shows up in crash test, and https://github.com/facebook/rocksdb/issues/5851 reproduces it.
The bug can surface in the following way.
1. Database has multiple column families.
2. Between one DB restart, the last log file is corrupted in the middle (not the tail)
3. During restart, DB crashes between flushing between two column families.
Then DB will fail to be opened again with error "SST file is ahead of WALs".
Solution is to update the log number associated with each column family altogether after flushing all column families' memtables. The version edits should be written to a new MANIFEST. Only after writing to all these version edits succeed does RocksDB (atomically) points the CURRENT file to the new MANIFEST.
Test plan (on devserver):
```
$make all && make check
```
Specifically
```
$make db_test2
$./db_test2 --gtest_filter=DBTest2.CrashInRecoveryMultipleCF
```
Also checked for compatibility as follows.
Use this branch, run DBTest2.CrashInRecoveryMultipleCF and preserve the db directory.
Then checkout 5.4, build ldb, and dump the MANIFEST.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5856
Differential Revision: D17620818
Pulled By: riversand963
fbshipit-source-id: b52ce5969c9a8052cacec2bd805fcfb373589039
Summary:
- Updated our included xxhash implementation to version 0.7.2 (== the latest dev version as of 2019-10-09).
- Using XXH_NAMESPACE (like other fb projects) to avoid potential name collisions.
- Added fastrange64, and unit tests for it and fastrange32. These are faster alternatives to hash % range.
- Use preview version of XXH3 instead of MurmurHash64A for NPHash64
-- Had to update cache_test to increase probability of passing for any given hash function.
- Use fastrange64 instead of % with uses of NPHash64
-- Had to fix WritePreparedTransactionTest.CommitOfDelayedPrepared to avoid deadlock apparently caused by new hash collision.
- Set default seed for NPHash64 because specifying a seed rarely makes sense for it.
- Removed unnecessary include xxhash.h in a popular .h file
- Rename preview version of XXH3 to XXH3p for clarity and to ease backward compatibility in case final version of XXH3 is integrated.
Relying on existing unit tests for NPHash64-related changes. Each new implementation of fastrange64 passed unit tests when manipulating my local build to select it. I haven't done any integration performance tests, but I consider the improved performance of the pieces being swapped in to be well established.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5909
Differential Revision: D18125196
Pulled By: pdillinger
fbshipit-source-id: f6bf83d49d20cbb2549926adf454fd035f0ecc0d
Summary:
This patch adds a number of new information elements to the FlushJobInfo and
CompactionJobInfo structures that are passed to EventListeners via the
OnFlush{Begin, Completed} and OnCompaction{Begin, Completed} callbacks.
Namely, for flushes, the file numbers of the new SST and the oldest blob file it
references are propagated. For compactions, the new pieces of information are
the file number, level, and the oldest blob file referenced by each compaction
input and output file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5962
Test Plan:
Extended the EventListener unit tests with logic that checks that these information
elements are correctly propagated from the corresponding FileMetaData.
Differential Revision: D18095568
Pulled By: ltamasi
fbshipit-source-id: 6874359a6aadb53366b5fe87adcb2f9bd27a0a56
Summary:
For more information on the original problem see this [link](https://github.com/facebook/rocksdb/issues/3977).
This change adds two new tests. They are identical other than one uses range tombstones and the other does not. Each test generates sub files at L2 which overlap with keys L3. The test that uses range tombstones generates a single file at L2. This single file will generate a very large range overlap that will in turn create excessively large compaction.
1: T001 - T005
2: 000 - 005
In contrast, the test that uses key ranges generates 3 files at L2. As a single file is compacted at a time, those 3 files will generate less work per compaction iteration.
1: 001 - 002
1: 003 - 004
1: 005
2: 000 - 005
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5956
Differential Revision: D18071631
Pulled By: dlambrig
fbshipit-source-id: 12abae75fb3e0b022d228c6371698aa5e53385df
Summary:
Several error paths in opening of a plain table would leak memory. PR https://github.com/facebook/rocksdb/issues/5940 opened the leak to one more error path, which happens to have been (mistakenly) exercised by CuckooTableDBTest.AdaptiveTable. That test has been fixed, and the exercising of
plain table error cases (more than before) has been added as BadOptions1 and BadOptions2
to PlainTableDBTest. This effectively moved the memory leak to plain_table_db_test.
Also here is a cheap fix for the memory leak, without (yet?) changing the signature of
ReadTableProperties. This fixes ASAN on unit tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5951
Test Plan: make COMPILE_WITH_ASAN=1 check
Differential Revision: D18051940
Pulled By: pdillinger
fbshipit-source-id: e2952930c09a2b46c4f1ff09818c5090426929de
Summary:
Right now, when LevelIterator::Seek() is called, when a file is filtered out by prefix bloom filter, the position is put to the beginning of the next file. This is a confusing internal interface because many keys in the levels are skipped. Avoid this behavior by checking the key of the next file against the seek key, and invalidate the whole iterator if the prefix doesn't match.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5861
Test Plan: Add a new unit test to validate the behavior; run all exsiting tests; run crash_test
Differential Revision: D17918213
fbshipit-source-id: f06b47d937c7cc8919001f18dcc3af5b28c9cdac
Summary:
A recent change introduced readahead inside VerifyChecksum(). However it is not compatible with mmap mode and generated wrong checksum verification failure. Fix it by not enabling readahead in mmap
mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5945
Test Plan: Add a unit test that used to fail.
Differential Revision: D18021443
fbshipit-source-id: 6f2eb600f81b26edb02222563a4006869d576bff
Summary:
Plain table SSTs could crash sst_dump because of a bug in
PlainTableReader that can leave table_properties_ as null. Even if it
was intended not to keep the table properties in some cases, they were
leaked on the offending code path.
Steps to reproduce:
$ db_bench --benchmarks=fillrandom --num=2000000 --use_plain_table --prefix-size=12
$ sst_dump --file=0000xx.sst --show_properties
from [] to []
Process /dev/shm/dbbench/000014.sst
Sst file format: plain table
Raw user collected properties
------------------------------
Segmentation fault (core dumped)
Also added missing unit testing of plain table full_scan_mode, and
an assertion in NewIterator to check for regression.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5940
Test Plan: new unit test, manual, make check
Differential Revision: D18018145
Pulled By: pdillinger
fbshipit-source-id: 4310c755e824c4cd6f3f86a3abc20dfa417c5e07
Summary:
The patch adds a new command line parameter --decode_blob_index to sst_dump.
If this switch is specified, sst_dump prints blob indexes in a human readable format,
printing the blob file number, offset, size, and expiration (if applicable) for blob
references, and the blob value (and expiration) for inlined blobs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5926
Test Plan:
Used db_bench's BlobDB mode to generate SST files containing blob references with
and without expiration, as well as inlined blobs with and without expiration (note: the
latter are stored as plain values), and confirmed sst_dump correctly prints all four types
of records.
Differential Revision: D17939077
Pulled By: ltamasi
fbshipit-source-id: edc5f58fee94ba35f6699c6a042d5758f5b3963d
Summary:
When there are concurrent flush job on the same CF, `OnFlushCompleted` can be called before the flush result being install to LSM. Fixing the issue by passing `FlushJobInfo` through `MemTable`, and the thread who commit the flush result can fetch the `FlushJobInfo` and fire `OnFlushCompleted` on behave of the thread actually writing the SST.
Fix https://github.com/facebook/rocksdb/issues/5892
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5908
Test Plan: Add new test. The test will fail without the fix.
Differential Revision: D17916144
Pulled By: riversand963
fbshipit-source-id: e18df67d9533b5baee52ae3605026cdeb05cbe10
Summary:
Without this PR, clang analyzer complains.
```
$USE_CLANG=1 make analyze
db/compaction/compaction_job_test.cc:161:20: warning: The left operand of '==' is a garbage value
if (key.type == kTypeBlobIndex) {
~~~~~~~~ ^
1 warning generated.
```
Test Plan (on devserver)
```
$USE_CLANG=1 make analyze
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5924
Differential Revision: D17923226
Pulled By: riversand963
fbshipit-source-id: 9d1eb769b5e0de7cb3d89dc90d1cfa895db7fdc8
Summary:
This is groundwork for adding garbage collection support to BlobDB. The
patch adds logic that keeps track of the oldest blob file referred to by
each SST file. The oldest blob file is identified during flush/
compaction (similarly to how the range of keys covered by the SST is
identified), and persisted in the manifest as a custom field of the new
file edit record. Blob indexes with TTL are ignored for the purposes of
identifying the oldest blob file (since such blob files are cleaned up by the
TTL logic in BlobDB).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5903
Test Plan:
Added new unit tests; also ran db_bench in BlobDB mode, inspected the
manifest using ldb, and confirmed (by scanning the SST files using
sst_dump) that the value of the oldest blob file number field matches
the contents of the file for each SST.
Differential Revision: D17859997
Pulled By: ltamasi
fbshipit-source-id: 21662c137c6259a6af70446faaf3a9912c550e90
Summary:
Compaction can call OnTableFileCreationCompleted(). If file is empty, "(nil)"
is used as the file name.
Do the same for flush.
Test plan (dev server):
```
make all
make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5905
Differential Revision: D17883285
Pulled By: riversand963
fbshipit-source-id: 6565884adbb00e8023d88b17dfb3b6eb92220b59
Summary:
Since we do not evict a file's blocks from block cache before that file
is deleted, we require a file's cache ID prefix is both unique and
non-reusable. However, the Windows functionality we were relying on only
guaranteed uniqueness. That meant a newly created file could be assigned
the same cache ID prefix as a deleted file. If the newly created file
had block offsets matching the deleted file, full cache keys could be
exactly the same, resulting in obsolete data blocks returned from cache
when trying to read from the new file.
We noticed this when running on FAT32 where compaction was writing out
of order keys due to reading obsolete blocks from its input files. The
functionality is documented as behaving the same on NTFS, although I
wasn't able to repro it there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5844
Test Plan:
we had a reliable repro of out-of-order keys on FAT32 that
was fixed by this change
Differential Revision: D17752442
fbshipit-source-id: 95d983f9196cf415f269e19293b97341edbf7e00
Summary:
RocksDB has a MultiGet() API that implements batched key lookup for higher performance (https://github.com/facebook/rocksdb/blob/master/include/rocksdb/db.h#L468). Currently, batching is implemented in BlockBasedTableReader::MultiGet() for SST file lookups. One of the ways it improves performance is by pipelining bloom filter lookups (by prefetching required cachelines for all the keys in the batch, and then doing the probe) and thus hiding the cache miss latency. The same concept can be extended to the memtable as well. This PR involves implementing a pipelined bloom filter lookup in DynamicBloom, and implementing MemTable::MultiGet() that can leverage it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5818
Test Plan:
Existing tests
Performance Test:
Ran the below command which fills up the memtable and makes sure there are no flushes and then call multiget. Ran it on master and on the new change and see atleast 1% performance improvement across all the test runs I did. Sometimes the improvement was upto 5%.
TEST_TMPDIR=/data/users/$USER/benchmarks/feature/ numactl -C 10 ./db_bench -benchmarks="fillseq,multireadrandom" -num=600000 -compression_type="none" -level_compaction_dynamic_level_bytes -write_buffer_size=200000000 -target_file_size_base=200000000 -max_bytes_for_level_base=16777216 -reads=90000 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4 -statistics -memtable_whole_key_filtering=true -memtable_bloom_size_ratio=10
Differential Revision: D17578869
Pulled By: vjnadimpalli
fbshipit-source-id: 23dc651d9bf49db11d22375bf435708875a1f192