Summary:
Iterators reseek to the target key after iterating over max_sequential_skip_in_iterations invalid values. The logic is susceptible to an infinite loop bug, which has been present with WritePrepared Transactions up until 6.2 release. Although the bug is not present on master, the patch adds a unit test to prevent it from resurfacing again.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5727
Differential Revision: D16952759
Pulled By: maysamyabandeh
fbshipit-source-id: d0d973dddc8dfabd5a794931232aa4c862c74f51
Summary:
Adding a light weight API to get last live WAL file name and size. Meant to be used as a helper for backup/restore tooling in a larger ecosystem such as MySQL with a MyRocks storage engine.
Specifically within MySQL's backup/restore mechanism, this call can be made with a write lock on the mysql db to get a transactionally consistent snapshot of the current WAL file position along with other non-rocksdb log/data files.
Without this, the alternative would be to take the aforementioned lock, scan the WAL dir for all files, find the last file and note its exact size as the rocksdb 'checkpoint'.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5765
Differential Revision: D17172717
Pulled By: affandar
fbshipit-source-id: f2fabafd4c0e6fc45f126670c8c88a9f84cb8a37
Summary:
Each DB has a globally unique ID. A DB can be physically copied around, or backed-up and restored, and the users should be identify the same DB. This unique ID right now is stored as plain text in file IDENTITY under the DB directory. This approach introduces at least two problems: (1) the file is not checksumed; (2) the source of truth of a DB is the manifest file, which can be copied separately from IDENTITY file, causing the DB ID to be wrong.
The goal of this PR is solve this problem by moving the DB ID to manifest. To begin with we will write to both identity file and manifest. Write to Manifest is controlled via the flag write_dbid_to_manifest in Options and default is false.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5725
Test Plan: Added unit tests.
Differential Revision: D16963840
Pulled By: vjnadimpalli
fbshipit-source-id: 8a86a4c8c82c716003c40fd6b9d2d758030d92e9
Summary:
Before this PR, when the number of column families involved in a file ingestion exceeds 2, a bug in the looping logic prevents correct file number being assigned to each ingestion job.
Also skip deleting non-existing hard links during cleanup-after-failure.
Test plan (devserver)
```
$COMPILE_WITH_ASAN=1 make all
$./external_sst_file_test --gtest_filter=ExternalSSTFileTest/ExternalSSTFileTest.IngestFilesIntoMultipleColumnFamilies_*/*
$makke check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5760
Differential Revision: D17142982
Pulled By: riversand963
fbshipit-source-id: 06c1847a4e7a402647bcf28d124e70f2a0f9daf6
Summary:
Before this PR, the following sequence of events can cause assertion failure as shown below.
Stack trace (partial):
```
(gdb) bt
2 0x00007f59b350ad15 in __assert_fail_base (fmt=<optimized out>, assertion=assertion@entry=0x9f8390 "mark_as_compacted ? !inputs_[i][j]->being_compacted : inputs_[i][j]->being_compacted", file=file@entry=0x9e347c "db/compaction/compaction.cc", line=line@entry=395, function=function@entry=0xa21ec0 <rocksdb::Compaction::MarkFilesBeingCompacted(bool)::__PRETTY_FUNCTION__> "void rocksdb::Compaction::MarkFilesBeingCompacted(bool)") at assert.c:92
3 0x00007f59b350adc3 in __GI___assert_fail (assertion=assertion@entry=0x9f8390 "mark_as_compacted ? !inputs_[i][j]->being_compacted : inputs_[i][j]->being_compacted", file=file@entry=0x9e347c "db/compaction/compaction.cc", line=line@entry=395, function=function@entry=0xa21ec0 <rocksdb::Compaction::MarkFilesBeingCompacted(bool)::__PRETTY_FUNCTION__> "void rocksdb::Compaction::MarkFilesBeingCompacted(bool)") at assert.c:101
4 0x0000000000492ccd in rocksdb::Compaction::MarkFilesBeingCompacted (this=<optimized out>, mark_as_compacted=<optimized out>) at db/compaction/compaction.cc:394
5 0x000000000049467a in rocksdb::Compaction::Compaction (this=0x7f59af013000, vstorage=0x7f581af53030, _immutable_cf_options=..., _mutable_cf_options=..., _inputs=..., _output_level=<optimized out>, _target_file_size=0, _max_compaction_bytes=0, _output_path_id=0, _compression=<incomplete type>, _compression_opts=..., _max_subcompactions=0, _grandparents=..., _manual_compaction=false, _score=4, _deletion_compaction=true, _compaction_reason=rocksdb::CompactionReason::kFIFOTtl) at db/compaction/compaction.cc:241
6 0x00000000004af9bc in rocksdb::FIFOCompactionPicker::PickTTLCompaction (this=0x7f59b31a6900, cf_name=..., mutable_cf_options=..., vstorage=0x7f581af53030, log_buffer=log_buffer@entry=0x7f59b1bfa930) at db/compaction/compaction_picker_fifo.cc:101
7 0x00000000004b0771 in rocksdb::FIFOCompactionPicker::PickCompaction (this=0x7f59b31a6900, cf_name=..., mutable_cf_options=..., vstorage=0x7f581af53030, log_buffer=0x7f59b1bfa930) at db/compaction/compaction_picker_fifo.cc:201
8 0x00000000004838cc in rocksdb::ColumnFamilyData::PickCompaction (this=this@entry=0x7f59b31b3700, mutable_options=..., log_buffer=log_buffer@entry=0x7f59b1bfa930) at db/column_family.cc:933
9 0x00000000004f3645 in rocksdb::DBImpl::BackgroundCompaction (this=this@entry=0x7f59b3176000, made_progress=made_progress@entry=0x7f59b1bfa6bf, job_context=job_context@entry=0x7f59b1bfa760, log_buffer=log_buffer@entry=0x7f59b1bfa930, prepicked_compaction=prepicked_compaction@entry=0x0, thread_pri=rocksdb::Env::LOW) at db/db_impl/db_impl_compaction_flush.cc:2541
10 0x00000000004f5e2a in rocksdb::DBImpl::BackgroundCallCompaction (this=this@entry=0x7f59b3176000, prepicked_compaction=prepicked_compaction@entry=0x0, bg_thread_pri=bg_thread_pri@entry=rocksdb::Env::LOW) at db/db_impl/db_impl_compaction_flush.cc:2312
11 0x00000000004f648e in rocksdb::DBImpl::BGWorkCompaction (arg=<optimized out>) at db/db_impl/db_impl_compaction_flush.cc:2087
```
This can be caused by the following sequence of events.
```
Time
| thr bg_compact_thr1 bg_compact_thr2
| write
| flush
| mark all l0 as being compacted
| write
| flush
| add cf to queue again
| mark all l0 as being
| compacted, fail the
| assertion
V
```
Test plan (on devserver)
Since bg_compact_thr1 and bg_compact_thr2 are two threads executing the same
code, it is difficult to use sync point dependency to
coordinate their execution. Therefore, I choose to use db_stress.
```
$TEST_TMPDIR=/dev/shm/rocksdb ./db_stress --periodic_compaction_seconds=1 --max_background_compactions=20 --format_version=2 --memtablerep=skip_list --max_write_buffer_number=3 --cache_index_and_filter_blocks=1 --reopen=20 --recycle_log_file_num=0 --acquire_snapshot_one_in=10000 --delpercent=4 --log2_keys_per_lock=22 --compaction_ttl=1 --block_size=16384 --use_multiget=1 --compact_files_one_in=1000000 --target_file_size_multiplier=2 --clear_column_family_one_in=0 --max_bytes_for_level_base=10485760 --use_full_merge_v1=1 --target_file_size_base=2097152 --checkpoint_one_in=1000000 --mmap_read=0 --compression_type=zstd --writepercent=35 --readpercent=45 --subcompactions=4 --use_merge=0 --write_buffer_size=4194304 --test_batches_snapshots=0 --db=/dev/shm/rocksdb/rocksdb_crashtest_whitebox --use_direct_reads=0 --compact_range_one_in=1000000 --open_files=-1 --destroy_db_initially=0 --progress_reports=0 --compression_zstd_max_train_bytes=0 --snapshot_hold_ops=100000 --enable_pipelined_write=0 --nooverwritepercent=1 --compression_max_dict_bytes=0 --max_key=1000000 --prefixpercent=5 --flush_one_in=1000000 --ops_per_thread=40000 --index_block_restart_interval=7 --cache_size=1048576 --compaction_style=2 --verify_checksum=1 --delrangepercent=1 --use_direct_io_for_flush_and_compaction=0
```
This should see no assertion failure.
Last but not least,
```
$COMPILE_WITH_ASAN=1 make -j32 all
$make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5754
Differential Revision: D17109791
Pulled By: riversand963
fbshipit-source-id: 25fc46101235add158554e096540b72c324be078
Summary:
In order to foster healthy open source communities, we're adopting the
[Contributor Covenant](https://www.contributor-covenant.org/). It has been
built by open source community members and represents a shared understanding of
what is expected from a healthy community.
Reviewed By: josephsavona, danobi, rdzhabarov
Differential Revision: D17104640
fbshipit-source-id: d210000de686c5f0d97d602b50472d5869bc6a49
Summary:
Open-source users recently reported two occurrences of LSM-tree corruption (https://github.com/facebook/rocksdb/issues/5558 is one), which would be caught by options.force_consistency_checks = true. options.force_consistency_checks has a usability limitation because it crashes the service once inconsistency is detected. This makes the feature hard to use. Most users serve from multiple RocksDB shards per server and the impacts of crashing the service is higher than it should be.
Instead, we just pass the error back to users without killing the service, and ask them to deal with the problem accordingly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5744
Differential Revision: D17096940
Pulled By: pdhandharia
fbshipit-source-id: b6780039044e265f26ed2ad03c51f4abbe8b603c
Summary:
Row cache is not supported in LITE mode. So disable the test in that mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5756
Test Plan: make LITE=1 all check
Differential Revision: D17115684
Pulled By: anand1976
fbshipit-source-id: e6433c2e528674645cea76cdfc80ddc473708fc2
Summary:
https://github.com/facebook/rocksdb/pull/5741 added compaction TTL to crash test, but it causes assertion fails for FIFO compaction. Disable this combination for now while we debug the assertion failure.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5749
Test Plan: Run crash test and observe that when compaction_style=2, compaction_ttl is always 0.
Differential Revision: D17078292
fbshipit-source-id: 446821a3b9739956094d5e4f9be1251a15b57f5d
Summary:
Covering periodic compaction and compaction TTL can help us expose potential issues. Add it there.
Randomly select value for these two options.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5741
Test Plan: Run crash_test and see the perameters generated.
Differential Revision: D17059515
fbshipit-source-id: 8213974846a0b6a22fc13be705825c9054d1d097
Summary:
This condition is now a normal occurrence during write burst so there is
no need to warn the user about it. Here is a scenario where it happens
under completely normal conditions.
* Initially we have a DB of three levels (L0, L1, and L2) that is stable, i.e., compaction scores are all less than one.
* Now a write burst comes along. At first L0 blows up a bit in size as compaction hasn't had a chance to catch up.
* As a result of the above, `base_bytes_min` also increases since it is based on L0 size as of https://github.com/facebook/rocksdb/issues/4338
* If `base_bytes_min` increased enough (i.e., to be larger than L1), then we are shown the warning that the DB has more levels than necessary.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5742
Differential Revision: D17059221
fbshipit-source-id: e4a31d6eea42089a8d273095f19653991bd91bea
Summary:
in order to avoid reallocations for a scratch std::string on every call to Next().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5702
Differential Revision: D16867803
fbshipit-source-id: 1391220a1b172b23336bbc71dc0c79ccf3b1c701
Summary:
MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory.
We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one.
The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming.
In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022
Differential Revision: D14394062
Pulled By: miasantreble
fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5
Summary:
Crc32c Parallel computation coding optimization:
Macro unfolding removes the "for" loop and is good to decrease branch-miss in arm64 micro architecture
1024 Bytes is divided into 8(head) + 1008( 6 * 7 * 3 * 8 ) + 8(tail) three parts
Macro unfolding 42 loops to 6 CRC32C7X24BYTESs
1 CRC32C7X24BYTES containing 7 CRC32C24BYTESs
1, crc32c_test
[==========] Running 4 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 4 tests from CRC
[ RUN ] CRC.StandardResults
[ OK ] CRC.StandardResults (1 ms)
[ RUN ] CRC.Values
[ OK ] CRC.Values (0 ms)
[ RUN ] CRC.Extend
[ OK ] CRC.Extend (0 ms)
[ RUN ] CRC.Mask
[ OK ] CRC.Mask (0 ms)
[----------] 4 tests from CRC (1 ms total)
[----------] Global test environment tear-down
[==========] 4 tests from 1 test case ran. (1 ms total)
[ PASSED ] 4 tests.
2, db_bench --benchmarks="crc32c"
crc32c : 0.218 micros/op 4595390 ops/sec; 17950.7 MB/s (4096 per op)
3, repeated crc32c_test case 60000 times
perf stat -e branch-miss -- ./crc32c_test
before optimization:
739,426,504 branch-miss
after optimization:
1,128,572 branch-miss
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5675
Differential Revision: D16989210
fbshipit-source-id: 7204e6069bb6ed066d49c2d1b3ac385065a98557
Summary:
PR https://github.com/facebook/rocksdb/issues/5584 decoupled the uncompression dictionary object from the underlying block data; however, this defeats the purpose of the digested ZSTD dictionary, since the whole point
of the digest is to create it once and reuse it over and over again. This patch goes back to
storing the uncompression dictionary itself in the cache (which should be now safe to do,
since it no longer includes a Statistics pointer), while preserving the rest of the refactoring.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5645
Test Plan: make asan_check
Differential Revision: D16551864
Pulled By: ltamasi
fbshipit-source-id: 2a7e2d34bb16e70e3c816506d5afe1d842057800
Summary:
AtomicFlushStressTest is a powerful test, but right now we only run it for atomic_flush=true + disable_wal=true. We further extend it to the case where atomic_flush=false + disable_wal = false. All the workload generation and validation can stay the same.
Atomic flush crash test is also changed to switch between the two test scenarios. It makes the name "atomic flush crash test" out of sync from what it really does. We leave it as it is to avoid troubles with continous test set-up.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5729
Test Plan: Run "CRASH_TEST_KILL_ODD=188 TEST_TMPDIR=/dev/shm/ USE_CLANG=1 make whitebox_crash_test_with_atomic_flush", observe the settings used and see it passed.
Differential Revision: D16969791
fbshipit-source-id: 56e37487000ae631e31b0100acd7bdc441c04163
Summary:
To improve code readability, since RetrieveBlock already calls MaybeReadBlockAndLoadToCache, we avoid name similarity of the functions that call RetrieveBlock with MaybeReadBlockAndLoadToCache. The patch thus renames MaybeLoadBlocksToCache to RetrieveMultipleBlock and deletes GetDataBlockFromCache, which contains only two lines.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5726
Differential Revision: D16962535
Pulled By: maysamyabandeh
fbshipit-source-id: 99e8946808ce4eb7857592b9003812e3004f92d6
Summary:
The batched MultiGet() implementation was not correctly handling bloom filter lookups when whole_key_filtering is disabled. It was incorrectly skipping keys not in the prefix_extractor domain, and not calling transform for keys in domain. This PR fixes both problems by moving the domain check and transformation to the FilterBlockReader.
Tests:
Unit test (confirmed failed before the fix)
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5665
Differential Revision: D16902380
Pulled By: anand1976
fbshipit-source-id: a6be81ad68a6e37134a65246aec7a2c590eccf00
Summary:
The comments of snap_refresh_nanos advertise that the snapshot refresh feature will be disabled when the option is set to 0. This contract is however not honored in the code: https://github.com/facebook/rocksdb/pull/5278
The patch fixes that and also adds an assert to ensure that the feature is not used when the option is zero.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5724
Differential Revision: D16918185
Pulled By: maysamyabandeh
fbshipit-source-id: fec167287df7d85093e087fc39c0eb243e3bbd7e
Summary:
Recently readahead is introduced for checksum verifying. However, users cannot override the setting for the checksum verifying before external SST file ingestion. Introduce a new option for the purpose.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5721
Test Plan: Add a new unit test for it.
Differential Revision: D16906896
fbshipit-source-id: 218ec37001ddcc05411cefddbe233d15ab308476
Summary:
We see this TSAN warning:
WARNING: ThreadSanitizer: data race (pid=282806)
Write of size 8 at 0x7b6c00000e38 by thread T16 (mutexes: write M1023578822185846136):
#0 operator delete(void*) <null> (libtsan.so.0+0x0000000795f8)
https://github.com/facebook/rocksdb/issues/1 rocksdb::DBImpl::BackgroundFlush(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::FlushReason*, rocksdb::Env::Priority) db/db_impl/db_impl_compaction_flush.cc:2202 (db_flush_test+0x00000060b462)
https://github.com/facebook/rocksdb/issues/2 rocksdb::DBImpl::BackgroundCallFlush(rocksdb::Env::Priority) db/db_impl/db_impl_compaction_flush.cc:2226 (db_flush_test+0x00000060cbd8)
https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::BGWorkFlush(void*) db/db_impl/db_impl_compaction_flush.cc:2073 (db_flush_test+0x00000060d5ac)
......
Previous atomic write of size 4 at 0x7b6c00000e38 by main thread:
#0 __tsan_atomic32_fetch_sub <null> (libtsan.so.0+0x00000006d721)
https://github.com/facebook/rocksdb/issues/1 std::__atomic_base<int>::fetch_sub(int, std::memory_order) /mnt/gvfs/third-party2/libgcc/c67031f0f739ac61575a061518d6ef5038f99f90/7.x/platform007/5620abc/include/c++/7.3.0/bits/atomic_base.h:524 (db_flush_test+0x0000005f9e38)
https://github.com/facebook/rocksdb/issues/2 rocksdb::ColumnFamilyData::Unref() db/column_family.h:286 (db_flush_test+0x0000005f9e38)
https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::FlushMemTable(rocksdb::ColumnFamilyData*, rocksdb::FlushOptions const&, rocksdb::FlushReason, bool) db/db_impl/db_impl_compaction_flush.cc:1624 (db_flush_test+0x0000005f9e38)
https://github.com/facebook/rocksdb/issues/4 rocksdb::DBImpl::TEST_FlushMemTable(rocksdb::ColumnFamilyData*, rocksdb::FlushOptions const&) db/db_impl/db_impl_debug.cc:127 (db_flush_test+0x00000061ace9)
https://github.com/facebook/rocksdb/issues/5 rocksdb::DBFlushTest_CFDropRaceWithWaitForFlushMemTables_Test::TestBody() db/db_flush_test.cc:320 (db_flush_test+0x0000004b44e5)
https://github.com/facebook/rocksdb/issues/6 void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) third-party/gtest-1.7.0/fused-src/gtest/gtest-all.cc:3824 (db_flush_test+0x000000be2988)
......
It's still very clear the cause of the warning is because that TSAN treats results from relaxed atomic::fetch_sub() as non-atomic with the operation itself. We can make it more explicit by bumping up the order to CS.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5723
Test Plan: Run all existing test.
Differential Revision: D16908250
fbshipit-source-id: bf17d39ed19058372bdf97f6440a743f88153021
Summary:
Atomic white box test's kill odd is the same as normal test. However, in the scenario that only WritableFileWriter::Append() is blacklisted, WritableFileWriter::Flush() dominates the killing odds. Normally, most of WritableFileWriter::Flush() are called in WAL writes, where every write triggers a WAL flush. In atomic test, WAL is disabled, so the kill happens less frequently than we antipated. In some rare cases, the kill didn't end up with happening (for reasons I still don't fully understand) and cause the stress test timeout.
If WAL is disabled, make the odds 5x likely to trigger.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5717
Test Plan: Run whitebox_crash_test_with_atomic_flush and whitebox_crash_test and observe the kill odds printed out.
Differential Revision: D16897237
fbshipit-source-id: cbf5d96f6fc0e980523d0f1f94bf4e72cdb82d1c
Summary:
Right now VerifyChecksum() doesn't do read-ahead. In some use cases, users won't be able to achieve good performance. With this change, by default, RocksDB will do a default readahead, and users will be able to overwrite the readahead size by passing in a ReadOptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5713
Test Plan: Add a new unit test.
Differential Revision: D16860874
fbshipit-source-id: 0cff0fe79ac855d3d068e6ccd770770854a68413
Summary:
Previous PR https://github.com/facebook/rocksdb/pull/3601 added support for making prefix_extractor dynamically mutable. However, there was a missing check for hash index when creating new BlockBasedTableIterator. While the check may be redundant because no other types of IndexReader makes uses of the flag, it is less error-prone to add the missing check so that future index reader implementation will not worry about violating the contract.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5712
Differential Revision: D16842052
Pulled By: miasantreble
fbshipit-source-id: aef11c0ff7a690ed248f5b8fe23481cac486b381
Summary:
In valgrind_test, TransactionTest.GetWithoutSnapshot ran 2 hours and still didn't finish. Black list from valgrind_test to prevent timeout.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5715
Test Plan: run "make valgrind_test" and see whether the test is still generated.
Differential Revision: D16866009
fbshipit-source-id: 92c78049b0bc1c2b9a0dfc1b7c8a9206b36f02f0
Summary:
Update HISTORY.md by removing a feature from "Unreleased" to 6.4.0 after cherry-picking related commits to 6.4.fb branch.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5714
Differential Revision: D16865334
Pulled By: riversand963
fbshipit-source-id: f17ede905a1dfbbcdf98806ca398c618cf54748a
Summary:
This was previously broken, as the performance context-related
macro signatures in file monitoring/perf_context_imp.h
deviated for the case when NPERF_CONTEXT was defined and when it
was not.
Update the macros for the `-DNPERF_CONTEXT` case, so it compiles.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5704
Differential Revision: D16867746
fbshipit-source-id: 05539724cb1f7955ecc42828365836a677759ad9
Summary:
VersionSet::ApproximateSize doesn't need to create two separate index iterators and do binary search for each in BlockBasedTable. So BlockBasedTable::ApproximateSize was added that creates the iterator once and uses it to calculate the data size between start and end keys.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5693
Differential Revision: D16774056
Pulled By: elipoz
fbshipit-source-id: 53ce262e1a057788243bf30cd9b8aa6581df1a18
Summary:
there is no need to return void*, as
std:🧵:thread(Func&& f, Args&&... args ) only requires `Func` to
be callable.
Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5709
Differential Revision: D16832894
fbshipit-source-id: a1e1b876fa8d55589ef5feb5b27f3a435068b747
Summary:
Add a command in ldb so that users can print out tombstones in SST files.
In order to test the code, change the interface of LDBCommandRunner::RunCommand() so that it doesn't return from the program, but return the status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5615
Test Plan: Add a new unit test
Differential Revision: D16550326
fbshipit-source-id: 88ddfe6984bdcbb3a528abdd115089df09eba52e
Summary:
Previously, the end key of a range deletion tombstone was considered exclusive for the purposes of deletion, but considered inclusive when checking if two SSTables overlap. For example, an SSTable with a range deletion tombstone [a, b) would be considered overlapping with an SSTable with a range deletion tombstone [b, c). This commit fixes this check.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5649
Differential Revision: D16808765
Pulled By: anand1976
fbshipit-source-id: 5c7ad1c027e4f778d35070e5dae1b8e6037e0d68
Summary:
PR https://github.com/facebook/rocksdb/issues/5298 (and subsequent related patches) unintentionally changed the
semantics of cache_index_and_filter_blocks: historically, this option
only affected the main index/filter block; with the changes, it affects
index/filter partitions as well. This can cause performance issues when
cache_index_and_filter_blocks is false since in this case, partitions are
neither cached nor preloaded (i.e. they are loaded on demand upon each
access). The patch reverts to the earlier behavior, that is, partitions
are cached similarly to data blocks regardless of the value of the above
option.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5705
Test Plan:
make check
./db_bench -benchmarks=fillrandom --statistics --stats_interval_seconds=1 --duration=30 --num=500000000 --bloom_bits=20 --partition_index_and_filters=true --cache_index_and_filter_blocks=false
./db_bench -benchmarks=readrandom --use_existing_db --statistics --stats_interval_seconds=1 --duration=10 --num=500000000 --bloom_bits=20 --partition_index_and_filters=true --cache_index_and_filter_blocks=false --cache_size=8000000000
Relevant statistics from the readrandom benchmark with the old code:
rocksdb.block.cache.index.miss COUNT : 0
rocksdb.block.cache.index.hit COUNT : 0
rocksdb.block.cache.index.add COUNT : 0
rocksdb.block.cache.index.bytes.insert COUNT : 0
rocksdb.block.cache.index.bytes.evict COUNT : 0
rocksdb.block.cache.filter.miss COUNT : 0
rocksdb.block.cache.filter.hit COUNT : 0
rocksdb.block.cache.filter.add COUNT : 0
rocksdb.block.cache.filter.bytes.insert COUNT : 0
rocksdb.block.cache.filter.bytes.evict COUNT : 0
With the new code:
rocksdb.block.cache.index.miss COUNT : 2500
rocksdb.block.cache.index.hit COUNT : 42696
rocksdb.block.cache.index.add COUNT : 2500
rocksdb.block.cache.index.bytes.insert COUNT : 4050048
rocksdb.block.cache.index.bytes.evict COUNT : 0
rocksdb.block.cache.filter.miss COUNT : 2500
rocksdb.block.cache.filter.hit COUNT : 4550493
rocksdb.block.cache.filter.add COUNT : 2500
rocksdb.block.cache.filter.bytes.insert COUNT : 10331040
rocksdb.block.cache.filter.bytes.evict COUNT : 0
Differential Revision: D16817382
Pulled By: ltamasi
fbshipit-source-id: 28a516b0da1f041a03313e0b70b28cf5cf205d00
Summary:
Fix a bug in write unprepared savepoints. When flushing the write batch according to savepoint boundaries, we were forgetting to flush the last write batch after the last savepoint, meaning that some data was not written to DB.
Also, add a small optimization where we avoid flushing empty batches.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5703
Differential Revision: D16811996
Pulled By: lth
fbshipit-source-id: 600c7e0e520ad7a8fad32d77e11d932453e68e3f
Summary:
Some accesses to blob_files_ and open_ttl_files_ in BlobDBImpl, as well
as to expiration_range_ in BlobFile were not properly synchronized.
The patch fixes this and also makes sure the invariant that obsolete_files_
is a subset of blob_files_ holds even when an attempt to delete an obsolete
blob file fails.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5698
Test Plan:
COMPILE_WITH_TSAN=1 make blob_db_test
gtest-parallel --repeat=1000 ./blob_db_test --gtest_filter="*ShutdownWait*"
The test fails with TSAN errors ~20 times out of 1000 without the patch but
completes successfully 1000 out of 1000 times with the fix.
Differential Revision: D16793235
Pulled By: ltamasi
fbshipit-source-id: 8034b987598d4fdc9f15098d4589cc49cde484e9
Summary:
In MyRocks, there are cases where we write while iterating through keys. This currently breaks WBWIIterator, because if a write batch flushes during iteration, the delta iterator would point to invalid memory.
For now, fix by disallowing flush if there are active iterators. In the future, we will loop through all the iterators on a transaction, and refresh the iterators when a write batch is flushed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5699
Differential Revision: D16794157
Pulled By: lth
fbshipit-source-id: 5d5bf70688bd68fe58e8a766475ae88fd1be3190
Summary:
Fix the following clang analyze failures:
```
In file included from utilities/transactions/transaction_test.cc:8:
./utilities/transactions/transaction_test.h:174:14: warning: Attempt to delete released memory
delete root_db;
^
```
The destructor of StackableDB already deletes the root db and there is no need to delete the db separately.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5700
Test Plan: USE_CLANG=1 TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make -j24 analyze
Differential Revision: D16800579
Pulled By: maysamyabandeh
fbshipit-source-id: 64c2d70f23e07e6a15242add97c744902ea33be5
Summary:
Currently, if a write is done without a snapshot, then `largest_validated_seq_` is set to `kMaxSequenceNumber`. This is too aggressive, because an iterator with a snapshot created after this write should be valid.
Set `largest_validated_seq_` to `GetLastPublishedSequence` instead. The variable means that no keys in the current tracked key set has changed by other transactions since `largest_validated_seq_`.
Also, do some extra cleanup in Clear() for safety.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5697
Differential Revision: D16788613
Pulled By: lth
fbshipit-source-id: f2aa40b8b12e0c0cf9e38c940fecc8f1cc0d2385