Tag:
					 Branch:
					 Tree:
					1601433b3a
				
			
			
		
		main
				
					oxigraph-8.1.1
				
					oxigraph-8.3.2
				
					oxigraph-main
				
			
			
				
			
		${ noResults }
		
	
		
			4831 Commits (1601433b3ae915337d882d59ffdfcce92b96223a)
		
	
	
		
	
	| Author | SHA1 | Message | Date | 
|---|---|---|---|
|  Hui Xiao | ca0ef54f16 | Rate-limit automatic WAL flush after each user write (#9607) Summary: **Context:** WAL flush is currently not rate-limited by `Options::rate_limiter`. This PR is to provide rate-limiting to auto WAL flush, the one that automatically happen after each user write operation (i.e, `Options::manual_wal_flush == false`), by adding `WriteOptions::rate_limiter_options`. Note that we are NOT rate-limiting WAL flush that do NOT automatically happen after each user write, such as `Options::manual_wal_flush == true + manual FlushWAL()` (rate-limiting multiple WAL flushes), for the benefits of: - being consistent with [ReadOptions::rate_limiter_priority](https://github.com/facebook/rocksdb/blob/7.0.fb/include/rocksdb/options.h#L515) - being able to turn off some WAL flush's rate-limiting but not all (e.g, turn off specific the WAL flush of a critical user write like a service's heartbeat) `WriteOptions::rate_limiter_options` only accept `Env::IO_USER` and `Env::IO_TOTAL` currently due to an implementation constraint. - The constraint is that we currently queue parallel writes (including WAL writes) based on FIFO policy which does not factor rate limiter priority into this layer's scheduling. If we allow lower priorities such as `Env::IO_HIGH/MID/LOW` and such writes specified with lower priorities occurs before ones specified with higher priorities (even just by a tiny bit in arrival time), the former would have blocked the latter, leading to a "priority inversion" issue and contradictory to what we promise for rate-limiting priority. Therefore we only allow `Env::IO_USER` and `Env::IO_TOTAL` right now before improving that scheduling. A pre-requisite to this feature is to support operation-level rate limiting in `WritableFileWriter`, which is also included in this PR. **Summary:** - Renamed test suite `DBRateLimiterTest to DBRateLimiterOnReadTest` for adding a new test suite - Accept `rate_limiter_priority` in `WritableFileWriter`'s private and public write functions - Passed `WriteOptions::rate_limiter_options` to `WritableFileWriter` in the path of automatic WAL flush. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9607 Test Plan: - Added new unit test to verify existing flush/compaction rate-limiting does not break, since `DBTest, RateLimitingTest` is disabled and current db-level rate-limiting tests focus on read only (e.g, `db_rate_limiter_test`, `DBTest2, RateLimitedCompactionReads`). - Added new unit test `DBRateLimiterOnWriteWALTest, AutoWalFlush` - `strace -ftt -e trace=write ./db_bench -benchmarks=fillseq -db=/dev/shm/testdb -rate_limit_auto_wal_flush=1 -rate_limiter_bytes_per_sec=15 -rate_limiter_refill_period_us=1000000 -write_buffer_size=100000000 -disable_auto_compactions=1 -num=100` - verified that WAL flush(i.e, system-call _write_) were chunked into 15 bytes and each _write_ was roughly 1 second apart - verified the chunking disappeared when `-rate_limit_auto_wal_flush=0` - crash test: `python3 tools/db_crashtest.py blackbox --disable_wal=0 --rate_limit_auto_wal_flush=1 --rate_limiter_bytes_per_sec=10485760 --interval=10` killed as normal **Benchmarked on flush/compaction to ensure no performance regression:** - compaction with rate-limiting (see table 1, avg over 1280-run): pre-change: **915635 micros/op**; post-change: **907350 micros/op (improved by 0.106%)** ``` #!/bin/bash TEST_TMPDIR=/dev/shm/testdb START=1 NUM_DATA_ENTRY=8 N=10 rm -f compact_bmk_output.txt compact_bmk_output_2.txt dont_care_output.txt for i in $(eval echo "{$START..$NUM_DATA_ENTRY}") do NUM_RUN=$(($N*(2**($i-1)))) for j in $(eval echo "{$START..$NUM_RUN}") do ./db_bench --benchmarks=fillrandom -db=$TEST_TMPDIR -disable_auto_compactions=1 -write_buffer_size=6710886 > dont_care_output.txt && ./db_bench --benchmarks=compact -use_existing_db=1 -db=$TEST_TMPDIR -level0_file_num_compaction_trigger=1 -rate_limiter_bytes_per_sec=100000000 | egrep 'compact' done > compact_bmk_output.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' compact_bmk_output.txt >> compact_bmk_output_2.txt done ``` - compaction w/o rate-limiting (see table 2, avg over 640-run): pre-change: **822197 micros/op**; post-change: **823148 micros/op (regressed by 0.12%)** ``` Same as above script, except that -rate_limiter_bytes_per_sec=0 ``` - flush with rate-limiting (see table 3, avg over 320-run, run on the [patch]( | 4 years ago | 
|  Ezgi Çiçek | 27d6ef8e60 | Rename mutable_cf_options to signify explicity copy (#9666) Summary: Signify explicit copy with comment and better name for variable `mutable_cf_options` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9666 Reviewed By: riversand963 Differential Revision: D34680934 Pulled By: ezgicicek fbshipit-source-id: b64ef18725fe523835d14ceb4b29bcdfe493f8ed | 4 years ago | 
|  Jay Zhuang | 36aec94d85 | `compression_per_level` should be used for flush and changeable (#9658) Summary: - Make `compression_per_level` dynamical changeable with `SetOptions`; - Fix a bug that `compression_per_level` is not used for flush; Pull Request resolved: https://github.com/facebook/rocksdb/pull/9658 Test Plan: CI Reviewed By: ajkr Differential Revision: D34700749 Pulled By: jay-zhuang fbshipit-source-id: a23b9dfa7ad03d393c1d71781d19e91de796f49c | 4 years ago | 
|  Peter Dillinger | ce60d0cbe5 | Test refactoring for Backups+Temperatures (#9655) Summary: In preparation for more support for file Temperatures in BackupEngine, this change does some test refactoring: * Move DBTest2::BackupFileTemperature test to BackupEngineTest::FileTemperatures, with some updates to make it work in the new home. This test will soon be expanded for deeper backup work. * Move FileTemperatureTestFS from db_test2.cc to db_test_util.h, to support sharing because of above moved test, but split off the "no link" part to the test needing it. * Use custom FileSystems in backupable_db_test rather than custom Envs, because going through Env file interfaces doesn't support temperatures. * Fix RemapFileSystem to map DirFsyncOptions::renamed_new_name parameter to FsyncWithDirOptions, which was required because this limitation caused a crash only after moving to higher fidelity of FileSystem interface (vs. LegacyDirectoryWrapper throwing away some parameter details) * `backupable_options_` -> `engine_options_` as part of the ongoing work to get rid of the obsolete "backupable" naming. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9655 Test Plan: test code updates only Reviewed By: jay-zhuang Differential Revision: D34622183 Pulled By: pdillinger fbshipit-source-id: f24b7a596a89b9e089e960f4e5d772575513e93f | 4 years ago | 
|  Hui Xiao | fc61e98ae6 | Attempt to deflake DBLogicalBlockSizeCacheTest.CreateColumnFamilies (#9516) Summary:
**Context:**
`DBLogicalBlockSizeCacheTest.CreateColumnFamilies` is flaky on a rare occurrence of assertion failure below
```
db/db_logical_block_size_cache_test.cc:210
Expected equality of these values:
  1
  cache_->GetRefCount(cf_path_0_)
    Which is: 2
```
Root-cause: `ASSERT_OK(db->DestroyColumnFamilyHandle(cfs[0]));` in the test may not successfully decrease the ref count of `cf_path_0_` since the decreasing only happens in the clean-up of `ColumnFamilyData` when `ColumnFamilyData` has no referencing to it, which may not be true when `db->DestroyColumnFamilyHandle(cfs[0])` is called since background work such as `DumpStats()` can hold reference to that `ColumnFamilyData` (suggested and repro-d by ajkr ). Similar case `ASSERT_OK(db->DestroyColumnFamilyHandle(cfs[1]));`.
See following for a deterministic repro:
```
 diff --git a/db/db_impl/db_impl.cc b/db/db_impl/db_impl.cc
index 196b428a3..4e7a834c4 100644
 --- a/db/db_impl/db_impl.cc
+++ b/db/db_impl/db_impl.cc
@@ -956,10 +956,16 @@ void DBImpl::DumpStats() {
         // near-atomically.
         // Get a ref before unlocking
         cfd->Ref();
+        if (cfd->GetName() == "cf1" || cfd->GetName() == "cf2") {
+          TEST_SYNC_POINT("DBImpl::DumpStats:PostCFDRef");
+        }
         {
           InstrumentedMutexUnlock u(&mutex_);
           cfd->internal_stats()->CollectCacheEntryStats(/*foreground=*/false);
         }
+        if (cfd->GetName() == "cf1" || cfd->GetName() == "cf2") {
+          TEST_SYNC_POINT("DBImpl::DumpStats::PreCFDUnrefAndTryDelete");
+        }
         cfd->UnrefAndTryDelete();
       }
     }
 diff --git a/db/db_logical_block_size_cache_test.cc b/db/db_logical_block_size_cache_test.cc
index 1057871c9..c3872c036 100644
 --- a/db/db_logical_block_size_cache_test.cc
+++ b/db/db_logical_block_size_cache_test.cc
@@ -9,6 +9,7 @@
 #include "env/io_posix.h"
 #include "rocksdb/db.h"
 #include "rocksdb/env.h"
+#include "test_util/sync_point.h"
 namespace ROCKSDB_NAMESPACE {
 class EnvWithCustomLogicalBlockSizeCache : public EnvWrapper {
@@ -183,6 +184,15 @@ TEST_F(DBLogicalBlockSizeCacheTest, CreateColumnFamilies) {
   ASSERT_EQ(1, cache_->GetRefCount(dbname_));
   std::vector<ColumnFamilyHandle*> cfs;
+  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
+  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->LoadDependency(
+      {{"DBLogicalBlockSizeCacheTest::CreateColumnFamilies::PostSetupTwoCFH",
+        "DBImpl::DumpStats:StartRunning"},
+       {"DBImpl::DumpStats:PostCFDRef",
+        "DBLogicalBlockSizeCacheTest::CreateColumnFamilies::PreDeleteTwoCFH"},
+       {"DBLogicalBlockSizeCacheTest::CreateColumnFamilies::"
+        "PostFinishCheckingRef",
+        "DBImpl::DumpStats::PreCFDUnrefAndTryDelete"}});
   ASSERT_OK(db->CreateColumnFamilies(cf_options, {"cf1", "cf2"}, &cfs));
   ASSERT_EQ(2, cache_->Size());
   ASSERT_TRUE(cache_->Contains(dbname_));
@@ -190,7 +200,7 @@ TEST_F(DBLogicalBlockSizeCacheTest, CreateColumnFamilies) {
   ASSERT_TRUE(cache_->Contains(cf_path_0_));
   ASSERT_EQ(2, cache_->GetRefCount(cf_path_0_));
   }
    // Delete one handle will not drop cache because another handle is still
   // referencing cf_path_0_.
+  TEST_SYNC_POINT(
+      "DBLogicalBlockSizeCacheTest::CreateColumnFamilies::PostSetupTwoCFH");
+  TEST_SYNC_POINT(
+      "DBLogicalBlockSizeCacheTest::CreateColumnFamilies::PreDeleteTwoCFH");
   ASSERT_OK(db->DestroyColumnFamilyHandle(cfs[0]));
   ASSERT_EQ(2, cache_->Size());
   ASSERT_TRUE(cache_->Contains(dbname_));
@@ -209,16 +221,20 @@ TEST_F(DBLogicalBlockSizeCacheTest, CreateColumnFamilies) {
   ASSERT_TRUE(cache_->Contains(cf_path_0_));
    // Will fail
   ASSERT_EQ(1, cache_->GetRefCount(cf_path_0_));
   // Delete the last handle will drop cache.
   ASSERT_OK(db->DestroyColumnFamilyHandle(cfs[1]));
   ASSERT_EQ(1, cache_->Size());
   ASSERT_TRUE(cache_->Contains(dbname_));
   // Will fail
   ASSERT_EQ(1, cache_->GetRefCount(dbname_));
+  TEST_SYNC_POINT(
+      "DBLogicalBlockSizeCacheTest::CreateColumnFamilies::"
+      "PostFinishCheckingRef");
   delete db;
   ASSERT_EQ(0, cache_->Size());
   ASSERT_OK(DestroyDB(dbname_, options,
       {{"cf1", cf_options}, {"cf2", cf_options}}));
+  ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->DisableProcessing();
 }
```
**Summary**
- Removed the flaky assertion
- Clarified the comments for the test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9516
Test Plan:
- CI
- Monitor for future flakiness
Reviewed By: ajkr
Differential Revision: D34055232
Pulled By: hx235
fbshipit-source-id: 9bf83ae5fa88bf6fc829876494d4692082e4c357 | 4 years ago | 
|  Hui Xiao | 4a776d81cc | Dynamic toggling of BlockBasedTableOptions::detect_filter_construct_corruption (#9654) Summary: **Context/Summary:** As requested, `BlockBasedTableOptions::detect_filter_construct_corruption` can now be dynamically configured using `DB::SetOptions` after this PR Pull Request resolved: https://github.com/facebook/rocksdb/pull/9654 Test Plan: - New unit test Reviewed By: pdillinger Differential Revision: D34622609 Pulled By: hx235 fbshipit-source-id: c06773ef3d029e6bf1724d3a72dffd37a8ec66d9 | 4 years ago | 
|  Yanqin Jin | 659a16d52b | Fix bug causing incorrect data returned by snapshot read (#9648) Summary: This bug affects use cases that meet the following conditions - (has only the default column family or disables WAL) and - has at least one event listener - atomic flush is NOT affected. If the above conditions meet, then RocksDB can release the db mutex before picking all the existing memtables to flush. In the meantime, a snapshot can be created and db's sequence number can still be incremented. The upcoming flush will ignore this snapshot. A later read using this snapshot can return incorrect result. To fix this issue, we call the listeners callbacks after picking the memtables so that we avoid creating snapshots during this interval. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9648 Test Plan: make check Reviewed By: ajkr Differential Revision: D34555456 Pulled By: riversand963 fbshipit-source-id: 1438981e9f069a5916686b1a0ad7627f734cf0ee | 4 years ago | 
|  Yuriy Chernyshov | 73fd589b1a | Do not rely on ADL when invoking std::max_element (#9608) Summary: Certain STLs use raw pointers and ADL does not work for them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9608 Reviewed By: ajkr Differential Revision: D34583012 Pulled By: riversand963 fbshipit-source-id: 7de6bbc8a080c3e7243ce0d758fe83f1663168aa | 4 years ago | 
|  Jay Zhuang | db8647969d | Unschedule manual compaction from thread-pool queue (#9625) Summary: PR https://github.com/facebook/rocksdb/issues/9557 introduced a race condition between manual compaction foreground thread and background compaction thread. This PR adds the ability to really unschedule manual compaction from thread-pool queue by differentiate tag name for manual compaction and other tasks. Also fix an issue that db `close()` didn't cancel the manual compaction thread. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9625 Test Plan: unittest not hang Reviewed By: ajkr Differential Revision: D34410811 Pulled By: jay-zhuang fbshipit-source-id: cb14065eabb8cf1345fa042b5652d4f788c0c40c | 4 years ago | 
|  sdong | 33742c2a9f | Remove BlockBasedTableOptions.hash_index_allow_collision (#9454) Summary: BlockBasedTableOptions.hash_index_allow_collision is already deprecated and has no effect. Delete it for preparing 7.0 release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9454 Test Plan: Run all existing tests. Reviewed By: ajkr Differential Revision: D33805827 fbshipit-source-id: ed8a436d1d083173ec6aef2a762ba02e1eefdc9d | 4 years ago | 
|  Andrew Kryczka | 9983eecdfb | Dedicate cacheline for DB mutex (#9637) Summary: We found a case of cacheline bouncing due to writers locking/unlocking `mutex_` and readers accessing `block_cache_tracer_`. We discovered it only after the issue was fixed by https://github.com/facebook/rocksdb/issues/9462 shifting the `DBImpl` members such that `mutex_` and `block_cache_tracer_` were naturally placed in separate cachelines in our regression testing setup. This PR forces the cacheline alignment of `mutex_` so we don't accidentally reintroduce the problem. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9637 Reviewed By: riversand963 Differential Revision: D34502233 Pulled By: ajkr fbshipit-source-id: 46aa313b7fe83e80c3de254e332b6fb242434c07 | 4 years ago | 
|  Hui Xiao | 87a8b3c8af | Deflake DBErrorHandlingFSTest.MultiCFWALWriteError (#9496) Summary: **Context:** As part of https://github.com/facebook/rocksdb/pull/6949, file deletion is disabled for faulty database on the IOError of MANIFEST write/sync and [re-enabled again during `DBImpl::Resume()` if all recovery is completed]( | 4 years ago | 
|  Siddhartha Roychowdhury | 21345d2823 | Streaming Compression API for WAL compression. (#9619) Summary: Implement a streaming compression API (compress/uncompress) to use for WAL compression. The log_writer would use the compress class/API to compress a record before writing it out in chunks. The log_reader would use the uncompress class/API to uncompress the chunks and combine into a single record. Added unit test to verify the API for different sizes/compression types. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9619 Test Plan: make -j24 check Reviewed By: anand1976 Differential Revision: D34437346 Pulled By: sidroyc fbshipit-source-id: b180569ad2ddcf3106380f8758b556cc0ad18382 | 4 years ago | 
|  Yanqin Jin | 6f12599863 | Support WBWI for keys having timestamps (#9603) Summary: This PR supports inserting keys to a `WriteBatchWithIndex` for column families that enable user-defined timestamps and reading the keys back. **The index does not have timestamps.** Writing a key to WBWI is unchanged, because the underlying WriteBatch already supports it. When reading the keys back, we need to make sure to distinguish between keys with and without timestamps before comparison. When user calls `GetFromBatchAndDB()`, no timestamp is needed to query the batch, but a timestamp has to be provided to query the db. The assumption is that data in the batch must be newer than data from the db. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9603 Test Plan: make check Reviewed By: ltamasi Differential Revision: D34354849 Pulled By: riversand963 fbshipit-source-id: d25d1f84e2240ce543e521fa30595082fb8db9a0 | 4 years ago | 
|  Andrew Kryczka | 8ca433f912 | Fix test race conditions with OnFlushCompleted() (#9617) Summary: We often see flaky tests due to `DB::Flush()` or `DBImpl::TEST_WaitForFlushMemTable()` not waiting until event listeners complete. For example, https://github.com/facebook/rocksdb/issues/9084, https://github.com/facebook/rocksdb/issues/9400, https://github.com/facebook/rocksdb/issues/9528, plus two new ones this week: "EventListenerTest.OnSingleDBFlushTest" and "DBFlushTest.FireOnFlushCompletedAfterCommittedResult". I ran a `make check` with the below race condition-coercing patch and fixed issues it found besides old BlobDB. ``` diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc index 0e1864788..aaba68c4a 100644 --- a/db/db_impl/db_impl_compaction_flush.cc +++ b/db/db_impl/db_impl_compaction_flush.cc @@ -861,6 +861,8 @@ void DBImpl::NotifyOnFlushCompleted( mutable_cf_options.level0_stop_writes_trigger); // release lock while notifying events mutex_.Unlock(); + bg_cv_.SignalAll(); + sleep(1); { for (auto& info : *flush_jobs_info) { info->triggered_writes_slowdown = triggered_writes_slowdown; ``` The reason I did not fix old BlobDB issues is because it appears to have a fundamental (non-test) issue. In particular, it uses an EventListener to keep track of the files. OnFlushCompleted() could be delayed until even after a compaction involving that flushed file completes, causing the compaction to unexpectedly delete an untracked file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9617 Test Plan: `make check` including the race condition coercing patch Reviewed By: hx235 Differential Revision: D34384022 Pulled By: ajkr fbshipit-source-id: 2652ded39b415277c5d6a628414345223930514e | 4 years ago | 
|  Andrew Kryczka | 3379d1466f | Fix DBTest2.BackupFileTemperature memory leak (#9610) Summary: Valgrind was failing with the below error because we forgot to destroy the `BackupEngine` object: ``` ==421173== Command: ./db_test2 --gtest_filter=DBTest2.BackupFileTemperature ==421173== Note: Google Test filter = DBTest2.BackupFileTemperature [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from DBTest2 [ RUN ] DBTest2.BackupFileTemperature --421173-- WARNING: unhandled amd64-linux syscall: 425 --421173-- You may be able to write your own handler. --421173-- Read the file README_MISSING_SYSCALL_OR_IOCTL. --421173-- Nevertheless we consider this a bug. Please report --421173-- it at http://valgrind.org/support/bug_reports.html. [ OK ] DBTest2.BackupFileTemperature (3366 ms) [----------] 1 test from DBTest2 (3371 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (3413 ms total) [ PASSED ] 1 test. ==421173== ==421173== HEAP SUMMARY: ==421173== in use at exit: 13,042 bytes in 195 blocks ==421173== total heap usage: 26,022 allocs, 25,827 frees, 27,555,265 bytes allocated ==421173== ==421173== 8 bytes in 1 blocks are possibly lost in loss record 6 of 167 ==421173== at 0x4838DBF: operator new(unsigned long) (vg_replace_malloc.c:344) ==421173== by 0x8D4606: allocate (new_allocator.h:114) ==421173== by 0x8D4606: allocate (alloc_traits.h:445) ==421173== by 0x8D4606: _M_allocate (stl_vector.h:343) ==421173== by 0x8D4606: reserve (vector.tcc:78) ==421173== by 0x8D4606: rocksdb::BackupEngineImpl::Initialize() (backupable_db.cc:1174) ==421173== by 0x8D5473: Initialize (backupable_db.cc:918) ==421173== by 0x8D5473: rocksdb::BackupEngine::Open(rocksdb::BackupEngineOptions const&, rocksdb::Env*, rocksdb::BackupEngine**) (backupable_db.cc:937) ==421173== by 0x50AC8F: Open (backup_engine.h:585) ==421173== by 0x50AC8F: rocksdb::DBTest2_BackupFileTemperature_Test::TestBody() (db_test2.cc:6996) ... ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9610 Test Plan: ``` $ make -j24 ROCKSDBTESTS_SUBSET=db_test2 valgrind_check_some ``` Reviewed By: akankshamahajan15 Differential Revision: D34371210 Pulled By: ajkr fbshipit-source-id: 68154fcb0c51b28222efa23fa4ee02df8d925a18 | 4 years ago | 
|  Jay Zhuang | d3a2f284d9 | Add Temperature info in `NewSequentialFile()` (#9499) Summary: Add Temperature hints information from RocksDB in API `NewSequentialFile()`. backup and checkpoint operations need to open the source files with `NewSequentialFile()`, which will have the temperature hints. Other operations are not covered. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9499 Test Plan: Added unittest Reviewed By: pdillinger Differential Revision: D34006115 Pulled By: jay-zhuang fbshipit-source-id: 568b34602b76520e53128672bd07e9d886786a2f | 4 years ago | 
|  Bo Wang | 67f071fade | Fixes #9565 (#9586) Summary: [Compaction::IsTrivialMove]( | 4 years ago | 
|  Jay Zhuang | f4b2500e12 | Add last level and non-last level read statistics (#9519) Summary: Add last level and non-last level read statistics: ``` LAST_LEVEL_READ_BYTES, LAST_LEVEL_READ_COUNT, NON_LAST_LEVEL_READ_BYTES, NON_LAST_LEVEL_READ_COUNT, ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9519 Test Plan: added unittest Reviewed By: siying Differential Revision: D34062539 Pulled By: jay-zhuang fbshipit-source-id: 908644c3050878b4234febdc72e3e19d89af38cd | 4 years ago | 
|  mrambacher | 30b08878d8 | Make FilterPolicy Customizable (#9590) Summary: Make FilterPolicy into a Customizable class. Allow new FilterPolicy to be discovered through the ObjectRegistry Pull Request resolved: https://github.com/facebook/rocksdb/pull/9590 Reviewed By: pdillinger Differential Revision: D34327367 Pulled By: mrambacher fbshipit-source-id: 37e7edac90ec9457422b72f359ab8ef48829c190 | 4 years ago | 
|  Jay Zhuang | 2fbc672732 | Add temperature information to the event listener callbacks (#9591) Summary: RocksDB try to provide temperature information in the event listener callbacks. The information is not guaranteed, as some operation like backup won't have these information. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9591 Test Plan: Added unittest Reviewed By: siying, pdillinger Differential Revision: D34309339 Pulled By: jay-zhuang fbshipit-source-id: 4aca4f270f99fa49186d85d300da42594663d6d7 | 4 years ago | 
|  anand76 | 627deb7ceb | Fix some MultiGet batching stats (#9583) Summary: The NUM_INDEX_AND_FILTER_BLOCKS_READ_PER_LEVEL, NUM_DATA_BLOCKS_READ_PER_LEVEL, and NUM_SST_READ_PER_LEVEL stats were being recorded only when the last file in a level happened to have hits. They are supposed to be updated for every level. Also, there was some overcounting of GetContextStats. This PR fixes both the problems. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9583 Test Plan: Update the unit test in db_basic_test Reviewed By: akankshamahajan15 Differential Revision: D34308044 Pulled By: anand1976 fbshipit-source-id: b3b36020fda26ba91bc6e0e47d52d58f4d7f656e | 4 years ago | 
|  Siddhartha Roychowdhury | 39b0d92153 | Add record to set WAL compression type if enabled (#9556) Summary: When WAL compression is enabled, add a record (new record type) to store the compression type to indicate that all subsequent records are compressed. The log reader will store the compression type when this record is encountered and use the type to uncompress the subsequent records. Compress and uncompress to be implemented in subsequent diffs. Enabled WAL compression in some WAL tests to check for regressions. Some tests that rely on offsets have been disabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9556 Reviewed By: anand1976 Differential Revision: D34308216 Pulled By: sidroyc fbshipit-source-id: 7f10595e46f3277f1ea2d309fbf95e2e935a8705 | 4 years ago | 
|  Jay Zhuang | f092f0fa5d | Add subcompaction event API (#9311) Summary: Add event callback for subcompaction and adds a sub_job_id to identify it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9311 Reviewed By: ajkr Differential Revision: D33892707 Pulled By: jay-zhuang fbshipit-source-id: 57b5e5e594d61b2112d480c18a79a36751f65a4e | 4 years ago | 
|  Andrew Kryczka | babe56ddba | Add rate limiter priority to ReadOptions (#9424) Summary: Users can set the priority for file reads associated with their operation by setting `ReadOptions::rate_limiter_priority` to something other than `Env::IO_TOTAL`. Rate limiting `VerifyChecksum()` and `VerifyFileChecksums()` is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working. `RandomAccessFileReader::Read()` already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed in `Env::IOPriority`. Now the compaction read rate limiting is supported by setting `rate_limiter_priority = Env::IO_LOW` on its `ReadOptions`. There is no default value for the new `Env::IOPriority` parameter to `RandomAccessFileReader::Read()`. That means this PR goes through all callers (in some cases multiple layers up the call stack) to find a `ReadOptions` to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads). The API doc only lists the missing cases where a file read associated with a provided `ReadOptions` cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provide `ReadOptions` or `Env::IOPriority`, so I didn't count that as missing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9424 Test Plan: - new unit tests - new benchmarks on ~50MB database with 1MB/s read rate limit and 100ms refill interval; verified with strace reads are chunked (at 0.1MB per chunk) and spaced roughly 100ms apart. - setup command: `./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true` - benchmarks command: `strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true` - crash test using IO_USER priority on non-validation reads with https://github.com/facebook/rocksdb/issues/9567 reverted: `python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10` Reviewed By: hx235 Differential Revision: D33747386 Pulled By: ajkr fbshipit-source-id: a2d985e97912fba8c54763798e04f006ccc56e0c | 4 years ago | 
|  Yanqin Jin | 1cda273dc3 | Fix a silent data loss for write-committed txn (#9571) Summary:
The following sequence of events can cause silent data loss for write-committed
transactions.
```
Time    thread 1                                       bg flush
 |   db->Put("a")
 |   txn = NewTxn()
 |   txn->Put("b", "v")
 |   txn->Prepare()       // writes only to 5.log
 |   db->SwitchMemtable() // memtable 1 has "a"
 |                        // close 5.log,
 |                        // creates 8.log
 |   trigger flush
 |                                                  pick memtable 1
 |                                                  unlock db mutex
 |                                                  write new sst
 |   txn->ctwb->Put("gtid", "1") // writes 8.log
 |   txn->Commit() // writes to 8.log
 |                 // writes to memtable 2
 |                                               compute min_log_number_to_keep_2pc, this
 |                                               will be 8 (incorrect).
 |
 |                                             Purge obsolete wals, including 5.log
 |
 V
```
At this point, writes of txn exists only in memtable. Close db without flush because db thinks the data in
memtable are backed by log. Then reopen, the writes are lost except key-value pair {"gtid"->"1"},
only the commit marker of txn is in 8.log
The reason lies in `PrecomputeMinLogNumberToKeep2PC()` which calls `FindMinPrepLogReferencedByMemTable()`.
In the above example, when bg flush thread tries to find obsolete wals, it uses the information
computed by `PrecomputeMinLogNumberToKeep2PC()`. The return value of `PrecomputeMinLogNumberToKeep2PC()`
depends on three components
- `PrecomputeMinLogNumberToKeepNon2PC()`. This represents the WAL that has unflushed data. As the name of this method suggests, it does not account for 2PC. Although the keys reside in the prepare section of a previous WAL, the column family references the current WAL when they are actually inserted into the memtable during txn commit.
- `prep_tracker->FindMinLogContainingOutstandingPrep()`. This represents the WAL with a prepare section but the txn hasn't committed.
- `FindMinPrepLogReferencedByMemTable()`. This represents the WAL on which some memtables (mutable and immutable) depend for their unflushed data.
The bug lies in `FindMinPrepLogReferencedByMemTable()`. Originally, this function skips checking the column families
that are being flushed, but the unit test added in this PR shows that they should not be. In this unit test, there is
only the default column family, and one of its memtables has unflushed data backed by a prepare section in 5.log.
We should return this information via `FindMinPrepLogReferencedByMemTable()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9571
Test Plan:
```
./transaction_test --gtest_filter=*/TransactionTest.SwitchMemtableDuringPrepareAndCommit_WC/*
make check
```
Reviewed By: siying
Differential Revision: D34235236
Pulled By: riversand963
fbshipit-source-id: 120eb21a666728a38dda77b96276c6af72b008b1 | 4 years ago | 
|  Peter Dillinger | 1e403a0c6c | Fix assertion failure in FastLocalBloomBitsBuilder (#9585) Summary:
As in
```
db_stress: table/block_based/filter_policy.cc:316: rocksdb::{anonymous}::FastLocalBloomBitsBuilder::FastLocalBloomBitsBuilder(int, std::atomic<long int>*, std::shared_ptr<rocksdb::CacheReservationManager>, bool): Assertion `millibits_per_key >= 1000' failed.
```
This assertion failure was actually happening with our RibbonFilterPolicy
which falls back to Bloom for some cases, often for flush, but was
missing new special logic to skip generating filter for 0 bits per key
case. Fixed by adding the logic in other builtin FilterPolicy
implementations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9585
Test Plan:
Updated db_bloom_filter_test to do more integration testing
of the RibbonFilterPolicy ("auto Ribbon") class, incl regression test
this with SkipFilterOnEssentiallyZeroBpk
Reviewed By: ajkr
Differential Revision: D34295101
Pulled By: pdillinger
fbshipit-source-id: 3488eb207fc1d67bbbd1301313714aa1b6406e6e | 4 years ago | 
|  Jay Zhuang | 31031c0210 | Remove deprecated RemoteCompaction API (#9570) Summary: Remove deprecated remote compaction APIs `CompactionService::Start()` and `CompactionService::WaitForComplete()`. Please use `CompactionService::StartV2()`, `CompactionService::WaitForCompleteV2()` instead, which provides the same information plus extra data like priority, db_id, etc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9570 Test Plan: CI Reviewed By: riversand963 Differential Revision: D34255969 Pulled By: jay-zhuang fbshipit-source-id: c6376eccdd1123f1c42ab53771b5f65f8160c325 | 4 years ago | 
|  Peter Dillinger | 8c681087c7 | Refactor FilterPolicies toward Customizable (#9567) Summary: Some changes to make it easier to make FilterPolicy customizable. Especially, create distinct classes for the different testing-only and user-facing built-in FilterPolicy modes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9567 Test Plan: tests updated, with no intended difference in functionality tested. No difference in test performance seen as a result of moving to string-based filter type configuration. Reviewed By: mrambacher Differential Revision: D34234694 Pulled By: pdillinger fbshipit-source-id: 8a94931a9e04c3bcca863a4f524cfd064aaf0122 | 4 years ago | 
|  Jay Zhuang | a0c569ee1d | Cancel manual compaction in thread-pool queue (#9557) Summary: Fix `DisableManualCompaction()` has to wait scheduled manual compaction to start the execution to cancel the job. When a manual compaction in thread-pool queue is cancel, set the job is_canceled to true and clean the resource. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9557 Test Plan: added unittest that will hang without the change Reviewed By: ajkr Differential Revision: D34214910 Pulled By: jay-zhuang fbshipit-source-id: 89dbaee78ddf26eb13ce862c2b15f4a098b36a78 | 4 years ago | 
|  Hui Xiao | 443d8ef094 | Fix PinSelf() read-after-free in DB::GetMergeOperands() (#9507) Summary:
**Context:**
Running the new test `DBMergeOperandTest.MergeOperandReadAfterFreeBug` prior to this fix surfaces the read-after-free bug of PinSef() as below:
```
READ of size 8 at 0x60400002529d thread T0
    https://github.com/facebook/rocksdb/issues/5 0x7f199a in rocksdb::PinnableSlice::PinSelf(rocksdb::Slice const&) include/rocksdb/slice.h:171
    https://github.com/facebook/rocksdb/issues/6 0x7f199a in rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&) db/db_impl/db_impl.cc:1919
    https://github.com/facebook/rocksdb/issues/7 0x540d63 in rocksdb::DBImpl::GetMergeOperands(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*, rocksdb::GetMergeOperandsOptions*, int*) db/db_impl/db_impl.h:203
freed by thread T0 here:
    https://github.com/facebook/rocksdb/issues/3 0x1191399 in rocksdb::cache_entry_roles_detail::RegisteredDeleter<rocksdb::Block, (rocksdb::CacheEntryRole)0>::Delete(rocksdb::Slice const&, void*) cache/cache_entry_roles.h:99
    https://github.com/facebook/rocksdb/issues/4 0x719348 in rocksdb::LRUHandle::Free() cache/lru_cache.h:205
    https://github.com/facebook/rocksdb/issues/5 0x71047f in rocksdb::LRUCacheShard::Release(rocksdb::Cache::Handle*, bool) cache/lru_cache.cc:547
    https://github.com/facebook/rocksdb/issues/6 0xa78f0a in rocksdb::Cleanable::DoCleanup() include/rocksdb/cleanable.h:60
    https://github.com/facebook/rocksdb/issues/7 0xa78f0a in rocksdb::Cleanable::Reset() include/rocksdb/cleanable.h:38
    https://github.com/facebook/rocksdb/issues/8 0xa78f0a in rocksdb::PinnedIteratorsManager::ReleasePinnedData() db/pinned_iterators_manager.h:71
    https://github.com/facebook/rocksdb/issues/9 0xd0c21b in rocksdb::PinnedIteratorsManager::~PinnedIteratorsManager() db/pinned_iterators_manager.h:24
    https://github.com/facebook/rocksdb/issues/10 0xd0c21b in rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, rocksdb::PinnableSlice*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >*, rocksdb::Status*, rocksdb::MergeContext*, unsigned long*, bool*, bool*, unsigned long*, rocksdb::ReadCallback*, bool*, bool) db/pinned_iterators_manager.h:22
    https://github.com/facebook/rocksdb/issues/11 0x7f0fdf in rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&) db/db_impl/db_impl.cc:1886
    https://github.com/facebook/rocksdb/issues/12 0x540d63 in rocksdb::DBImpl::GetMergeOperands(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*, rocksdb::GetMergeOperandsOptions*, int*) db/db_impl/db_impl.h:203
previously allocated by thread T0 here:
    https://github.com/facebook/rocksdb/issues/1 0x1239896 in rocksdb::AllocateBlock(unsigned long, **rocksdb::MemoryAllocator*)** memory/memory_allocator.h:35
    https://github.com/facebook/rocksdb/issues/2 0x1239896 in rocksdb::BlockFetcher::CopyBufferToHeapBuf() table/block_fetcher.cc:171
    https://github.com/facebook/rocksdb/issues/3 0x1239896 in rocksdb::BlockFetcher::GetBlockContents() table/block_fetcher.cc:206
    https://github.com/facebook/rocksdb/issues/4 0x122eae5 in rocksdb::BlockFetcher::ReadBlockContents() table/block_fetcher.cc:325
    https://github.com/facebook/rocksdb/issues/5 0x11b1f45 in rocksdb::Status rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::Block>(rocksdb::FilePrefetchBuffer*, rocksdb::ReadOptions const&, rocksdb::BlockHandle const&, rocksdb::UncompressionDict const&, bool, rocksdb::CachableEntry<rocksdb::Block>*, rocksdb::BlockType, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocksdb::BlockContents*) const table/block_based/block_based_table_reader.cc:1503
```
Here is the analysis:
- We have [PinnedIteratorsManager](https://github.com/facebook/rocksdb/blob/6.28.fb/db/version_set.cc#L1980) with `Cleanable` capability in our `Version::Get()` path. It's responsible for managing the life-time of pinned iterator and invoking registered cleanup functions during its own destruction.
  - For example in case above, the merge operands's clean-up gets associated with this manger in [GetContext::push_operand](https://github.com/facebook/rocksdb/blob/6.28.fb/table/get_context.cc#L405). During PinnedIteratorsManager's [destruction](https://github.com/facebook/rocksdb/blob/6.28.fb/db/pinned_iterators_manager.h#L67), the release function associated with those merge operand data is invoked.
**And that's what we see in "freed by thread T955 here" in ASAN.**
- Bug 🐛: `PinnedIteratorsManager` is local to `Version::Get()`  while the data of merge operands need to outlive `Version::Get` and stay till they get [PinSelf()](https://github.com/facebook/rocksdb/blob/6.28.fb/db/db_impl/db_impl.cc#L1905), **which is the read-after-free in ASAN.**
  - This bug is likely to be an overlook of `PinnedIteratorsManager` when developing the API `DB::GetMergeOperands` cuz the current logic works fine with the existing case of getting the *merged value* where the operands do not need to live that long.
- This bug was not surfaced much (even in its unit test) due to the release function associated with the merge operands (which are actually blocks put in cache as you can see in `BlockBasedTable::MaybeReadBlockAndLoadToCache` **in "previously allocated by" in ASAN report**) is a cache entry deleter.
The deleter will call `Cache::Release()` which, for LRU cache, won't immediately deallocate the block based on LRU policy [unless the cache is full or being instructed to force erase](https://github.com/facebook/rocksdb/blob/6.28.fb/cache/lru_cache.cc#L521-L531)
  - `DBMergeOperandTest.MergeOperandReadAfterFreeBug` makes the cache extremely small to force cache full.
**Summary:**
- Fix the bug by align `PinnedIteratorsManager`'s lifetime with the merge operands
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9507
Test Plan:
- New test `DBMergeOperandTest.MergeOperandReadAfterFreeBug`
- db bench on read path
  - Setup (LSM tree with several levels, cache the whole db to avoid read IO, warm cache with readseq to avoid read IO): `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks="fillrandom,readseq  -num=1000000 -cache_size=100000000  -write_buffer_size=10000 -statistics=1 -max_bytes_for_level_base=10000 -level0_file_num_compaction_trigger=1``TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks="readrandom" -num=1000000 -cache_size=100000000 `
  - Actual command run (run 20-run for 20 times and then average the 20-run's average micros/op)
     - `for j in {1..20}; do (for i in {1..20}; do rm -rf /dev/shm/rocksdb/ && TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks="fillrandom,readseq,readrandom" -num=1000000 -cache_size=100000000  -write_buffer_size=10000 -statistics=1 -max_bytes_for_level_base=10000 -level0_file_num_compaction_trigger=1 | egrep 'readrandom'; done > rr_output_pre.txt && (awk '{sum+=$3; sum_sqrt+=$3^2}END{print sum/20, sqrt(sum_sqrt/20-(sum/20)^2)}' rr_output_pre.txt) >> rr_output_pre_2.txt); done`
  - **Result: Pre-change: 3.79193 micros/op;   Post-change: 3.79528 micros/op (+0.09%)**
(pre-change)sorted avg micros/op of each 20-run | std of micros/op of each 20-run | (post-change) sorted avg micros/op of each 20-run | std of micros/op of each 20-run
-- | -- | -- | --
3.58355 | 0.265209 | 3.48715 | 0.382076
3.58845 | 0.519927 | 3.5832 | 0.382726
3.66415 | 0.452097 | 3.677 | 0.563831
3.68495 | 0.430897 | 3.68405 | 0.495355
3.70295 | 0.482893 | 3.68465 | 0.431438
3.719 | 0.463806 | 3.71945 | 0.457157
3.7393 | 0.453423 | 3.72795 | 0.538604
3.7806 | 0.527613 | 3.75075 | 0.444509
3.7817 | 0.426704 | 3.7683 | 0.468065
3.809 | 0.381033 | 3.8086 | 0.557378
3.80985 | 0.466011 | 3.81805 | 0.524833
3.8165 | 0.500351 | 3.83405 | 0.529339
3.8479 | 0.430326 | 3.86285 | 0.44831
3.85125 | 0.434108 | 3.8717 | 0.544098
3.8556 | 0.524602 | 3.895 | 0.411679
3.8656 | 0.476383 | 3.90965 | 0.566636
3.8911 | 0.488477 | 3.92735 | 0.608038
3.898 | 0.493978 | 3.9439 | 0.524511
3.97235 | 0.515008 | 3.9623 | 0.477416
3.9768 | 0.519993 | 3.98965 | 0.521481
- CI
Reviewed By: ajkr
Differential Revision: D34030519
Pulled By: hx235
fbshipit-source-id: a99ac585c11704c5ed93af033cb29ba0a7b16ae8 | 4 years ago | 
|  Peter Dillinger | e24734f843 | Use -Wno-invalid-offsetof instead of dangerous offset_of hack (#9563) Summary: After https://github.com/facebook/rocksdb/issues/9515 added a unique_ptr to Status, we see some warnings-as-error in some internal builds like this: ``` stderr: rocksdb/src/db/compaction/compaction_job.cc:2839:7: error: offset of on non-standard-layout type 'struct CompactionServiceResult' [-Werror,-Winvalid-offsetof] {offsetof(struct CompactionServiceResult, status), ^ ~~~~~~ ``` I see three potential solutions to resolving this: * Expand our use of an idiom that works around the warning (see offset_of functions removed in this change, inspired by https://gist.github.com/graphitemaster/494f21190bb2c63c5516) However, this construction is invoking undefined behavior that assumes consistent layout with no compiler-introduced indirection. A compiler incompatible with our assumptions will likely compile the code and exhibit undefined behavior. * Migrate to something in place of offset, like a function mapping CompactionServiceResult* to Status* (for the `status` field). This might be required in the long term. * **Selected:** Use our new C++17 dependency to use offsetof in a well-defined way when the compiler allows it. From a comment on https://gist.github.com/graphitemaster/494f21190bb2c63c5516: > A final note: in C++17, offsetof is conditionally supported, which > means that you can use it on any type (not just standard layout > types) and the compiler will error if it can't compile it correctly. > That appears to be the best option if you can live with C++17 and > don't need constexpr support. The C++17 semantics are confirmed on https://en.cppreference.com/w/cpp/types/offsetof, so we can suppress the warning as long as we accept that we might run into a compiler that rejects the code, and at that point we will find a solution, such as the more intrusive "migrate" solution above. Although this is currently only showing in our buck build, it will surely show up also with make and cmake, so I have updated those configurations as well. Also in the buck build, -Wno-expansion-to-defined does not appear to be needed anymore (both current compiler configurations) so I removed it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9563 Test Plan: Tried out buck builds with both current compiler configurations Reviewed By: riversand963 Differential Revision: D34220931 Pulled By: pdillinger fbshipit-source-id: d39436008259bd1eaaa87c77be69fb2a5b559e1f | 4 years ago | 
|  Peter Dillinger | 479eb1aad6 | Hide deprecated, inefficient block-based filter from public API (#9535) Summary: This change removes the ability to configure the deprecated, inefficient block-based filter in the public API. Options that would have enabled it now use "full" (and optionally partitioned) filters. Existing block-based filters can still be read and used, and a "back door" way to build them still exists, for testing and in case of trouble. About the only way this removal would cause an issue for users is if temporary memory for filter construction greatly increases. In HISTORY.md we suggest a few possible mitigations: partitioned filters, smaller SST files, or setting reserve_table_builder_memory=true. Or users who have customized a FilterPolicy using the CreateFilter/KeyMayMatch mechanism removed in https://github.com/facebook/rocksdb/issues/9501 will have to upgrade their code. (It's long past time for people to move to the new builder/reader customization interface.) This change also introduces some internal-use-only configuration strings for testing specific filter implementations while bypassing some compatibility / intelligence logic. This is intended to hint at a path toward making FilterPolicy Customizable, but it also gives us a "back door" way to configure block-based filter. Aside: updated db_bench so that -readonly implies -use_existing_db Pull Request resolved: https://github.com/facebook/rocksdb/pull/9535 Test Plan: Unit tests updated. Specifically, * BlockBasedTableTest.BlockReadCountTest is tweaked to validate the back door configuration interface and ignoring of `use_block_based_builder`. * BlockBasedTableTest.TracingGetTest is migrated from testing block-based filter access pattern to full filter access patter, by re-ordering some things. * Options test (pretty self-explanatory) Performance test - create with `./db_bench -db=/dev/shm/rocksdb1 -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=fillrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0` with and without `-use_block_based_filter`, which creates a DB with 21 SST files in L0. Read with `./db_bench -db=/dev/shm/rocksdb1 -readonly -bloom_bits=10 -cache_index_and_filter_blocks=1 -benchmarks=readrandom -num=10000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -duration=30` Without -use_block_based_filter: readrandom 464 ops/sec, 689280 KB DB With -use_block_based_filter: readrandom 169 ops/sec, 690996 KB DB No consistent difference with fillrandom Reviewed By: jay-zhuang Differential Revision: D34153871 Pulled By: pdillinger fbshipit-source-id: 31f4a933c542f8f09aca47fa64aec67832a69738 | 4 years ago | 
|  Akanksha Mahajan | 5c53b9008f | Fix failure in c_test (#9547) Summary: When tests are run with TMPD, c_test may fail because TMPD is not created by the test. It results in IO error: No such file or directory: While mkdir if missing: /tmp/rocksdb_test_tmp/rocksdb_c_test-0: No such file or directory Pull Request resolved: https://github.com/facebook/rocksdb/pull/9547 Test Plan: make -j32 c_test; TEST_TMPDIR=/tmp/rocksdb_test ./c_test Reviewed By: riversand963 Differential Revision: D34173298 Pulled By: akankshamahajan15 fbshipit-source-id: 5b5a01f5b842c2487b05b0708c8e9532241db7f8 | 4 years ago | 
|  Ezgi Çiçek | 95d9cb8357 | Avoid unnecessary copy of sample_slice map (#9551) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9551 Reviewed By: riversand963 Differential Revision: D34169574 Pulled By: ezgicicek fbshipit-source-id: 2e88db59b65bda269917a9b0bed17181a4afd281 | 4 years ago | 
|  Levi Tamasi | a1203edca4 | Rework VersionStorageInfo::ComputeFilesMarkedForForcedBlobGC a bit (#9548) Summary: We had a bug in `VersionStorageInfo::ComputeFilesMarkedForForcedBlobGC` related to the edge case where all blob files are part of the "oldest batch", i.e. where only the very oldest file has any linked SSTs. (See https://github.com/facebook/rocksdb/issues/9542) This PR tries to make the logic in this method clearer and also adds a unit test for the problematic case. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9548 Test Plan: `make check` Reviewed By: akankshamahajan15 Differential Revision: D34158959 Pulled By: ltamasi fbshipit-source-id: fbab6d749c569728382aa04f7b7c60c92cca7650 | 4 years ago | 
|  Levi Tamasi | 073ac54739 | Log blob file space amp and expose it via the rocksdb.blob-stats DB property (#9538) Summary: Extend the periodic statistics in the info log with the total amount of garbage in blob files and the space amplification pertaining to blob files, where the latter is defined as `total_blob_file_size / (total_blob_file_size - total_blob_garbage_size)`. Also expose the space amp via the `rocksdb.blob-stats` DB property. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9538 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D34126855 Pulled By: ltamasi fbshipit-source-id: 3153e7a0fe0eca440322db273f4deaabaccc51b2 | 4 years ago | 
|  Levi Tamasi | b2423f8dde | Fix off-by-one bug in VersionStorageInfo::ComputeFilesMarkedForForcedBlobGC (#9542) Summary: Fixes a bug introduced in https://github.com/facebook/rocksdb/issues/9526 where we index one position past the end of a `vector`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9542 Test Plan: `make asan_check` Will add a unit test in a separate PR. Reviewed By: akankshamahajan15 Differential Revision: D34145825 Pulled By: ltamasi fbshipit-source-id: 4e87c948407dee489d669a3e41f59e2fcc1228d8 | 4 years ago | 
|  Hui Xiao | c5cd31c12b | Fix TSAN data race in EventListenerTest.MultiCF (#9528) Summary:
**Context:**
`EventListenerTest.MultiCF` occasionally failed on TSAN data race as below:
```
WARNING: ThreadSanitizer: data race (pid=2047633)
  Read of size 8 at 0x7b6000001440 by main thread:
    #0 std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::size() const /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/stl_vector.h:916:40 (listener_test+0x52337c)
    https://github.com/facebook/rocksdb/issues/1 rocksdb::EventListenerTest_MultiCF_Test::TestBody() /home/circleci/project/db/listener_test.cc:384:7 (listener_test+0x52337c)
  Previous write of size 8 at 0x7b6000001440 by thread T2:
    #0 void std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::_M_realloc_insert<rocksdb::DB* const&>(__gnu_cxx::__normal_iterator<rocksdb::DB**, std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> > >, rocksdb::DB* const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/vector.tcc:503:31 (listener_test+0x550654)
    https://github.com/facebook/rocksdb/issues/1 std::vector<rocksdb::DB*, std::allocator<rocksdb::DB*> >::push_back(rocksdb::DB* const&) /usr/bin/../lib/gcc/x86_64-linux-gnu/9/../../../../include/c++/9/bits/stl_vector.h:1195:4 (listener_test+0x550654)
    https://github.com/facebook/rocksdb/issues/2 rocksdb::TestFlushListener::OnFlushCompleted(rocksdb::DB*, rocksdb::FlushJobInfo const&) /home/circleci/project/db/listener_test.cc:255:18 (listener_test+0x550654)
```
After investigation, it is due to the following:
(1) `ASSERT_OK(Flush(i));` before the read `std::vector::size()` is supposed to be [blocked on `DB::Impl::bg_cv_` for memtable flush to finish]( | 4 years ago | 
|  Levi Tamasi | 320d9a8e8a | Use a sorted vector instead of a map to store blob file metadata (#9526) Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc | 4 years ago | 
|  Yanqin Jin | 685044dff2 | Remove timestamp from key in expected state (#9525) Summary: The keys as part of write batch read from trace file can contain trailing timestamps. This PR removes them before calling `ExpectedState`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9525 Test Plan: make check make crash_test_with_ts Reviewed By: ajkr Differential Revision: D34082358 Pulled By: riversand963 fbshipit-source-id: 78c925659e2a19e4a8278fb4a8ddf5070e265c04 | 4 years ago | 
|  Akanksha Mahajan | 9745c68eb1 | Remove deprecated option new_table_reader_for_compaction_inputs (#9443) Summary: In RocksDB option new_table_reader_for_compaction_inputs has not effect on Compaction or on the behavior of RocksDB library. Therefore, we are removing it in the upcoming 7.0 release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9443 Test Plan: CircleCI Reviewed By: ajkr Differential Revision: D33788508 Pulled By: akankshamahajan15 fbshipit-source-id: 324ca6f12bfd019e9bd5e1b0cdac39be5c3cec7d | 4 years ago | 
|  Peter Dillinger | 68a9c186d0 | FilterPolicy API changes for 7.0 (#9501) Summary: * Inefficient block-based filter is no longer customizable in the public API, though (for now) can still be enabled. * Removed deprecated FilterPolicy::CreateFilter() and FilterPolicy::KeyMayMatch() * Removed `rocksdb_filterpolicy_create()` from C API * Change meaning of nullptr return from GetBuilderWithContext() from "use block-based filter" to "generate no filter in this case." This is a cleaner solution to the proposal in https://github.com/facebook/rocksdb/issues/8250. * Also, when user specifies bits_per_key < 0.5, we now round this down to "no filter" because we expect a filter with >= 80% FP rate is unlikely to be worth the CPU cost of accessing it (esp with cache_index_and_filter_blocks=1 or partition_filters=1). * bits_per_key >= 0.5 and < 1.0 is still rounded up to 1.0 (for 62% FP rate) * This also gives us some support for configuring filters from OPTIONS file as currently saved: `filter_policy=rocksdb.BuiltinBloomFilter`. Opening from such an options file will enable reading filters (an improvement) but not writing new ones. (See Customizable follow-up below.) * Also removed deprecated functions * FilterBitsBuilder::CalculateNumEntry() * FilterPolicy::GetFilterBitsBuilder() * NewExperimentalRibbonFilterPolicy() * Remove default implementations of * FilterBitsBuilder::EstimateEntriesAdded() * FilterBitsBuilder::ApproximateNumEntries() * FilterPolicy::GetBuilderWithContext() * Remove support for "filter_policy=experimental_ribbon" configuration string. * Allow "filter_policy=bloomfilter:n" without bool to discourage use of block-based filter. Some pieces for https://github.com/facebook/rocksdb/issues/9389 Likely follow-up (later PRs): * Refactoring toward FilterPolicy Customizable, so that we can generate filters with same configuration as before when configuring from options file. * Remove support for user enabling block-based filter (ignore `bool use_block_based_builder`) * Some months after this change, we could even remove read support for block-based filter, because it is not critical to DB data preservation. * Make FilterBitsBuilder::FinishV2 to avoid `using FilterBitsBuilder::Finish` mess and add support for specifying a MemoryAllocator (for cache warming) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9501 Test Plan: A number of obsolete tests deleted and new tests or test cases added or updated. Reviewed By: hx235 Differential Revision: D34008011 Pulled By: pdillinger fbshipit-source-id: a39a720457c354e00d5b59166b686f7f59e392aa | 4 years ago | 
|  satyajanga | 036bbab6f7 | Use the comparator from the sst file table properties in sst_dump_tool (#9491) Summary: We introduced a new Comparator for timestamp in user keys. In the sst_dump_tool by default we use BytewiseComparator to read sst files. This change allows us to read comparator_name from table properties in meta data block and use it to read. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9491 Test Plan: added unittests for new functionality. make check   Reviewed By: riversand963 Differential Revision: D33993614 Pulled By: satyajanga fbshipit-source-id: 4b5cf938e6d2cb3931d763bef5baccc900b8c536 | 4 years ago | 
|  Peter Dillinger | 5cb137a860 | Work around some new clang-analyze failures (#9515) Summary: ... seen only in internal clang-analyze runs after https://github.com/facebook/rocksdb/issues/9481 * Mostly, this works around falsely reported leaks by using std::unique_ptr in some places where clang-analyze was getting confused. (I didn't see any changes in C++17 that could make our Status implementation leak memory.) * Also fixed SetBGError returning address of a stack variable. * Also fixed another false null deref report by adding an assert. Also, use SKIP_LINK=1 to speed up `make analyze` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9515 Test Plan: Was able to reproduce the reported errors locally and verify they're fixed (except SetBGError). Otherwise, existing tests Reviewed By: hx235 Differential Revision: D34054630 Pulled By: pdillinger fbshipit-source-id: 38600ef3da75ddca307dff96b7a1a523c2885c2e | 4 years ago | 
|  Levi Tamasi | 0cc0543893 | Mitigate the overhead of building the hash of file locations (#9504) Summary: The patch builds on the refactoring done in https://github.com/facebook/rocksdb/issues/9494 and improves the performance of building the hash of file locations in `VersionStorageInfo` in two ways. First, the hash building is moved from `AddFile` (which is called under the DB mutex) to a separate post-processing step done as part of `PrepareForVersionAppend` (during which the mutex is *not* held). Second, the space necessary for the hash is preallocated to prevent costly reallocation/rehashing operations. These changes mitigate the overhead of the file location hash, which can be significant with certain workloads where the baseline CPU usage is low (see https://github.com/facebook/rocksdb/issues/9351, which is a workload where keys are sorted, WAL is turned off, the vector memtable implementation is used, and there are lots of small SST files). Fixes https://github.com/facebook/rocksdb/issues/9351 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9504 Test Plan: `make check` ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --disable_wal=1 --seed=<some_seed> ``` Final statistics before this patch: ``` Cumulative writes: 0 writes, 697M keys, 0 commit groups, 0.0 writes per commit group, ingest: 283.25 GB, 241.08 MB/s Interval writes: 0 writes, 1264K keys, 0 commit groups, 0.0 writes per commit group, ingest: 525.69 MB, 176.67 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 759M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.57 GB, 262.63 MB/s Interval writes: 0 writes, 1555K keys, 0 commit groups, 0.0 writes per commit group, ingest: 646.61 MB, 215.11 MB/s ``` Reviewed By: riversand963 Differential Revision: D34014734 Pulled By: ltamasi fbshipit-source-id: acb2703677451d5ccaa7e9d950844b33d240695b | 4 years ago | 
|  Peter Dillinger | afc280fdfd | Enhance new cache key testing & comments (#9329) Summary: Follow-up to https://github.com/facebook/rocksdb/issues/9126 Added new unit tests to validate some of the claims of guaranteed uniqueness within certain large bounds. Also cleaned up the cache_bench -stress-cache-key tool with better comments and description. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9329 Test Plan: no changes to production code Reviewed By: mrambacher Differential Revision: D33269328 Pulled By: pdillinger fbshipit-source-id: 3a2b684a6b2b15f79dc872e563e3d16563be26de | 4 years ago | 
|  Levi Tamasi | 42e0751b3a | Clean up VersionStorageInfo a bit (#9494) Summary: The patch does some cleanup in and around `VersionStorageInfo`: * Renames the method `PrepareApply` to `PrepareAppend` in `Version` to make it clear that it is to be called before appending the `Version` to `VersionSet` (via `AppendVersion`), not before applying any `VersionEdit`s. * Introduces a helper method `VersionStorageInfo::PrepareForVersionAppend` (called by `Version::PrepareAppend`) that encapsulates the population of the various derived data structures in `VersionStorageInfo`, and turns the methods computing the derived structures (`UpdateNumNonEmptyLevels`, `CalculateBaseBytes` etc.) into private helpers. * Changes `Version::PrepareAppend` so it only calls `UpdateAccumulatedStats` if the `update_stats` flag is set. (Earlier, this was checked by the callee.) Related to this, it also moves the call to `ComputeCompensatedSizes` to `VersionStorageInfo::PrepareForVersionAppend`. * Updates and cleans up `version_builder_test`, `version_set_test`, and `compaction_picker_test` so `PrepareForVersionAppend` is called anytime a new `VersionStorageInfo` is set up or saved. This cleanup also involves splitting `VersionStorageInfoTest.MaxBytesForLevelDynamic` into multiple smaller test cases. * Fixes up a bunch of comments that were outdated or just plain incorrect. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9494 Test Plan: Ran `make check` and the crash test script for a while. Reviewed By: riversand963 Differential Revision: D33971666 Pulled By: ltamasi fbshipit-source-id: fda52faac7783041126e4f8dec0fe01bdcadf65a | 4 years ago | 
|  Baptiste Lemaire | bec9ab4316 | Remove deprecated option DBOptions::max_mem_compaction_level (#9446) Summary: In RocksDB, this option was already marked as "NOT SUPPORTED" for a long time, and setting this option does not have any effect on the behavior of RocksDB library. Therefore, we are removing it in the preparations of the upcoming 7.0 release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9446 Reviewed By: ajkr Differential Revision: D33793048 Pulled By: bjlemaire fbshipit-source-id: 73316efdb194e90225005246673dae99e65577ae | 4 years ago | 
|  Yanqin Jin | 629e3e1d77 | Fix spelling in public API (#9490) Summary: I feel it would be nice if we can fix this spelling error. In `SizeApproximationOptions`, the `include_memtabtles` should be `include_memtables`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9490 Test Plan: make check Reviewed By: hx235 Differential Revision: D33949862 Pulled By: riversand963 fbshipit-source-id: b2be67501b65d4aabb6b8df1bf25eb8d54cc1466 | 4 years ago |