Always allow L0->L1 trivial move during manual compaction (#11375)

Summary:
during manual compaction (CompactRange()), L0->L1 trivial move is disabled when only L0 overlaps with compacting key range (introduced in https://github.com/facebook/rocksdb/issues/7368 to enforce kForce* contract). This can cause large memory usage due to compaction readahead when number of L0 files is large. This PR allows L0->L1 trivial move in this case, and will do a L1 -> L1 intra-level compaction when needed (`bottommost_level_compaction` is kForce*). In brief, consider a DB with only L0 file, and user calls CompactRange(kForce, nullptr, nullptr),
- before this PR, RocksDB does a L0 -> L1 compaction (disallow trivial move),
- after this PR, RocksDB does a L0 -> L1 compaction (allow trivial move), and a L1 -> L1 compaction.
Users can use kForceOptimized to avoid this extra L1->L1 compaction overhead when L0s are overlapping and cannot be trivial moved.

This PR also fixed a bug (see previous discussion in https://github.com/facebook/rocksdb/issues/11041) where `final_output_level` of a manual compaction can be miscalculated when `level_compaction_dynamic_level_bytes=true`. This bug could cause incorrect level being moved when CompactRangeOptions::change_level is specified.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11375

Test Plan: - Added new unit tests to test that L0 -> L1 compaction allows trivial move and L1 -> L1 compaction is done when needed.

Reviewed By: ajkr

Differential Revision: D44943518

Pulled By: cbi42

fbshipit-source-id: e9fb770d17b163c18a623e1d1bd6b81159192708
oxigraph-8.3.2
Changyu Bi 2 years ago committed by Facebook GitHub Bot
parent bd80433c73
commit 43e9a60bb2
  1. 1
      HISTORY.md
  2. 2
      db/db_compaction_filter_test.cc
  3. 93
      db/db_compaction_test.cc
  4. 6
      db/db_impl/db_impl.h
  5. 150
      db/db_impl/db_impl_compaction_flush.cc
  6. 5
      db/db_sst_test.cc

@ -9,6 +9,7 @@
* For level compaction with `level_compaction_dynamic_level_bytes=true`, RocksDB now trivially moves levels down to fill LSM starting from bottommost level during DB open. See more in comments for option `level_compaction_dynamic_level_bytes` (#11321). * For level compaction with `level_compaction_dynamic_level_bytes=true`, RocksDB now trivially moves levels down to fill LSM starting from bottommost level during DB open. See more in comments for option `level_compaction_dynamic_level_bytes` (#11321).
* User-provided `ReadOptions` take effect for more reads of non-`CacheEntryRole::kDataBlock` blocks. * User-provided `ReadOptions` take effect for more reads of non-`CacheEntryRole::kDataBlock` blocks.
* For level compaction with `level_compaction_dynamic_level_bytes=true`, RocksDB now drains unnecessary levels through background compaction automatically (#11340). This together with #11321 makes it automatic to migrate other compaction settings to level compaction with `level_compaction_dynamic_level_bytes=true`. In addition, a live DB that becomes smaller will now have unnecessary levels drained which can help to reduce read and space amp. * For level compaction with `level_compaction_dynamic_level_bytes=true`, RocksDB now drains unnecessary levels through background compaction automatically (#11340). This together with #11321 makes it automatic to migrate other compaction settings to level compaction with `level_compaction_dynamic_level_bytes=true`. In addition, a live DB that becomes smaller will now have unnecessary levels drained which can help to reduce read and space amp.
* If `CompactRange()` is called with `CompactRangeOptions::bottommost_level_compaction=kForce*` to compact from L0 to L1, RocksDB now will try to do trivial move from L0 to L1 and then do an intra L1 compaction, instead of a L0 to L1 compaction with trivial move disabled (#11375)).
### Bug Fixes ### Bug Fixes
* In the DB::VerifyFileChecksums API, ensure that file system reads of SST files are equal to the readahead_size in ReadOptions, if specified. Previously, each read was 2x the readahead_size. * In the DB::VerifyFileChecksums API, ensure that file system reads of SST files are equal to the readahead_size in ReadOptions, if specified. Previously, each read was 2x the readahead_size.

@ -742,7 +742,7 @@ TEST_F(DBTestCompactionFilter, CompactionFilterContextCfId) {
ASSERT_TRUE(filter->compaction_filter_created()); ASSERT_TRUE(filter->compaction_filter_created());
} }
// Compaction filters aplies to all records, regardless snapshots. // Compaction filters applies to all records, regardless snapshots.
TEST_F(DBTestCompactionFilter, CompactionFilterIgnoreSnapshot) { TEST_F(DBTestCompactionFilter, CompactionFilterIgnoreSnapshot) {
std::string five = std::to_string(5); std::string five = std::to_string(5);
Options options = CurrentOptions(); Options options = CurrentOptions();

@ -136,11 +136,12 @@ class DBCompactionTestWithParam
class DBCompactionTestWithBottommostParam class DBCompactionTestWithBottommostParam
: public DBTestBase, : public DBTestBase,
public testing::WithParamInterface<BottommostLevelCompaction> { public testing::WithParamInterface<
std::tuple<BottommostLevelCompaction, bool>> {
public: public:
DBCompactionTestWithBottommostParam() DBCompactionTestWithBottommostParam()
: DBTestBase("db_compaction_test", /*env_do_fsync=*/true) { : DBTestBase("db_compaction_test", /*env_do_fsync=*/true) {
bottommost_level_compaction_ = GetParam(); bottommost_level_compaction_ = std::get<0>(GetParam());
} }
BottommostLevelCompaction bottommost_level_compaction_; BottommostLevelCompaction bottommost_level_compaction_;
@ -7339,10 +7340,63 @@ TEST_P(DBCompactionTestL0FilesMisorderCorruptionWithParam,
Destroy(options_); Destroy(options_);
} }
TEST_F(DBCompactionTest, SingleLevelUniveresal) {
// Tests that manual compaction works with single level universal compaction.
Options options = CurrentOptions();
options.compaction_style = kCompactionStyleUniversal;
options.disable_auto_compactions = true;
options.num_levels = 1;
DestroyAndReopen(options);
Random rnd(31);
for (int i = 0; i < 10; ++i) {
for (int j = 0; j < 50; ++j) {
ASSERT_OK(Put(Key(i * 100 + j), rnd.RandomString(50)));
}
ASSERT_OK(Flush());
}
ASSERT_EQ(NumTableFilesAtLevel(0), 10);
ASSERT_OK(db_->CompactRange(CompactRangeOptions(), nullptr, nullptr));
ASSERT_EQ(NumTableFilesAtLevel(0), 1);
}
TEST_F(DBCompactionTest, SingleOverlappingNonL0BottommostManualCompaction) {
// Tests that manual compact will rewrite bottommost level
// when there is only a single non-L0 level that overlaps with
// manual compaction range.
constexpr int kSstNum = 10;
Options options = CurrentOptions();
options.disable_auto_compactions = true;
options.num_levels = 7;
for (auto b : {BottommostLevelCompaction::kForce,
BottommostLevelCompaction::kForceOptimized}) {
DestroyAndReopen(options);
// Generate some sst files on level 0 with sequence keys (no overlap)
for (int i = 0; i < kSstNum; i++) {
for (int j = 1; j < UCHAR_MAX; j++) {
auto key = std::string(kSstNum, '\0');
key[kSstNum - i] += static_cast<char>(j);
ASSERT_OK(Put(key, std::string(i % 1000, 'A')));
}
ASSERT_OK(Flush());
}
MoveFilesToLevel(4);
ASSERT_EQ(NumTableFilesAtLevel(4), kSstNum);
CompactRangeOptions cro;
cro.bottommost_level_compaction = b;
ASSERT_OK(db_->CompactRange(cro, nullptr, nullptr));
ASSERT_EQ(NumTableFilesAtLevel(4), 1);
}
}
TEST_P(DBCompactionTestWithBottommostParam, SequenceKeysManualCompaction) { TEST_P(DBCompactionTestWithBottommostParam, SequenceKeysManualCompaction) {
constexpr int kSstNum = 10; constexpr int kSstNum = 10;
Options options = CurrentOptions(); Options options = CurrentOptions();
options.disable_auto_compactions = true; options.disable_auto_compactions = true;
options.num_levels = 7;
const bool dynamic_level = std::get<1>(GetParam());
options.level_compaction_dynamic_level_bytes = dynamic_level;
DestroyAndReopen(options); DestroyAndReopen(options);
// Generate some sst files on level 0 with sequence keys (no overlap) // Generate some sst files on level 0 with sequence keys (no overlap)
@ -7360,25 +7414,42 @@ TEST_P(DBCompactionTestWithBottommostParam, SequenceKeysManualCompaction) {
auto cro = CompactRangeOptions(); auto cro = CompactRangeOptions();
cro.bottommost_level_compaction = bottommost_level_compaction_; cro.bottommost_level_compaction = bottommost_level_compaction_;
bool trivial_moved = false;
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
"DBImpl::BackgroundCompaction:TrivialMove",
[&](void* /*arg*/) { trivial_moved = true; });
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
// All bottommost_level_compaction options should allow l0 -> l1 trivial move.
ASSERT_OK(db_->CompactRange(cro, nullptr, nullptr)); ASSERT_OK(db_->CompactRange(cro, nullptr, nullptr));
ASSERT_TRUE(trivial_moved);
if (bottommost_level_compaction_ == BottommostLevelCompaction::kForce || if (bottommost_level_compaction_ == BottommostLevelCompaction::kForce ||
bottommost_level_compaction_ == bottommost_level_compaction_ ==
BottommostLevelCompaction::kForceOptimized) { BottommostLevelCompaction::kForceOptimized) {
// Real compaction to compact all sst files from level 0 to 1 file on level // bottommost level should go through intra-level compaction
// 1 // and has only 1 file
ASSERT_EQ("0,1", FilesPerLevel(0)); if (dynamic_level) {
ASSERT_EQ("0,0,0,0,0,0,1", FilesPerLevel(0));
} else {
ASSERT_EQ("0,1", FilesPerLevel(0));
}
} else { } else {
// Just trivial move from level 0 -> 1 // Just trivial move from level 0 -> 1/base
ASSERT_EQ("0," + std::to_string(kSstNum), FilesPerLevel(0)); if (dynamic_level) {
ASSERT_EQ("0,0,0,0,0,0," + std::to_string(kSstNum), FilesPerLevel(0));
} else {
ASSERT_EQ("0," + std::to_string(kSstNum), FilesPerLevel(0));
}
} }
} }
INSTANTIATE_TEST_CASE_P( INSTANTIATE_TEST_CASE_P(
DBCompactionTestWithBottommostParam, DBCompactionTestWithBottommostParam, DBCompactionTestWithBottommostParam, DBCompactionTestWithBottommostParam,
::testing::Values(BottommostLevelCompaction::kSkip, ::testing::Combine(
BottommostLevelCompaction::kIfHaveCompactionFilter, ::testing::Values(BottommostLevelCompaction::kSkip,
BottommostLevelCompaction::kForce, BottommostLevelCompaction::kIfHaveCompactionFilter,
BottommostLevelCompaction::kForceOptimized)); BottommostLevelCompaction::kForce,
BottommostLevelCompaction::kForceOptimized),
::testing::Bool()));
TEST_F(DBCompactionTest, UpdateLevelSubCompactionTest) { TEST_F(DBCompactionTest, UpdateLevelSubCompactionTest) {
Options options = CurrentOptions(); Options options = CurrentOptions();

@ -734,13 +734,17 @@ class DBImpl : public DB {
// max_file_num_to_ignore allows bottom level compaction to filter out newly // max_file_num_to_ignore allows bottom level compaction to filter out newly
// compacted SST files. Setting max_file_num_to_ignore to kMaxUint64 will // compacted SST files. Setting max_file_num_to_ignore to kMaxUint64 will
// disable the filtering // disable the filtering
// If `final_output_level` is not nullptr, it is set to manual compaction's
// output level if returned status is OK, and it may or may not be set to
// manual compaction's output level if returned status is not OK.
Status RunManualCompaction(ColumnFamilyData* cfd, int input_level, Status RunManualCompaction(ColumnFamilyData* cfd, int input_level,
int output_level, int output_level,
const CompactRangeOptions& compact_range_options, const CompactRangeOptions& compact_range_options,
const Slice* begin, const Slice* end, const Slice* begin, const Slice* end,
bool exclusive, bool disallow_trivial_move, bool exclusive, bool disallow_trivial_move,
uint64_t max_file_num_to_ignore, uint64_t max_file_num_to_ignore,
const std::string& trim_ts); const std::string& trim_ts,
int* final_output_level = nullptr);
// Return an internal iterator over the current state of the database. // Return an internal iterator over the current state of the database.
// The keys of this iterator are internal keys (see format.h). // The keys of this iterator are internal keys (see format.h).

@ -1054,8 +1054,8 @@ Status DBImpl::CompactRangeInternal(const CompactRangeOptions& options,
} }
s = RunManualCompaction(cfd, ColumnFamilyData::kCompactAllLevels, s = RunManualCompaction(cfd, ColumnFamilyData::kCompactAllLevels,
final_output_level, options, begin, end, exclusive, final_output_level, options, begin, end, exclusive,
false, std::numeric_limits<uint64_t>::max(), false /* disable_trivial_move */,
trim_ts); std::numeric_limits<uint64_t>::max(), trim_ts);
} else { } else {
int first_overlapped_level = kInvalidLevel; int first_overlapped_level = kInvalidLevel;
int max_overlapped_level = kInvalidLevel; int max_overlapped_level = kInvalidLevel;
@ -1142,74 +1142,83 @@ Status DBImpl::CompactRangeInternal(const CompactRangeOptions& options,
CleanupSuperVersion(super_version); CleanupSuperVersion(super_version);
} }
if (s.ok() && first_overlapped_level != kInvalidLevel) { if (s.ok() && first_overlapped_level != kInvalidLevel) {
// max_file_num_to_ignore can be used to filter out newly created SST if (cfd->ioptions()->compaction_style == kCompactionStyleUniversal ||
// files, useful for bottom level compaction in a manual compaction cfd->ioptions()->compaction_style == kCompactionStyleFIFO) {
uint64_t max_file_num_to_ignore = std::numeric_limits<uint64_t>::max(); assert(first_overlapped_level == 0);
uint64_t next_file_number = versions_->current_next_file_number(); s = RunManualCompaction(
final_output_level = max_overlapped_level; cfd, first_overlapped_level, first_overlapped_level, options, begin,
int output_level; end, exclusive, true /* disallow_trivial_move */,
for (int level = first_overlapped_level; level <= max_overlapped_level; std::numeric_limits<uint64_t>::max() /* max_file_num_to_ignore */,
level++) { trim_ts);
bool disallow_trivial_move = false; final_output_level = max_overlapped_level;
// in case the compaction is universal or if we're compacting the } else {
// bottom-most level, the output level will be the same as input one. assert(cfd->ioptions()->compaction_style == kCompactionStyleLevel);
// level 0 can never be the bottommost level (i.e. if all files are in uint64_t next_file_number = versions_->current_next_file_number();
// level 0, we will compact to level 1) // Start compaction from `first_overlapped_level`, one level down at a
if (cfd->ioptions()->compaction_style == kCompactionStyleUniversal || // time, until output level >= max_overlapped_level.
cfd->ioptions()->compaction_style == kCompactionStyleFIFO) { // When max_overlapped_level == 0, we will still compact from L0 -> L1
output_level = level; // (or LBase), and followed by a bottommost level intra-level compaction
} else if (level == max_overlapped_level && level > 0) { // at L1 (or LBase), if applicable.
if (options.bottommost_level_compaction == int level = first_overlapped_level;
BottommostLevelCompaction::kSkip) { final_output_level = level;
// Skip bottommost level compaction int output_level, base_level;
continue; while (level < max_overlapped_level || level == 0) {
} else if (options.bottommost_level_compaction ==
BottommostLevelCompaction::kIfHaveCompactionFilter &&
cfd->ioptions()->compaction_filter == nullptr &&
cfd->ioptions()->compaction_filter_factory == nullptr) {
// Skip bottommost level compaction since we don't have a compaction
// filter
continue;
}
output_level = level;
// update max_file_num_to_ignore only for bottom level compaction
// because data in newly compacted files in middle levels may still
// need to be pushed down
max_file_num_to_ignore = next_file_number;
} else {
output_level = level + 1; output_level = level + 1;
if (cfd->ioptions()->compaction_style == kCompactionStyleLevel && if (cfd->ioptions()->level_compaction_dynamic_level_bytes &&
cfd->ioptions()->level_compaction_dynamic_level_bytes &&
level == 0) { level == 0) {
output_level = ColumnFamilyData::kCompactToBaseLevel; output_level = ColumnFamilyData::kCompactToBaseLevel;
} }
// if it's a BottommostLevel compaction and `kForce*` compaction is // Use max value for `max_file_num_to_ignore` to always compact
// set, disallow trivial move // files down.
if (level == max_overlapped_level && s = RunManualCompaction(
(options.bottommost_level_compaction == cfd, level, output_level, options, begin, end, exclusive,
BottommostLevelCompaction::kForce || !trim_ts.empty() /* disallow_trivial_move */,
options.bottommost_level_compaction == std::numeric_limits<uint64_t>::max() /* max_file_num_to_ignore */,
BottommostLevelCompaction::kForceOptimized)) { trim_ts,
disallow_trivial_move = true; output_level == ColumnFamilyData::kCompactToBaseLevel
? &base_level
: nullptr);
if (!s.ok()) {
break;
} }
if (output_level == ColumnFamilyData::kCompactToBaseLevel) {
assert(base_level > 0);
level = base_level;
} else {
++level;
}
final_output_level = level;
TEST_SYNC_POINT("DBImpl::RunManualCompaction()::1");
TEST_SYNC_POINT("DBImpl::RunManualCompaction()::2");
} }
// trim_ts need real compaction to remove latest record if (s.ok()) {
if (!trim_ts.empty()) { assert(final_output_level > 0);
disallow_trivial_move = true; // bottommost level intra-level compaction
} // TODO(cbi): this preserves earlier behavior where if
s = RunManualCompaction(cfd, level, output_level, options, begin, end, // max_overlapped_level = 0 and bottommost_level_compaction is
exclusive, disallow_trivial_move, // kIfHaveCompactionFilter, we only do a L0 -> LBase compaction
max_file_num_to_ignore, trim_ts); // and do not do intra-LBase compaction even when user configures
if (!s.ok()) { // compaction filter. We may want to still do a LBase -> LBase
break; // compaction in case there is some file in LBase that did not go
} // through L0 -> LBase compaction, and hence did not go through
if (output_level == ColumnFamilyData::kCompactToBaseLevel) { // compaction filter.
final_output_level = cfd->NumberLevels() - 1; if ((options.bottommost_level_compaction ==
} else if (output_level > final_output_level) { BottommostLevelCompaction::kIfHaveCompactionFilter &&
final_output_level = output_level; max_overlapped_level != 0 &&
(cfd->ioptions()->compaction_filter != nullptr ||
cfd->ioptions()->compaction_filter_factory != nullptr)) ||
options.bottommost_level_compaction ==
BottommostLevelCompaction::kForceOptimized ||
options.bottommost_level_compaction ==
BottommostLevelCompaction::kForce) {
// Use `next_file_number` as `max_file_num_to_ignore` to avoid
// rewriting newly compacted files when it is kForceOptimized.
s = RunManualCompaction(
cfd, final_output_level, final_output_level, options, begin,
end, exclusive, !trim_ts.empty() /* disallow_trivial_move */,
next_file_number /* max_file_num_to_ignore */, trim_ts);
}
} }
TEST_SYNC_POINT("DBImpl::RunManualCompaction()::1");
TEST_SYNC_POINT("DBImpl::RunManualCompaction()::2");
} }
} }
} }
@ -1853,7 +1862,8 @@ Status DBImpl::RunManualCompaction(
ColumnFamilyData* cfd, int input_level, int output_level, ColumnFamilyData* cfd, int input_level, int output_level,
const CompactRangeOptions& compact_range_options, const Slice* begin, const CompactRangeOptions& compact_range_options, const Slice* begin,
const Slice* end, bool exclusive, bool disallow_trivial_move, const Slice* end, bool exclusive, bool disallow_trivial_move,
uint64_t max_file_num_to_ignore, const std::string& trim_ts) { uint64_t max_file_num_to_ignore, const std::string& trim_ts,
int* final_output_level) {
assert(input_level == ColumnFamilyData::kCompactAllLevels || assert(input_level == ColumnFamilyData::kCompactAllLevels ||
input_level >= 0); input_level >= 0);
@ -2004,6 +2014,15 @@ Status DBImpl::RunManualCompaction(
} else if (!scheduled) { } else if (!scheduled) {
if (compaction == nullptr) { if (compaction == nullptr) {
manual.done = true; manual.done = true;
if (final_output_level) {
// No compaction needed or there is a conflicting compaction.
// Still set `final_output_level` to the level where we would
// have compacted to.
*final_output_level = output_level;
if (output_level == ColumnFamilyData::kCompactToBaseLevel) {
*final_output_level = cfd->current()->storage_info()->base_level();
}
}
bg_cv_.SignalAll(); bg_cv_.SignalAll();
continue; continue;
} }
@ -2037,6 +2056,9 @@ Status DBImpl::RunManualCompaction(
} }
scheduled = true; scheduled = true;
TEST_SYNC_POINT("DBImpl::RunManualCompaction:Scheduled"); TEST_SYNC_POINT("DBImpl::RunManualCompaction:Scheduled");
if (final_output_level) {
*final_output_level = compaction->output_level();
}
} }
} }

@ -810,9 +810,10 @@ TEST_F(DBSSTTest, RateLimitedWALDelete) {
// We created 4 sst files in L0 // We created 4 sst files in L0
ASSERT_EQ("4", FilesPerLevel(0)); ASSERT_EQ("4", FilesPerLevel(0));
// Compaction will move the 4 files in L0 to trash and create 1 L1 file // Compaction will move the 4 files in L0 to trash and create 1 L1 file.
// Use kForceOptimized to not rewrite the new L1 file.
CompactRangeOptions cro; CompactRangeOptions cro;
cro.bottommost_level_compaction = BottommostLevelCompaction::kForce; cro.bottommost_level_compaction = BottommostLevelCompaction::kForceOptimized;
ASSERT_OK(db_->CompactRange(cro, nullptr, nullptr)); ASSERT_OK(db_->CompactRange(cro, nullptr, nullptr));
ASSERT_OK(dbfull()->TEST_WaitForCompact(true)); ASSERT_OK(dbfull()->TEST_WaitForCompact(true));
ASSERT_EQ("0,1", FilesPerLevel(0)); ASSERT_EQ("0,1", FilesPerLevel(0));

Loading…
Cancel
Save