Skip WALs according to MinLogNumberToKeep when creating checkpoint (#7789)

Summary:
In a stress test failure, we observe that a WAL is skipped when creating checkpoint, although its log number >= MinLogNumberToKeep(). This might happen in the following case:

1. when creating the checkpoint, there are 2 column families: CF0 and CF1, and there are 2 WALs: 1, 2;
2. CF0's log number is 1, CF0's active memtable is empty, CF1's log number is 2, CF1's active memtable is not empty, WAL 2 is not empty, the sequence number points to WAL 2;
2. the checkpoint process flushes CF0, since CF0' active memtable is empty, there is no need to SwitchMemtable, thus no new WAL will be created, so CF0's log number is now 2, concurrently, some data is written to CF0 and WAL 2;
3. the checkpoint process flushes CF1, WAL 3 is created and CF1's log number is now 3, CF0's log number is still 2 because CF0 is not empty and WAL 2 contains its unflushed data concurrently written in step 2;
4.  the checkpoint process determines that WAL 1 and 2 are no longer needed according to [live_wal_files[i]->StartSequence() >= *sequence_number](https://github.com/facebook/rocksdb/blob/master/utilities/checkpoint/checkpoint_impl.cc#L388), so it skips linking them to the checkpoint directory;
5. but according to `MinLogNumberToKeep()`, WAL 2 still needs to be kept because CF0's log number is 2.

If the checkpoint is reopened in read-only mode, and only read from the snapshot with the initial sequence number, then there will be no data loss or data inconsistency.

But if the checkpoint is reopened and read from the most recent sequence number, suppose in step 3, there are also data concurrently written to CF1 and WAL 3, then the most recent sequence number refers to the latest entry in WAL 3, so the data written in step 2 should also be visible, but since WAL 2 is discarded, those data are lost.

When tracking WAL in MANIFEST is enabled, when reopening the checkpoint, since WAL 2 is still tracked in MANIFEST as alive, but it's missing from the checkpoint directory, a corruption will be reported.

This PR makes the checkpoint process to only skip a WAL if its log number < `MinLogNumberToKeep`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/7789

Test Plan: watch existing tests to pass.

Reviewed By: ajkr

Differential Revision: D25662346

Pulled By: cheng-chang

fbshipit-source-id: 136471095baa01886cf44809455cf855f24857a0
main
cheng-chang 4 years ago committed by Facebook GitHub Bot
parent bd2645bc34
commit bdb7e544bd
  1. 10
      utilities/backupable/backupable_db_test.cc
  2. 12
      utilities/checkpoint/checkpoint_impl.cc

@ -70,6 +70,16 @@ class DummyDB : public StackableDB {
DBOptions GetDBOptions() const override { return DBOptions(options_); } DBOptions GetDBOptions() const override { return DBOptions(options_); }
using StackableDB::GetIntProperty;
bool GetIntProperty(ColumnFamilyHandle*, const Slice& property,
uint64_t* value) override {
if (property == DB::Properties::kMinLogNumberToKeep) {
*value = 1;
return true;
}
return false;
}
Status EnableFileDeletions(bool /*force*/) override { Status EnableFileDeletions(bool /*force*/) override {
EXPECT_TRUE(!deletions_enabled_); EXPECT_TRUE(!deletions_enabled_);
deletions_enabled_ = true; deletions_enabled_ = true;

@ -232,14 +232,11 @@ Status CheckpointImpl::CreateCustomCheckpoint(
// this will return live_files prefixed with "/" // this will return live_files prefixed with "/"
s = db_->GetLiveFiles(live_files, &manifest_file_size, flush_memtable); s = db_->GetLiveFiles(live_files, &manifest_file_size, flush_memtable);
if (!db_->GetIntProperty(DB::Properties::kMinLogNumberToKeep, &min_log_num)) {
return Status::InvalidArgument("cannot get the min log number to keep.");
}
if (s.ok() && db_options.allow_2pc) { if (s.ok() && db_options.allow_2pc) {
// If 2PC is enabled, we need to get minimum log number after the flush.
// Need to refetch the live files to recapture the snapshot.
if (!db_->GetIntProperty(DB::Properties::kMinLogNumberToKeep,
&min_log_num)) {
return Status::InvalidArgument(
"2PC enabled but cannot fine the min log number to keep.");
}
// We need to refetch live files with flush to handle this case: // We need to refetch live files with flush to handle this case:
// A previous 000001.log contains the prepare record of transaction tnx1. // A previous 000001.log contains the prepare record of transaction tnx1.
// The current log file is 000002.log, and sequence_number points to this // The current log file is 000002.log, and sequence_number points to this
@ -385,7 +382,6 @@ Status CheckpointImpl::CreateCustomCheckpoint(
for (size_t i = 0; s.ok() && i < wal_size; ++i) { for (size_t i = 0; s.ok() && i < wal_size; ++i) {
if ((live_wal_files[i]->Type() == kAliveLogFile) && if ((live_wal_files[i]->Type() == kAliveLogFile) &&
(!flush_memtable || (!flush_memtable ||
live_wal_files[i]->StartSequence() >= *sequence_number ||
live_wal_files[i]->LogNumber() >= min_log_num)) { live_wal_files[i]->LogNumber() >= min_log_num)) {
if (i + 1 == wal_size) { if (i + 1 == wal_size) {
s = copy_file_cb(db_options.wal_dir, live_wal_files[i]->PathName(), s = copy_file_cb(db_options.wal_dir, live_wal_files[i]->PathName(),

Loading…
Cancel
Save