A better contract for best_efforts_recovery (#11085)

Summary: Capture more of the original intent at a high level, without getting bogged down in low-level details. The old text made some weak promises about handling of LOCK files. There should be no specific concern for LOCK files, because we already rely on LockFile() to create the file if it's not present already. And the lock file is generally size 0, so don't have to worry about truncation. Added a unit test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11085 Test Plan: existing tests, and a new one. Reviewed By: siying Differential Revision: D42713233 Pulled By: pdillinger fbshipit-source-id: 2fce7c974d35fac065037c9c4c7326a59c9fe340
2 years ago · 4a9185340d
parent e0ea0dc6bd
commit 4a9185340d
2 changed files with 51 additions and 30 deletions
--- a/db/db_basic_test.cc
+++ b/db/db_basic_test.cc
@ -679,6 +679,27 @@ TEST_F(DBBasicTest, IdentityAcrossRestarts) {
  } while (ChangeCompactOptions());
 }
 TEST_F(DBBasicTest, LockFileRecovery) {
  Options options = CurrentOptions();
  // Regardless of best_efforts_recovery
  for (bool ber : {false, true}) {
    options.best_efforts_recovery = ber;
    DestroyAndReopen(options);
    std::string id1, id2;
    ASSERT_OK(db_->GetDbIdentity(id1));
    Close();
    // Should be OK to re-open DB after lock file deleted
    std::string lockfilename = LockFileName(dbname_);
    ASSERT_OK(env_->DeleteFile(lockfilename));
    Reopen(options);
    // Should be same DB as before
    ASSERT_OK(db_->GetDbIdentity(id2));
    ASSERT_EQ(id1, id2);
  }
 }
 #ifndef ROCKSDB_LITE
 TEST_F(DBBasicTest, Snapshot) {
  env_->SetMockSleep();
--- a/include/rocksdb/options.h
+++ b/include/rocksdb/options.h
@ -1288,36 +1288,36 @@ struct DBOptions {
  // Default: nullptr
  std::shared_ptr<FileChecksumGenFactory> file_checksum_gen_factory = nullptr;
-  // By default, RocksDB recovery fails if any table/blob file referenced in the
+  // By default, RocksDB will attempt to detect any data losses or corruptions
-  // final version reconstructed from the
+  // in DB files and return an error to the user, either at DB::Open time or
-  // MANIFEST are missing after scanning the MANIFEST pointed to by the
+  // later during DB operation. The exception to this policy is the WAL file,
-  // CURRENT file. It can also fail if verification of unique SST id fails.
+  // whose recovery is controlled by the wal_recovery_mode option.
-  // Best-efforts recovery is another recovery mode that does not necessarily
+  //
-  // fail when certain table/blob files are missing/corrupted or have mismatched
+  // Best-efforts recovery (this option set to true) signals a preference for
-  // unique id table property. Instead, best-efforts recovery recovers each
+  // opening the DB to any point-in-time valid state for each column family,
-  // column family to a point in the MANIFEST that corresponds to a version. In
+  // including the empty/new state, versus the default of returning non-WAL
-  // such a version, all valid table/blob files referenced have the expected
+  // data losses to the user as errors. In terms of RocksDB user data, this
-  // file size. For table files, their unique id table property match the
+  // is like applying WALRecoveryMode::kPointInTimeRecovery to each column
-  // MANIFEST.
+  // family rather than just the WAL.
-  //
+  //
-  // Best-efforts recovery does not need a valid CURRENT file, and tries to
+  // Best-efforts recovery (BER) is specifically designed to recover a DB with
-  // recover the database using one of the available MANIFEST files in the db
+  // files that are missing or truncated to some smaller size, such as the
-  // directory.
+  // result of an incomplete DB "physical" (FileSystem) copy. BER can also
-  // Best-efforts recovery tries the available MANIFEST files from high file
+  // detect when an SST file has been replaced with a different one of the
-  // numbers (newer) to low file numbers (older), and stops after finding the
+  // same size (assuming SST unique IDs are tracked in DB manifest).
-  // first MANIFEST file from which the db can be recovered to a state without
+  // BER is not yet designed to produce a usable DB from other corruptions to
-  // invalid (missing/filesize-mismatch/unique-id-mismatch) table and blob
+  // DB files (which should generally be detectable by DB::VerifyChecksum()),
-  // files. It is possible that the database can be restored to an empty state
+  // and BER does not yet attempt to recover any WAL files.
-  // with no table or blob files.
+  //
-  //
+  // For example, if an SST or blob file referenced by the MANIFEST is missing,
-  // Regardless of this option, the IDENTITY file
+  // BER might be able to find a set of files corresponding to an old "point in
-  // is updated if needed during recovery to match the DB ID in the MANIFEST (if
+  // time" version of the column family, possibly from an older MANIFEST
-  // previously using write_dbid_to_manifest) or to be in some valid state
+  // file. Some other kinds of DB files (e.g. CURRENT, LOCK, IDENTITY) are
-  // (non-empty DB ID). Currently, not compatible with atomic flush.
+  // either ignored or replaced with BER, or quietly fixed regardless of BER
-  // Furthermore, WAL files will not be used for recovery if
+  // setting. BER does require at least one valid MANIFEST to recover to a
-  // best_efforts_recovery is true. Also requires either 1) LOCK file exists or
+  // non-trivial DB state, unlike `ldb repair`.
-  // 2) underlying env's LockFile() call returns ok even for non-existing LOCK
+  //
-  // file.
+  // Currently, best_efforts_recovery=true is not compatible with atomic flush.
  //
  // Default: false
  bool best_efforts_recovery = false;