From 362b8d4393a3bd389be3991ddd55254612486e76 Mon Sep 17 00:00:00 2001 From: Yanqin Jin Date: Thu, 20 Feb 2020 13:13:53 -0800 Subject: [PATCH] Fix MANIFEST name assignment (#6426) Summary: Currently, a new MANIFEST file is assigned a new file number when 1) no MANIFEST is open, or 2) current MANIFEST file size exceeds a threshold. This is not sufficient. There are cases when the caller explicitly specifies that a new MANIFEST be created. For example, if user sets options.write_dbid_to_manifest = true, and there are WAL files, then RocksDB will run into an issue during recovery. `DBImpl::Recover()` will call `LogAndApply()` to write dbid. At this point, the db being recovered creates a new MANIFEST, say, MANIFEST-000003. Since there are WALs, `DBImpl::RecoverLogFiles` will be called. Towards the end of this function, we call `LogAndApply(new_descriptor_log=true)`, which explicitly creates a new MANIFEST. However, the manifest_file_number is wrong before this fix. Consequently, RocksDB opens an existing, non-empty file for append, effectively truncating the file to zero. If a crash occurs, then there will be data loss. Test Plan (devserver): make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6426 Test Plan: make check Differential Revision: D19951866 Pulled By: riversand963 fbshipit-source-id: 4b1b9fc28d4fe2ac12764b388ef9e61f05e766da --- HISTORY.md | 1 + db/version_set.cc | 5 +++-- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/HISTORY.md b/HISTORY.md index f8355ef72..75966e3ce 100644 --- a/HISTORY.md +++ b/HISTORY.md @@ -12,6 +12,7 @@ * BlobDB now ignores trivially moved files when updating the mapping between blob files and SSTs. This should mitigate issue #6338 where out of order flush/compaction notifications could trigger an assertion with the earlier code. * Batched MultiGet() ignores IO errors while reading data blocks, causing it to potentially continue looking for a key and returning stale results. * `WriteBatchWithIndex::DeleteRange` returns `Status::NotSupported`. Previously it returned success even though reads on the batch did not account for range tombstones. The corresponding language bindings now cannot be used. In C, that includes `rocksdb_writebatch_wi_delete_range`, `rocksdb_writebatch_wi_delete_range_cf`, `rocksdb_writebatch_wi_delete_rangev`, and `rocksdb_writebatch_wi_delete_rangev_cf`. In Java, that includes `WriteBatchWithIndex::deleteRange`. +* Assign new MANIFEST file number when caller tries to create a new MANIFEST by calling LogAndApply(..., new_descriptor_log=true). This bug can cause MANIFEST being overwritten during recovery if options.write_dbid_to_manifest = true and there are WAL file(s). ### Performance Improvements * Perfom readahead when reading from option files. Inside DB, options.log_readahead_size will be used as the readahead size. In other cases, a default 512KB is used. diff --git a/db/version_set.cc b/db/version_set.cc index b413ba50f..e913a97dd 100644 --- a/db/version_set.cc +++ b/db/version_set.cc @@ -3767,8 +3767,6 @@ Status VersionSet::ProcessManifestWrites( if (!descriptor_log_ || manifest_file_size_ > db_options_->max_manifest_file_size) { TEST_SYNC_POINT("VersionSet::ProcessManifestWrites:BeforeNewManifest"); - pending_manifest_file_number_ = NewFileNumber(); - batch_edits.back()->SetNextFile(next_file_number_.load()); new_descriptor_log = true; } else { pending_manifest_file_number_ = manifest_file_number_; @@ -3779,6 +3777,9 @@ Status VersionSet::ProcessManifestWrites( // SwitchMemtable(). std::unordered_map curr_state; if (new_descriptor_log) { + pending_manifest_file_number_ = NewFileNumber(); + batch_edits.back()->SetNextFile(next_file_number_.load()); + // if we are writing out new snapshot make sure to persist max column // family. if (column_family_set_->GetMaxColumnFamily() > 0) {