Fix missing WAL in new manifest by rolling over the WAL deletion record from prev manifest (#10892)
Summary: **Context** `Options::track_and_verify_wals_in_manifest = true` verifies each of the WALs tracked in manifest indeed presents in the WAL folder. If not, a corruption "Missing WAL with log number" will be thrown. `DB::SyncWAL()` called at a specific timing (i.e, at the `TEST_SYNC_POINT("FindObsoleteFiles::PostMutexUnlock")`) can record in a new manifest the WAL addition of a WAL file that already had a WAL deletion recorded in the previous manifest. And the WAL deletion record is not rollover-ed to the new manifest. So the new manifest creates the illusion of such WAL never gets deleted and should presents at db re/open. - Such WAL deletion record can be caused by flushing the memtable associated with that WAL and such WAL deletion can actually happen in` PurgeObsoleteFiles()`. As a consequence, upon `DB::Reopen()`, this WAL file can be deleted while manifest still has its WAL addition record , which causes a false alarm of corruption "Missing WAL with log number" to be thrown. **Summary** This PR fixes this false alarm by rolling over the WAL deletion record from prev manifest to the new manifest by adding the WAL deletion record to the new manifest. **Test** - Make check - Added new unit test `TEST_F(DBWALTest, FixSyncWalOnObseletedWalWithNewManifestCausingMissingWAL)` that failed before the fix and passed after - [Ongoing]CI stress test + aggressive value as in https://github.com/facebook/rocksdb/pull/10761 , which is how this false alarm was first surfaced, to confirm such false alarm disappears - [Ongoing]Regular CI stress test to confirm such fix didn't harm anything Pull Request resolved: https://github.com/facebook/rocksdb/pull/10892 Reviewed By: ajkr Differential Revision: D40778965 Pulled By: hx235 fbshipit-source-id: a512364bfdeb0b1a55c171890e60d856c528f37fmain
parent
f1574a20ff
commit
2f76ab150d
Loading…
Reference in new issue