From 27f3af596609f3b86af5080cc870c869e836e7f2 Mon Sep 17 00:00:00 2001 From: Peter Dillinger Date: Tue, 2 Aug 2022 10:54:32 -0700 Subject: [PATCH] Fix serious FSDirectory use-after-Close bug (missing fsync) (#10460) Summary: TL;DR: due to a recent change, if you drop a column family, often that DB will no longer fsync after writing new SST files to remaining or new column families, which could lead to data loss on power loss. More bug detail: The intent of https://github.com/facebook/rocksdb/issues/10049 was to Close FSDirectory objects at DB::Close time rather than waiting for DB object destruction. Unfortunately, it also closes shared FSDirectory objects on DropColumnFamily (& destroy remaining handles), which can lead to use-after-Close on FSDirectory shared with remaining column families. Those "uses" are only Fsyncs (or redundant Closes). In the default Posix filesystem, an Fsync on a closed FSDirectory is a quiet no-op. Consequently (under most configurations), if you drop a column family, that DB will no longer fsync after writing new SST files to column families sharing the same directory (true under most configurations). More fix detail: Basically, this removes unnecessary Close ops on destroying ColumnFamilyData. We let `shared_ptr` take care of calling the destructor at the right time. If the intent was to require Close be called before destroying FSDirectory, that was not made clear by the author of FileSystem and was not at all enforced by https://github.com/facebook/rocksdb/issues/10049, which could have added `assert(fd_ == -1)` to `~PosixDirectory()` but did not. To keep this fix simple, we relax the unit test for https://github.com/facebook/rocksdb/issues/10049 to allow timely destruction of FSDirectory to suffice as Close (in CountedFileSystem). Added a TODO to revisit that. Also in this PR: * Added a TODO to share FSDirectory instances between DB and its column families. (Already shared among column families.) * Made DB::Close attempt to close all its open FSDirectory objects even if there is a failure in closing one. Also code clean-up around this logic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10460 Test Plan: add an assert to check for use-after-Close. With that existing tests can detect the misuse. With fix, tests pass (except noted relaxing of unit test for https://github.com/facebook/rocksdb/issues/10049) Reviewed By: ajkr Differential Revision: D38357922 Pulled By: pdillinger fbshipit-source-id: d42079cadbedf0a969f03389bf586b3b4e1f9137 --- HISTORY.md | 3 ++- db/column_family.cc | 13 ---------- db/db_basic_test.cc | 6 ++--- db/db_impl/db_impl.h | 51 ++++++++++++-------------------------- db/db_impl/db_impl_open.cc | 1 + env/io_posix.cc | 1 + utilities/counted_fs.cc | 10 ++++++++ 7 files changed, 33 insertions(+), 52 deletions(-) diff --git a/HISTORY.md b/HISTORY.md index c37909e75..1ee8d9baf 100644 --- a/HISTORY.md +++ b/HISTORY.md @@ -13,6 +13,7 @@ * `CompactRangeOptions::exclusive_manual_compaction` is now false by default. This ensures RocksDB does not introduce artificial parallelism limitations by default. ### Bug Fixes +* Fix a bug starting in 7.4.0 in which some fsync operations might be skipped in a DB after any DropColumnFamily on that DB, until it is re-opened. This can lead to data loss on power loss. (For custom FileSystem implementations, this could lead to `FSDirectory::Fsync` or `FSDirectory::Close` after the first `FSDirectory::Close`; Also, valgrind could report call to `close()` with `fd=-1`.) * Fix a bug where `GenericRateLimiter` could revert the bandwidth set dynamically using `SetBytesPerSecond()` when a user configures a structure enclosing it, e.g., using `GetOptionsFromString()` to configure an `Options` that references an existing `RateLimiter` object. * Fix race conditions in `GenericRateLimiter`. * Fix a bug in `FIFOCompactionPicker::PickTTLCompaction` where total_size calculating might cause underflow @@ -32,7 +33,7 @@ * Provide support for ReadOptions.async_io with direct_io to improve Seek latency by using async IO to parallelize child iterator seek and doing asynchronous prefetching on sequential scans. * Added support for blob caching in order to cache frequently used blobs for BlobDB. * User can configure the new ColumnFamilyOptions `blob_cache` to enable/disable blob caching. - * Either sharing the backend cache with the block cache or using a completely separate cache is supported. + * Either sharing the backend cache with the block cache or using a completely separate cache is supported. * A new abstraction interface called `BlobSource` for blob read logic gives all users access to blobs, whether they are in the blob cache, secondary cache, or (remote) storage. Blobs can be potentially read both while handling user reads (`Get`, `MultiGet`, or iterator) and during compaction (while dealing with compaction filters, Merges, or garbage collection) but eventually all blob reads go through `Version::GetBlob` or, for MultiGet, `Version::MultiGetBlob` (and then get dispatched to the interface -- `BlobSource`). * Add experimental tiered compaction feature `AdvancedColumnFamilyOptions::preclude_last_level_data_seconds`, which makes sure the new data inserted within preclude_last_level_data_seconds won't be placed on cold tier (the feature is not complete). diff --git a/db/column_family.cc b/db/column_family.cc index fe416db9b..142f54a23 100644 --- a/db/column_family.cc +++ b/db/column_family.cc @@ -699,19 +699,6 @@ ColumnFamilyData::~ColumnFamilyData() { id_, name_.c_str()); } } - - if (data_dirs_.size()) { // Explicitly close data directories - Status s = Status::OK(); - for (auto& data_dir_ptr : data_dirs_) { - if (data_dir_ptr) { - s = data_dir_ptr->Close(IOOptions(), nullptr); - if (!s.ok()) { - // TODO(zichen): add `Status Close()` and `CloseDirectories() - s.PermitUncheckedError(); - } - } - } - } } bool ColumnFamilyData::UnrefAndTryDelete() { diff --git a/db/db_basic_test.cc b/db/db_basic_test.cc index 6b5f61c96..bfdf9880b 100644 --- a/db/db_basic_test.cc +++ b/db/db_basic_test.cc @@ -1209,9 +1209,9 @@ TEST_F(DBBasicTest, DBCloseAllDirectoryFDs) { s = db->Close(); auto* counted_fs = options.env->GetFileSystem()->CheckedCast(); - assert(counted_fs); - ASSERT_TRUE(counted_fs->counters()->dir_opens == - counted_fs->counters()->dir_closes); + ASSERT_TRUE(counted_fs != nullptr); + ASSERT_EQ(counted_fs->counters()->dir_opens, + counted_fs->counters()->dir_closes); ASSERT_OK(s); delete db; } diff --git a/db/db_impl/db_impl.h b/db/db_impl/db_impl.h index 51dad1a63..2a9a6b9c2 100644 --- a/db/db_impl/db_impl.h +++ b/db/db_impl/db_impl.h @@ -118,7 +118,6 @@ class Directories { IOStatus Close(const IOOptions& options, IODebugContext* dbg) { // close all directories for all database paths IOStatus s = IOStatus::OK(); - IOStatus temp_s = IOStatus::OK(); // The default implementation for Close() in Directory/FSDirectory class // "NotSupported" status, the upper level interface should be able to @@ -127,53 +126,35 @@ class Directories { // `FSDirectory::Close()` yet if (db_dir_) { - temp_s = db_dir_->Close(options, dbg); - if (!temp_s.ok()) { - if (temp_s.IsNotSupported()) { - temp_s.PermitUncheckedError(); - } else { - s = temp_s; - } + IOStatus temp_s = db_dir_->Close(options, dbg); + if (!temp_s.ok() && !temp_s.IsNotSupported() && s.ok()) { + s = std::move(temp_s); } } - if (!s.ok()) { - return s; - } + // Attempt to close everything even if one fails + s.PermitUncheckedError(); if (wal_dir_) { - s = wal_dir_->Close(options, dbg); - if (!temp_s.ok()) { - if (temp_s.IsNotSupported()) { - temp_s.PermitUncheckedError(); - } else { - s = temp_s; - } + IOStatus temp_s = wal_dir_->Close(options, dbg); + if (!temp_s.ok() && !temp_s.IsNotSupported() && s.ok()) { + s = std::move(temp_s); } } - if (!s.ok()) { - return s; - } + s.PermitUncheckedError(); - if (data_dirs_.size() > 0 && s.ok()) { - for (auto& data_dir_ptr : data_dirs_) { - if (data_dir_ptr) { - temp_s = data_dir_ptr->Close(options, dbg); - if (!temp_s.ok()) { - if (temp_s.IsNotSupported()) { - temp_s.PermitUncheckedError(); - } else { - return temp_s; - } - } + for (auto& data_dir_ptr : data_dirs_) { + if (data_dir_ptr) { + IOStatus temp_s = data_dir_ptr->Close(options, dbg); + if (!temp_s.ok() && !temp_s.IsNotSupported() && s.ok()) { + s = std::move(temp_s); } } } - // Mark temp_s as checked when temp_s is still the initial status - // (IOStatus::OK(), not checked yet) - temp_s.PermitUncheckedError(); + // Ready for caller + s.MustCheck(); return s; } diff --git a/db/db_impl/db_impl_open.cc b/db/db_impl/db_impl_open.cc index d5dd77c3c..fa32f4fd4 100644 --- a/db/db_impl/db_impl_open.cc +++ b/db/db_impl/db_impl_open.cc @@ -541,6 +541,7 @@ Status DBImpl::Recover( s = CheckConsistency(); } if (s.ok() && !read_only) { + // TODO: share file descriptors (FSDirectory) with SetDirectories above std::map> created_dirs; for (auto cfd : *versions_->GetColumnFamilySet()) { s = cfd->AddDirectories(&created_dirs); diff --git a/env/io_posix.cc b/env/io_posix.cc index d29aa3a4c..7a23f9e54 100644 --- a/env/io_posix.cc +++ b/env/io_posix.cc @@ -1678,6 +1678,7 @@ IOStatus PosixDirectory::Close(const IOOptions& /*opts*/, IOStatus PosixDirectory::FsyncWithDirOptions( const IOOptions& /*opts*/, IODebugContext* /*dbg*/, const DirFsyncOptions& dir_fsync_options) { + assert(fd_ >= 0); // Check use after close IOStatus s = IOStatus::OK(); #ifndef OS_AIX if (is_btrfs_) { diff --git a/utilities/counted_fs.cc b/utilities/counted_fs.cc index 6917dc06e..e43f3a191 100644 --- a/utilities/counted_fs.cc +++ b/utilities/counted_fs.cc @@ -211,6 +211,7 @@ class CountedRandomRWFile : public FSRandomRWFileOwnerWrapper { class CountedDirectory : public FSDirectoryWrapper { private: mutable CountedFileSystem* fs_; + bool closed_ = false; public: CountedDirectory(std::unique_ptr&& f, CountedFileSystem* fs) @@ -229,6 +230,7 @@ class CountedDirectory : public FSDirectoryWrapper { if (rv.ok()) { fs_->counters()->closes++; fs_->counters()->dir_closes++; + closed_ = true; } return rv; } @@ -242,6 +244,14 @@ class CountedDirectory : public FSDirectoryWrapper { } return rv; } + + ~CountedDirectory() { + if (!closed_) { + // TODO: fix DB+CF code to use explicit Close, not rely on destructor + fs_->counters()->closes++; + fs_->counters()->dir_closes++; + } + } }; } // anonymous namespace