From beb86addeb363a3cd2406cbd855c7717aa51760b Mon Sep 17 00:00:00 2001 From: anand76 Date: Tue, 25 Jan 2022 22:47:34 -0800 Subject: [PATCH] Fix race condition in SstFileManagerImpl error recovery code (#9435) Summary: There is a race in SstFileManagerImpl between the ClearError() function and CancelErrorRecovery(). The race can cause ClearError() to deref the file system pointer after it has been freed. This is likely to occur during process shutdown, when the order of destruction of the DB/Env/FileSystem and SstFileManagerImpl is not deterministic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9435 Test Plan: Reproduce the crash in a TSAN build by introducing sleeps in the code, and verify with the fix. Reviewed By: siying Differential Revision: D33774696 Pulled By: anand1976 fbshipit-source-id: 643d3da31b8d2ee6d9b6db5d33327e0053ce3b83 --- HISTORY.md | 1 + file/sst_file_manager_impl.cc | 5 +++-- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/HISTORY.md b/HISTORY.md index 8e64fb3ee..b08c4005b 100644 --- a/HISTORY.md +++ b/HISTORY.md @@ -31,6 +31,7 @@ Note: The next release will be major release 7.0. See https://github.com/faceboo * Fix a bug that FlushMemTable may return ok even flush not succeed. * Fixed a bug of Sync() and Fsync() not using `fcntl(F_FULLFSYNC)` on OS X and iOS. * Fixed a significant performance regression in version 6.26 when a prefix extractor is used on the read path (Seek, Get, MultiGet). (Excessive time was spent in SliceTransform::AsString().) +* Fixed a race condition in SstFileManagerImpl error recovery code that can cause a crash during process shutdown. ### New Features * Added RocksJava support for MacOS universal binary (ARM+x86) diff --git a/file/sst_file_manager_impl.cc b/file/sst_file_manager_impl.cc index f7ff03cc3..c4c411488 100644 --- a/file/sst_file_manager_impl.cc +++ b/file/sst_file_manager_impl.cc @@ -256,7 +256,7 @@ void SstFileManagerImpl::ClearError() { while (true) { MutexLock l(&mu_); - if (closing_) { + if (error_handler_list_.empty() || closing_) { return; } @@ -297,7 +297,8 @@ void SstFileManagerImpl::ClearError() { // Someone could have called CancelErrorRecovery() and the list could have // become empty, so check again here - if (s.ok() && !error_handler_list_.empty()) { + if (s.ok()) { + assert(!error_handler_list_.empty()); auto error_handler = error_handler_list_.front(); // Since we will release the mutex, set cur_instance_ to signal to the // shutdown thread, if it calls // CancelErrorRecovery() the meantime,