Fix race condition in SstFileManagerImpl error recovery code (#9435)

Summary:
There is a race in SstFileManagerImpl between the ClearError() function
and CancelErrorRecovery(). The race can cause ClearError() to deref the
file system pointer after it has been freed. This is likely to occur
during process shutdown, when the order of destruction of the
DB/Env/FileSystem and SstFileManagerImpl is not deterministic.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9435

Test Plan:
Reproduce the crash in a TSAN build by introducing sleeps in the code, and verify with
the fix.

Reviewed By: siying

Differential Revision: D33774696

Pulled By: anand1976

fbshipit-source-id: 643d3da31b8d2ee6d9b6db5d33327e0053ce3b83
main
anand76 2 years ago committed by Facebook GitHub Bot
parent 8822562d75
commit beb86addeb
  1. 1
      HISTORY.md
  2. 5
      file/sst_file_manager_impl.cc

@ -31,6 +31,7 @@ Note: The next release will be major release 7.0. See https://github.com/faceboo
* Fix a bug that FlushMemTable may return ok even flush not succeed.
* Fixed a bug of Sync() and Fsync() not using `fcntl(F_FULLFSYNC)` on OS X and iOS.
* Fixed a significant performance regression in version 6.26 when a prefix extractor is used on the read path (Seek, Get, MultiGet). (Excessive time was spent in SliceTransform::AsString().)
* Fixed a race condition in SstFileManagerImpl error recovery code that can cause a crash during process shutdown.
### New Features
* Added RocksJava support for MacOS universal binary (ARM+x86)

@ -256,7 +256,7 @@ void SstFileManagerImpl::ClearError() {
while (true) {
MutexLock l(&mu_);
if (closing_) {
if (error_handler_list_.empty() || closing_) {
return;
}
@ -297,7 +297,8 @@ void SstFileManagerImpl::ClearError() {
// Someone could have called CancelErrorRecovery() and the list could have
// become empty, so check again here
if (s.ok() && !error_handler_list_.empty()) {
if (s.ok()) {
assert(!error_handler_list_.empty());
auto error_handler = error_handler_list_.front();
// Since we will release the mutex, set cur_instance_ to signal to the
// shutdown thread, if it calls // CancelErrorRecovery() the meantime,

Loading…
Cancel
Save