rocksdb

fork of https://github.com/oxigraph/rocksdb and https://github.com/facebook/rocksdb for nextgraph and oxigraph

History

Yanqin Jin e0c84aa0dc Fix a race condition in WAL tracking causing DB open failure (#9715 ) Summary: There is a race condition if WAL tracking in the MANIFEST is enabled in a database that disables 2PC. The race condition is between two background flush threads trying to install flush results to the MANIFEST. Consider an example database with two column families: "default" (cfd0) and "cf1" (cfd1). Initially, both column families have one mutable (active) memtable whose data backed by 6.log. 1. Trigger a manual flush for "cf1", creating a 7.log 2. Insert another key to "default", and trigger flush for "default", creating 8.log 3. BgFlushThread1 finishes writing 9.sst 4. BgFlushThread2 finishes writing 10.sst ``` Time BgFlushThread1 BgFlushThread2 \| mutex_.Lock() \| precompute min_wal_to_keep as 6 \| mutex_.Unlock() \| mutex_.Lock() \| precompute min_wal_to_keep as 6 \| join MANIFEST write queue and mutex_.Unlock() \| write to MANIFEST \| mutex_.Lock() \| cfd1->log_number = 7 \| Signal bg_flush_2 and mutex_.Unlock() \| wake up and mutex_.Lock() \| cfd0->log_number = 8 \| FindObsoleteFiles() with job_context->log_number == 7 \| mutex_.Unlock() \| PurgeObsoleteFiles() deletes 6.log V ``` As shown in the above, BgFlushThread2 thinks that the min wal to keep is 6.log because "cf1" has unflushed data in 6.log (cf1.log_number=6). Similarly, BgThread1 thinks that min wal to keep is also 6.log because "default" has unflushed data (default.log_number=6). No WAL deletion will be written to MANIFEST because 6 is equal to `versions_->wals_.min_wal_number_to_keep`, due to https://github.com/facebook/rocksdb/blob/7.1.fb/db/memtable_list.cc#L513:L514. The bg flush thread that finishes last will perform file purging. `job_context.log_number` will be evaluated as 7, i.e. the min wal that contains unflushed data, causing 6.log to be deleted. However, MANIFEST thinks 6.log should still exist. If you close the db at this point, you won't be able to re-open it if `track_and_verify_wal_in_manifest` is true. We must handle the case of multiple bg flush threads, and it is difficult for one bg flush thread to know the correct min wal number until the other bg flush threads have finished committing to the manifest and updated the `cfd::log_number`. To fix this issue, we rename an existing variable `min_log_number_to_keep_2pc` to `min_log_number_to_keep`, and use it to track WAL file deletion in non-2pc mode as well. This variable is updated only 1) during recovery with mutex held, or 2) in the MANIFEST write thread. `min_log_number_to_keep` means RocksDB will delete WALs below it, although there may be WALs above it which are also obsolete. Formally, we will have [min_wal_to_keep, max_obsolete_wal]. During recovery, we make sure that only WALs above max_obsolete_wal are checked and added back to `alive_log_files_`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9715 Test Plan: ``` make check ``` Also ran stress test below (with asan) to make sure it completes successfully. ``` TEST_TMPDIR=/dev/shm/rocksdb OPT=-g ASAN_OPTIONS=disable_coredump=0 \ CRASH_TEST_EXT_ARGS=--compression_type=zstd SKIP_FORMAT_BUCK_CHECKS=1 \ make J=52 -j52 blackbox_asan_crash_test ``` Reviewed By: ltamasi Differential Revision: D34984412 Pulled By: riversand963 fbshipit-source-id: c7b21a8d84751bb55ea79c9f387103d21b231005		4 years ago
..
compacted_db_impl.cc	Fix a timer crash caused by invalid memory management (#9656 )	4 years ago
compacted_db_impl.h	Move compacted_db_impl.[c\|h] to db/db_impl (#8082 )	5 years ago
db_impl.cc	fix a bug, c api, if enable inplace_update_support, and use create sn… (#9471 )	4 years ago
db_impl.h	Fix a race condition when disable and enable manual compaction (#9694 )	4 years ago
db_impl_compaction_flush.cc	Expand auto recovery to background read errors (#9679 )	4 years ago
db_impl_debug.cc	Add OpenAndTrimHistory API to support trimming data with specified timestamp (#9410 )	4 years ago
db_impl_experimental.cc	Get DBTest passing Assert Status Checked (#7737 )	4 years ago
db_impl_files.cc	Fix a race condition in WAL tracking causing DB open failure (#9715 )	4 years ago
db_impl_open.cc	Fix a race condition in WAL tracking causing DB open failure (#9715 )	4 years ago
db_impl_readonly.cc	Fix PinSelf() read-after-free in DB::GetMergeOperands() (#9507 )	4 years ago
db_impl_readonly.h	RocksJava - Add errorIfLogFileExists parameter to RocksDB.openReadOnly (#7046 )	5 years ago
db_impl_secondary.cc	Fix PinSelf() read-after-free in DB::GetMergeOperands() (#9507 )	4 years ago
db_impl_secondary.h	Add commit marker with timestamp (#9266 )	4 years ago
db_impl_write.cc	Return invalid argument if batch is null (#9744 )	4 years ago