From 4d82b9489681f0cc2638700b834d1b1c59e27be2 Mon Sep 17 00:00:00 2001 From: Yanqin Jin Date: Wed, 5 Oct 2022 12:24:39 -0700 Subject: [PATCH] Sanitize min_write_buffer_number_to_merge to 1 with atomic_flush (#10773) Summary: With current implementation, within the same RocksDB instance, all column families with non-empty memtables will be scheduled for flush if RocksDB determines that any column family needs to be flushed, e.g. memtable full, write buffer manager, etc., if atomic flush is enabled. Not doing so can lead to data loss and inconsistency when WAL is disabled, which is a common setting when atomic flush is enabled. Therefore, setting a per-column-family knob, min_write_buffer_number_to_merge to a value greater than 1 is not compatible with atomic flush, and should be sanitized during column family creation and db open. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10773 Test Plan: Reproduce: D39993203 has detailed steps. Run the test with and without the fix. Reviewed By: cbi42 Differential Revision: D40077955 Pulled By: cbi42 fbshipit-source-id: 451a9179eb531ac42eaccf40b451b9dec4085240 --- HISTORY.md | 3 +++ db/column_family.cc | 15 +++++++++++++++ include/rocksdb/advanced_options.h | 5 ++++- 3 files changed, 22 insertions(+), 1 deletion(-) diff --git a/HISTORY.md b/HISTORY.md index 388c101ac..4a639a60a 100644 --- a/HISTORY.md +++ b/HISTORY.md @@ -19,6 +19,9 @@ ### New Features * Add a new option IOOptions.do_not_recurse that can be used by underlying file systems to skip recursing through sub directories and list only files in GetChildren API. +### Behavior Changes +* Sanitize min_write_buffer_number_to_merge to 1 if atomic flush is enabled to prevent unexpected data loss when WAL is disabled in a multi-column-family setting (#10773). + ## 7.7.0 (09/18/2022) ### Bug Fixes * Fixed a hang when an operation such as `GetLiveFiles` or `CreateNewBackup` is asked to trigger and wait for memtable flush on a read-only DB. Such indirect requests for memtable flush are now ignored on a read-only DB. diff --git a/db/column_family.cc b/db/column_family.cc index 85b98a859..eb2702691 100644 --- a/db/column_family.cc +++ b/db/column_family.cc @@ -233,6 +233,21 @@ ColumnFamilyOptions SanitizeOptions(const ImmutableDBOptions& db_options, result.min_write_buffer_number_to_merge = 1; } + if (db_options.atomic_flush && result.min_write_buffer_number_to_merge > 1) { + ROCKS_LOG_WARN( + db_options.logger, + "Currently, if atomic_flush is true, then triggering flush for any " + "column family internally (non-manual flush) will trigger flushing " + "all column families even if the number of memtables is smaller " + "min_write_buffer_number_to_merge. Therefore, configuring " + "min_write_buffer_number_to_merge > 1 is not compatible and should " + "be satinized to 1. Not doing so will lead to data loss and " + "inconsistent state across multiple column families when WAL is " + "disabled, which is a common setting for atomic flush"); + + result.min_write_buffer_number_to_merge = 1; + } + if (result.num_levels < 1) { result.num_levels = 1; } diff --git a/include/rocksdb/advanced_options.h b/include/rocksdb/advanced_options.h index b552d66ae..144b5c61e 100644 --- a/include/rocksdb/advanced_options.h +++ b/include/rocksdb/advanced_options.h @@ -271,7 +271,10 @@ struct AdvancedColumnFamilyOptions { // read amplification because a get request has to check in all of these // files. Also, an in-memory merge may result in writing lesser // data to storage if there are duplicate records in each of these - // individual write buffers. Default: 1 + // individual write buffers. + // If atomic flush is enabled (options.atomic_flush == true), then this + // option will be sanitized to 1. + // Default: 1 int min_write_buffer_number_to_merge = 1; // DEPRECATED