|
|
|
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
|
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
|
|
// (found in the LICENSE.Apache file in the root directory).
|
|
|
|
//
|
|
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
|
|
|
#include <cinttypes>
|
|
|
|
|
|
|
|
#include "db/builder.h"
|
|
|
|
#include "db/db_impl/db_impl.h"
|
|
|
|
#include "db/error_handler.h"
|
|
|
|
#include "db/periodic_task_scheduler.h"
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
#include "env/composite_env_wrapper.h"
|
|
|
|
#include "file/filename.h"
|
|
|
|
#include "file/read_write_util.h"
|
|
|
|
#include "file/sst_file_manager_impl.h"
|
|
|
|
#include "file/writable_file_writer.h"
|
|
|
|
#include "logging/logging.h"
|
|
|
|
#include "monitoring/persistent_stats_history.h"
|
|
|
|
#include "monitoring/thread_status_util.h"
|
|
|
|
#include "options/options_helper.h"
|
|
|
|
#include "rocksdb/table.h"
|
|
|
|
#include "rocksdb/wal_filter.h"
|
|
|
|
#include "test_util/sync_point.h"
|
|
|
|
#include "util/rate_limiter_impl.h"
|
|
|
|
#include "util/udt_util.h"
|
|
|
|
|
|
|
|
namespace ROCKSDB_NAMESPACE {
|
Make backups openable as read-only DBs (#8142)
Summary:
A current limitation of backups is that you don't know the
exact database state of when the backup was taken. With this new
feature, you can at least inspect the backup's DB state without
restoring it by opening it as a read-only DB.
Rather than add something like OpenAsReadOnlyDB to the BackupEngine API,
which would inhibit opening stackable DB implementations read-only
(if/when their APIs support it), we instead provide a DB name and Env
that can be used to open as a read-only DB.
Possible follow-up work:
* Add a version of GetBackupInfo for a single backup.
* Let CreateNewBackup return the BackupID of the newly-created backup.
Implementation details:
Refactored ChrootFileSystem to split off new base class RemapFileSystem,
which allows more general remapping of files. We use this base class to
implement BackupEngineImpl::RemapSharedFileSystem.
To minimize API impact, I decided to just add these fields `name_for_open`
and `env_for_open` to those set by GetBackupInfo when
include_file_details=true. Creating the RemapSharedFileSystem adds a bit
to the memory consumption, perhaps unnecessarily in some cases, but this
has been mitigated by (a) only initialize the RemapSharedFileSystem
lazily when GetBackupInfo with include_file_details=true is called, and
(b) using the existing `shared_ptr<FileInfo>` objects to hold most of the
mapping data.
To enhance API safety, RemapSharedFileSystem is wrapped by new
ReadOnlyFileSystem which rejects any attempts to write. This uncovered a
couple of places in which DB::OpenForReadOnly would write to the
filesystem, so I fixed these. Added a release note because this affects
logging.
Additional minor refactoring in backupable_db.cc to support the new
functionality.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8142
Test Plan:
new test (run with ASAN and UBSAN), added to stress test and
ran it for a while with amplified backup_one_in
Reviewed By: ajkr
Differential Revision: D27535408
Pulled By: pdillinger
fbshipit-source-id: 04666d310aa0261ef6b2385c43ca793ce1dfd148
4 years ago
|
|
|
Options SanitizeOptions(const std::string& dbname, const Options& src,
|
|
|
|
bool read_only, Status* logger_creation_s) {
|
|
|
|
auto db_options =
|
|
|
|
SanitizeOptions(dbname, DBOptions(src), read_only, logger_creation_s);
|
|
|
|
ImmutableDBOptions immutable_db_options(db_options);
|
|
|
|
auto cf_options =
|
|
|
|
SanitizeOptions(immutable_db_options, ColumnFamilyOptions(src));
|
|
|
|
return Options(db_options, cf_options);
|
|
|
|
}
|
|
|
|
|
Make backups openable as read-only DBs (#8142)
Summary:
A current limitation of backups is that you don't know the
exact database state of when the backup was taken. With this new
feature, you can at least inspect the backup's DB state without
restoring it by opening it as a read-only DB.
Rather than add something like OpenAsReadOnlyDB to the BackupEngine API,
which would inhibit opening stackable DB implementations read-only
(if/when their APIs support it), we instead provide a DB name and Env
that can be used to open as a read-only DB.
Possible follow-up work:
* Add a version of GetBackupInfo for a single backup.
* Let CreateNewBackup return the BackupID of the newly-created backup.
Implementation details:
Refactored ChrootFileSystem to split off new base class RemapFileSystem,
which allows more general remapping of files. We use this base class to
implement BackupEngineImpl::RemapSharedFileSystem.
To minimize API impact, I decided to just add these fields `name_for_open`
and `env_for_open` to those set by GetBackupInfo when
include_file_details=true. Creating the RemapSharedFileSystem adds a bit
to the memory consumption, perhaps unnecessarily in some cases, but this
has been mitigated by (a) only initialize the RemapSharedFileSystem
lazily when GetBackupInfo with include_file_details=true is called, and
(b) using the existing `shared_ptr<FileInfo>` objects to hold most of the
mapping data.
To enhance API safety, RemapSharedFileSystem is wrapped by new
ReadOnlyFileSystem which rejects any attempts to write. This uncovered a
couple of places in which DB::OpenForReadOnly would write to the
filesystem, so I fixed these. Added a release note because this affects
logging.
Additional minor refactoring in backupable_db.cc to support the new
functionality.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8142
Test Plan:
new test (run with ASAN and UBSAN), added to stress test and
ran it for a while with amplified backup_one_in
Reviewed By: ajkr
Differential Revision: D27535408
Pulled By: pdillinger
fbshipit-source-id: 04666d310aa0261ef6b2385c43ca793ce1dfd148
4 years ago
|
|
|
DBOptions SanitizeOptions(const std::string& dbname, const DBOptions& src,
|
|
|
|
bool read_only, Status* logger_creation_s) {
|
|
|
|
DBOptions result(src);
|
|
|
|
|
|
|
|
if (result.env == nullptr) {
|
|
|
|
result.env = Env::Default();
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
}
|
|
|
|
|
|
|
|
// result.max_open_files means an "infinite" open files.
|
|
|
|
if (result.max_open_files != -1) {
|
|
|
|
int max_max_open_files = port::GetMaxOpenFiles();
|
|
|
|
if (max_max_open_files == -1) {
|
|
|
|
max_max_open_files = 0x400000;
|
|
|
|
}
|
|
|
|
ClipToRange(&result.max_open_files, 20, max_max_open_files);
|
|
|
|
TEST_SYNC_POINT_CALLBACK("SanitizeOptions::AfterChangeMaxOpenFiles",
|
|
|
|
&result.max_open_files);
|
|
|
|
}
|
|
|
|
|
Make backups openable as read-only DBs (#8142)
Summary:
A current limitation of backups is that you don't know the
exact database state of when the backup was taken. With this new
feature, you can at least inspect the backup's DB state without
restoring it by opening it as a read-only DB.
Rather than add something like OpenAsReadOnlyDB to the BackupEngine API,
which would inhibit opening stackable DB implementations read-only
(if/when their APIs support it), we instead provide a DB name and Env
that can be used to open as a read-only DB.
Possible follow-up work:
* Add a version of GetBackupInfo for a single backup.
* Let CreateNewBackup return the BackupID of the newly-created backup.
Implementation details:
Refactored ChrootFileSystem to split off new base class RemapFileSystem,
which allows more general remapping of files. We use this base class to
implement BackupEngineImpl::RemapSharedFileSystem.
To minimize API impact, I decided to just add these fields `name_for_open`
and `env_for_open` to those set by GetBackupInfo when
include_file_details=true. Creating the RemapSharedFileSystem adds a bit
to the memory consumption, perhaps unnecessarily in some cases, but this
has been mitigated by (a) only initialize the RemapSharedFileSystem
lazily when GetBackupInfo with include_file_details=true is called, and
(b) using the existing `shared_ptr<FileInfo>` objects to hold most of the
mapping data.
To enhance API safety, RemapSharedFileSystem is wrapped by new
ReadOnlyFileSystem which rejects any attempts to write. This uncovered a
couple of places in which DB::OpenForReadOnly would write to the
filesystem, so I fixed these. Added a release note because this affects
logging.
Additional minor refactoring in backupable_db.cc to support the new
functionality.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8142
Test Plan:
new test (run with ASAN and UBSAN), added to stress test and
ran it for a while with amplified backup_one_in
Reviewed By: ajkr
Differential Revision: D27535408
Pulled By: pdillinger
fbshipit-source-id: 04666d310aa0261ef6b2385c43ca793ce1dfd148
4 years ago
|
|
|
if (result.info_log == nullptr && !read_only) {
|
|
|
|
Status s = CreateLoggerFromOptions(dbname, result, &result.info_log);
|
|
|
|
if (!s.ok()) {
|
|
|
|
// No place suitable for logging
|
|
|
|
result.info_log = nullptr;
|
|
|
|
if (logger_creation_s) {
|
|
|
|
*logger_creation_s = s;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!result.write_buffer_manager) {
|
|
|
|
result.write_buffer_manager.reset(
|
|
|
|
new WriteBufferManager(result.db_write_buffer_size));
|
|
|
|
}
|
|
|
|
auto bg_job_limits = DBImpl::GetBGJobLimits(
|
|
|
|
result.max_background_flushes, result.max_background_compactions,
|
|
|
|
result.max_background_jobs, true /* parallelize_compactions */);
|
|
|
|
result.env->IncBackgroundThreadsIfNeeded(bg_job_limits.max_compactions,
|
|
|
|
Env::Priority::LOW);
|
|
|
|
result.env->IncBackgroundThreadsIfNeeded(bg_job_limits.max_flushes,
|
|
|
|
Env::Priority::HIGH);
|
|
|
|
|
|
|
|
if (result.rate_limiter.get() != nullptr) {
|
|
|
|
if (result.bytes_per_sync == 0) {
|
|
|
|
result.bytes_per_sync = 1024 * 1024;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (result.delayed_write_rate == 0) {
|
|
|
|
if (result.rate_limiter.get() != nullptr) {
|
|
|
|
result.delayed_write_rate = result.rate_limiter->GetBytesPerSecond();
|
|
|
|
}
|
|
|
|
if (result.delayed_write_rate == 0) {
|
|
|
|
result.delayed_write_rate = 16 * 1024 * 1024;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (result.WAL_ttl_seconds > 0 || result.WAL_size_limit_MB > 0) {
|
|
|
|
result.recycle_log_file_num = false;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (result.recycle_log_file_num &&
|
|
|
|
(result.wal_recovery_mode ==
|
|
|
|
WALRecoveryMode::kTolerateCorruptedTailRecords ||
|
|
|
|
result.wal_recovery_mode == WALRecoveryMode::kPointInTimeRecovery ||
|
|
|
|
result.wal_recovery_mode == WALRecoveryMode::kAbsoluteConsistency)) {
|
|
|
|
// - kTolerateCorruptedTailRecords is inconsistent with recycle log file
|
|
|
|
// feature. WAL recycling expects recovery success upon encountering a
|
|
|
|
// corrupt record at the point where new data ends and recycled data
|
|
|
|
// remains at the tail. However, `kTolerateCorruptedTailRecords` must fail
|
|
|
|
// upon encountering any such corrupt record, as it cannot differentiate
|
|
|
|
// between this and a real corruption, which would cause committed updates
|
|
|
|
// to be truncated -- a violation of the recovery guarantee.
|
|
|
|
// - kPointInTimeRecovery and kAbsoluteConsistency are incompatible with
|
|
|
|
// recycle log file feature temporarily due to a bug found introducing a
|
|
|
|
// hole in the recovered data
|
|
|
|
// (https://github.com/facebook/rocksdb/pull/7252#issuecomment-673766236).
|
|
|
|
// Besides this bug, we believe the features are fundamentally compatible.
|
|
|
|
result.recycle_log_file_num = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (result.db_paths.size() == 0) {
|
|
|
|
result.db_paths.emplace_back(dbname, std::numeric_limits<uint64_t>::max());
|
|
|
|
} else if (result.wal_dir.empty()) {
|
|
|
|
// Use dbname as default
|
|
|
|
result.wal_dir = dbname;
|
|
|
|
}
|
|
|
|
if (!result.wal_dir.empty()) {
|
|
|
|
// If there is a wal_dir already set, check to see if the wal_dir is the
|
|
|
|
// same as the dbname AND the same as the db_path[0] (which must exist from
|
|
|
|
// a few lines ago). If the wal_dir matches both of these values, then clear
|
|
|
|
// the wal_dir value, which will make wal_dir == dbname. Most likely this
|
|
|
|
// condition was the result of reading an old options file where we forced
|
|
|
|
// wal_dir to be set (to dbname).
|
|
|
|
auto npath = NormalizePath(dbname + "/");
|
|
|
|
if (npath == NormalizePath(result.wal_dir + "/") &&
|
|
|
|
npath == NormalizePath(result.db_paths[0].path + "/")) {
|
|
|
|
result.wal_dir.clear();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!result.wal_dir.empty() && result.wal_dir.back() == '/') {
|
|
|
|
result.wal_dir = result.wal_dir.substr(0, result.wal_dir.size() - 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (result.use_direct_reads && result.compaction_readahead_size == 0) {
|
|
|
|
TEST_SYNC_POINT_CALLBACK("SanitizeOptions:direct_io", nullptr);
|
|
|
|
result.compaction_readahead_size = 1024 * 1024 * 2;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Force flush on DB open if 2PC is enabled, since with 2PC we have no
|
|
|
|
// guarantee that consecutive log files have consecutive sequence id, which
|
|
|
|
// make recovery complicated.
|
|
|
|
if (result.allow_2pc) {
|
|
|
|
result.avoid_flush_during_recovery = false;
|
|
|
|
}
|
|
|
|
|
|
|
|
ImmutableDBOptions immutable_db_options(result);
|
|
|
|
if (!immutable_db_options.IsWalDirSameAsDBPath()) {
|
|
|
|
// Either the WAL dir and db_paths[0]/db_name are not the same, or we
|
|
|
|
// cannot tell for sure. In either case, assume they're different and
|
|
|
|
// explicitly cleanup the trash log files (bypass DeleteScheduler)
|
|
|
|
// Do this first so even if we end up calling
|
|
|
|
// DeleteScheduler::CleanupDirectory on the same dir later, it will be
|
|
|
|
// safe
|
|
|
|
std::vector<std::string> filenames;
|
|
|
|
IOOptions io_opts;
|
|
|
|
io_opts.do_not_recurse = true;
|
|
|
|
auto wal_dir = immutable_db_options.GetWalDir();
|
|
|
|
Status s = immutable_db_options.fs->GetChildren(
|
|
|
|
wal_dir, io_opts, &filenames, /*IODebugContext*=*/nullptr);
|
|
|
|
s.PermitUncheckedError(); //**TODO: What to do on error?
|
|
|
|
for (std::string& filename : filenames) {
|
|
|
|
if (filename.find(".log.trash", filename.length() -
|
|
|
|
std::string(".log.trash").length()) !=
|
|
|
|
std::string::npos) {
|
|
|
|
std::string trash_file = wal_dir + "/" + filename;
|
|
|
|
result.env->DeleteFile(trash_file).PermitUncheckedError();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
// When the DB is stopped, it's possible that there are some .trash files that
|
|
|
|
// were not deleted yet, when we open the DB we will find these .trash files
|
|
|
|
// and schedule them to be deleted (or delete immediately if SstFileManager
|
|
|
|
// was not used)
|
|
|
|
auto sfm = static_cast<SstFileManagerImpl*>(result.sst_file_manager.get());
|
|
|
|
for (size_t i = 0; i < result.db_paths.size(); i++) {
|
|
|
|
DeleteScheduler::CleanupDirectory(result.env, sfm, result.db_paths[i].path)
|
|
|
|
.PermitUncheckedError();
|
|
|
|
}
|
|
|
|
|
Auto recovery from out of space errors (#4164)
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
6 years ago
|
|
|
// Create a default SstFileManager for purposes of tracking compaction size
|
|
|
|
// and facilitating recovery from out of space errors.
|
|
|
|
if (result.sst_file_manager.get() == nullptr) {
|
|
|
|
std::shared_ptr<SstFileManager> sst_file_manager(
|
|
|
|
NewSstFileManager(result.env, result.info_log));
|
|
|
|
result.sst_file_manager = sst_file_manager;
|
|
|
|
}
|
Add an option to prevent DB::Open() from querying sizes of all sst files (#6353)
Summary:
When paranoid_checks is on, DBImpl::CheckConsistency() iterates over all sst files and calls Env::GetFileSize() for each of them. As far as I could understand, this is pretty arbitrary and doesn't affect correctness - if filesystem doesn't corrupt fsynced files, the file sizes will always match; if it does, it may as well corrupt contents as well as sizes, and rocksdb doesn't check contents on open.
If there are thousands of sst files, getting all their sizes takes a while. If, on top of that, Env is overridden to use some remote storage instead of local filesystem, it can be *really* slow and overload the remote storage service. This PR adds an option to not do GetFileSize(); instead it does GetChildren() for parent directory to check that all the expected sst files are at least present, but doesn't check their sizes.
We can't just disable paranoid_checks instead because paranoid_checks do a few other important things: make the DB read-only on write errors, print error messages on read errors, etc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6353
Test Plan: ran the added sanity check unit test. Will try it out in a LogDevice test cluster where the GetFileSize() calls are causing a lot of trouble.
Differential Revision: D19656425
Pulled By: al13n321
fbshipit-source-id: c2c421b367633033760d1f56747bad206d1fbf82
5 years ago
|
|
|
|
|
|
|
// Supported wal compression types
|
|
|
|
if (!StreamingCompressionTypeSupported(result.wal_compression)) {
|
|
|
|
result.wal_compression = kNoCompression;
|
|
|
|
ROCKS_LOG_WARN(result.info_log,
|
|
|
|
"wal_compression is disabled since only zstd is supported");
|
|
|
|
}
|
|
|
|
|
Add an option to prevent DB::Open() from querying sizes of all sst files (#6353)
Summary:
When paranoid_checks is on, DBImpl::CheckConsistency() iterates over all sst files and calls Env::GetFileSize() for each of them. As far as I could understand, this is pretty arbitrary and doesn't affect correctness - if filesystem doesn't corrupt fsynced files, the file sizes will always match; if it does, it may as well corrupt contents as well as sizes, and rocksdb doesn't check contents on open.
If there are thousands of sst files, getting all their sizes takes a while. If, on top of that, Env is overridden to use some remote storage instead of local filesystem, it can be *really* slow and overload the remote storage service. This PR adds an option to not do GetFileSize(); instead it does GetChildren() for parent directory to check that all the expected sst files are at least present, but doesn't check their sizes.
We can't just disable paranoid_checks instead because paranoid_checks do a few other important things: make the DB read-only on write errors, print error messages on read errors, etc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6353
Test Plan: ran the added sanity check unit test. Will try it out in a LogDevice test cluster where the GetFileSize() calls are causing a lot of trouble.
Differential Revision: D19656425
Pulled By: al13n321
fbshipit-source-id: c2c421b367633033760d1f56747bad206d1fbf82
5 years ago
|
|
|
if (!result.paranoid_checks) {
|
|
|
|
result.skip_checking_sst_file_sizes_on_db_open = true;
|
|
|
|
ROCKS_LOG_INFO(result.info_log,
|
|
|
|
"file size check will be skipped during open.");
|
|
|
|
}
|
|
|
|
|
|
|
|
return result;
|
|
|
|
}
|
|
|
|
|
|
|
|
namespace {
|
|
|
|
Status ValidateOptionsByTable(
|
|
|
|
const DBOptions& db_opts,
|
|
|
|
const std::vector<ColumnFamilyDescriptor>& column_families) {
|
|
|
|
Status s;
|
|
|
|
for (auto& cf : column_families) {
|
|
|
|
s = ValidateOptions(db_opts, cf.options);
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
} // namespace
|
|
|
|
|
|
|
|
Status DBImpl::ValidateOptions(
|
|
|
|
const DBOptions& db_options,
|
|
|
|
const std::vector<ColumnFamilyDescriptor>& column_families) {
|
|
|
|
Status s;
|
|
|
|
for (auto& cfd : column_families) {
|
|
|
|
s = ColumnFamilyData::ValidateOptions(db_options, cfd.options);
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
s = ValidateOptions(db_options);
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
Status DBImpl::ValidateOptions(const DBOptions& db_options) {
|
|
|
|
if (db_options.db_paths.size() > 4) {
|
|
|
|
return Status::NotSupported(
|
|
|
|
"More than four DB paths are not supported yet. ");
|
|
|
|
}
|
|
|
|
|
|
|
|
if (db_options.allow_mmap_reads && db_options.use_direct_reads) {
|
|
|
|
// Protect against assert in PosixMMapReadableFile constructor
|
|
|
|
return Status::NotSupported(
|
|
|
|
"If memory mapped reads (allow_mmap_reads) are enabled "
|
|
|
|
"then direct I/O reads (use_direct_reads) must be disabled. ");
|
|
|
|
}
|
|
|
|
|
|
|
|
if (db_options.allow_mmap_writes &&
|
|
|
|
db_options.use_direct_io_for_flush_and_compaction) {
|
|
|
|
return Status::NotSupported(
|
|
|
|
"If memory mapped writes (allow_mmap_writes) are enabled "
|
|
|
|
"then direct I/O writes (use_direct_io_for_flush_and_compaction) must "
|
|
|
|
"be disabled. ");
|
|
|
|
}
|
|
|
|
|
|
|
|
if (db_options.keep_log_file_num == 0) {
|
|
|
|
return Status::InvalidArgument("keep_log_file_num must be greater than 0");
|
|
|
|
}
|
|
|
|
|
|
|
|
if (db_options.unordered_write &&
|
|
|
|
!db_options.allow_concurrent_memtable_write) {
|
|
|
|
return Status::InvalidArgument(
|
|
|
|
"unordered_write is incompatible with "
|
|
|
|
"!allow_concurrent_memtable_write");
|
|
|
|
}
|
|
|
|
|
|
|
|
if (db_options.unordered_write && db_options.enable_pipelined_write) {
|
|
|
|
return Status::InvalidArgument(
|
|
|
|
"unordered_write is incompatible with enable_pipelined_write");
|
|
|
|
}
|
|
|
|
|
|
|
|
if (db_options.atomic_flush && db_options.enable_pipelined_write) {
|
|
|
|
return Status::InvalidArgument(
|
|
|
|
"atomic_flush is incompatible with enable_pipelined_write");
|
|
|
|
}
|
|
|
|
|
|
|
|
// TODO remove this restriction
|
|
|
|
if (db_options.atomic_flush && db_options.best_efforts_recovery) {
|
|
|
|
return Status::InvalidArgument(
|
|
|
|
"atomic_flush is currently incompatible with best-efforts recovery");
|
|
|
|
}
|
|
|
|
|
|
|
|
if (db_options.use_direct_io_for_flush_and_compaction &&
|
|
|
|
0 == db_options.writable_file_max_buffer_size) {
|
|
|
|
return Status::InvalidArgument(
|
|
|
|
"writes in direct IO require writable_file_max_buffer_size > 0");
|
|
|
|
}
|
|
|
|
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
|
|
|
Status DBImpl::NewDB(std::vector<std::string>* new_filenames) {
|
|
|
|
VersionEdit new_db;
|
|
|
|
Status s = SetIdentityFile(env_, dbname_);
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
if (immutable_db_options_.write_dbid_to_manifest) {
|
|
|
|
std::string temp_db_id;
|
|
|
|
GetDbIdentityFromIdentityFile(&temp_db_id);
|
|
|
|
new_db.SetDBId(temp_db_id);
|
|
|
|
}
|
|
|
|
new_db.SetLogNumber(0);
|
|
|
|
new_db.SetNextFile(2);
|
|
|
|
new_db.SetLastSequence(0);
|
|
|
|
|
|
|
|
ROCKS_LOG_INFO(immutable_db_options_.info_log, "Creating manifest 1 \n");
|
|
|
|
const std::string manifest = DescriptorFileName(dbname_, 1);
|
|
|
|
{
|
Handle rename() failure in non-local FS (#8192)
Summary:
In a distributed environment, a file `rename()` operation can succeed on server (remote)
side, but the client can somehow return non-ok status to RocksDB. Possible reasons include
network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which
can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a
new MANIFEST. We currently always delete the new MANIFEST if an error occurs.
This is problematic in distributed world. If the server-side successfully updates the CURRENT
file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail.
As a fix, we can track the execution result of IO operations on the new MANIFEST.
- If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original
MANIFEST. Therefore, it is safe to remove the new MANIFEST.
- If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up
code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local
POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the
new MANIFEST.) Therefore, we keep the new MANIFEST.
- Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT.
- If process reopens the db immediately after the failure, then the CURRENT file can point
to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can
succeed and ignore the other.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8192
Test Plan: make check
Reviewed By: zhichao-cao
Differential Revision: D27804648
Pulled By: riversand963
fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4
4 years ago
|
|
|
if (fs_->FileExists(manifest, IOOptions(), nullptr).ok()) {
|
|
|
|
fs_->DeleteFile(manifest, IOOptions(), nullptr).PermitUncheckedError();
|
|
|
|
}
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
std::unique_ptr<FSWritableFile> file;
|
|
|
|
FileOptions file_options = fs_->OptimizeForManifestWrite(file_options_);
|
|
|
|
s = NewWritableFile(fs_.get(), manifest, &file, file_options);
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
FileTypeSet tmp_set = immutable_db_options_.checksum_handoff_file_types;
|
|
|
|
file->SetPreallocationBlockSize(
|
|
|
|
immutable_db_options_.manifest_preallocation_size);
|
|
|
|
std::unique_ptr<WritableFileWriter> file_writer(new WritableFileWriter(
|
|
|
|
std::move(file), manifest, file_options, immutable_db_options_.clock,
|
|
|
|
io_tracer_, nullptr /* stats */, immutable_db_options_.listeners,
|
Using existing crc32c checksum in checksum handoff for Manifest and WAL (#8412)
Summary:
In PR https://github.com/facebook/rocksdb/issues/7523 , checksum handoff is introduced in RocksDB for WAL, Manifest, and SST files. When user enable checksum handoff for a certain type of file, before the data is written to the lower layer storage system, we calculate the checksum (crc32c) of each piece of data and pass the checksum down with the data, such that data verification can be down by the lower layer storage system if it has the capability. However, it cannot cover the whole lifetime of the data in the memory and also it potentially introduces extra checksum calculation overhead.
In this PR, we introduce a new interface in WritableFileWriter::Append, which allows the caller be able to pass the data and the checksum (crc32c) together. In this way, WritableFileWriter can directly use the pass-in checksum (crc32c) to generate the checksum of data being passed down to the storage system. It saves the calculation overhead and achieves higher protection coverage. When a new checksum is added with the data, we use Crc32cCombine https://github.com/facebook/rocksdb/issues/8305 to combine the existing checksum and the new checksum. To avoid the segmenting of data by rate-limiter before it is stored, rate-limiter is called enough times to accumulate enough credits for a certain write. This design only support Manifest and WAL which use log_writer in the current stage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8412
Test Plan: make check, add new testing cases.
Reviewed By: anand1976
Differential Revision: D29151545
Pulled By: zhichao-cao
fbshipit-source-id: 75e2278c5126cfd58393c67b1efd18dcc7a30772
4 years ago
|
|
|
nullptr, tmp_set.Contains(FileType::kDescriptorFile),
|
|
|
|
tmp_set.Contains(FileType::kDescriptorFile)));
|
|
|
|
log::Writer log(std::move(file_writer), 0, false);
|
|
|
|
std::string record;
|
|
|
|
new_db.EncodeTo(&record);
|
|
|
|
s = log.AddRecord(record);
|
|
|
|
if (s.ok()) {
|
|
|
|
s = SyncManifest(&immutable_db_options_, log.file());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (s.ok()) {
|
|
|
|
// Make "CURRENT" file that points to the new manifest file.
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
5 years ago
|
|
|
s = SetCurrentFile(fs_.get(), dbname_, 1, directories_.GetDbDir());
|
|
|
|
if (new_filenames) {
|
|
|
|
new_filenames->emplace_back(
|
|
|
|
manifest.substr(manifest.find_last_of("/\\") + 1));
|
|
|
|
}
|
|
|
|
} else {
|
Handle rename() failure in non-local FS (#8192)
Summary:
In a distributed environment, a file `rename()` operation can succeed on server (remote)
side, but the client can somehow return non-ok status to RocksDB. Possible reasons include
network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which
can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a
new MANIFEST. We currently always delete the new MANIFEST if an error occurs.
This is problematic in distributed world. If the server-side successfully updates the CURRENT
file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail.
As a fix, we can track the execution result of IO operations on the new MANIFEST.
- If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original
MANIFEST. Therefore, it is safe to remove the new MANIFEST.
- If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up
code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local
POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the
new MANIFEST.) Therefore, we keep the new MANIFEST.
- Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT.
- If process reopens the db immediately after the failure, then the CURRENT file can point
to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can
succeed and ignore the other.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8192
Test Plan: make check
Reviewed By: zhichao-cao
Differential Revision: D27804648
Pulled By: riversand963
fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4
4 years ago
|
|
|
fs_->DeleteFile(manifest, IOOptions(), nullptr).PermitUncheckedError();
|
|
|
|
}
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
IOStatus DBImpl::CreateAndNewDirectory(
|
|
|
|
FileSystem* fs, const std::string& dirname,
|
|
|
|
std::unique_ptr<FSDirectory>* directory) {
|
|
|
|
// We call CreateDirIfMissing() as the directory may already exist (if we
|
|
|
|
// are reopening a DB), when this happens we don't want creating the
|
|
|
|
// directory to cause an error. However, we need to check if creating the
|
|
|
|
// directory fails or else we may get an obscure message about the lock
|
|
|
|
// file not existing. One real-world example of this occurring is if
|
|
|
|
// env->CreateDirIfMissing() doesn't create intermediate directories, e.g.
|
|
|
|
// when dbname_ is "dir/db" but when "dir" doesn't exist.
|
|
|
|
IOStatus io_s = fs->CreateDirIfMissing(dirname, IOOptions(), nullptr);
|
|
|
|
if (!io_s.ok()) {
|
|
|
|
return io_s;
|
|
|
|
}
|
|
|
|
return fs->NewDirectory(dirname, IOOptions(), directory, nullptr);
|
|
|
|
}
|
|
|
|
|
|
|
|
IOStatus Directories::SetDirectories(FileSystem* fs, const std::string& dbname,
|
|
|
|
const std::string& wal_dir,
|
|
|
|
const std::vector<DbPath>& data_paths) {
|
|
|
|
IOStatus io_s = DBImpl::CreateAndNewDirectory(fs, dbname, &db_dir_);
|
|
|
|
if (!io_s.ok()) {
|
|
|
|
return io_s;
|
|
|
|
}
|
|
|
|
if (!wal_dir.empty() && dbname != wal_dir) {
|
|
|
|
io_s = DBImpl::CreateAndNewDirectory(fs, wal_dir, &wal_dir_);
|
|
|
|
if (!io_s.ok()) {
|
|
|
|
return io_s;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
data_dirs_.clear();
|
|
|
|
for (auto& p : data_paths) {
|
|
|
|
const std::string db_path = p.path;
|
|
|
|
if (db_path == dbname) {
|
|
|
|
data_dirs_.emplace_back(nullptr);
|
|
|
|
} else {
|
|
|
|
std::unique_ptr<FSDirectory> path_directory;
|
|
|
|
io_s = DBImpl::CreateAndNewDirectory(fs, db_path, &path_directory);
|
|
|
|
if (!io_s.ok()) {
|
|
|
|
return io_s;
|
|
|
|
}
|
|
|
|
data_dirs_.emplace_back(path_directory.release());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
assert(data_dirs_.size() == data_paths.size());
|
|
|
|
return IOStatus::OK();
|
|
|
|
}
|
|
|
|
|
|
|
|
Status DBImpl::Recover(
|
|
|
|
const std::vector<ColumnFamilyDescriptor>& column_families, bool read_only,
|
|
|
|
bool error_if_wal_file_exists, bool error_if_data_exists_in_wals,
|
Persist the new MANIFEST after successfully syncing the new WAL during recovery (#9922)
Summary:
In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
flush the data from WAL to L0 for all column families if possible. As a
result, not all column families can increase their log_numbers, and
min_log_number_to_keep won't change.
For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.
If we persist a new MANIFEST with
advanced log_numbers for some column families, then during a second
crash after persisting the MANIFEST, RocksDB will see some column
families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail.
As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL.
If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point.
If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9922
Test Plan:
1. Update unit tests to fail without this change
2. make crast_test -j
Branch with unit test and no fix https://github.com/facebook/rocksdb/pull/9942 to keep track of unit test (without fix)
Reviewed By: riversand963
Differential Revision: D36043701
Pulled By: akankshamahajan15
fbshipit-source-id: 5760970db0a0920fb73d3c054a4155733500acd9
3 years ago
|
|
|
uint64_t* recovered_seq, RecoveryContext* recovery_ctx) {
|
|
|
|
mutex_.AssertHeld();
|
|
|
|
|
|
|
|
bool is_new_db = false;
|
|
|
|
assert(db_lock_ == nullptr);
|
|
|
|
std::vector<std::string> files_in_dbname;
|
|
|
|
if (!read_only) {
|
|
|
|
Status s = directories_.SetDirectories(fs_.get(), dbname_,
|
|
|
|
immutable_db_options_.wal_dir,
|
|
|
|
immutable_db_options_.db_paths);
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
s = env_->LockFile(LockFileName(dbname_), &db_lock_);
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
std::string current_fname = CurrentFileName(dbname_);
|
|
|
|
// Path to any MANIFEST file in the db dir. It does not matter which one.
|
|
|
|
// Since best-efforts recovery ignores CURRENT file, existence of a
|
|
|
|
// MANIFEST indicates the recovery to recover existing db. If no MANIFEST
|
|
|
|
// can be found, a new db will be created.
|
|
|
|
std::string manifest_path;
|
|
|
|
if (!immutable_db_options_.best_efforts_recovery) {
|
|
|
|
s = env_->FileExists(current_fname);
|
|
|
|
} else {
|
|
|
|
s = Status::NotFound();
|
|
|
|
IOOptions io_opts;
|
|
|
|
io_opts.do_not_recurse = true;
|
|
|
|
Status io_s = immutable_db_options_.fs->GetChildren(
|
|
|
|
dbname_, io_opts, &files_in_dbname, /*IODebugContext*=*/nullptr);
|
|
|
|
if (!io_s.ok()) {
|
|
|
|
s = io_s;
|
|
|
|
files_in_dbname.clear();
|
|
|
|
}
|
|
|
|
for (const std::string& file : files_in_dbname) {
|
|
|
|
uint64_t number = 0;
|
|
|
|
FileType type = kWalFile; // initialize
|
|
|
|
if (ParseFileName(file, &number, &type) && type == kDescriptorFile) {
|
|
|
|
uint64_t bytes;
|
|
|
|
s = env_->GetFileSize(DescriptorFileName(dbname_, number), &bytes);
|
|
|
|
if (s.ok() && bytes != 0) {
|
|
|
|
// Found non-empty MANIFEST (descriptor log), thus best-efforts
|
|
|
|
// recovery does not have to treat the db as empty.
|
|
|
|
manifest_path = dbname_ + "/" + file;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (s.IsNotFound()) {
|
|
|
|
if (immutable_db_options_.create_if_missing) {
|
|
|
|
s = NewDB(&files_in_dbname);
|
|
|
|
is_new_db = true;
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
return Status::InvalidArgument(
|
|
|
|
current_fname, "does not exist (create_if_missing is false)");
|
|
|
|
}
|
|
|
|
} else if (s.ok()) {
|
|
|
|
if (immutable_db_options_.error_if_exists) {
|
|
|
|
return Status::InvalidArgument(dbname_,
|
|
|
|
"exists (error_if_exists is true)");
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
// Unexpected error reading file
|
|
|
|
assert(s.IsIOError());
|
|
|
|
return s;
|
|
|
|
}
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
// Verify compatibility of file_options_ and filesystem
|
|
|
|
{
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
std::unique_ptr<FSRandomAccessFile> idfile;
|
|
|
|
FileOptions customized_fs(file_options_);
|
|
|
|
customized_fs.use_direct_reads |=
|
|
|
|
immutable_db_options_.use_direct_io_for_flush_and_compaction;
|
|
|
|
const std::string& fname =
|
|
|
|
manifest_path.empty() ? current_fname : manifest_path;
|
|
|
|
s = fs_->NewRandomAccessFile(fname, customized_fs, &idfile, nullptr);
|
|
|
|
if (!s.ok()) {
|
|
|
|
std::string error_str = s.ToString();
|
|
|
|
// Check if unsupported Direct I/O is the root cause
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
customized_fs.use_direct_reads = false;
|
|
|
|
s = fs_->NewRandomAccessFile(fname, customized_fs, &idfile, nullptr);
|
|
|
|
if (s.ok()) {
|
|
|
|
return Status::InvalidArgument(
|
|
|
|
"Direct I/O is not supported by the specified DB.");
|
|
|
|
} else {
|
|
|
|
return Status::InvalidArgument(
|
|
|
|
"Found options incompatible with filesystem", error_str.c_str());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
} else if (immutable_db_options_.best_efforts_recovery) {
|
|
|
|
assert(files_in_dbname.empty());
|
|
|
|
IOOptions io_opts;
|
|
|
|
io_opts.do_not_recurse = true;
|
|
|
|
Status s = immutable_db_options_.fs->GetChildren(
|
|
|
|
dbname_, io_opts, &files_in_dbname, /*IODebugContext*=*/nullptr);
|
|
|
|
if (s.IsNotFound()) {
|
|
|
|
return Status::InvalidArgument(dbname_,
|
|
|
|
"does not exist (open for read only)");
|
|
|
|
} else if (s.IsIOError()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
assert(s.ok());
|
|
|
|
}
|
|
|
|
assert(db_id_.empty());
|
|
|
|
Status s;
|
|
|
|
bool missing_table_file = false;
|
|
|
|
if (!immutable_db_options_.best_efforts_recovery) {
|
|
|
|
s = versions_->Recover(column_families, read_only, &db_id_);
|
|
|
|
} else {
|
|
|
|
assert(!files_in_dbname.empty());
|
|
|
|
s = versions_->TryRecover(column_families, read_only, files_in_dbname,
|
|
|
|
&db_id_, &missing_table_file);
|
|
|
|
if (s.ok()) {
|
|
|
|
// TryRecover may delete previous column_family_set_.
|
|
|
|
column_family_memtables_.reset(
|
|
|
|
new ColumnFamilyMemTablesImpl(versions_->GetColumnFamilySet()));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
if (s.ok() && !read_only) {
|
|
|
|
for (auto cfd : *versions_->GetColumnFamilySet()) {
|
|
|
|
// Try to trivially move files down the LSM tree to start from bottommost
|
|
|
|
// level when level_compaction_dynamic_level_bytes is enabled. This should
|
|
|
|
// only be useful when user is migrating to turning on this option.
|
|
|
|
// If a user is migrating from Level Compaction with a smaller level
|
|
|
|
// multiplier or from Universal Compaction, there may be too many
|
|
|
|
// non-empty levels and the trivial moves here are not sufficed for
|
|
|
|
// migration. Additional compactions are needed to drain unnecessary
|
|
|
|
// levels.
|
|
|
|
//
|
|
|
|
// Note that this step moves files down LSM without consulting
|
|
|
|
// SSTPartitioner. Further compactions are still needed if
|
|
|
|
// the user wants to partition SST files.
|
|
|
|
// Note that files moved in this step may not respect the compression
|
|
|
|
// option in target level.
|
|
|
|
if (cfd->ioptions()->compaction_style ==
|
|
|
|
CompactionStyle::kCompactionStyleLevel &&
|
|
|
|
cfd->ioptions()->level_compaction_dynamic_level_bytes &&
|
|
|
|
!cfd->GetLatestMutableCFOptions()->disable_auto_compactions) {
|
|
|
|
int to_level = cfd->ioptions()->num_levels - 1;
|
|
|
|
// last level is reserved
|
|
|
|
// allow_ingest_behind does not support Level Compaction,
|
|
|
|
// and per_key_placement can have infinite compaction loop for Level
|
|
|
|
// Compaction. Adjust to_level here just to be safe.
|
|
|
|
if (cfd->ioptions()->allow_ingest_behind ||
|
|
|
|
cfd->ioptions()->preclude_last_level_data_seconds > 0) {
|
|
|
|
to_level -= 1;
|
|
|
|
}
|
|
|
|
// Whether this column family has a level trivially moved
|
|
|
|
bool moved = false;
|
|
|
|
// Fill the LSM starting from to_level and going up one level at a time.
|
|
|
|
// Some loop invariants (when last level is not reserved):
|
|
|
|
// - levels in (from_level, to_level] are empty, and
|
|
|
|
// - levels in (to_level, last_level] are non-empty.
|
|
|
|
for (int from_level = to_level; from_level >= 0; --from_level) {
|
|
|
|
const std::vector<FileMetaData*>& level_files =
|
|
|
|
cfd->current()->storage_info()->LevelFiles(from_level);
|
|
|
|
if (level_files.empty() || from_level == 0) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
assert(from_level <= to_level);
|
|
|
|
// Trivial move files from `from_level` to `to_level`
|
|
|
|
if (from_level < to_level) {
|
|
|
|
if (!moved) {
|
|
|
|
// lsm_state will look like "[1,2,3,4,5,6,0]" for an LSM with
|
|
|
|
// 7 levels
|
|
|
|
std::string lsm_state = "[";
|
|
|
|
for (int i = 0; i < cfd->ioptions()->num_levels; ++i) {
|
|
|
|
lsm_state += std::to_string(
|
|
|
|
cfd->current()->storage_info()->NumLevelFiles(i));
|
|
|
|
if (i < cfd->ioptions()->num_levels - 1) {
|
|
|
|
lsm_state += ",";
|
|
|
|
}
|
|
|
|
}
|
|
|
|
lsm_state += "]";
|
|
|
|
ROCKS_LOG_WARN(immutable_db_options_.info_log,
|
|
|
|
"[%s] Trivially move files down the LSM when open "
|
|
|
|
"with level_compaction_dynamic_level_bytes=true,"
|
|
|
|
" lsm_state: %s (Files are moved only if DB "
|
|
|
|
"Recovery is successful).",
|
|
|
|
cfd->GetName().c_str(), lsm_state.c_str());
|
|
|
|
moved = true;
|
|
|
|
}
|
|
|
|
ROCKS_LOG_WARN(
|
|
|
|
immutable_db_options_.info_log,
|
|
|
|
"[%s] Moving %zu files from from_level-%d to from_level-%d",
|
|
|
|
cfd->GetName().c_str(), level_files.size(), from_level,
|
|
|
|
to_level);
|
|
|
|
VersionEdit edit;
|
|
|
|
edit.SetColumnFamily(cfd->GetID());
|
|
|
|
for (const FileMetaData* f : level_files) {
|
|
|
|
edit.DeleteFile(from_level, f->fd.GetNumber());
|
|
|
|
edit.AddFile(to_level, f->fd.GetNumber(), f->fd.GetPathId(),
|
|
|
|
f->fd.GetFileSize(), f->smallest, f->largest,
|
|
|
|
f->fd.smallest_seqno, f->fd.largest_seqno,
|
|
|
|
f->marked_for_compaction,
|
|
|
|
f->temperature, // this can be different from
|
|
|
|
// `last_level_temperature`
|
|
|
|
f->oldest_blob_file_number, f->oldest_ancester_time,
|
|
|
|
f->file_creation_time, f->epoch_number,
|
|
|
|
f->file_checksum, f->file_checksum_func_name,
|
Record and use the tail size to prefetch table tail (#11406)
Summary:
**Context:**
We prefetch the tail part of a SST file (i.e, the blocks after data blocks till the end of the file) during each SST file open in hope to prefetch all the stuff at once ahead of time for later read e.g, footer, meta index, filter/index etc. The existing approach to estimate the tail size to prefetch is through `TailPrefetchStats` heuristics introduced in https://github.com/facebook/rocksdb/pull/4156, which has caused small reads in unlucky case (e.g, small read into the tail buffer during table open in thread 1 under the same BlockBasedTableFactory object can make thread 2's tail prefetching use a small size that it shouldn't) and is hard to debug. Therefore we decide to record the exact tail size and use it directly to prefetch tail of the SST instead of relying heuristics.
**Summary:**
- Obtain and record in manifest the tail size in `BlockBasedTableBuilder::Finish()`
- For backward compatibility, we fall back to TailPrefetchStats and last to simple heuristics that the tail size is a linear portion of the file size - see PR conversation for more.
- Make`tail_start_offset` part of the table properties and deduct tail size to record in manifest for external files (e.g, file ingestion, import CF) and db repair (with no access to manifest).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11406
Test Plan:
1. New UT
2. db bench
Note: db bench on /tmp/ where direct read is supported is too slow to finish and the default pinning setting in db bench is not helpful to profile # sst read of Get. Therefore I hacked the following to obtain the following comparison.
```
diff --git a/table/block_based/block_based_table_reader.cc b/table/block_based/block_based_table_reader.cc
index bd5669f0f..791484c1f 100644
--- a/table/block_based/block_based_table_reader.cc
+++ b/table/block_based/block_based_table_reader.cc
@@ -838,7 +838,7 @@ Status BlockBasedTable::PrefetchTail(
&tail_prefetch_size);
// Try file system prefetch
- if (!file->use_direct_io() && !force_direct_prefetch) {
+ if (false && !file->use_direct_io() && !force_direct_prefetch) {
if (!file->Prefetch(prefetch_off, prefetch_len, ro.rate_limiter_priority)
.IsNotSupported()) {
prefetch_buffer->reset(new FilePrefetchBuffer(
diff --git a/tools/db_bench_tool.cc b/tools/db_bench_tool.cc
index ea40f5fa0..39a0ac385 100644
--- a/tools/db_bench_tool.cc
+++ b/tools/db_bench_tool.cc
@@ -4191,6 +4191,8 @@ class Benchmark {
std::shared_ptr<TableFactory>(NewCuckooTableFactory(table_options));
} else {
BlockBasedTableOptions block_based_options;
+ block_based_options.metadata_cache_options.partition_pinning =
+ PinningTier::kAll;
block_based_options.checksum =
static_cast<ChecksumType>(FLAGS_checksum_type);
if (FLAGS_use_hash_search) {
```
Create DB
```
./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/ -benchmarks=readrandom -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none
```
ReadRandom
```
./db_bench --bloom_bits=3 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 -db=/dev/shm/testdb/ -benchmarks=readrandom -key_size=3200 -value_size=512 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=false -target_file_size_base=6550000 -compression_type=none
```
(a) Existing (Use TailPrefetchStats for tail size + use seperate prefetch buffer in PartitionedFilter/IndexReader::CacheDependencies())
```
rocksdb.table.open.prefetch.tail.hit COUNT : 3395
rocksdb.sst.read.micros P50 : 5.655570 P95 : 9.931396 P99 : 14.845454 P100 : 585.000000 COUNT : 999905 SUM : 6590614
```
(b) This PR (Record tail size + use the same tail buffer in PartitionedFilter/IndexReader::CacheDependencies())
```
rocksdb.table.open.prefetch.tail.hit COUNT : 14257
rocksdb.sst.read.micros P50 : 5.173347 P95 : 9.015017 P99 : 12.912610 P100 : 228.000000 COUNT : 998547 SUM : 5976540
```
As we can see, we increase the prefetch tail hit count and decrease SST read count with this PR
3. Test backward compatibility by stepping through reading with post-PR code on a db generated pre-PR.
Reviewed By: pdillinger
Differential Revision: D45413346
Pulled By: hx235
fbshipit-source-id: 7d5e36a60a72477218f79905168d688452a4c064
2 years ago
|
|
|
f->unique_id, f->compensated_range_deletion_size,
|
|
|
|
f->tail_size, f->user_defined_timestamps_persisted);
|
|
|
|
ROCKS_LOG_WARN(immutable_db_options_.info_log,
|
|
|
|
"[%s] Moving #%" PRIu64
|
|
|
|
" from from_level-%d to from_level-%d %" PRIu64
|
|
|
|
" bytes\n",
|
|
|
|
cfd->GetName().c_str(), f->fd.GetNumber(),
|
|
|
|
from_level, to_level, f->fd.GetFileSize());
|
|
|
|
}
|
|
|
|
recovery_ctx->UpdateVersionEdits(cfd, edit);
|
|
|
|
}
|
|
|
|
--to_level;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
s = SetupDBId(read_only, recovery_ctx);
|
|
|
|
ROCKS_LOG_INFO(immutable_db_options_.info_log, "DB ID: %s\n", db_id_.c_str());
|
Fix a recovery corner case (#7621)
Summary:
Consider the following sequence of events:
1. Db flushed an SST with file number N, appended to MANIFEST, and tried to sync the MANIFEST.
2. Syncing MANIFEST failed and db crashed.
3. Db tried to recover with this MANIFEST. In the meantime, no entry about the newly-flushed SST was found in the MANIFEST. Therefore, RocksDB replayed WAL and tried to flush to an SST file reusing the same file number N. This failed because file system does not support overwrite. Then Db deleted this file.
4. Db crashed again.
5. Db tried to recover. When db read the MANIFEST, there was an entry referencing N.sst. This could happen probably because the append in step 1 finally reached the MANIFEST and became visible. Since N.sst had been deleted in step 3, recovery failed.
It is possible that N.sst created in step 1 is valid. Although step 3 would still fail since the MANIFEST was not synced properly in step 1 and 2, deleting N.sst would make it impossible for the db to recover even if the remaining part of MANIFEST was appended and visible after step 5.
After this PR, in step 3, immediately after recovering from MANIFEST, a new MANIFEST is created, then we find that N.sst is not referenced in the MANIFEST, so we delete it, and we'll not reuse N as file number. Then in step 5, since the new MANIFEST does not contain N.sst, the recovery failure situation in step 5 won't happen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7621
Test Plan:
1. some tests are updated, because these tests assume that new MANIFEST is created after WAL recovery.
2. a new unit test is added in db_basic_test to simulate step 3.
Reviewed By: riversand963
Differential Revision: D24668144
Pulled By: cheng-chang
fbshipit-source-id: 90d7487fbad2bc3714f5ede46ea949895b15ae3b
4 years ago
|
|
|
if (s.ok() && !read_only) {
|
Persist the new MANIFEST after successfully syncing the new WAL during recovery (#9922)
Summary:
In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
flush the data from WAL to L0 for all column families if possible. As a
result, not all column families can increase their log_numbers, and
min_log_number_to_keep won't change.
For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.
If we persist a new MANIFEST with
advanced log_numbers for some column families, then during a second
crash after persisting the MANIFEST, RocksDB will see some column
families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail.
As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL.
If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point.
If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9922
Test Plan:
1. Update unit tests to fail without this change
2. make crast_test -j
Branch with unit test and no fix https://github.com/facebook/rocksdb/pull/9942 to keep track of unit test (without fix)
Reviewed By: riversand963
Differential Revision: D36043701
Pulled By: akankshamahajan15
fbshipit-source-id: 5760970db0a0920fb73d3c054a4155733500acd9
3 years ago
|
|
|
s = DeleteUnreferencedSstFiles(recovery_ctx);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (immutable_db_options_.paranoid_checks && s.ok()) {
|
|
|
|
s = CheckConsistency();
|
|
|
|
}
|
|
|
|
if (s.ok() && !read_only) {
|
Fix serious FSDirectory use-after-Close bug (missing fsync) (#10460)
Summary:
TL;DR: due to a recent change, if you drop a column family,
often that DB will no longer fsync after writing new SST files
to remaining or new column families, which could lead to data
loss on power loss.
More bug detail:
The intent of https://github.com/facebook/rocksdb/issues/10049 was to Close FSDirectory objects at
DB::Close time rather than waiting for DB object destruction.
Unfortunately, it also closes shared FSDirectory objects on
DropColumnFamily (& destroy remaining handles), which can lead
to use-after-Close on FSDirectory shared with remaining column
families. Those "uses" are only Fsyncs (or redundant Closes). In
the default Posix filesystem, an Fsync on a closed FSDirectory is a
quiet no-op. Consequently (under most configurations), if you drop
a column family, that DB will no longer fsync after writing new SST
files to column families sharing the same directory (true under most
configurations).
More fix detail:
Basically, this removes unnecessary Close ops on destroying
ColumnFamilyData. We let `shared_ptr` take care of calling the
destructor at the right time. If the intent was to require Close be
called before destroying FSDirectory, that was not made clear by the
author of FileSystem and was not at all enforced by https://github.com/facebook/rocksdb/issues/10049, which
could have added `assert(fd_ == -1)` to `~PosixDirectory()` but did
not. To keep this fix simple, we relax the unit test for https://github.com/facebook/rocksdb/issues/10049 to allow
timely destruction of FSDirectory to suffice as Close (in
CountedFileSystem). Added a TODO to revisit that.
Also in this PR:
* Added a TODO to share FSDirectory instances between DB and its column
families. (Already shared among column families.)
* Made DB::Close attempt to close all its open FSDirectory objects even
if there is a failure in closing one. Also code clean-up around this
logic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10460
Test Plan:
add an assert to check for use-after-Close. With that
existing tests can detect the misuse. With fix, tests pass (except noted
relaxing of unit test for https://github.com/facebook/rocksdb/issues/10049)
Reviewed By: ajkr
Differential Revision: D38357922
Pulled By: pdillinger
fbshipit-source-id: d42079cadbedf0a969f03389bf586b3b4e1f9137
3 years ago
|
|
|
// TODO: share file descriptors (FSDirectory) with SetDirectories above
|
|
|
|
std::map<std::string, std::shared_ptr<FSDirectory>> created_dirs;
|
|
|
|
for (auto cfd : *versions_->GetColumnFamilySet()) {
|
|
|
|
s = cfd->AddDirectories(&created_dirs);
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
std::vector<std::string> files_in_wal_dir;
|
|
|
|
if (s.ok()) {
|
|
|
|
// Initial max_total_in_memory_state_ before recovery wals. Log recovery
|
|
|
|
// may check this value to decide whether to flush.
|
|
|
|
max_total_in_memory_state_ = 0;
|
|
|
|
for (auto cfd : *versions_->GetColumnFamilySet()) {
|
|
|
|
auto* mutable_cf_options = cfd->GetLatestMutableCFOptions();
|
|
|
|
max_total_in_memory_state_ += mutable_cf_options->write_buffer_size *
|
|
|
|
mutable_cf_options->max_write_buffer_number;
|
|
|
|
}
|
|
|
|
|
|
|
|
SequenceNumber next_sequence(kMaxSequenceNumber);
|
|
|
|
default_cf_handle_ = new ColumnFamilyHandleImpl(
|
|
|
|
versions_->GetColumnFamilySet()->GetDefault(), this, &mutex_);
|
|
|
|
default_cf_internal_stats_ = default_cf_handle_->cfd()->internal_stats();
|
|
|
|
|
|
|
|
// Recover from all newer log files than the ones named in the
|
|
|
|
// descriptor (new log files may have been added by the previous
|
|
|
|
// incarnation without registering them in the descriptor).
|
|
|
|
//
|
|
|
|
// Note that prev_log_number() is no longer used, but we pay
|
|
|
|
// attention to it in case we are recovering a database
|
|
|
|
// produced by an older version of rocksdb.
|
|
|
|
auto wal_dir = immutable_db_options_.GetWalDir();
|
|
|
|
if (!immutable_db_options_.best_efforts_recovery) {
|
|
|
|
IOOptions io_opts;
|
|
|
|
io_opts.do_not_recurse = true;
|
|
|
|
s = immutable_db_options_.fs->GetChildren(
|
|
|
|
wal_dir, io_opts, &files_in_wal_dir, /*IODebugContext*=*/nullptr);
|
|
|
|
}
|
|
|
|
if (s.IsNotFound()) {
|
|
|
|
return Status::InvalidArgument("wal_dir not found", wal_dir);
|
|
|
|
} else if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
std::unordered_map<uint64_t, std::string> wal_files;
|
|
|
|
for (const auto& file : files_in_wal_dir) {
|
|
|
|
uint64_t number;
|
|
|
|
FileType type;
|
|
|
|
if (ParseFileName(file, &number, &type) && type == kWalFile) {
|
|
|
|
if (is_new_db) {
|
|
|
|
return Status::Corruption(
|
|
|
|
"While creating a new Db, wal_dir contains "
|
|
|
|
"existing log file: ",
|
|
|
|
file);
|
|
|
|
} else {
|
|
|
|
wal_files[number] = LogFileName(wal_dir, number);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (immutable_db_options_.track_and_verify_wals_in_manifest) {
|
|
|
|
if (!immutable_db_options_.best_efforts_recovery) {
|
|
|
|
// Verify WALs in MANIFEST.
|
|
|
|
s = versions_->GetWalSet().CheckWals(env_, wal_files);
|
|
|
|
} // else since best effort recovery does not recover from WALs, no need
|
|
|
|
// to check WALs.
|
|
|
|
} else if (!versions_->GetWalSet().GetWals().empty()) {
|
|
|
|
// Tracking is disabled, clear previously tracked WALs from MANIFEST,
|
|
|
|
// otherwise, in the future, if WAL tracking is enabled again,
|
|
|
|
// since the WALs deleted when WAL tracking is disabled are not persisted
|
|
|
|
// into MANIFEST, WAL check may fail.
|
|
|
|
VersionEdit edit;
|
|
|
|
WalNumber max_wal_number =
|
|
|
|
versions_->GetWalSet().GetWals().rbegin()->first;
|
|
|
|
edit.DeleteWalsBefore(max_wal_number + 1);
|
Persist the new MANIFEST after successfully syncing the new WAL during recovery (#9922)
Summary:
In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
flush the data from WAL to L0 for all column families if possible. As a
result, not all column families can increase their log_numbers, and
min_log_number_to_keep won't change.
For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.
If we persist a new MANIFEST with
advanced log_numbers for some column families, then during a second
crash after persisting the MANIFEST, RocksDB will see some column
families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail.
As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL.
If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point.
If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9922
Test Plan:
1. Update unit tests to fail without this change
2. make crast_test -j
Branch with unit test and no fix https://github.com/facebook/rocksdb/pull/9942 to keep track of unit test (without fix)
Reviewed By: riversand963
Differential Revision: D36043701
Pulled By: akankshamahajan15
fbshipit-source-id: 5760970db0a0920fb73d3c054a4155733500acd9
3 years ago
|
|
|
assert(recovery_ctx != nullptr);
|
|
|
|
assert(versions_->GetColumnFamilySet() != nullptr);
|
|
|
|
recovery_ctx->UpdateVersionEdits(
|
|
|
|
versions_->GetColumnFamilySet()->GetDefault(), edit);
|
|
|
|
}
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!wal_files.empty()) {
|
|
|
|
if (error_if_wal_file_exists) {
|
|
|
|
return Status::Corruption(
|
|
|
|
"The db was opened in readonly mode with error_if_wal_file_exists"
|
|
|
|
"flag but a WAL file already exists");
|
|
|
|
} else if (error_if_data_exists_in_wals) {
|
|
|
|
for (auto& wal_file : wal_files) {
|
|
|
|
uint64_t bytes;
|
|
|
|
s = env_->GetFileSize(wal_file.second, &bytes);
|
|
|
|
if (s.ok()) {
|
|
|
|
if (bytes > 0) {
|
|
|
|
return Status::Corruption(
|
|
|
|
"error_if_data_exists_in_wals is set but there are data "
|
|
|
|
" in WAL files.");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!wal_files.empty()) {
|
|
|
|
// Recover in the order in which the wals were generated
|
|
|
|
std::vector<uint64_t> wals;
|
|
|
|
wals.reserve(wal_files.size());
|
|
|
|
for (const auto& wal_file : wal_files) {
|
|
|
|
wals.push_back(wal_file.first);
|
|
|
|
}
|
|
|
|
std::sort(wals.begin(), wals.end());
|
|
|
|
|
|
|
|
bool corrupted_wal_found = false;
|
Persist the new MANIFEST after successfully syncing the new WAL during recovery (#9922)
Summary:
In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
flush the data from WAL to L0 for all column families if possible. As a
result, not all column families can increase their log_numbers, and
min_log_number_to_keep won't change.
For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.
If we persist a new MANIFEST with
advanced log_numbers for some column families, then during a second
crash after persisting the MANIFEST, RocksDB will see some column
families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail.
As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL.
If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point.
If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9922
Test Plan:
1. Update unit tests to fail without this change
2. make crast_test -j
Branch with unit test and no fix https://github.com/facebook/rocksdb/pull/9942 to keep track of unit test (without fix)
Reviewed By: riversand963
Differential Revision: D36043701
Pulled By: akankshamahajan15
fbshipit-source-id: 5760970db0a0920fb73d3c054a4155733500acd9
3 years ago
|
|
|
s = RecoverLogFiles(wals, &next_sequence, read_only, &corrupted_wal_found,
|
|
|
|
recovery_ctx);
|
|
|
|
if (corrupted_wal_found && recovered_seq != nullptr) {
|
|
|
|
*recovered_seq = next_sequence;
|
|
|
|
}
|
|
|
|
if (!s.ok()) {
|
|
|
|
// Clear memtables if recovery failed
|
|
|
|
for (auto cfd : *versions_->GetColumnFamilySet()) {
|
|
|
|
cfd->CreateNewMemtable(*cfd->GetLatestMutableCFOptions(),
|
|
|
|
kMaxSequenceNumber);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (read_only) {
|
|
|
|
// If we are opening as read-only, we need to update options_file_number_
|
|
|
|
// to reflect the most recent OPTIONS file. It does not matter for regular
|
|
|
|
// read-write db instance because options_file_number_ will later be
|
|
|
|
// updated to versions_->NewFileNumber() in RenameTempFileToOptionsFile.
|
|
|
|
std::vector<std::string> filenames;
|
|
|
|
if (s.ok()) {
|
|
|
|
const std::string normalized_dbname = NormalizePath(dbname_);
|
|
|
|
const std::string normalized_wal_dir =
|
|
|
|
NormalizePath(immutable_db_options_.GetWalDir());
|
|
|
|
if (immutable_db_options_.best_efforts_recovery) {
|
|
|
|
filenames = std::move(files_in_dbname);
|
|
|
|
} else if (normalized_dbname == normalized_wal_dir) {
|
|
|
|
filenames = std::move(files_in_wal_dir);
|
|
|
|
} else {
|
|
|
|
IOOptions io_opts;
|
|
|
|
io_opts.do_not_recurse = true;
|
|
|
|
s = immutable_db_options_.fs->GetChildren(
|
|
|
|
GetName(), io_opts, &filenames, /*IODebugContext*=*/nullptr);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (s.ok()) {
|
|
|
|
uint64_t number = 0;
|
|
|
|
uint64_t options_file_number = 0;
|
|
|
|
FileType type;
|
|
|
|
for (const auto& fname : filenames) {
|
|
|
|
if (ParseFileName(fname, &number, &type) && type == kOptionsFile) {
|
|
|
|
options_file_number = std::max(number, options_file_number);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
versions_->options_file_number_ = options_file_number;
|
|
|
|
uint64_t options_file_size = 0;
|
|
|
|
if (options_file_number > 0) {
|
|
|
|
s = env_->GetFileSize(OptionsFileName(GetName(), options_file_number),
|
|
|
|
&options_file_size);
|
|
|
|
}
|
|
|
|
versions_->options_file_size_ = options_file_size;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
Status DBImpl::PersistentStatsProcessFormatVersion() {
|
|
|
|
mutex_.AssertHeld();
|
|
|
|
Status s;
|
|
|
|
// persist version when stats CF doesn't exist
|
|
|
|
bool should_persist_format_version = !persistent_stats_cfd_exists_;
|
|
|
|
mutex_.Unlock();
|
|
|
|
if (persistent_stats_cfd_exists_) {
|
|
|
|
// Check persistent stats format version compatibility. Drop and recreate
|
|
|
|
// persistent stats CF if format version is incompatible
|
|
|
|
uint64_t format_version_recovered = 0;
|
|
|
|
Status s_format = DecodePersistentStatsVersionNumber(
|
|
|
|
this, StatsVersionKeyType::kFormatVersion, &format_version_recovered);
|
|
|
|
uint64_t compatible_version_recovered = 0;
|
|
|
|
Status s_compatible = DecodePersistentStatsVersionNumber(
|
|
|
|
this, StatsVersionKeyType::kCompatibleVersion,
|
|
|
|
&compatible_version_recovered);
|
|
|
|
// abort reading from existing stats CF if any of following is true:
|
|
|
|
// 1. failed to read format version or compatible version from disk
|
|
|
|
// 2. sst's format version is greater than current format version, meaning
|
|
|
|
// this sst is encoded with a newer RocksDB release, and current compatible
|
|
|
|
// version is below the sst's compatible version
|
|
|
|
if (!s_format.ok() || !s_compatible.ok() ||
|
|
|
|
(kStatsCFCurrentFormatVersion < format_version_recovered &&
|
|
|
|
kStatsCFCompatibleFormatVersion < compatible_version_recovered)) {
|
|
|
|
if (!s_format.ok() || !s_compatible.ok()) {
|
|
|
|
ROCKS_LOG_WARN(
|
|
|
|
immutable_db_options_.info_log,
|
|
|
|
"Recreating persistent stats column family since reading "
|
|
|
|
"persistent stats version key failed. Format key: %s, compatible "
|
|
|
|
"key: %s",
|
|
|
|
s_format.ToString().c_str(), s_compatible.ToString().c_str());
|
|
|
|
} else {
|
|
|
|
ROCKS_LOG_WARN(
|
|
|
|
immutable_db_options_.info_log,
|
|
|
|
"Recreating persistent stats column family due to corrupted or "
|
|
|
|
"incompatible format version. Recovered format: %" PRIu64
|
|
|
|
"; recovered format compatible since: %" PRIu64 "\n",
|
|
|
|
format_version_recovered, compatible_version_recovered);
|
|
|
|
}
|
|
|
|
s = DropColumnFamily(persist_stats_cf_handle_);
|
|
|
|
if (s.ok()) {
|
|
|
|
s = DestroyColumnFamilyHandle(persist_stats_cf_handle_);
|
|
|
|
}
|
|
|
|
ColumnFamilyHandle* handle = nullptr;
|
|
|
|
if (s.ok()) {
|
|
|
|
ColumnFamilyOptions cfo;
|
|
|
|
OptimizeForPersistentStats(&cfo);
|
|
|
|
s = CreateColumnFamily(cfo, kPersistentStatsColumnFamilyName, &handle);
|
|
|
|
}
|
|
|
|
if (s.ok()) {
|
|
|
|
persist_stats_cf_handle_ = static_cast<ColumnFamilyHandleImpl*>(handle);
|
|
|
|
// should also persist version here because old stats CF is discarded
|
|
|
|
should_persist_format_version = true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (should_persist_format_version) {
|
|
|
|
// Persistent stats CF being created for the first time, need to write
|
|
|
|
// format version key
|
|
|
|
WriteBatch batch;
|
|
|
|
if (s.ok()) {
|
|
|
|
s = batch.Put(persist_stats_cf_handle_, kFormatVersionKeyString,
|
|
|
|
std::to_string(kStatsCFCurrentFormatVersion));
|
|
|
|
}
|
|
|
|
if (s.ok()) {
|
|
|
|
s = batch.Put(persist_stats_cf_handle_, kCompatibleVersionKeyString,
|
|
|
|
std::to_string(kStatsCFCompatibleFormatVersion));
|
|
|
|
}
|
|
|
|
if (s.ok()) {
|
|
|
|
WriteOptions wo;
|
|
|
|
wo.low_pri = true;
|
|
|
|
wo.no_slowdown = true;
|
|
|
|
wo.sync = false;
|
|
|
|
s = Write(wo, &batch);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
mutex_.Lock();
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
Status DBImpl::InitPersistStatsColumnFamily() {
|
|
|
|
mutex_.AssertHeld();
|
|
|
|
assert(!persist_stats_cf_handle_);
|
|
|
|
ColumnFamilyData* persistent_stats_cfd =
|
|
|
|
versions_->GetColumnFamilySet()->GetColumnFamily(
|
|
|
|
kPersistentStatsColumnFamilyName);
|
|
|
|
persistent_stats_cfd_exists_ = persistent_stats_cfd != nullptr;
|
|
|
|
|
|
|
|
Status s;
|
|
|
|
if (persistent_stats_cfd != nullptr) {
|
|
|
|
// We are recovering from a DB which already contains persistent stats CF,
|
|
|
|
// the CF is already created in VersionSet::ApplyOneVersionEdit, but
|
|
|
|
// column family handle was not. Need to explicitly create handle here.
|
|
|
|
persist_stats_cf_handle_ =
|
|
|
|
new ColumnFamilyHandleImpl(persistent_stats_cfd, this, &mutex_);
|
|
|
|
} else {
|
|
|
|
mutex_.Unlock();
|
|
|
|
ColumnFamilyHandle* handle = nullptr;
|
|
|
|
ColumnFamilyOptions cfo;
|
|
|
|
OptimizeForPersistentStats(&cfo);
|
|
|
|
s = CreateColumnFamily(cfo, kPersistentStatsColumnFamilyName, &handle);
|
|
|
|
persist_stats_cf_handle_ = static_cast<ColumnFamilyHandleImpl*>(handle);
|
|
|
|
mutex_.Lock();
|
|
|
|
}
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
Persist the new MANIFEST after successfully syncing the new WAL during recovery (#9922)
Summary:
In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
flush the data from WAL to L0 for all column families if possible. As a
result, not all column families can increase their log_numbers, and
min_log_number_to_keep won't change.
For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.
If we persist a new MANIFEST with
advanced log_numbers for some column families, then during a second
crash after persisting the MANIFEST, RocksDB will see some column
families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail.
As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL.
If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point.
If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9922
Test Plan:
1. Update unit tests to fail without this change
2. make crast_test -j
Branch with unit test and no fix https://github.com/facebook/rocksdb/pull/9942 to keep track of unit test (without fix)
Reviewed By: riversand963
Differential Revision: D36043701
Pulled By: akankshamahajan15
fbshipit-source-id: 5760970db0a0920fb73d3c054a4155733500acd9
3 years ago
|
|
|
Status DBImpl::LogAndApplyForRecovery(const RecoveryContext& recovery_ctx) {
|
|
|
|
mutex_.AssertHeld();
|
|
|
|
assert(versions_->descriptor_log_ == nullptr);
|
|
|
|
const ReadOptions read_options(Env::IOActivity::kDBOpen);
|
Persist the new MANIFEST after successfully syncing the new WAL during recovery (#9922)
Summary:
In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
flush the data from WAL to L0 for all column families if possible. As a
result, not all column families can increase their log_numbers, and
min_log_number_to_keep won't change.
For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.
If we persist a new MANIFEST with
advanced log_numbers for some column families, then during a second
crash after persisting the MANIFEST, RocksDB will see some column
families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail.
As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL.
If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point.
If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9922
Test Plan:
1. Update unit tests to fail without this change
2. make crast_test -j
Branch with unit test and no fix https://github.com/facebook/rocksdb/pull/9942 to keep track of unit test (without fix)
Reviewed By: riversand963
Differential Revision: D36043701
Pulled By: akankshamahajan15
fbshipit-source-id: 5760970db0a0920fb73d3c054a4155733500acd9
3 years ago
|
|
|
Status s = versions_->LogAndApply(
|
|
|
|
recovery_ctx.cfds_, recovery_ctx.mutable_cf_opts_, read_options,
|
Persist the new MANIFEST after successfully syncing the new WAL during recovery (#9922)
Summary:
In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
flush the data from WAL to L0 for all column families if possible. As a
result, not all column families can increase their log_numbers, and
min_log_number_to_keep won't change.
For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.
If we persist a new MANIFEST with
advanced log_numbers for some column families, then during a second
crash after persisting the MANIFEST, RocksDB will see some column
families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail.
As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL.
If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point.
If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9922
Test Plan:
1. Update unit tests to fail without this change
2. make crast_test -j
Branch with unit test and no fix https://github.com/facebook/rocksdb/pull/9942 to keep track of unit test (without fix)
Reviewed By: riversand963
Differential Revision: D36043701
Pulled By: akankshamahajan15
fbshipit-source-id: 5760970db0a0920fb73d3c054a4155733500acd9
3 years ago
|
|
|
recovery_ctx.edit_lists_, &mutex_, directories_.GetDbDir());
|
|
|
|
if (s.ok() && !(recovery_ctx.files_to_delete_.empty())) {
|
|
|
|
mutex_.Unlock();
|
|
|
|
for (const auto& fname : recovery_ctx.files_to_delete_) {
|
|
|
|
s = env_->DeleteFile(fname);
|
|
|
|
if (!s.ok()) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
mutex_.Lock();
|
|
|
|
}
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
void DBImpl::InvokeWalFilterIfNeededOnColumnFamilyToWalNumberMap() {
|
|
|
|
if (immutable_db_options_.wal_filter == nullptr) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
assert(immutable_db_options_.wal_filter != nullptr);
|
|
|
|
WalFilter& wal_filter = *(immutable_db_options_.wal_filter);
|
|
|
|
|
|
|
|
std::map<std::string, uint32_t> cf_name_id_map;
|
|
|
|
std::map<uint32_t, uint64_t> cf_lognumber_map;
|
|
|
|
assert(versions_);
|
|
|
|
assert(versions_->GetColumnFamilySet());
|
|
|
|
for (auto cfd : *versions_->GetColumnFamilySet()) {
|
|
|
|
assert(cfd);
|
|
|
|
cf_name_id_map.insert(std::make_pair(cfd->GetName(), cfd->GetID()));
|
|
|
|
cf_lognumber_map.insert(std::make_pair(cfd->GetID(), cfd->GetLogNumber()));
|
|
|
|
}
|
|
|
|
|
|
|
|
wal_filter.ColumnFamilyLogNumberMap(cf_lognumber_map, cf_name_id_map);
|
|
|
|
}
|
|
|
|
|
|
|
|
bool DBImpl::InvokeWalFilterIfNeededOnWalRecord(uint64_t wal_number,
|
|
|
|
const std::string& wal_fname,
|
|
|
|
log::Reader::Reporter& reporter,
|
|
|
|
Status& status,
|
|
|
|
bool& stop_replay,
|
|
|
|
WriteBatch& batch) {
|
|
|
|
if (immutable_db_options_.wal_filter == nullptr) {
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
assert(immutable_db_options_.wal_filter != nullptr);
|
|
|
|
WalFilter& wal_filter = *(immutable_db_options_.wal_filter);
|
|
|
|
|
|
|
|
WriteBatch new_batch;
|
|
|
|
bool batch_changed = false;
|
|
|
|
|
|
|
|
bool process_current_record = true;
|
|
|
|
|
|
|
|
WalFilter::WalProcessingOption wal_processing_option =
|
|
|
|
wal_filter.LogRecordFound(wal_number, wal_fname, batch, &new_batch,
|
|
|
|
&batch_changed);
|
|
|
|
|
|
|
|
switch (wal_processing_option) {
|
|
|
|
case WalFilter::WalProcessingOption::kContinueProcessing:
|
|
|
|
// do nothing, proceeed normally
|
|
|
|
break;
|
|
|
|
case WalFilter::WalProcessingOption::kIgnoreCurrentRecord:
|
|
|
|
// skip current record
|
|
|
|
process_current_record = false;
|
|
|
|
break;
|
|
|
|
case WalFilter::WalProcessingOption::kStopReplay:
|
|
|
|
// skip current record and stop replay
|
|
|
|
process_current_record = false;
|
|
|
|
stop_replay = true;
|
|
|
|
break;
|
|
|
|
case WalFilter::WalProcessingOption::kCorruptedRecord: {
|
|
|
|
status = Status::Corruption("Corruption reported by Wal Filter ",
|
|
|
|
wal_filter.Name());
|
|
|
|
MaybeIgnoreError(&status);
|
|
|
|
if (!status.ok()) {
|
|
|
|
process_current_record = false;
|
|
|
|
reporter.Corruption(batch.GetDataSize(), status);
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
default: {
|
|
|
|
// logical error which should not happen. If RocksDB throws, we would
|
|
|
|
// just do `throw std::logic_error`.
|
|
|
|
assert(false);
|
|
|
|
status = Status::NotSupported(
|
|
|
|
"Unknown WalProcessingOption returned by Wal Filter ",
|
|
|
|
wal_filter.Name());
|
|
|
|
MaybeIgnoreError(&status);
|
|
|
|
if (!status.ok()) {
|
|
|
|
// Ignore the error with current record processing.
|
|
|
|
stop_replay = true;
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!process_current_record) {
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (batch_changed) {
|
|
|
|
// Make sure that the count in the new batch is
|
|
|
|
// within the orignal count.
|
|
|
|
int new_count = WriteBatchInternal::Count(&new_batch);
|
|
|
|
int original_count = WriteBatchInternal::Count(&batch);
|
|
|
|
if (new_count > original_count) {
|
|
|
|
ROCKS_LOG_FATAL(
|
|
|
|
immutable_db_options_.info_log,
|
|
|
|
"Recovering log #%" PRIu64
|
|
|
|
" mode %d log filter %s returned "
|
|
|
|
"more records (%d) than original (%d) which is not allowed. "
|
|
|
|
"Aborting recovery.",
|
|
|
|
wal_number, static_cast<int>(immutable_db_options_.wal_recovery_mode),
|
|
|
|
wal_filter.Name(), new_count, original_count);
|
|
|
|
status = Status::NotSupported(
|
|
|
|
"More than original # of records "
|
|
|
|
"returned by Wal Filter ",
|
|
|
|
wal_filter.Name());
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
// Set the same sequence number in the new_batch
|
|
|
|
// as the original batch.
|
|
|
|
WriteBatchInternal::SetSequence(&new_batch,
|
|
|
|
WriteBatchInternal::Sequence(&batch));
|
|
|
|
batch = new_batch;
|
|
|
|
}
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
// REQUIRES: wal_numbers are sorted in ascending order
|
|
|
|
Status DBImpl::RecoverLogFiles(const std::vector<uint64_t>& wal_numbers,
|
|
|
|
SequenceNumber* next_sequence, bool read_only,
|
Persist the new MANIFEST after successfully syncing the new WAL during recovery (#9922)
Summary:
In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
flush the data from WAL to L0 for all column families if possible. As a
result, not all column families can increase their log_numbers, and
min_log_number_to_keep won't change.
For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.
If we persist a new MANIFEST with
advanced log_numbers for some column families, then during a second
crash after persisting the MANIFEST, RocksDB will see some column
families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail.
As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL.
If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point.
If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9922
Test Plan:
1. Update unit tests to fail without this change
2. make crast_test -j
Branch with unit test and no fix https://github.com/facebook/rocksdb/pull/9942 to keep track of unit test (without fix)
Reviewed By: riversand963
Differential Revision: D36043701
Pulled By: akankshamahajan15
fbshipit-source-id: 5760970db0a0920fb73d3c054a4155733500acd9
3 years ago
|
|
|
bool* corrupted_wal_found,
|
|
|
|
RecoveryContext* recovery_ctx) {
|
|
|
|
struct LogReporter : public log::Reader::Reporter {
|
|
|
|
Env* env;
|
|
|
|
Logger* info_log;
|
|
|
|
const char* fname;
|
|
|
|
Status* status; // nullptr if immutable_db_options_.paranoid_checks==false
|
|
|
|
void Corruption(size_t bytes, const Status& s) override {
|
|
|
|
ROCKS_LOG_WARN(info_log, "%s%s: dropping %d bytes; %s",
|
|
|
|
(status == nullptr ? "(ignoring error) " : ""), fname,
|
|
|
|
static_cast<int>(bytes), s.ToString().c_str());
|
|
|
|
if (status != nullptr && status->ok()) {
|
|
|
|
*status = s;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
};
|
|
|
|
|
|
|
|
mutex_.AssertHeld();
|
|
|
|
Status status;
|
|
|
|
std::unordered_map<int, VersionEdit> version_edits;
|
|
|
|
// no need to refcount because iteration is under mutex
|
|
|
|
for (auto cfd : *versions_->GetColumnFamilySet()) {
|
|
|
|
VersionEdit edit;
|
|
|
|
edit.SetColumnFamily(cfd->GetID());
|
|
|
|
version_edits.insert({cfd->GetID(), edit});
|
|
|
|
}
|
|
|
|
int job_id = next_job_id_.fetch_add(1);
|
|
|
|
{
|
|
|
|
auto stream = event_logger_.Log();
|
|
|
|
stream << "job" << job_id << "event"
|
|
|
|
<< "recovery_started";
|
|
|
|
stream << "wal_files";
|
|
|
|
stream.StartArray();
|
|
|
|
for (auto wal_number : wal_numbers) {
|
|
|
|
stream << wal_number;
|
|
|
|
}
|
|
|
|
stream.EndArray();
|
|
|
|
}
|
|
|
|
|
|
|
|
// No-op for immutable_db_options_.wal_filter == nullptr.
|
|
|
|
InvokeWalFilterIfNeededOnColumnFamilyToWalNumberMap();
|
|
|
|
|
|
|
|
bool stop_replay_by_wal_filter = false;
|
|
|
|
bool stop_replay_for_corruption = false;
|
|
|
|
bool flushed = false;
|
|
|
|
uint64_t corrupted_wal_number = kMaxSequenceNumber;
|
|
|
|
uint64_t min_wal_number = MinLogNumberToKeep();
|
Fix a race condition in WAL tracking causing DB open failure (#9715)
Summary:
There is a race condition if WAL tracking in the MANIFEST is enabled in a database that disables 2PC.
The race condition is between two background flush threads trying to install flush results to the MANIFEST.
Consider an example database with two column families: "default" (cfd0) and "cf1" (cfd1). Initially,
both column families have one mutable (active) memtable whose data backed by 6.log.
1. Trigger a manual flush for "cf1", creating a 7.log
2. Insert another key to "default", and trigger flush for "default", creating 8.log
3. BgFlushThread1 finishes writing 9.sst
4. BgFlushThread2 finishes writing 10.sst
```
Time BgFlushThread1 BgFlushThread2
| mutex_.Lock()
| precompute min_wal_to_keep as 6
| mutex_.Unlock()
| mutex_.Lock()
| precompute min_wal_to_keep as 6
| join MANIFEST write queue and mutex_.Unlock()
| write to MANIFEST
| mutex_.Lock()
| cfd1->log_number = 7
| Signal bg_flush_2 and mutex_.Unlock()
| wake up and mutex_.Lock()
| cfd0->log_number = 8
| FindObsoleteFiles() with job_context->log_number == 7
| mutex_.Unlock()
| PurgeObsoleteFiles() deletes 6.log
V
```
As shown in the above, BgFlushThread2 thinks that the min wal to keep is 6.log because "cf1" has unflushed data in 6.log (cf1.log_number=6).
Similarly, BgThread1 thinks that min wal to keep is also 6.log because "default" has unflushed data (default.log_number=6).
No WAL deletion will be written to MANIFEST because 6 is equal to `versions_->wals_.min_wal_number_to_keep`,
due to https://github.com/facebook/rocksdb/blob/7.1.fb/db/memtable_list.cc#L513:L514.
The bg flush thread that finishes last will perform file purging. `job_context.log_number` will be evaluated as 7, i.e.
the min wal that contains unflushed data, causing 6.log to be deleted. However, MANIFEST thinks 6.log should still exist.
If you close the db at this point, you won't be able to re-open it if `track_and_verify_wal_in_manifest` is true.
We must handle the case of multiple bg flush threads, and it is difficult for one bg flush thread to know
the correct min wal number until the other bg flush threads have finished committing to the manifest and updated
the `cfd::log_number`.
To fix this issue, we rename an existing variable `min_log_number_to_keep_2pc` to `min_log_number_to_keep`,
and use it to track WAL file deletion in non-2pc mode as well.
This variable is updated only 1) during recovery with mutex held, or 2) in the MANIFEST write thread.
`min_log_number_to_keep` means RocksDB will delete WALs below it, although there may be WALs
above it which are also obsolete. Formally, we will have [min_wal_to_keep, max_obsolete_wal]. During recovery, we
make sure that only WALs above max_obsolete_wal are checked and added back to `alive_log_files_`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9715
Test Plan:
```
make check
```
Also ran stress test below (with asan) to make sure it completes successfully.
```
TEST_TMPDIR=/dev/shm/rocksdb OPT=-g ASAN_OPTIONS=disable_coredump=0 \
CRASH_TEST_EXT_ARGS=--compression_type=zstd SKIP_FORMAT_BUCK_CHECKS=1 \
make J=52 -j52 blackbox_asan_crash_test
```
Reviewed By: ltamasi
Differential Revision: D34984412
Pulled By: riversand963
fbshipit-source-id: c7b21a8d84751bb55ea79c9f387103d21b231005
3 years ago
|
|
|
if (!allow_2pc()) {
|
|
|
|
// In non-2pc mode, we skip WALs that do not back unflushed data.
|
|
|
|
min_wal_number =
|
|
|
|
std::max(min_wal_number, versions_->MinLogNumberWithUnflushedData());
|
|
|
|
}
|
|
|
|
for (auto wal_number : wal_numbers) {
|
|
|
|
if (wal_number < min_wal_number) {
|
Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765
Differential Revision: D7747618
Pulled By: siying
fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
7 years ago
|
|
|
ROCKS_LOG_INFO(immutable_db_options_.info_log,
|
|
|
|
"Skipping log #%" PRIu64
|
|
|
|
" since it is older than min log to keep #%" PRIu64,
|
|
|
|
wal_number, min_wal_number);
|
Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765
Differential Revision: D7747618
Pulled By: siying
fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
7 years ago
|
|
|
continue;
|
|
|
|
}
|
|
|
|
// The previous incarnation may not have written any MANIFEST
|
|
|
|
// records after allocating this log number. So we manually
|
|
|
|
// update the file number allocation counter in VersionSet.
|
|
|
|
versions_->MarkFileNumberUsed(wal_number);
|
|
|
|
// Open the log file
|
|
|
|
std::string fname =
|
|
|
|
LogFileName(immutable_db_options_.GetWalDir(), wal_number);
|
|
|
|
|
|
|
|
ROCKS_LOG_INFO(immutable_db_options_.info_log,
|
|
|
|
"Recovering log #%" PRIu64 " mode %d", wal_number,
|
|
|
|
static_cast<int>(immutable_db_options_.wal_recovery_mode));
|
|
|
|
auto logFileDropped = [this, &fname]() {
|
|
|
|
uint64_t bytes;
|
|
|
|
if (env_->GetFileSize(fname, &bytes).ok()) {
|
|
|
|
auto info_log = immutable_db_options_.info_log.get();
|
|
|
|
ROCKS_LOG_WARN(info_log, "%s: dropping %d bytes", fname.c_str(),
|
|
|
|
static_cast<int>(bytes));
|
|
|
|
}
|
|
|
|
};
|
|
|
|
if (stop_replay_by_wal_filter) {
|
|
|
|
logFileDropped();
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
std::unique_ptr<SequentialFileReader> file_reader;
|
|
|
|
{
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
std::unique_ptr<FSSequentialFile> file;
|
|
|
|
status = fs_->NewSequentialFile(
|
|
|
|
fname, fs_->OptimizeForLogRead(file_options_), &file, nullptr);
|
|
|
|
if (!status.ok()) {
|
|
|
|
MaybeIgnoreError(&status);
|
|
|
|
if (!status.ok()) {
|
|
|
|
return status;
|
|
|
|
} else {
|
|
|
|
// Fail with one log file, but that's ok.
|
|
|
|
// Try next one.
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
file_reader.reset(new SequentialFileReader(
|
|
|
|
std::move(file), fname, immutable_db_options_.log_readahead_size,
|
|
|
|
io_tracer_));
|
|
|
|
}
|
|
|
|
|
|
|
|
// Create the log reader.
|
|
|
|
LogReporter reporter;
|
|
|
|
reporter.env = env_;
|
|
|
|
reporter.info_log = immutable_db_options_.info_log.get();
|
|
|
|
reporter.fname = fname.c_str();
|
|
|
|
if (!immutable_db_options_.paranoid_checks ||
|
|
|
|
immutable_db_options_.wal_recovery_mode ==
|
|
|
|
WALRecoveryMode::kSkipAnyCorruptedRecords) {
|
|
|
|
reporter.status = nullptr;
|
|
|
|
} else {
|
|
|
|
reporter.status = &status;
|
|
|
|
}
|
|
|
|
// We intentially make log::Reader do checksumming even if
|
|
|
|
// paranoid_checks==false so that corruptions cause entire commits
|
|
|
|
// to be skipped instead of propagating bad information (like overly
|
|
|
|
// large sequence numbers).
|
|
|
|
log::Reader reader(immutable_db_options_.info_log, std::move(file_reader),
|
|
|
|
&reporter, true /*checksum*/, wal_number);
|
|
|
|
|
|
|
|
// Determine if we should tolerate incomplete records at the tail end of the
|
|
|
|
// Read all the records and add to a memtable
|
|
|
|
std::string scratch;
|
|
|
|
Slice record;
|
|
|
|
|
|
|
|
const UnorderedMap<uint32_t, size_t>& running_ts_sz =
|
|
|
|
versions_->GetRunningColumnFamiliesTimestampSize();
|
|
|
|
|
|
|
|
TEST_SYNC_POINT_CALLBACK("DBImpl::RecoverLogFiles:BeforeReadWal",
|
|
|
|
/*arg=*/nullptr);
|
|
|
|
uint64_t record_checksum;
|
|
|
|
while (!stop_replay_by_wal_filter &&
|
|
|
|
reader.ReadRecord(&record, &scratch,
|
|
|
|
immutable_db_options_.wal_recovery_mode,
|
|
|
|
&record_checksum) &&
|
|
|
|
status.ok()) {
|
|
|
|
if (record.size() < WriteBatchInternal::kHeader) {
|
|
|
|
reporter.Corruption(record.size(),
|
|
|
|
Status::Corruption("log record too small"));
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
// We create a new batch and initialize with a valid prot_info_ to store
|
|
|
|
// the data checksums
|
|
|
|
WriteBatch batch;
|
|
|
|
|
|
|
|
status = WriteBatchInternal::SetContents(&batch, record);
|
|
|
|
if (!status.ok()) {
|
|
|
|
return status;
|
|
|
|
}
|
|
|
|
|
|
|
|
const UnorderedMap<uint32_t, size_t>& record_ts_sz =
|
|
|
|
reader.GetRecordedTimestampSize();
|
|
|
|
// TODO(yuzhangyu): update mode to kReconcileInconsistency when user
|
|
|
|
// comparator can be changed.
|
|
|
|
status = HandleWriteBatchTimestampSizeDifference(
|
|
|
|
&batch, running_ts_sz, record_ts_sz,
|
|
|
|
TimestampSizeConsistencyMode::kVerifyConsistency);
|
|
|
|
if (!status.ok()) {
|
|
|
|
return status;
|
|
|
|
}
|
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"DBImpl::RecoverLogFiles:BeforeUpdateProtectionInfo:batch", &batch);
|
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"DBImpl::RecoverLogFiles:BeforeUpdateProtectionInfo:checksum",
|
|
|
|
&record_checksum);
|
|
|
|
status = WriteBatchInternal::UpdateProtectionInfo(
|
|
|
|
&batch, 8 /* bytes_per_key */, &record_checksum);
|
|
|
|
if (!status.ok()) {
|
|
|
|
return status;
|
|
|
|
}
|
|
|
|
|
|
|
|
SequenceNumber sequence = WriteBatchInternal::Sequence(&batch);
|
|
|
|
|
|
|
|
if (immutable_db_options_.wal_recovery_mode ==
|
|
|
|
WALRecoveryMode::kPointInTimeRecovery) {
|
|
|
|
// In point-in-time recovery mode, if sequence id of log files are
|
|
|
|
// consecutive, we continue recovery despite corruption. This could
|
|
|
|
// happen when we open and write to a corrupted DB, where sequence id
|
|
|
|
// will start from the last sequence id we recovered.
|
|
|
|
if (sequence == *next_sequence) {
|
|
|
|
stop_replay_for_corruption = false;
|
|
|
|
}
|
|
|
|
if (stop_replay_for_corruption) {
|
|
|
|
logFileDropped();
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// For the default case of wal_filter == nullptr, always performs no-op
|
|
|
|
// and returns true.
|
|
|
|
if (!InvokeWalFilterIfNeededOnWalRecord(wal_number, fname, reporter,
|
|
|
|
status, stop_replay_by_wal_filter,
|
|
|
|
batch)) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
// If column family was not found, it might mean that the WAL write
|
|
|
|
// batch references to the column family that was dropped after the
|
|
|
|
// insert. We don't want to fail the whole write batch in that case --
|
|
|
|
// we just ignore the update.
|
|
|
|
// That's why we set ignore missing column families to true
|
|
|
|
bool has_valid_writes = false;
|
|
|
|
status = WriteBatchInternal::InsertInto(
|
Refactor trimming logic for immutable memtables (#5022)
Summary:
MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory.
We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one.
The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming.
In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022
Differential Revision: D14394062
Pulled By: miasantreble
fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5
5 years ago
|
|
|
&batch, column_family_memtables_.get(), &flush_scheduler_,
|
|
|
|
&trim_history_scheduler_, true, wal_number, this,
|
Refactor trimming logic for immutable memtables (#5022)
Summary:
MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory.
We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one.
The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming.
In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022
Differential Revision: D14394062
Pulled By: miasantreble
fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5
5 years ago
|
|
|
false /* concurrent_memtable_writes */, next_sequence,
|
|
|
|
&has_valid_writes, seq_per_batch_, batch_per_txn_);
|
|
|
|
MaybeIgnoreError(&status);
|
|
|
|
if (!status.ok()) {
|
|
|
|
// We are treating this as a failure while reading since we read valid
|
|
|
|
// blocks that do not form coherent data
|
|
|
|
reporter.Corruption(record.size(), status);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (has_valid_writes && !read_only) {
|
|
|
|
// we can do this because this is called before client has access to the
|
|
|
|
// DB and there is only a single thread operating on DB
|
|
|
|
ColumnFamilyData* cfd;
|
|
|
|
|
|
|
|
while ((cfd = flush_scheduler_.TakeNextColumnFamily()) != nullptr) {
|
|
|
|
cfd->UnrefAndTryDelete();
|
|
|
|
// If this asserts, it means that InsertInto failed in
|
|
|
|
// filtering updates to already-flushed column families
|
|
|
|
assert(cfd->GetLogNumber() <= wal_number);
|
|
|
|
auto iter = version_edits.find(cfd->GetID());
|
|
|
|
assert(iter != version_edits.end());
|
|
|
|
VersionEdit* edit = &iter->second;
|
|
|
|
status = WriteLevel0TableForRecovery(job_id, cfd, cfd->mem(), edit);
|
|
|
|
if (!status.ok()) {
|
|
|
|
// Reflect errors immediately so that conditions like full
|
|
|
|
// file-systems cause the DB::Open() to fail.
|
|
|
|
return status;
|
|
|
|
}
|
|
|
|
flushed = true;
|
|
|
|
|
|
|
|
cfd->CreateNewMemtable(*cfd->GetLatestMutableCFOptions(),
|
|
|
|
*next_sequence);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!status.ok()) {
|
|
|
|
if (status.IsNotSupported()) {
|
|
|
|
// We should not treat NotSupported as corruption. It is rather a clear
|
|
|
|
// sign that we are processing a WAL that is produced by an incompatible
|
|
|
|
// version of the code.
|
|
|
|
return status;
|
|
|
|
}
|
|
|
|
if (immutable_db_options_.wal_recovery_mode ==
|
|
|
|
WALRecoveryMode::kSkipAnyCorruptedRecords) {
|
|
|
|
// We should ignore all errors unconditionally
|
|
|
|
status = Status::OK();
|
|
|
|
} else if (immutable_db_options_.wal_recovery_mode ==
|
|
|
|
WALRecoveryMode::kPointInTimeRecovery) {
|
|
|
|
if (status.IsIOError()) {
|
|
|
|
ROCKS_LOG_ERROR(immutable_db_options_.info_log,
|
|
|
|
"IOError during point-in-time reading log #%" PRIu64
|
|
|
|
" seq #%" PRIu64
|
|
|
|
". %s. This likely mean loss of synced WAL, "
|
|
|
|
"thus recovery fails.",
|
|
|
|
wal_number, *next_sequence,
|
|
|
|
status.ToString().c_str());
|
|
|
|
return status;
|
|
|
|
}
|
|
|
|
// We should ignore the error but not continue replaying
|
|
|
|
status = Status::OK();
|
|
|
|
stop_replay_for_corruption = true;
|
|
|
|
corrupted_wal_number = wal_number;
|
|
|
|
if (corrupted_wal_found != nullptr) {
|
|
|
|
*corrupted_wal_found = true;
|
|
|
|
}
|
|
|
|
ROCKS_LOG_INFO(immutable_db_options_.info_log,
|
|
|
|
"Point in time recovered to log #%" PRIu64
|
|
|
|
" seq #%" PRIu64,
|
|
|
|
wal_number, *next_sequence);
|
|
|
|
} else {
|
|
|
|
assert(immutable_db_options_.wal_recovery_mode ==
|
|
|
|
WALRecoveryMode::kTolerateCorruptedTailRecords ||
|
|
|
|
immutable_db_options_.wal_recovery_mode ==
|
|
|
|
WALRecoveryMode::kAbsoluteConsistency);
|
|
|
|
return status;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
flush_scheduler_.Clear();
|
Refactor trimming logic for immutable memtables (#5022)
Summary:
MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory.
We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one.
The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming.
In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022
Differential Revision: D14394062
Pulled By: miasantreble
fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5
5 years ago
|
|
|
trim_history_scheduler_.Clear();
|
|
|
|
auto last_sequence = *next_sequence - 1;
|
|
|
|
if ((*next_sequence != kMaxSequenceNumber) &&
|
|
|
|
(versions_->LastSequence() <= last_sequence)) {
|
|
|
|
versions_->SetLastAllocatedSequence(last_sequence);
|
|
|
|
versions_->SetLastPublishedSequence(last_sequence);
|
|
|
|
versions_->SetLastSequence(last_sequence);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
// Compare the corrupted log number to all columnfamily's current log number.
|
|
|
|
// Abort Open() if any column family's log number is greater than
|
|
|
|
// the corrupted log number, which means CF contains data beyond the point of
|
|
|
|
// corruption. This could during PIT recovery when the WAL is corrupted and
|
|
|
|
// some (but not all) CFs are flushed
|
|
|
|
// Exclude the PIT case where no log is dropped after the corruption point.
|
|
|
|
// This is to cover the case for empty wals after corrupted log, in which we
|
|
|
|
// don't reset stop_replay_for_corruption.
|
|
|
|
if (stop_replay_for_corruption == true &&
|
|
|
|
(immutable_db_options_.wal_recovery_mode ==
|
|
|
|
WALRecoveryMode::kPointInTimeRecovery ||
|
|
|
|
immutable_db_options_.wal_recovery_mode ==
|
|
|
|
WALRecoveryMode::kTolerateCorruptedTailRecords)) {
|
|
|
|
for (auto cfd : *versions_->GetColumnFamilySet()) {
|
Fix the false positive alert of CF consistency check in WAL recovery (#8207)
Summary:
In current RocksDB, in recover the information form WAL, we do the consistency check for each column family when one WAL file is corrupted and PointInTimeRecovery is set. However, it will report a false positive alert on "SST file is ahead of WALs" when one of the CF current log number is greater than the corrupted WAL number (CF contains the data beyond the corrupted WAl) due to a new column family creation during flush. In this case, a new WAL is created (it is empty) during a flush. Also, due to some reason (e.g., storage issue or crash happens before SyncCloseLog is called), the old WAL is corrupted. The new CF has no data, therefore, it does not have the consistency issue.
Fix: when checking cfd->GetLogNumber() > corrupted_wal_number also check cfd->GetLiveSstFilesSize() > 0. So the CFs with no SST file data will skip the check here.
Note potential ignored inconsistency caused due to fix: empty CF can also be caused by write+delete. In this case, after flush, there is no SST files being generated. However, this CF still have the log in the WAL. When the WAL is corrupted, the DB might be inconsistent.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8207
Test Plan: added unit test, make crash_test
Reviewed By: riversand963
Differential Revision: D27898839
Pulled By: zhichao-cao
fbshipit-source-id: 931fc2d8b92dd00b4169bf84b94e712fd688a83e
4 years ago
|
|
|
// One special case cause cfd->GetLogNumber() > corrupted_wal_number but
|
|
|
|
// the CF is still consistent: If a new column family is created during
|
|
|
|
// the flush and the WAL sync fails at the same time, the new CF points to
|
|
|
|
// the new WAL but the old WAL is curropted. Since the new CF is empty, it
|
|
|
|
// is still consistent. We add the check of CF sst file size to avoid the
|
|
|
|
// false positive alert.
|
|
|
|
|
|
|
|
// Note that, the check of (cfd->GetLiveSstFilesSize() > 0) may leads to
|
|
|
|
// the ignorance of a very rare inconsistency case caused in data
|
|
|
|
// canclation. One CF is empty due to KV deletion. But those operations
|
|
|
|
// are in the WAL. If the WAL is corrupted, the status of this CF might
|
|
|
|
// not be consistent with others. However, the consistency check will be
|
|
|
|
// bypassed due to empty CF.
|
|
|
|
// TODO: a better and complete implementation is needed to ensure strict
|
|
|
|
// consistency check in WAL recovery including hanlding the tailing
|
|
|
|
// issues.
|
|
|
|
if (cfd->GetLogNumber() > corrupted_wal_number &&
|
|
|
|
cfd->GetLiveSstFilesSize() > 0) {
|
|
|
|
ROCKS_LOG_ERROR(immutable_db_options_.info_log,
|
|
|
|
"Column family inconsistency: SST file contains data"
|
|
|
|
" beyond the point of corruption.");
|
Fix the false positive alert of CF consistency check in WAL recovery (#8207)
Summary:
In current RocksDB, in recover the information form WAL, we do the consistency check for each column family when one WAL file is corrupted and PointInTimeRecovery is set. However, it will report a false positive alert on "SST file is ahead of WALs" when one of the CF current log number is greater than the corrupted WAL number (CF contains the data beyond the corrupted WAl) due to a new column family creation during flush. In this case, a new WAL is created (it is empty) during a flush. Also, due to some reason (e.g., storage issue or crash happens before SyncCloseLog is called), the old WAL is corrupted. The new CF has no data, therefore, it does not have the consistency issue.
Fix: when checking cfd->GetLogNumber() > corrupted_wal_number also check cfd->GetLiveSstFilesSize() > 0. So the CFs with no SST file data will skip the check here.
Note potential ignored inconsistency caused due to fix: empty CF can also be caused by write+delete. In this case, after flush, there is no SST files being generated. However, this CF still have the log in the WAL. When the WAL is corrupted, the DB might be inconsistent.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8207
Test Plan: added unit test, make crash_test
Reviewed By: riversand963
Differential Revision: D27898839
Pulled By: zhichao-cao
fbshipit-source-id: 931fc2d8b92dd00b4169bf84b94e712fd688a83e
4 years ago
|
|
|
return Status::Corruption("SST file is ahead of WALs in CF " +
|
|
|
|
cfd->GetName());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// True if there's any data in the WALs; if not, we can skip re-processing
|
|
|
|
// them later
|
|
|
|
bool data_seen = false;
|
|
|
|
if (!read_only) {
|
|
|
|
// no need to refcount since client still doesn't have access
|
|
|
|
// to the DB and can not drop column families while we iterate
|
|
|
|
const WalNumber max_wal_number = wal_numbers.back();
|
|
|
|
for (auto cfd : *versions_->GetColumnFamilySet()) {
|
|
|
|
auto iter = version_edits.find(cfd->GetID());
|
|
|
|
assert(iter != version_edits.end());
|
|
|
|
VersionEdit* edit = &iter->second;
|
|
|
|
|
|
|
|
if (cfd->GetLogNumber() > max_wal_number) {
|
|
|
|
// Column family cfd has already flushed the data
|
|
|
|
// from all wals. Memtable has to be empty because
|
|
|
|
// we filter the updates based on wal_number
|
|
|
|
// (in WriteBatch::InsertInto)
|
|
|
|
assert(cfd->mem()->GetFirstSequenceNumber() == 0);
|
|
|
|
assert(edit->NumEntries() == 0);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"DBImpl::RecoverLogFiles:BeforeFlushFinalMemtable", /*arg=*/nullptr);
|
|
|
|
|
|
|
|
// flush the final memtable (if non-empty)
|
|
|
|
if (cfd->mem()->GetFirstSequenceNumber() != 0) {
|
|
|
|
// If flush happened in the middle of recovery (e.g. due to memtable
|
|
|
|
// being full), we flush at the end. Otherwise we'll need to record
|
|
|
|
// where we were on last flush, which make the logic complicated.
|
|
|
|
if (flushed || !immutable_db_options_.avoid_flush_during_recovery) {
|
|
|
|
status = WriteLevel0TableForRecovery(job_id, cfd, cfd->mem(), edit);
|
|
|
|
if (!status.ok()) {
|
|
|
|
// Recovery failed
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
flushed = true;
|
|
|
|
|
|
|
|
cfd->CreateNewMemtable(*cfd->GetLatestMutableCFOptions(),
|
|
|
|
versions_->LastSequence());
|
|
|
|
}
|
|
|
|
data_seen = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Update the log number info in the version edit corresponding to this
|
|
|
|
// column family. Note that the version edits will be written to MANIFEST
|
|
|
|
// together later.
|
|
|
|
// writing wal_number in the manifest means that any log file
|
|
|
|
// with number strongly less than (wal_number + 1) is already
|
|
|
|
// recovered and should be ignored on next reincarnation.
|
|
|
|
// Since we already recovered max_wal_number, we want all wals
|
|
|
|
// with numbers `<= max_wal_number` (includes this one) to be ignored
|
|
|
|
if (flushed || cfd->mem()->GetFirstSequenceNumber() == 0) {
|
|
|
|
edit->SetLogNumber(max_wal_number + 1);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (status.ok()) {
|
|
|
|
// we must mark the next log number as used, even though it's
|
|
|
|
// not actually used. that is because VersionSet assumes
|
|
|
|
// VersionSet::next_file_number_ always to be strictly greater than any
|
|
|
|
// log number
|
|
|
|
versions_->MarkFileNumberUsed(max_wal_number + 1);
|
Persist the new MANIFEST after successfully syncing the new WAL during recovery (#9922)
Summary:
In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
flush the data from WAL to L0 for all column families if possible. As a
result, not all column families can increase their log_numbers, and
min_log_number_to_keep won't change.
For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.
If we persist a new MANIFEST with
advanced log_numbers for some column families, then during a second
crash after persisting the MANIFEST, RocksDB will see some column
families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail.
As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL.
If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point.
If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9922
Test Plan:
1. Update unit tests to fail without this change
2. make crast_test -j
Branch with unit test and no fix https://github.com/facebook/rocksdb/pull/9942 to keep track of unit test (without fix)
Reviewed By: riversand963
Differential Revision: D36043701
Pulled By: akankshamahajan15
fbshipit-source-id: 5760970db0a0920fb73d3c054a4155733500acd9
3 years ago
|
|
|
assert(recovery_ctx != nullptr);
|
|
|
|
|
|
|
|
for (auto* cfd : *versions_->GetColumnFamilySet()) {
|
|
|
|
auto iter = version_edits.find(cfd->GetID());
|
|
|
|
assert(iter != version_edits.end());
|
Persist the new MANIFEST after successfully syncing the new WAL during recovery (#9922)
Summary:
In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
flush the data from WAL to L0 for all column families if possible. As a
result, not all column families can increase their log_numbers, and
min_log_number_to_keep won't change.
For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.
If we persist a new MANIFEST with
advanced log_numbers for some column families, then during a second
crash after persisting the MANIFEST, RocksDB will see some column
families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail.
As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL.
If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point.
If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9922
Test Plan:
1. Update unit tests to fail without this change
2. make crast_test -j
Branch with unit test and no fix https://github.com/facebook/rocksdb/pull/9942 to keep track of unit test (without fix)
Reviewed By: riversand963
Differential Revision: D36043701
Pulled By: akankshamahajan15
fbshipit-source-id: 5760970db0a0920fb73d3c054a4155733500acd9
3 years ago
|
|
|
recovery_ctx->UpdateVersionEdits(cfd, iter->second);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (flushed || !data_seen) {
|
Persist the new MANIFEST after successfully syncing the new WAL during recovery (#9922)
Summary:
In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
flush the data from WAL to L0 for all column families if possible. As a
result, not all column families can increase their log_numbers, and
min_log_number_to_keep won't change.
For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.
If we persist a new MANIFEST with
advanced log_numbers for some column families, then during a second
crash after persisting the MANIFEST, RocksDB will see some column
families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail.
As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL.
If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point.
If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9922
Test Plan:
1. Update unit tests to fail without this change
2. make crast_test -j
Branch with unit test and no fix https://github.com/facebook/rocksdb/pull/9942 to keep track of unit test (without fix)
Reviewed By: riversand963
Differential Revision: D36043701
Pulled By: akankshamahajan15
fbshipit-source-id: 5760970db0a0920fb73d3c054a4155733500acd9
3 years ago
|
|
|
VersionEdit wal_deletion;
|
Fix a race condition in WAL tracking causing DB open failure (#9715)
Summary:
There is a race condition if WAL tracking in the MANIFEST is enabled in a database that disables 2PC.
The race condition is between two background flush threads trying to install flush results to the MANIFEST.
Consider an example database with two column families: "default" (cfd0) and "cf1" (cfd1). Initially,
both column families have one mutable (active) memtable whose data backed by 6.log.
1. Trigger a manual flush for "cf1", creating a 7.log
2. Insert another key to "default", and trigger flush for "default", creating 8.log
3. BgFlushThread1 finishes writing 9.sst
4. BgFlushThread2 finishes writing 10.sst
```
Time BgFlushThread1 BgFlushThread2
| mutex_.Lock()
| precompute min_wal_to_keep as 6
| mutex_.Unlock()
| mutex_.Lock()
| precompute min_wal_to_keep as 6
| join MANIFEST write queue and mutex_.Unlock()
| write to MANIFEST
| mutex_.Lock()
| cfd1->log_number = 7
| Signal bg_flush_2 and mutex_.Unlock()
| wake up and mutex_.Lock()
| cfd0->log_number = 8
| FindObsoleteFiles() with job_context->log_number == 7
| mutex_.Unlock()
| PurgeObsoleteFiles() deletes 6.log
V
```
As shown in the above, BgFlushThread2 thinks that the min wal to keep is 6.log because "cf1" has unflushed data in 6.log (cf1.log_number=6).
Similarly, BgThread1 thinks that min wal to keep is also 6.log because "default" has unflushed data (default.log_number=6).
No WAL deletion will be written to MANIFEST because 6 is equal to `versions_->wals_.min_wal_number_to_keep`,
due to https://github.com/facebook/rocksdb/blob/7.1.fb/db/memtable_list.cc#L513:L514.
The bg flush thread that finishes last will perform file purging. `job_context.log_number` will be evaluated as 7, i.e.
the min wal that contains unflushed data, causing 6.log to be deleted. However, MANIFEST thinks 6.log should still exist.
If you close the db at this point, you won't be able to re-open it if `track_and_verify_wal_in_manifest` is true.
We must handle the case of multiple bg flush threads, and it is difficult for one bg flush thread to know
the correct min wal number until the other bg flush threads have finished committing to the manifest and updated
the `cfd::log_number`.
To fix this issue, we rename an existing variable `min_log_number_to_keep_2pc` to `min_log_number_to_keep`,
and use it to track WAL file deletion in non-2pc mode as well.
This variable is updated only 1) during recovery with mutex held, or 2) in the MANIFEST write thread.
`min_log_number_to_keep` means RocksDB will delete WALs below it, although there may be WALs
above it which are also obsolete. Formally, we will have [min_wal_to_keep, max_obsolete_wal]. During recovery, we
make sure that only WALs above max_obsolete_wal are checked and added back to `alive_log_files_`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9715
Test Plan:
```
make check
```
Also ran stress test below (with asan) to make sure it completes successfully.
```
TEST_TMPDIR=/dev/shm/rocksdb OPT=-g ASAN_OPTIONS=disable_coredump=0 \
CRASH_TEST_EXT_ARGS=--compression_type=zstd SKIP_FORMAT_BUCK_CHECKS=1 \
make J=52 -j52 blackbox_asan_crash_test
```
Reviewed By: ltamasi
Differential Revision: D34984412
Pulled By: riversand963
fbshipit-source-id: c7b21a8d84751bb55ea79c9f387103d21b231005
3 years ago
|
|
|
if (immutable_db_options_.track_and_verify_wals_in_manifest) {
|
Persist the new MANIFEST after successfully syncing the new WAL during recovery (#9922)
Summary:
In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
flush the data from WAL to L0 for all column families if possible. As a
result, not all column families can increase their log_numbers, and
min_log_number_to_keep won't change.
For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.
If we persist a new MANIFEST with
advanced log_numbers for some column families, then during a second
crash after persisting the MANIFEST, RocksDB will see some column
families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail.
As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL.
If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point.
If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9922
Test Plan:
1. Update unit tests to fail without this change
2. make crast_test -j
Branch with unit test and no fix https://github.com/facebook/rocksdb/pull/9942 to keep track of unit test (without fix)
Reviewed By: riversand963
Differential Revision: D36043701
Pulled By: akankshamahajan15
fbshipit-source-id: 5760970db0a0920fb73d3c054a4155733500acd9
3 years ago
|
|
|
wal_deletion.DeleteWalsBefore(max_wal_number + 1);
|
Fix a race condition in WAL tracking causing DB open failure (#9715)
Summary:
There is a race condition if WAL tracking in the MANIFEST is enabled in a database that disables 2PC.
The race condition is between two background flush threads trying to install flush results to the MANIFEST.
Consider an example database with two column families: "default" (cfd0) and "cf1" (cfd1). Initially,
both column families have one mutable (active) memtable whose data backed by 6.log.
1. Trigger a manual flush for "cf1", creating a 7.log
2. Insert another key to "default", and trigger flush for "default", creating 8.log
3. BgFlushThread1 finishes writing 9.sst
4. BgFlushThread2 finishes writing 10.sst
```
Time BgFlushThread1 BgFlushThread2
| mutex_.Lock()
| precompute min_wal_to_keep as 6
| mutex_.Unlock()
| mutex_.Lock()
| precompute min_wal_to_keep as 6
| join MANIFEST write queue and mutex_.Unlock()
| write to MANIFEST
| mutex_.Lock()
| cfd1->log_number = 7
| Signal bg_flush_2 and mutex_.Unlock()
| wake up and mutex_.Lock()
| cfd0->log_number = 8
| FindObsoleteFiles() with job_context->log_number == 7
| mutex_.Unlock()
| PurgeObsoleteFiles() deletes 6.log
V
```
As shown in the above, BgFlushThread2 thinks that the min wal to keep is 6.log because "cf1" has unflushed data in 6.log (cf1.log_number=6).
Similarly, BgThread1 thinks that min wal to keep is also 6.log because "default" has unflushed data (default.log_number=6).
No WAL deletion will be written to MANIFEST because 6 is equal to `versions_->wals_.min_wal_number_to_keep`,
due to https://github.com/facebook/rocksdb/blob/7.1.fb/db/memtable_list.cc#L513:L514.
The bg flush thread that finishes last will perform file purging. `job_context.log_number` will be evaluated as 7, i.e.
the min wal that contains unflushed data, causing 6.log to be deleted. However, MANIFEST thinks 6.log should still exist.
If you close the db at this point, you won't be able to re-open it if `track_and_verify_wal_in_manifest` is true.
We must handle the case of multiple bg flush threads, and it is difficult for one bg flush thread to know
the correct min wal number until the other bg flush threads have finished committing to the manifest and updated
the `cfd::log_number`.
To fix this issue, we rename an existing variable `min_log_number_to_keep_2pc` to `min_log_number_to_keep`,
and use it to track WAL file deletion in non-2pc mode as well.
This variable is updated only 1) during recovery with mutex held, or 2) in the MANIFEST write thread.
`min_log_number_to_keep` means RocksDB will delete WALs below it, although there may be WALs
above it which are also obsolete. Formally, we will have [min_wal_to_keep, max_obsolete_wal]. During recovery, we
make sure that only WALs above max_obsolete_wal are checked and added back to `alive_log_files_`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9715
Test Plan:
```
make check
```
Also ran stress test below (with asan) to make sure it completes successfully.
```
TEST_TMPDIR=/dev/shm/rocksdb OPT=-g ASAN_OPTIONS=disable_coredump=0 \
CRASH_TEST_EXT_ARGS=--compression_type=zstd SKIP_FORMAT_BUCK_CHECKS=1 \
make J=52 -j52 blackbox_asan_crash_test
```
Reviewed By: ltamasi
Differential Revision: D34984412
Pulled By: riversand963
fbshipit-source-id: c7b21a8d84751bb55ea79c9f387103d21b231005
3 years ago
|
|
|
}
|
|
|
|
if (!allow_2pc()) {
|
|
|
|
// In non-2pc mode, flushing the memtables of the column families
|
|
|
|
// means we can advance min_log_number_to_keep.
|
Persist the new MANIFEST after successfully syncing the new WAL during recovery (#9922)
Summary:
In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
flush the data from WAL to L0 for all column families if possible. As a
result, not all column families can increase their log_numbers, and
min_log_number_to_keep won't change.
For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.
If we persist a new MANIFEST with
advanced log_numbers for some column families, then during a second
crash after persisting the MANIFEST, RocksDB will see some column
families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail.
As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL.
If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point.
If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9922
Test Plan:
1. Update unit tests to fail without this change
2. make crast_test -j
Branch with unit test and no fix https://github.com/facebook/rocksdb/pull/9942 to keep track of unit test (without fix)
Reviewed By: riversand963
Differential Revision: D36043701
Pulled By: akankshamahajan15
fbshipit-source-id: 5760970db0a0920fb73d3c054a4155733500acd9
3 years ago
|
|
|
wal_deletion.SetMinLogNumberToKeep(max_wal_number + 1);
|
Fix a race condition in WAL tracking causing DB open failure (#9715)
Summary:
There is a race condition if WAL tracking in the MANIFEST is enabled in a database that disables 2PC.
The race condition is between two background flush threads trying to install flush results to the MANIFEST.
Consider an example database with two column families: "default" (cfd0) and "cf1" (cfd1). Initially,
both column families have one mutable (active) memtable whose data backed by 6.log.
1. Trigger a manual flush for "cf1", creating a 7.log
2. Insert another key to "default", and trigger flush for "default", creating 8.log
3. BgFlushThread1 finishes writing 9.sst
4. BgFlushThread2 finishes writing 10.sst
```
Time BgFlushThread1 BgFlushThread2
| mutex_.Lock()
| precompute min_wal_to_keep as 6
| mutex_.Unlock()
| mutex_.Lock()
| precompute min_wal_to_keep as 6
| join MANIFEST write queue and mutex_.Unlock()
| write to MANIFEST
| mutex_.Lock()
| cfd1->log_number = 7
| Signal bg_flush_2 and mutex_.Unlock()
| wake up and mutex_.Lock()
| cfd0->log_number = 8
| FindObsoleteFiles() with job_context->log_number == 7
| mutex_.Unlock()
| PurgeObsoleteFiles() deletes 6.log
V
```
As shown in the above, BgFlushThread2 thinks that the min wal to keep is 6.log because "cf1" has unflushed data in 6.log (cf1.log_number=6).
Similarly, BgThread1 thinks that min wal to keep is also 6.log because "default" has unflushed data (default.log_number=6).
No WAL deletion will be written to MANIFEST because 6 is equal to `versions_->wals_.min_wal_number_to_keep`,
due to https://github.com/facebook/rocksdb/blob/7.1.fb/db/memtable_list.cc#L513:L514.
The bg flush thread that finishes last will perform file purging. `job_context.log_number` will be evaluated as 7, i.e.
the min wal that contains unflushed data, causing 6.log to be deleted. However, MANIFEST thinks 6.log should still exist.
If you close the db at this point, you won't be able to re-open it if `track_and_verify_wal_in_manifest` is true.
We must handle the case of multiple bg flush threads, and it is difficult for one bg flush thread to know
the correct min wal number until the other bg flush threads have finished committing to the manifest and updated
the `cfd::log_number`.
To fix this issue, we rename an existing variable `min_log_number_to_keep_2pc` to `min_log_number_to_keep`,
and use it to track WAL file deletion in non-2pc mode as well.
This variable is updated only 1) during recovery with mutex held, or 2) in the MANIFEST write thread.
`min_log_number_to_keep` means RocksDB will delete WALs below it, although there may be WALs
above it which are also obsolete. Formally, we will have [min_wal_to_keep, max_obsolete_wal]. During recovery, we
make sure that only WALs above max_obsolete_wal are checked and added back to `alive_log_files_`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9715
Test Plan:
```
make check
```
Also ran stress test below (with asan) to make sure it completes successfully.
```
TEST_TMPDIR=/dev/shm/rocksdb OPT=-g ASAN_OPTIONS=disable_coredump=0 \
CRASH_TEST_EXT_ARGS=--compression_type=zstd SKIP_FORMAT_BUCK_CHECKS=1 \
make J=52 -j52 blackbox_asan_crash_test
```
Reviewed By: ltamasi
Differential Revision: D34984412
Pulled By: riversand963
fbshipit-source-id: c7b21a8d84751bb55ea79c9f387103d21b231005
3 years ago
|
|
|
}
|
Persist the new MANIFEST after successfully syncing the new WAL during recovery (#9922)
Summary:
In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
flush the data from WAL to L0 for all column families if possible. As a
result, not all column families can increase their log_numbers, and
min_log_number_to_keep won't change.
For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.
If we persist a new MANIFEST with
advanced log_numbers for some column families, then during a second
crash after persisting the MANIFEST, RocksDB will see some column
families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail.
As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL.
If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point.
If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9922
Test Plan:
1. Update unit tests to fail without this change
2. make crast_test -j
Branch with unit test and no fix https://github.com/facebook/rocksdb/pull/9942 to keep track of unit test (without fix)
Reviewed By: riversand963
Differential Revision: D36043701
Pulled By: akankshamahajan15
fbshipit-source-id: 5760970db0a0920fb73d3c054a4155733500acd9
3 years ago
|
|
|
assert(versions_->GetColumnFamilySet() != nullptr);
|
|
|
|
recovery_ctx->UpdateVersionEdits(
|
|
|
|
versions_->GetColumnFamilySet()->GetDefault(), wal_deletion);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (status.ok()) {
|
|
|
|
if (data_seen && !flushed) {
|
|
|
|
status = RestoreAliveLogFiles(wal_numbers);
|
Persist the new MANIFEST after successfully syncing the new WAL during recovery (#9922)
Summary:
In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
flush the data from WAL to L0 for all column families if possible. As a
result, not all column families can increase their log_numbers, and
min_log_number_to_keep won't change.
For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.
If we persist a new MANIFEST with
advanced log_numbers for some column families, then during a second
crash after persisting the MANIFEST, RocksDB will see some column
families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail.
As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL.
If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point.
If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9922
Test Plan:
1. Update unit tests to fail without this change
2. make crast_test -j
Branch with unit test and no fix https://github.com/facebook/rocksdb/pull/9942 to keep track of unit test (without fix)
Reviewed By: riversand963
Differential Revision: D36043701
Pulled By: akankshamahajan15
fbshipit-source-id: 5760970db0a0920fb73d3c054a4155733500acd9
3 years ago
|
|
|
} else if (!wal_numbers.empty()) { // If there's no data in the WAL, or we
|
|
|
|
// flushed all the data, still
|
|
|
|
// truncate the log file. If the process goes into a crash loop before
|
|
|
|
// the file is deleted, the preallocated space will never get freed.
|
|
|
|
const bool truncate = !read_only;
|
|
|
|
GetLogSizeAndMaybeTruncate(wal_numbers.back(), truncate, nullptr)
|
|
|
|
.PermitUncheckedError();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
event_logger_.Log() << "job" << job_id << "event"
|
|
|
|
<< "recovery_finished";
|
|
|
|
|
|
|
|
return status;
|
|
|
|
}
|
|
|
|
|
|
|
|
Status DBImpl::GetLogSizeAndMaybeTruncate(uint64_t wal_number, bool truncate,
|
|
|
|
LogFileNumberSize* log_ptr) {
|
|
|
|
LogFileNumberSize log(wal_number);
|
|
|
|
std::string fname =
|
|
|
|
LogFileName(immutable_db_options_.GetWalDir(), wal_number);
|
|
|
|
Status s;
|
|
|
|
// This gets the appear size of the wals, not including preallocated space.
|
|
|
|
s = env_->GetFileSize(fname, &log.size);
|
Update WAL corruption test so that it fails without fix (#9942)
Summary:
In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
flush the data from WAL to L0 for all column families if possible. As a
result, not all column families can increase their log_numbers, and
min_log_number_to_keep won't change.
For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.
If we persist a new MANIFEST with
advanced log_numbers for some column families, then during a second
crash after persisting the MANIFEST, RocksDB will see some column
families' log_numbers larger than the corrupted WAL, and the "column family inconsistency" error will be hit, causing recovery to fail.
This PR update unit tests to emulate the errors and tests are failing without a fix.
Error:
```
[ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/0
db/corruption_test.cc:1190: Failure
DB::Open(options, dbname_, cf_descs, &handles, &db_)
Corruption: SST file is ahead of WALs in CF test_cf
[ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/0, where GetParam() = (true, false) (91 ms)
[ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/1
db/corruption_test.cc:1190: Failure
DB::Open(options, dbname_, cf_descs, &handles, &db_)
Corruption: SST file is ahead of WALs in CF test_cf
[ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/1, where GetParam() = (false, false) (92 ms)
[ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/2
db/corruption_test.cc:1190: Failure
DB::Open(options, dbname_, cf_descs, &handles, &db_)
Corruption: SST file is ahead of WALs in CF test_cf
[ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/2, where GetParam() = (true, true) (95 ms)
[ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/3
db/corruption_test.cc:1190: Failure
DB::Open(options, dbname_, cf_descs, &handles, &db_)
Corruption: SST file is ahead of WALs in CF test_cf
[ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/3, where GetParam() = (false, true) (92 ms)
[ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/0
db/corruption_test.cc:1354: Failure
TransactionDB::Open(options, txn_db_opts, dbname_, cf_descs, &handles, &txn_db)
Corruption: SST file is ahead of WALs in CF default
[ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/0, where GetParam() = (true, false) (94 ms)
[ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/1
db/corruption_test.cc:1354: Failure
TransactionDB::Open(options, txn_db_opts, dbname_, cf_descs, &handles, &txn_db)
Corruption: SST file is ahead of WALs in CF default
[ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/1, where GetParam() = (false, false) (97 ms)
[ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/2
db/corruption_test.cc:1354: Failure
TransactionDB::Open(options, txn_db_opts, dbname_, cf_descs, &handles, &txn_db)
Corruption: SST file is ahead of WALs in CF default
[ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/2, where GetParam() = (true, true) (94 ms)
[ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/3
db/corruption_test.cc:1354: Failure
TransactionDB::Open(options, txn_db_opts, dbname_, cf_descs, &handles, &txn_db)
Corruption: SST file is ahead of WALs in CF default
[ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/3, where GetParam() = (false, true) (91 ms)
[ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/0
db/corruption_test.cc:1483: Failure
DB::Open(options, dbname_, cf_descs, &handles, &db_)
Corruption: SST file is ahead of WALs in CF default
[ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/0, where GetParam() = (true, false) (93 ms)
[ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/1
db/corruption_test.cc:1483: Failure
DB::Open(options, dbname_, cf_descs, &handles, &db_)
Corruption: SST file is ahead of WALs in CF default
[ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/1, where GetParam() = (false, false) (94 ms)
[ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/2
db/corruption_test.cc:1483: Failure
DB::Open(options, dbname_, cf_descs, &handles, &db_)
Corruption: SST file is ahead of WALs in CF default
[ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/2, where GetParam() = (true, true) (90 ms)
[ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/3
db/corruption_test.cc:1483: Failure
DB::Open(options, dbname_, cf_descs, &handles, &db_)
Corruption: SST file is ahead of WALs in CF default
[ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/3, where GetParam() = (false, true) (93 ms)
[----------] 12 tests from CorruptionTest/CrashDuringRecoveryWithCorruptionTest (1116 ms total)
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9942
Test Plan: Not needed
Reviewed By: riversand963
Differential Revision: D36324112
Pulled By: akankshamahajan15
fbshipit-source-id: cab2075ac4ebe48f5ef93a6ea162558aa4fc334d
3 years ago
|
|
|
TEST_SYNC_POINT_CALLBACK("DBImpl::GetLogSizeAndMaybeTruncate:0", /*arg=*/&s);
|
|
|
|
if (s.ok() && truncate) {
|
|
|
|
std::unique_ptr<FSWritableFile> last_log;
|
|
|
|
Status truncate_status = fs_->ReopenWritableFile(
|
|
|
|
fname,
|
|
|
|
fs_->OptimizeForLogWrite(
|
|
|
|
file_options_,
|
|
|
|
BuildDBOptions(immutable_db_options_, mutable_db_options_)),
|
|
|
|
&last_log, nullptr);
|
|
|
|
if (truncate_status.ok()) {
|
|
|
|
truncate_status = last_log->Truncate(log.size, IOOptions(), nullptr);
|
|
|
|
}
|
|
|
|
if (truncate_status.ok()) {
|
|
|
|
truncate_status = last_log->Close(IOOptions(), nullptr);
|
|
|
|
}
|
|
|
|
// Not a critical error if fail to truncate.
|
|
|
|
if (!truncate_status.ok() && !truncate_status.IsNotSupported()) {
|
|
|
|
ROCKS_LOG_WARN(immutable_db_options_.info_log,
|
|
|
|
"Failed to truncate log #%" PRIu64 ": %s", wal_number,
|
|
|
|
truncate_status.ToString().c_str());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (log_ptr) {
|
|
|
|
*log_ptr = log;
|
|
|
|
}
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
Status DBImpl::RestoreAliveLogFiles(const std::vector<uint64_t>& wal_numbers) {
|
|
|
|
if (wal_numbers.empty()) {
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
Status s;
|
|
|
|
mutex_.AssertHeld();
|
|
|
|
assert(immutable_db_options_.avoid_flush_during_recovery);
|
|
|
|
// Mark these as alive so they'll be considered for deletion later by
|
|
|
|
// FindObsoleteFiles()
|
|
|
|
total_log_size_ = 0;
|
|
|
|
log_empty_ = false;
|
Fix a race condition in WAL tracking causing DB open failure (#9715)
Summary:
There is a race condition if WAL tracking in the MANIFEST is enabled in a database that disables 2PC.
The race condition is between two background flush threads trying to install flush results to the MANIFEST.
Consider an example database with two column families: "default" (cfd0) and "cf1" (cfd1). Initially,
both column families have one mutable (active) memtable whose data backed by 6.log.
1. Trigger a manual flush for "cf1", creating a 7.log
2. Insert another key to "default", and trigger flush for "default", creating 8.log
3. BgFlushThread1 finishes writing 9.sst
4. BgFlushThread2 finishes writing 10.sst
```
Time BgFlushThread1 BgFlushThread2
| mutex_.Lock()
| precompute min_wal_to_keep as 6
| mutex_.Unlock()
| mutex_.Lock()
| precompute min_wal_to_keep as 6
| join MANIFEST write queue and mutex_.Unlock()
| write to MANIFEST
| mutex_.Lock()
| cfd1->log_number = 7
| Signal bg_flush_2 and mutex_.Unlock()
| wake up and mutex_.Lock()
| cfd0->log_number = 8
| FindObsoleteFiles() with job_context->log_number == 7
| mutex_.Unlock()
| PurgeObsoleteFiles() deletes 6.log
V
```
As shown in the above, BgFlushThread2 thinks that the min wal to keep is 6.log because "cf1" has unflushed data in 6.log (cf1.log_number=6).
Similarly, BgThread1 thinks that min wal to keep is also 6.log because "default" has unflushed data (default.log_number=6).
No WAL deletion will be written to MANIFEST because 6 is equal to `versions_->wals_.min_wal_number_to_keep`,
due to https://github.com/facebook/rocksdb/blob/7.1.fb/db/memtable_list.cc#L513:L514.
The bg flush thread that finishes last will perform file purging. `job_context.log_number` will be evaluated as 7, i.e.
the min wal that contains unflushed data, causing 6.log to be deleted. However, MANIFEST thinks 6.log should still exist.
If you close the db at this point, you won't be able to re-open it if `track_and_verify_wal_in_manifest` is true.
We must handle the case of multiple bg flush threads, and it is difficult for one bg flush thread to know
the correct min wal number until the other bg flush threads have finished committing to the manifest and updated
the `cfd::log_number`.
To fix this issue, we rename an existing variable `min_log_number_to_keep_2pc` to `min_log_number_to_keep`,
and use it to track WAL file deletion in non-2pc mode as well.
This variable is updated only 1) during recovery with mutex held, or 2) in the MANIFEST write thread.
`min_log_number_to_keep` means RocksDB will delete WALs below it, although there may be WALs
above it which are also obsolete. Formally, we will have [min_wal_to_keep, max_obsolete_wal]. During recovery, we
make sure that only WALs above max_obsolete_wal are checked and added back to `alive_log_files_`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9715
Test Plan:
```
make check
```
Also ran stress test below (with asan) to make sure it completes successfully.
```
TEST_TMPDIR=/dev/shm/rocksdb OPT=-g ASAN_OPTIONS=disable_coredump=0 \
CRASH_TEST_EXT_ARGS=--compression_type=zstd SKIP_FORMAT_BUCK_CHECKS=1 \
make J=52 -j52 blackbox_asan_crash_test
```
Reviewed By: ltamasi
Differential Revision: D34984412
Pulled By: riversand963
fbshipit-source-id: c7b21a8d84751bb55ea79c9f387103d21b231005
3 years ago
|
|
|
uint64_t min_wal_with_unflushed_data =
|
|
|
|
versions_->MinLogNumberWithUnflushedData();
|
|
|
|
for (auto wal_number : wal_numbers) {
|
Fix a race condition in WAL tracking causing DB open failure (#9715)
Summary:
There is a race condition if WAL tracking in the MANIFEST is enabled in a database that disables 2PC.
The race condition is between two background flush threads trying to install flush results to the MANIFEST.
Consider an example database with two column families: "default" (cfd0) and "cf1" (cfd1). Initially,
both column families have one mutable (active) memtable whose data backed by 6.log.
1. Trigger a manual flush for "cf1", creating a 7.log
2. Insert another key to "default", and trigger flush for "default", creating 8.log
3. BgFlushThread1 finishes writing 9.sst
4. BgFlushThread2 finishes writing 10.sst
```
Time BgFlushThread1 BgFlushThread2
| mutex_.Lock()
| precompute min_wal_to_keep as 6
| mutex_.Unlock()
| mutex_.Lock()
| precompute min_wal_to_keep as 6
| join MANIFEST write queue and mutex_.Unlock()
| write to MANIFEST
| mutex_.Lock()
| cfd1->log_number = 7
| Signal bg_flush_2 and mutex_.Unlock()
| wake up and mutex_.Lock()
| cfd0->log_number = 8
| FindObsoleteFiles() with job_context->log_number == 7
| mutex_.Unlock()
| PurgeObsoleteFiles() deletes 6.log
V
```
As shown in the above, BgFlushThread2 thinks that the min wal to keep is 6.log because "cf1" has unflushed data in 6.log (cf1.log_number=6).
Similarly, BgThread1 thinks that min wal to keep is also 6.log because "default" has unflushed data (default.log_number=6).
No WAL deletion will be written to MANIFEST because 6 is equal to `versions_->wals_.min_wal_number_to_keep`,
due to https://github.com/facebook/rocksdb/blob/7.1.fb/db/memtable_list.cc#L513:L514.
The bg flush thread that finishes last will perform file purging. `job_context.log_number` will be evaluated as 7, i.e.
the min wal that contains unflushed data, causing 6.log to be deleted. However, MANIFEST thinks 6.log should still exist.
If you close the db at this point, you won't be able to re-open it if `track_and_verify_wal_in_manifest` is true.
We must handle the case of multiple bg flush threads, and it is difficult for one bg flush thread to know
the correct min wal number until the other bg flush threads have finished committing to the manifest and updated
the `cfd::log_number`.
To fix this issue, we rename an existing variable `min_log_number_to_keep_2pc` to `min_log_number_to_keep`,
and use it to track WAL file deletion in non-2pc mode as well.
This variable is updated only 1) during recovery with mutex held, or 2) in the MANIFEST write thread.
`min_log_number_to_keep` means RocksDB will delete WALs below it, although there may be WALs
above it which are also obsolete. Formally, we will have [min_wal_to_keep, max_obsolete_wal]. During recovery, we
make sure that only WALs above max_obsolete_wal are checked and added back to `alive_log_files_`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9715
Test Plan:
```
make check
```
Also ran stress test below (with asan) to make sure it completes successfully.
```
TEST_TMPDIR=/dev/shm/rocksdb OPT=-g ASAN_OPTIONS=disable_coredump=0 \
CRASH_TEST_EXT_ARGS=--compression_type=zstd SKIP_FORMAT_BUCK_CHECKS=1 \
make J=52 -j52 blackbox_asan_crash_test
```
Reviewed By: ltamasi
Differential Revision: D34984412
Pulled By: riversand963
fbshipit-source-id: c7b21a8d84751bb55ea79c9f387103d21b231005
3 years ago
|
|
|
if (!allow_2pc() && wal_number < min_wal_with_unflushed_data) {
|
|
|
|
// In non-2pc mode, the WAL files not backing unflushed data are not
|
|
|
|
// alive, thus should not be added to the alive_log_files_.
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
// We preallocate space for wals, but then after a crash and restart, those
|
|
|
|
// preallocated space are not needed anymore. It is likely only the last
|
|
|
|
// log has such preallocated space, so we only truncate for the last log.
|
|
|
|
LogFileNumberSize log;
|
|
|
|
s = GetLogSizeAndMaybeTruncate(
|
|
|
|
wal_number, /*truncate=*/(wal_number == wal_numbers.back()), &log);
|
|
|
|
if (!s.ok()) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
total_log_size_ += log.size;
|
|
|
|
alive_log_files_.push_back(log);
|
|
|
|
}
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
Status DBImpl::WriteLevel0TableForRecovery(int job_id, ColumnFamilyData* cfd,
|
|
|
|
MemTable* mem, VersionEdit* edit) {
|
|
|
|
mutex_.AssertHeld();
|
Fix a silent data loss for write-committed txn (#9571)
Summary:
The following sequence of events can cause silent data loss for write-committed
transactions.
```
Time thread 1 bg flush
| db->Put("a")
| txn = NewTxn()
| txn->Put("b", "v")
| txn->Prepare() // writes only to 5.log
| db->SwitchMemtable() // memtable 1 has "a"
| // close 5.log,
| // creates 8.log
| trigger flush
| pick memtable 1
| unlock db mutex
| write new sst
| txn->ctwb->Put("gtid", "1") // writes 8.log
| txn->Commit() // writes to 8.log
| // writes to memtable 2
| compute min_log_number_to_keep_2pc, this
| will be 8 (incorrect).
|
| Purge obsolete wals, including 5.log
|
V
```
At this point, writes of txn exists only in memtable. Close db without flush because db thinks the data in
memtable are backed by log. Then reopen, the writes are lost except key-value pair {"gtid"->"1"},
only the commit marker of txn is in 8.log
The reason lies in `PrecomputeMinLogNumberToKeep2PC()` which calls `FindMinPrepLogReferencedByMemTable()`.
In the above example, when bg flush thread tries to find obsolete wals, it uses the information
computed by `PrecomputeMinLogNumberToKeep2PC()`. The return value of `PrecomputeMinLogNumberToKeep2PC()`
depends on three components
- `PrecomputeMinLogNumberToKeepNon2PC()`. This represents the WAL that has unflushed data. As the name of this method suggests, it does not account for 2PC. Although the keys reside in the prepare section of a previous WAL, the column family references the current WAL when they are actually inserted into the memtable during txn commit.
- `prep_tracker->FindMinLogContainingOutstandingPrep()`. This represents the WAL with a prepare section but the txn hasn't committed.
- `FindMinPrepLogReferencedByMemTable()`. This represents the WAL on which some memtables (mutable and immutable) depend for their unflushed data.
The bug lies in `FindMinPrepLogReferencedByMemTable()`. Originally, this function skips checking the column families
that are being flushed, but the unit test added in this PR shows that they should not be. In this unit test, there is
only the default column family, and one of its memtables has unflushed data backed by a prepare section in 5.log.
We should return this information via `FindMinPrepLogReferencedByMemTable()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9571
Test Plan:
```
./transaction_test --gtest_filter=*/TransactionTest.SwitchMemtableDuringPrepareAndCommit_WC/*
make check
```
Reviewed By: siying
Differential Revision: D34235236
Pulled By: riversand963
fbshipit-source-id: 120eb21a666728a38dda77b96276c6af72b008b1
3 years ago
|
|
|
assert(cfd);
|
|
|
|
assert(cfd->imm());
|
|
|
|
// The immutable memtable list must be empty.
|
|
|
|
assert(std::numeric_limits<uint64_t>::max() ==
|
|
|
|
cfd->imm()->GetEarliestMemTableID());
|
|
|
|
|
|
|
|
const uint64_t start_micros = immutable_db_options_.clock->NowMicros();
|
|
|
|
|
|
|
|
FileMetaData meta;
|
|
|
|
std::vector<BlobFileAddition> blob_file_additions;
|
|
|
|
|
|
|
|
std::unique_ptr<std::list<uint64_t>::iterator> pending_outputs_inserted_elem(
|
|
|
|
new std::list<uint64_t>::iterator(
|
|
|
|
CaptureCurrentFileNumberInPendingOutputs()));
|
|
|
|
meta.fd = FileDescriptor(versions_->NewFileNumber(), 0, 0);
|
|
|
|
ReadOptions ro;
|
|
|
|
ro.total_order_seek = true;
|
|
|
|
ro.io_activity = Env::IOActivity::kDBOpen;
|
|
|
|
Arena arena;
|
|
|
|
Status s;
|
|
|
|
TableProperties table_properties;
|
|
|
|
{
|
|
|
|
ScopedArenaIterator iter(mem->NewIterator(ro, &arena));
|
|
|
|
ROCKS_LOG_DEBUG(immutable_db_options_.info_log,
|
|
|
|
"[%s] [WriteLevel0TableForRecovery]"
|
|
|
|
" Level-0 table #%" PRIu64 ": started",
|
|
|
|
cfd->GetName().c_str(), meta.fd.GetNumber());
|
|
|
|
|
|
|
|
// Get the latest mutable cf options while the mutex is still locked
|
|
|
|
const MutableCFOptions mutable_cf_options =
|
|
|
|
*cfd->GetLatestMutableCFOptions();
|
|
|
|
bool paranoid_file_checks =
|
|
|
|
cfd->GetLatestMutableCFOptions()->paranoid_file_checks;
|
|
|
|
|
|
|
|
int64_t _current_time = 0;
|
|
|
|
immutable_db_options_.clock->GetCurrentTime(&_current_time)
|
|
|
|
.PermitUncheckedError(); // ignore error
|
|
|
|
const uint64_t current_time = static_cast<uint64_t>(_current_time);
|
|
|
|
meta.oldest_ancester_time = current_time;
|
Sort L0 files by newly introduced epoch_num (#10922)
Summary:
**Context:**
Sorting L0 files by `largest_seqno` has at least two inconvenience:
- File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap.
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
- insert k1@1 to memtable m1
- ingest file s1 with k2@2, ingest file s2 with k3@3
- insert k4@4 to m1
- compact files s1, s2 and result in new file s3 of seqno range [2, 3]
- flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
- However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
- Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption
- For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example)
- an existing SST s1 contains only k1@1
- insert k1@2 to memtable m1
- ingest file s2 with k3@3, ingest file s3 with k4@4
- insert single delete k5@5 in m1
- flush m1 and result in new file s4 of seqno range [2, 5]
- compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
- compact s4 and result in new file s6 of seqno range [2] due to single delete
- By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno`
Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways:
- In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more.
- In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.
**Summary:**
- Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`.
- `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`)
- Compaction output file is assigned with the minimum `epoch_number` among input files'
- Refit level: reuse refitted file's epoch_number
- Other paths needing `epoch_number` treatment:
- Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`
- Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`.
- Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
- Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
- Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery
- Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more
- Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag`
- Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above
- Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`.
- Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
- Misc:
- update existing tests with `epoch_number` so make check will pass
- update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases
- assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922
Test Plan:
- `make check`
- New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc`
- Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930
- [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox`
- [Ongoing] normal db stress test
- [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761
Reviewed By: ajkr
Differential Revision: D41063187
Pulled By: hx235
fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2 years ago
|
|
|
meta.epoch_number = cfd->NewEpochNumber();
|
|
|
|
{
|
|
|
|
auto write_hint = cfd->CalculateSSTWriteHint(0);
|
|
|
|
mutex_.Unlock();
|
|
|
|
|
|
|
|
SequenceNumber earliest_write_conflict_snapshot;
|
|
|
|
std::vector<SequenceNumber> snapshot_seqs =
|
|
|
|
snapshots_.GetAll(&earliest_write_conflict_snapshot);
|
|
|
|
auto snapshot_checker = snapshot_checker_.get();
|
|
|
|
if (use_custom_gc_ && snapshot_checker == nullptr) {
|
|
|
|
snapshot_checker = DisableGCSnapshotChecker::Instance();
|
|
|
|
}
|
|
|
|
std::vector<std::unique_ptr<FragmentedRangeTombstoneIterator>>
|
|
|
|
range_del_iters;
|
|
|
|
auto range_del_iter =
|
|
|
|
// This is called during recovery, where a live memtable is flushed
|
|
|
|
// directly. In this case, no fragmented tombstone list is cached in
|
|
|
|
// this memtable yet.
|
|
|
|
mem->NewRangeTombstoneIterator(ro, kMaxSequenceNumber,
|
|
|
|
false /* immutable_memtable */);
|
|
|
|
if (range_del_iter != nullptr) {
|
|
|
|
range_del_iters.emplace_back(range_del_iter);
|
|
|
|
}
|
|
|
|
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
5 years ago
|
|
|
IOStatus io_s;
|
|
|
|
TableBuilderOptions tboptions(
|
|
|
|
*cfd->ioptions(), mutable_cf_options, cfd->internal_comparator(),
|
|
|
|
cfd->int_tbl_prop_collector_factories(),
|
|
|
|
GetCompressionFlush(*cfd->ioptions(), mutable_cf_options),
|
Add more LSM info to FilterBuildingContext (#8246)
Summary:
Add `num_levels`, `is_bottommost`, and table file creation
`reason` to `FilterBuildingContext`, in anticipation of more powerful
Bloom-like filter support.
To support this, added `is_bottommost` and `reason` to
`TableBuilderOptions`, which allowed removing `reason` parameter from
`rocksdb::BuildTable`.
I attempted to remove `skip_filters` from `TableBuilderOptions`, because
filter construction decisions should arise from options, not one-off
parameters. I could not completely remove it because the public API for
SstFileWriter takes a `skip_filters` parameter, and translating this
into an option change would mean awkwardly replacing the table_factory
if it is BlockBasedTableFactory with new filter_policy=nullptr option.
I marked this public skip_filters option as deprecated because of this
oddity. (skip_filters on the read side probably makes sense.)
At least `skip_filters` is now largely hidden for users of
`TableBuilderOptions` and is no longer used for implementing the
optimize_filters_for_hits option. Bringing the logic for that option
closer to handling of FilterBuildingContext makes it more obvious that
hese two are using the same notion of "bottommost." (Planned:
configuration options for Bloom-like filters that generalize
`optimize_filters_for_hits`)
Recommended follow-up: Try to get away from "bottommost level" naming of
things, which is inaccurate (see
VersionStorageInfo::RangeMightExistAfterSortedRun), and move to
"bottommost run" or just "bottommost."
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8246
Test Plan:
extended an existing unit test to exercise and check various
filter building contexts. Also, existing tests for
optimize_filters_for_hits validate some of the "bottommost" handling,
which is now closely connected to FilterBuildingContext::is_bottommost
through TableBuilderOptions::is_bottommost
Reviewed By: mrambacher
Differential Revision: D28099346
Pulled By: pdillinger
fbshipit-source-id: 2c1072e29c24d4ac404c761a7b7663292372600a
4 years ago
|
|
|
mutable_cf_options.compression_opts, cfd->GetID(), cfd->GetName(),
|
|
|
|
0 /* level */, false /* is_bottommost */,
|
|
|
|
TableFileCreationReason::kRecovery, 0 /* oldest_key_time */,
|
|
|
|
0 /* file_creation_time */, db_id_, db_session_id_,
|
|
|
|
0 /* target_file_size */, meta.fd.GetNumber());
|
|
|
|
SeqnoToTimeMapping empty_seqno_time_mapping;
|
|
|
|
Version* version = cfd->current();
|
|
|
|
version->Ref();
|
|
|
|
const ReadOptions read_option(Env::IOActivity::kDBOpen);
|
|
|
|
uint64_t num_input_entries = 0;
|
|
|
|
s = BuildTable(
|
|
|
|
dbname_, versions_.get(), immutable_db_options_, tboptions,
|
|
|
|
file_options_for_compaction_, read_option, cfd->table_cache(),
|
|
|
|
iter.get(), std::move(range_del_iters), &meta, &blob_file_additions,
|
CompactionIterator sees consistent view of which keys are committed (#9830)
Summary:
**This PR does not affect the functionality of `DB` and write-committed transactions.**
`CompactionIterator` uses `KeyCommitted(seq)` to determine if a key in the database is committed.
As the name 'write-committed' implies, if write-committed policy is used, a key exists in the database only if
it is committed. In fact, the implementation of `KeyCommitted()` is as follows:
```
inline bool KeyCommitted(SequenceNumber seq) {
// For non-txn-db and write-committed, snapshot_checker_ is always nullptr.
return snapshot_checker_ == nullptr ||
snapshot_checker_->CheckInSnapshot(seq, kMaxSequence) == SnapshotCheckerResult::kInSnapshot;
}
```
With that being said, we focus on write-prepared/write-unprepared transactions.
A few notes:
- A key can exist in the db even if it's uncommitted. Therefore, we rely on `snapshot_checker_` to determine data visibility. We also require that all writes go through transaction API instead of the raw `WriteBatch` + `Write`, thus at most one uncommitted version of one user key can exist in the database.
- `CompactionIterator` outputs a key as long as the key is uncommitted.
Due to the above reasons, it is possible that `CompactionIterator` decides to output an uncommitted key without
doing further checks on the key (`NextFromInput()`). By the time the key is being prepared for output, the key becomes
committed because the `snapshot_checker_(seq, kMaxSequence)` becomes true in the implementation of `KeyCommitted()`.
Then `CompactionIterator` will try to zero its sequence number and hit assertion error if the key is a tombstone.
To fix this issue, we should make the `CompactionIterator` see a consistent view of the input keys. Note that
for write-prepared/write-unprepared, the background flush/compaction jobs already take a "job snapshot" before starting
processing keys. The job snapshot is released only after the entire flush/compaction finishes. We can use this snapshot
to determine whether a key is committed or not with minor change to `KeyCommitted()`.
```
inline bool KeyCommitted(SequenceNumber sequence) {
// For non-txn-db and write-committed, snapshot_checker_ is always nullptr.
return snapshot_checker_ == nullptr ||
snapshot_checker_->CheckInSnapshot(sequence, job_snapshot_) ==
SnapshotCheckerResult::kInSnapshot;
}
```
As a result, whether a key is committed or not will remain a constant throughout compaction, causing no trouble
for `CompactionIterator`s assertions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9830
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D35561162
Pulled By: riversand963
fbshipit-source-id: 0e00d200c195240341cfe6d34cbc86798b315b9f
3 years ago
|
|
|
snapshot_seqs, earliest_write_conflict_snapshot, kMaxSequenceNumber,
|
|
|
|
snapshot_checker, paranoid_file_checks, cfd->internal_stats(), &io_s,
|
|
|
|
io_tracer_, BlobFileCreationReason::kRecovery,
|
|
|
|
empty_seqno_time_mapping, &event_logger_, job_id, Env::IO_HIGH,
|
|
|
|
nullptr /* table_properties */, write_hint,
|
|
|
|
nullptr /*full_history_ts_low*/, &blob_callback_, version,
|
|
|
|
&num_input_entries);
|
|
|
|
version->Unref();
|
|
|
|
LogFlush(immutable_db_options_.info_log);
|
|
|
|
ROCKS_LOG_DEBUG(immutable_db_options_.info_log,
|
|
|
|
"[%s] [WriteLevel0TableForRecovery]"
|
|
|
|
" Level-0 table #%" PRIu64 ": %" PRIu64 " bytes %s",
|
|
|
|
cfd->GetName().c_str(), meta.fd.GetNumber(),
|
|
|
|
meta.fd.GetFileSize(), s.ToString().c_str());
|
|
|
|
mutex_.Lock();
|
|
|
|
|
|
|
|
// TODO(AR) is this ok?
|
|
|
|
if (!io_s.ok() && s.ok()) {
|
|
|
|
s = io_s;
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t total_num_entries = mem->num_entries();
|
|
|
|
if (s.ok() && total_num_entries != num_input_entries) {
|
|
|
|
std::string msg = "Expected " + std::to_string(total_num_entries) +
|
|
|
|
" entries in memtable, but read " +
|
|
|
|
std::to_string(num_input_entries);
|
|
|
|
ROCKS_LOG_WARN(immutable_db_options_.info_log,
|
|
|
|
"[%s] [JOB %d] Level-0 flush during recover: %s",
|
|
|
|
cfd->GetName().c_str(), job_id, msg.c_str());
|
|
|
|
if (immutable_db_options_.flush_verify_memtable_count) {
|
|
|
|
s = Status::Corruption(msg);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
ReleaseFileNumberFromPendingOutputs(pending_outputs_inserted_elem);
|
|
|
|
|
|
|
|
// Note that if file_size is zero, the file has been deleted and
|
|
|
|
// should not be added to the manifest.
|
|
|
|
const bool has_output = meta.fd.GetFileSize() > 0;
|
|
|
|
|
|
|
|
constexpr int level = 0;
|
|
|
|
|
|
|
|
if (s.ok() && has_output) {
|
|
|
|
edit->AddFile(level, meta.fd.GetNumber(), meta.fd.GetPathId(),
|
|
|
|
meta.fd.GetFileSize(), meta.smallest, meta.largest,
|
|
|
|
meta.fd.smallest_seqno, meta.fd.largest_seqno,
|
|
|
|
meta.marked_for_compaction, meta.temperature,
|
|
|
|
meta.oldest_blob_file_number, meta.oldest_ancester_time,
|
|
|
|
meta.file_creation_time, meta.epoch_number,
|
|
|
|
meta.file_checksum, meta.file_checksum_func_name,
|
|
|
|
meta.unique_id, meta.compensated_range_deletion_size,
|
|
|
|
meta.tail_size, meta.user_defined_timestamps_persisted);
|
|
|
|
|
|
|
|
for (const auto& blob : blob_file_additions) {
|
|
|
|
edit->AddBlobFile(blob);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
InternalStats::CompactionStats stats(CompactionReason::kFlush, 1);
|
|
|
|
stats.micros = immutable_db_options_.clock->NowMicros() - start_micros;
|
|
|
|
|
|
|
|
if (has_output) {
|
|
|
|
stats.bytes_written = meta.fd.GetFileSize();
|
|
|
|
stats.num_output_files = 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
const auto& blobs = edit->GetBlobFileAdditions();
|
|
|
|
for (const auto& blob : blobs) {
|
|
|
|
stats.bytes_written_blob += blob.GetTotalBlobBytes();
|
|
|
|
}
|
|
|
|
|
|
|
|
stats.num_output_files_blob = static_cast<int>(blobs.size());
|
|
|
|
|
|
|
|
cfd->internal_stats()->AddCompactionStats(level, Env::Priority::USER, stats);
|
|
|
|
cfd->internal_stats()->AddCFStats(
|
|
|
|
InternalStats::BYTES_FLUSHED,
|
|
|
|
stats.bytes_written + stats.bytes_written_blob);
|
|
|
|
RecordTick(stats_, COMPACT_WRITE_BYTES, meta.fd.GetFileSize());
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
Status DB::Open(const Options& options, const std::string& dbname, DB** dbptr) {
|
|
|
|
DBOptions db_options(options);
|
|
|
|
ColumnFamilyOptions cf_options(options);
|
|
|
|
std::vector<ColumnFamilyDescriptor> column_families;
|
|
|
|
column_families.push_back(
|
|
|
|
ColumnFamilyDescriptor(kDefaultColumnFamilyName, cf_options));
|
|
|
|
if (db_options.persist_stats_to_disk) {
|
|
|
|
column_families.push_back(
|
|
|
|
ColumnFamilyDescriptor(kPersistentStatsColumnFamilyName, cf_options));
|
|
|
|
}
|
|
|
|
std::vector<ColumnFamilyHandle*> handles;
|
|
|
|
Status s = DB::Open(db_options, dbname, column_families, &handles, dbptr);
|
|
|
|
if (s.ok()) {
|
|
|
|
if (db_options.persist_stats_to_disk) {
|
|
|
|
assert(handles.size() == 2);
|
|
|
|
} else {
|
|
|
|
assert(handles.size() == 1);
|
|
|
|
}
|
|
|
|
// i can delete the handle since DBImpl is always holding a reference to
|
|
|
|
// default column family
|
|
|
|
if (db_options.persist_stats_to_disk && handles[1] != nullptr) {
|
|
|
|
delete handles[1];
|
|
|
|
}
|
|
|
|
delete handles[0];
|
|
|
|
}
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
Status DB::Open(const DBOptions& db_options, const std::string& dbname,
|
|
|
|
const std::vector<ColumnFamilyDescriptor>& column_families,
|
|
|
|
std::vector<ColumnFamilyHandle*>* handles, DB** dbptr) {
|
|
|
|
const bool kSeqPerBatch = true;
|
|
|
|
const bool kBatchPerTxn = true;
|
|
|
|
ThreadStatusUtil::SetEnableTracking(db_options.enable_thread_tracking);
|
|
|
|
ThreadStatusUtil::SetThreadOperation(ThreadStatus::OperationType::OP_DBOPEN);
|
|
|
|
Status s = DBImpl::Open(db_options, dbname, column_families, handles, dbptr,
|
|
|
|
!kSeqPerBatch, kBatchPerTxn);
|
|
|
|
ThreadStatusUtil::ResetThreadStatus();
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
// TODO: Implement the trimming in flush code path.
|
|
|
|
// TODO: Perform trimming before inserting into memtable during recovery.
|
|
|
|
// TODO: Pick files with max_timestamp > trim_ts by each file's timestamp meta
|
|
|
|
// info, and handle only these files to reduce io.
|
|
|
|
Status DB::OpenAndTrimHistory(
|
|
|
|
const DBOptions& db_options, const std::string& dbname,
|
|
|
|
const std::vector<ColumnFamilyDescriptor>& column_families,
|
|
|
|
std::vector<ColumnFamilyHandle*>* handles, DB** dbptr,
|
|
|
|
std::string trim_ts) {
|
|
|
|
assert(dbptr != nullptr);
|
|
|
|
assert(handles != nullptr);
|
|
|
|
auto validate_options = [&db_options] {
|
|
|
|
if (db_options.avoid_flush_during_recovery) {
|
|
|
|
return Status::InvalidArgument(
|
|
|
|
"avoid_flush_during_recovery incompatible with "
|
|
|
|
"OpenAndTrimHistory");
|
|
|
|
}
|
|
|
|
return Status::OK();
|
|
|
|
};
|
|
|
|
auto s = validate_options();
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
DB* db = nullptr;
|
|
|
|
s = DB::Open(db_options, dbname, column_families, handles, &db);
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
assert(db);
|
|
|
|
CompactRangeOptions options;
|
|
|
|
options.bottommost_level_compaction =
|
|
|
|
BottommostLevelCompaction::kForceOptimized;
|
|
|
|
auto db_impl = static_cast_with_check<DBImpl>(db);
|
|
|
|
for (auto handle : *handles) {
|
|
|
|
assert(handle != nullptr);
|
|
|
|
auto cfh = static_cast_with_check<ColumnFamilyHandleImpl>(handle);
|
|
|
|
auto cfd = cfh->cfd();
|
|
|
|
assert(cfd != nullptr);
|
|
|
|
// Only compact column families with timestamp enabled
|
|
|
|
if (cfd->user_comparator() != nullptr &&
|
|
|
|
cfd->user_comparator()->timestamp_size() > 0) {
|
|
|
|
s = db_impl->CompactRangeInternal(options, handle, nullptr, nullptr,
|
|
|
|
trim_ts);
|
|
|
|
if (!s.ok()) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
auto clean_op = [&handles, &db] {
|
|
|
|
for (auto handle : *handles) {
|
|
|
|
auto temp_s = db->DestroyColumnFamilyHandle(handle);
|
|
|
|
assert(temp_s.ok());
|
|
|
|
}
|
|
|
|
handles->clear();
|
|
|
|
delete db;
|
|
|
|
};
|
|
|
|
if (!s.ok()) {
|
|
|
|
clean_op();
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
*dbptr = db;
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
IOStatus DBImpl::CreateWAL(uint64_t log_file_num, uint64_t recycle_log_number,
|
|
|
|
size_t preallocate_block_size,
|
|
|
|
log::Writer** new_log) {
|
|
|
|
IOStatus io_s;
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
std::unique_ptr<FSWritableFile> lfile;
|
|
|
|
|
|
|
|
DBOptions db_options =
|
|
|
|
BuildDBOptions(immutable_db_options_, mutable_db_options_);
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
FileOptions opt_file_options =
|
|
|
|
fs_->OptimizeForLogWrite(file_options_, db_options);
|
|
|
|
std::string wal_dir = immutable_db_options_.GetWalDir();
|
|
|
|
std::string log_fname = LogFileName(wal_dir, log_file_num);
|
|
|
|
|
|
|
|
if (recycle_log_number) {
|
|
|
|
ROCKS_LOG_INFO(immutable_db_options_.info_log,
|
|
|
|
"reusing log %" PRIu64 " from recycle list\n",
|
|
|
|
recycle_log_number);
|
|
|
|
std::string old_log_fname = LogFileName(wal_dir, recycle_log_number);
|
|
|
|
TEST_SYNC_POINT("DBImpl::CreateWAL:BeforeReuseWritableFile1");
|
|
|
|
TEST_SYNC_POINT("DBImpl::CreateWAL:BeforeReuseWritableFile2");
|
|
|
|
io_s = fs_->ReuseWritableFile(log_fname, old_log_fname, opt_file_options,
|
|
|
|
&lfile, /*dbg=*/nullptr);
|
|
|
|
} else {
|
|
|
|
io_s = NewWritableFile(fs_.get(), log_fname, &lfile, opt_file_options);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (io_s.ok()) {
|
|
|
|
lfile->SetWriteLifeTimeHint(CalculateWALWriteHint());
|
|
|
|
lfile->SetPreallocationBlockSize(preallocate_block_size);
|
|
|
|
|
|
|
|
const auto& listeners = immutable_db_options_.listeners;
|
|
|
|
FileTypeSet tmp_set = immutable_db_options_.checksum_handoff_file_types;
|
|
|
|
std::unique_ptr<WritableFileWriter> file_writer(new WritableFileWriter(
|
|
|
|
std::move(lfile), log_fname, opt_file_options,
|
|
|
|
immutable_db_options_.clock, io_tracer_, nullptr /* stats */, listeners,
|
Using existing crc32c checksum in checksum handoff for Manifest and WAL (#8412)
Summary:
In PR https://github.com/facebook/rocksdb/issues/7523 , checksum handoff is introduced in RocksDB for WAL, Manifest, and SST files. When user enable checksum handoff for a certain type of file, before the data is written to the lower layer storage system, we calculate the checksum (crc32c) of each piece of data and pass the checksum down with the data, such that data verification can be down by the lower layer storage system if it has the capability. However, it cannot cover the whole lifetime of the data in the memory and also it potentially introduces extra checksum calculation overhead.
In this PR, we introduce a new interface in WritableFileWriter::Append, which allows the caller be able to pass the data and the checksum (crc32c) together. In this way, WritableFileWriter can directly use the pass-in checksum (crc32c) to generate the checksum of data being passed down to the storage system. It saves the calculation overhead and achieves higher protection coverage. When a new checksum is added with the data, we use Crc32cCombine https://github.com/facebook/rocksdb/issues/8305 to combine the existing checksum and the new checksum. To avoid the segmenting of data by rate-limiter before it is stored, rate-limiter is called enough times to accumulate enough credits for a certain write. This design only support Manifest and WAL which use log_writer in the current stage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8412
Test Plan: make check, add new testing cases.
Reviewed By: anand1976
Differential Revision: D29151545
Pulled By: zhichao-cao
fbshipit-source-id: 75e2278c5126cfd58393c67b1efd18dcc7a30772
4 years ago
|
|
|
nullptr, tmp_set.Contains(FileType::kWalFile),
|
|
|
|
tmp_set.Contains(FileType::kWalFile)));
|
|
|
|
*new_log = new log::Writer(std::move(file_writer), log_file_num,
|
|
|
|
immutable_db_options_.recycle_log_file_num > 0,
|
|
|
|
immutable_db_options_.manual_wal_flush,
|
|
|
|
immutable_db_options_.wal_compression);
|
|
|
|
io_s = (*new_log)->AddCompressionTypeRecord();
|
|
|
|
}
|
|
|
|
return io_s;
|
|
|
|
}
|
|
|
|
|
|
|
|
Status DBImpl::Open(const DBOptions& db_options, const std::string& dbname,
|
|
|
|
const std::vector<ColumnFamilyDescriptor>& column_families,
|
|
|
|
std::vector<ColumnFamilyHandle*>* handles, DB** dbptr,
|
|
|
|
const bool seq_per_batch, const bool batch_per_txn) {
|
|
|
|
Status s = ValidateOptionsByTable(db_options, column_families);
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
s = ValidateOptions(db_options, column_families);
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
*dbptr = nullptr;
|
Persist the new MANIFEST after successfully syncing the new WAL during recovery (#9922)
Summary:
In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
flush the data from WAL to L0 for all column families if possible. As a
result, not all column families can increase their log_numbers, and
min_log_number_to_keep won't change.
For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.
If we persist a new MANIFEST with
advanced log_numbers for some column families, then during a second
crash after persisting the MANIFEST, RocksDB will see some column
families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail.
As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL.
If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point.
If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9922
Test Plan:
1. Update unit tests to fail without this change
2. make crast_test -j
Branch with unit test and no fix https://github.com/facebook/rocksdb/pull/9942 to keep track of unit test (without fix)
Reviewed By: riversand963
Differential Revision: D36043701
Pulled By: akankshamahajan15
fbshipit-source-id: 5760970db0a0920fb73d3c054a4155733500acd9
3 years ago
|
|
|
assert(handles);
|
|
|
|
handles->clear();
|
|
|
|
|
|
|
|
size_t max_write_buffer_size = 0;
|
|
|
|
for (auto cf : column_families) {
|
|
|
|
max_write_buffer_size =
|
|
|
|
std::max(max_write_buffer_size, cf.options.write_buffer_size);
|
|
|
|
}
|
|
|
|
|
|
|
|
DBImpl* impl = new DBImpl(db_options, dbname, seq_per_batch, batch_per_txn);
|
|
|
|
if (!impl->immutable_db_options_.info_log) {
|
|
|
|
s = impl->init_logger_creation_s_;
|
|
|
|
delete impl;
|
|
|
|
return s;
|
|
|
|
} else {
|
|
|
|
assert(impl->init_logger_creation_s_.ok());
|
|
|
|
}
|
|
|
|
s = impl->env_->CreateDirIfMissing(impl->immutable_db_options_.GetWalDir());
|
|
|
|
if (s.ok()) {
|
|
|
|
std::vector<std::string> paths;
|
|
|
|
for (auto& db_path : impl->immutable_db_options_.db_paths) {
|
|
|
|
paths.emplace_back(db_path.path);
|
|
|
|
}
|
|
|
|
for (auto& cf : column_families) {
|
|
|
|
for (auto& cf_path : cf.options.cf_paths) {
|
|
|
|
paths.emplace_back(cf_path.path);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
for (auto& path : paths) {
|
|
|
|
s = impl->env_->CreateDirIfMissing(path);
|
|
|
|
if (!s.ok()) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
Auto recovery from out of space errors (#4164)
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
6 years ago
|
|
|
|
|
|
|
// For recovery from NoSpace() error, we can only handle
|
|
|
|
// the case where the database is stored in a single path
|
|
|
|
if (paths.size() <= 1) {
|
|
|
|
impl->error_handler_.EnableAutoRecovery();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (s.ok()) {
|
|
|
|
s = impl->CreateArchivalDirectory();
|
|
|
|
}
|
|
|
|
if (!s.ok()) {
|
|
|
|
delete impl;
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
impl->wal_in_db_path_ = impl->immutable_db_options_.IsWalDirSameAsDBPath();
|
Persist the new MANIFEST after successfully syncing the new WAL during recovery (#9922)
Summary:
In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
flush the data from WAL to L0 for all column families if possible. As a
result, not all column families can increase their log_numbers, and
min_log_number_to_keep won't change.
For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.
If we persist a new MANIFEST with
advanced log_numbers for some column families, then during a second
crash after persisting the MANIFEST, RocksDB will see some column
families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail.
As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL.
If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point.
If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9922
Test Plan:
1. Update unit tests to fail without this change
2. make crast_test -j
Branch with unit test and no fix https://github.com/facebook/rocksdb/pull/9942 to keep track of unit test (without fix)
Reviewed By: riversand963
Differential Revision: D36043701
Pulled By: akankshamahajan15
fbshipit-source-id: 5760970db0a0920fb73d3c054a4155733500acd9
3 years ago
|
|
|
RecoveryContext recovery_ctx;
|
|
|
|
impl->mutex_.Lock();
|
Persist the new MANIFEST after successfully syncing the new WAL during recovery (#9922)
Summary:
In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
flush the data from WAL to L0 for all column families if possible. As a
result, not all column families can increase their log_numbers, and
min_log_number_to_keep won't change.
For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.
If we persist a new MANIFEST with
advanced log_numbers for some column families, then during a second
crash after persisting the MANIFEST, RocksDB will see some column
families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail.
As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL.
If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point.
If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9922
Test Plan:
1. Update unit tests to fail without this change
2. make crast_test -j
Branch with unit test and no fix https://github.com/facebook/rocksdb/pull/9942 to keep track of unit test (without fix)
Reviewed By: riversand963
Differential Revision: D36043701
Pulled By: akankshamahajan15
fbshipit-source-id: 5760970db0a0920fb73d3c054a4155733500acd9
3 years ago
|
|
|
|
|
|
|
// Handles create_if_missing, error_if_exists
|
|
|
|
uint64_t recovered_seq(kMaxSequenceNumber);
|
|
|
|
s = impl->Recover(column_families, false /* read_only */,
|
|
|
|
false /* error_if_wal_file_exists */,
|
|
|
|
false /* error_if_data_exists_in_wals */, &recovered_seq,
|
Persist the new MANIFEST after successfully syncing the new WAL during recovery (#9922)
Summary:
In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
flush the data from WAL to L0 for all column families if possible. As a
result, not all column families can increase their log_numbers, and
min_log_number_to_keep won't change.
For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.
If we persist a new MANIFEST with
advanced log_numbers for some column families, then during a second
crash after persisting the MANIFEST, RocksDB will see some column
families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail.
As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL.
If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point.
If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9922
Test Plan:
1. Update unit tests to fail without this change
2. make crast_test -j
Branch with unit test and no fix https://github.com/facebook/rocksdb/pull/9942 to keep track of unit test (without fix)
Reviewed By: riversand963
Differential Revision: D36043701
Pulled By: akankshamahajan15
fbshipit-source-id: 5760970db0a0920fb73d3c054a4155733500acd9
3 years ago
|
|
|
&recovery_ctx);
|
|
|
|
if (s.ok()) {
|
|
|
|
uint64_t new_log_number = impl->versions_->NewFileNumber();
|
|
|
|
log::Writer* new_log = nullptr;
|
|
|
|
const size_t preallocate_block_size =
|
|
|
|
impl->GetWalPreallocateBlockSize(max_write_buffer_size);
|
|
|
|
s = impl->CreateWAL(new_log_number, 0 /*recycle_log_number*/,
|
|
|
|
preallocate_block_size, &new_log);
|
|
|
|
if (s.ok()) {
|
|
|
|
InstrumentedMutexLock wl(&impl->log_write_mutex_);
|
|
|
|
impl->logfile_number_ = new_log_number;
|
|
|
|
assert(new_log != nullptr);
|
Do not track obsolete WALs in MANIFEST even if they are synced (#7725)
Summary:
Consider the case:
1. All column families are flushed, so all WALs become obsolete, but no WAL is removed from disk yet because the removal is asynchronous, a VersionEdit is written to MANIFEST indicating that WALs before a certain WAL number are obsolete, let's say this number is 3;
2. `SyncWAL` is called, so all the on-disk WALs are synced, and if track_and_verify_wal_in_manifest=true, the WALs will be tracked in MANIFEST, let's say the WAL numbers are 1 and 2;
3. DB crashes;
4. During DB recovery, when replaying MANIFEST, we first see that WAL with number < 3 are obsolete, then we see that WAL 1 and 2 are synced, so according to current implementation of `WalSet`, the `WalSet` will be recovered to include WAL 1 and 2;
5. WAL 1 and 2 are asynchronously deleted from disk, then the WAL verification algorithm fails with `Corruption: missing WAL`.
The above case is reproduced in a new unit test `DBBasicTestTrackWal::DoNotTrackObsoleteWal`.
The fix is to maintain the upper bound of the obsolete WAL numbers, any WAL with number less than the maintained number is considered to be obsolete, so shouldn't be tracked even if they are later synced. The number is maintained in `WalSet`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7725
Test Plan:
1. a new unit test `DBBasicTestTrackWal::DoNotTrackObsoleteWal` is added.
2. run `make crash_test` on devserver.
Reviewed By: riversand963
Differential Revision: D25238914
Pulled By: cheng-chang
fbshipit-source-id: f5dccd57c3d89f19565ec5731f2d42f06d272b72
4 years ago
|
|
|
assert(impl->logs_.empty());
|
|
|
|
impl->logs_.emplace_back(new_log_number, new_log);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (s.ok()) {
|
|
|
|
impl->alive_log_files_.push_back(
|
|
|
|
DBImpl::LogFileNumberSize(impl->logfile_number_));
|
|
|
|
// In WritePrepared there could be gap in sequence numbers. This breaks
|
|
|
|
// the trick we use in kPointInTimeRecovery which assumes the first seq in
|
|
|
|
// the log right after the corrupted log is one larger than the last seq
|
|
|
|
// we read from the wals. To let this trick keep working, we add a dummy
|
|
|
|
// entry with the expected sequence to the first log right after recovery.
|
|
|
|
// In non-WritePrepared case also the new log after recovery could be
|
|
|
|
// empty, and thus missing the consecutive seq hint to distinguish
|
|
|
|
// middle-log corruption to corrupted-log-remained-after-recovery. This
|
|
|
|
// case also will be addressed by a dummy write.
|
|
|
|
if (recovered_seq != kMaxSequenceNumber) {
|
|
|
|
WriteBatch empty_batch;
|
|
|
|
WriteBatchInternal::SetSequence(&empty_batch, recovered_seq);
|
|
|
|
WriteOptions write_options;
|
|
|
|
uint64_t log_used, log_size;
|
|
|
|
log::Writer* log_writer = impl->logs_.back().writer;
|
|
|
|
LogFileNumberSize& log_file_number_size = impl->alive_log_files_.back();
|
|
|
|
|
|
|
|
assert(log_writer->get_log_number() == log_file_number_size.number);
|
|
|
|
impl->mutex_.AssertHeld();
|
Rate-limit automatic WAL flush after each user write (#9607)
Summary:
**Context:**
WAL flush is currently not rate-limited by `Options::rate_limiter`. This PR is to provide rate-limiting to auto WAL flush, the one that automatically happen after each user write operation (i.e, `Options::manual_wal_flush == false`), by adding `WriteOptions::rate_limiter_options`.
Note that we are NOT rate-limiting WAL flush that do NOT automatically happen after each user write, such as `Options::manual_wal_flush == true + manual FlushWAL()` (rate-limiting multiple WAL flushes), for the benefits of:
- being consistent with [ReadOptions::rate_limiter_priority](https://github.com/facebook/rocksdb/blob/7.0.fb/include/rocksdb/options.h#L515)
- being able to turn off some WAL flush's rate-limiting but not all (e.g, turn off specific the WAL flush of a critical user write like a service's heartbeat)
`WriteOptions::rate_limiter_options` only accept `Env::IO_USER` and `Env::IO_TOTAL` currently due to an implementation constraint.
- The constraint is that we currently queue parallel writes (including WAL writes) based on FIFO policy which does not factor rate limiter priority into this layer's scheduling. If we allow lower priorities such as `Env::IO_HIGH/MID/LOW` and such writes specified with lower priorities occurs before ones specified with higher priorities (even just by a tiny bit in arrival time), the former would have blocked the latter, leading to a "priority inversion" issue and contradictory to what we promise for rate-limiting priority. Therefore we only allow `Env::IO_USER` and `Env::IO_TOTAL` right now before improving that scheduling.
A pre-requisite to this feature is to support operation-level rate limiting in `WritableFileWriter`, which is also included in this PR.
**Summary:**
- Renamed test suite `DBRateLimiterTest to DBRateLimiterOnReadTest` for adding a new test suite
- Accept `rate_limiter_priority` in `WritableFileWriter`'s private and public write functions
- Passed `WriteOptions::rate_limiter_options` to `WritableFileWriter` in the path of automatic WAL flush.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9607
Test Plan:
- Added new unit test to verify existing flush/compaction rate-limiting does not break, since `DBTest, RateLimitingTest` is disabled and current db-level rate-limiting tests focus on read only (e.g, `db_rate_limiter_test`, `DBTest2, RateLimitedCompactionReads`).
- Added new unit test `DBRateLimiterOnWriteWALTest, AutoWalFlush`
- `strace -ftt -e trace=write ./db_bench -benchmarks=fillseq -db=/dev/shm/testdb -rate_limit_auto_wal_flush=1 -rate_limiter_bytes_per_sec=15 -rate_limiter_refill_period_us=1000000 -write_buffer_size=100000000 -disable_auto_compactions=1 -num=100`
- verified that WAL flush(i.e, system-call _write_) were chunked into 15 bytes and each _write_ was roughly 1 second apart
- verified the chunking disappeared when `-rate_limit_auto_wal_flush=0`
- crash test: `python3 tools/db_crashtest.py blackbox --disable_wal=0 --rate_limit_auto_wal_flush=1 --rate_limiter_bytes_per_sec=10485760 --interval=10` killed as normal
**Benchmarked on flush/compaction to ensure no performance regression:**
- compaction with rate-limiting (see table 1, avg over 1280-run): pre-change: **915635 micros/op**; post-change:
**907350 micros/op (improved by 0.106%)**
```
#!/bin/bash
TEST_TMPDIR=/dev/shm/testdb
START=1
NUM_DATA_ENTRY=8
N=10
rm -f compact_bmk_output.txt compact_bmk_output_2.txt dont_care_output.txt
for i in $(eval echo "{$START..$NUM_DATA_ENTRY}")
do
NUM_RUN=$(($N*(2**($i-1))))
for j in $(eval echo "{$START..$NUM_RUN}")
do
./db_bench --benchmarks=fillrandom -db=$TEST_TMPDIR -disable_auto_compactions=1 -write_buffer_size=6710886 > dont_care_output.txt && ./db_bench --benchmarks=compact -use_existing_db=1 -db=$TEST_TMPDIR -level0_file_num_compaction_trigger=1 -rate_limiter_bytes_per_sec=100000000 | egrep 'compact'
done > compact_bmk_output.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' compact_bmk_output.txt >> compact_bmk_output_2.txt
done
```
- compaction w/o rate-limiting (see table 2, avg over 640-run): pre-change: **822197 micros/op**; post-change: **823148 micros/op (regressed by 0.12%)**
```
Same as above script, except that -rate_limiter_bytes_per_sec=0
```
- flush with rate-limiting (see table 3, avg over 320-run, run on the [patch](https://github.com/hx235/rocksdb/commit/ee5c6023a9f6533fab9afdc681568daa21da4953) to augment current db_bench ): pre-change: **745752 micros/op**; post-change: **745331 micros/op (regressed by 0.06 %)**
```
#!/bin/bash
TEST_TMPDIR=/dev/shm/testdb
START=1
NUM_DATA_ENTRY=8
N=10
rm -f flush_bmk_output.txt flush_bmk_output_2.txt
for i in $(eval echo "{$START..$NUM_DATA_ENTRY}")
do
NUM_RUN=$(($N*(2**($i-1))))
for j in $(eval echo "{$START..$NUM_RUN}")
do
./db_bench -db=$TEST_TMPDIR -write_buffer_size=1048576000 -num=1000000 -rate_limiter_bytes_per_sec=100000000 -benchmarks=fillseq,flush | egrep 'flush'
done > flush_bmk_output.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' flush_bmk_output.txt >> flush_bmk_output_2.txt
done
```
- flush w/o rate-limiting (see table 4, avg over 320-run, run on the [patch](https://github.com/hx235/rocksdb/commit/ee5c6023a9f6533fab9afdc681568daa21da4953) to augment current db_bench): pre-change: **487512 micros/op**, post-change: **485856 micors/ops (improved by 0.34%)**
```
Same as above script, except that -rate_limiter_bytes_per_sec=0
```
| table 1 - compact with rate-limiting|
#-run | (pre-change) avg micros/op | std micros/op | (post-change) avg micros/op | std micros/op | change in avg micros/op (%)
-- | -- | -- | -- | -- | --
10 | 896978 | 16046.9 | 901242 | 15670.9 | 0.475373978
20 | 893718 | 15813 | 886505 | 17544.7 | -0.8070778478
40 | 900426 | 23882.2 | 894958 | 15104.5 | -0.6072681153
80 | 906635 | 21761.5 | 903332 | 23948.3 | -0.3643141948
160 | 898632 | 21098.9 | 907583 | 21145 | 0.9960695813
3.20E+02 | 905252 | 22785.5 | 908106 | 25325.5 | 0.3152713278
6.40E+02 | 905213 | 23598.6 | 906741 | 21370.5 | 0.1688000504
**1.28E+03** | **908316** | **23533.1** | **907350** | **24626.8** | **-0.1063506533**
average over #-run | 901896.25 | 21064.9625 | 901977.125 | 20592.025 | 0.008967217682
| table 2 - compact w/o rate-limiting|
#-run | (pre-change) avg micros/op | std micros/op | (post-change) avg micros/op | std micros/op | change in avg micros/op (%)
-- | -- | -- | -- | -- | --
10 | 811211 | 26996.7 | 807586 | 28456.4 | -0.4468627768
20 | 815465 | 14803.7 | 814608 | 28719.7 | -0.105093413
40 | 809203 | 26187.1 | 797835 | 25492.1 | -1.404839082
80 | 822088 | 28765.3 | 822192 | 32840.4 | 0.01265071379
160 | 821719 | 36344.7 | 821664 | 29544.9 | -0.006693285661
3.20E+02 | 820921 | 27756.4 | 821403 | 28347.7 | 0.05871454135
**6.40E+02** | **822197** | **28960.6** | **823148** | **30055.1** | **0.1156657103**
average over #-run | 8.18E+05 | 2.71E+04 | 8.15E+05 | 2.91E+04 | -0.25
| table 3 - flush with rate-limiting|
#-run | (pre-change) avg micros/op | std micros/op | (post-change) avg micros/op | std micros/op | change in avg micros/op (%)
-- | -- | -- | -- | -- | --
10 | 741721 | 11770.8 | 740345 | 5949.76 | -0.1855144994
20 | 735169 | 3561.83 | 743199 | 9755.77 | 1.09226586
40 | 743368 | 8891.03 | 742102 | 8683.22 | -0.1703059588
80 | 742129 | 8148.51 | 743417 | 9631.58| 0.1735547324
160 | 749045 | 9757.21 | 746256 | 9191.86 | -0.3723407806
**3.20E+02** | **745752** | **9819.65** | **745331** | **9840.62** | **-0.0564530836**
6.40E+02 | 749006 | 11080.5 | 748173 | 10578.7 | -0.1112140624
average over #-run | 743741.4286 | 9004.218571 | 744117.5714 | 9090.215714 | 0.05057441238
| table 4 - flush w/o rate-limiting|
#-run | (pre-change) avg micros/op | std micros/op | (post-change) avg micros/op | std micros/op | change in avg micros/op (%)
-- | -- | -- | -- | -- | --
10 | 477283 | 24719.6 | 473864 | 12379 | -0.7163464863
20 | 486743 | 20175.2 | 502296 | 23931.3 | 3.195320734
40 | 482846 | 15309.2 | 489820 | 22259.5 | 1.444352858
80 | 491490 | 21883.1 | 490071 | 23085.7 | -0.2887139108
160 | 493347 | 28074.3 | 483609 | 21211.7 | -1.973864238
**3.20E+02** | **487512** | **21401.5** | **485856** | **22195.2** | **-0.3396839462**
6.40E+02 | 490307 | 25418.6 | 485435 | 22405.2 | -0.9936631539
average over #-run | 4.87E+05 | 2.24E+04 | 4.87E+05 | 2.11E+04 | 0.00E+00
Reviewed By: ajkr
Differential Revision: D34442441
Pulled By: hx235
fbshipit-source-id: 4790f13e1e5c0a95ae1d1cc93ffcf69dc6e78bdd
3 years ago
|
|
|
s = impl->WriteToWAL(empty_batch, log_writer, &log_used, &log_size,
|
|
|
|
Env::IO_TOTAL, log_file_number_size);
|
|
|
|
if (s.ok()) {
|
|
|
|
// Need to fsync, otherwise it might get lost after a power reset.
|
|
|
|
s = impl->FlushWAL(false);
|
Update WAL corruption test so that it fails without fix (#9942)
Summary:
In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
flush the data from WAL to L0 for all column families if possible. As a
result, not all column families can increase their log_numbers, and
min_log_number_to_keep won't change.
For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.
If we persist a new MANIFEST with
advanced log_numbers for some column families, then during a second
crash after persisting the MANIFEST, RocksDB will see some column
families' log_numbers larger than the corrupted WAL, and the "column family inconsistency" error will be hit, causing recovery to fail.
This PR update unit tests to emulate the errors and tests are failing without a fix.
Error:
```
[ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/0
db/corruption_test.cc:1190: Failure
DB::Open(options, dbname_, cf_descs, &handles, &db_)
Corruption: SST file is ahead of WALs in CF test_cf
[ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/0, where GetParam() = (true, false) (91 ms)
[ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/1
db/corruption_test.cc:1190: Failure
DB::Open(options, dbname_, cf_descs, &handles, &db_)
Corruption: SST file is ahead of WALs in CF test_cf
[ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/1, where GetParam() = (false, false) (92 ms)
[ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/2
db/corruption_test.cc:1190: Failure
DB::Open(options, dbname_, cf_descs, &handles, &db_)
Corruption: SST file is ahead of WALs in CF test_cf
[ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/2, where GetParam() = (true, true) (95 ms)
[ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/3
db/corruption_test.cc:1190: Failure
DB::Open(options, dbname_, cf_descs, &handles, &db_)
Corruption: SST file is ahead of WALs in CF test_cf
[ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecovery/3, where GetParam() = (false, true) (92 ms)
[ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/0
db/corruption_test.cc:1354: Failure
TransactionDB::Open(options, txn_db_opts, dbname_, cf_descs, &handles, &txn_db)
Corruption: SST file is ahead of WALs in CF default
[ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/0, where GetParam() = (true, false) (94 ms)
[ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/1
db/corruption_test.cc:1354: Failure
TransactionDB::Open(options, txn_db_opts, dbname_, cf_descs, &handles, &txn_db)
Corruption: SST file is ahead of WALs in CF default
[ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/1, where GetParam() = (false, false) (97 ms)
[ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/2
db/corruption_test.cc:1354: Failure
TransactionDB::Open(options, txn_db_opts, dbname_, cf_descs, &handles, &txn_db)
Corruption: SST file is ahead of WALs in CF default
[ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/2, where GetParam() = (true, true) (94 ms)
[ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/3
db/corruption_test.cc:1354: Failure
TransactionDB::Open(options, txn_db_opts, dbname_, cf_descs, &handles, &txn_db)
Corruption: SST file is ahead of WALs in CF default
[ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.TxnDbCrashDuringRecovery/3, where GetParam() = (false, true) (91 ms)
[ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/0
db/corruption_test.cc:1483: Failure
DB::Open(options, dbname_, cf_descs, &handles, &db_)
Corruption: SST file is ahead of WALs in CF default
[ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/0, where GetParam() = (true, false) (93 ms)
[ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/1
db/corruption_test.cc:1483: Failure
DB::Open(options, dbname_, cf_descs, &handles, &db_)
Corruption: SST file is ahead of WALs in CF default
[ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/1, where GetParam() = (false, false) (94 ms)
[ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/2
db/corruption_test.cc:1483: Failure
DB::Open(options, dbname_, cf_descs, &handles, &db_)
Corruption: SST file is ahead of WALs in CF default
[ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/2, where GetParam() = (true, true) (90 ms)
[ RUN ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/3
db/corruption_test.cc:1483: Failure
DB::Open(options, dbname_, cf_descs, &handles, &db_)
Corruption: SST file is ahead of WALs in CF default
[ FAILED ] CorruptionTest/CrashDuringRecoveryWithCorruptionTest.CrashDuringRecoveryWithFlush/3, where GetParam() = (false, true) (93 ms)
[----------] 12 tests from CorruptionTest/CrashDuringRecoveryWithCorruptionTest (1116 ms total)
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9942
Test Plan: Not needed
Reviewed By: riversand963
Differential Revision: D36324112
Pulled By: akankshamahajan15
fbshipit-source-id: cab2075ac4ebe48f5ef93a6ea162558aa4fc334d
3 years ago
|
|
|
TEST_SYNC_POINT_CALLBACK("DBImpl::Open::BeforeSyncWAL", /*arg=*/&s);
|
|
|
|
if (s.ok()) {
|
|
|
|
s = log_writer->file()->Sync(impl->immutable_db_options_.use_fsync);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
Persist the new MANIFEST after successfully syncing the new WAL during recovery (#9922)
Summary:
In case of non-TransactionDB and avoid_flush_during_recovery = true, RocksDB won't
flush the data from WAL to L0 for all column families if possible. As a
result, not all column families can increase their log_numbers, and
min_log_number_to_keep won't change.
For transaction DB (.allow_2pc), even with the flush, there may be old WAL files that it must not delete because they can contain data of uncommitted transactions and min_log_number_to_keep won't change.
If we persist a new MANIFEST with
advanced log_numbers for some column families, then during a second
crash after persisting the MANIFEST, RocksDB will see some column
families' log_numbers larger than the corrupted wal, and the "column family inconsistency" error will be hit, causing recovery to fail.
As a solution, RocksDB will persist the new MANIFEST after successfully syncing the new WAL.
If a future recovery starts from the new MANIFEST, then it means the new WAL is successfully synced. Due to the sentinel empty write batch at the beginning, kPointInTimeRecovery of WAL is guaranteed to go after this point.
If future recovery starts from the old MANIFEST, it means the writing the new MANIFEST failed. We won't have the "SST ahead of WAL" error.
Currently, RocksDB DB::Open() may creates and writes to two new MANIFEST files even before recovery succeeds. This PR buffers the edits in a structure and writes to a new MANIFEST after recovery is successful
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9922
Test Plan:
1. Update unit tests to fail without this change
2. make crast_test -j
Branch with unit test and no fix https://github.com/facebook/rocksdb/pull/9942 to keep track of unit test (without fix)
Reviewed By: riversand963
Differential Revision: D36043701
Pulled By: akankshamahajan15
fbshipit-source-id: 5760970db0a0920fb73d3c054a4155733500acd9
3 years ago
|
|
|
if (s.ok()) {
|
|
|
|
s = impl->LogAndApplyForRecovery(recovery_ctx);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (s.ok() && impl->immutable_db_options_.persist_stats_to_disk) {
|
|
|
|
impl->mutex_.AssertHeld();
|
|
|
|
s = impl->InitPersistStatsColumnFamily();
|
|
|
|
}
|
|
|
|
|
|
|
|
if (s.ok()) {
|
|
|
|
// set column family handles
|
|
|
|
for (auto cf : column_families) {
|
|
|
|
auto cfd =
|
|
|
|
impl->versions_->GetColumnFamilySet()->GetColumnFamily(cf.name);
|
|
|
|
if (cfd != nullptr) {
|
|
|
|
handles->push_back(
|
|
|
|
new ColumnFamilyHandleImpl(cfd, impl, &impl->mutex_));
|
|
|
|
impl->NewThreadStatusCfInfo(cfd);
|
|
|
|
} else {
|
|
|
|
if (db_options.create_missing_column_families) {
|
|
|
|
// missing column family, create it
|
|
|
|
ColumnFamilyHandle* handle = nullptr;
|
|
|
|
impl->mutex_.Unlock();
|
|
|
|
s = impl->CreateColumnFamily(cf.options, cf.name, &handle);
|
|
|
|
impl->mutex_.Lock();
|
|
|
|
if (s.ok()) {
|
|
|
|
handles->push_back(handle);
|
|
|
|
} else {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
s = Status::InvalidArgument("Column family not found", cf.name);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (s.ok()) {
|
|
|
|
SuperVersionContext sv_context(/* create_superversion */ true);
|
|
|
|
for (auto cfd : *impl->versions_->GetColumnFamilySet()) {
|
|
|
|
impl->InstallSuperVersionAndScheduleWork(
|
|
|
|
cfd, &sv_context, *cfd->GetLatestMutableCFOptions());
|
|
|
|
}
|
|
|
|
sv_context.Clean();
|
|
|
|
}
|
|
|
|
|
|
|
|
if (s.ok() && impl->immutable_db_options_.persist_stats_to_disk) {
|
|
|
|
// try to read format version
|
|
|
|
s = impl->PersistentStatsProcessFormatVersion();
|
|
|
|
}
|
|
|
|
|
|
|
|
if (s.ok()) {
|
|
|
|
for (auto cfd : *impl->versions_->GetColumnFamilySet()) {
|
|
|
|
if (!cfd->mem()->IsSnapshotSupported()) {
|
|
|
|
impl->is_snapshot_supported_ = false;
|
|
|
|
}
|
|
|
|
if (cfd->ioptions()->merge_operator != nullptr &&
|
|
|
|
!cfd->mem()->IsMergeOperatorSupported()) {
|
|
|
|
s = Status::InvalidArgument(
|
|
|
|
"The memtable of column family %s does not support merge operator "
|
|
|
|
"its options.merge_operator is non-null",
|
|
|
|
cfd->GetName().c_str());
|
|
|
|
}
|
|
|
|
if (!s.ok()) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
TEST_SYNC_POINT("DBImpl::Open:Opened");
|
|
|
|
Status persist_options_status;
|
|
|
|
if (s.ok()) {
|
|
|
|
// Persist RocksDB Options before scheduling the compaction.
|
|
|
|
// The WriteOptionsFile() will release and lock the mutex internally.
|
|
|
|
persist_options_status = impl->WriteOptionsFile(
|
|
|
|
false /*need_mutex_lock*/, false /*need_enter_write_thread*/);
|
|
|
|
|
|
|
|
*dbptr = impl;
|
|
|
|
impl->opened_successfully_ = true;
|
|
|
|
impl->DeleteObsoleteFiles();
|
|
|
|
TEST_SYNC_POINT("DBImpl::Open:AfterDeleteFiles");
|
|
|
|
impl->MaybeScheduleFlushOrCompaction();
|
|
|
|
} else {
|
|
|
|
persist_options_status.PermitUncheckedError();
|
|
|
|
}
|
|
|
|
impl->mutex_.Unlock();
|
|
|
|
|
|
|
|
auto sfm = static_cast<SstFileManagerImpl*>(
|
|
|
|
impl->immutable_db_options_.sst_file_manager.get());
|
|
|
|
if (s.ok() && sfm) {
|
|
|
|
// Set Statistics ptr for SstFileManager to dump the stats of
|
|
|
|
// DeleteScheduler.
|
|
|
|
sfm->SetStatisticsPtr(impl->immutable_db_options_.statistics);
|
|
|
|
ROCKS_LOG_INFO(impl->immutable_db_options_.info_log,
|
|
|
|
"SstFileManager instance %p", sfm);
|
|
|
|
|
|
|
|
// Notify SstFileManager about all sst files that already exist in
|
|
|
|
// db_paths[0] and cf_paths[0] when the DB is opened.
|
|
|
|
|
|
|
|
// SstFileManagerImpl needs to know sizes of the files. For files whose size
|
|
|
|
// we already know (sst files that appear in manifest - typically that's the
|
|
|
|
// vast majority of all files), we'll pass the size to SstFileManager.
|
|
|
|
// For all other files SstFileManager will query the size from filesystem.
|
|
|
|
|
|
|
|
std::vector<ColumnFamilyMetaData> metadata;
|
|
|
|
impl->GetAllColumnFamilyMetaData(&metadata);
|
|
|
|
|
|
|
|
std::unordered_map<std::string, uint64_t> known_file_sizes;
|
|
|
|
for (const auto& md : metadata) {
|
|
|
|
for (const auto& lmd : md.levels) {
|
|
|
|
for (const auto& fmd : lmd.files) {
|
|
|
|
known_file_sizes[fmd.relative_filename] = fmd.size;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
for (const auto& bmd : md.blob_files) {
|
|
|
|
std::string name = bmd.blob_file_name;
|
|
|
|
// The BlobMetaData.blob_file_name may start with "/".
|
|
|
|
if (!name.empty() && name[0] == '/') {
|
|
|
|
name = name.substr(1);
|
|
|
|
}
|
|
|
|
known_file_sizes[name] = bmd.blob_file_size;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
std::vector<std::string> paths;
|
|
|
|
paths.emplace_back(impl->immutable_db_options_.db_paths[0].path);
|
|
|
|
for (auto& cf : column_families) {
|
|
|
|
if (!cf.options.cf_paths.empty()) {
|
|
|
|
paths.emplace_back(cf.options.cf_paths[0].path);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
// Remove duplicate paths.
|
|
|
|
std::sort(paths.begin(), paths.end());
|
|
|
|
paths.erase(std::unique(paths.begin(), paths.end()), paths.end());
|
|
|
|
IOOptions io_opts;
|
|
|
|
io_opts.do_not_recurse = true;
|
|
|
|
for (auto& path : paths) {
|
|
|
|
std::vector<std::string> existing_files;
|
|
|
|
impl->immutable_db_options_.fs
|
|
|
|
->GetChildren(path, io_opts, &existing_files,
|
|
|
|
/*IODebugContext*=*/nullptr)
|
|
|
|
.PermitUncheckedError(); //**TODO: What do to on error?
|
|
|
|
for (auto& file_name : existing_files) {
|
|
|
|
uint64_t file_number;
|
|
|
|
FileType file_type;
|
|
|
|
std::string file_path = path + "/" + file_name;
|
|
|
|
if (ParseFileName(file_name, &file_number, &file_type) &&
|
|
|
|
(file_type == kTableFile || file_type == kBlobFile)) {
|
|
|
|
// TODO: Check for errors from OnAddFile?
|
|
|
|
if (known_file_sizes.count(file_name)) {
|
|
|
|
// We're assuming that each sst file name exists in at most one of
|
|
|
|
// the paths.
|
|
|
|
sfm->OnAddFile(file_path, known_file_sizes.at(file_name))
|
|
|
|
.PermitUncheckedError();
|
|
|
|
} else {
|
|
|
|
sfm->OnAddFile(file_path).PermitUncheckedError();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
Auto recovery from out of space errors (#4164)
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
6 years ago
|
|
|
|
|
|
|
// Reserve some disk buffer space. This is a heuristic - when we run out
|
|
|
|
// of disk space, this ensures that there is atleast write_buffer_size
|
|
|
|
// amount of free space before we resume DB writes. In low disk space
|
|
|
|
// conditions, we want to avoid a lot of small L0 files due to frequent
|
|
|
|
// WAL write failures and resultant forced flushes
|
|
|
|
sfm->ReserveDiskBuffer(max_write_buffer_size,
|
|
|
|
impl->immutable_db_options_.db_paths[0].path);
|
|
|
|
}
|
|
|
|
|
|
|
|
|
|
|
|
if (s.ok()) {
|
|
|
|
ROCKS_LOG_HEADER(impl->immutable_db_options_.info_log, "DB pointer %p",
|
|
|
|
impl);
|
|
|
|
LogFlush(impl->immutable_db_options_.info_log);
|
Add manual_wal_flush, FlushWAL() to stress/crash test (#10698)
Summary:
**Context/Summary:**
Introduce `manual_wal_flush_one_in` as titled.
- When `manual_wal_flush_one_in > 0`, we also need tracing to correctly verify recovery because WAL data can be lost in this case when `FlushWAL()` is not explicitly called by users of RocksDB (in our case, db stress) and the recovery from such potential WAL data loss is a prefix recovery that requires tracing to verify. As another consequence, we need to disable features can't run under unsync data loss with `manual_wal_flush_one_in`
Incompatibilities fixed along the way:
```
db_stress: db/db_impl/db_impl_open.cc:2063: static rocksdb::Status rocksdb::DBImpl::Open(const rocksdb::DBOptions&, const string&, const std::vector<rocksdb::ColumnFamilyDescriptor>&, std::vector<rocksdb::ColumnFamilyHandle*>*, rocksdb::DB**, bool, bool): Assertion `impl->TEST_WALBufferIsEmpty()' failed.
```
- It turns out that `Writer::AddCompressionTypeRecord` before this assertion `EmitPhysicalRecord(kSetCompressionType, encode.data(), encode.size());` but do not trigger flush if `manual_wal_flush` is set . This leads to `impl->TEST_WALBufferIsEmpty()' is false.
- As suggested, assertion is removed and violation case is handled by `FlushWAL(sync=true)` along with refactoring `TEST_WALBufferIsEmpty()` to be `WALBufferIsEmpty()` since it is used in prod code now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10698
Test Plan:
- Locally running `python3 tools/db_crashtest.py blackbox --manual_wal_flush_one_in=1 --manual_wal_flush=1 --sync_wal_one_in=100 --atomic_flush=1 --flush_one_in=100 --column_families=3`
- Joined https://github.com/facebook/rocksdb/pull/10624 in auto CI testings with all RocksDB stress/crash test jobs
Reviewed By: ajkr
Differential Revision: D39593752
Pulled By: ajkr
fbshipit-source-id: 3a2135bb792c52d2ffa60257d4fbc557fb04d2ce
2 years ago
|
|
|
if (!impl->WALBufferIsEmpty()) {
|
|
|
|
s = impl->FlushWAL(false);
|
|
|
|
if (s.ok()) {
|
|
|
|
// Sync is needed otherwise WAL buffered data might get lost after a
|
|
|
|
// power reset.
|
|
|
|
log::Writer* log_writer = impl->logs_.back().writer;
|
|
|
|
s = log_writer->file()->Sync(impl->immutable_db_options_.use_fsync);
|
|
|
|
}
|
Add manual_wal_flush, FlushWAL() to stress/crash test (#10698)
Summary:
**Context/Summary:**
Introduce `manual_wal_flush_one_in` as titled.
- When `manual_wal_flush_one_in > 0`, we also need tracing to correctly verify recovery because WAL data can be lost in this case when `FlushWAL()` is not explicitly called by users of RocksDB (in our case, db stress) and the recovery from such potential WAL data loss is a prefix recovery that requires tracing to verify. As another consequence, we need to disable features can't run under unsync data loss with `manual_wal_flush_one_in`
Incompatibilities fixed along the way:
```
db_stress: db/db_impl/db_impl_open.cc:2063: static rocksdb::Status rocksdb::DBImpl::Open(const rocksdb::DBOptions&, const string&, const std::vector<rocksdb::ColumnFamilyDescriptor>&, std::vector<rocksdb::ColumnFamilyHandle*>*, rocksdb::DB**, bool, bool): Assertion `impl->TEST_WALBufferIsEmpty()' failed.
```
- It turns out that `Writer::AddCompressionTypeRecord` before this assertion `EmitPhysicalRecord(kSetCompressionType, encode.data(), encode.size());` but do not trigger flush if `manual_wal_flush` is set . This leads to `impl->TEST_WALBufferIsEmpty()' is false.
- As suggested, assertion is removed and violation case is handled by `FlushWAL(sync=true)` along with refactoring `TEST_WALBufferIsEmpty()` to be `WALBufferIsEmpty()` since it is used in prod code now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10698
Test Plan:
- Locally running `python3 tools/db_crashtest.py blackbox --manual_wal_flush_one_in=1 --manual_wal_flush=1 --sync_wal_one_in=100 --atomic_flush=1 --flush_one_in=100 --column_families=3`
- Joined https://github.com/facebook/rocksdb/pull/10624 in auto CI testings with all RocksDB stress/crash test jobs
Reviewed By: ajkr
Differential Revision: D39593752
Pulled By: ajkr
fbshipit-source-id: 3a2135bb792c52d2ffa60257d4fbc557fb04d2ce
2 years ago
|
|
|
}
|
|
|
|
if (s.ok() && !persist_options_status.ok()) {
|
|
|
|
s = Status::IOError(
|
|
|
|
"DB::Open() failed --- Unable to persist Options file",
|
|
|
|
persist_options_status.ToString());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!s.ok()) {
|
|
|
|
ROCKS_LOG_WARN(impl->immutable_db_options_.info_log,
|
|
|
|
"DB::Open() failed: %s", s.ToString().c_str());
|
|
|
|
}
|
move dump stats to a separate thread (#4382)
Summary:
Currently statistics are supposed to be dumped to info log at intervals of `options.stats_dump_period_sec`. However the implementation choice was to bind it with compaction thread, meaning if the database has been serving very light traffic, the stats may not get dumped at all.
We decided to separate stats dumping into a new timed thread using `TimerQueue`, which is already used in blob_db. This will allow us schedule new timed tasks with more deterministic behavior.
Tested with db_bench using `--stats_dump_period_sec=20` in command line:
> LOG:2018/09/17-14:07:45.575025 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
LOG:2018/09/17-14:08:05.643286 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
LOG:2018/09/17-14:08:25.691325 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
LOG:2018/09/17-14:08:45.740989 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
LOG content:
> 2018/09/17-14:07:45.575025 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS -------
2018/09/17-14:07:45.575080 7fe99fbfe700 [WARN] [db/db_impl.cc:606]
** DB Stats **
Uptime(secs): 20.0 total, 20.0 interval
Cumulative writes: 4447K writes, 4447K keys, 4447K commit groups, 1.0 writes per commit group, ingest: 5.57 GB, 285.01 MB/s
Cumulative WAL: 4447K writes, 0 syncs, 4447638.00 writes per sync, written: 5.57 GB, 285.01 MB/s
Cumulative stall: 00:00:0.012 H:M:S, 0.1 percent
Interval writes: 4447K writes, 4447K keys, 4447K commit groups, 1.0 writes per commit group, ingest: 5700.71 MB, 285.01 MB/s
Interval WAL: 4447K writes, 0 syncs, 4447638.00 writes per sync, written: 5.57 MB, 285.01 MB/s
Interval stall: 00:00:0.012 H:M:S, 0.1 percent
** Compaction Stats [default] **
Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4382
Differential Revision: D9933051
Pulled By: miasantreble
fbshipit-source-id: 6d12bb1e4977674eea4bf2d2ac6d486b814bb2fa
6 years ago
|
|
|
if (s.ok()) {
|
|
|
|
s = impl->StartPeriodicTaskScheduler();
|
|
|
|
}
|
|
|
|
|
|
|
|
if (s.ok()) {
|
|
|
|
s = impl->RegisterRecordSeqnoTimeWorker();
|
|
|
|
}
|
|
|
|
if (!s.ok()) {
|
|
|
|
for (auto* h : *handles) {
|
|
|
|
delete h;
|
|
|
|
}
|
|
|
|
handles->clear();
|
|
|
|
delete impl;
|
|
|
|
*dbptr = nullptr;
|
|
|
|
}
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
} // namespace ROCKSDB_NAMESPACE
|