Define WAL related classes to be used in VersionEdit and VersionSet (#7164)
Summary: `WalAddition`, `WalDeletion` are defined in `wal_version.h` and used in `VersionEdit`. `WalAddition` is used to represent events of creating a new WAL (no size, just log number), or closing a WAL (with size). `WalDeletion` is used to represent events of deleting or archiving a WAL, it means the WAL is no longer alive (won't be replayed during recovery). `WalSet` is the set of alive WALs kept in `VersionSet`. 1. Why use `WalDeletion` instead of relying on `MinLogNumber` to identify outdated WALs On recovery, we can compute `MinLogNumber()` based on the log numbers kept in MANIFEST, any log with number < MinLogNumber can be ignored. So it seems that we don't need to persist `WalDeletion` to MANIFEST, since we can ignore the WALs based on MinLogNumber. But the `MinLogNumber()` is actually a lower bound, it does not exactly mean that logs starting from MinLogNumber must exist. This is because in a corner case, when a column family is empty and never flushed, its log number is set to the largest log number, but not persisted in MANIFEST. So let's say there are 2 column families, when creating the DB, the first WAL has log number 1, so it's persisted to MANIFEST for both column families. Then CF 0 is empty and never flushed, CF 1 is updated and flushed, so a new WAL with log number 2 is created and persisted to MANIFEST for CF 1. But CF 0's log number in MANIFEST is still 1. So on recovery, MinLogNumber is 1, but since log 1 only contains data for CF 1, and CF 1 is flushed, log 1 might have already been deleted from disk. We can make `MinLogNumber()` be the exactly minimum log number that must exist, by persisting the most recent log number for empty column families that are not flushed. But if there are N such column families, then every time a new WAL is created, we need to add N records to MANIFEST. In current design, a record is persisted to MANIFEST only when WAL is created, closed, or deleted/archived, so the number of WAL related records are bounded to 3x number of WALs. 2. Why keep `WalSet` in `VersionSet` instead of applying the `VersionEdit`s to `VersionStorageInfo` `VersionEdit`s are originally designed to track the addition and deletion of SST files. The SST files are related to column families, each column family has a list of `Version`s, and each `Version` keeps the set of active SST files in `VersionStorageInfo`. But WALs are a concept of DB, they are not bounded to specific column families. So logically it does not make sense to store WALs in a column family's `Version`s. Also, `Version`'s purpose is to keep reference to SST / blob files, so that they are not deleted until there is no version referencing them. But a WAL is deleted regardless of version references. So we keep the WALs in `VersionSet` for the purpose of writing out the DB state's snapshot when creating new MANIFESTs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7164 Test Plan: make version_edit_test && ./version_edit_test make wal_edit_test && ./wal_edit_test Reviewed By: ltamasi Differential Revision: D22677936 Pulled By: cheng-chang fbshipit-source-id: 5a3b6890140e572ffd79eb37e6e4c3c32361a859main
parent
124fbd96d8
commit
cd48ecaa1a
@ -0,0 +1,175 @@ |
||||
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
||||
// This source code is licensed under both the GPLv2 (found in the
|
||||
// COPYING file in the root directory) and Apache 2.0 License
|
||||
// (found in the LICENSE.Apache file in the root directory).
|
||||
|
||||
#include "db/wal_edit.h" |
||||
|
||||
#include "rocksdb/slice.h" |
||||
#include "rocksdb/status.h" |
||||
#include "util/coding.h" |
||||
|
||||
namespace ROCKSDB_NAMESPACE { |
||||
|
||||
void WalAddition::EncodeTo(std::string* dst) const { |
||||
PutVarint64(dst, number_); |
||||
|
||||
if (metadata_.HasSize()) { |
||||
PutVarint32(dst, static_cast<uint32_t>(WalAdditionTag::kSize)); |
||||
PutVarint64(dst, metadata_.GetSizeInBytes()); |
||||
} |
||||
|
||||
PutVarint32(dst, static_cast<uint32_t>(WalAdditionTag::kTerminate)); |
||||
} |
||||
|
||||
Status WalAddition::DecodeFrom(Slice* src) { |
||||
constexpr char class_name[] = "WalAddition"; |
||||
|
||||
if (!GetVarint64(src, &number_)) { |
||||
return Status::Corruption(class_name, "Error decoding WAL log number"); |
||||
} |
||||
|
||||
while (true) { |
||||
uint32_t tag_value = 0; |
||||
if (!GetVarint32(src, &tag_value)) { |
||||
return Status::Corruption(class_name, "Error decoding tag"); |
||||
} |
||||
WalAdditionTag tag = static_cast<WalAdditionTag>(tag_value); |
||||
switch (tag) { |
||||
case WalAdditionTag::kSize: { |
||||
uint64_t size = 0; |
||||
if (!GetVarint64(src, &size)) { |
||||
return Status::Corruption(class_name, "Error decoding WAL file size"); |
||||
} |
||||
metadata_.SetSizeInBytes(size); |
||||
break; |
||||
} |
||||
// TODO: process future tags such as checksum.
|
||||
case WalAdditionTag::kTerminate: |
||||
return Status::OK(); |
||||
default: { |
||||
std::stringstream ss; |
||||
ss << "Unknown tag " << tag_value; |
||||
return Status::Corruption(class_name, ss.str()); |
||||
} |
||||
} |
||||
} |
||||
} |
||||
|
||||
JSONWriter& operator<<(JSONWriter& jw, const WalAddition& wal) { |
||||
jw << "LogNumber" << wal.GetLogNumber() << "SizeInBytes" |
||||
<< wal.GetMetadata().GetSizeInBytes(); |
||||
return jw; |
||||
} |
||||
|
||||
std::ostream& operator<<(std::ostream& os, const WalAddition& wal) { |
||||
os << "log_number: " << wal.GetLogNumber() |
||||
<< " size_in_bytes: " << wal.GetMetadata().GetSizeInBytes(); |
||||
return os; |
||||
} |
||||
|
||||
std::string WalAddition::DebugString() const { |
||||
std::ostringstream oss; |
||||
oss << *this; |
||||
return oss.str(); |
||||
} |
||||
|
||||
void WalDeletion::EncodeTo(std::string* dst) const { |
||||
PutVarint64(dst, number_); |
||||
} |
||||
|
||||
Status WalDeletion::DecodeFrom(Slice* src) { |
||||
constexpr char class_name[] = "WalDeletion"; |
||||
|
||||
if (!GetVarint64(src, &number_)) { |
||||
return Status::Corruption(class_name, "Error decoding WAL log number"); |
||||
} |
||||
|
||||
return Status::OK(); |
||||
} |
||||
|
||||
JSONWriter& operator<<(JSONWriter& jw, const WalDeletion& wal) { |
||||
jw << "LogNumber" << wal.GetLogNumber(); |
||||
return jw; |
||||
} |
||||
|
||||
std::ostream& operator<<(std::ostream& os, const WalDeletion& wal) { |
||||
os << "log_number: " << wal.GetLogNumber(); |
||||
return os; |
||||
} |
||||
|
||||
std::string WalDeletion::DebugString() const { |
||||
std::ostringstream oss; |
||||
oss << *this; |
||||
return oss.str(); |
||||
} |
||||
|
||||
Status WalSet::AddWal(const WalAddition& wal) { |
||||
auto it = wals_.lower_bound(wal.GetLogNumber()); |
||||
if (wal.GetMetadata().HasSize()) { |
||||
// The WAL must exist without size.
|
||||
if (it == wals_.end() || it->first != wal.GetLogNumber()) { |
||||
std::stringstream ss; |
||||
ss << "WAL " << wal.GetLogNumber() << " is not created before closing"; |
||||
return Status::Corruption("WalSet", ss.str()); |
||||
} |
||||
if (it->second.HasSize()) { |
||||
std::stringstream ss; |
||||
ss << "WAL " << wal.GetLogNumber() << " is closed more than once"; |
||||
return Status::Corruption("WalSet", ss.str()); |
||||
} |
||||
it->second = wal.GetMetadata(); |
||||
} else { |
||||
// The WAL must not exist beforehand.
|
||||
if (it != wals_.end() && it->first == wal.GetLogNumber()) { |
||||
std::stringstream ss; |
||||
ss << "WAL " << wal.GetLogNumber() << " is created more than once"; |
||||
return Status::Corruption("WalSet", ss.str()); |
||||
} |
||||
wals_[wal.GetLogNumber()] = wal.GetMetadata(); |
||||
} |
||||
return Status::OK(); |
||||
} |
||||
|
||||
Status WalSet::AddWals(const WalAdditions& wals) { |
||||
Status s; |
||||
for (const WalAddition& wal : wals) { |
||||
s = AddWal(wal); |
||||
if (!s.ok()) { |
||||
break; |
||||
} |
||||
} |
||||
return s; |
||||
} |
||||
|
||||
Status WalSet::DeleteWal(const WalDeletion& wal) { |
||||
auto it = wals_.lower_bound(wal.GetLogNumber()); |
||||
// The WAL must exist and has been closed.
|
||||
if (it == wals_.end() || it->first != wal.GetLogNumber()) { |
||||
std::stringstream ss; |
||||
ss << "WAL " << wal.GetLogNumber() << " must exist before deletion"; |
||||
return Status::Corruption("WalSet", ss.str()); |
||||
} |
||||
if (!it->second.HasSize()) { |
||||
std::stringstream ss; |
||||
ss << "WAL " << wal.GetLogNumber() << " must be closed before deletion"; |
||||
return Status::Corruption("WalSet", ss.str()); |
||||
} |
||||
wals_.erase(it); |
||||
return Status::OK(); |
||||
} |
||||
|
||||
Status WalSet::DeleteWals(const WalDeletions& wals) { |
||||
Status s; |
||||
for (const WalDeletion& wal : wals) { |
||||
s = DeleteWal(wal); |
||||
if (!s.ok()) { |
||||
break; |
||||
} |
||||
} |
||||
return s; |
||||
} |
||||
|
||||
void WalSet::Reset() { wals_.clear(); } |
||||
|
||||
} // namespace ROCKSDB_NAMESPACE
|
@ -0,0 +1,143 @@ |
||||
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
||||
// This source code is licensed under both the GPLv2 (found in the
|
||||
// COPYING file in the root directory) and Apache 2.0 License
|
||||
// (found in the LICENSE.Apache file in the root directory).
|
||||
|
||||
// WAL related classes used in VersionEdit and VersionSet.
|
||||
|
||||
#pragma once |
||||
|
||||
#include <map> |
||||
#include <ostream> |
||||
#include <string> |
||||
#include <vector> |
||||
|
||||
#include "logging/event_logger.h" |
||||
#include "rocksdb/rocksdb_namespace.h" |
||||
|
||||
namespace ROCKSDB_NAMESPACE { |
||||
|
||||
class JSONWriter; |
||||
class Slice; |
||||
class Status; |
||||
|
||||
using WalNumber = uint64_t; |
||||
|
||||
// Metadata of a WAL.
|
||||
class WalMetadata { |
||||
public: |
||||
WalMetadata() = default; |
||||
|
||||
explicit WalMetadata(uint64_t size_bytes) : size_bytes_(size_bytes) {} |
||||
|
||||
bool HasSize() const { return size_bytes_ != kUnknownWalSize; } |
||||
|
||||
void SetSizeInBytes(uint64_t bytes) { size_bytes_ = bytes; } |
||||
|
||||
uint64_t GetSizeInBytes() const { return size_bytes_; } |
||||
|
||||
private: |
||||
// The size of WAL is unknown, used when the WAL is not closed yet.
|
||||
constexpr static uint64_t kUnknownWalSize = 0; |
||||
|
||||
// Size of a closed WAL in bytes.
|
||||
uint64_t size_bytes_ = kUnknownWalSize; |
||||
}; |
||||
|
||||
// These tags are persisted to MANIFEST, so it's part of the user API.
|
||||
enum class WalAdditionTag : uint32_t { |
||||
// Indicates that there are no more tags.
|
||||
kTerminate = 1, |
||||
// Size in bytes.
|
||||
kSize = 2, |
||||
// Add tags in the future, such as checksum?
|
||||
}; |
||||
|
||||
// Records the event of adding a WAL in VersionEdit.
|
||||
class WalAddition { |
||||
public: |
||||
WalAddition() : number_(0), metadata_() {} |
||||
|
||||
explicit WalAddition(WalNumber number) : number_(number), metadata_() {} |
||||
|
||||
WalAddition(WalNumber number, WalMetadata meta) |
||||
: number_(number), metadata_(std::move(meta)) {} |
||||
|
||||
WalNumber GetLogNumber() const { return number_; } |
||||
|
||||
const WalMetadata& GetMetadata() const { return metadata_; } |
||||
|
||||
void EncodeTo(std::string* dst) const; |
||||
|
||||
Status DecodeFrom(Slice* src); |
||||
|
||||
std::string DebugString() const; |
||||
|
||||
private: |
||||
WalNumber number_; |
||||
WalMetadata metadata_; |
||||
}; |
||||
|
||||
std::ostream& operator<<(std::ostream& os, const WalAddition& wal); |
||||
JSONWriter& operator<<(JSONWriter& jw, const WalAddition& wal); |
||||
|
||||
using WalAdditions = std::vector<WalAddition>; |
||||
|
||||
// Records the event of deleting/archiving a WAL in VersionEdit.
|
||||
class WalDeletion { |
||||
public: |
||||
WalDeletion() : number_(0) {} |
||||
|
||||
explicit WalDeletion(WalNumber number) : number_(number) {} |
||||
|
||||
WalNumber GetLogNumber() const { return number_; } |
||||
|
||||
void EncodeTo(std::string* dst) const; |
||||
|
||||
Status DecodeFrom(Slice* src); |
||||
|
||||
std::string DebugString() const; |
||||
|
||||
private: |
||||
WalNumber number_; |
||||
}; |
||||
|
||||
std::ostream& operator<<(std::ostream& os, const WalDeletion& wal); |
||||
JSONWriter& operator<<(JSONWriter& jw, const WalDeletion& wal); |
||||
|
||||
using WalDeletions = std::vector<WalDeletion>; |
||||
|
||||
// Used in VersionSet to keep the current set of WALs.
|
||||
//
|
||||
// When a WAL is created, closed, deleted, or archived,
|
||||
// a VersionEdit is logged to MANIFEST and
|
||||
// the WAL is added to or deleted from WalSet.
|
||||
//
|
||||
// Not thread safe, needs external synchronization such as holding DB mutex.
|
||||
class WalSet { |
||||
public: |
||||
// Add WAL(s).
|
||||
// If the WAL has size, it means the WAL is closed,
|
||||
// then there must be an existing WAL without size that is added
|
||||
// when creating the WAL, otherwise, return Status::Corruption.
|
||||
// Can happen when applying a VersionEdit or recovering from MANIFEST.
|
||||
Status AddWal(const WalAddition& wal); |
||||
Status AddWals(const WalAdditions& wals); |
||||
|
||||
// Delete WAL(s).
|
||||
// The WAL to be deleted must exist, otherwise,
|
||||
// return Status::Corruption.
|
||||
// Can happen when applying a VersionEdit or recovering from MANIFEST.
|
||||
Status DeleteWal(const WalDeletion& wal); |
||||
Status DeleteWals(const WalDeletions& wals); |
||||
|
||||
// Resets the internal state.
|
||||
void Reset(); |
||||
|
||||
const std::map<WalNumber, WalMetadata>& GetWals() const { return wals_; } |
||||
|
||||
private: |
||||
std::map<WalNumber, WalMetadata> wals_; |
||||
}; |
||||
|
||||
} // namespace ROCKSDB_NAMESPACE
|
@ -0,0 +1,127 @@ |
||||
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
||||
// This source code is licensed under both the GPLv2 (found in the
|
||||
// COPYING file in the root directory) and Apache 2.0 License
|
||||
// (found in the LICENSE.Apache file in the root directory).
|
||||
|
||||
#include "db/wal_edit.h" |
||||
|
||||
#include "port/port.h" |
||||
#include "port/stack_trace.h" |
||||
#include "test_util/testharness.h" |
||||
#include "test_util/testutil.h" |
||||
|
||||
namespace ROCKSDB_NAMESPACE { |
||||
|
||||
TEST(WalSet, AddDeleteReset) { |
||||
WalSet wals; |
||||
ASSERT_TRUE(wals.GetWals().empty()); |
||||
|
||||
// Create WAL 1 - 10.
|
||||
for (WalNumber log_number = 1; log_number <= 10; log_number++) { |
||||
wals.AddWal(WalAddition(log_number)); |
||||
} |
||||
ASSERT_EQ(wals.GetWals().size(), 10); |
||||
|
||||
// Close WAL 1 - 5.
|
||||
for (WalNumber log_number = 1; log_number <= 5; log_number++) { |
||||
wals.AddWal(WalAddition(log_number, WalMetadata(100))); |
||||
} |
||||
ASSERT_EQ(wals.GetWals().size(), 10); |
||||
|
||||
// Delete WAL 1 - 5.
|
||||
for (WalNumber log_number = 1; log_number <= 5; log_number++) { |
||||
wals.DeleteWal(WalDeletion(log_number)); |
||||
} |
||||
ASSERT_EQ(wals.GetWals().size(), 5); |
||||
|
||||
WalNumber expected_log_number = 6; |
||||
for (auto it : wals.GetWals()) { |
||||
WalNumber log_number = it.first; |
||||
ASSERT_EQ(log_number, expected_log_number++); |
||||
} |
||||
|
||||
wals.Reset(); |
||||
ASSERT_TRUE(wals.GetWals().empty()); |
||||
} |
||||
|
||||
TEST(WalSet, Overwrite) { |
||||
constexpr WalNumber kNumber = 100; |
||||
constexpr uint64_t kBytes = 200; |
||||
WalSet wals; |
||||
wals.AddWal(WalAddition(kNumber)); |
||||
ASSERT_FALSE(wals.GetWals().at(kNumber).HasSize()); |
||||
wals.AddWal(WalAddition(kNumber, WalMetadata(kBytes))); |
||||
ASSERT_TRUE(wals.GetWals().at(kNumber).HasSize()); |
||||
ASSERT_EQ(wals.GetWals().at(kNumber).GetSizeInBytes(), kBytes); |
||||
} |
||||
|
||||
TEST(WalSet, CreateTwice) { |
||||
constexpr WalNumber kNumber = 100; |
||||
WalSet wals; |
||||
ASSERT_OK(wals.AddWal(WalAddition(kNumber))); |
||||
Status s = wals.AddWal(WalAddition(kNumber)); |
||||
ASSERT_TRUE(s.IsCorruption()); |
||||
ASSERT_TRUE(s.ToString().find("WAL 100 is created more than once") != |
||||
std::string::npos); |
||||
} |
||||
|
||||
TEST(WalSet, CloseTwice) { |
||||
constexpr WalNumber kNumber = 100; |
||||
constexpr uint64_t kBytes = 200; |
||||
WalSet wals; |
||||
ASSERT_OK(wals.AddWal(WalAddition(kNumber))); |
||||
ASSERT_OK(wals.AddWal(WalAddition(kNumber, WalMetadata(kBytes)))); |
||||
Status s = wals.AddWal(WalAddition(kNumber, WalMetadata(kBytes))); |
||||
ASSERT_TRUE(s.IsCorruption()); |
||||
ASSERT_TRUE(s.ToString().find("WAL 100 is closed more than once") != |
||||
std::string::npos); |
||||
} |
||||
|
||||
TEST(WalSet, CloseBeforeCreate) { |
||||
constexpr WalNumber kNumber = 100; |
||||
constexpr uint64_t kBytes = 200; |
||||
WalSet wals; |
||||
Status s = wals.AddWal(WalAddition(kNumber, WalMetadata(kBytes))); |
||||
ASSERT_TRUE(s.IsCorruption()); |
||||
ASSERT_TRUE(s.ToString().find("WAL 100 is not created before closing") != |
||||
std::string::npos); |
||||
} |
||||
|
||||
TEST(WalSet, CreateAfterClose) { |
||||
constexpr WalNumber kNumber = 100; |
||||
constexpr uint64_t kBytes = 200; |
||||
WalSet wals; |
||||
ASSERT_OK(wals.AddWal(WalAddition(kNumber))); |
||||
ASSERT_OK(wals.AddWal(WalAddition(kNumber, WalMetadata(kBytes)))); |
||||
Status s = wals.AddWal(WalAddition(kNumber)); |
||||
ASSERT_TRUE(s.IsCorruption()); |
||||
ASSERT_TRUE(s.ToString().find("WAL 100 is created more than once") != |
||||
std::string::npos); |
||||
} |
||||
|
||||
TEST(WalSet, DeleteNonExistingWal) { |
||||
constexpr WalNumber kNonExistingNumber = 100; |
||||
WalSet wals; |
||||
Status s = wals.DeleteWal(WalDeletion(kNonExistingNumber)); |
||||
ASSERT_TRUE(s.IsCorruption()); |
||||
ASSERT_TRUE(s.ToString().find("WAL 100 must exist before deletion") != |
||||
std::string::npos); |
||||
} |
||||
|
||||
TEST(WalSet, DeleteNonClosedWal) { |
||||
constexpr WalNumber kNonExistingNumber = 100; |
||||
WalSet wals; |
||||
ASSERT_OK(wals.AddWal(WalAddition(kNonExistingNumber))); |
||||
Status s = wals.DeleteWal(WalDeletion(kNonExistingNumber)); |
||||
ASSERT_TRUE(s.IsCorruption()); |
||||
ASSERT_TRUE(s.ToString().find("WAL 100 must be closed before deletion") != |
||||
std::string::npos); |
||||
} |
||||
|
||||
} // namespace ROCKSDB_NAMESPACE
|
||||
|
||||
int main(int argc, char** argv) { |
||||
ROCKSDB_NAMESPACE::port::InstallStackTraceHandler(); |
||||
::testing::InitGoogleTest(&argc, argv); |
||||
return RUN_ALL_TESTS(); |
||||
} |
Loading…
Reference in new issue