Always verify SST unique IDs on SST file open (#10532)

Summary:
Although we've been tracking SST unique IDs in the DB manifest
unconditionally, checking has been opt-in and with an extra pass at DB::Open
time. This changes the behavior of `verify_sst_unique_id_in_manifest` to
check unique ID against manifest every time an SST file is opened through
table cache (normal DB operations), replacing the explicit pass over files
at DB::Open time. This change also enables the option by default and
removes the "EXPERIMENTAL" designation.

One possible criticism is that the option no longer ensures the integrity
of a DB at Open time. This is far from an all-or-nothing issue. Verifying
the IDs of all SST files hardly ensures all the data in the DB is readable.
(VerifyChecksum is supposed to do that.) Also, with
max_open_files=-1 (default, extremely common), all SST files are
opened at DB::Open time anyway.

Implementation details:
* `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass
that is now removed.
* Unit tests that manipulate/corrupt table properties have to opt out of
this check, because that corrupts the "actual" unique id. (And even for
testing we don't currently have a mechanism to set "no unique id"
in the in-memory file metadata for new files.)
* A lot of other unit test churn relates to (a) default checking on, and
(b) checking on SST open even without DB::Open (e.g. on flush)
* Use `FileMetaData` for more `TableCache` operations (in place of
`FileDescriptor`) so that we have access to the unique_id whenever
we might need to open an SST file. **There is the possibility of
performance impact because we can no longer use the more
localized `fd` part of an `FdWithKeyRange` but instead follow the
`file_metadata` pointer. However, this change (possible regression)
is only done for `GetMemoryUsageByTableReaders`.**
* Removed a completely unnecessary constructor overload of
`TableReaderOptions`

Possible follow-up:
* Verification only happens when opening through table cache. Are there
more places where this should happen?
* Improve error message when there is a file size mismatch vs. manifest
(FIXME added in the appropriate place).
* I'm not sure there's a justification for `FileDescriptor` to be distinct from
`FileMetaData`.
* I'm skeptical that `FdWithKeyRange` really still makes sense for
optimizing some data locality by duplicating some data in memory, but I
could be wrong.
* An unnecessary overload of NewTableReader was recently added, in
the public API nonetheless (though unusable there). It should be cleaned
up to put most things under `TableReaderOptions`.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532

Test Plan:
updated unit tests

Performance test showing no significant difference (just noise I think):
`./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000`
Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec
After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec

Reviewed By: jay-zhuang

Differential Revision: D38765551

Pulled By: pdillinger

fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2
main
Peter Dillinger 2 years ago committed by Facebook GitHub Bot
parent d490bfcdb6
commit 6de7081cf3
  1. 1
      HISTORY.md
  2. 2
      db/compaction/compaction_job.cc
  3. 22
      db/compaction/compaction_service_test.cc
  4. 4
      db/db_block_cache_test.cc
  5. 5
      db/db_impl/db_impl.h
  6. 31
      db/db_impl/db_impl_open.cc
  7. 6
      db/db_table_properties_test.cc
  8. 59
      db/db_test2.cc
  9. 34
      db/external_sst_file_basic_test.cc
  10. 4
      db/repair.cc
  11. 3
      db/repair_test.cc
  12. 105
      db/table_cache.cc
  13. 17
      db/table_cache.h
  14. 4
      db/table_cache_sync_and_async.h
  15. 2
      db/version_builder.cc
  16. 3
      db/version_edit.h
  17. 43
      db/version_set.cc
  18. 2
      db/version_set.h
  19. 33
      include/rocksdb/options.h
  20. 4
      table/block_based/block_based_table_factory.cc
  21. 52
      table/block_based/block_based_table_reader.cc
  22. 3
      table/block_based/block_based_table_reader.h
  23. 4
      table/format.cc
  24. 30
      table/table_builder.h
  25. 24
      table/table_test.cc
  26. 4
      utilities/backup/backup_engine_test.cc

@ -17,6 +17,7 @@
* Revert to using the default metadata charge policy when creating an LRU cache via the Java API. * Revert to using the default metadata charge policy when creating an LRU cache via the Java API.
### Behavior Change ### Behavior Change
* DBOptions::verify_sst_unique_id_in_manifest is now an on-by-default feature that verifies SST file identity whenever they are opened by a DB, rather than only at DB::Open time.
* Right now, when the option migration tool (OptionChangeMigration()) migrates to FIFO compaction, it compacts all the data into one single SST file and move to L0. This might create a problem for some users: the giant file may be soon deleted to satisfy max_table_files_size, and might cayse the DB to be almost empty. We change the behavior so that the files are cut to be smaller, but these files might not follow the data insertion order. With the change, after the migration, migrated data might not be dropped by insertion order by FIFO compaction. * Right now, when the option migration tool (OptionChangeMigration()) migrates to FIFO compaction, it compacts all the data into one single SST file and move to L0. This might create a problem for some users: the giant file may be soon deleted to satisfy max_table_files_size, and might cayse the DB to be almost empty. We change the behavior so that the files are cut to be smaller, but these files might not follow the data insertion order. With the change, after the migration, migrated data might not be dropped by insertion order by FIFO compaction.
* When a block is firstly found from `CompressedSecondaryCache`, we just insert a dummy block into the primary cache and don’t erase the block from `CompressedSecondaryCache`. A standalone handle is returned to the caller. Only if the block is found again from `CompressedSecondaryCache` before the dummy block is evicted, we erase the block from `CompressedSecondaryCache` and insert it into the primary cache. * When a block is firstly found from `CompressedSecondaryCache`, we just insert a dummy block into the primary cache and don’t erase the block from `CompressedSecondaryCache`. A standalone handle is returned to the caller. Only if the block is found again from `CompressedSecondaryCache` before the dummy block is evicted, we erase the block from `CompressedSecondaryCache` and insert it into the primary cache.
* When a block is firstly evicted from the primary cache to `CompressedSecondaryCache`, we just insert a dummy block in `CompressedSecondaryCache`. Only if it is evicted again before the dummy block is evicted from the cache, it is treated as a hot block and is inserted into `CompressedSecondaryCache`. * When a block is firstly evicted from the primary cache to `CompressedSecondaryCache`, we just insert a dummy block in `CompressedSecondaryCache`. Only if it is evicted again before the dummy block is evicted from the cache, it is treated as a hot block and is inserted into `CompressedSecondaryCache`.

@ -464,7 +464,7 @@ void CompactionJob::GenSubcompactionBoundaries() {
FileMetaData* f = flevel->files[i].file_metadata; FileMetaData* f = flevel->files[i].file_metadata;
std::vector<TableReader::Anchor> my_anchors; std::vector<TableReader::Anchor> my_anchors;
Status s = cfd->table_cache()->ApproximateKeyAnchors( Status s = cfd->table_cache()->ApproximateKeyAnchors(
ReadOptions(), icomp, f->fd, my_anchors); ReadOptions(), icomp, *f, my_anchors);
if (!s.ok() || my_anchors.empty()) { if (!s.ok() || my_anchors.empty()) {
my_anchors.emplace_back(f->largest.user_key(), f->fd.GetFileSize()); my_anchors.emplace_back(f->largest.user_key(), f->fd.GetFileSize());
} }

@ -289,16 +289,6 @@ TEST_F(CompactionServiceTest, BasicCompactions) {
auto s = static_cast<Status*>(status); auto s = static_cast<Status*>(status);
*s = Status::Aborted("MyTestCompactionService failed to compact!"); *s = Status::Aborted("MyTestCompactionService failed to compact!");
}); });
// tracking success unique id verification
std::atomic_int verify_passed{0};
SyncPoint::GetInstance()->SetCallBack(
"Version::VerifySstUniqueIds::Passed", [&](void* arg) {
// override job status
auto id = static_cast<UniqueId64x2*>(arg);
assert(*id != kNullUniqueId64x2);
verify_passed++;
});
SyncPoint::GetInstance()->EnableProcessing(); SyncPoint::GetInstance()->EnableProcessing();
Status s; Status s;
@ -324,9 +314,15 @@ TEST_F(CompactionServiceTest, BasicCompactions) {
} }
ASSERT_TRUE(s.IsAborted()); ASSERT_TRUE(s.IsAborted());
// Test verification // Test re-open and successful unique id verification
ASSERT_EQ(verify_passed, 0); std::atomic_int verify_passed{0};
options.verify_sst_unique_id_in_manifest = true; SyncPoint::GetInstance()->SetCallBack(
"BlockBasedTable::Open::PassedVerifyUniqueId", [&](void* arg) {
// override job status
auto id = static_cast<UniqueId64x2*>(arg);
assert(*id != kNullUniqueId64x2);
verify_passed++;
});
Reopen(options); Reopen(options);
ASSERT_GT(verify_passed, 0); ASSERT_GT(verify_passed, 0);
} }

@ -1570,6 +1570,10 @@ TEST_P(DBBlockCacheKeyTest, StableCacheKeys) {
options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics(); options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
options.env = test_env.get(); options.env = test_env.get();
// Corrupting the table properties corrupts the unique id.
// Ignore the unique id recorded in the manifest.
options.verify_sst_unique_id_in_manifest = false;
BlockBasedTableOptions table_options; BlockBasedTableOptions table_options;
int key_count = 0; int key_count = 0;

@ -2179,11 +2179,6 @@ class DBImpl : public DB {
IOStatus CreateWAL(uint64_t log_file_num, uint64_t recycle_log_number, IOStatus CreateWAL(uint64_t log_file_num, uint64_t recycle_log_number,
size_t preallocate_block_size, log::Writer** new_log); size_t preallocate_block_size, log::Writer** new_log);
// Verify SST file unique id between Manifest and table properties to make
// sure they're the same. Currently only used during DB open when
// `verify_sst_unique_id_in_manifest = true`.
Status VerifySstUniqueIdInManifest();
// Validate self-consistency of DB options // Validate self-consistency of DB options
static Status ValidateOptions(const DBOptions& db_options); static Status ValidateOptions(const DBOptions& db_options);
// Validate self-consistency of DB options and its consistency with cf options // Validate self-consistency of DB options and its consistency with cf options

@ -525,12 +525,6 @@ Status DBImpl::Recover(
if (!s.ok()) { if (!s.ok()) {
return s; return s;
} }
if (immutable_db_options_.verify_sst_unique_id_in_manifest) {
s = VerifySstUniqueIdInManifest();
if (!s.ok()) {
return s;
}
}
s = SetupDBId(read_only, recovery_ctx); s = SetupDBId(read_only, recovery_ctx);
ROCKS_LOG_INFO(immutable_db_options_.info_log, "DB ID: %s\n", db_id_.c_str()); ROCKS_LOG_INFO(immutable_db_options_.info_log, "DB ID: %s\n", db_id_.c_str());
if (s.ok() && !read_only) { if (s.ok() && !read_only) {
@ -708,31 +702,6 @@ Status DBImpl::Recover(
return s; return s;
} }
Status DBImpl::VerifySstUniqueIdInManifest() {
mutex_.AssertHeld();
ROCKS_LOG_INFO(
immutable_db_options_.info_log,
"Verifying SST unique id between MANIFEST and SST file table properties");
Status status;
for (auto cfd : *versions_->GetColumnFamilySet()) {
if (!cfd->IsDropped()) {
auto version = cfd->current();
version->Ref();
mutex_.Unlock();
status = version->VerifySstUniqueIds();
mutex_.Lock();
version->Unref();
if (!status.ok()) {
ROCKS_LOG_WARN(immutable_db_options_.info_log,
"SST unique id mismatch in column family \"%s\": %s",
cfd->GetName().c_str(), status.ToString().c_str());
return status;
}
}
}
return status;
}
Status DBImpl::PersistentStatsProcessFormatVersion() { Status DBImpl::PersistentStatsProcessFormatVersion() {
mutex_.AssertHeld(); mutex_.AssertHeld();
Status s; Status s;

@ -186,6 +186,12 @@ TEST_F(DBTablePropertiesTest, InvalidIgnored) {
}); });
SyncPoint::GetInstance()->EnableProcessing(); SyncPoint::GetInstance()->EnableProcessing();
// Corrupting the table properties corrupts the unique id.
// Ignore the unique id recorded in the manifest.
auto options = CurrentOptions();
options.verify_sst_unique_id_in_manifest = false;
Reopen(options);
// Build file // Build file
for (int i = 0; i < 10; ++i) { for (int i = 0; i < 10; ++i) {
ASSERT_OK(db_->Put(WriteOptions(), std::to_string(i), "val")); ASSERT_OK(db_->Put(WriteOptions(), std::to_string(i), "val"));

@ -7339,18 +7339,18 @@ TEST_F(DBTest2, SstUniqueIdVerifyBackwardCompatible) {
auto options = CurrentOptions(); auto options = CurrentOptions();
options.level0_file_num_compaction_trigger = kLevel0Trigger; options.level0_file_num_compaction_trigger = kLevel0Trigger;
options.statistics = CreateDBStatistics(); options.statistics = CreateDBStatistics();
// Skip for now
options.verify_sst_unique_id_in_manifest = false;
Reopen(options);
// Existing manifest doesn't have unique id
SyncPoint::GetInstance()->SetCallBack(
"VersionEdit::EncodeTo:UniqueId", [&](void* arg) {
auto unique_id = static_cast<UniqueId64x2*>(arg);
// remove id before writing it to manifest
(*unique_id)[0] = 0;
(*unique_id)[1] = 0;
});
std::atomic_int skipped = 0; std::atomic_int skipped = 0;
SyncPoint::GetInstance()->SetCallBack("Version::VerifySstUniqueIds::Skipped", std::atomic_int passed = 0;
[&](void* /*arg*/) { skipped++; }); SyncPoint::GetInstance()->SetCallBack(
"BlockBasedTable::Open::SkippedVerifyUniqueId",
[&](void* /*arg*/) { skipped++; });
SyncPoint::GetInstance()->SetCallBack(
"BlockBasedTable::Open::PassedVerifyUniqueId",
[&](void* /*arg*/) { passed++; });
SyncPoint::GetInstance()->EnableProcessing(); SyncPoint::GetInstance()->EnableProcessing();
// generate a few SSTs // generate a few SSTs
@ -7361,13 +7361,28 @@ TEST_F(DBTest2, SstUniqueIdVerifyBackwardCompatible) {
ASSERT_OK(Flush()); ASSERT_OK(Flush());
} }
// Reopen without verification // Verification has been skipped on files so far
Reopen(options); EXPECT_EQ(skipped, kNumSst);
EXPECT_EQ(passed, 0);
// Reopen with verification, but it's skipped because manifest doesn't have id // Reopen with verification
options.verify_sst_unique_id_in_manifest = true; options.verify_sst_unique_id_in_manifest = true;
skipped = 0;
passed = 0;
Reopen(options); Reopen(options);
ASSERT_EQ(skipped, kNumSst); EXPECT_EQ(skipped, 0);
EXPECT_EQ(passed, kNumSst);
// Now simulate no unique id in manifest for next file
// NOTE: this only works for loading manifest from disk,
// not in-memory manifest, so we need to re-open below.
SyncPoint::GetInstance()->SetCallBack(
"VersionEdit::EncodeTo:UniqueId", [&](void* arg) {
auto unique_id = static_cast<UniqueId64x2*>(arg);
// remove id before writing it to manifest
(*unique_id)[0] = 0;
(*unique_id)[1] = 0;
});
// test compaction generated Sst // test compaction generated Sst
for (int i = kNumSst; i < kLevel0Trigger; i++) { for (int i = kNumSst; i < kLevel0Trigger; i++) {
@ -7382,11 +7397,13 @@ TEST_F(DBTest2, SstUniqueIdVerifyBackwardCompatible) {
ASSERT_EQ("0,1", FilesPerLevel(0)); ASSERT_EQ("0,1", FilesPerLevel(0));
#endif // ROCKSDB_LITE #endif // ROCKSDB_LITE
// Reopen with verification should fail // Reopen (with verification)
options.verify_sst_unique_id_in_manifest = true; ASSERT_TRUE(options.verify_sst_unique_id_in_manifest);
skipped = 0; skipped = 0;
passed = 0;
Reopen(options); Reopen(options);
ASSERT_EQ(skipped, 1); EXPECT_EQ(skipped, 1);
EXPECT_EQ(passed, 0);
} }
TEST_F(DBTest2, SstUniqueIdVerify) { TEST_F(DBTest2, SstUniqueIdVerify) {
@ -7394,11 +7411,15 @@ TEST_F(DBTest2, SstUniqueIdVerify) {
const int kLevel0Trigger = 4; const int kLevel0Trigger = 4;
auto options = CurrentOptions(); auto options = CurrentOptions();
options.level0_file_num_compaction_trigger = kLevel0Trigger; options.level0_file_num_compaction_trigger = kLevel0Trigger;
// Allow mismatch for now
options.verify_sst_unique_id_in_manifest = false;
Reopen(options);
SyncPoint::GetInstance()->SetCallBack( SyncPoint::GetInstance()->SetCallBack(
"PropertyBlockBuilder::AddTableProperty:Start", [&](void* props_vs) { "PropertyBlockBuilder::AddTableProperty:Start", [&](void* props_vs) {
auto props = static_cast<TableProperties*>(props_vs); auto props = static_cast<TableProperties*>(props_vs);
// update table property session_id to a different one // update table property session_id to a different one, which
// changes unique ID
props->db_session_id = DBImpl::GenerateDbSessionId(nullptr); props->db_session_id = DBImpl::GenerateDbSessionId(nullptr);
}); });
SyncPoint::GetInstance()->EnableProcessing(); SyncPoint::GetInstance()->EnableProcessing();
@ -7444,6 +7465,8 @@ TEST_F(DBTest2, SstUniqueIdVerifyMultiCFs) {
const int kLevel0Trigger = 4; const int kLevel0Trigger = 4;
auto options = CurrentOptions(); auto options = CurrentOptions();
options.level0_file_num_compaction_trigger = kLevel0Trigger; options.level0_file_num_compaction_trigger = kLevel0Trigger;
// Allow mismatch for now
options.verify_sst_unique_id_in_manifest = false;
CreateAndReopenWithCF({"one", "two"}, options); CreateAndReopenWithCF({"one", "two"}, options);

@ -1874,9 +1874,6 @@ TEST_F(ExternalSSTFileBasicTest, VerifySstUniqueId) {
ASSERT_OK(db_->IngestExternalFile(db_->DefaultColumnFamily(), {external_file}, ASSERT_OK(db_->IngestExternalFile(db_->DefaultColumnFamily(), {external_file},
IngestExternalFileOptions())); IngestExternalFileOptions()));
auto options = CurrentOptions();
options.verify_sst_unique_id_in_manifest = true;
Reopen(options);
// Test ingest file without session_id and db_id (for example generated by an // Test ingest file without session_id and db_id (for example generated by an
// older version of sst_writer) // older version of sst_writer)
@ -1887,12 +1884,21 @@ TEST_F(ExternalSSTFileBasicTest, VerifySstUniqueId) {
props->db_session_id = ""; props->db_session_id = "";
props->db_id = ""; props->db_id = "";
}); });
std::atomic_int skipped = 0; std::atomic_int skipped = 0, passed = 0;
SyncPoint::GetInstance()->SetCallBack("Version::VerifySstUniqueIds::Skipped", SyncPoint::GetInstance()->SetCallBack(
[&](void* /*arg*/) { skipped++; }); "BlockBasedTable::Open::SkippedVerifyUniqueId",
SyncPoint::GetInstance()->EnableProcessing(); [&](void* /*arg*/) { skipped++; });
SyncPoint::GetInstance()->SetCallBack(
"BlockBasedTable::Open::PassedVerifyUniqueId",
[&](void* /*arg*/) { passed++; });
SyncPoint::GetInstance()->EnableProcessing(); SyncPoint::GetInstance()->EnableProcessing();
auto options = CurrentOptions();
ASSERT_TRUE(options.verify_sst_unique_id_in_manifest);
Reopen(options);
ASSERT_EQ(skipped, 0);
ASSERT_EQ(passed, 2); // one flushed + one ingested
external_file = sst_files_dir_ + "/file_to_ingest2.sst"; external_file = sst_files_dir_ + "/file_to_ingest2.sst";
{ {
SstFileWriter sst_file_writer{EnvOptions(), CurrentOptions()}; SstFileWriter sst_file_writer{EnvOptions(), CurrentOptions()};
@ -1905,12 +1911,18 @@ TEST_F(ExternalSSTFileBasicTest, VerifySstUniqueId) {
ASSERT_OK(db_->IngestExternalFile(db_->DefaultColumnFamily(), {external_file}, ASSERT_OK(db_->IngestExternalFile(db_->DefaultColumnFamily(), {external_file},
IngestExternalFileOptions())); IngestExternalFileOptions()));
options.statistics = CreateDBStatistics(); // Two table file opens skipping verification:
options.verify_sst_unique_id_in_manifest = true; // * ExternalSstFileIngestionJob::GetIngestedFileInfo
ASSERT_EQ(skipped, 0); // * TableCache::GetTableReader
ASSERT_EQ(skipped, 2);
ASSERT_EQ(passed, 2);
// Check same after re-open (except no GetIngestedFileInfo)
skipped = 0;
passed = 0;
Reopen(options); Reopen(options);
// only one sst file is not verified because of missing unique_id
ASSERT_EQ(skipped, 1); ASSERT_EQ(skipped, 1);
ASSERT_EQ(passed, 2);
} }
TEST_F(ExternalSSTFileBasicTest, StableSnapshotWhileLoggingToManifest) { TEST_F(ExternalSSTFileBasicTest, StableSnapshotWhileLoggingToManifest) {

@ -508,8 +508,8 @@ class Repairer {
file_size); file_size);
std::shared_ptr<const TableProperties> props; std::shared_ptr<const TableProperties> props;
if (status.ok()) { if (status.ok()) {
status = table_cache_->GetTableProperties(file_options_, icmp_, status = table_cache_->GetTableProperties(file_options_, icmp_, t->meta,
t->meta.fd, &props); &props);
} }
if (status.ok()) { if (status.ok()) {
auto s = auto s =

@ -48,7 +48,7 @@ class RepairTest : public DBTestBase {
void ReopenWithSstIdVerify() { void ReopenWithSstIdVerify() {
std::atomic_int verify_passed{0}; std::atomic_int verify_passed{0};
SyncPoint::GetInstance()->SetCallBack( SyncPoint::GetInstance()->SetCallBack(
"Version::VerifySstUniqueIds::Passed", [&](void* arg) { "BlockBasedTable::Open::PassedVerifyUniqueId", [&](void* arg) {
// override job status // override job status
auto id = static_cast<UniqueId64x2*>(arg); auto id = static_cast<UniqueId64x2*>(arg);
assert(*id != kNullUniqueId64x2); assert(*id != kNullUniqueId64x2);
@ -60,6 +60,7 @@ class RepairTest : public DBTestBase {
Reopen(options); Reopen(options);
ASSERT_GT(verify_passed, 0); ASSERT_GT(verify_passed, 0);
SyncPoint::GetInstance()->DisableProcessing();
} }
}; };

@ -116,14 +116,14 @@ void TableCache::ReleaseHandle(Cache::Handle* handle) {
Status TableCache::GetTableReader( Status TableCache::GetTableReader(
const ReadOptions& ro, const FileOptions& file_options, const ReadOptions& ro, const FileOptions& file_options,
const InternalKeyComparator& internal_comparator, const FileDescriptor& fd, const InternalKeyComparator& internal_comparator,
bool sequential_mode, bool record_read_stats, HistogramImpl* file_read_hist, const FileMetaData& file_meta, bool sequential_mode, bool record_read_stats,
std::unique_ptr<TableReader>* table_reader, HistogramImpl* file_read_hist, std::unique_ptr<TableReader>* table_reader,
const std::shared_ptr<const SliceTransform>& prefix_extractor, const std::shared_ptr<const SliceTransform>& prefix_extractor,
bool skip_filters, int level, bool prefetch_index_and_filter_in_cache, bool skip_filters, int level, bool prefetch_index_and_filter_in_cache,
size_t max_file_size_for_l0_meta_pin, Temperature file_temperature) { size_t max_file_size_for_l0_meta_pin, Temperature file_temperature) {
std::string fname = std::string fname = TableFileName(
TableFileName(ioptions_.cf_paths, fd.GetNumber(), fd.GetPathId()); ioptions_.cf_paths, file_meta.fd.GetNumber(), file_meta.fd.GetPathId());
std::unique_ptr<FSRandomAccessFile> file; std::unique_ptr<FSRandomAccessFile> file;
FileOptions fopts = file_options; FileOptions fopts = file_options;
fopts.temperature = file_temperature; fopts.temperature = file_temperature;
@ -156,14 +156,21 @@ Status TableCache::GetTableReader(
record_read_stats ? ioptions_.stats : nullptr, SST_READ_MICROS, record_read_stats ? ioptions_.stats : nullptr, SST_READ_MICROS,
file_read_hist, ioptions_.rate_limiter.get(), ioptions_.listeners, file_read_hist, ioptions_.rate_limiter.get(), ioptions_.listeners,
file_temperature, level == ioptions_.num_levels - 1)); file_temperature, level == ioptions_.num_levels - 1));
UniqueId64x2 expected_unique_id;
if (ioptions_.verify_sst_unique_id_in_manifest) {
expected_unique_id = file_meta.unique_id;
} else {
expected_unique_id = kNullUniqueId64x2; // null ID == no verification
}
s = ioptions_.table_factory->NewTableReader( s = ioptions_.table_factory->NewTableReader(
ro, ro,
TableReaderOptions( TableReaderOptions(ioptions_, prefix_extractor, file_options,
ioptions_, prefix_extractor, file_options, internal_comparator, internal_comparator, skip_filters, immortal_tables_,
skip_filters, immortal_tables_, false /* force_direct_prefetch */, false /* force_direct_prefetch */, level,
level, fd.largest_seqno, block_cache_tracer_, block_cache_tracer_, max_file_size_for_l0_meta_pin,
max_file_size_for_l0_meta_pin, db_session_id_, fd.GetNumber()), db_session_id_, file_meta.fd.GetNumber(),
std::move(file_reader), fd.GetFileSize(), table_reader, expected_unique_id, file_meta.fd.largest_seqno),
std::move(file_reader), file_meta.fd.GetFileSize(), table_reader,
prefetch_index_and_filter_in_cache); prefetch_index_and_filter_in_cache);
TEST_SYNC_POINT("TableCache::GetTableReader:0"); TEST_SYNC_POINT("TableCache::GetTableReader:0");
} }
@ -179,14 +186,14 @@ void TableCache::EraseHandle(const FileDescriptor& fd, Cache::Handle* handle) {
Status TableCache::FindTable( Status TableCache::FindTable(
const ReadOptions& ro, const FileOptions& file_options, const ReadOptions& ro, const FileOptions& file_options,
const InternalKeyComparator& internal_comparator, const FileDescriptor& fd, const InternalKeyComparator& internal_comparator,
Cache::Handle** handle, const FileMetaData& file_meta, Cache::Handle** handle,
const std::shared_ptr<const SliceTransform>& prefix_extractor, const std::shared_ptr<const SliceTransform>& prefix_extractor,
const bool no_io, bool record_read_stats, HistogramImpl* file_read_hist, const bool no_io, bool record_read_stats, HistogramImpl* file_read_hist,
bool skip_filters, int level, bool prefetch_index_and_filter_in_cache, bool skip_filters, int level, bool prefetch_index_and_filter_in_cache,
size_t max_file_size_for_l0_meta_pin, Temperature file_temperature) { size_t max_file_size_for_l0_meta_pin, Temperature file_temperature) {
PERF_TIMER_GUARD_WITH_CLOCK(find_table_nanos, ioptions_.clock); PERF_TIMER_GUARD_WITH_CLOCK(find_table_nanos, ioptions_.clock);
uint64_t number = fd.GetNumber(); uint64_t number = file_meta.fd.GetNumber();
Slice key = GetSliceForFileNumber(&number); Slice key = GetSliceForFileNumber(&number);
*handle = cache_->Lookup(key); *handle = cache_->Lookup(key);
TEST_SYNC_POINT_CALLBACK("TableCache::FindTable:0", TEST_SYNC_POINT_CALLBACK("TableCache::FindTable:0",
@ -204,11 +211,12 @@ Status TableCache::FindTable(
} }
std::unique_ptr<TableReader> table_reader; std::unique_ptr<TableReader> table_reader;
Status s = GetTableReader( Status s =
ro, file_options, internal_comparator, fd, false /* sequential mode */, GetTableReader(ro, file_options, internal_comparator, file_meta,
record_read_stats, file_read_hist, &table_reader, prefix_extractor, false /* sequential mode */, record_read_stats,
skip_filters, level, prefetch_index_and_filter_in_cache, file_read_hist, &table_reader, prefix_extractor,
max_file_size_for_l0_meta_pin, file_temperature); skip_filters, level, prefetch_index_and_filter_in_cache,
max_file_size_for_l0_meta_pin, file_temperature);
if (!s.ok()) { if (!s.ok()) {
assert(table_reader == nullptr); assert(table_reader == nullptr);
RecordTick(ioptions_.stats, NO_FILE_ERRORS); RecordTick(ioptions_.stats, NO_FILE_ERRORS);
@ -251,8 +259,8 @@ InternalIterator* TableCache::NewIterator(
table_reader = fd.table_reader; table_reader = fd.table_reader;
if (table_reader == nullptr) { if (table_reader == nullptr) {
s = FindTable( s = FindTable(
options, file_options, icomparator, fd, &handle, prefix_extractor, options, file_options, icomparator, file_meta, &handle,
options.read_tier == kBlockCacheTier /* no_io */, prefix_extractor, options.read_tier == kBlockCacheTier /* no_io */,
!for_compaction /* record_read_stats */, file_read_hist, skip_filters, !for_compaction /* record_read_stats */, file_read_hist, skip_filters,
level, true /* prefetch_index_and_filter_in_cache */, level, true /* prefetch_index_and_filter_in_cache */,
max_file_size_for_l0_meta_pin, file_meta.temperature); max_file_size_for_l0_meta_pin, file_meta.temperature);
@ -341,7 +349,8 @@ Status TableCache::GetRangeTombstoneIterator(
TableReader* t = fd.table_reader; TableReader* t = fd.table_reader;
Cache::Handle* handle = nullptr; Cache::Handle* handle = nullptr;
if (t == nullptr) { if (t == nullptr) {
s = FindTable(options, file_options_, internal_comparator, fd, &handle); s = FindTable(options, file_options_, internal_comparator, file_meta,
&handle);
if (s.ok()) { if (s.ok()) {
t = GetTableReaderFromHandle(handle); t = GetTableReaderFromHandle(handle);
} }
@ -464,8 +473,8 @@ Status TableCache::Get(
if (!done) { if (!done) {
assert(s.ok()); assert(s.ok());
if (t == nullptr) { if (t == nullptr) {
s = FindTable(options, file_options_, internal_comparator, fd, &handle, s = FindTable(options, file_options_, internal_comparator, file_meta,
prefix_extractor, &handle, prefix_extractor,
options.read_tier == kBlockCacheTier /* no_io */, options.read_tier == kBlockCacheTier /* no_io */,
true /* record_read_stats */, file_read_hist, skip_filters, true /* record_read_stats */, file_read_hist, skip_filters,
level, true /* prefetch_index_and_filter_in_cache */, level, true /* prefetch_index_and_filter_in_cache */,
@ -560,7 +569,7 @@ Status TableCache::MultiGetFilter(
mget_range->end()); mget_range->end());
if (t == nullptr) { if (t == nullptr) {
s = FindTable( s = FindTable(
options, file_options_, internal_comparator, fd, &handle, options, file_options_, internal_comparator, file_meta, &handle,
prefix_extractor, options.read_tier == kBlockCacheTier /* no_io */, prefix_extractor, options.read_tier == kBlockCacheTier /* no_io */,
true /* record_read_stats */, file_read_hist, /*skip_filters=*/false, true /* record_read_stats */, file_read_hist, /*skip_filters=*/false,
level, true /* prefetch_index_and_filter_in_cache */, level, true /* prefetch_index_and_filter_in_cache */,
@ -589,10 +598,11 @@ Status TableCache::MultiGetFilter(
Status TableCache::GetTableProperties( Status TableCache::GetTableProperties(
const FileOptions& file_options, const FileOptions& file_options,
const InternalKeyComparator& internal_comparator, const FileDescriptor& fd, const InternalKeyComparator& internal_comparator,
const FileMetaData& file_meta,
std::shared_ptr<const TableProperties>* properties, std::shared_ptr<const TableProperties>* properties,
const std::shared_ptr<const SliceTransform>& prefix_extractor, bool no_io) { const std::shared_ptr<const SliceTransform>& prefix_extractor, bool no_io) {
auto table_reader = fd.table_reader; auto table_reader = file_meta.fd.table_reader;
// table already been pre-loaded? // table already been pre-loaded?
if (table_reader) { if (table_reader) {
*properties = table_reader->GetTableProperties(); *properties = table_reader->GetTableProperties();
@ -601,8 +611,8 @@ Status TableCache::GetTableProperties(
} }
Cache::Handle* table_handle = nullptr; Cache::Handle* table_handle = nullptr;
Status s = FindTable(ReadOptions(), file_options, internal_comparator, fd, Status s = FindTable(ReadOptions(), file_options, internal_comparator,
&table_handle, prefix_extractor, no_io); file_meta, &table_handle, prefix_extractor, no_io);
if (!s.ok()) { if (!s.ok()) {
return s; return s;
} }
@ -615,12 +625,12 @@ Status TableCache::GetTableProperties(
Status TableCache::ApproximateKeyAnchors( Status TableCache::ApproximateKeyAnchors(
const ReadOptions& ro, const InternalKeyComparator& internal_comparator, const ReadOptions& ro, const InternalKeyComparator& internal_comparator,
const FileDescriptor& fd, std::vector<TableReader::Anchor>& anchors) { const FileMetaData& file_meta, std::vector<TableReader::Anchor>& anchors) {
Status s; Status s;
TableReader* t = fd.table_reader; TableReader* t = file_meta.fd.table_reader;
Cache::Handle* handle = nullptr; Cache::Handle* handle = nullptr;
if (t == nullptr) { if (t == nullptr) {
s = FindTable(ro, file_options_, internal_comparator, fd, &handle); s = FindTable(ro, file_options_, internal_comparator, file_meta, &handle);
if (s.ok()) { if (s.ok()) {
t = GetTableReaderFromHandle(handle); t = GetTableReaderFromHandle(handle);
} }
@ -636,17 +646,18 @@ Status TableCache::ApproximateKeyAnchors(
size_t TableCache::GetMemoryUsageByTableReader( size_t TableCache::GetMemoryUsageByTableReader(
const FileOptions& file_options, const FileOptions& file_options,
const InternalKeyComparator& internal_comparator, const FileDescriptor& fd, const InternalKeyComparator& internal_comparator,
const FileMetaData& file_meta,
const std::shared_ptr<const SliceTransform>& prefix_extractor) { const std::shared_ptr<const SliceTransform>& prefix_extractor) {
auto table_reader = fd.table_reader; auto table_reader = file_meta.fd.table_reader;
// table already been pre-loaded? // table already been pre-loaded?
if (table_reader) { if (table_reader) {
return table_reader->ApproximateMemoryUsage(); return table_reader->ApproximateMemoryUsage();
} }
Cache::Handle* table_handle = nullptr; Cache::Handle* table_handle = nullptr;
Status s = FindTable(ReadOptions(), file_options, internal_comparator, fd, Status s = FindTable(ReadOptions(), file_options, internal_comparator,
&table_handle, prefix_extractor, true); file_meta, &table_handle, prefix_extractor, true);
if (!s.ok()) { if (!s.ok()) {
return 0; return 0;
} }
@ -672,17 +683,18 @@ void TableCache::Evict(Cache* cache, uint64_t file_number) {
} }
uint64_t TableCache::ApproximateOffsetOf( uint64_t TableCache::ApproximateOffsetOf(
const Slice& key, const FileDescriptor& fd, TableReaderCaller caller, const Slice& key, const FileMetaData& file_meta, TableReaderCaller caller,
const InternalKeyComparator& internal_comparator, const InternalKeyComparator& internal_comparator,
const std::shared_ptr<const SliceTransform>& prefix_extractor) { const std::shared_ptr<const SliceTransform>& prefix_extractor) {
uint64_t result = 0; uint64_t result = 0;
TableReader* table_reader = fd.table_reader; TableReader* table_reader = file_meta.fd.table_reader;
Cache::Handle* table_handle = nullptr; Cache::Handle* table_handle = nullptr;
if (table_reader == nullptr) { if (table_reader == nullptr) {
const bool for_compaction = (caller == TableReaderCaller::kCompaction); const bool for_compaction = (caller == TableReaderCaller::kCompaction);
Status s = FindTable(ReadOptions(), file_options_, internal_comparator, fd, Status s =
&table_handle, prefix_extractor, false /* no_io */, FindTable(ReadOptions(), file_options_, internal_comparator, file_meta,
!for_compaction /* record_read_stats */); &table_handle, prefix_extractor, false /* no_io */,
!for_compaction /* record_read_stats */);
if (s.ok()) { if (s.ok()) {
table_reader = GetTableReaderFromHandle(table_handle); table_reader = GetTableReaderFromHandle(table_handle);
} }
@ -699,17 +711,18 @@ uint64_t TableCache::ApproximateOffsetOf(
} }
uint64_t TableCache::ApproximateSize( uint64_t TableCache::ApproximateSize(
const Slice& start, const Slice& end, const FileDescriptor& fd, const Slice& start, const Slice& end, const FileMetaData& file_meta,
TableReaderCaller caller, const InternalKeyComparator& internal_comparator, TableReaderCaller caller, const InternalKeyComparator& internal_comparator,
const std::shared_ptr<const SliceTransform>& prefix_extractor) { const std::shared_ptr<const SliceTransform>& prefix_extractor) {
uint64_t result = 0; uint64_t result = 0;
TableReader* table_reader = fd.table_reader; TableReader* table_reader = file_meta.fd.table_reader;
Cache::Handle* table_handle = nullptr; Cache::Handle* table_handle = nullptr;
if (table_reader == nullptr) { if (table_reader == nullptr) {
const bool for_compaction = (caller == TableReaderCaller::kCompaction); const bool for_compaction = (caller == TableReaderCaller::kCompaction);
Status s = FindTable(ReadOptions(), file_options_, internal_comparator, fd, Status s =
&table_handle, prefix_extractor, false /* no_io */, FindTable(ReadOptions(), file_options_, internal_comparator, file_meta,
!for_compaction /* record_read_stats */); &table_handle, prefix_extractor, false /* no_io */,
!for_compaction /* record_read_stats */);
if (s.ok()) { if (s.ok()) {
table_reader = GetTableReaderFromHandle(table_handle); table_reader = GetTableReaderFromHandle(table_handle);
} }

@ -160,7 +160,7 @@ class TableCache {
Status FindTable( Status FindTable(
const ReadOptions& ro, const FileOptions& toptions, const ReadOptions& ro, const FileOptions& toptions,
const InternalKeyComparator& internal_comparator, const InternalKeyComparator& internal_comparator,
const FileDescriptor& file_fd, Cache::Handle**, const FileMetaData& file_meta, Cache::Handle**,
const std::shared_ptr<const SliceTransform>& prefix_extractor = nullptr, const std::shared_ptr<const SliceTransform>& prefix_extractor = nullptr,
const bool no_io = false, bool record_read_stats = true, const bool no_io = false, bool record_read_stats = true,
HistogramImpl* file_read_hist = nullptr, bool skip_filters = false, HistogramImpl* file_read_hist = nullptr, bool skip_filters = false,
@ -180,14 +180,14 @@ class TableCache {
Status GetTableProperties( Status GetTableProperties(
const FileOptions& toptions, const FileOptions& toptions,
const InternalKeyComparator& internal_comparator, const InternalKeyComparator& internal_comparator,
const FileDescriptor& file_meta, const FileMetaData& file_meta,
std::shared_ptr<const TableProperties>* properties, std::shared_ptr<const TableProperties>* properties,
const std::shared_ptr<const SliceTransform>& prefix_extractor = nullptr, const std::shared_ptr<const SliceTransform>& prefix_extractor = nullptr,
bool no_io = false); bool no_io = false);
Status ApproximateKeyAnchors(const ReadOptions& ro, Status ApproximateKeyAnchors(const ReadOptions& ro,
const InternalKeyComparator& internal_comparator, const InternalKeyComparator& internal_comparator,
const FileDescriptor& file_meta, const FileMetaData& file_meta,
std::vector<TableReader::Anchor>& anchors); std::vector<TableReader::Anchor>& anchors);
// Return total memory usage of the table reader of the file. // Return total memory usage of the table reader of the file.
@ -195,19 +195,19 @@ class TableCache {
size_t GetMemoryUsageByTableReader( size_t GetMemoryUsageByTableReader(
const FileOptions& toptions, const FileOptions& toptions,
const InternalKeyComparator& internal_comparator, const InternalKeyComparator& internal_comparator,
const FileDescriptor& fd, const FileMetaData& file_meta,
const std::shared_ptr<const SliceTransform>& prefix_extractor = nullptr); const std::shared_ptr<const SliceTransform>& prefix_extractor = nullptr);
// Returns approximated offset of a key in a file represented by fd. // Returns approximated offset of a key in a file represented by fd.
uint64_t ApproximateOffsetOf( uint64_t ApproximateOffsetOf(
const Slice& key, const FileDescriptor& fd, TableReaderCaller caller, const Slice& key, const FileMetaData& file_meta, TableReaderCaller caller,
const InternalKeyComparator& internal_comparator, const InternalKeyComparator& internal_comparator,
const std::shared_ptr<const SliceTransform>& prefix_extractor = nullptr); const std::shared_ptr<const SliceTransform>& prefix_extractor = nullptr);
// Returns approximated data size between start and end keys in a file // Returns approximated data size between start and end keys in a file
// represented by fd (the start key must not be greater than the end key). // represented by fd (the start key must not be greater than the end key).
uint64_t ApproximateSize( uint64_t ApproximateSize(
const Slice& start, const Slice& end, const FileDescriptor& fd, const Slice& start, const Slice& end, const FileMetaData& file_meta,
TableReaderCaller caller, TableReaderCaller caller,
const InternalKeyComparator& internal_comparator, const InternalKeyComparator& internal_comparator,
const std::shared_ptr<const SliceTransform>& prefix_extractor = nullptr); const std::shared_ptr<const SliceTransform>& prefix_extractor = nullptr);
@ -234,8 +234,9 @@ class TableCache {
Status GetTableReader( Status GetTableReader(
const ReadOptions& ro, const FileOptions& file_options, const ReadOptions& ro, const FileOptions& file_options,
const InternalKeyComparator& internal_comparator, const InternalKeyComparator& internal_comparator,
const FileDescriptor& fd, bool sequential_mode, bool record_read_stats, const FileMetaData& file_meta, bool sequential_mode,
HistogramImpl* file_read_hist, std::unique_ptr<TableReader>* table_reader, bool record_read_stats, HistogramImpl* file_read_hist,
std::unique_ptr<TableReader>* table_reader,
const std::shared_ptr<const SliceTransform>& prefix_extractor = nullptr, const std::shared_ptr<const SliceTransform>& prefix_extractor = nullptr,
bool skip_filters = false, int level = -1, bool skip_filters = false, int level = -1,
bool prefetch_index_and_filter_in_cache = true, bool prefetch_index_and_filter_in_cache = true,

@ -67,8 +67,8 @@ DEFINE_SYNC_AND_ASYNC(Status, TableCache::MultiGet)
if (s.ok() && !table_range.empty()) { if (s.ok() && !table_range.empty()) {
if (t == nullptr) { if (t == nullptr) {
assert(handle == nullptr); assert(handle == nullptr);
s = FindTable(options, file_options_, internal_comparator, fd, &handle, s = FindTable(options, file_options_, internal_comparator, file_meta,
prefix_extractor, &handle, prefix_extractor,
options.read_tier == kBlockCacheTier /* no_io */, options.read_tier == kBlockCacheTier /* no_io */,
true /* record_read_stats */, file_read_hist, skip_filters, true /* record_read_stats */, file_read_hist, skip_filters,
level, true /* prefetch_index_and_filter_in_cache */, level, true /* prefetch_index_and_filter_in_cache */,

@ -1274,7 +1274,7 @@ class VersionBuilder::Rep {
int level = files_meta[file_idx].second; int level = files_meta[file_idx].second;
statuses[file_idx] = table_cache_->FindTable( statuses[file_idx] = table_cache_->FindTable(
ReadOptions(), file_options_, ReadOptions(), file_options_,
*(base_vstorage_->InternalComparator()), file_meta->fd, *(base_vstorage_->InternalComparator()), *file_meta,
&file_meta->table_reader_handle, prefix_extractor, false /*no_io */, &file_meta->table_reader_handle, prefix_extractor, false /*no_io */,
true /* record_read_stats */, true /* record_read_stats */,
internal_stats->GetFileReadHist(level), false, level, internal_stats->GetFileReadHist(level), false, level,

@ -23,6 +23,7 @@
#include "rocksdb/advanced_options.h" #include "rocksdb/advanced_options.h"
#include "rocksdb/cache.h" #include "rocksdb/cache.h"
#include "table/table_reader.h" #include "table/table_reader.h"
#include "table/unique_id_impl.h"
#include "util/autovector.h" #include "util/autovector.h"
namespace ROCKSDB_NAMESPACE { namespace ROCKSDB_NAMESPACE {
@ -104,8 +105,6 @@ constexpr uint64_t kUnknownFileCreationTime = 0;
extern uint64_t PackFileNumberAndPathId(uint64_t number, uint64_t path_id); extern uint64_t PackFileNumberAndPathId(uint64_t number, uint64_t path_id);
using UniqueId64x2 = std::array<uint64_t, 2>;
// A copyable structure contains information needed to read data from an SST // A copyable structure contains information needed to read data from an SST
// file. It can contain a pointer to a table reader opened for the file, or // file. It can contain a pointer to a table reader opened for the file, or
// file number and size, which can be used to create a new table reader for it. // file number and size, which can be used to create a new table reader for it.

@ -1543,7 +1543,7 @@ Status Version::GetTableProperties(std::shared_ptr<const TableProperties>* tp,
auto table_cache = cfd_->table_cache(); auto table_cache = cfd_->table_cache();
auto ioptions = cfd_->ioptions(); auto ioptions = cfd_->ioptions();
Status s = table_cache->GetTableProperties( Status s = table_cache->GetTableProperties(
file_options_, cfd_->internal_comparator(), file_meta->fd, tp, file_options_, cfd_->internal_comparator(), *file_meta, tp,
mutable_cf_options_.prefix_extractor, true /* no io */); mutable_cf_options_.prefix_extractor, true /* no io */);
if (s.ok()) { if (s.ok()) {
return s; return s;
@ -1736,7 +1736,8 @@ size_t Version::GetMemoryUsageByTableReaders() {
for (auto& file_level : storage_info_.level_files_brief_) { for (auto& file_level : storage_info_.level_files_brief_) {
for (size_t i = 0; i < file_level.num_files; i++) { for (size_t i = 0; i < file_level.num_files; i++) {
total_usage += cfd_->table_cache()->GetMemoryUsageByTableReader( total_usage += cfd_->table_cache()->GetMemoryUsageByTableReader(
file_options_, cfd_->internal_comparator(), file_level.files[i].fd, file_options_, cfd_->internal_comparator(),
*file_level.files[i].file_metadata,
mutable_cf_options_.prefix_extractor); mutable_cf_options_.prefix_extractor);
} }
} }
@ -1833,40 +1834,6 @@ void Version::GetCreationTimeOfOldestFile(uint64_t* creation_time) {
*creation_time = oldest_time; *creation_time = oldest_time;
} }
Status Version::VerifySstUniqueIds() const {
for (int level = 0; level < storage_info_.num_non_empty_levels_; level++) {
for (FileMetaData* meta : storage_info_.LevelFiles(level)) {
if (meta->unique_id != kNullUniqueId64x2) {
std::shared_ptr<const TableProperties> props;
Status s =
GetTableProperties(&props, meta); // may open the file if it's not
if (!s.ok()) {
return s;
}
UniqueId64x2 id;
s = GetSstInternalUniqueId(props->db_id, props->db_session_id,
props->orig_file_number, &id);
if (!s.ok() || id != meta->unique_id) {
std::ostringstream oss;
oss << "SST #" << meta->fd.GetNumber() << " unique ID mismatch. ";
oss << "Manifest: "
<< InternalUniqueIdToHumanString(&(meta->unique_id)) << ", ";
if (s.ok()) {
oss << "Table Properties: " << InternalUniqueIdToHumanString(&id);
} else {
oss << "Failed to get Table Properties: " << s.ToString();
}
return Status::Corruption("VersionSet", oss.str());
}
TEST_SYNC_POINT_CALLBACK("Version::VerifySstUniqueIds::Passed", &id);
} else {
TEST_SYNC_POINT_CALLBACK("Version::VerifySstUniqueIds::Skipped", meta);
}
}
}
return Status::OK();
}
InternalIterator* Version::TEST_GetLevelIterator( InternalIterator* Version::TEST_GetLevelIterator(
const ReadOptions& read_options, MergeIteratorBuilder* merge_iter_builder, const ReadOptions& read_options, MergeIteratorBuilder* merge_iter_builder,
int level, bool allow_unprepared_value) { int level, bool allow_unprepared_value) {
@ -6300,7 +6267,7 @@ uint64_t VersionSet::ApproximateOffsetOf(Version* v, const FdWithKeyRange& f,
TableCache* table_cache = v->cfd_->table_cache(); TableCache* table_cache = v->cfd_->table_cache();
if (table_cache != nullptr) { if (table_cache != nullptr) {
result = table_cache->ApproximateOffsetOf( result = table_cache->ApproximateOffsetOf(
key, f.file_metadata->fd, caller, icmp, key, *f.file_metadata, caller, icmp,
v->GetMutableCFOptions().prefix_extractor); v->GetMutableCFOptions().prefix_extractor);
} }
} }
@ -6340,7 +6307,7 @@ uint64_t VersionSet::ApproximateSize(Version* v, const FdWithKeyRange& f,
return 0; return 0;
} }
return table_cache->ApproximateSize( return table_cache->ApproximateSize(
start, end, f.file_metadata->fd, caller, icmp, start, end, *f.file_metadata, caller, icmp,
v->GetMutableCFOptions().prefix_extractor); v->GetMutableCFOptions().prefix_extractor);
} }

@ -959,8 +959,6 @@ class Version {
const MutableCFOptions& GetMutableCFOptions() { return mutable_cf_options_; } const MutableCFOptions& GetMutableCFOptions() { return mutable_cf_options_; }
Status VerifySstUniqueIds() const;
InternalIterator* TEST_GetLevelIterator( InternalIterator* TEST_GetLevelIterator(
const ReadOptions& read_options, MergeIteratorBuilder* merge_iter_builder, const ReadOptions& read_options, MergeIteratorBuilder* merge_iter_builder,
int level, bool allow_unprepared_value); int level, bool allow_unprepared_value);

@ -516,22 +516,25 @@ struct DBOptions {
// Default: false // Default: false
bool track_and_verify_wals_in_manifest = false; bool track_and_verify_wals_in_manifest = false;
// EXPERIMENTAL: This API/behavior is subject to change // If true, verifies the SST unique id between MANIFEST and actual file
// If true, during DB-open it verifies the SST unique id between MANIFEST // each time an SST file is opened. This check ensures an SST file is not
// and SST properties, which is to make sure the SST is not overwritten or // overwritten or misplaced. A corruption error will be reported if mismatch
// misplaced. A corruption error will be reported if mismatch detected, but // detected, but only when MANIFEST tracks the unique id, which starts from
// only when MANIFEST tracks the unique id, which starts from version 7.3. // RocksDB version 7.3. Although the tracked internal unique id is related
// The unique id is an internal unique id and subject to change. // to the one returned by GetUniqueIdFromTableProperties, that is subject to
// // change.
// Note: // NOTE: verification is currently only done on SST files using block-based
// 1. if enabled, it opens every SST files during DB open to read the unique // table format.
// id from SST properties, so it's recommended to have `max_open_files=-1` //
// to pre-open the SST files before the verification. // Setting to false should only be needed in case of unexpected problems.
// 2. existing SST files won't have its unique_id tracked in MANIFEST, then //
// verification will be skipped. // Although an early version of this option opened all SST files for
// verification on DB::Open, that is no longer guaranteed. However, as
// documented in an above option, if max_open_files is -1, DB will open all
// files on DB::Open().
// //
// Default: false // Default: true
bool verify_sst_unique_id_in_manifest = false; bool verify_sst_unique_id_in_manifest = true;
// Use the specified object to interact with the environment, // Use the specified object to interact with the environment,
// e.g. to read/write files, schedule background work, etc. In the near // e.g. to read/write files, schedule background work, etc. In the near

@ -628,8 +628,8 @@ Status BlockBasedTableFactory::NewTableReader(
table_reader_options.force_direct_prefetch, &tail_prefetch_stats_, table_reader_options.force_direct_prefetch, &tail_prefetch_stats_,
table_reader_options.block_cache_tracer, table_reader_options.block_cache_tracer,
table_reader_options.max_file_size_for_l0_meta_pin, table_reader_options.max_file_size_for_l0_meta_pin,
table_reader_options.cur_db_session_id, table_reader_options.cur_db_session_id, table_reader_options.cur_file_num,
table_reader_options.cur_file_num); table_reader_options.unique_id);
} }
TableBuilder* BlockBasedTableFactory::NewTableBuilder( TableBuilder* BlockBasedTableFactory::NewTableBuilder(

@ -10,6 +10,7 @@
#include <algorithm> #include <algorithm>
#include <array> #include <array>
#include <atomic>
#include <cstdint> #include <cstdint>
#include <limits> #include <limits>
#include <memory> #include <memory>
@ -588,7 +589,7 @@ Status BlockBasedTable::Open(
TailPrefetchStats* tail_prefetch_stats, TailPrefetchStats* tail_prefetch_stats,
BlockCacheTracer* const block_cache_tracer, BlockCacheTracer* const block_cache_tracer,
size_t max_file_size_for_l0_meta_pin, const std::string& cur_db_session_id, size_t max_file_size_for_l0_meta_pin, const std::string& cur_db_session_id,
uint64_t cur_file_num) { uint64_t cur_file_num, UniqueId64x2 expected_unique_id) {
table_reader->reset(); table_reader->reset();
Status s; Status s;
@ -597,7 +598,7 @@ Status BlockBasedTable::Open(
// From read_options, retain deadline, io_timeout, and rate_limiter_priority. // From read_options, retain deadline, io_timeout, and rate_limiter_priority.
// In future, we may retain more // In future, we may retain more
// options. Specifically, w ignore verify_checksums and default to // options. Specifically, we ignore verify_checksums and default to
// checksum verification anyway when creating the index and filter // checksum verification anyway when creating the index and filter
// readers. // readers.
ReadOptions ro; ReadOptions ro;
@ -682,6 +683,53 @@ Status BlockBasedTable::Open(
if (!s.ok()) { if (!s.ok()) {
return s; return s;
} }
// Check expected unique id if provided
if (expected_unique_id != kNullUniqueId64x2) {
auto props = rep->table_properties;
if (!props) {
return Status::Corruption("Missing table properties on file " +
std::to_string(cur_file_num) +
" with known unique ID");
}
UniqueId64x2 actual_unique_id{};
s = GetSstInternalUniqueId(props->db_id, props->db_session_id,
props->orig_file_number, &actual_unique_id,
/*force*/ true);
assert(s.ok()); // because force=true
if (expected_unique_id != actual_unique_id) {
return Status::Corruption(
"Mismatch in unique ID on table file " +
std::to_string(cur_file_num) +
". Expected: " + InternalUniqueIdToHumanString(&expected_unique_id) +
" Actual: " + InternalUniqueIdToHumanString(&actual_unique_id));
}
TEST_SYNC_POINT_CALLBACK("BlockBasedTable::Open::PassedVerifyUniqueId",
&actual_unique_id);
} else {
TEST_SYNC_POINT_CALLBACK("BlockBasedTable::Open::SkippedVerifyUniqueId",
nullptr);
if (ioptions.verify_sst_unique_id_in_manifest && ioptions.logger) {
// A crude but isolated way of reporting unverified files. This should not
// be an ongoing concern so doesn't deserve a place in Statistics IMHO.
static std::atomic<uint64_t> unverified_count{0};
auto prev_count =
unverified_count.fetch_add(1, std::memory_order_relaxed);
if (prev_count == 0) {
ROCKS_LOG_WARN(
ioptions.logger,
"At least one SST file opened without unique ID to verify: %" PRIu64
".sst",
cur_file_num);
} else if (prev_count % 1000 == 0) {
ROCKS_LOG_WARN(
ioptions.logger,
"Another ~1000 SST files opened without unique ID to verify");
}
}
}
// Set up prefix extracto as needed
bool force_null_table_prefix_extractor = false; bool force_null_table_prefix_extractor = false;
TEST_SYNC_POINT_CALLBACK( TEST_SYNC_POINT_CALLBACK(
"BlockBasedTable::Open::ForceNullTablePrefixExtractor", "BlockBasedTable::Open::ForceNullTablePrefixExtractor",

@ -108,7 +108,8 @@ class BlockBasedTable : public TableReader {
TailPrefetchStats* tail_prefetch_stats = nullptr, TailPrefetchStats* tail_prefetch_stats = nullptr,
BlockCacheTracer* const block_cache_tracer = nullptr, BlockCacheTracer* const block_cache_tracer = nullptr,
size_t max_file_size_for_l0_meta_pin = 0, size_t max_file_size_for_l0_meta_pin = 0,
const std::string& cur_db_session_id = "", uint64_t cur_file_num = 0); const std::string& cur_db_session_id = "", uint64_t cur_file_num = 0,
UniqueId64x2 expected_unique_id = {});
bool PrefixRangeMayMatch(const Slice& internal_key, bool PrefixRangeMayMatch(const Slice& internal_key,
const ReadOptions& read_options, const ReadOptions& read_options,

@ -390,6 +390,10 @@ Status ReadFooterFromFile(const IOOptions& opts, RandomAccessFileReader* file,
// Check that we actually read the whole footer from the file. It may be // Check that we actually read the whole footer from the file. It may be
// that size isn't correct. // that size isn't correct.
if (footer_input.size() < Footer::kMinEncodedLength) { if (footer_input.size() < Footer::kMinEncodedLength) {
// FIXME: this error message is bad. We should be checking whether the
// provided file_size matches what's on disk, at least in this case.
// Unfortunately FileSystem/Env does not provide a way to get the size
// of an open file, so getting file size requires a full path seek.
return Status::Corruption("file is too short (" + return Status::Corruption("file is too short (" +
std::to_string(file_size) + std::to_string(file_size) +
" bytes) to be an " " bytes) to be an "

@ -22,6 +22,7 @@
#include "options/cf_options.h" #include "options/cf_options.h"
#include "rocksdb/options.h" #include "rocksdb/options.h"
#include "rocksdb/table_properties.h" #include "rocksdb/table_properties.h"
#include "table/unique_id_impl.h"
#include "trace_replay/block_cache_tracer.h" #include "trace_replay/block_cache_tracer.h"
namespace ROCKSDB_NAMESPACE { namespace ROCKSDB_NAMESPACE {
@ -40,25 +41,8 @@ struct TableReaderOptions {
bool _force_direct_prefetch = false, int _level = -1, bool _force_direct_prefetch = false, int _level = -1,
BlockCacheTracer* const _block_cache_tracer = nullptr, BlockCacheTracer* const _block_cache_tracer = nullptr,
size_t _max_file_size_for_l0_meta_pin = 0, size_t _max_file_size_for_l0_meta_pin = 0,
const std::string& _cur_db_session_id = "", uint64_t _cur_file_num = 0) const std::string& _cur_db_session_id = "", uint64_t _cur_file_num = 0,
: TableReaderOptions( UniqueId64x2 _unique_id = {}, SequenceNumber _largest_seqno = 0)
_ioptions, _prefix_extractor, _env_options, _internal_comparator,
_skip_filters, _immortal, _force_direct_prefetch, _level,
0 /* _largest_seqno */, _block_cache_tracer,
_max_file_size_for_l0_meta_pin, _cur_db_session_id, _cur_file_num) {
}
// @param skip_filters Disables loading/accessing the filter block
TableReaderOptions(
const ImmutableOptions& _ioptions,
const std::shared_ptr<const SliceTransform>& _prefix_extractor,
const EnvOptions& _env_options,
const InternalKeyComparator& _internal_comparator, bool _skip_filters,
bool _immortal, bool _force_direct_prefetch, int _level,
SequenceNumber _largest_seqno,
BlockCacheTracer* const _block_cache_tracer,
size_t _max_file_size_for_l0_meta_pin,
const std::string& _cur_db_session_id, uint64_t _cur_file_num)
: ioptions(_ioptions), : ioptions(_ioptions),
prefix_extractor(_prefix_extractor), prefix_extractor(_prefix_extractor),
env_options(_env_options), env_options(_env_options),
@ -71,7 +55,8 @@ struct TableReaderOptions {
block_cache_tracer(_block_cache_tracer), block_cache_tracer(_block_cache_tracer),
max_file_size_for_l0_meta_pin(_max_file_size_for_l0_meta_pin), max_file_size_for_l0_meta_pin(_max_file_size_for_l0_meta_pin),
cur_db_session_id(_cur_db_session_id), cur_db_session_id(_cur_db_session_id),
cur_file_num(_cur_file_num) {} cur_file_num(_cur_file_num),
unique_id(_unique_id) {}
const ImmutableOptions& ioptions; const ImmutableOptions& ioptions;
const std::shared_ptr<const SliceTransform>& prefix_extractor; const std::shared_ptr<const SliceTransform>& prefix_extractor;
@ -88,7 +73,7 @@ struct TableReaderOptions {
// What level this table/file is on, -1 for "not set, don't know." Used // What level this table/file is on, -1 for "not set, don't know." Used
// for level-specific statistics. // for level-specific statistics.
int level; int level;
// largest seqno in the table // largest seqno in the table (or 0 means unknown???)
SequenceNumber largest_seqno; SequenceNumber largest_seqno;
BlockCacheTracer* const block_cache_tracer; BlockCacheTracer* const block_cache_tracer;
// Largest L0 file size whose meta-blocks may be pinned (can be zero when // Largest L0 file size whose meta-blocks may be pinned (can be zero when
@ -98,6 +83,9 @@ struct TableReaderOptions {
std::string cur_db_session_id; std::string cur_db_session_id;
uint64_t cur_file_num; uint64_t cur_file_num;
// Known unique_id or {}, kNullUniqueId64x2 means unknown
UniqueId64x2 unique_id;
}; };
struct TableBuilderOptions { struct TableBuilderOptions {

@ -407,7 +407,7 @@ class TableConstructor : public Constructor {
EXPECT_EQ(TEST_GetSink()->contents().size(), builder->FileSize()); EXPECT_EQ(TEST_GetSink()->contents().size(), builder->FileSize());
// Open the table // Open the table
uniq_id_ = cur_uniq_id_++; file_num_ = cur_file_num_++;
return Reopen(ioptions, moptions); return Reopen(ioptions, moptions);
} }
@ -438,15 +438,15 @@ class TableConstructor : public Constructor {
virtual Status Reopen(const ImmutableOptions& ioptions, virtual Status Reopen(const ImmutableOptions& ioptions,
const MutableCFOptions& moptions) { const MutableCFOptions& moptions) {
std::unique_ptr<FSRandomAccessFile> source(new test::StringSource( std::unique_ptr<FSRandomAccessFile> source(new test::StringSource(
TEST_GetSink()->contents(), uniq_id_, ioptions.allow_mmap_reads)); TEST_GetSink()->contents(), file_num_, ioptions.allow_mmap_reads));
file_reader_.reset(new RandomAccessFileReader(std::move(source), "test")); file_reader_.reset(new RandomAccessFileReader(std::move(source), "test"));
return ioptions.table_factory->NewTableReader( return ioptions.table_factory->NewTableReader(
TableReaderOptions(ioptions, moptions.prefix_extractor, soptions, TableReaderOptions(ioptions, moptions.prefix_extractor, soptions,
*last_internal_comparator_, /*skip_filters*/ false, *last_internal_comparator_, /*skip_filters*/ false,
/*immortal*/ false, false, level_, largest_seqno_, /*immortal*/ false, false, level_,
&block_cache_tracer_, moptions.write_buffer_size, "", &block_cache_tracer_, moptions.write_buffer_size, "",
uniq_id_), file_num_, kNullUniqueId64x2, largest_seqno_),
std::move(file_reader_), TEST_GetSink()->contents().size(), std::move(file_reader_), TEST_GetSink()->contents().size(),
&table_reader_); &table_reader_);
} }
@ -469,14 +469,14 @@ class TableConstructor : public Constructor {
private: private:
void Reset() { void Reset() {
uniq_id_ = 0; file_num_ = 0;
table_reader_.reset(); table_reader_.reset();
file_writer_.reset(); file_writer_.reset();
file_reader_.reset(); file_reader_.reset();
} }
const ReadOptions read_options_; const ReadOptions read_options_;
uint64_t uniq_id_; uint64_t file_num_;
std::unique_ptr<WritableFileWriter> file_writer_; std::unique_ptr<WritableFileWriter> file_writer_;
std::unique_ptr<RandomAccessFileReader> file_reader_; std::unique_ptr<RandomAccessFileReader> file_reader_;
std::unique_ptr<TableReader> table_reader_; std::unique_ptr<TableReader> table_reader_;
@ -486,11 +486,11 @@ class TableConstructor : public Constructor {
TableConstructor(); TableConstructor();
static uint64_t cur_uniq_id_; static uint64_t cur_file_num_;
EnvOptions soptions; EnvOptions soptions;
Env* env_; Env* env_;
}; };
uint64_t TableConstructor::cur_uniq_id_ = 1; uint64_t TableConstructor::cur_file_num_ = 1;
class MemTableConstructor: public Constructor { class MemTableConstructor: public Constructor {
public: public:
@ -1310,7 +1310,7 @@ class FileChecksumTestHelper {
Status CalculateFileChecksum(FileChecksumGenerator* file_checksum_generator, Status CalculateFileChecksum(FileChecksumGenerator* file_checksum_generator,
std::string* checksum) { std::string* checksum) {
assert(file_checksum_generator != nullptr); assert(file_checksum_generator != nullptr);
cur_uniq_id_ = checksum_uniq_id_++; cur_file_num_ = checksum_file_num_++;
test::StringSink* ss_rw = test::StringSink* ss_rw =
static_cast<test::StringSink*>(file_writer_->writable_file()); static_cast<test::StringSink*>(file_writer_->writable_file());
std::unique_ptr<FSRandomAccessFile> source( std::unique_ptr<FSRandomAccessFile> source(
@ -1344,17 +1344,17 @@ class FileChecksumTestHelper {
private: private:
bool convert_to_internal_key_; bool convert_to_internal_key_;
uint64_t cur_uniq_id_; uint64_t cur_file_num_;
std::unique_ptr<WritableFileWriter> file_writer_; std::unique_ptr<WritableFileWriter> file_writer_;
std::unique_ptr<RandomAccessFileReader> file_reader_; std::unique_ptr<RandomAccessFileReader> file_reader_;
std::unique_ptr<TableBuilder> table_builder_; std::unique_ptr<TableBuilder> table_builder_;
stl_wrappers::KVMap kv_map_; stl_wrappers::KVMap kv_map_;
test::StringSink* sink_ = nullptr; test::StringSink* sink_ = nullptr;
static uint64_t checksum_uniq_id_; static uint64_t checksum_file_num_;
}; };
uint64_t FileChecksumTestHelper::checksum_uniq_id_ = 1; uint64_t FileChecksumTestHelper::checksum_file_num_ = 1;
INSTANTIATE_TEST_CASE_P(FormatVersions, BlockBasedTableTest, INSTANTIATE_TEST_CASE_P(FormatVersions, BlockBasedTableTest,
testing::ValuesIn(test::kFooterFormatVersionsToTest)); testing::ValuesIn(test::kFooterFormatVersionsToTest));

@ -2083,6 +2083,10 @@ TEST_F(BackupEngineTest, ShareTableFilesWithChecksumsOldFileNaming) {
}); });
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing(); ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
// Corrupting the table properties corrupts the unique id.
// Ignore the unique id recorded in the manifest.
options_.verify_sst_unique_id_in_manifest = false;
OpenDBAndBackupEngine(true, false, kShareWithChecksum); OpenDBAndBackupEngine(true, false, kShareWithChecksum);
FillDB(db_.get(), 0, keys_iteration); FillDB(db_.get(), 0, keys_iteration);
CloseDBAndBackupEngine(); CloseDBAndBackupEngine();

Loading…
Cancel
Save