From 230660be7333a2fc6069fd860f1701fb5dfc5243 Mon Sep 17 00:00:00 2001 From: Peter Dillinger Date: Thu, 18 Nov 2021 11:42:12 -0800 Subject: [PATCH] Improve / clean up meta block code & integrity (#9163) Summary: * Checksums are now checked on meta blocks unless specifically suppressed or not applicable (e.g. plain table). (Was other way around.) This means a number of cases that were not checking checksums now are, including direct read TableProperties in Version::GetTableProperties (fixed in meta_blocks ReadTableProperties), reading any block from PersistentCache (fixed in BlockFetcher), read TableProperties in SstFileDumper (ldb/sst_dump/BackupEngine) before table reader open, maybe more. * For that to work, I moved the global_seqno+TableProperties checksum logic to the shared table/ code, because that is used by many utilies such as SstFileDumper. * Also for that to work, we have to know when we're dealing with a block that has a checksum (trailer), so added that capability to Footer based on magic number, and from there BlockFetcher. * Knowledge of trailer presence has also fixed a problem where other table formats were reading blocks including bytes for a non-existant trailer--and awkwardly kind-of not using them, e.g. no shared code checking checksums. (BlockFetcher compression type was populated incorrectly.) Now we only read what is needed. * Minimized code duplication and differing/incompatible/awkward abstractions in meta_blocks.{cc,h} (e.g. SeekTo in metaindex block without parsing block handle) * Moved some meta block handling code from table_properties*.* * Moved some code specific to block-based table from shared table/ code to BlockBasedTable class. The checksum stuff means we can't completely separate it, but things that don't need to be in shared table/ code should not be. * Use unique_ptr rather than raw ptr in more places. (Note: you can std::move from unique_ptr to shared_ptr.) Without enhancements to GetPropertiesOfAllTablesTest (see below), net reduction of roughly 100 lines of code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9163 Test Plan: existing tests and * Enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to verify that checksums are now checked on direct read of table properties by TableCache (new test would fail before this change) * Also enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to test putting table properties under old meta name * Also generally enhanced that same test to actually test what it was supposed to be testing already, by kicking things out of table cache when we don't want them there. Reviewed By: ajkr, mrambacher Differential Revision: D32514757 Pulled By: pdillinger fbshipit-source-id: 507964b9311d186ae8d1131182290cbd97a99fa9 --- HISTORY.md | 2 + Makefile | 7 + db/corruption_test.cc | 2 +- db/db_table_properties_test.cc | 79 ++++- db/plain_table_db_test.cc | 40 ++- db/table_cache.cc | 10 + db/table_cache.h | 3 + db/table_properties_collector_test.cc | 14 +- db/version_edit_handler.cc | 6 +- db/version_set.cc | 7 +- include/rocksdb/table_properties.h | 4 - .../block_based/block_based_table_builder.cc | 10 +- table/block_based/block_based_table_reader.cc | 141 +++------ table/block_based/block_based_table_reader.h | 27 +- table/block_based/block_prefetcher.cc | 9 +- table/block_based/block_test.cc | 7 +- table/block_based/filter_policy.cc | 3 +- table/block_based/flush_block_policy.cc | 3 +- table/block_based/partitioned_filter_block.cc | 3 +- table/block_based/partitioned_index_reader.cc | 4 +- table/block_based/reader_common.cc | 1 + table/block_fetcher.cc | 28 +- table/block_fetcher.h | 24 +- table/cuckoo/cuckoo_table_builder_test.cc | 6 +- table/cuckoo/cuckoo_table_reader.cc | 18 +- table/format.cc | 23 +- table/format.h | 38 +-- table/meta_blocks.cc | 278 ++++++++---------- table/meta_blocks.h | 59 ++-- table/persistent_cache_helper.cc | 5 +- table/persistent_cache_options.h | 2 + table/plain/plain_table_reader.cc | 14 +- table/sst_file_dumper.cc | 11 +- table/table_properties.cc | 57 +--- table/table_properties_internal.h | 20 -- table/table_test.cc | 31 +- 36 files changed, 493 insertions(+), 503 deletions(-) diff --git a/HISTORY.md b/HISTORY.md index 68cac534b..75c45d183 100644 --- a/HISTORY.md +++ b/HISTORY.md @@ -17,6 +17,7 @@ * Added input sanitization on negative bytes passed into `GenericRateLimiter::Request`. * Fixed an assertion failure in CompactionIterator when write-prepared transaction is used. We prove that certain operations can lead to a Delete being followed by a SingleDelete (same user key). We can drop the SingleDelete. * Fixed a bug of timestamp-based GC which can cause all versions of a key under full_history_ts_low to be dropped. This bug will be triggered when some of the ikeys' timestamps are lower than full_history_ts_low, while others are newer. +* In some cases outside of the DB read and compaction paths, SST block checksums are now checked where they were not before. * Explicitly check for and disallow the `BlockBasedTableOptions` if insertion into one of {`block_cache`, `block_cache_compressed`, `persistent_cache`} can show up in another of these. (RocksDB expects to be able to use the same key for different physical data among tiers.) ### Behavior Changes @@ -29,6 +30,7 @@ * Made FileSystem extend the Customizable class and added a CreateFromString method. Implementations need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method. * Clarified in API comments that RocksDB is not exception safe for callbacks and custom extensions. An exception propagating into RocksDB can lead to undefined behavior, including data loss, unreported corruption, deadlocks, and more. * Marked `WriteBufferManager` as `final` because it is not intended for extension. +* Removed unimportant implementation details from table_properties.h * Add API `FSDirectory::FsyncWithDirOptions()`, which provides extra information like directory fsync reason in `DirFsyncOptions`. File system like btrfs is using that to skip directory fsync for creating a new file, or when renaming a file, fsync the target file instead of the directory, which improves the `DB::Open()` speed by ~20%. * `DB::Open()` is not going be blocked by obsolete file purge if `DBOptions::avoid_unnecessary_blocking_io` is set to true. * In builds where glibc provides `gettid()`, info log ("LOG" file) lines now print a system-wide thread ID from `gettid()` instead of the process-local `pthread_self()`. For all users, the thread ID format is changed from hexadecimal to decimal integer. diff --git a/Makefile b/Makefile index 67e312eb9..8a1d3f1e8 100644 --- a/Makefile +++ b/Makefile @@ -306,6 +306,13 @@ ifdef COMPILE_WITH_ASAN EXEC_LDFLAGS += -fsanitize=address PLATFORM_CCFLAGS += -fsanitize=address PLATFORM_CXXFLAGS += -fsanitize=address +ifeq ($(LIB_MODE),shared) +ifdef USE_CLANG +# Fix false ODR violation; see https://github.com/google/sanitizers/issues/1017 + EXEC_LDFLAGS += -mllvm -asan-use-private-alias=1 + PLATFORM_CXXFLAGS += -mllvm -asan-use-private-alias=1 +endif +endif endif # TSAN doesn't work well with jemalloc. If we're compiling with TSAN, we should use regular malloc. diff --git a/db/corruption_test.cc b/db/corruption_test.cc index c592de875..09825b1be 100644 --- a/db/corruption_test.cc +++ b/db/corruption_test.cc @@ -544,7 +544,7 @@ TEST_F(CorruptionTest, RangeDeletionCorrupted) { fs->GetFileSize(filename, file_opts.io_options, &file_size, nullptr)); BlockHandle range_del_handle; - ASSERT_OK(FindMetaBlock( + ASSERT_OK(FindMetaBlockInFile( file_reader.get(), file_size, kBlockBasedTableMagicNumber, ImmutableOptions(options_), kRangeDelBlock, &range_del_handle)); diff --git a/db/db_table_properties_test.cc b/db/db_table_properties_test.cc index 788e2a9d7..308b95cef 100644 --- a/db/db_table_properties_test.cc +++ b/db/db_table_properties_test.cc @@ -14,8 +14,10 @@ #include "port/port.h" #include "port/stack_trace.h" #include "rocksdb/db.h" +#include "rocksdb/types.h" #include "rocksdb/utilities/table_properties_collectors.h" #include "table/format.h" +#include "table/meta_blocks.h" #include "test_util/testharness.h" #include "test_util/testutil.h" #include "util/random.h" @@ -63,21 +65,49 @@ class DBTablePropertiesTest : public DBTestBase, TEST_F(DBTablePropertiesTest, GetPropertiesOfAllTablesTest) { Options options = CurrentOptions(); options.level0_file_num_compaction_trigger = 8; + // Part of strategy to prevent pinning table files + options.max_open_files = 42; Reopen(options); + // Create 4 tables for (int table = 0; table < 4; ++table) { + // Use old meta name for table properties for one file + if (table == 3) { + SyncPoint::GetInstance()->SetCallBack( + "BlockBasedTableBuilder::WritePropertiesBlock:Meta", [&](void* meta) { + *reinterpret_cast(meta) = + &kPropertiesBlockOldName; + }); + SyncPoint::GetInstance()->EnableProcessing(); + } + // Build file for (int i = 0; i < 10 + table; ++i) { ASSERT_OK(db_->Put(WriteOptions(), ToString(table * 100 + i), "val")); } ASSERT_OK(db_->Flush(FlushOptions())); } + SyncPoint::GetInstance()->DisableProcessing(); + std::string original_session_id; + ASSERT_OK(db_->GetDbSessionId(original_session_id)); + + // Part of strategy to prevent pinning table files + SyncPoint::GetInstance()->SetCallBack( + "VersionEditHandler::LoadTables:skip_load_table_files", + [&](void* skip_load) { *reinterpret_cast(skip_load) = true; }); + SyncPoint::GetInstance()->EnableProcessing(); // 1. Read table properties directly from file Reopen(options); + // Clear out auto-opened files + dbfull()->TEST_table_cache()->EraseUnRefEntries(); + ASSERT_EQ(dbfull()->TEST_table_cache()->GetUsage(), 0U); VerifyTableProperties(db_, 10 + 11 + 12 + 13); // 2. Put two tables to table cache and Reopen(options); + // Clear out auto-opened files + dbfull()->TEST_table_cache()->EraseUnRefEntries(); + ASSERT_EQ(dbfull()->TEST_table_cache()->GetUsage(), 0U); // fetch key from 1st and 2nd table, which will internally place that table to // the table cache. for (int i = 0; i < 2; ++i) { @@ -88,12 +118,57 @@ TEST_F(DBTablePropertiesTest, GetPropertiesOfAllTablesTest) { // 3. Put all tables to table cache Reopen(options); - // fetch key from 1st and 2nd table, which will internally place that table to - // the table cache. + // fetch key from all tables, which will place them in table cache. for (int i = 0; i < 4; ++i) { Get(ToString(i * 100 + 0)); } VerifyTableProperties(db_, 10 + 11 + 12 + 13); + + // 4. Try to read CORRUPT properties (a) directly from file, and (b) + // through reader on Get + + // It's not practical to prevent table file read on Open, so we + // corrupt after open and after purging table cache. + for (bool direct : {true, false}) { + Reopen(options); + // Clear out auto-opened files + dbfull()->TEST_table_cache()->EraseUnRefEntries(); + ASSERT_EQ(dbfull()->TEST_table_cache()->GetUsage(), 0U); + + TablePropertiesCollection props; + ASSERT_OK(db_->GetPropertiesOfAllTables(&props)); + std::string sst_file = props.begin()->first; + + // Corrupt the file's TableProperties using session id + std::string contents; + ASSERT_OK( + ReadFileToString(env_->GetFileSystem().get(), sst_file, &contents)); + size_t pos = contents.find(original_session_id); + ASSERT_NE(pos, std::string::npos); + ASSERT_OK(test::CorruptFile(env_, sst_file, static_cast(pos), 1, + /*verify checksum fails*/ false)); + + // Try to read CORRUPT properties + if (direct) { + ASSERT_TRUE(db_->GetPropertiesOfAllTables(&props).IsCorruption()); + } else { + bool found_corruption = false; + for (int i = 0; i < 4; ++i) { + std::string result = Get(ToString(i * 100 + 0)); + if (result.find_first_of("Corruption: block checksum mismatch") != + std::string::npos) { + found_corruption = true; + } + } + ASSERT_TRUE(found_corruption); + } + + // UN-corrupt file for next iteration + ASSERT_OK(test::CorruptFile(env_, sst_file, static_cast(pos), 1, + /*verify checksum fails*/ false)); + } + + SyncPoint::GetInstance()->DisableProcessing(); } TEST_F(DBTablePropertiesTest, CreateOnDeletionCollectorFactory) { diff --git a/db/plain_table_db_test.cc b/db/plain_table_db_test.cc index 0654ce994..9932de039 100644 --- a/db/plain_table_db_test.cc +++ b/db/plain_table_db_test.cc @@ -266,24 +266,22 @@ class TestPlainTableReader : public PlainTableReader { const EnvOptions& env_options, const InternalKeyComparator& icomparator, EncodingType encoding_type, uint64_t file_size, int bloom_bits_per_key, double hash_table_ratio, size_t index_sparseness, - const TableProperties* table_properties, + std::unique_ptr&& props, std::unique_ptr&& file, const ImmutableOptions& ioptions, const SliceTransform* prefix_extractor, bool* expect_bloom_not_match, bool store_index_in_file, uint32_t column_family_id, const std::string& column_family_name) : PlainTableReader(ioptions, std::move(file), env_options, icomparator, - encoding_type, file_size, table_properties, + encoding_type, file_size, props.get(), prefix_extractor), expect_bloom_not_match_(expect_bloom_not_match) { Status s = MmapDataIfNeeded(); EXPECT_TRUE(s.ok()); - s = PopulateIndex(const_cast(table_properties), - bloom_bits_per_key, hash_table_ratio, index_sparseness, - 2 * 1024 * 1024); + s = PopulateIndex(props.get(), bloom_bits_per_key, hash_table_ratio, + index_sparseness, 2 * 1024 * 1024); EXPECT_TRUE(s.ok()); - TableProperties* props = const_cast(table_properties); EXPECT_EQ(column_family_id, static_cast(props->column_family_id)); EXPECT_EQ(column_family_name, props->column_family_name); if (store_index_in_file) { @@ -297,7 +295,7 @@ class TestPlainTableReader : public PlainTableReader { EXPECT_TRUE(num_blocks_ptr != props->user_collected_properties.end()); } } - table_properties_.reset(props); + table_properties_ = std::move(props); } ~TestPlainTableReader() override {} @@ -337,26 +335,24 @@ class TestPlainTableFactory : public PlainTableFactory { std::unique_ptr&& file, uint64_t file_size, std::unique_ptr* table, bool /*prefetch_index_and_filter_in_cache*/) const override { - TableProperties* props = nullptr; - auto s = - ReadTableProperties(file.get(), file_size, kPlainTableMagicNumber, - table_reader_options.ioptions, &props, - true /* compression_type_missing */); + std::unique_ptr props; + auto s = ReadTableProperties(file.get(), file_size, kPlainTableMagicNumber, + table_reader_options.ioptions, &props); EXPECT_TRUE(s.ok()); if (store_index_in_file_) { BlockHandle bloom_block_handle; - s = FindMetaBlock(file.get(), file_size, kPlainTableMagicNumber, - table_reader_options.ioptions, - BloomBlockBuilder::kBloomBlock, &bloom_block_handle, - /* compression_type_missing */ true); + s = FindMetaBlockInFile(file.get(), file_size, kPlainTableMagicNumber, + table_reader_options.ioptions, + BloomBlockBuilder::kBloomBlock, + &bloom_block_handle); EXPECT_TRUE(s.ok()); BlockHandle index_block_handle; - s = FindMetaBlock(file.get(), file_size, kPlainTableMagicNumber, - table_reader_options.ioptions, - PlainTableIndexBuilder::kPlainTableIndexBlock, - &index_block_handle, /* compression_type_missing */ true); + s = FindMetaBlockInFile(file.get(), file_size, kPlainTableMagicNumber, + table_reader_options.ioptions, + PlainTableIndexBuilder::kPlainTableIndexBlock, + &index_block_handle); EXPECT_TRUE(s.ok()); } @@ -370,8 +366,8 @@ class TestPlainTableFactory : public PlainTableFactory { std::unique_ptr new_reader(new TestPlainTableReader( table_reader_options.env_options, table_reader_options.internal_comparator, encoding_type, file_size, - bloom_bits_per_key_, hash_table_ratio_, index_sparseness_, props, - std::move(file), table_reader_options.ioptions, + bloom_bits_per_key_, hash_table_ratio_, index_sparseness_, + std::move(props), std::move(file), table_reader_options.ioptions, table_reader_options.prefix_extractor, expect_bloom_not_match_, store_index_in_file_, column_family_id_, column_family_name_)); diff --git a/db/table_cache.cc b/db/table_cache.cc index fe549329a..8bcd81c52 100644 --- a/db/table_cache.cc +++ b/db/table_cache.cc @@ -655,6 +655,16 @@ size_t TableCache::GetMemoryUsageByTableReader( return ret; } +bool TableCache::HasEntry(Cache* cache, uint64_t file_number) { + Cache::Handle* handle = cache->Lookup(GetSliceForFileNumber(&file_number)); + if (handle) { + cache->Release(handle); + return true; + } else { + return false; + } +} + void TableCache::Evict(Cache* cache, uint64_t file_number) { cache->Erase(GetSliceForFileNumber(&file_number)); } diff --git a/db/table_cache.h b/db/table_cache.h index e8144dcd0..da172833b 100644 --- a/db/table_cache.h +++ b/db/table_cache.h @@ -125,6 +125,9 @@ class TableCache { // Evict any entry for the specified file number static void Evict(Cache* cache, uint64_t file_number); + // Query whether specified file number is currently in cache + static bool HasEntry(Cache* cache, uint64_t file_number); + // Clean table handle and erase it from the table cache // Used in DB close, or the file is not live anymore. void EraseHandle(const FileDescriptor& fd, Cache::Handle* handle); diff --git a/db/table_properties_collector_test.cc b/db/table_properties_collector_test.cc index e6f88aca9..010136247 100644 --- a/db/table_properties_collector_test.cc +++ b/db/table_properties_collector_test.cc @@ -291,11 +291,9 @@ void TestCustomizedTablePropertiesCollector( std::unique_ptr fake_file_reader( new RandomAccessFileReader(std::move(source), "test")); - TableProperties* props; + std::unique_ptr props; Status s = ReadTableProperties(fake_file_reader.get(), fwf->contents().size(), - magic_number, ioptions, &props, - true /* compression_type_missing */); - std::unique_ptr props_guard(props); + magic_number, ioptions, &props); ASSERT_OK(s); auto user_collected = props->user_collected_properties; @@ -432,13 +430,11 @@ void TestInternalKeyPropertiesCollector( std::unique_ptr reader( new RandomAccessFileReader(std::move(source), "test")); - TableProperties* props; - Status s = - ReadTableProperties(reader.get(), fwf->contents().size(), magic_number, - ioptions, &props, true /* compression_type_missing */); + std::unique_ptr props; + Status s = ReadTableProperties(reader.get(), fwf->contents().size(), + magic_number, ioptions, &props); ASSERT_OK(s); - std::unique_ptr props_guard(props); auto user_collected = props->user_collected_properties; uint64_t deleted = GetDeletedKeys(user_collected); ASSERT_EQ(5u, deleted); // deletes + single-deletes diff --git a/db/version_edit_handler.cc b/db/version_edit_handler.cc index 7e6c4e9e7..39ed82580 100644 --- a/db/version_edit_handler.cc +++ b/db/version_edit_handler.cc @@ -517,7 +517,11 @@ Status VersionEditHandler::MaybeCreateVersion(const VersionEdit& /*edit*/, Status VersionEditHandler::LoadTables(ColumnFamilyData* cfd, bool prefetch_index_and_filter_in_cache, bool is_initial_load) { - if (skip_load_table_files_) { + bool skip_load_table_files = skip_load_table_files_; + TEST_SYNC_POINT_CALLBACK( + "VersionEditHandler::LoadTables:skip_load_table_files", + &skip_load_table_files); + if (skip_load_table_files) { return Status::OK(); } assert(cfd != nullptr); diff --git a/db/version_set.cc b/db/version_set.cc index d46c7bc50..557698e9f 100644 --- a/db/version_set.cc +++ b/db/version_set.cc @@ -1271,7 +1271,6 @@ Status Version::GetTableProperties(std::shared_ptr* tp, return s; } - TableProperties* raw_table_properties; // By setting the magic number to kInvalidTableMagicNumber, we can by // pass the magic number check in the footer. std::unique_ptr file_reader( @@ -1279,16 +1278,16 @@ Status Version::GetTableProperties(std::shared_ptr* tp, std::move(file), file_name, nullptr /* env */, io_tracer_, nullptr /* stats */, 0 /* hist_type */, nullptr /* file_read_hist */, nullptr /* rate_limiter */, ioptions->listeners)); + std::unique_ptr props; s = ReadTableProperties( file_reader.get(), file_meta->fd.GetFileSize(), Footer::kInvalidTableMagicNumber /* table's magic number */, *ioptions, - &raw_table_properties, false /* compression_type_missing */); + &props); if (!s.ok()) { return s; } + *tp = std::move(props); RecordTick(ioptions->stats, NUMBER_DIRECT_LOAD_TABLE_PROPERTIES); - - *tp = std::shared_ptr(raw_table_properties); return s; } diff --git a/include/rocksdb/table_properties.h b/include/rocksdb/table_properties.h index e6d59c821..b9ddc2a84 100644 --- a/include/rocksdb/table_properties.h +++ b/include/rocksdb/table_properties.h @@ -71,10 +71,6 @@ struct TablePropertiesNames { static const std::string kFastCompressionEstimatedDataSize; }; -extern const std::string kPropertiesBlock; -extern const std::string kCompressionDictBlock; -extern const std::string kRangeDelBlock; - // `TablePropertiesCollector` provides the mechanism for users to collect // their own properties that they are interested in. This class is essentially // a collection of callback functions that will be invoked during table diff --git a/table/block_based/block_based_table_builder.cc b/table/block_based/block_based_table_builder.cc index 780eb4f37..92e9cb9ba 100644 --- a/table/block_based/block_based_table_builder.cc +++ b/table/block_based/block_based_table_builder.cc @@ -46,6 +46,7 @@ #include "table/block_based/full_filter_block.h" #include "table/block_based/partitioned_filter_block.h" #include "table/format.h" +#include "table/meta_blocks.h" #include "table/table_builder.h" #include "util/coding.h" #include "util/compression.h" @@ -62,6 +63,8 @@ extern const std::string kHashIndexPrefixesMetadataBlock; // Without anonymous namespace here, we fail the warning -Wmissing-prototypes namespace { +constexpr size_t kBlockTrailerSize = BlockBasedTable::kBlockTrailerSize; + // Create a filter block builder based on its type. FilterBlockBuilder* CreateFilterBlockBuilder( const ImmutableCFOptions& /*opt*/, const MutableCFOptions& mopt, @@ -1722,7 +1725,12 @@ void BlockBasedTableBuilder::WritePropertiesBlock( &props_block_size); } #endif // !NDEBUG - meta_index_builder->Add(kPropertiesBlock, properties_block_handle); + + const std::string* properties_block_meta = &kPropertiesBlock; + TEST_SYNC_POINT_CALLBACK( + "BlockBasedTableBuilder::WritePropertiesBlock:Meta", + &properties_block_meta); + meta_index_builder->Add(*properties_block_meta, properties_block_handle); } } diff --git a/table/block_based/block_based_table_reader.cc b/table/block_based/block_based_table_reader.cc index 75b9994ff..348fd3009 100644 --- a/table/block_based/block_based_table_reader.cc +++ b/table/block_based/block_based_table_reader.cc @@ -17,6 +17,7 @@ #include "cache/cache_entry_roles.h" #include "cache/sharded_cache.h" +#include "db/compaction/compaction_picker.h" #include "db/dbformat.h" #include "db/pinned_iterators_manager.h" #include "file/file_prefetch_buffer.h" @@ -747,75 +748,27 @@ Status BlockBasedTable::PrefetchTail( return s; } -Status BlockBasedTable::TryReadPropertiesWithGlobalSeqno( - const ReadOptions& ro, FilePrefetchBuffer* prefetch_buffer, - const Slice& handle_value, TableProperties** table_properties) { - assert(table_properties != nullptr); - // If this is an external SST file ingested with write_global_seqno set to - // true, then we expect the checksum mismatch because checksum was written - // by SstFileWriter, but its global seqno in the properties block may have - // been changed during ingestion. In this case, we read the properties - // block, copy it to a memory buffer, change the global seqno to its - // original value, i.e. 0, and verify the checksum again. - BlockHandle props_block_handle; - CacheAllocationPtr tmp_buf; - Status s = ReadProperties(ro, handle_value, rep_->file.get(), prefetch_buffer, - rep_->footer, rep_->ioptions, table_properties, - false /* verify_checksum */, &props_block_handle, - &tmp_buf, false /* compression_type_missing */, - nullptr /* memory_allocator */); - if (s.ok() && tmp_buf) { - const auto seqno_pos_iter = - (*table_properties) - ->properties_offsets.find( - ExternalSstFilePropertyNames::kGlobalSeqno); - size_t block_size = static_cast(props_block_handle.size()); - if (seqno_pos_iter != (*table_properties)->properties_offsets.end()) { - uint64_t global_seqno_offset = seqno_pos_iter->second; - EncodeFixed64( - tmp_buf.get() + global_seqno_offset - props_block_handle.offset(), 0); - } - s = ROCKSDB_NAMESPACE::VerifyBlockChecksum( - rep_->footer.checksum(), tmp_buf.get(), block_size, - rep_->file->file_name(), props_block_handle.offset()); - } - return s; -} - Status BlockBasedTable::ReadPropertiesBlock( const ReadOptions& ro, FilePrefetchBuffer* prefetch_buffer, InternalIterator* meta_iter, const SequenceNumber largest_seqno) { - bool found_properties_block = true; Status s; - s = SeekToPropertiesBlock(meta_iter, &found_properties_block); + BlockHandle handle; + s = FindOptionalMetaBlock(meta_iter, kPropertiesBlock, &handle); if (!s.ok()) { ROCKS_LOG_WARN(rep_->ioptions.logger, "Error when seeking to properties block from file: %s", s.ToString().c_str()); - } else if (found_properties_block) { + } else if (!handle.IsNull()) { s = meta_iter->status(); - TableProperties* table_properties = nullptr; + std::unique_ptr table_properties; if (s.ok()) { - s = ReadProperties( - ro, meta_iter->value(), rep_->file.get(), prefetch_buffer, - rep_->footer, rep_->ioptions, &table_properties, - true /* verify_checksum */, nullptr /* ret_block_handle */, - nullptr /* ret_block_contents */, - false /* compression_type_missing */, nullptr /* memory_allocator */); + s = ReadTablePropertiesHelper( + ro, handle, rep_->file.get(), prefetch_buffer, rep_->footer, + rep_->ioptions, &table_properties, nullptr /* memory_allocator */); } IGNORE_STATUS_IF_ERROR(s); - if (s.IsCorruption()) { - s = TryReadPropertiesWithGlobalSeqno( - ro, prefetch_buffer, meta_iter->value(), &table_properties); - IGNORE_STATUS_IF_ERROR(s); - } - std::unique_ptr props_guard; - if (table_properties != nullptr) { - props_guard.reset(table_properties); - } - if (!s.ok()) { ROCKS_LOG_WARN(rep_->ioptions.logger, "Encountered error while reading data from properties " @@ -823,7 +776,7 @@ Status BlockBasedTable::ReadPropertiesBlock( s.ToString().c_str()); } else { assert(table_properties != nullptr); - rep_->table_properties.reset(props_guard.release()); + rep_->table_properties = std::move(table_properties); rep_->blocks_maybe_compressed = rep_->table_properties->compression_name != CompressionTypeToString(kNoCompression); @@ -898,15 +851,14 @@ Status BlockBasedTable::ReadRangeDelBlock( const InternalKeyComparator& internal_comparator, BlockCacheLookupContext* lookup_context) { Status s; - bool found_range_del_block; BlockHandle range_del_handle; - s = SeekToRangeDelBlock(meta_iter, &found_range_del_block, &range_del_handle); + s = FindOptionalMetaBlock(meta_iter, kRangeDelBlock, &range_del_handle); if (!s.ok()) { ROCKS_LOG_WARN( rep_->ioptions.logger, "Error when seeking to range delete tombstones block from file: %s", s.ToString().c_str()); - } else if (found_range_del_block && !range_del_handle.IsNull()) { + } else if (!range_del_handle.IsNull()) { std::unique_ptr iter(NewDataBlockIterator( read_options, range_del_handle, /*input_iter=*/nullptr, BlockType::kRangeDeletion, @@ -969,9 +921,8 @@ Status BlockBasedTable::PrefetchIndexAndFilterBlocks( rep_->index_type == BlockBasedTableOptions::kTwoLevelIndexSearch); // Find compression dictionary handle - bool found_compression_dict = false; - s = SeekToCompressionDictBlock(meta_iter, &found_compression_dict, - &rep_->compression_dict_handle); + s = FindOptionalMetaBlock(meta_iter, kCompressionDictBlock, + &rep_->compression_dict_handle); if (!s.ok()) { return s; } @@ -1243,7 +1194,7 @@ Status BlockBasedTable::GetDataBlockFromCache( RecordTick(statistics, BLOCK_CACHE_COMPRESSED_HIT); compressed_block = reinterpret_cast( block_cache_compressed->Value(block_cache_compressed_handle)); - CompressionType compression_type = compressed_block->get_compression_type(); + CompressionType compression_type = GetBlockCompressionType(*compressed_block); assert(compression_type != kNoCompression); // Retrieve the uncompressed contents into a new buffer @@ -1523,8 +1474,9 @@ Status BlockBasedTable::MaybeReadBlockAndLoadToCache( // Update the block details so that PrefetchBuffer can use the read // pattern to determine if reads are sequential or not for // prefetching. It should also take in account blocks read from cache. - prefetch_buffer->UpdateReadPattern( - handle.offset(), block_size(handle), ro.adaptive_readahead); + prefetch_buffer->UpdateReadPattern(handle.offset(), + BlockSizeWithTrailer(handle), + ro.adaptive_readahead); } } } @@ -1569,7 +1521,7 @@ Status BlockBasedTable::MaybeReadBlockAndLoadToCache( } } } else { - raw_block_comp_type = contents->get_compression_type(); + raw_block_comp_type = GetBlockCompressionType(*contents); } if (s.ok()) { @@ -1731,7 +1683,7 @@ void BlockBasedTable::RetrieveMultipleBlocks( if (use_shared_buffer && !file->use_direct_io() && prev_end == handle.offset()) { req_offset_for_block.emplace_back(prev_len); - prev_len += block_size(handle); + prev_len += BlockSizeWithTrailer(handle); } else { // No compression or current block and previous one is not adjacent: // Step 1, create a new request for previous blocks @@ -1752,13 +1704,13 @@ void BlockBasedTable::RetrieveMultipleBlocks( // Step 2, remeber the previous block info prev_offset = handle.offset(); - prev_len = block_size(handle); + prev_len = BlockSizeWithTrailer(handle); req_offset_for_block.emplace_back(0); } req_idx_for_block.emplace_back(read_reqs.size()); PERF_COUNTER_ADD(block_read_count, 1); - PERF_COUNTER_ADD(block_read_byte, block_size(handle)); + PERF_COUNTER_ADD(block_read_byte, BlockSizeWithTrailer(handle)); } // Handle the last block and process the pending last request if (prev_len != 0) { @@ -1815,7 +1767,7 @@ void BlockBasedTable::RetrieveMultipleBlocks( Status s = req.status; if (s.ok()) { if ((req.result.size() != req.len) || - (req_offset + block_size(handle) > req.result.size())) { + (req_offset + BlockSizeWithTrailer(handle) > req.result.size())) { s = Status::Corruption( "truncated block read from " + rep_->file->file_name() + " offset " + ToString(handle.offset()) + ", expected " + @@ -1829,7 +1781,7 @@ void BlockBasedTable::RetrieveMultipleBlocks( // We allocated a buffer for this block. Give ownership of it to // BlockContents so it can free the memory assert(req.result.data() == req.scratch); - assert(req.result.size() == block_size(handle)); + assert(req.result.size() == BlockSizeWithTrailer(handle)); assert(req_offset == 0); std::unique_ptr raw_block(req.scratch); raw_block_contents = BlockContents(std::move(raw_block), handle.size()); @@ -1852,9 +1804,9 @@ void BlockBasedTable::RetrieveMultipleBlocks( // begin address of each read request, we need to add the offset // in each read request. Checksum is stored in the block trailer, // beyond the payload size. - s = ROCKSDB_NAMESPACE::VerifyBlockChecksum( - footer.checksum(), data + req_offset, handle.size(), - rep_->file->file_name(), handle.offset()); + s = VerifyBlockChecksum(footer.checksum(), data + req_offset, + handle.size(), rep_->file->file_name(), + handle.offset()); TEST_SYNC_POINT_CALLBACK("RetrieveMultipleBlocks:VerifyChecksum", &s); } } else if (!use_shared_buffer) { @@ -1875,11 +1827,12 @@ void BlockBasedTable::RetrieveMultipleBlocks( // In all other cases, the raw block is either uncompressed into a heap // buffer or there is no cache at all. CompressionType compression_type = - raw_block_contents.get_compression_type(); + GetBlockCompressionType(raw_block_contents); if (use_shared_buffer && (compression_type == kNoCompression || (compression_type != kNoCompression && rep_->table_options.block_cache_compressed))) { - Slice raw = Slice(req.result.data() + req_offset, block_size(handle)); + Slice raw = + Slice(req.result.data() + req_offset, BlockSizeWithTrailer(handle)); raw_block_contents = BlockContents( CopyBufferToHeap(GetMemoryAllocator(rep_->table_options), raw), handle.size()); @@ -1913,7 +1866,7 @@ void BlockBasedTable::RetrieveMultipleBlocks( } CompressionType compression_type = - raw_block_contents.get_compression_type(); + GetBlockCompressionType(raw_block_contents); BlockContents contents; if (compression_type != kNoCompression) { UncompressionContext context(compression_type); @@ -2669,7 +2622,7 @@ void BlockBasedTable::MultiGet(const ReadOptions& read_options, } } else { block_handles.emplace_back(handle); - total_len += block_size(handle); + total_len += BlockSizeWithTrailer(handle); } } @@ -2690,7 +2643,7 @@ void BlockBasedTable::MultiGet(const ReadOptions& read_options, // or a false positive. We need to read the data block from // the SST file results[i].Reset(); - total_len += block_size(block_handles[i]); + total_len += BlockSizeWithTrailer(block_handles[i]); } else { block_handles[i] = BlockHandle::NullBlockHandle(); } @@ -3088,20 +3041,22 @@ Status BlockBasedTable::VerifyChecksumInMetaBlocks( s = handle.DecodeFrom(&input); BlockContents contents; const Slice meta_block_name = index_iter->key(); - BlockFetcher block_fetcher( - rep_->file.get(), nullptr /* prefetch buffer */, rep_->footer, - ReadOptions(), handle, &contents, rep_->ioptions, - false /* decompress */, false /*maybe_compressed*/, - GetBlockTypeForMetaBlockByName(meta_block_name), - UncompressionDict::GetEmptyDict(), rep_->persistent_cache_options); - s = block_fetcher.ReadBlockContents(); - if (s.IsCorruption() && meta_block_name == kPropertiesBlock) { - TableProperties* table_properties; - ReadOptions ro; - s = TryReadPropertiesWithGlobalSeqno(ro, nullptr /* prefetch_buffer */, - index_iter->value(), - &table_properties); - delete table_properties; + if (meta_block_name == kPropertiesBlock) { + // Unfortunate special handling for properties block checksum w/ + // global seqno + std::unique_ptr table_properties; + s = ReadTablePropertiesHelper(ReadOptions(), handle, rep_->file.get(), + nullptr /* prefetch_buffer */, rep_->footer, + rep_->ioptions, &table_properties, + nullptr /* memory_allocator */); + } else { + s = BlockFetcher( + rep_->file.get(), nullptr /* prefetch buffer */, rep_->footer, + ReadOptions(), handle, &contents, rep_->ioptions, + false /* decompress */, false /*maybe_compressed*/, + GetBlockTypeForMetaBlockByName(meta_block_name), + UncompressionDict::GetEmptyDict(), rep_->persistent_cache_options) + .ReadBlockContents(); } if (!s.ok()) { break; diff --git a/table/block_based/block_based_table_reader.h b/table/block_based/block_based_table_reader.h index 31c7b946b..45c8a7e73 100644 --- a/table/block_based/block_based_table_reader.h +++ b/table/block_based/block_based_table_reader.h @@ -19,6 +19,7 @@ #include "table/block_based/cachable_entry.h" #include "table/block_based/filter_block.h" #include "table/block_based/uncompression_dict_reader.h" +#include "table/format.h" #include "table/table_properties_internal.h" #include "table/table_reader.h" #include "table/two_level_iterator.h" @@ -68,6 +69,9 @@ class BlockBasedTable : public TableReader { static const size_t kInitAutoReadaheadSize = 8 * 1024; static const int kMinNumFileReadsToStartAutoReadahead = 2; + // 1-byte compression type + 32-bit checksum + static constexpr size_t kBlockTrailerSize = 5; + // Attempt to open the table that is stored in bytes [0..file_size) // of "file", and read the metadata entries necessary to allow // retrieving data from the table. @@ -224,6 +228,25 @@ class BlockBasedTable : public TableReader { bool redundant, Statistics* const statistics); + // Get the size to read from storage for a BlockHandle. size_t because we + // are about to load into memory. + static inline size_t BlockSizeWithTrailer(const BlockHandle& handle) { + return static_cast(handle.size() + kBlockTrailerSize); + } + + // It's the caller's responsibility to make sure that this is + // for raw block contents, which contains the compression + // byte in the end. + static inline CompressionType GetBlockCompressionType(const char* block_data, + size_t block_size) { + return static_cast(block_data[block_size]); + } + static inline CompressionType GetBlockCompressionType( + const BlockContents& contents) { + assert(contents.is_raw_block); + return GetBlockCompressionType(contents.data.data(), contents.data.size()); + } + // Retrieve all key value pairs from data blocks in the table. // The key retrieved are internal keys. Status GetKVPairsFromDataBlocks(std::vector* kv_pair_blocks); @@ -431,10 +454,6 @@ class BlockBasedTable : public TableReader { FilePrefetchBuffer* prefetch_buffer, std::unique_ptr* metaindex_block, std::unique_ptr* iter); - Status TryReadPropertiesWithGlobalSeqno(const ReadOptions& ro, - FilePrefetchBuffer* prefetch_buffer, - const Slice& handle_value, - TableProperties** table_properties); Status ReadPropertiesBlock(const ReadOptions& ro, FilePrefetchBuffer* prefetch_buffer, InternalIterator* meta_iter, diff --git a/table/block_based/block_prefetcher.cc b/table/block_based/block_prefetcher.cc index 2ecfc751e..e488e4f5c 100644 --- a/table/block_based/block_prefetcher.cc +++ b/table/block_based/block_prefetcher.cc @@ -8,6 +8,8 @@ // found in the LICENSE file. See the AUTHORS file for names of contributors. #include "table/block_based/block_prefetcher.h" +#include "table/block_based/block_based_table_reader.h" + namespace ROCKSDB_NAMESPACE { void BlockPrefetcher::PrefetchIfNeeded(const BlockBasedTable::Rep* rep, const BlockHandle& handle, @@ -36,7 +38,7 @@ void BlockPrefetcher::PrefetchIfNeeded(const BlockBasedTable::Rep* rep, return; } - size_t len = static_cast(block_size(handle)); + size_t len = BlockBasedTable::BlockSizeWithTrailer(handle); size_t offset = handle.offset(); // If FS supports prefetching (readahead_limit_ will be non zero in that case) @@ -80,8 +82,9 @@ void BlockPrefetcher::PrefetchIfNeeded(const BlockBasedTable::Rep* rep, // If prefetch is not supported, fall back to use internal prefetch buffer. // Discarding other return status of Prefetch calls intentionally, as // we can fallback to reading from disk if Prefetch fails. - Status s = rep->file->Prefetch(handle.offset(), - block_size(handle) + readahead_size_); + Status s = rep->file->Prefetch( + handle.offset(), + BlockBasedTable::BlockSizeWithTrailer(handle) + readahead_size_); if (s.IsNotSupported()) { rep->CreateFilePrefetchBufferIfNotExists(initial_auto_readahead_size_, max_auto_readahead_size, diff --git a/table/block_based/block_test.cc b/table/block_based/block_test.cc index c9f12d54e..4b7ba09b1 100644 --- a/table/block_based/block_test.cc +++ b/table/block_based/block_test.cc @@ -4,7 +4,10 @@ // (found in the LICENSE.Apache file in the root directory). // +#include "table/block_based/block.h" + #include + #include #include #include @@ -20,7 +23,7 @@ #include "rocksdb/iterator.h" #include "rocksdb/slice_transform.h" #include "rocksdb/table.h" -#include "table/block_based/block.h" +#include "table/block_based/block_based_table_reader.h" #include "table/block_based/block_builder.h" #include "table/format.h" #include "test_util/testharness.h" @@ -520,7 +523,7 @@ void GenerateRandomIndexEntries(std::vector *separators, separators->emplace_back(*it++); uint64_t size = rnd.Uniform(1024 * 16); BlockHandle handle(offset, size); - offset += size + kBlockTrailerSize; + offset += size + BlockBasedTable::kBlockTrailerSize; block_handles->emplace_back(handle); } } diff --git a/table/block_based/filter_policy.cc b/table/block_based/filter_policy.cc index b0f345c6a..a2a0728b9 100644 --- a/table/block_based/filter_policy.cc +++ b/table/block_based/filter_policy.cc @@ -19,6 +19,7 @@ #include "logging/logging.h" #include "rocksdb/slice.h" #include "table/block_based/block_based_filter_block.h" +#include "table/block_based/block_based_table_reader.h" #include "table/block_based/filter_policy_internal.h" #include "table/block_based/full_filter_block.h" #include "third-party/folly/folly/ConstexprMath.h" @@ -158,7 +159,7 @@ class XXPH3FilterBitsBuilder : public BuiltinFilterBitsBuilder { // Filter blocks are loaded into block cache with their block trailer. // We need to make sure that's accounted for in choosing a // fragmentation-friendly size. - const size_t kExtraPadding = kBlockTrailerSize; + const size_t kExtraPadding = BlockBasedTable::kBlockTrailerSize; size_t requested = rv + kExtraPadding; // Allocate and get usable size diff --git a/table/block_based/flush_block_policy.cc b/table/block_based/flush_block_policy.cc index 149cae3dc..13332accb 100644 --- a/table/block_based/flush_block_policy.cc +++ b/table/block_based/flush_block_policy.cc @@ -11,6 +11,7 @@ #include "rocksdb/options.h" #include "rocksdb/slice.h" #include "rocksdb/utilities/customizable_util.h" +#include "table/block_based/block_based_table_reader.h" #include "table/block_based/block_builder.h" #include "table/block_based/flush_block_policy.h" #include "table/format.h" @@ -62,7 +63,7 @@ class FlushBlockBySizePolicy : public FlushBlockPolicy { data_block_builder_.EstimateSizeAfterKV(key, value); if (align_) { - estimated_size_after += kBlockTrailerSize; + estimated_size_after += BlockBasedTable::kBlockTrailerSize; return estimated_size_after > block_size_; } diff --git a/table/block_based/partitioned_filter_block.cc b/table/block_based/partitioned_filter_block.cc index 4808a1547..a2718b25d 100644 --- a/table/block_based/partitioned_filter_block.cc +++ b/table/block_based/partitioned_filter_block.cc @@ -475,7 +475,8 @@ Status PartitionedFilterBlockReader::CacheDependencies(const ReadOptions& ro, // Read the last block's offset biter.SeekToLast(); handle = biter.value().handle; - uint64_t last_off = handle.offset() + handle.size() + kBlockTrailerSize; + uint64_t last_off = + handle.offset() + handle.size() + BlockBasedTable::kBlockTrailerSize; uint64_t prefetch_len = last_off - prefetch_off; std::unique_ptr prefetch_buffer; rep->CreateFilePrefetchBuffer(0, 0, &prefetch_buffer, diff --git a/table/block_based/partitioned_index_reader.cc b/table/block_based/partitioned_index_reader.cc index f5934e691..71af6af9a 100644 --- a/table/block_based/partitioned_index_reader.cc +++ b/table/block_based/partitioned_index_reader.cc @@ -9,6 +9,7 @@ #include "table/block_based/partitioned_index_reader.h" #include "file/random_access_file_reader.h" +#include "table/block_based/block_based_table_reader.h" #include "table/block_based/partitioned_index_iterator.h" namespace ROCKSDB_NAMESPACE { @@ -146,7 +147,8 @@ Status PartitionIndexReader::CacheDependencies(const ReadOptions& ro, return biter.status(); } handle = biter.value().handle; - uint64_t last_off = handle.offset() + block_size(handle); + uint64_t last_off = + handle.offset() + BlockBasedTable::BlockSizeWithTrailer(handle); uint64_t prefetch_len = last_off - prefetch_off; std::unique_ptr prefetch_buffer; rep->CreateFilePrefetchBuffer(0, 0, &prefetch_buffer, diff --git a/table/block_based/reader_common.cc b/table/block_based/reader_common.cc index 14ed9c79c..cfaa6791c 100644 --- a/table/block_based/reader_common.cc +++ b/table/block_based/reader_common.cc @@ -22,6 +22,7 @@ void ForceReleaseCachedEntry(void* arg, void* h) { cache->Release(handle, true /* force_erase */); } +// WART: this is specific to block-based table Status VerifyBlockChecksum(ChecksumType type, const char* data, size_t block_size, const std::string& file_name, uint64_t offset) { diff --git a/table/block_fetcher.cc b/table/block_fetcher.cc index 90558168e..03ca5ce47 100644 --- a/table/block_fetcher.cc +++ b/table/block_fetcher.cc @@ -15,9 +15,11 @@ #include "logging/logging.h" #include "memory/memory_allocator.h" #include "monitoring/perf_context_imp.h" +#include "rocksdb/compression_type.h" #include "rocksdb/env.h" #include "table/block_based/block.h" #include "table/block_based/block_based_table_reader.h" +#include "table/block_based/block_type.h" #include "table/block_based/reader_common.h" #include "table/format.h" #include "table/persistent_cache_helper.h" @@ -26,12 +28,19 @@ namespace ROCKSDB_NAMESPACE { -inline void BlockFetcher::CheckBlockChecksum() { - // Check the crc of the type and the block contents - if (read_options_.verify_checksums) { - io_status_ = status_to_io_status(ROCKSDB_NAMESPACE::VerifyBlockChecksum( - footer_.checksum(), slice_.data(), block_size_, file_->file_name(), - handle_.offset())); +inline void BlockFetcher::ProcessTrailerIfPresent() { + if (footer_.GetBlockTrailerSize() > 0) { + assert(footer_.GetBlockTrailerSize() == BlockBasedTable::kBlockTrailerSize); + if (read_options_.verify_checksums) { + io_status_ = status_to_io_status( + VerifyBlockChecksum(footer_.checksum(), slice_.data(), block_size_, + file_->file_name(), handle_.offset())); + } + compression_type_ = + BlockBasedTable::GetBlockCompressionType(slice_.data(), block_size_); + } else { + // E.g. plain table or cuckoo table + compression_type_ = kNoCompression; } } @@ -63,7 +72,7 @@ inline bool BlockFetcher::TryGetFromPrefetchBuffer() { if (io_s.ok() && prefetch_buffer_->TryReadFromCache( opts, handle_.offset(), block_size_with_trailer_, &slice_, &io_s, for_compaction_)) { - CheckBlockChecksum(); + ProcessTrailerIfPresent(); if (!io_status_.ok()) { return true; } @@ -88,6 +97,7 @@ inline bool BlockFetcher::TryGetCompressedBlockFromPersistentCache() { heap_buf_ = CacheAllocationPtr(raw_data.release()); used_buf_ = heap_buf_.get(); slice_ = Slice(heap_buf_.get(), block_size_); + ProcessTrailerIfPresent(); return true; } else if (!io_status_.IsNotFound() && ioptions_.logger) { assert(!io_status_.ok()); @@ -290,7 +300,7 @@ IOStatus BlockFetcher::ReadBlockContents() { " bytes, got " + ToString(slice_.size())); } - CheckBlockChecksum(); + ProcessTrailerIfPresent(); if (io_status_.ok()) { InsertCompressedBlockToPersistentCacheIfNeeded(); } else { @@ -298,8 +308,6 @@ IOStatus BlockFetcher::ReadBlockContents() { } } - compression_type_ = get_block_compression_type(slice_.data(), block_size_); - if (do_uncompress_ && compression_type_ != kNoCompression) { PERF_TIMER_GUARD(block_decompress_time); // compressed page, uncompress, update cache diff --git a/table/block_fetcher.h b/table/block_fetcher.h index 74807ef51..67ca35840 100644 --- a/table/block_fetcher.h +++ b/table/block_fetcher.h @@ -37,12 +37,15 @@ namespace ROCKSDB_NAMESPACE { class BlockFetcher { public: BlockFetcher(RandomAccessFileReader* file, - FilePrefetchBuffer* prefetch_buffer, const Footer& footer, - const ReadOptions& read_options, const BlockHandle& handle, - BlockContents* contents, const ImmutableOptions& ioptions, + FilePrefetchBuffer* prefetch_buffer, + const Footer& footer /* ref retained */, + const ReadOptions& read_options, + const BlockHandle& handle /* ref retained */, + BlockContents* contents, + const ImmutableOptions& ioptions /* ref retained */, bool do_uncompress, bool maybe_compressed, BlockType block_type, - const UncompressionDict& uncompression_dict, - const PersistentCacheOptions& cache_options, + const UncompressionDict& uncompression_dict /* ref retained */, + const PersistentCacheOptions& cache_options /* ref retained */, MemoryAllocator* memory_allocator = nullptr, MemoryAllocator* memory_allocator_compressed = nullptr, bool for_compaction = false) @@ -57,7 +60,7 @@ class BlockFetcher { maybe_compressed_(maybe_compressed), block_type_(block_type), block_size_(static_cast(handle_.size())), - block_size_with_trailer_(block_size(handle_)), + block_size_with_trailer_(block_size_ + footer.GetBlockTrailerSize()), uncompression_dict_(uncompression_dict), cache_options_(cache_options), memory_allocator_(memory_allocator), @@ -67,7 +70,12 @@ class BlockFetcher { } IOStatus ReadBlockContents(); - CompressionType get_compression_type() const { return compression_type_; } + inline CompressionType get_compression_type() const { + return compression_type_; + } + inline size_t GetBlockSizeWithTrailer() const { + return block_size_with_trailer_; + } #ifndef NDEBUG int TEST_GetNumStackBufMemcpy() const { return num_stack_buf_memcpy_; } @@ -126,6 +134,6 @@ class BlockFetcher { void GetBlockContents(); void InsertCompressedBlockToPersistentCacheIfNeeded(); void InsertUncompressedBlockToPersistentCacheIfNeeded(); - void CheckBlockChecksum(); + void ProcessTrailerIfPresent(); }; } // namespace ROCKSDB_NAMESPACE diff --git a/table/cuckoo/cuckoo_table_builder_test.cc b/table/cuckoo/cuckoo_table_builder_test.cc index f3194ec93..5d6ebfbdd 100644 --- a/table/cuckoo/cuckoo_table_builder_test.cc +++ b/table/cuckoo/cuckoo_table_builder_test.cc @@ -68,10 +68,9 @@ class CuckooBuilderTest : public testing::Test { ImmutableOptions ioptions(options); // Assert Table Properties. - TableProperties* props = nullptr; + std::unique_ptr props; ASSERT_OK(ReadTableProperties(file_reader.get(), read_file_size, - kCuckooTableMagicNumber, ioptions, - &props, true /* compression_type_missing */)); + kCuckooTableMagicNumber, ioptions, &props)); // Check unused bucket. std::string unused_key = props->user_collected_properties[ CuckooTablePropertyNames::kEmptyKey]; @@ -108,7 +107,6 @@ class CuckooBuilderTest : public testing::Test { ASSERT_EQ(props->raw_key_size, keys.size()*props->fixed_key_len); ASSERT_EQ(props->column_family_id, 0); ASSERT_EQ(props->column_family_name, kDefaultColumnFamilyName); - delete props; // Check contents of the bucket. std::vector keys_found(keys.size(), false); diff --git a/table/cuckoo/cuckoo_table_reader.cc b/table/cuckoo/cuckoo_table_reader.cc index 4045d4528..44725767b 100644 --- a/table/cuckoo/cuckoo_table_reader.cc +++ b/table/cuckoo/cuckoo_table_reader.cc @@ -58,14 +58,16 @@ CuckooTableReader::CuckooTableReader( status_ = Status::InvalidArgument("File is not mmaped"); return; } - TableProperties* props = nullptr; - status_ = ReadTableProperties(file_.get(), file_size, kCuckooTableMagicNumber, - ioptions, &props, true /* compression_type_missing */); - if (!status_.ok()) { - return; + { + std::unique_ptr props; + status_ = ReadTableProperties(file_.get(), file_size, + kCuckooTableMagicNumber, ioptions, &props); + if (!status_.ok()) { + return; + } + table_props_ = std::move(props); } - table_props_.reset(props); - auto& user_props = props->user_collected_properties; + auto& user_props = table_props_->user_collected_properties; auto hash_funs = user_props.find(CuckooTablePropertyNames::kNumHashFunc); if (hash_funs == user_props.end()) { status_ = Status::Corruption("Number of hash functions not found"); @@ -79,7 +81,7 @@ CuckooTableReader::CuckooTableReader( } unused_key_ = unused_key->second; - key_length_ = static_cast(props->fixed_key_len); + key_length_ = static_cast(table_props_->fixed_key_len); auto user_key_len = user_props.find(CuckooTablePropertyNames::kUserKeyLength); if (user_key_len == user_props.end()) { status_ = Status::Corruption("User key length not found"); diff --git a/table/format.cc b/table/format.cc index d8add0f80..c68fbb132 100644 --- a/table/format.cc +++ b/table/format.cc @@ -97,8 +97,10 @@ const BlockHandle BlockHandle::kNullBlockHandle(0, 0); void IndexValue::EncodeTo(std::string* dst, bool have_first_key, const BlockHandle* previous_handle) const { if (previous_handle) { + // WART: this is specific to Block-based table assert(handle.offset() == previous_handle->offset() + - previous_handle->size() + kBlockTrailerSize); + previous_handle->size() + + BlockBasedTable::kBlockTrailerSize); PutVarsignedint64(dst, handle.size() - previous_handle->size()); } else { handle.EncodeTo(dst); @@ -117,9 +119,10 @@ Status IndexValue::DecodeFrom(Slice* input, bool have_first_key, if (!GetVarsignedint64(input, &delta)) { return Status::Corruption("bad delta-encoded index value"); } - handle = BlockHandle( - previous_handle->offset() + previous_handle->size() + kBlockTrailerSize, - previous_handle->size() + delta); + // WART: this is specific to Block-based table + handle = BlockHandle(previous_handle->offset() + previous_handle->size() + + BlockBasedTable::kBlockTrailerSize, + previous_handle->size() + delta); } else { Status s = handle.DecodeFrom(input); if (!s.ok()) { @@ -163,6 +166,18 @@ inline uint64_t UpconvertLegacyFooterFormat(uint64_t magic_number) { } } // namespace +void Footer::set_table_magic_number(uint64_t magic_number) { + assert(!HasInitializedTableMagicNumber()); + table_magic_number_ = magic_number; + if (magic_number == kBlockBasedTableMagicNumber || + magic_number == kLegacyBlockBasedTableMagicNumber) { + block_trailer_size_ = + static_cast(BlockBasedTable::kBlockTrailerSize); + } else { + block_trailer_size_ = 0; + } +} + // legacy footer format: // metaindex handle (varint64 offset, varint64 size) // index handle (varint64 offset, varint64 size) diff --git a/table/format.h b/table/format.h index 870a40b58..c64c1ebea 100644 --- a/table/format.h +++ b/table/format.h @@ -185,12 +185,13 @@ class Footer { // convert this object to a human readable form std::string ToString() const; + // Block trailer size used by file with this footer (e.g. 5 for block-based + // table and 0 for plain table) + inline size_t GetBlockTrailerSize() const { return block_trailer_size_; } + private: // REQUIRES: magic number wasn't initialized. - void set_table_magic_number(uint64_t magic_number) { - assert(!HasInitializedTableMagicNumber()); - table_magic_number_ = magic_number; - } + void set_table_magic_number(uint64_t magic_number); // return true if @table_magic_number_ is set to a value different // from @kInvalidTableMagicNumber. @@ -200,6 +201,7 @@ class Footer { uint32_t version_; ChecksumType checksum_; + uint8_t block_trailer_size_ = 0; // set based on magic number BlockHandle metaindex_handle_; BlockHandle index_handle_; uint64_t table_magic_number_ = 0; @@ -213,19 +215,6 @@ Status ReadFooterFromFile(const IOOptions& opts, RandomAccessFileReader* file, uint64_t file_size, Footer* footer, uint64_t enforce_table_magic_number = 0); -// 1-byte compression type + 32-bit checksum -static const size_t kBlockTrailerSize = 5; - -// Make block size calculation for IO less error prone -inline uint64_t block_size(const BlockHandle& handle) { - return handle.size() + kBlockTrailerSize; -} - -inline CompressionType get_block_compression_type(const char* block_data, - size_t block_size) { - return static_cast(block_data[block_size]); -} - // Computes a checksum using the given ChecksumType. Sometimes we need to // include one more input byte logically at the end but not part of the main // data buffer. If data_size >= 1, then @@ -242,12 +231,13 @@ uint32_t ComputeBuiltinChecksumWithLastByte(ChecksumType type, const char* data, // BlockContents objects representing data read from mmapped files only point // into the mmapped region. struct BlockContents { - Slice data; // Actual contents of data + // Points to block payload (without trailer) + Slice data; CacheAllocationPtr allocation; #ifndef NDEBUG - // Whether the block is a raw block, which contains compression type - // byte. It is only used for assertion. + // Whether there is a known trailer after what is pointed to by `data`. + // See BlockBasedTable::GetCompressionType. bool is_raw_block = false; #endif // NDEBUG @@ -269,14 +259,6 @@ struct BlockContents { // Returns whether the object has ownership of the underlying data bytes. bool own_bytes() const { return allocation.get() != nullptr; } - // It's the caller's responsibility to make sure that this is - // for raw block contents, which contains the compression - // byte in the end. - CompressionType get_compression_type() const { - assert(is_raw_block); - return get_block_compression_type(data.data(), data.size()); - } - // The additional memory space taken by the block data. size_t usable_size() const { if (allocation.get() != nullptr) { diff --git a/table/meta_blocks.cc b/table/meta_blocks.cc index ee92dcebb..20d6e354d 100644 --- a/table/meta_blocks.cc +++ b/table/meta_blocks.cc @@ -11,18 +11,27 @@ #include "db/table_properties_collector.h" #include "file/random_access_file_reader.h" #include "logging/logging.h" +#include "rocksdb/options.h" #include "rocksdb/table.h" #include "rocksdb/table_properties.h" #include "table/block_based/block.h" +#include "table/block_based/reader_common.h" #include "table/format.h" #include "table/internal_iterator.h" #include "table/persistent_cache_helper.h" +#include "table/sst_file_writer_collectors.h" #include "table/table_properties_internal.h" #include "test_util/sync_point.h" #include "util/coding.h" namespace ROCKSDB_NAMESPACE { +const std::string kPropertiesBlock = "rocksdb.properties"; +// Old property block name for backward compatibility +const std::string kPropertiesBlockOldName = "rocksdb.stats"; +const std::string kCompressionDictBlock = "rocksdb.compression_dict"; +const std::string kRangeDelBlock = "rocksdb.range_del"; + MetaIndexBuilder::MetaIndexBuilder() : meta_index_block_(new BlockBuilder(1 /* restart interval */)) {} @@ -211,40 +220,33 @@ bool NotifyCollectTableCollectorsOnFinish( return all_succeeded; } -Status ReadProperties(const ReadOptions& read_options, - const Slice& handle_value, RandomAccessFileReader* file, - FilePrefetchBuffer* prefetch_buffer, const Footer& footer, - const ImmutableOptions& ioptions, - TableProperties** table_properties, bool verify_checksum, - BlockHandle* ret_block_handle, - CacheAllocationPtr* verification_buf, - bool /*compression_type_missing*/, - MemoryAllocator* memory_allocator) { +// FIXME: should be a parameter for reading table properties to use persistent +// cache? +Status ReadTablePropertiesHelper( + const ReadOptions& ro, const BlockHandle& handle, + RandomAccessFileReader* file, FilePrefetchBuffer* prefetch_buffer, + const Footer& footer, const ImmutableOptions& ioptions, + std::unique_ptr* table_properties, + MemoryAllocator* memory_allocator) { assert(table_properties); - Slice v = handle_value; - BlockHandle handle; - if (!handle.DecodeFrom(&v).ok()) { - return Status::InvalidArgument("Failed to decode properties block handle"); - } - + // If this is an external SST file ingested with write_global_seqno set to + // true, then we expect the checksum mismatch because checksum was written + // by SstFileWriter, but its global seqno in the properties block may have + // been changed during ingestion. For this reason, we initially read + // and process without checksum verification, then later try checksum + // verification so that if it fails, we can copy to a temporary buffer with + // global seqno set to its original value, i.e. 0, and attempt checksum + // verification again. + ReadOptions modified_ro = ro; + modified_ro.verify_checksums = false; BlockContents block_contents; - Status s; - // FIXME: should be a parameter for reading table properties to use persistent - // cache - PersistentCacheOptions cache_options; - ReadOptions ro = read_options; - ro.verify_checksums = verify_checksum; - - BlockFetcher block_fetcher(file, prefetch_buffer, footer, ro, handle, + BlockFetcher block_fetcher(file, prefetch_buffer, footer, modified_ro, handle, &block_contents, ioptions, false /* decompress */, false /*maybe_compressed*/, BlockType::kProperties, - UncompressionDict::GetEmptyDict(), cache_options, - memory_allocator); - s = block_fetcher.ReadBlockContents(); - // property block is never compressed. Need to add uncompress logic if we are - // to compress it.. - + UncompressionDict::GetEmptyDict(), + PersistentCacheOptions::kEmpty, memory_allocator); + Status s = block_fetcher.ReadBlockContents(); if (!s.ok()) { return s; } @@ -254,7 +256,7 @@ Status ReadProperties(const ReadOptions& read_options, properties_block.NewDataIterator(BytewiseComparator(), kDisableGlobalSequenceNumber, &iter); - auto new_table_properties = new TableProperties(); + std::unique_ptr new_table_properties{new TableProperties}; // All pre-defined properties of type uint64_t std::unordered_map predefined_uint64_properties = { {TablePropertiesNames::kOriginalFileNumber, @@ -370,21 +372,30 @@ Status ReadProperties(const ReadOptions& read_options, {key, raw_val.ToString()}); } } - if (s.ok()) { - *table_properties = new_table_properties; - if (ret_block_handle != nullptr) { - *ret_block_handle = handle; - } - if (verification_buf != nullptr) { - size_t len = static_cast(handle.size() + kBlockTrailerSize); - *verification_buf = - ROCKSDB_NAMESPACE::AllocateBlock(len, memory_allocator); - if (verification_buf->get() != nullptr) { - memcpy(verification_buf->get(), block_contents.data.data(), len); + + // Modified version of BlockFetcher checksum verification + // (See write_global_seqno comment above) + if (s.ok() && footer.GetBlockTrailerSize() > 0) { + s = VerifyBlockChecksum(footer.checksum(), properties_block.data(), + properties_block.size(), file->file_name(), + handle.offset()); + if (s.IsCorruption()) { + const auto seqno_pos_iter = new_table_properties->properties_offsets.find( + ExternalSstFilePropertyNames::kGlobalSeqno); + if (seqno_pos_iter != new_table_properties->properties_offsets.end()) { + std::string tmp_buf(properties_block.data(), + block_fetcher.GetBlockSizeWithTrailer()); + uint64_t global_seqno_offset = seqno_pos_iter->second - handle.offset(); + EncodeFixed64(&tmp_buf[static_cast(global_seqno_offset)], 0); + s = VerifyBlockChecksum(footer.checksum(), tmp_buf.data(), + properties_block.size(), file->file_name(), + handle.offset()); } } - } else { - delete new_table_properties; + } + + if (s.ok()) { + *table_properties = std::move(new_table_properties); } return s; @@ -393,101 +404,91 @@ Status ReadProperties(const ReadOptions& read_options, Status ReadTableProperties(RandomAccessFileReader* file, uint64_t file_size, uint64_t table_magic_number, const ImmutableOptions& ioptions, - TableProperties** properties, - bool compression_type_missing, + std::unique_ptr* properties, MemoryAllocator* memory_allocator, FilePrefetchBuffer* prefetch_buffer) { - // -- Read metaindex block + BlockHandle block_handle; Footer footer; - IOOptions opts; - auto s = ReadFooterFromFile(opts, file, prefetch_buffer, file_size, &footer, - table_magic_number); + Status s = FindMetaBlockInFile(file, file_size, table_magic_number, ioptions, + kPropertiesBlock, &block_handle, + memory_allocator, prefetch_buffer, &footer); if (!s.ok()) { return s; } - auto metaindex_handle = footer.metaindex_handle(); - BlockContents metaindex_contents; - ReadOptions read_options; - read_options.verify_checksums = false; - PersistentCacheOptions cache_options; - - BlockFetcher block_fetcher( - file, prefetch_buffer, footer, read_options, metaindex_handle, - &metaindex_contents, ioptions, false /* decompress */, - false /*maybe_compressed*/, BlockType::kMetaIndex, - UncompressionDict::GetEmptyDict(), cache_options, memory_allocator); - s = block_fetcher.ReadBlockContents(); - if (!s.ok()) { - return s; - } - // property blocks are never compressed. Need to add uncompress logic if we - // are to compress it. - Block metaindex_block(std::move(metaindex_contents)); - std::unique_ptr meta_iter(metaindex_block.NewDataIterator( - BytewiseComparator(), kDisableGlobalSequenceNumber)); - - // -- Read property block - bool found_properties_block = true; - s = SeekToPropertiesBlock(meta_iter.get(), &found_properties_block); - if (!s.ok()) { - return s; - } - - TableProperties table_properties; - if (found_properties_block == true) { - s = ReadProperties( - read_options, meta_iter->value(), file, prefetch_buffer, footer, - ioptions, properties, false /* verify_checksum */, - nullptr /* ret_block_hanel */, nullptr /* ret_block_contents */, - compression_type_missing, memory_allocator); + if (!block_handle.IsNull()) { + s = ReadTablePropertiesHelper(ReadOptions(), block_handle, file, + prefetch_buffer, footer, ioptions, properties, + memory_allocator); } else { s = Status::NotFound(); } - return s; } +Status FindOptionalMetaBlock(InternalIterator* meta_index_iter, + const std::string& meta_block_name, + BlockHandle* block_handle) { + assert(block_handle != nullptr); + meta_index_iter->Seek(meta_block_name); + if (meta_index_iter->status().ok()) { + if (meta_index_iter->Valid() && meta_index_iter->key() == meta_block_name) { + Slice v = meta_index_iter->value(); + return block_handle->DecodeFrom(&v); + } else if (meta_block_name == kPropertiesBlock) { + // Have to try old name for compatibility + meta_index_iter->Seek(kPropertiesBlockOldName); + if (meta_index_iter->status().ok() && meta_index_iter->Valid() && + meta_index_iter->key() == kPropertiesBlockOldName) { + Slice v = meta_index_iter->value(); + return block_handle->DecodeFrom(&v); + } + } + } + // else + *block_handle = BlockHandle::NullBlockHandle(); + return meta_index_iter->status(); +} + Status FindMetaBlock(InternalIterator* meta_index_iter, const std::string& meta_block_name, BlockHandle* block_handle) { - meta_index_iter->Seek(meta_block_name); - if (meta_index_iter->status().ok() && meta_index_iter->Valid() && - meta_index_iter->key() == meta_block_name) { - Slice v = meta_index_iter->value(); - return block_handle->DecodeFrom(&v); - } else { + Status s = + FindOptionalMetaBlock(meta_index_iter, meta_block_name, block_handle); + if (s.ok() && block_handle->IsNull()) { return Status::Corruption("Cannot find the meta block", meta_block_name); + } else { + return s; } } -Status FindMetaBlock(RandomAccessFileReader* file, uint64_t file_size, - uint64_t table_magic_number, - const ImmutableOptions& ioptions, - const std::string& meta_block_name, - BlockHandle* block_handle, - bool /*compression_type_missing*/, - MemoryAllocator* memory_allocator) { +Status FindMetaBlockInFile(RandomAccessFileReader* file, uint64_t file_size, + uint64_t table_magic_number, + const ImmutableOptions& ioptions, + const std::string& meta_block_name, + BlockHandle* block_handle, + MemoryAllocator* memory_allocator, + FilePrefetchBuffer* prefetch_buffer, + Footer* footer_out) { Footer footer; IOOptions opts; - auto s = ReadFooterFromFile(opts, file, nullptr /* prefetch_buffer */, - file_size, &footer, table_magic_number); + auto s = ReadFooterFromFile(opts, file, prefetch_buffer, file_size, &footer, + table_magic_number); if (!s.ok()) { return s; } + if (footer_out) { + *footer_out = footer; + } auto metaindex_handle = footer.metaindex_handle(); BlockContents metaindex_contents; - ReadOptions read_options; - read_options.verify_checksums = false; - PersistentCacheOptions cache_options; - BlockFetcher block_fetcher( - file, nullptr /* prefetch_buffer */, footer, read_options, - metaindex_handle, &metaindex_contents, ioptions, - false /* do decompression */, false /*maybe_compressed*/, - BlockType::kMetaIndex, UncompressionDict::GetEmptyDict(), cache_options, - memory_allocator); - s = block_fetcher.ReadBlockContents(); + s = BlockFetcher(file, prefetch_buffer, footer, ReadOptions(), + metaindex_handle, &metaindex_contents, ioptions, + false /* do decompression */, false /*maybe_compressed*/, + BlockType::kMetaIndex, UncompressionDict::GetEmptyDict(), + PersistentCacheOptions::kEmpty, memory_allocator) + .ReadBlockContents(); if (!s.ok()) { return s; } @@ -507,56 +508,27 @@ Status ReadMetaBlock(RandomAccessFileReader* file, uint64_t table_magic_number, const ImmutableOptions& ioptions, const std::string& meta_block_name, BlockType block_type, - BlockContents* contents, bool /*compression_type_missing*/, + BlockContents* contents, MemoryAllocator* memory_allocator) { - Status status; - Footer footer; - IOOptions opts; - status = ReadFooterFromFile(opts, file, prefetch_buffer, file_size, &footer, - table_magic_number); - if (!status.ok()) { - return status; - } - - // Reading metaindex block - auto metaindex_handle = footer.metaindex_handle(); - BlockContents metaindex_contents; - ReadOptions read_options; - read_options.verify_checksums = false; - PersistentCacheOptions cache_options; - - BlockFetcher block_fetcher( - file, prefetch_buffer, footer, read_options, metaindex_handle, - &metaindex_contents, ioptions, false /* decompress */, - false /*maybe_compressed*/, BlockType::kMetaIndex, - UncompressionDict::GetEmptyDict(), cache_options, memory_allocator); - status = block_fetcher.ReadBlockContents(); - if (!status.ok()) { - return status; - } - // meta block is never compressed. Need to add uncompress logic if we are to - // compress it. - - // Finding metablock - Block metaindex_block(std::move(metaindex_contents)); - - std::unique_ptr meta_iter; - meta_iter.reset(metaindex_block.NewDataIterator( - BytewiseComparator(), kDisableGlobalSequenceNumber)); + // TableProperties requires special handling because of checksum issues. + // Call ReadTableProperties instead for that case. + assert(block_type != BlockType::kProperties); BlockHandle block_handle; - status = FindMetaBlock(meta_iter.get(), meta_block_name, &block_handle); - + Footer footer; + Status status = FindMetaBlockInFile( + file, file_size, table_magic_number, ioptions, meta_block_name, + &block_handle, memory_allocator, prefetch_buffer, &footer); if (!status.ok()) { return status; } - // Reading metablock - BlockFetcher block_fetcher2( - file, prefetch_buffer, footer, read_options, block_handle, contents, - ioptions, false /* decompress */, false /*maybe_compressed*/, block_type, - UncompressionDict::GetEmptyDict(), cache_options, memory_allocator); - return block_fetcher2.ReadBlockContents(); + return BlockFetcher(file, prefetch_buffer, footer, ReadOptions(), + block_handle, contents, ioptions, false /* decompress */, + false /*maybe_compressed*/, block_type, + UncompressionDict::GetEmptyDict(), + PersistentCacheOptions::kEmpty, memory_allocator) + .ReadBlockContents(); } } // namespace ROCKSDB_NAMESPACE diff --git a/table/meta_blocks.h b/table/meta_blocks.h index 01b56d57c..3d24d5c31 100644 --- a/table/meta_blocks.h +++ b/table/meta_blocks.h @@ -30,6 +30,12 @@ class Logger; class RandomAccessFile; struct TableProperties; +// Meta block names for metaindex +extern const std::string kPropertiesBlock; +extern const std::string kPropertiesBlockOldName; +extern const std::string kCompressionDictBlock; +extern const std::string kRangeDelBlock; + class MetaIndexBuilder { public: MetaIndexBuilder(const MetaIndexBuilder&) = delete; @@ -95,49 +101,49 @@ bool NotifyCollectTableCollectorsOnFinish( const std::vector>& collectors, Logger* info_log, PropertyBlockBuilder* builder); -// Read the properties from the table. +// Read table properties from a file using known BlockHandle. // @returns a status to indicate if the operation succeeded. On success, // *table_properties will point to a heap-allocated TableProperties // object, otherwise value of `table_properties` will not be modified. -Status ReadProperties(const ReadOptions& ro, const Slice& handle_value, - RandomAccessFileReader* file, - FilePrefetchBuffer* prefetch_buffer, const Footer& footer, - const ImmutableOptions& ioptions, - TableProperties** table_properties, bool verify_checksum, - BlockHandle* block_handle, - CacheAllocationPtr* verification_buf, - bool compression_type_missing = false, - MemoryAllocator* memory_allocator = nullptr); - -// Directly read the properties from the properties block of a plain table. +Status ReadTablePropertiesHelper( + const ReadOptions& ro, const BlockHandle& handle, + RandomAccessFileReader* file, FilePrefetchBuffer* prefetch_buffer, + const Footer& footer, const ImmutableOptions& ioptions, + std::unique_ptr* table_properties, + MemoryAllocator* memory_allocator = nullptr); + +// Read table properties from the properties block of a plain table. // @returns a status to indicate if the operation succeeded. On success, // *table_properties will point to a heap-allocated TableProperties // object, otherwise value of `table_properties` will not be modified. -// certain tables do not have compression_type byte setup properly for -// uncompressed blocks, caller can request to reset compression type by -// passing compression_type_missing = true, the same applies to -// `ReadProperties`, `FindMetaBlock`, and `ReadMetaBlock` Status ReadTableProperties(RandomAccessFileReader* file, uint64_t file_size, uint64_t table_magic_number, const ImmutableOptions& ioptions, - TableProperties** properties, - bool compression_type_missing = false, + std::unique_ptr* properties, MemoryAllocator* memory_allocator = nullptr, FilePrefetchBuffer* prefetch_buffer = nullptr); -// Find the meta block from the meta index block. +// Find the meta block from the meta index block. Returns OK and +// block_handle->IsNull() if not found. +Status FindOptionalMetaBlock(InternalIterator* meta_index_iter, + const std::string& meta_block_name, + BlockHandle* block_handle); + +// Find the meta block from the meta index block. Returns Corruption if not +// found. Status FindMetaBlock(InternalIterator* meta_index_iter, const std::string& meta_block_name, BlockHandle* block_handle); // Find the meta block -Status FindMetaBlock(RandomAccessFileReader* file, uint64_t file_size, - uint64_t table_magic_number, - const ImmutableOptions& ioptions, - const std::string& meta_block_name, - BlockHandle* block_handle, - bool compression_type_missing = false, - MemoryAllocator* memory_allocator = nullptr); +Status FindMetaBlockInFile(RandomAccessFileReader* file, uint64_t file_size, + uint64_t table_magic_number, + const ImmutableOptions& ioptions, + const std::string& meta_block_name, + BlockHandle* block_handle, + MemoryAllocator* memory_allocator = nullptr, + FilePrefetchBuffer* prefetch_buffer = nullptr, + Footer* footer_out = nullptr); // Read the specified meta block with name meta_block_name // from `file` and initialize `contents` with contents of this block. @@ -148,7 +154,6 @@ Status ReadMetaBlock(RandomAccessFileReader* file, const ImmutableOptions& ioptions, const std::string& meta_block_name, BlockType block_type, BlockContents* contents, - bool compression_type_missing = false, MemoryAllocator* memory_allocator = nullptr); } // namespace ROCKSDB_NAMESPACE diff --git a/table/persistent_cache_helper.cc b/table/persistent_cache_helper.cc index 21d3bf734..2dc0a1e9f 100644 --- a/table/persistent_cache_helper.cc +++ b/table/persistent_cache_helper.cc @@ -9,6 +9,8 @@ namespace ROCKSDB_NAMESPACE { +const PersistentCacheOptions PersistentCacheOptions::kEmpty; + void PersistentCacheHelper::InsertRawPage( const PersistentCacheOptions& cache_options, const BlockHandle& handle, const char* data, const size_t size) { @@ -70,7 +72,8 @@ Status PersistentCacheHelper::LookupRawPage( } // cache hit - assert(raw_data_size == handle.size() + kBlockTrailerSize); + // Block-based table is assumed + assert(raw_data_size == handle.size() + BlockBasedTable::kBlockTrailerSize); assert(size == raw_data_size); RecordTick(cache_options.statistics, PERSISTENT_CACHE_HIT); return Status::OK(); diff --git a/table/persistent_cache_options.h b/table/persistent_cache_options.h index 7c65a041a..ffbae2ac3 100644 --- a/table/persistent_cache_options.h +++ b/table/persistent_cache_options.h @@ -29,6 +29,8 @@ struct PersistentCacheOptions { std::shared_ptr persistent_cache; std::string key_prefix; Statistics* statistics = nullptr; + + static const PersistentCacheOptions kEmpty; }; } // namespace ROCKSDB_NAMESPACE diff --git a/table/plain/plain_table_reader.cc b/table/plain/plain_table_reader.cc index 9a9e21850..2d6902a2a 100644 --- a/table/plain/plain_table_reader.cc +++ b/table/plain/plain_table_reader.cc @@ -129,11 +129,9 @@ Status PlainTableReader::Open( return Status::NotSupported("File is too large for PlainTableReader!"); } - TableProperties* props_ptr = nullptr; + std::unique_ptr props; auto s = ReadTableProperties(file.get(), file_size, kPlainTableMagicNumber, - ioptions, &props_ptr, - true /* compression_type_missing */); - std::shared_ptr props(props_ptr); + ioptions, &props); if (!s.ok()) { return s; } @@ -186,7 +184,7 @@ Status PlainTableReader::Open( new_reader->full_scan_mode_ = true; } // PopulateIndex can add to the props, so don't store them until now - new_reader->table_properties_ = props; + new_reader->table_properties_ = std::move(props); if (immortal_table && new_reader->file_info_.is_mmap_mode) { new_reader->dummy_cleanable_.reset(new Cleanable()); @@ -308,8 +306,7 @@ Status PlainTableReader::PopulateIndex(TableProperties* props, Status s = ReadMetaBlock(file_info_.file.get(), nullptr /* prefetch_buffer */, file_size_, kPlainTableMagicNumber, ioptions_, PlainTableIndexBuilder::kPlainTableIndexBlock, - BlockType::kIndex, &index_block_contents, - true /* compression_type_missing */); + BlockType::kIndex, &index_block_contents); bool index_in_file = s.ok(); @@ -320,8 +317,7 @@ Status PlainTableReader::PopulateIndex(TableProperties* props, s = ReadMetaBlock(file_info_.file.get(), nullptr /* prefetch_buffer */, file_size_, kPlainTableMagicNumber, ioptions_, BloomBlockBuilder::kBloomBlock, BlockType::kFilter, - &bloom_block_contents, - true /* compression_type_missing */); + &bloom_block_contents); bloom_in_file = s.ok() && bloom_block_contents.data.size() > 0; } diff --git a/table/sst_file_dumper.cc b/table/sst_file_dumper.cc index d2ade5bfe..ff9c42f6c 100644 --- a/table/sst_file_dumper.cc +++ b/table/sst_file_dumper.cc @@ -332,18 +332,16 @@ Status SstFileDumper::ShowCompressionSize( return Status::OK(); } +// Reads TableProperties prior to opening table reader in order to set up +// options. Status SstFileDumper::ReadTableProperties(uint64_t table_magic_number, RandomAccessFileReader* file, uint64_t file_size, FilePrefetchBuffer* prefetch_buffer) { - TableProperties* table_properties = nullptr; Status s = ROCKSDB_NAMESPACE::ReadTableProperties( - file, file_size, table_magic_number, ioptions_, &table_properties, - /* compression_type_missing= */ false, + file, file_size, table_magic_number, ioptions_, &table_properties_, /* memory_allocator= */ nullptr, prefetch_buffer); - if (s.ok()) { - table_properties_.reset(table_properties); - } else { + if (!s.ok()) { if (!silent_) { fprintf(stdout, "Not able to read table properties\n"); } @@ -488,6 +486,7 @@ Status SstFileDumper::ReadSequential(bool print_kv, uint64_t read_num, return ret; } +// Provides TableProperties to API user Status SstFileDumper::ReadTableProperties( std::shared_ptr* table_properties) { if (!table_reader_) { diff --git a/table/table_properties.cc b/table/table_properties.cc index 1938f3342..83ff31151 100644 --- a/table/table_properties.cc +++ b/table/table_properties.cc @@ -7,12 +7,10 @@ #include "port/port.h" #include "rocksdb/env.h" -#include "rocksdb/iterator.h" #include "rocksdb/unique_id.h" -#include "table/block_based/block.h" -#include "table/internal_iterator.h" #include "table/table_properties_internal.h" #include "table/unique_id_impl.h" +#include "util/random.h" #include "util/string_util.h" namespace ROCKSDB_NAMESPACE { @@ -44,31 +42,6 @@ namespace { props, key, ToString(value), prop_delim, kv_delim ); } - - // Seek to the specified meta block. - // Return true if it successfully seeks to that block. - Status SeekToMetaBlock(InternalIterator* meta_iter, - const std::string& block_name, bool* is_found, - BlockHandle* block_handle = nullptr) { - if (block_handle != nullptr) { - *block_handle = BlockHandle::NullBlockHandle(); - } - *is_found = true; - meta_iter->Seek(block_name); - if (meta_iter->status().ok()) { - if (meta_iter->Valid() && meta_iter->key() == block_name) { - *is_found = true; - if (block_handle) { - Slice v = meta_iter->value(); - return block_handle->DecodeFrom(&v); - } - } else { - *is_found = false; - return Status::OK(); - } - } - return meta_iter->status(); - } } std::string TableProperties::ToString( @@ -306,12 +279,6 @@ const std::string TablePropertiesNames::kSlowCompressionEstimatedDataSize = const std::string TablePropertiesNames::kFastCompressionEstimatedDataSize = "rocksdb.sample_for_compression.fast.data.size"; -extern const std::string kPropertiesBlock = "rocksdb.properties"; -// Old property block name for backward compatibility -extern const std::string kPropertiesBlockOldName = "rocksdb.stats"; -extern const std::string kCompressionDictBlock = "rocksdb.compression_dict"; -extern const std::string kRangeDelBlock = "rocksdb.range_del"; - #ifndef NDEBUG void TEST_SetRandomTableProperties(TableProperties* props) { Random* r = Random::GetTLSInstance(); @@ -335,26 +302,4 @@ void TEST_SetRandomTableProperties(TableProperties* props) { } #endif -// Seek to the properties block. -// Return true if it successfully seeks to the properties block. -Status SeekToPropertiesBlock(InternalIterator* meta_iter, bool* is_found) { - Status status = SeekToMetaBlock(meta_iter, kPropertiesBlock, is_found); - if (!*is_found && status.ok()) { - status = SeekToMetaBlock(meta_iter, kPropertiesBlockOldName, is_found); - } - return status; -} - -// Seek to the compression dictionary block. -// Return true if it successfully seeks to that block. -Status SeekToCompressionDictBlock(InternalIterator* meta_iter, bool* is_found, - BlockHandle* block_handle) { - return SeekToMetaBlock(meta_iter, kCompressionDictBlock, is_found, block_handle); -} - -Status SeekToRangeDelBlock(InternalIterator* meta_iter, bool* is_found, - BlockHandle* block_handle = nullptr) { - return SeekToMetaBlock(meta_iter, kRangeDelBlock, is_found, block_handle); -} - } // namespace ROCKSDB_NAMESPACE diff --git a/table/table_properties_internal.h b/table/table_properties_internal.h index 25bc75f66..5c2a0cb9a 100644 --- a/table/table_properties_internal.h +++ b/table/table_properties_internal.h @@ -5,29 +5,9 @@ #pragma once -#include "rocksdb/status.h" #include "rocksdb/table_properties.h" -#include "table/internal_iterator.h" namespace ROCKSDB_NAMESPACE { - -class BlockHandle; - -// Seek to the properties block. -// If it successfully seeks to the properties block, "is_found" will be -// set to true. -Status SeekToPropertiesBlock(InternalIterator* meta_iter, bool* is_found); - -// Seek to the compression dictionary block. -// If it successfully seeks to the properties block, "is_found" will be -// set to true. -Status SeekToCompressionDictBlock(InternalIterator* meta_iter, bool* is_found, - BlockHandle* block_handle); - -// TODO(andrewkr) should not put all meta block in table_properties.h/cc -Status SeekToRangeDelBlock(InternalIterator* meta_iter, bool* is_found, - BlockHandle* block_handle); - #ifndef NDEBUG void TEST_SetRandomTableProperties(TableProperties* props); #endif diff --git a/table/table_test.cc b/table/table_test.cc index df8806cd6..b845b976f 100644 --- a/table/table_test.cc +++ b/table/table_test.cc @@ -1698,7 +1698,8 @@ TEST_P(BlockBasedTableTest, BasicBlockBasedTableProperties) { block_builder.Add(item.first, item.second); } Slice content = block_builder.Finish(); - ASSERT_EQ(content.size() + kBlockTrailerSize + diff_internal_user_bytes, + ASSERT_EQ(content.size() + BlockBasedTable::kBlockTrailerSize + + diff_internal_user_bytes, props.data_size); c.ResetTableReader(); } @@ -3836,11 +3837,9 @@ TEST_F(PlainTableTest, BasicPlainTableProperties) { std::unique_ptr file_reader( new RandomAccessFileReader(std::move(source), "test")); - TableProperties* props = nullptr; + std::unique_ptr props; auto s = ReadTableProperties(file_reader.get(), ss->contents().size(), - kPlainTableMagicNumber, ioptions, - &props, true /* compression_type_missing */); - std::unique_ptr props_guard(props); + kPlainTableMagicNumber, ioptions, &props); ASSERT_OK(s); ASSERT_EQ(0ul, props->index_size); @@ -4481,10 +4480,10 @@ TEST_P(BlockBasedTableTest, DISABLED_TableWithGlobalSeqno) { std::unique_ptr file_reader( new RandomAccessFileReader(std::move(source), "")); - TableProperties* props = nullptr; + std::unique_ptr props; ASSERT_OK(ReadTableProperties(file_reader.get(), ss_rw.contents().size(), kBlockBasedTableMagicNumber, ioptions, - &props, true /* compression_type_missing */)); + &props)); UserCollectedProperties user_props = props->user_collected_properties; version = DecodeFixed32( @@ -4493,8 +4492,6 @@ TEST_P(BlockBasedTableTest, DISABLED_TableWithGlobalSeqno) { user_props[ExternalSstFilePropertyNames::kGlobalSeqno].c_str()); global_seqno_offset = props->properties_offsets[ExternalSstFilePropertyNames::kGlobalSeqno]; - - delete props; }; // Helper function to update the value of the global seqno in the file @@ -4661,15 +4658,14 @@ TEST_P(BlockBasedTableTest, BlockAlignTest) { new RandomAccessFileReader(std::move(source), "test")); // Helper function to get version, global_seqno, global_seqno_offset std::function VerifyBlockAlignment = [&]() { - TableProperties* props = nullptr; + std::unique_ptr props; ASSERT_OK(ReadTableProperties(file_reader.get(), sink->contents().size(), - kBlockBasedTableMagicNumber, ioptions, &props, - true /* compression_type_missing */)); + kBlockBasedTableMagicNumber, ioptions, + &props)); uint64_t data_block_size = props->data_size / props->num_data_blocks; ASSERT_EQ(data_block_size, 4096); ASSERT_EQ(props->data_size, data_block_size * props->num_data_blocks); - delete props; }; VerifyBlockAlignment(); @@ -4788,16 +4784,13 @@ TEST_P(BlockBasedTableTest, PropertiesBlockRestartPointTest) { std::unique_ptr meta_iter(metaindex_block.NewDataIterator( BytewiseComparator(), kDisableGlobalSequenceNumber)); - bool found_properties_block = true; - ASSERT_OK(SeekToPropertiesBlock(meta_iter.get(), &found_properties_block)); - ASSERT_TRUE(found_properties_block); // -- Read properties block - Slice v = meta_iter->value(); BlockHandle properties_handle; - ASSERT_OK(properties_handle.DecodeFrom(&v)); + ASSERT_OK(FindOptionalMetaBlock(meta_iter.get(), kPropertiesBlock, + &properties_handle)); + ASSERT_FALSE(properties_handle.IsNull()); BlockContents properties_contents; - BlockFetchHelper(properties_handle, BlockType::kProperties, &properties_contents); Block properties_block(std::move(properties_contents));