Improve / clean up meta block code & integrity (#9163)

Summary:
* Checksums are now checked on meta blocks unless specifically
suppressed or not applicable (e.g. plain table). (Was other way around.)
This means a number of cases that were not checking checksums now are,
including direct read TableProperties in Version::GetTableProperties
(fixed in meta_blocks ReadTableProperties), reading any block from
PersistentCache (fixed in BlockFetcher), read TableProperties in
SstFileDumper (ldb/sst_dump/BackupEngine) before table reader open,
maybe more.
* For that to work, I moved the global_seqno+TableProperties checksum
logic to the shared table/ code, because that is used by many utilies
such as SstFileDumper.
* Also for that to work, we have to know when we're dealing with a block
that has a checksum (trailer), so added that capability to Footer based
on magic number, and from there BlockFetcher.
* Knowledge of trailer presence has also fixed a problem where other
table formats were reading blocks including bytes for a non-existant
trailer--and awkwardly kind-of not using them, e.g. no shared code
checking checksums. (BlockFetcher compression type was populated
incorrectly.) Now we only read what is needed.
* Minimized code duplication and differing/incompatible/awkward
abstractions in meta_blocks.{cc,h} (e.g. SeekTo in metaindex block
without parsing block handle)
* Moved some meta block handling code from table_properties*.*
* Moved some code specific to block-based table from shared table/ code
to BlockBasedTable class. The checksum stuff means we can't completely
separate it, but things that don't need to be in shared table/ code
should not be.
* Use unique_ptr rather than raw ptr in more places. (Note: you can
std::move from unique_ptr to shared_ptr.)

Without enhancements to GetPropertiesOfAllTablesTest (see below),
net reduction of roughly 100 lines of code.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9163

Test Plan:
existing tests and
* Enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to verify that
checksums are now checked on direct read of table properties by TableCache
(new test would fail before this change)
* Also enhanced DBTablePropertiesTest.GetPropertiesOfAllTablesTest to test
putting table properties under old meta name
* Also generally enhanced that same test to actually test what it was
supposed to be testing already, by kicking things out of table cache when
we don't want them there.

Reviewed By: ajkr, mrambacher

Differential Revision: D32514757

Pulled By: pdillinger

fbshipit-source-id: 507964b9311d186ae8d1131182290cbd97a99fa9
main
Peter Dillinger 3 years ago committed by Facebook GitHub Bot
parent f4294669e0
commit 230660be73
  1. 2
      HISTORY.md
  2. 7
      Makefile
  3. 2
      db/corruption_test.cc
  4. 79
      db/db_table_properties_test.cc
  5. 34
      db/plain_table_db_test.cc
  6. 10
      db/table_cache.cc
  7. 3
      db/table_cache.h
  8. 14
      db/table_properties_collector_test.cc
  9. 6
      db/version_edit_handler.cc
  10. 7
      db/version_set.cc
  11. 4
      include/rocksdb/table_properties.h
  12. 10
      table/block_based/block_based_table_builder.cc
  13. 131
      table/block_based/block_based_table_reader.cc
  14. 27
      table/block_based/block_based_table_reader.h
  15. 9
      table/block_based/block_prefetcher.cc
  16. 7
      table/block_based/block_test.cc
  17. 3
      table/block_based/filter_policy.cc
  18. 3
      table/block_based/flush_block_policy.cc
  19. 3
      table/block_based/partitioned_filter_block.cc
  20. 4
      table/block_based/partitioned_index_reader.cc
  21. 1
      table/block_based/reader_common.cc
  22. 26
      table/block_fetcher.cc
  23. 24
      table/block_fetcher.h
  24. 6
      table/cuckoo/cuckoo_table_builder_test.cc
  25. 14
      table/cuckoo/cuckoo_table_reader.cc
  26. 21
      table/format.cc
  27. 38
      table/format.h
  28. 256
      table/meta_blocks.cc
  29. 47
      table/meta_blocks.h
  30. 5
      table/persistent_cache_helper.cc
  31. 2
      table/persistent_cache_options.h
  32. 14
      table/plain/plain_table_reader.cc
  33. 11
      table/sst_file_dumper.cc
  34. 57
      table/table_properties.cc
  35. 20
      table/table_properties_internal.h
  36. 31
      table/table_test.cc

@ -17,6 +17,7 @@
* Added input sanitization on negative bytes passed into `GenericRateLimiter::Request`. * Added input sanitization on negative bytes passed into `GenericRateLimiter::Request`.
* Fixed an assertion failure in CompactionIterator when write-prepared transaction is used. We prove that certain operations can lead to a Delete being followed by a SingleDelete (same user key). We can drop the SingleDelete. * Fixed an assertion failure in CompactionIterator when write-prepared transaction is used. We prove that certain operations can lead to a Delete being followed by a SingleDelete (same user key). We can drop the SingleDelete.
* Fixed a bug of timestamp-based GC which can cause all versions of a key under full_history_ts_low to be dropped. This bug will be triggered when some of the ikeys' timestamps are lower than full_history_ts_low, while others are newer. * Fixed a bug of timestamp-based GC which can cause all versions of a key under full_history_ts_low to be dropped. This bug will be triggered when some of the ikeys' timestamps are lower than full_history_ts_low, while others are newer.
* In some cases outside of the DB read and compaction paths, SST block checksums are now checked where they were not before.
* Explicitly check for and disallow the `BlockBasedTableOptions` if insertion into one of {`block_cache`, `block_cache_compressed`, `persistent_cache`} can show up in another of these. (RocksDB expects to be able to use the same key for different physical data among tiers.) * Explicitly check for and disallow the `BlockBasedTableOptions` if insertion into one of {`block_cache`, `block_cache_compressed`, `persistent_cache`} can show up in another of these. (RocksDB expects to be able to use the same key for different physical data among tiers.)
### Behavior Changes ### Behavior Changes
@ -29,6 +30,7 @@
* Made FileSystem extend the Customizable class and added a CreateFromString method. Implementations need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method. * Made FileSystem extend the Customizable class and added a CreateFromString method. Implementations need to be registered with the ObjectRegistry and to implement a Name() method in order to be created via this method.
* Clarified in API comments that RocksDB is not exception safe for callbacks and custom extensions. An exception propagating into RocksDB can lead to undefined behavior, including data loss, unreported corruption, deadlocks, and more. * Clarified in API comments that RocksDB is not exception safe for callbacks and custom extensions. An exception propagating into RocksDB can lead to undefined behavior, including data loss, unreported corruption, deadlocks, and more.
* Marked `WriteBufferManager` as `final` because it is not intended for extension. * Marked `WriteBufferManager` as `final` because it is not intended for extension.
* Removed unimportant implementation details from table_properties.h
* Add API `FSDirectory::FsyncWithDirOptions()`, which provides extra information like directory fsync reason in `DirFsyncOptions`. File system like btrfs is using that to skip directory fsync for creating a new file, or when renaming a file, fsync the target file instead of the directory, which improves the `DB::Open()` speed by ~20%. * Add API `FSDirectory::FsyncWithDirOptions()`, which provides extra information like directory fsync reason in `DirFsyncOptions`. File system like btrfs is using that to skip directory fsync for creating a new file, or when renaming a file, fsync the target file instead of the directory, which improves the `DB::Open()` speed by ~20%.
* `DB::Open()` is not going be blocked by obsolete file purge if `DBOptions::avoid_unnecessary_blocking_io` is set to true. * `DB::Open()` is not going be blocked by obsolete file purge if `DBOptions::avoid_unnecessary_blocking_io` is set to true.
* In builds where glibc provides `gettid()`, info log ("LOG" file) lines now print a system-wide thread ID from `gettid()` instead of the process-local `pthread_self()`. For all users, the thread ID format is changed from hexadecimal to decimal integer. * In builds where glibc provides `gettid()`, info log ("LOG" file) lines now print a system-wide thread ID from `gettid()` instead of the process-local `pthread_self()`. For all users, the thread ID format is changed from hexadecimal to decimal integer.

@ -306,6 +306,13 @@ ifdef COMPILE_WITH_ASAN
EXEC_LDFLAGS += -fsanitize=address EXEC_LDFLAGS += -fsanitize=address
PLATFORM_CCFLAGS += -fsanitize=address PLATFORM_CCFLAGS += -fsanitize=address
PLATFORM_CXXFLAGS += -fsanitize=address PLATFORM_CXXFLAGS += -fsanitize=address
ifeq ($(LIB_MODE),shared)
ifdef USE_CLANG
# Fix false ODR violation; see https://github.com/google/sanitizers/issues/1017
EXEC_LDFLAGS += -mllvm -asan-use-private-alias=1
PLATFORM_CXXFLAGS += -mllvm -asan-use-private-alias=1
endif
endif
endif endif
# TSAN doesn't work well with jemalloc. If we're compiling with TSAN, we should use regular malloc. # TSAN doesn't work well with jemalloc. If we're compiling with TSAN, we should use regular malloc.

@ -544,7 +544,7 @@ TEST_F(CorruptionTest, RangeDeletionCorrupted) {
fs->GetFileSize(filename, file_opts.io_options, &file_size, nullptr)); fs->GetFileSize(filename, file_opts.io_options, &file_size, nullptr));
BlockHandle range_del_handle; BlockHandle range_del_handle;
ASSERT_OK(FindMetaBlock( ASSERT_OK(FindMetaBlockInFile(
file_reader.get(), file_size, kBlockBasedTableMagicNumber, file_reader.get(), file_size, kBlockBasedTableMagicNumber,
ImmutableOptions(options_), kRangeDelBlock, &range_del_handle)); ImmutableOptions(options_), kRangeDelBlock, &range_del_handle));

@ -14,8 +14,10 @@
#include "port/port.h" #include "port/port.h"
#include "port/stack_trace.h" #include "port/stack_trace.h"
#include "rocksdb/db.h" #include "rocksdb/db.h"
#include "rocksdb/types.h"
#include "rocksdb/utilities/table_properties_collectors.h" #include "rocksdb/utilities/table_properties_collectors.h"
#include "table/format.h" #include "table/format.h"
#include "table/meta_blocks.h"
#include "test_util/testharness.h" #include "test_util/testharness.h"
#include "test_util/testutil.h" #include "test_util/testutil.h"
#include "util/random.h" #include "util/random.h"
@ -63,21 +65,49 @@ class DBTablePropertiesTest : public DBTestBase,
TEST_F(DBTablePropertiesTest, GetPropertiesOfAllTablesTest) { TEST_F(DBTablePropertiesTest, GetPropertiesOfAllTablesTest) {
Options options = CurrentOptions(); Options options = CurrentOptions();
options.level0_file_num_compaction_trigger = 8; options.level0_file_num_compaction_trigger = 8;
// Part of strategy to prevent pinning table files
options.max_open_files = 42;
Reopen(options); Reopen(options);
// Create 4 tables // Create 4 tables
for (int table = 0; table < 4; ++table) { for (int table = 0; table < 4; ++table) {
// Use old meta name for table properties for one file
if (table == 3) {
SyncPoint::GetInstance()->SetCallBack(
"BlockBasedTableBuilder::WritePropertiesBlock:Meta", [&](void* meta) {
*reinterpret_cast<const std::string**>(meta) =
&kPropertiesBlockOldName;
});
SyncPoint::GetInstance()->EnableProcessing();
}
// Build file
for (int i = 0; i < 10 + table; ++i) { for (int i = 0; i < 10 + table; ++i) {
ASSERT_OK(db_->Put(WriteOptions(), ToString(table * 100 + i), "val")); ASSERT_OK(db_->Put(WriteOptions(), ToString(table * 100 + i), "val"));
} }
ASSERT_OK(db_->Flush(FlushOptions())); ASSERT_OK(db_->Flush(FlushOptions()));
} }
SyncPoint::GetInstance()->DisableProcessing();
std::string original_session_id;
ASSERT_OK(db_->GetDbSessionId(original_session_id));
// Part of strategy to prevent pinning table files
SyncPoint::GetInstance()->SetCallBack(
"VersionEditHandler::LoadTables:skip_load_table_files",
[&](void* skip_load) { *reinterpret_cast<bool*>(skip_load) = true; });
SyncPoint::GetInstance()->EnableProcessing();
// 1. Read table properties directly from file // 1. Read table properties directly from file
Reopen(options); Reopen(options);
// Clear out auto-opened files
dbfull()->TEST_table_cache()->EraseUnRefEntries();
ASSERT_EQ(dbfull()->TEST_table_cache()->GetUsage(), 0U);
VerifyTableProperties(db_, 10 + 11 + 12 + 13); VerifyTableProperties(db_, 10 + 11 + 12 + 13);
// 2. Put two tables to table cache and // 2. Put two tables to table cache and
Reopen(options); Reopen(options);
// Clear out auto-opened files
dbfull()->TEST_table_cache()->EraseUnRefEntries();
ASSERT_EQ(dbfull()->TEST_table_cache()->GetUsage(), 0U);
// fetch key from 1st and 2nd table, which will internally place that table to // fetch key from 1st and 2nd table, which will internally place that table to
// the table cache. // the table cache.
for (int i = 0; i < 2; ++i) { for (int i = 0; i < 2; ++i) {
@ -88,12 +118,57 @@ TEST_F(DBTablePropertiesTest, GetPropertiesOfAllTablesTest) {
// 3. Put all tables to table cache // 3. Put all tables to table cache
Reopen(options); Reopen(options);
// fetch key from 1st and 2nd table, which will internally place that table to // fetch key from all tables, which will place them in table cache.
// the table cache.
for (int i = 0; i < 4; ++i) { for (int i = 0; i < 4; ++i) {
Get(ToString(i * 100 + 0)); Get(ToString(i * 100 + 0));
} }
VerifyTableProperties(db_, 10 + 11 + 12 + 13); VerifyTableProperties(db_, 10 + 11 + 12 + 13);
// 4. Try to read CORRUPT properties (a) directly from file, and (b)
// through reader on Get
// It's not practical to prevent table file read on Open, so we
// corrupt after open and after purging table cache.
for (bool direct : {true, false}) {
Reopen(options);
// Clear out auto-opened files
dbfull()->TEST_table_cache()->EraseUnRefEntries();
ASSERT_EQ(dbfull()->TEST_table_cache()->GetUsage(), 0U);
TablePropertiesCollection props;
ASSERT_OK(db_->GetPropertiesOfAllTables(&props));
std::string sst_file = props.begin()->first;
// Corrupt the file's TableProperties using session id
std::string contents;
ASSERT_OK(
ReadFileToString(env_->GetFileSystem().get(), sst_file, &contents));
size_t pos = contents.find(original_session_id);
ASSERT_NE(pos, std::string::npos);
ASSERT_OK(test::CorruptFile(env_, sst_file, static_cast<int>(pos), 1,
/*verify checksum fails*/ false));
// Try to read CORRUPT properties
if (direct) {
ASSERT_TRUE(db_->GetPropertiesOfAllTables(&props).IsCorruption());
} else {
bool found_corruption = false;
for (int i = 0; i < 4; ++i) {
std::string result = Get(ToString(i * 100 + 0));
if (result.find_first_of("Corruption: block checksum mismatch") !=
std::string::npos) {
found_corruption = true;
}
}
ASSERT_TRUE(found_corruption);
}
// UN-corrupt file for next iteration
ASSERT_OK(test::CorruptFile(env_, sst_file, static_cast<int>(pos), 1,
/*verify checksum fails*/ false));
}
SyncPoint::GetInstance()->DisableProcessing();
} }
TEST_F(DBTablePropertiesTest, CreateOnDeletionCollectorFactory) { TEST_F(DBTablePropertiesTest, CreateOnDeletionCollectorFactory) {

@ -266,24 +266,22 @@ class TestPlainTableReader : public PlainTableReader {
const EnvOptions& env_options, const InternalKeyComparator& icomparator, const EnvOptions& env_options, const InternalKeyComparator& icomparator,
EncodingType encoding_type, uint64_t file_size, int bloom_bits_per_key, EncodingType encoding_type, uint64_t file_size, int bloom_bits_per_key,
double hash_table_ratio, size_t index_sparseness, double hash_table_ratio, size_t index_sparseness,
const TableProperties* table_properties, std::unique_ptr<TableProperties>&& props,
std::unique_ptr<RandomAccessFileReader>&& file, std::unique_ptr<RandomAccessFileReader>&& file,
const ImmutableOptions& ioptions, const SliceTransform* prefix_extractor, const ImmutableOptions& ioptions, const SliceTransform* prefix_extractor,
bool* expect_bloom_not_match, bool store_index_in_file, bool* expect_bloom_not_match, bool store_index_in_file,
uint32_t column_family_id, const std::string& column_family_name) uint32_t column_family_id, const std::string& column_family_name)
: PlainTableReader(ioptions, std::move(file), env_options, icomparator, : PlainTableReader(ioptions, std::move(file), env_options, icomparator,
encoding_type, file_size, table_properties, encoding_type, file_size, props.get(),
prefix_extractor), prefix_extractor),
expect_bloom_not_match_(expect_bloom_not_match) { expect_bloom_not_match_(expect_bloom_not_match) {
Status s = MmapDataIfNeeded(); Status s = MmapDataIfNeeded();
EXPECT_TRUE(s.ok()); EXPECT_TRUE(s.ok());
s = PopulateIndex(const_cast<TableProperties*>(table_properties), s = PopulateIndex(props.get(), bloom_bits_per_key, hash_table_ratio,
bloom_bits_per_key, hash_table_ratio, index_sparseness, index_sparseness, 2 * 1024 * 1024);
2 * 1024 * 1024);
EXPECT_TRUE(s.ok()); EXPECT_TRUE(s.ok());
TableProperties* props = const_cast<TableProperties*>(table_properties);
EXPECT_EQ(column_family_id, static_cast<uint32_t>(props->column_family_id)); EXPECT_EQ(column_family_id, static_cast<uint32_t>(props->column_family_id));
EXPECT_EQ(column_family_name, props->column_family_name); EXPECT_EQ(column_family_name, props->column_family_name);
if (store_index_in_file) { if (store_index_in_file) {
@ -297,7 +295,7 @@ class TestPlainTableReader : public PlainTableReader {
EXPECT_TRUE(num_blocks_ptr != props->user_collected_properties.end()); EXPECT_TRUE(num_blocks_ptr != props->user_collected_properties.end());
} }
} }
table_properties_.reset(props); table_properties_ = std::move(props);
} }
~TestPlainTableReader() override {} ~TestPlainTableReader() override {}
@ -337,26 +335,24 @@ class TestPlainTableFactory : public PlainTableFactory {
std::unique_ptr<RandomAccessFileReader>&& file, uint64_t file_size, std::unique_ptr<RandomAccessFileReader>&& file, uint64_t file_size,
std::unique_ptr<TableReader>* table, std::unique_ptr<TableReader>* table,
bool /*prefetch_index_and_filter_in_cache*/) const override { bool /*prefetch_index_and_filter_in_cache*/) const override {
TableProperties* props = nullptr; std::unique_ptr<TableProperties> props;
auto s = auto s = ReadTableProperties(file.get(), file_size, kPlainTableMagicNumber,
ReadTableProperties(file.get(), file_size, kPlainTableMagicNumber, table_reader_options.ioptions, &props);
table_reader_options.ioptions, &props,
true /* compression_type_missing */);
EXPECT_TRUE(s.ok()); EXPECT_TRUE(s.ok());
if (store_index_in_file_) { if (store_index_in_file_) {
BlockHandle bloom_block_handle; BlockHandle bloom_block_handle;
s = FindMetaBlock(file.get(), file_size, kPlainTableMagicNumber, s = FindMetaBlockInFile(file.get(), file_size, kPlainTableMagicNumber,
table_reader_options.ioptions, table_reader_options.ioptions,
BloomBlockBuilder::kBloomBlock, &bloom_block_handle, BloomBlockBuilder::kBloomBlock,
/* compression_type_missing */ true); &bloom_block_handle);
EXPECT_TRUE(s.ok()); EXPECT_TRUE(s.ok());
BlockHandle index_block_handle; BlockHandle index_block_handle;
s = FindMetaBlock(file.get(), file_size, kPlainTableMagicNumber, s = FindMetaBlockInFile(file.get(), file_size, kPlainTableMagicNumber,
table_reader_options.ioptions, table_reader_options.ioptions,
PlainTableIndexBuilder::kPlainTableIndexBlock, PlainTableIndexBuilder::kPlainTableIndexBlock,
&index_block_handle, /* compression_type_missing */ true); &index_block_handle);
EXPECT_TRUE(s.ok()); EXPECT_TRUE(s.ok());
} }
@ -370,8 +366,8 @@ class TestPlainTableFactory : public PlainTableFactory {
std::unique_ptr<PlainTableReader> new_reader(new TestPlainTableReader( std::unique_ptr<PlainTableReader> new_reader(new TestPlainTableReader(
table_reader_options.env_options, table_reader_options.env_options,
table_reader_options.internal_comparator, encoding_type, file_size, table_reader_options.internal_comparator, encoding_type, file_size,
bloom_bits_per_key_, hash_table_ratio_, index_sparseness_, props, bloom_bits_per_key_, hash_table_ratio_, index_sparseness_,
std::move(file), table_reader_options.ioptions, std::move(props), std::move(file), table_reader_options.ioptions,
table_reader_options.prefix_extractor, expect_bloom_not_match_, table_reader_options.prefix_extractor, expect_bloom_not_match_,
store_index_in_file_, column_family_id_, column_family_name_)); store_index_in_file_, column_family_id_, column_family_name_));

@ -655,6 +655,16 @@ size_t TableCache::GetMemoryUsageByTableReader(
return ret; return ret;
} }
bool TableCache::HasEntry(Cache* cache, uint64_t file_number) {
Cache::Handle* handle = cache->Lookup(GetSliceForFileNumber(&file_number));
if (handle) {
cache->Release(handle);
return true;
} else {
return false;
}
}
void TableCache::Evict(Cache* cache, uint64_t file_number) { void TableCache::Evict(Cache* cache, uint64_t file_number) {
cache->Erase(GetSliceForFileNumber(&file_number)); cache->Erase(GetSliceForFileNumber(&file_number));
} }

@ -125,6 +125,9 @@ class TableCache {
// Evict any entry for the specified file number // Evict any entry for the specified file number
static void Evict(Cache* cache, uint64_t file_number); static void Evict(Cache* cache, uint64_t file_number);
// Query whether specified file number is currently in cache
static bool HasEntry(Cache* cache, uint64_t file_number);
// Clean table handle and erase it from the table cache // Clean table handle and erase it from the table cache
// Used in DB close, or the file is not live anymore. // Used in DB close, or the file is not live anymore.
void EraseHandle(const FileDescriptor& fd, Cache::Handle* handle); void EraseHandle(const FileDescriptor& fd, Cache::Handle* handle);

@ -291,11 +291,9 @@ void TestCustomizedTablePropertiesCollector(
std::unique_ptr<RandomAccessFileReader> fake_file_reader( std::unique_ptr<RandomAccessFileReader> fake_file_reader(
new RandomAccessFileReader(std::move(source), "test")); new RandomAccessFileReader(std::move(source), "test"));
TableProperties* props; std::unique_ptr<TableProperties> props;
Status s = ReadTableProperties(fake_file_reader.get(), fwf->contents().size(), Status s = ReadTableProperties(fake_file_reader.get(), fwf->contents().size(),
magic_number, ioptions, &props, magic_number, ioptions, &props);
true /* compression_type_missing */);
std::unique_ptr<TableProperties> props_guard(props);
ASSERT_OK(s); ASSERT_OK(s);
auto user_collected = props->user_collected_properties; auto user_collected = props->user_collected_properties;
@ -432,13 +430,11 @@ void TestInternalKeyPropertiesCollector(
std::unique_ptr<RandomAccessFileReader> reader( std::unique_ptr<RandomAccessFileReader> reader(
new RandomAccessFileReader(std::move(source), "test")); new RandomAccessFileReader(std::move(source), "test"));
TableProperties* props; std::unique_ptr<TableProperties> props;
Status s = Status s = ReadTableProperties(reader.get(), fwf->contents().size(),
ReadTableProperties(reader.get(), fwf->contents().size(), magic_number, magic_number, ioptions, &props);
ioptions, &props, true /* compression_type_missing */);
ASSERT_OK(s); ASSERT_OK(s);
std::unique_ptr<TableProperties> props_guard(props);
auto user_collected = props->user_collected_properties; auto user_collected = props->user_collected_properties;
uint64_t deleted = GetDeletedKeys(user_collected); uint64_t deleted = GetDeletedKeys(user_collected);
ASSERT_EQ(5u, deleted); // deletes + single-deletes ASSERT_EQ(5u, deleted); // deletes + single-deletes

@ -517,7 +517,11 @@ Status VersionEditHandler::MaybeCreateVersion(const VersionEdit& /*edit*/,
Status VersionEditHandler::LoadTables(ColumnFamilyData* cfd, Status VersionEditHandler::LoadTables(ColumnFamilyData* cfd,
bool prefetch_index_and_filter_in_cache, bool prefetch_index_and_filter_in_cache,
bool is_initial_load) { bool is_initial_load) {
if (skip_load_table_files_) { bool skip_load_table_files = skip_load_table_files_;
TEST_SYNC_POINT_CALLBACK(
"VersionEditHandler::LoadTables:skip_load_table_files",
&skip_load_table_files);
if (skip_load_table_files) {
return Status::OK(); return Status::OK();
} }
assert(cfd != nullptr); assert(cfd != nullptr);

@ -1271,7 +1271,6 @@ Status Version::GetTableProperties(std::shared_ptr<const TableProperties>* tp,
return s; return s;
} }
TableProperties* raw_table_properties;
// By setting the magic number to kInvalidTableMagicNumber, we can by // By setting the magic number to kInvalidTableMagicNumber, we can by
// pass the magic number check in the footer. // pass the magic number check in the footer.
std::unique_ptr<RandomAccessFileReader> file_reader( std::unique_ptr<RandomAccessFileReader> file_reader(
@ -1279,16 +1278,16 @@ Status Version::GetTableProperties(std::shared_ptr<const TableProperties>* tp,
std::move(file), file_name, nullptr /* env */, io_tracer_, std::move(file), file_name, nullptr /* env */, io_tracer_,
nullptr /* stats */, 0 /* hist_type */, nullptr /* file_read_hist */, nullptr /* stats */, 0 /* hist_type */, nullptr /* file_read_hist */,
nullptr /* rate_limiter */, ioptions->listeners)); nullptr /* rate_limiter */, ioptions->listeners));
std::unique_ptr<TableProperties> props;
s = ReadTableProperties( s = ReadTableProperties(
file_reader.get(), file_meta->fd.GetFileSize(), file_reader.get(), file_meta->fd.GetFileSize(),
Footer::kInvalidTableMagicNumber /* table's magic number */, *ioptions, Footer::kInvalidTableMagicNumber /* table's magic number */, *ioptions,
&raw_table_properties, false /* compression_type_missing */); &props);
if (!s.ok()) { if (!s.ok()) {
return s; return s;
} }
*tp = std::move(props);
RecordTick(ioptions->stats, NUMBER_DIRECT_LOAD_TABLE_PROPERTIES); RecordTick(ioptions->stats, NUMBER_DIRECT_LOAD_TABLE_PROPERTIES);
*tp = std::shared_ptr<const TableProperties>(raw_table_properties);
return s; return s;
} }

@ -71,10 +71,6 @@ struct TablePropertiesNames {
static const std::string kFastCompressionEstimatedDataSize; static const std::string kFastCompressionEstimatedDataSize;
}; };
extern const std::string kPropertiesBlock;
extern const std::string kCompressionDictBlock;
extern const std::string kRangeDelBlock;
// `TablePropertiesCollector` provides the mechanism for users to collect // `TablePropertiesCollector` provides the mechanism for users to collect
// their own properties that they are interested in. This class is essentially // their own properties that they are interested in. This class is essentially
// a collection of callback functions that will be invoked during table // a collection of callback functions that will be invoked during table

@ -46,6 +46,7 @@
#include "table/block_based/full_filter_block.h" #include "table/block_based/full_filter_block.h"
#include "table/block_based/partitioned_filter_block.h" #include "table/block_based/partitioned_filter_block.h"
#include "table/format.h" #include "table/format.h"
#include "table/meta_blocks.h"
#include "table/table_builder.h" #include "table/table_builder.h"
#include "util/coding.h" #include "util/coding.h"
#include "util/compression.h" #include "util/compression.h"
@ -62,6 +63,8 @@ extern const std::string kHashIndexPrefixesMetadataBlock;
// Without anonymous namespace here, we fail the warning -Wmissing-prototypes // Without anonymous namespace here, we fail the warning -Wmissing-prototypes
namespace { namespace {
constexpr size_t kBlockTrailerSize = BlockBasedTable::kBlockTrailerSize;
// Create a filter block builder based on its type. // Create a filter block builder based on its type.
FilterBlockBuilder* CreateFilterBlockBuilder( FilterBlockBuilder* CreateFilterBlockBuilder(
const ImmutableCFOptions& /*opt*/, const MutableCFOptions& mopt, const ImmutableCFOptions& /*opt*/, const MutableCFOptions& mopt,
@ -1722,7 +1725,12 @@ void BlockBasedTableBuilder::WritePropertiesBlock(
&props_block_size); &props_block_size);
} }
#endif // !NDEBUG #endif // !NDEBUG
meta_index_builder->Add(kPropertiesBlock, properties_block_handle);
const std::string* properties_block_meta = &kPropertiesBlock;
TEST_SYNC_POINT_CALLBACK(
"BlockBasedTableBuilder::WritePropertiesBlock:Meta",
&properties_block_meta);
meta_index_builder->Add(*properties_block_meta, properties_block_handle);
} }
} }

@ -17,6 +17,7 @@
#include "cache/cache_entry_roles.h" #include "cache/cache_entry_roles.h"
#include "cache/sharded_cache.h" #include "cache/sharded_cache.h"
#include "db/compaction/compaction_picker.h"
#include "db/dbformat.h" #include "db/dbformat.h"
#include "db/pinned_iterators_manager.h" #include "db/pinned_iterators_manager.h"
#include "file/file_prefetch_buffer.h" #include "file/file_prefetch_buffer.h"
@ -747,75 +748,27 @@ Status BlockBasedTable::PrefetchTail(
return s; return s;
} }
Status BlockBasedTable::TryReadPropertiesWithGlobalSeqno(
const ReadOptions& ro, FilePrefetchBuffer* prefetch_buffer,
const Slice& handle_value, TableProperties** table_properties) {
assert(table_properties != nullptr);
// If this is an external SST file ingested with write_global_seqno set to
// true, then we expect the checksum mismatch because checksum was written
// by SstFileWriter, but its global seqno in the properties block may have
// been changed during ingestion. In this case, we read the properties
// block, copy it to a memory buffer, change the global seqno to its
// original value, i.e. 0, and verify the checksum again.
BlockHandle props_block_handle;
CacheAllocationPtr tmp_buf;
Status s = ReadProperties(ro, handle_value, rep_->file.get(), prefetch_buffer,
rep_->footer, rep_->ioptions, table_properties,
false /* verify_checksum */, &props_block_handle,
&tmp_buf, false /* compression_type_missing */,
nullptr /* memory_allocator */);
if (s.ok() && tmp_buf) {
const auto seqno_pos_iter =
(*table_properties)
->properties_offsets.find(
ExternalSstFilePropertyNames::kGlobalSeqno);
size_t block_size = static_cast<size_t>(props_block_handle.size());
if (seqno_pos_iter != (*table_properties)->properties_offsets.end()) {
uint64_t global_seqno_offset = seqno_pos_iter->second;
EncodeFixed64(
tmp_buf.get() + global_seqno_offset - props_block_handle.offset(), 0);
}
s = ROCKSDB_NAMESPACE::VerifyBlockChecksum(
rep_->footer.checksum(), tmp_buf.get(), block_size,
rep_->file->file_name(), props_block_handle.offset());
}
return s;
}
Status BlockBasedTable::ReadPropertiesBlock( Status BlockBasedTable::ReadPropertiesBlock(
const ReadOptions& ro, FilePrefetchBuffer* prefetch_buffer, const ReadOptions& ro, FilePrefetchBuffer* prefetch_buffer,
InternalIterator* meta_iter, const SequenceNumber largest_seqno) { InternalIterator* meta_iter, const SequenceNumber largest_seqno) {
bool found_properties_block = true;
Status s; Status s;
s = SeekToPropertiesBlock(meta_iter, &found_properties_block); BlockHandle handle;
s = FindOptionalMetaBlock(meta_iter, kPropertiesBlock, &handle);
if (!s.ok()) { if (!s.ok()) {
ROCKS_LOG_WARN(rep_->ioptions.logger, ROCKS_LOG_WARN(rep_->ioptions.logger,
"Error when seeking to properties block from file: %s", "Error when seeking to properties block from file: %s",
s.ToString().c_str()); s.ToString().c_str());
} else if (found_properties_block) { } else if (!handle.IsNull()) {
s = meta_iter->status(); s = meta_iter->status();
TableProperties* table_properties = nullptr; std::unique_ptr<TableProperties> table_properties;
if (s.ok()) { if (s.ok()) {
s = ReadProperties( s = ReadTablePropertiesHelper(
ro, meta_iter->value(), rep_->file.get(), prefetch_buffer, ro, handle, rep_->file.get(), prefetch_buffer, rep_->footer,
rep_->footer, rep_->ioptions, &table_properties, rep_->ioptions, &table_properties, nullptr /* memory_allocator */);
true /* verify_checksum */, nullptr /* ret_block_handle */,
nullptr /* ret_block_contents */,
false /* compression_type_missing */, nullptr /* memory_allocator */);
} }
IGNORE_STATUS_IF_ERROR(s); IGNORE_STATUS_IF_ERROR(s);
if (s.IsCorruption()) {
s = TryReadPropertiesWithGlobalSeqno(
ro, prefetch_buffer, meta_iter->value(), &table_properties);
IGNORE_STATUS_IF_ERROR(s);
}
std::unique_ptr<TableProperties> props_guard;
if (table_properties != nullptr) {
props_guard.reset(table_properties);
}
if (!s.ok()) { if (!s.ok()) {
ROCKS_LOG_WARN(rep_->ioptions.logger, ROCKS_LOG_WARN(rep_->ioptions.logger,
"Encountered error while reading data from properties " "Encountered error while reading data from properties "
@ -823,7 +776,7 @@ Status BlockBasedTable::ReadPropertiesBlock(
s.ToString().c_str()); s.ToString().c_str());
} else { } else {
assert(table_properties != nullptr); assert(table_properties != nullptr);
rep_->table_properties.reset(props_guard.release()); rep_->table_properties = std::move(table_properties);
rep_->blocks_maybe_compressed = rep_->blocks_maybe_compressed =
rep_->table_properties->compression_name != rep_->table_properties->compression_name !=
CompressionTypeToString(kNoCompression); CompressionTypeToString(kNoCompression);
@ -898,15 +851,14 @@ Status BlockBasedTable::ReadRangeDelBlock(
const InternalKeyComparator& internal_comparator, const InternalKeyComparator& internal_comparator,
BlockCacheLookupContext* lookup_context) { BlockCacheLookupContext* lookup_context) {
Status s; Status s;
bool found_range_del_block;
BlockHandle range_del_handle; BlockHandle range_del_handle;
s = SeekToRangeDelBlock(meta_iter, &found_range_del_block, &range_del_handle); s = FindOptionalMetaBlock(meta_iter, kRangeDelBlock, &range_del_handle);
if (!s.ok()) { if (!s.ok()) {
ROCKS_LOG_WARN( ROCKS_LOG_WARN(
rep_->ioptions.logger, rep_->ioptions.logger,
"Error when seeking to range delete tombstones block from file: %s", "Error when seeking to range delete tombstones block from file: %s",
s.ToString().c_str()); s.ToString().c_str());
} else if (found_range_del_block && !range_del_handle.IsNull()) { } else if (!range_del_handle.IsNull()) {
std::unique_ptr<InternalIterator> iter(NewDataBlockIterator<DataBlockIter>( std::unique_ptr<InternalIterator> iter(NewDataBlockIterator<DataBlockIter>(
read_options, range_del_handle, read_options, range_del_handle,
/*input_iter=*/nullptr, BlockType::kRangeDeletion, /*input_iter=*/nullptr, BlockType::kRangeDeletion,
@ -969,8 +921,7 @@ Status BlockBasedTable::PrefetchIndexAndFilterBlocks(
rep_->index_type == BlockBasedTableOptions::kTwoLevelIndexSearch); rep_->index_type == BlockBasedTableOptions::kTwoLevelIndexSearch);
// Find compression dictionary handle // Find compression dictionary handle
bool found_compression_dict = false; s = FindOptionalMetaBlock(meta_iter, kCompressionDictBlock,
s = SeekToCompressionDictBlock(meta_iter, &found_compression_dict,
&rep_->compression_dict_handle); &rep_->compression_dict_handle);
if (!s.ok()) { if (!s.ok()) {
return s; return s;
@ -1243,7 +1194,7 @@ Status BlockBasedTable::GetDataBlockFromCache(
RecordTick(statistics, BLOCK_CACHE_COMPRESSED_HIT); RecordTick(statistics, BLOCK_CACHE_COMPRESSED_HIT);
compressed_block = reinterpret_cast<BlockContents*>( compressed_block = reinterpret_cast<BlockContents*>(
block_cache_compressed->Value(block_cache_compressed_handle)); block_cache_compressed->Value(block_cache_compressed_handle));
CompressionType compression_type = compressed_block->get_compression_type(); CompressionType compression_type = GetBlockCompressionType(*compressed_block);
assert(compression_type != kNoCompression); assert(compression_type != kNoCompression);
// Retrieve the uncompressed contents into a new buffer // Retrieve the uncompressed contents into a new buffer
@ -1523,8 +1474,9 @@ Status BlockBasedTable::MaybeReadBlockAndLoadToCache(
// Update the block details so that PrefetchBuffer can use the read // Update the block details so that PrefetchBuffer can use the read
// pattern to determine if reads are sequential or not for // pattern to determine if reads are sequential or not for
// prefetching. It should also take in account blocks read from cache. // prefetching. It should also take in account blocks read from cache.
prefetch_buffer->UpdateReadPattern( prefetch_buffer->UpdateReadPattern(handle.offset(),
handle.offset(), block_size(handle), ro.adaptive_readahead); BlockSizeWithTrailer(handle),
ro.adaptive_readahead);
} }
} }
} }
@ -1569,7 +1521,7 @@ Status BlockBasedTable::MaybeReadBlockAndLoadToCache(
} }
} }
} else { } else {
raw_block_comp_type = contents->get_compression_type(); raw_block_comp_type = GetBlockCompressionType(*contents);
} }
if (s.ok()) { if (s.ok()) {
@ -1731,7 +1683,7 @@ void BlockBasedTable::RetrieveMultipleBlocks(
if (use_shared_buffer && !file->use_direct_io() && if (use_shared_buffer && !file->use_direct_io() &&
prev_end == handle.offset()) { prev_end == handle.offset()) {
req_offset_for_block.emplace_back(prev_len); req_offset_for_block.emplace_back(prev_len);
prev_len += block_size(handle); prev_len += BlockSizeWithTrailer(handle);
} else { } else {
// No compression or current block and previous one is not adjacent: // No compression or current block and previous one is not adjacent:
// Step 1, create a new request for previous blocks // Step 1, create a new request for previous blocks
@ -1752,13 +1704,13 @@ void BlockBasedTable::RetrieveMultipleBlocks(
// Step 2, remeber the previous block info // Step 2, remeber the previous block info
prev_offset = handle.offset(); prev_offset = handle.offset();
prev_len = block_size(handle); prev_len = BlockSizeWithTrailer(handle);
req_offset_for_block.emplace_back(0); req_offset_for_block.emplace_back(0);
} }
req_idx_for_block.emplace_back(read_reqs.size()); req_idx_for_block.emplace_back(read_reqs.size());
PERF_COUNTER_ADD(block_read_count, 1); PERF_COUNTER_ADD(block_read_count, 1);
PERF_COUNTER_ADD(block_read_byte, block_size(handle)); PERF_COUNTER_ADD(block_read_byte, BlockSizeWithTrailer(handle));
} }
// Handle the last block and process the pending last request // Handle the last block and process the pending last request
if (prev_len != 0) { if (prev_len != 0) {
@ -1815,7 +1767,7 @@ void BlockBasedTable::RetrieveMultipleBlocks(
Status s = req.status; Status s = req.status;
if (s.ok()) { if (s.ok()) {
if ((req.result.size() != req.len) || if ((req.result.size() != req.len) ||
(req_offset + block_size(handle) > req.result.size())) { (req_offset + BlockSizeWithTrailer(handle) > req.result.size())) {
s = Status::Corruption( s = Status::Corruption(
"truncated block read from " + rep_->file->file_name() + "truncated block read from " + rep_->file->file_name() +
" offset " + ToString(handle.offset()) + ", expected " + " offset " + ToString(handle.offset()) + ", expected " +
@ -1829,7 +1781,7 @@ void BlockBasedTable::RetrieveMultipleBlocks(
// We allocated a buffer for this block. Give ownership of it to // We allocated a buffer for this block. Give ownership of it to
// BlockContents so it can free the memory // BlockContents so it can free the memory
assert(req.result.data() == req.scratch); assert(req.result.data() == req.scratch);
assert(req.result.size() == block_size(handle)); assert(req.result.size() == BlockSizeWithTrailer(handle));
assert(req_offset == 0); assert(req_offset == 0);
std::unique_ptr<char[]> raw_block(req.scratch); std::unique_ptr<char[]> raw_block(req.scratch);
raw_block_contents = BlockContents(std::move(raw_block), handle.size()); raw_block_contents = BlockContents(std::move(raw_block), handle.size());
@ -1852,9 +1804,9 @@ void BlockBasedTable::RetrieveMultipleBlocks(
// begin address of each read request, we need to add the offset // begin address of each read request, we need to add the offset
// in each read request. Checksum is stored in the block trailer, // in each read request. Checksum is stored in the block trailer,
// beyond the payload size. // beyond the payload size.
s = ROCKSDB_NAMESPACE::VerifyBlockChecksum( s = VerifyBlockChecksum(footer.checksum(), data + req_offset,
footer.checksum(), data + req_offset, handle.size(), handle.size(), rep_->file->file_name(),
rep_->file->file_name(), handle.offset()); handle.offset());
TEST_SYNC_POINT_CALLBACK("RetrieveMultipleBlocks:VerifyChecksum", &s); TEST_SYNC_POINT_CALLBACK("RetrieveMultipleBlocks:VerifyChecksum", &s);
} }
} else if (!use_shared_buffer) { } else if (!use_shared_buffer) {
@ -1875,11 +1827,12 @@ void BlockBasedTable::RetrieveMultipleBlocks(
// In all other cases, the raw block is either uncompressed into a heap // In all other cases, the raw block is either uncompressed into a heap
// buffer or there is no cache at all. // buffer or there is no cache at all.
CompressionType compression_type = CompressionType compression_type =
raw_block_contents.get_compression_type(); GetBlockCompressionType(raw_block_contents);
if (use_shared_buffer && (compression_type == kNoCompression || if (use_shared_buffer && (compression_type == kNoCompression ||
(compression_type != kNoCompression && (compression_type != kNoCompression &&
rep_->table_options.block_cache_compressed))) { rep_->table_options.block_cache_compressed))) {
Slice raw = Slice(req.result.data() + req_offset, block_size(handle)); Slice raw =
Slice(req.result.data() + req_offset, BlockSizeWithTrailer(handle));
raw_block_contents = BlockContents( raw_block_contents = BlockContents(
CopyBufferToHeap(GetMemoryAllocator(rep_->table_options), raw), CopyBufferToHeap(GetMemoryAllocator(rep_->table_options), raw),
handle.size()); handle.size());
@ -1913,7 +1866,7 @@ void BlockBasedTable::RetrieveMultipleBlocks(
} }
CompressionType compression_type = CompressionType compression_type =
raw_block_contents.get_compression_type(); GetBlockCompressionType(raw_block_contents);
BlockContents contents; BlockContents contents;
if (compression_type != kNoCompression) { if (compression_type != kNoCompression) {
UncompressionContext context(compression_type); UncompressionContext context(compression_type);
@ -2669,7 +2622,7 @@ void BlockBasedTable::MultiGet(const ReadOptions& read_options,
} }
} else { } else {
block_handles.emplace_back(handle); block_handles.emplace_back(handle);
total_len += block_size(handle); total_len += BlockSizeWithTrailer(handle);
} }
} }
@ -2690,7 +2643,7 @@ void BlockBasedTable::MultiGet(const ReadOptions& read_options,
// or a false positive. We need to read the data block from // or a false positive. We need to read the data block from
// the SST file // the SST file
results[i].Reset(); results[i].Reset();
total_len += block_size(block_handles[i]); total_len += BlockSizeWithTrailer(block_handles[i]);
} else { } else {
block_handles[i] = BlockHandle::NullBlockHandle(); block_handles[i] = BlockHandle::NullBlockHandle();
} }
@ -3088,20 +3041,22 @@ Status BlockBasedTable::VerifyChecksumInMetaBlocks(
s = handle.DecodeFrom(&input); s = handle.DecodeFrom(&input);
BlockContents contents; BlockContents contents;
const Slice meta_block_name = index_iter->key(); const Slice meta_block_name = index_iter->key();
BlockFetcher block_fetcher( if (meta_block_name == kPropertiesBlock) {
// Unfortunate special handling for properties block checksum w/
// global seqno
std::unique_ptr<TableProperties> table_properties;
s = ReadTablePropertiesHelper(ReadOptions(), handle, rep_->file.get(),
nullptr /* prefetch_buffer */, rep_->footer,
rep_->ioptions, &table_properties,
nullptr /* memory_allocator */);
} else {
s = BlockFetcher(
rep_->file.get(), nullptr /* prefetch buffer */, rep_->footer, rep_->file.get(), nullptr /* prefetch buffer */, rep_->footer,
ReadOptions(), handle, &contents, rep_->ioptions, ReadOptions(), handle, &contents, rep_->ioptions,
false /* decompress */, false /*maybe_compressed*/, false /* decompress */, false /*maybe_compressed*/,
GetBlockTypeForMetaBlockByName(meta_block_name), GetBlockTypeForMetaBlockByName(meta_block_name),
UncompressionDict::GetEmptyDict(), rep_->persistent_cache_options); UncompressionDict::GetEmptyDict(), rep_->persistent_cache_options)
s = block_fetcher.ReadBlockContents(); .ReadBlockContents();
if (s.IsCorruption() && meta_block_name == kPropertiesBlock) {
TableProperties* table_properties;
ReadOptions ro;
s = TryReadPropertiesWithGlobalSeqno(ro, nullptr /* prefetch_buffer */,
index_iter->value(),
&table_properties);
delete table_properties;
} }
if (!s.ok()) { if (!s.ok()) {
break; break;

@ -19,6 +19,7 @@
#include "table/block_based/cachable_entry.h" #include "table/block_based/cachable_entry.h"
#include "table/block_based/filter_block.h" #include "table/block_based/filter_block.h"
#include "table/block_based/uncompression_dict_reader.h" #include "table/block_based/uncompression_dict_reader.h"
#include "table/format.h"
#include "table/table_properties_internal.h" #include "table/table_properties_internal.h"
#include "table/table_reader.h" #include "table/table_reader.h"
#include "table/two_level_iterator.h" #include "table/two_level_iterator.h"
@ -68,6 +69,9 @@ class BlockBasedTable : public TableReader {
static const size_t kInitAutoReadaheadSize = 8 * 1024; static const size_t kInitAutoReadaheadSize = 8 * 1024;
static const int kMinNumFileReadsToStartAutoReadahead = 2; static const int kMinNumFileReadsToStartAutoReadahead = 2;
// 1-byte compression type + 32-bit checksum
static constexpr size_t kBlockTrailerSize = 5;
// Attempt to open the table that is stored in bytes [0..file_size) // Attempt to open the table that is stored in bytes [0..file_size)
// of "file", and read the metadata entries necessary to allow // of "file", and read the metadata entries necessary to allow
// retrieving data from the table. // retrieving data from the table.
@ -224,6 +228,25 @@ class BlockBasedTable : public TableReader {
bool redundant, bool redundant,
Statistics* const statistics); Statistics* const statistics);
// Get the size to read from storage for a BlockHandle. size_t because we
// are about to load into memory.
static inline size_t BlockSizeWithTrailer(const BlockHandle& handle) {
return static_cast<size_t>(handle.size() + kBlockTrailerSize);
}
// It's the caller's responsibility to make sure that this is
// for raw block contents, which contains the compression
// byte in the end.
static inline CompressionType GetBlockCompressionType(const char* block_data,
size_t block_size) {
return static_cast<CompressionType>(block_data[block_size]);
}
static inline CompressionType GetBlockCompressionType(
const BlockContents& contents) {
assert(contents.is_raw_block);
return GetBlockCompressionType(contents.data.data(), contents.data.size());
}
// Retrieve all key value pairs from data blocks in the table. // Retrieve all key value pairs from data blocks in the table.
// The key retrieved are internal keys. // The key retrieved are internal keys.
Status GetKVPairsFromDataBlocks(std::vector<KVPairBlock>* kv_pair_blocks); Status GetKVPairsFromDataBlocks(std::vector<KVPairBlock>* kv_pair_blocks);
@ -431,10 +454,6 @@ class BlockBasedTable : public TableReader {
FilePrefetchBuffer* prefetch_buffer, FilePrefetchBuffer* prefetch_buffer,
std::unique_ptr<Block>* metaindex_block, std::unique_ptr<Block>* metaindex_block,
std::unique_ptr<InternalIterator>* iter); std::unique_ptr<InternalIterator>* iter);
Status TryReadPropertiesWithGlobalSeqno(const ReadOptions& ro,
FilePrefetchBuffer* prefetch_buffer,
const Slice& handle_value,
TableProperties** table_properties);
Status ReadPropertiesBlock(const ReadOptions& ro, Status ReadPropertiesBlock(const ReadOptions& ro,
FilePrefetchBuffer* prefetch_buffer, FilePrefetchBuffer* prefetch_buffer,
InternalIterator* meta_iter, InternalIterator* meta_iter,

@ -8,6 +8,8 @@
// found in the LICENSE file. See the AUTHORS file for names of contributors. // found in the LICENSE file. See the AUTHORS file for names of contributors.
#include "table/block_based/block_prefetcher.h" #include "table/block_based/block_prefetcher.h"
#include "table/block_based/block_based_table_reader.h"
namespace ROCKSDB_NAMESPACE { namespace ROCKSDB_NAMESPACE {
void BlockPrefetcher::PrefetchIfNeeded(const BlockBasedTable::Rep* rep, void BlockPrefetcher::PrefetchIfNeeded(const BlockBasedTable::Rep* rep,
const BlockHandle& handle, const BlockHandle& handle,
@ -36,7 +38,7 @@ void BlockPrefetcher::PrefetchIfNeeded(const BlockBasedTable::Rep* rep,
return; return;
} }
size_t len = static_cast<size_t>(block_size(handle)); size_t len = BlockBasedTable::BlockSizeWithTrailer(handle);
size_t offset = handle.offset(); size_t offset = handle.offset();
// If FS supports prefetching (readahead_limit_ will be non zero in that case) // If FS supports prefetching (readahead_limit_ will be non zero in that case)
@ -80,8 +82,9 @@ void BlockPrefetcher::PrefetchIfNeeded(const BlockBasedTable::Rep* rep,
// If prefetch is not supported, fall back to use internal prefetch buffer. // If prefetch is not supported, fall back to use internal prefetch buffer.
// Discarding other return status of Prefetch calls intentionally, as // Discarding other return status of Prefetch calls intentionally, as
// we can fallback to reading from disk if Prefetch fails. // we can fallback to reading from disk if Prefetch fails.
Status s = rep->file->Prefetch(handle.offset(), Status s = rep->file->Prefetch(
block_size(handle) + readahead_size_); handle.offset(),
BlockBasedTable::BlockSizeWithTrailer(handle) + readahead_size_);
if (s.IsNotSupported()) { if (s.IsNotSupported()) {
rep->CreateFilePrefetchBufferIfNotExists(initial_auto_readahead_size_, rep->CreateFilePrefetchBufferIfNotExists(initial_auto_readahead_size_,
max_auto_readahead_size, max_auto_readahead_size,

@ -4,7 +4,10 @@
// (found in the LICENSE.Apache file in the root directory). // (found in the LICENSE.Apache file in the root directory).
// //
#include "table/block_based/block.h"
#include <stdio.h> #include <stdio.h>
#include <algorithm> #include <algorithm>
#include <set> #include <set>
#include <string> #include <string>
@ -20,7 +23,7 @@
#include "rocksdb/iterator.h" #include "rocksdb/iterator.h"
#include "rocksdb/slice_transform.h" #include "rocksdb/slice_transform.h"
#include "rocksdb/table.h" #include "rocksdb/table.h"
#include "table/block_based/block.h" #include "table/block_based/block_based_table_reader.h"
#include "table/block_based/block_builder.h" #include "table/block_based/block_builder.h"
#include "table/format.h" #include "table/format.h"
#include "test_util/testharness.h" #include "test_util/testharness.h"
@ -520,7 +523,7 @@ void GenerateRandomIndexEntries(std::vector<std::string> *separators,
separators->emplace_back(*it++); separators->emplace_back(*it++);
uint64_t size = rnd.Uniform(1024 * 16); uint64_t size = rnd.Uniform(1024 * 16);
BlockHandle handle(offset, size); BlockHandle handle(offset, size);
offset += size + kBlockTrailerSize; offset += size + BlockBasedTable::kBlockTrailerSize;
block_handles->emplace_back(handle); block_handles->emplace_back(handle);
} }
} }

@ -19,6 +19,7 @@
#include "logging/logging.h" #include "logging/logging.h"
#include "rocksdb/slice.h" #include "rocksdb/slice.h"
#include "table/block_based/block_based_filter_block.h" #include "table/block_based/block_based_filter_block.h"
#include "table/block_based/block_based_table_reader.h"
#include "table/block_based/filter_policy_internal.h" #include "table/block_based/filter_policy_internal.h"
#include "table/block_based/full_filter_block.h" #include "table/block_based/full_filter_block.h"
#include "third-party/folly/folly/ConstexprMath.h" #include "third-party/folly/folly/ConstexprMath.h"
@ -158,7 +159,7 @@ class XXPH3FilterBitsBuilder : public BuiltinFilterBitsBuilder {
// Filter blocks are loaded into block cache with their block trailer. // Filter blocks are loaded into block cache with their block trailer.
// We need to make sure that's accounted for in choosing a // We need to make sure that's accounted for in choosing a
// fragmentation-friendly size. // fragmentation-friendly size.
const size_t kExtraPadding = kBlockTrailerSize; const size_t kExtraPadding = BlockBasedTable::kBlockTrailerSize;
size_t requested = rv + kExtraPadding; size_t requested = rv + kExtraPadding;
// Allocate and get usable size // Allocate and get usable size

@ -11,6 +11,7 @@
#include "rocksdb/options.h" #include "rocksdb/options.h"
#include "rocksdb/slice.h" #include "rocksdb/slice.h"
#include "rocksdb/utilities/customizable_util.h" #include "rocksdb/utilities/customizable_util.h"
#include "table/block_based/block_based_table_reader.h"
#include "table/block_based/block_builder.h" #include "table/block_based/block_builder.h"
#include "table/block_based/flush_block_policy.h" #include "table/block_based/flush_block_policy.h"
#include "table/format.h" #include "table/format.h"
@ -62,7 +63,7 @@ class FlushBlockBySizePolicy : public FlushBlockPolicy {
data_block_builder_.EstimateSizeAfterKV(key, value); data_block_builder_.EstimateSizeAfterKV(key, value);
if (align_) { if (align_) {
estimated_size_after += kBlockTrailerSize; estimated_size_after += BlockBasedTable::kBlockTrailerSize;
return estimated_size_after > block_size_; return estimated_size_after > block_size_;
} }

@ -475,7 +475,8 @@ Status PartitionedFilterBlockReader::CacheDependencies(const ReadOptions& ro,
// Read the last block's offset // Read the last block's offset
biter.SeekToLast(); biter.SeekToLast();
handle = biter.value().handle; handle = biter.value().handle;
uint64_t last_off = handle.offset() + handle.size() + kBlockTrailerSize; uint64_t last_off =
handle.offset() + handle.size() + BlockBasedTable::kBlockTrailerSize;
uint64_t prefetch_len = last_off - prefetch_off; uint64_t prefetch_len = last_off - prefetch_off;
std::unique_ptr<FilePrefetchBuffer> prefetch_buffer; std::unique_ptr<FilePrefetchBuffer> prefetch_buffer;
rep->CreateFilePrefetchBuffer(0, 0, &prefetch_buffer, rep->CreateFilePrefetchBuffer(0, 0, &prefetch_buffer,

@ -9,6 +9,7 @@
#include "table/block_based/partitioned_index_reader.h" #include "table/block_based/partitioned_index_reader.h"
#include "file/random_access_file_reader.h" #include "file/random_access_file_reader.h"
#include "table/block_based/block_based_table_reader.h"
#include "table/block_based/partitioned_index_iterator.h" #include "table/block_based/partitioned_index_iterator.h"
namespace ROCKSDB_NAMESPACE { namespace ROCKSDB_NAMESPACE {
@ -146,7 +147,8 @@ Status PartitionIndexReader::CacheDependencies(const ReadOptions& ro,
return biter.status(); return biter.status();
} }
handle = biter.value().handle; handle = biter.value().handle;
uint64_t last_off = handle.offset() + block_size(handle); uint64_t last_off =
handle.offset() + BlockBasedTable::BlockSizeWithTrailer(handle);
uint64_t prefetch_len = last_off - prefetch_off; uint64_t prefetch_len = last_off - prefetch_off;
std::unique_ptr<FilePrefetchBuffer> prefetch_buffer; std::unique_ptr<FilePrefetchBuffer> prefetch_buffer;
rep->CreateFilePrefetchBuffer(0, 0, &prefetch_buffer, rep->CreateFilePrefetchBuffer(0, 0, &prefetch_buffer,

@ -22,6 +22,7 @@ void ForceReleaseCachedEntry(void* arg, void* h) {
cache->Release(handle, true /* force_erase */); cache->Release(handle, true /* force_erase */);
} }
// WART: this is specific to block-based table
Status VerifyBlockChecksum(ChecksumType type, const char* data, Status VerifyBlockChecksum(ChecksumType type, const char* data,
size_t block_size, const std::string& file_name, size_t block_size, const std::string& file_name,
uint64_t offset) { uint64_t offset) {

@ -15,9 +15,11 @@
#include "logging/logging.h" #include "logging/logging.h"
#include "memory/memory_allocator.h" #include "memory/memory_allocator.h"
#include "monitoring/perf_context_imp.h" #include "monitoring/perf_context_imp.h"
#include "rocksdb/compression_type.h"
#include "rocksdb/env.h" #include "rocksdb/env.h"
#include "table/block_based/block.h" #include "table/block_based/block.h"
#include "table/block_based/block_based_table_reader.h" #include "table/block_based/block_based_table_reader.h"
#include "table/block_based/block_type.h"
#include "table/block_based/reader_common.h" #include "table/block_based/reader_common.h"
#include "table/format.h" #include "table/format.h"
#include "table/persistent_cache_helper.h" #include "table/persistent_cache_helper.h"
@ -26,12 +28,19 @@
namespace ROCKSDB_NAMESPACE { namespace ROCKSDB_NAMESPACE {
inline void BlockFetcher::CheckBlockChecksum() { inline void BlockFetcher::ProcessTrailerIfPresent() {
// Check the crc of the type and the block contents if (footer_.GetBlockTrailerSize() > 0) {
assert(footer_.GetBlockTrailerSize() == BlockBasedTable::kBlockTrailerSize);
if (read_options_.verify_checksums) { if (read_options_.verify_checksums) {
io_status_ = status_to_io_status(ROCKSDB_NAMESPACE::VerifyBlockChecksum( io_status_ = status_to_io_status(
footer_.checksum(), slice_.data(), block_size_, file_->file_name(), VerifyBlockChecksum(footer_.checksum(), slice_.data(), block_size_,
handle_.offset())); file_->file_name(), handle_.offset()));
}
compression_type_ =
BlockBasedTable::GetBlockCompressionType(slice_.data(), block_size_);
} else {
// E.g. plain table or cuckoo table
compression_type_ = kNoCompression;
} }
} }
@ -63,7 +72,7 @@ inline bool BlockFetcher::TryGetFromPrefetchBuffer() {
if (io_s.ok() && prefetch_buffer_->TryReadFromCache( if (io_s.ok() && prefetch_buffer_->TryReadFromCache(
opts, handle_.offset(), block_size_with_trailer_, opts, handle_.offset(), block_size_with_trailer_,
&slice_, &io_s, for_compaction_)) { &slice_, &io_s, for_compaction_)) {
CheckBlockChecksum(); ProcessTrailerIfPresent();
if (!io_status_.ok()) { if (!io_status_.ok()) {
return true; return true;
} }
@ -88,6 +97,7 @@ inline bool BlockFetcher::TryGetCompressedBlockFromPersistentCache() {
heap_buf_ = CacheAllocationPtr(raw_data.release()); heap_buf_ = CacheAllocationPtr(raw_data.release());
used_buf_ = heap_buf_.get(); used_buf_ = heap_buf_.get();
slice_ = Slice(heap_buf_.get(), block_size_); slice_ = Slice(heap_buf_.get(), block_size_);
ProcessTrailerIfPresent();
return true; return true;
} else if (!io_status_.IsNotFound() && ioptions_.logger) { } else if (!io_status_.IsNotFound() && ioptions_.logger) {
assert(!io_status_.ok()); assert(!io_status_.ok());
@ -290,7 +300,7 @@ IOStatus BlockFetcher::ReadBlockContents() {
" bytes, got " + ToString(slice_.size())); " bytes, got " + ToString(slice_.size()));
} }
CheckBlockChecksum(); ProcessTrailerIfPresent();
if (io_status_.ok()) { if (io_status_.ok()) {
InsertCompressedBlockToPersistentCacheIfNeeded(); InsertCompressedBlockToPersistentCacheIfNeeded();
} else { } else {
@ -298,8 +308,6 @@ IOStatus BlockFetcher::ReadBlockContents() {
} }
} }
compression_type_ = get_block_compression_type(slice_.data(), block_size_);
if (do_uncompress_ && compression_type_ != kNoCompression) { if (do_uncompress_ && compression_type_ != kNoCompression) {
PERF_TIMER_GUARD(block_decompress_time); PERF_TIMER_GUARD(block_decompress_time);
// compressed page, uncompress, update cache // compressed page, uncompress, update cache

@ -37,12 +37,15 @@ namespace ROCKSDB_NAMESPACE {
class BlockFetcher { class BlockFetcher {
public: public:
BlockFetcher(RandomAccessFileReader* file, BlockFetcher(RandomAccessFileReader* file,
FilePrefetchBuffer* prefetch_buffer, const Footer& footer, FilePrefetchBuffer* prefetch_buffer,
const ReadOptions& read_options, const BlockHandle& handle, const Footer& footer /* ref retained */,
BlockContents* contents, const ImmutableOptions& ioptions, const ReadOptions& read_options,
const BlockHandle& handle /* ref retained */,
BlockContents* contents,
const ImmutableOptions& ioptions /* ref retained */,
bool do_uncompress, bool maybe_compressed, BlockType block_type, bool do_uncompress, bool maybe_compressed, BlockType block_type,
const UncompressionDict& uncompression_dict, const UncompressionDict& uncompression_dict /* ref retained */,
const PersistentCacheOptions& cache_options, const PersistentCacheOptions& cache_options /* ref retained */,
MemoryAllocator* memory_allocator = nullptr, MemoryAllocator* memory_allocator = nullptr,
MemoryAllocator* memory_allocator_compressed = nullptr, MemoryAllocator* memory_allocator_compressed = nullptr,
bool for_compaction = false) bool for_compaction = false)
@ -57,7 +60,7 @@ class BlockFetcher {
maybe_compressed_(maybe_compressed), maybe_compressed_(maybe_compressed),
block_type_(block_type), block_type_(block_type),
block_size_(static_cast<size_t>(handle_.size())), block_size_(static_cast<size_t>(handle_.size())),
block_size_with_trailer_(block_size(handle_)), block_size_with_trailer_(block_size_ + footer.GetBlockTrailerSize()),
uncompression_dict_(uncompression_dict), uncompression_dict_(uncompression_dict),
cache_options_(cache_options), cache_options_(cache_options),
memory_allocator_(memory_allocator), memory_allocator_(memory_allocator),
@ -67,7 +70,12 @@ class BlockFetcher {
} }
IOStatus ReadBlockContents(); IOStatus ReadBlockContents();
CompressionType get_compression_type() const { return compression_type_; } inline CompressionType get_compression_type() const {
return compression_type_;
}
inline size_t GetBlockSizeWithTrailer() const {
return block_size_with_trailer_;
}
#ifndef NDEBUG #ifndef NDEBUG
int TEST_GetNumStackBufMemcpy() const { return num_stack_buf_memcpy_; } int TEST_GetNumStackBufMemcpy() const { return num_stack_buf_memcpy_; }
@ -126,6 +134,6 @@ class BlockFetcher {
void GetBlockContents(); void GetBlockContents();
void InsertCompressedBlockToPersistentCacheIfNeeded(); void InsertCompressedBlockToPersistentCacheIfNeeded();
void InsertUncompressedBlockToPersistentCacheIfNeeded(); void InsertUncompressedBlockToPersistentCacheIfNeeded();
void CheckBlockChecksum(); void ProcessTrailerIfPresent();
}; };
} // namespace ROCKSDB_NAMESPACE } // namespace ROCKSDB_NAMESPACE

@ -68,10 +68,9 @@ class CuckooBuilderTest : public testing::Test {
ImmutableOptions ioptions(options); ImmutableOptions ioptions(options);
// Assert Table Properties. // Assert Table Properties.
TableProperties* props = nullptr; std::unique_ptr<TableProperties> props;
ASSERT_OK(ReadTableProperties(file_reader.get(), read_file_size, ASSERT_OK(ReadTableProperties(file_reader.get(), read_file_size,
kCuckooTableMagicNumber, ioptions, kCuckooTableMagicNumber, ioptions, &props));
&props, true /* compression_type_missing */));
// Check unused bucket. // Check unused bucket.
std::string unused_key = props->user_collected_properties[ std::string unused_key = props->user_collected_properties[
CuckooTablePropertyNames::kEmptyKey]; CuckooTablePropertyNames::kEmptyKey];
@ -108,7 +107,6 @@ class CuckooBuilderTest : public testing::Test {
ASSERT_EQ(props->raw_key_size, keys.size()*props->fixed_key_len); ASSERT_EQ(props->raw_key_size, keys.size()*props->fixed_key_len);
ASSERT_EQ(props->column_family_id, 0); ASSERT_EQ(props->column_family_id, 0);
ASSERT_EQ(props->column_family_name, kDefaultColumnFamilyName); ASSERT_EQ(props->column_family_name, kDefaultColumnFamilyName);
delete props;
// Check contents of the bucket. // Check contents of the bucket.
std::vector<bool> keys_found(keys.size(), false); std::vector<bool> keys_found(keys.size(), false);

@ -58,14 +58,16 @@ CuckooTableReader::CuckooTableReader(
status_ = Status::InvalidArgument("File is not mmaped"); status_ = Status::InvalidArgument("File is not mmaped");
return; return;
} }
TableProperties* props = nullptr; {
status_ = ReadTableProperties(file_.get(), file_size, kCuckooTableMagicNumber, std::unique_ptr<TableProperties> props;
ioptions, &props, true /* compression_type_missing */); status_ = ReadTableProperties(file_.get(), file_size,
kCuckooTableMagicNumber, ioptions, &props);
if (!status_.ok()) { if (!status_.ok()) {
return; return;
} }
table_props_.reset(props); table_props_ = std::move(props);
auto& user_props = props->user_collected_properties; }
auto& user_props = table_props_->user_collected_properties;
auto hash_funs = user_props.find(CuckooTablePropertyNames::kNumHashFunc); auto hash_funs = user_props.find(CuckooTablePropertyNames::kNumHashFunc);
if (hash_funs == user_props.end()) { if (hash_funs == user_props.end()) {
status_ = Status::Corruption("Number of hash functions not found"); status_ = Status::Corruption("Number of hash functions not found");
@ -79,7 +81,7 @@ CuckooTableReader::CuckooTableReader(
} }
unused_key_ = unused_key->second; unused_key_ = unused_key->second;
key_length_ = static_cast<uint32_t>(props->fixed_key_len); key_length_ = static_cast<uint32_t>(table_props_->fixed_key_len);
auto user_key_len = user_props.find(CuckooTablePropertyNames::kUserKeyLength); auto user_key_len = user_props.find(CuckooTablePropertyNames::kUserKeyLength);
if (user_key_len == user_props.end()) { if (user_key_len == user_props.end()) {
status_ = Status::Corruption("User key length not found"); status_ = Status::Corruption("User key length not found");

@ -97,8 +97,10 @@ const BlockHandle BlockHandle::kNullBlockHandle(0, 0);
void IndexValue::EncodeTo(std::string* dst, bool have_first_key, void IndexValue::EncodeTo(std::string* dst, bool have_first_key,
const BlockHandle* previous_handle) const { const BlockHandle* previous_handle) const {
if (previous_handle) { if (previous_handle) {
// WART: this is specific to Block-based table
assert(handle.offset() == previous_handle->offset() + assert(handle.offset() == previous_handle->offset() +
previous_handle->size() + kBlockTrailerSize); previous_handle->size() +
BlockBasedTable::kBlockTrailerSize);
PutVarsignedint64(dst, handle.size() - previous_handle->size()); PutVarsignedint64(dst, handle.size() - previous_handle->size());
} else { } else {
handle.EncodeTo(dst); handle.EncodeTo(dst);
@ -117,8 +119,9 @@ Status IndexValue::DecodeFrom(Slice* input, bool have_first_key,
if (!GetVarsignedint64(input, &delta)) { if (!GetVarsignedint64(input, &delta)) {
return Status::Corruption("bad delta-encoded index value"); return Status::Corruption("bad delta-encoded index value");
} }
handle = BlockHandle( // WART: this is specific to Block-based table
previous_handle->offset() + previous_handle->size() + kBlockTrailerSize, handle = BlockHandle(previous_handle->offset() + previous_handle->size() +
BlockBasedTable::kBlockTrailerSize,
previous_handle->size() + delta); previous_handle->size() + delta);
} else { } else {
Status s = handle.DecodeFrom(input); Status s = handle.DecodeFrom(input);
@ -163,6 +166,18 @@ inline uint64_t UpconvertLegacyFooterFormat(uint64_t magic_number) {
} }
} // namespace } // namespace
void Footer::set_table_magic_number(uint64_t magic_number) {
assert(!HasInitializedTableMagicNumber());
table_magic_number_ = magic_number;
if (magic_number == kBlockBasedTableMagicNumber ||
magic_number == kLegacyBlockBasedTableMagicNumber) {
block_trailer_size_ =
static_cast<uint8_t>(BlockBasedTable::kBlockTrailerSize);
} else {
block_trailer_size_ = 0;
}
}
// legacy footer format: // legacy footer format:
// metaindex handle (varint64 offset, varint64 size) // metaindex handle (varint64 offset, varint64 size)
// index handle (varint64 offset, varint64 size) // index handle (varint64 offset, varint64 size)

@ -185,12 +185,13 @@ class Footer {
// convert this object to a human readable form // convert this object to a human readable form
std::string ToString() const; std::string ToString() const;
// Block trailer size used by file with this footer (e.g. 5 for block-based
// table and 0 for plain table)
inline size_t GetBlockTrailerSize() const { return block_trailer_size_; }
private: private:
// REQUIRES: magic number wasn't initialized. // REQUIRES: magic number wasn't initialized.
void set_table_magic_number(uint64_t magic_number) { void set_table_magic_number(uint64_t magic_number);
assert(!HasInitializedTableMagicNumber());
table_magic_number_ = magic_number;
}
// return true if @table_magic_number_ is set to a value different // return true if @table_magic_number_ is set to a value different
// from @kInvalidTableMagicNumber. // from @kInvalidTableMagicNumber.
@ -200,6 +201,7 @@ class Footer {
uint32_t version_; uint32_t version_;
ChecksumType checksum_; ChecksumType checksum_;
uint8_t block_trailer_size_ = 0; // set based on magic number
BlockHandle metaindex_handle_; BlockHandle metaindex_handle_;
BlockHandle index_handle_; BlockHandle index_handle_;
uint64_t table_magic_number_ = 0; uint64_t table_magic_number_ = 0;
@ -213,19 +215,6 @@ Status ReadFooterFromFile(const IOOptions& opts, RandomAccessFileReader* file,
uint64_t file_size, Footer* footer, uint64_t file_size, Footer* footer,
uint64_t enforce_table_magic_number = 0); uint64_t enforce_table_magic_number = 0);
// 1-byte compression type + 32-bit checksum
static const size_t kBlockTrailerSize = 5;
// Make block size calculation for IO less error prone
inline uint64_t block_size(const BlockHandle& handle) {
return handle.size() + kBlockTrailerSize;
}
inline CompressionType get_block_compression_type(const char* block_data,
size_t block_size) {
return static_cast<CompressionType>(block_data[block_size]);
}
// Computes a checksum using the given ChecksumType. Sometimes we need to // Computes a checksum using the given ChecksumType. Sometimes we need to
// include one more input byte logically at the end but not part of the main // include one more input byte logically at the end but not part of the main
// data buffer. If data_size >= 1, then // data buffer. If data_size >= 1, then
@ -242,12 +231,13 @@ uint32_t ComputeBuiltinChecksumWithLastByte(ChecksumType type, const char* data,
// BlockContents objects representing data read from mmapped files only point // BlockContents objects representing data read from mmapped files only point
// into the mmapped region. // into the mmapped region.
struct BlockContents { struct BlockContents {
Slice data; // Actual contents of data // Points to block payload (without trailer)
Slice data;
CacheAllocationPtr allocation; CacheAllocationPtr allocation;
#ifndef NDEBUG #ifndef NDEBUG
// Whether the block is a raw block, which contains compression type // Whether there is a known trailer after what is pointed to by `data`.
// byte. It is only used for assertion. // See BlockBasedTable::GetCompressionType.
bool is_raw_block = false; bool is_raw_block = false;
#endif // NDEBUG #endif // NDEBUG
@ -269,14 +259,6 @@ struct BlockContents {
// Returns whether the object has ownership of the underlying data bytes. // Returns whether the object has ownership of the underlying data bytes.
bool own_bytes() const { return allocation.get() != nullptr; } bool own_bytes() const { return allocation.get() != nullptr; }
// It's the caller's responsibility to make sure that this is
// for raw block contents, which contains the compression
// byte in the end.
CompressionType get_compression_type() const {
assert(is_raw_block);
return get_block_compression_type(data.data(), data.size());
}
// The additional memory space taken by the block data. // The additional memory space taken by the block data.
size_t usable_size() const { size_t usable_size() const {
if (allocation.get() != nullptr) { if (allocation.get() != nullptr) {

@ -11,18 +11,27 @@
#include "db/table_properties_collector.h" #include "db/table_properties_collector.h"
#include "file/random_access_file_reader.h" #include "file/random_access_file_reader.h"
#include "logging/logging.h" #include "logging/logging.h"
#include "rocksdb/options.h"
#include "rocksdb/table.h" #include "rocksdb/table.h"
#include "rocksdb/table_properties.h" #include "rocksdb/table_properties.h"
#include "table/block_based/block.h" #include "table/block_based/block.h"
#include "table/block_based/reader_common.h"
#include "table/format.h" #include "table/format.h"
#include "table/internal_iterator.h" #include "table/internal_iterator.h"
#include "table/persistent_cache_helper.h" #include "table/persistent_cache_helper.h"
#include "table/sst_file_writer_collectors.h"
#include "table/table_properties_internal.h" #include "table/table_properties_internal.h"
#include "test_util/sync_point.h" #include "test_util/sync_point.h"
#include "util/coding.h" #include "util/coding.h"
namespace ROCKSDB_NAMESPACE { namespace ROCKSDB_NAMESPACE {
const std::string kPropertiesBlock = "rocksdb.properties";
// Old property block name for backward compatibility
const std::string kPropertiesBlockOldName = "rocksdb.stats";
const std::string kCompressionDictBlock = "rocksdb.compression_dict";
const std::string kRangeDelBlock = "rocksdb.range_del";
MetaIndexBuilder::MetaIndexBuilder() MetaIndexBuilder::MetaIndexBuilder()
: meta_index_block_(new BlockBuilder(1 /* restart interval */)) {} : meta_index_block_(new BlockBuilder(1 /* restart interval */)) {}
@ -211,40 +220,33 @@ bool NotifyCollectTableCollectorsOnFinish(
return all_succeeded; return all_succeeded;
} }
Status ReadProperties(const ReadOptions& read_options, // FIXME: should be a parameter for reading table properties to use persistent
const Slice& handle_value, RandomAccessFileReader* file, // cache?
FilePrefetchBuffer* prefetch_buffer, const Footer& footer, Status ReadTablePropertiesHelper(
const ImmutableOptions& ioptions, const ReadOptions& ro, const BlockHandle& handle,
TableProperties** table_properties, bool verify_checksum, RandomAccessFileReader* file, FilePrefetchBuffer* prefetch_buffer,
BlockHandle* ret_block_handle, const Footer& footer, const ImmutableOptions& ioptions,
CacheAllocationPtr* verification_buf, std::unique_ptr<TableProperties>* table_properties,
bool /*compression_type_missing*/,
MemoryAllocator* memory_allocator) { MemoryAllocator* memory_allocator) {
assert(table_properties); assert(table_properties);
Slice v = handle_value; // If this is an external SST file ingested with write_global_seqno set to
BlockHandle handle; // true, then we expect the checksum mismatch because checksum was written
if (!handle.DecodeFrom(&v).ok()) { // by SstFileWriter, but its global seqno in the properties block may have
return Status::InvalidArgument("Failed to decode properties block handle"); // been changed during ingestion. For this reason, we initially read
} // and process without checksum verification, then later try checksum
// verification so that if it fails, we can copy to a temporary buffer with
// global seqno set to its original value, i.e. 0, and attempt checksum
// verification again.
ReadOptions modified_ro = ro;
modified_ro.verify_checksums = false;
BlockContents block_contents; BlockContents block_contents;
Status s; BlockFetcher block_fetcher(file, prefetch_buffer, footer, modified_ro, handle,
// FIXME: should be a parameter for reading table properties to use persistent
// cache
PersistentCacheOptions cache_options;
ReadOptions ro = read_options;
ro.verify_checksums = verify_checksum;
BlockFetcher block_fetcher(file, prefetch_buffer, footer, ro, handle,
&block_contents, ioptions, false /* decompress */, &block_contents, ioptions, false /* decompress */,
false /*maybe_compressed*/, BlockType::kProperties, false /*maybe_compressed*/, BlockType::kProperties,
UncompressionDict::GetEmptyDict(), cache_options, UncompressionDict::GetEmptyDict(),
memory_allocator); PersistentCacheOptions::kEmpty, memory_allocator);
s = block_fetcher.ReadBlockContents(); Status s = block_fetcher.ReadBlockContents();
// property block is never compressed. Need to add uncompress logic if we are
// to compress it..
if (!s.ok()) { if (!s.ok()) {
return s; return s;
} }
@ -254,7 +256,7 @@ Status ReadProperties(const ReadOptions& read_options,
properties_block.NewDataIterator(BytewiseComparator(), properties_block.NewDataIterator(BytewiseComparator(),
kDisableGlobalSequenceNumber, &iter); kDisableGlobalSequenceNumber, &iter);
auto new_table_properties = new TableProperties(); std::unique_ptr<TableProperties> new_table_properties{new TableProperties};
// All pre-defined properties of type uint64_t // All pre-defined properties of type uint64_t
std::unordered_map<std::string, uint64_t*> predefined_uint64_properties = { std::unordered_map<std::string, uint64_t*> predefined_uint64_properties = {
{TablePropertiesNames::kOriginalFileNumber, {TablePropertiesNames::kOriginalFileNumber,
@ -370,21 +372,30 @@ Status ReadProperties(const ReadOptions& read_options,
{key, raw_val.ToString()}); {key, raw_val.ToString()});
} }
} }
if (s.ok()) {
*table_properties = new_table_properties; // Modified version of BlockFetcher checksum verification
if (ret_block_handle != nullptr) { // (See write_global_seqno comment above)
*ret_block_handle = handle; if (s.ok() && footer.GetBlockTrailerSize() > 0) {
s = VerifyBlockChecksum(footer.checksum(), properties_block.data(),
properties_block.size(), file->file_name(),
handle.offset());
if (s.IsCorruption()) {
const auto seqno_pos_iter = new_table_properties->properties_offsets.find(
ExternalSstFilePropertyNames::kGlobalSeqno);
if (seqno_pos_iter != new_table_properties->properties_offsets.end()) {
std::string tmp_buf(properties_block.data(),
block_fetcher.GetBlockSizeWithTrailer());
uint64_t global_seqno_offset = seqno_pos_iter->second - handle.offset();
EncodeFixed64(&tmp_buf[static_cast<size_t>(global_seqno_offset)], 0);
s = VerifyBlockChecksum(footer.checksum(), tmp_buf.data(),
properties_block.size(), file->file_name(),
handle.offset());
} }
if (verification_buf != nullptr) {
size_t len = static_cast<size_t>(handle.size() + kBlockTrailerSize);
*verification_buf =
ROCKSDB_NAMESPACE::AllocateBlock(len, memory_allocator);
if (verification_buf->get() != nullptr) {
memcpy(verification_buf->get(), block_contents.data.data(), len);
} }
} }
} else {
delete new_table_properties; if (s.ok()) {
*table_properties = std::move(new_table_properties);
} }
return s; return s;
@ -393,101 +404,91 @@ Status ReadProperties(const ReadOptions& read_options,
Status ReadTableProperties(RandomAccessFileReader* file, uint64_t file_size, Status ReadTableProperties(RandomAccessFileReader* file, uint64_t file_size,
uint64_t table_magic_number, uint64_t table_magic_number,
const ImmutableOptions& ioptions, const ImmutableOptions& ioptions,
TableProperties** properties, std::unique_ptr<TableProperties>* properties,
bool compression_type_missing,
MemoryAllocator* memory_allocator, MemoryAllocator* memory_allocator,
FilePrefetchBuffer* prefetch_buffer) { FilePrefetchBuffer* prefetch_buffer) {
// -- Read metaindex block BlockHandle block_handle;
Footer footer; Footer footer;
IOOptions opts; Status s = FindMetaBlockInFile(file, file_size, table_magic_number, ioptions,
auto s = ReadFooterFromFile(opts, file, prefetch_buffer, file_size, &footer, kPropertiesBlock, &block_handle,
table_magic_number); memory_allocator, prefetch_buffer, &footer);
if (!s.ok()) { if (!s.ok()) {
return s; return s;
} }
auto metaindex_handle = footer.metaindex_handle(); if (!block_handle.IsNull()) {
BlockContents metaindex_contents; s = ReadTablePropertiesHelper(ReadOptions(), block_handle, file,
ReadOptions read_options; prefetch_buffer, footer, ioptions, properties,
read_options.verify_checksums = false; memory_allocator);
PersistentCacheOptions cache_options;
BlockFetcher block_fetcher(
file, prefetch_buffer, footer, read_options, metaindex_handle,
&metaindex_contents, ioptions, false /* decompress */,
false /*maybe_compressed*/, BlockType::kMetaIndex,
UncompressionDict::GetEmptyDict(), cache_options, memory_allocator);
s = block_fetcher.ReadBlockContents();
if (!s.ok()) {
return s;
}
// property blocks are never compressed. Need to add uncompress logic if we
// are to compress it.
Block metaindex_block(std::move(metaindex_contents));
std::unique_ptr<InternalIterator> meta_iter(metaindex_block.NewDataIterator(
BytewiseComparator(), kDisableGlobalSequenceNumber));
// -- Read property block
bool found_properties_block = true;
s = SeekToPropertiesBlock(meta_iter.get(), &found_properties_block);
if (!s.ok()) {
return s;
}
TableProperties table_properties;
if (found_properties_block == true) {
s = ReadProperties(
read_options, meta_iter->value(), file, prefetch_buffer, footer,
ioptions, properties, false /* verify_checksum */,
nullptr /* ret_block_hanel */, nullptr /* ret_block_contents */,
compression_type_missing, memory_allocator);
} else { } else {
s = Status::NotFound(); s = Status::NotFound();
} }
return s; return s;
} }
Status FindMetaBlock(InternalIterator* meta_index_iter, Status FindOptionalMetaBlock(InternalIterator* meta_index_iter,
const std::string& meta_block_name, const std::string& meta_block_name,
BlockHandle* block_handle) { BlockHandle* block_handle) {
assert(block_handle != nullptr);
meta_index_iter->Seek(meta_block_name); meta_index_iter->Seek(meta_block_name);
if (meta_index_iter->status().ok()) {
if (meta_index_iter->Valid() && meta_index_iter->key() == meta_block_name) {
Slice v = meta_index_iter->value();
return block_handle->DecodeFrom(&v);
} else if (meta_block_name == kPropertiesBlock) {
// Have to try old name for compatibility
meta_index_iter->Seek(kPropertiesBlockOldName);
if (meta_index_iter->status().ok() && meta_index_iter->Valid() && if (meta_index_iter->status().ok() && meta_index_iter->Valid() &&
meta_index_iter->key() == meta_block_name) { meta_index_iter->key() == kPropertiesBlockOldName) {
Slice v = meta_index_iter->value(); Slice v = meta_index_iter->value();
return block_handle->DecodeFrom(&v); return block_handle->DecodeFrom(&v);
} else { }
}
}
// else
*block_handle = BlockHandle::NullBlockHandle();
return meta_index_iter->status();
}
Status FindMetaBlock(InternalIterator* meta_index_iter,
const std::string& meta_block_name,
BlockHandle* block_handle) {
Status s =
FindOptionalMetaBlock(meta_index_iter, meta_block_name, block_handle);
if (s.ok() && block_handle->IsNull()) {
return Status::Corruption("Cannot find the meta block", meta_block_name); return Status::Corruption("Cannot find the meta block", meta_block_name);
} else {
return s;
} }
} }
Status FindMetaBlock(RandomAccessFileReader* file, uint64_t file_size, Status FindMetaBlockInFile(RandomAccessFileReader* file, uint64_t file_size,
uint64_t table_magic_number, uint64_t table_magic_number,
const ImmutableOptions& ioptions, const ImmutableOptions& ioptions,
const std::string& meta_block_name, const std::string& meta_block_name,
BlockHandle* block_handle, BlockHandle* block_handle,
bool /*compression_type_missing*/, MemoryAllocator* memory_allocator,
MemoryAllocator* memory_allocator) { FilePrefetchBuffer* prefetch_buffer,
Footer* footer_out) {
Footer footer; Footer footer;
IOOptions opts; IOOptions opts;
auto s = ReadFooterFromFile(opts, file, nullptr /* prefetch_buffer */, auto s = ReadFooterFromFile(opts, file, prefetch_buffer, file_size, &footer,
file_size, &footer, table_magic_number); table_magic_number);
if (!s.ok()) { if (!s.ok()) {
return s; return s;
} }
if (footer_out) {
*footer_out = footer;
}
auto metaindex_handle = footer.metaindex_handle(); auto metaindex_handle = footer.metaindex_handle();
BlockContents metaindex_contents; BlockContents metaindex_contents;
ReadOptions read_options; s = BlockFetcher(file, prefetch_buffer, footer, ReadOptions(),
read_options.verify_checksums = false;
PersistentCacheOptions cache_options;
BlockFetcher block_fetcher(
file, nullptr /* prefetch_buffer */, footer, read_options,
metaindex_handle, &metaindex_contents, ioptions, metaindex_handle, &metaindex_contents, ioptions,
false /* do decompression */, false /*maybe_compressed*/, false /* do decompression */, false /*maybe_compressed*/,
BlockType::kMetaIndex, UncompressionDict::GetEmptyDict(), cache_options, BlockType::kMetaIndex, UncompressionDict::GetEmptyDict(),
memory_allocator); PersistentCacheOptions::kEmpty, memory_allocator)
s = block_fetcher.ReadBlockContents(); .ReadBlockContents();
if (!s.ok()) { if (!s.ok()) {
return s; return s;
} }
@ -507,56 +508,27 @@ Status ReadMetaBlock(RandomAccessFileReader* file,
uint64_t table_magic_number, uint64_t table_magic_number,
const ImmutableOptions& ioptions, const ImmutableOptions& ioptions,
const std::string& meta_block_name, BlockType block_type, const std::string& meta_block_name, BlockType block_type,
BlockContents* contents, bool /*compression_type_missing*/, BlockContents* contents,
MemoryAllocator* memory_allocator) { MemoryAllocator* memory_allocator) {
Status status; // TableProperties requires special handling because of checksum issues.
Footer footer; // Call ReadTableProperties instead for that case.
IOOptions opts; assert(block_type != BlockType::kProperties);
status = ReadFooterFromFile(opts, file, prefetch_buffer, file_size, &footer,
table_magic_number);
if (!status.ok()) {
return status;
}
// Reading metaindex block
auto metaindex_handle = footer.metaindex_handle();
BlockContents metaindex_contents;
ReadOptions read_options;
read_options.verify_checksums = false;
PersistentCacheOptions cache_options;
BlockFetcher block_fetcher(
file, prefetch_buffer, footer, read_options, metaindex_handle,
&metaindex_contents, ioptions, false /* decompress */,
false /*maybe_compressed*/, BlockType::kMetaIndex,
UncompressionDict::GetEmptyDict(), cache_options, memory_allocator);
status = block_fetcher.ReadBlockContents();
if (!status.ok()) {
return status;
}
// meta block is never compressed. Need to add uncompress logic if we are to
// compress it.
// Finding metablock
Block metaindex_block(std::move(metaindex_contents));
std::unique_ptr<InternalIterator> meta_iter;
meta_iter.reset(metaindex_block.NewDataIterator(
BytewiseComparator(), kDisableGlobalSequenceNumber));
BlockHandle block_handle; BlockHandle block_handle;
status = FindMetaBlock(meta_iter.get(), meta_block_name, &block_handle); Footer footer;
Status status = FindMetaBlockInFile(
file, file_size, table_magic_number, ioptions, meta_block_name,
&block_handle, memory_allocator, prefetch_buffer, &footer);
if (!status.ok()) { if (!status.ok()) {
return status; return status;
} }
// Reading metablock return BlockFetcher(file, prefetch_buffer, footer, ReadOptions(),
BlockFetcher block_fetcher2( block_handle, contents, ioptions, false /* decompress */,
file, prefetch_buffer, footer, read_options, block_handle, contents, false /*maybe_compressed*/, block_type,
ioptions, false /* decompress */, false /*maybe_compressed*/, block_type, UncompressionDict::GetEmptyDict(),
UncompressionDict::GetEmptyDict(), cache_options, memory_allocator); PersistentCacheOptions::kEmpty, memory_allocator)
return block_fetcher2.ReadBlockContents(); .ReadBlockContents();
} }
} // namespace ROCKSDB_NAMESPACE } // namespace ROCKSDB_NAMESPACE

@ -30,6 +30,12 @@ class Logger;
class RandomAccessFile; class RandomAccessFile;
struct TableProperties; struct TableProperties;
// Meta block names for metaindex
extern const std::string kPropertiesBlock;
extern const std::string kPropertiesBlockOldName;
extern const std::string kCompressionDictBlock;
extern const std::string kRangeDelBlock;
class MetaIndexBuilder { class MetaIndexBuilder {
public: public:
MetaIndexBuilder(const MetaIndexBuilder&) = delete; MetaIndexBuilder(const MetaIndexBuilder&) = delete;
@ -95,49 +101,49 @@ bool NotifyCollectTableCollectorsOnFinish(
const std::vector<std::unique_ptr<IntTblPropCollector>>& collectors, const std::vector<std::unique_ptr<IntTblPropCollector>>& collectors,
Logger* info_log, PropertyBlockBuilder* builder); Logger* info_log, PropertyBlockBuilder* builder);
// Read the properties from the table. // Read table properties from a file using known BlockHandle.
// @returns a status to indicate if the operation succeeded. On success, // @returns a status to indicate if the operation succeeded. On success,
// *table_properties will point to a heap-allocated TableProperties // *table_properties will point to a heap-allocated TableProperties
// object, otherwise value of `table_properties` will not be modified. // object, otherwise value of `table_properties` will not be modified.
Status ReadProperties(const ReadOptions& ro, const Slice& handle_value, Status ReadTablePropertiesHelper(
RandomAccessFileReader* file, const ReadOptions& ro, const BlockHandle& handle,
FilePrefetchBuffer* prefetch_buffer, const Footer& footer, RandomAccessFileReader* file, FilePrefetchBuffer* prefetch_buffer,
const ImmutableOptions& ioptions, const Footer& footer, const ImmutableOptions& ioptions,
TableProperties** table_properties, bool verify_checksum, std::unique_ptr<TableProperties>* table_properties,
BlockHandle* block_handle,
CacheAllocationPtr* verification_buf,
bool compression_type_missing = false,
MemoryAllocator* memory_allocator = nullptr); MemoryAllocator* memory_allocator = nullptr);
// Directly read the properties from the properties block of a plain table. // Read table properties from the properties block of a plain table.
// @returns a status to indicate if the operation succeeded. On success, // @returns a status to indicate if the operation succeeded. On success,
// *table_properties will point to a heap-allocated TableProperties // *table_properties will point to a heap-allocated TableProperties
// object, otherwise value of `table_properties` will not be modified. // object, otherwise value of `table_properties` will not be modified.
// certain tables do not have compression_type byte setup properly for
// uncompressed blocks, caller can request to reset compression type by
// passing compression_type_missing = true, the same applies to
// `ReadProperties`, `FindMetaBlock`, and `ReadMetaBlock`
Status ReadTableProperties(RandomAccessFileReader* file, uint64_t file_size, Status ReadTableProperties(RandomAccessFileReader* file, uint64_t file_size,
uint64_t table_magic_number, uint64_t table_magic_number,
const ImmutableOptions& ioptions, const ImmutableOptions& ioptions,
TableProperties** properties, std::unique_ptr<TableProperties>* properties,
bool compression_type_missing = false,
MemoryAllocator* memory_allocator = nullptr, MemoryAllocator* memory_allocator = nullptr,
FilePrefetchBuffer* prefetch_buffer = nullptr); FilePrefetchBuffer* prefetch_buffer = nullptr);
// Find the meta block from the meta index block. // Find the meta block from the meta index block. Returns OK and
// block_handle->IsNull() if not found.
Status FindOptionalMetaBlock(InternalIterator* meta_index_iter,
const std::string& meta_block_name,
BlockHandle* block_handle);
// Find the meta block from the meta index block. Returns Corruption if not
// found.
Status FindMetaBlock(InternalIterator* meta_index_iter, Status FindMetaBlock(InternalIterator* meta_index_iter,
const std::string& meta_block_name, const std::string& meta_block_name,
BlockHandle* block_handle); BlockHandle* block_handle);
// Find the meta block // Find the meta block
Status FindMetaBlock(RandomAccessFileReader* file, uint64_t file_size, Status FindMetaBlockInFile(RandomAccessFileReader* file, uint64_t file_size,
uint64_t table_magic_number, uint64_t table_magic_number,
const ImmutableOptions& ioptions, const ImmutableOptions& ioptions,
const std::string& meta_block_name, const std::string& meta_block_name,
BlockHandle* block_handle, BlockHandle* block_handle,
bool compression_type_missing = false, MemoryAllocator* memory_allocator = nullptr,
MemoryAllocator* memory_allocator = nullptr); FilePrefetchBuffer* prefetch_buffer = nullptr,
Footer* footer_out = nullptr);
// Read the specified meta block with name meta_block_name // Read the specified meta block with name meta_block_name
// from `file` and initialize `contents` with contents of this block. // from `file` and initialize `contents` with contents of this block.
@ -148,7 +154,6 @@ Status ReadMetaBlock(RandomAccessFileReader* file,
const ImmutableOptions& ioptions, const ImmutableOptions& ioptions,
const std::string& meta_block_name, BlockType block_type, const std::string& meta_block_name, BlockType block_type,
BlockContents* contents, BlockContents* contents,
bool compression_type_missing = false,
MemoryAllocator* memory_allocator = nullptr); MemoryAllocator* memory_allocator = nullptr);
} // namespace ROCKSDB_NAMESPACE } // namespace ROCKSDB_NAMESPACE

@ -9,6 +9,8 @@
namespace ROCKSDB_NAMESPACE { namespace ROCKSDB_NAMESPACE {
const PersistentCacheOptions PersistentCacheOptions::kEmpty;
void PersistentCacheHelper::InsertRawPage( void PersistentCacheHelper::InsertRawPage(
const PersistentCacheOptions& cache_options, const BlockHandle& handle, const PersistentCacheOptions& cache_options, const BlockHandle& handle,
const char* data, const size_t size) { const char* data, const size_t size) {
@ -70,7 +72,8 @@ Status PersistentCacheHelper::LookupRawPage(
} }
// cache hit // cache hit
assert(raw_data_size == handle.size() + kBlockTrailerSize); // Block-based table is assumed
assert(raw_data_size == handle.size() + BlockBasedTable::kBlockTrailerSize);
assert(size == raw_data_size); assert(size == raw_data_size);
RecordTick(cache_options.statistics, PERSISTENT_CACHE_HIT); RecordTick(cache_options.statistics, PERSISTENT_CACHE_HIT);
return Status::OK(); return Status::OK();

@ -29,6 +29,8 @@ struct PersistentCacheOptions {
std::shared_ptr<PersistentCache> persistent_cache; std::shared_ptr<PersistentCache> persistent_cache;
std::string key_prefix; std::string key_prefix;
Statistics* statistics = nullptr; Statistics* statistics = nullptr;
static const PersistentCacheOptions kEmpty;
}; };
} // namespace ROCKSDB_NAMESPACE } // namespace ROCKSDB_NAMESPACE

@ -129,11 +129,9 @@ Status PlainTableReader::Open(
return Status::NotSupported("File is too large for PlainTableReader!"); return Status::NotSupported("File is too large for PlainTableReader!");
} }
TableProperties* props_ptr = nullptr; std::unique_ptr<TableProperties> props;
auto s = ReadTableProperties(file.get(), file_size, kPlainTableMagicNumber, auto s = ReadTableProperties(file.get(), file_size, kPlainTableMagicNumber,
ioptions, &props_ptr, ioptions, &props);
true /* compression_type_missing */);
std::shared_ptr<TableProperties> props(props_ptr);
if (!s.ok()) { if (!s.ok()) {
return s; return s;
} }
@ -186,7 +184,7 @@ Status PlainTableReader::Open(
new_reader->full_scan_mode_ = true; new_reader->full_scan_mode_ = true;
} }
// PopulateIndex can add to the props, so don't store them until now // PopulateIndex can add to the props, so don't store them until now
new_reader->table_properties_ = props; new_reader->table_properties_ = std::move(props);
if (immortal_table && new_reader->file_info_.is_mmap_mode) { if (immortal_table && new_reader->file_info_.is_mmap_mode) {
new_reader->dummy_cleanable_.reset(new Cleanable()); new_reader->dummy_cleanable_.reset(new Cleanable());
@ -308,8 +306,7 @@ Status PlainTableReader::PopulateIndex(TableProperties* props,
Status s = ReadMetaBlock(file_info_.file.get(), nullptr /* prefetch_buffer */, Status s = ReadMetaBlock(file_info_.file.get(), nullptr /* prefetch_buffer */,
file_size_, kPlainTableMagicNumber, ioptions_, file_size_, kPlainTableMagicNumber, ioptions_,
PlainTableIndexBuilder::kPlainTableIndexBlock, PlainTableIndexBuilder::kPlainTableIndexBlock,
BlockType::kIndex, &index_block_contents, BlockType::kIndex, &index_block_contents);
true /* compression_type_missing */);
bool index_in_file = s.ok(); bool index_in_file = s.ok();
@ -320,8 +317,7 @@ Status PlainTableReader::PopulateIndex(TableProperties* props,
s = ReadMetaBlock(file_info_.file.get(), nullptr /* prefetch_buffer */, s = ReadMetaBlock(file_info_.file.get(), nullptr /* prefetch_buffer */,
file_size_, kPlainTableMagicNumber, ioptions_, file_size_, kPlainTableMagicNumber, ioptions_,
BloomBlockBuilder::kBloomBlock, BlockType::kFilter, BloomBlockBuilder::kBloomBlock, BlockType::kFilter,
&bloom_block_contents, &bloom_block_contents);
true /* compression_type_missing */);
bloom_in_file = s.ok() && bloom_block_contents.data.size() > 0; bloom_in_file = s.ok() && bloom_block_contents.data.size() > 0;
} }

@ -332,18 +332,16 @@ Status SstFileDumper::ShowCompressionSize(
return Status::OK(); return Status::OK();
} }
// Reads TableProperties prior to opening table reader in order to set up
// options.
Status SstFileDumper::ReadTableProperties(uint64_t table_magic_number, Status SstFileDumper::ReadTableProperties(uint64_t table_magic_number,
RandomAccessFileReader* file, RandomAccessFileReader* file,
uint64_t file_size, uint64_t file_size,
FilePrefetchBuffer* prefetch_buffer) { FilePrefetchBuffer* prefetch_buffer) {
TableProperties* table_properties = nullptr;
Status s = ROCKSDB_NAMESPACE::ReadTableProperties( Status s = ROCKSDB_NAMESPACE::ReadTableProperties(
file, file_size, table_magic_number, ioptions_, &table_properties, file, file_size, table_magic_number, ioptions_, &table_properties_,
/* compression_type_missing= */ false,
/* memory_allocator= */ nullptr, prefetch_buffer); /* memory_allocator= */ nullptr, prefetch_buffer);
if (s.ok()) { if (!s.ok()) {
table_properties_.reset(table_properties);
} else {
if (!silent_) { if (!silent_) {
fprintf(stdout, "Not able to read table properties\n"); fprintf(stdout, "Not able to read table properties\n");
} }
@ -488,6 +486,7 @@ Status SstFileDumper::ReadSequential(bool print_kv, uint64_t read_num,
return ret; return ret;
} }
// Provides TableProperties to API user
Status SstFileDumper::ReadTableProperties( Status SstFileDumper::ReadTableProperties(
std::shared_ptr<const TableProperties>* table_properties) { std::shared_ptr<const TableProperties>* table_properties) {
if (!table_reader_) { if (!table_reader_) {

@ -7,12 +7,10 @@
#include "port/port.h" #include "port/port.h"
#include "rocksdb/env.h" #include "rocksdb/env.h"
#include "rocksdb/iterator.h"
#include "rocksdb/unique_id.h" #include "rocksdb/unique_id.h"
#include "table/block_based/block.h"
#include "table/internal_iterator.h"
#include "table/table_properties_internal.h" #include "table/table_properties_internal.h"
#include "table/unique_id_impl.h" #include "table/unique_id_impl.h"
#include "util/random.h"
#include "util/string_util.h" #include "util/string_util.h"
namespace ROCKSDB_NAMESPACE { namespace ROCKSDB_NAMESPACE {
@ -44,31 +42,6 @@ namespace {
props, key, ToString(value), prop_delim, kv_delim props, key, ToString(value), prop_delim, kv_delim
); );
} }
// Seek to the specified meta block.
// Return true if it successfully seeks to that block.
Status SeekToMetaBlock(InternalIterator* meta_iter,
const std::string& block_name, bool* is_found,
BlockHandle* block_handle = nullptr) {
if (block_handle != nullptr) {
*block_handle = BlockHandle::NullBlockHandle();
}
*is_found = true;
meta_iter->Seek(block_name);
if (meta_iter->status().ok()) {
if (meta_iter->Valid() && meta_iter->key() == block_name) {
*is_found = true;
if (block_handle) {
Slice v = meta_iter->value();
return block_handle->DecodeFrom(&v);
}
} else {
*is_found = false;
return Status::OK();
}
}
return meta_iter->status();
}
} }
std::string TableProperties::ToString( std::string TableProperties::ToString(
@ -306,12 +279,6 @@ const std::string TablePropertiesNames::kSlowCompressionEstimatedDataSize =
const std::string TablePropertiesNames::kFastCompressionEstimatedDataSize = const std::string TablePropertiesNames::kFastCompressionEstimatedDataSize =
"rocksdb.sample_for_compression.fast.data.size"; "rocksdb.sample_for_compression.fast.data.size";
extern const std::string kPropertiesBlock = "rocksdb.properties";
// Old property block name for backward compatibility
extern const std::string kPropertiesBlockOldName = "rocksdb.stats";
extern const std::string kCompressionDictBlock = "rocksdb.compression_dict";
extern const std::string kRangeDelBlock = "rocksdb.range_del";
#ifndef NDEBUG #ifndef NDEBUG
void TEST_SetRandomTableProperties(TableProperties* props) { void TEST_SetRandomTableProperties(TableProperties* props) {
Random* r = Random::GetTLSInstance(); Random* r = Random::GetTLSInstance();
@ -335,26 +302,4 @@ void TEST_SetRandomTableProperties(TableProperties* props) {
} }
#endif #endif
// Seek to the properties block.
// Return true if it successfully seeks to the properties block.
Status SeekToPropertiesBlock(InternalIterator* meta_iter, bool* is_found) {
Status status = SeekToMetaBlock(meta_iter, kPropertiesBlock, is_found);
if (!*is_found && status.ok()) {
status = SeekToMetaBlock(meta_iter, kPropertiesBlockOldName, is_found);
}
return status;
}
// Seek to the compression dictionary block.
// Return true if it successfully seeks to that block.
Status SeekToCompressionDictBlock(InternalIterator* meta_iter, bool* is_found,
BlockHandle* block_handle) {
return SeekToMetaBlock(meta_iter, kCompressionDictBlock, is_found, block_handle);
}
Status SeekToRangeDelBlock(InternalIterator* meta_iter, bool* is_found,
BlockHandle* block_handle = nullptr) {
return SeekToMetaBlock(meta_iter, kRangeDelBlock, is_found, block_handle);
}
} // namespace ROCKSDB_NAMESPACE } // namespace ROCKSDB_NAMESPACE

@ -5,29 +5,9 @@
#pragma once #pragma once
#include "rocksdb/status.h"
#include "rocksdb/table_properties.h" #include "rocksdb/table_properties.h"
#include "table/internal_iterator.h"
namespace ROCKSDB_NAMESPACE { namespace ROCKSDB_NAMESPACE {
class BlockHandle;
// Seek to the properties block.
// If it successfully seeks to the properties block, "is_found" will be
// set to true.
Status SeekToPropertiesBlock(InternalIterator* meta_iter, bool* is_found);
// Seek to the compression dictionary block.
// If it successfully seeks to the properties block, "is_found" will be
// set to true.
Status SeekToCompressionDictBlock(InternalIterator* meta_iter, bool* is_found,
BlockHandle* block_handle);
// TODO(andrewkr) should not put all meta block in table_properties.h/cc
Status SeekToRangeDelBlock(InternalIterator* meta_iter, bool* is_found,
BlockHandle* block_handle);
#ifndef NDEBUG #ifndef NDEBUG
void TEST_SetRandomTableProperties(TableProperties* props); void TEST_SetRandomTableProperties(TableProperties* props);
#endif #endif

@ -1698,7 +1698,8 @@ TEST_P(BlockBasedTableTest, BasicBlockBasedTableProperties) {
block_builder.Add(item.first, item.second); block_builder.Add(item.first, item.second);
} }
Slice content = block_builder.Finish(); Slice content = block_builder.Finish();
ASSERT_EQ(content.size() + kBlockTrailerSize + diff_internal_user_bytes, ASSERT_EQ(content.size() + BlockBasedTable::kBlockTrailerSize +
diff_internal_user_bytes,
props.data_size); props.data_size);
c.ResetTableReader(); c.ResetTableReader();
} }
@ -3836,11 +3837,9 @@ TEST_F(PlainTableTest, BasicPlainTableProperties) {
std::unique_ptr<RandomAccessFileReader> file_reader( std::unique_ptr<RandomAccessFileReader> file_reader(
new RandomAccessFileReader(std::move(source), "test")); new RandomAccessFileReader(std::move(source), "test"));
TableProperties* props = nullptr; std::unique_ptr<TableProperties> props;
auto s = ReadTableProperties(file_reader.get(), ss->contents().size(), auto s = ReadTableProperties(file_reader.get(), ss->contents().size(),
kPlainTableMagicNumber, ioptions, kPlainTableMagicNumber, ioptions, &props);
&props, true /* compression_type_missing */);
std::unique_ptr<TableProperties> props_guard(props);
ASSERT_OK(s); ASSERT_OK(s);
ASSERT_EQ(0ul, props->index_size); ASSERT_EQ(0ul, props->index_size);
@ -4481,10 +4480,10 @@ TEST_P(BlockBasedTableTest, DISABLED_TableWithGlobalSeqno) {
std::unique_ptr<RandomAccessFileReader> file_reader( std::unique_ptr<RandomAccessFileReader> file_reader(
new RandomAccessFileReader(std::move(source), "")); new RandomAccessFileReader(std::move(source), ""));
TableProperties* props = nullptr; std::unique_ptr<TableProperties> props;
ASSERT_OK(ReadTableProperties(file_reader.get(), ss_rw.contents().size(), ASSERT_OK(ReadTableProperties(file_reader.get(), ss_rw.contents().size(),
kBlockBasedTableMagicNumber, ioptions, kBlockBasedTableMagicNumber, ioptions,
&props, true /* compression_type_missing */)); &props));
UserCollectedProperties user_props = props->user_collected_properties; UserCollectedProperties user_props = props->user_collected_properties;
version = DecodeFixed32( version = DecodeFixed32(
@ -4493,8 +4492,6 @@ TEST_P(BlockBasedTableTest, DISABLED_TableWithGlobalSeqno) {
user_props[ExternalSstFilePropertyNames::kGlobalSeqno].c_str()); user_props[ExternalSstFilePropertyNames::kGlobalSeqno].c_str());
global_seqno_offset = global_seqno_offset =
props->properties_offsets[ExternalSstFilePropertyNames::kGlobalSeqno]; props->properties_offsets[ExternalSstFilePropertyNames::kGlobalSeqno];
delete props;
}; };
// Helper function to update the value of the global seqno in the file // Helper function to update the value of the global seqno in the file
@ -4661,15 +4658,14 @@ TEST_P(BlockBasedTableTest, BlockAlignTest) {
new RandomAccessFileReader(std::move(source), "test")); new RandomAccessFileReader(std::move(source), "test"));
// Helper function to get version, global_seqno, global_seqno_offset // Helper function to get version, global_seqno, global_seqno_offset
std::function<void()> VerifyBlockAlignment = [&]() { std::function<void()> VerifyBlockAlignment = [&]() {
TableProperties* props = nullptr; std::unique_ptr<TableProperties> props;
ASSERT_OK(ReadTableProperties(file_reader.get(), sink->contents().size(), ASSERT_OK(ReadTableProperties(file_reader.get(), sink->contents().size(),
kBlockBasedTableMagicNumber, ioptions, &props, kBlockBasedTableMagicNumber, ioptions,
true /* compression_type_missing */)); &props));
uint64_t data_block_size = props->data_size / props->num_data_blocks; uint64_t data_block_size = props->data_size / props->num_data_blocks;
ASSERT_EQ(data_block_size, 4096); ASSERT_EQ(data_block_size, 4096);
ASSERT_EQ(props->data_size, data_block_size * props->num_data_blocks); ASSERT_EQ(props->data_size, data_block_size * props->num_data_blocks);
delete props;
}; };
VerifyBlockAlignment(); VerifyBlockAlignment();
@ -4788,16 +4784,13 @@ TEST_P(BlockBasedTableTest, PropertiesBlockRestartPointTest) {
std::unique_ptr<InternalIterator> meta_iter(metaindex_block.NewDataIterator( std::unique_ptr<InternalIterator> meta_iter(metaindex_block.NewDataIterator(
BytewiseComparator(), kDisableGlobalSequenceNumber)); BytewiseComparator(), kDisableGlobalSequenceNumber));
bool found_properties_block = true;
ASSERT_OK(SeekToPropertiesBlock(meta_iter.get(), &found_properties_block));
ASSERT_TRUE(found_properties_block);
// -- Read properties block // -- Read properties block
Slice v = meta_iter->value();
BlockHandle properties_handle; BlockHandle properties_handle;
ASSERT_OK(properties_handle.DecodeFrom(&v)); ASSERT_OK(FindOptionalMetaBlock(meta_iter.get(), kPropertiesBlock,
&properties_handle));
ASSERT_FALSE(properties_handle.IsNull());
BlockContents properties_contents; BlockContents properties_contents;
BlockFetchHelper(properties_handle, BlockType::kProperties, BlockFetchHelper(properties_handle, BlockType::kProperties,
&properties_contents); &properties_contents);
Block properties_block(std::move(properties_contents)); Block properties_block(std::move(properties_contents));

Loading…
Cancel
Save