format_version=6 and context-aware block checksums (#9058)

Summary:
## Context checksum
All RocksDB checksums currently use 32 bits of checking
power, which should be 1 in 4 billion false negative (FN) probability (failing to
detect corruption). This is true for random corruptions, and in some cases
small corruptions are guaranteed to be detected. But some possible
corruptions, such as in storage metadata rather than storage payload data,
would have a much higher FN rate. For example:
* Data larger than one SST block is replaced by data from elsewhere in
the same or another SST file. Especially with block_align=true, the
probability of exact block size match is probably around 1 in 100, making
the FN probability around that same. Without `block_align=true` the
probability of same block start location is probably around 1 in 10,000,
for FN probability around 1 in a million.

To solve this problem in new format_version=6, we add "context awareness"
to block checksum checks. The stored and expected checksum value is
modified based on the block's position in the file and which file it is in. The
modifications are cleverly chosen so that, for example
* blocks within about 4GB of each other are guaranteed to use different context
* blocks that are offset by exactly some multiple of 4GiB are guaranteed to use
different context
* files generated by the same process are guaranteed to use different context
for the same offsets, until wrap-around after 2^32 - 1 files

Thus, with format_version=6, if a valid SST block and checksum is misplaced,
its checksum FN probability should be essentially ideal, 1 in 4B.

## Footer checksum
This change also adds checksum protection to the SST footer (with
format_version=6), for the first time without relying on whole file checksum.
To prevent a corruption of the format_version in the footer (e.g. 6 -> 5) to
defeat the footer checksum, we change much of the footer data format
including an "extended magic number" in format_version 6 that would be
interpreted as empty index and metaindex block handles in older footer
versions. We also change the encoding of handles to free up space for
other new data in footer.

## More detail: making space in footer
In order to keep footer the same size in format_version=6 (avoid change to IO
patterns), we have to free up some space for new data. We do this two ways:
* Metaindex block handle is encoded down to 4 bytes (from 10) by assuming
it immediately precedes the footer, and by assuming it is < 4GB.
* Index block handle is moved into metaindex. (I don't know why it was
in footer to begin with.)

## Performance
In case of small performance penalty, I've made a "pay as you go" optimization
to compensate: replace `MutableCFOptions` in BlockBasedTableBuilder::Rep
with the only field used in that structure after construction: `prefix_extractor`.
This makes the PR an overall performance improvement (results below).

Nevertheless I'm seeing essentially no difference going from fv=5 to fv=6,
even including that improvement for both. That's based on extreme case table
write performance testing, many files with many blocks. This is relatively
checksum intensive (small blocks) and salt generation intensive (small files).

```
(for I in `seq 1 100`; do TEST_TMPDIR=/dev/shm/dbbench2 ./db_bench -benchmarks=fillseq -memtablerep=vector -disable_wal=1 -allow_concurrent_memtable_write=false -num=3000000 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -write_buffer_size=100000 -compression_type=none -block_size=1000; done) 2>&1 | grep micros/op | tee out
awk '{ tot += $5; n += 1; } END { print int(1.0 * tot / n) }' < out
```

Each value below is ops/s averaged over 100 runs, run simultaneously with competing
configuration for load fairness

Before -> after (both fv=5): 483530 -> 483673 (negligible)
Re-run 1: 480733 -> 485427 (1.0% faster)
Re-run 2: 483821 -> 484541 (0.1% faster)
Before (fv=5) -> after (fv=6): 482006 -> 485100 (0.6% faster)
Re-run 1: 482212 -> 485075 (0.6% faster)
Re-run 2: 483590 -> 484073 (0.1% faster)
After fv=5 -> after fv=6: 483878 -> 485542 (0.3% faster)
Re-run 1: 485331 -> 483385 (0.4% slower)
Re-run 2: 485283 -> 483435 (0.4% slower)
Re-run 3: 483647 -> 486109 (0.5% faster)

Pull Request resolved: https://github.com/facebook/rocksdb/pull/9058

Test Plan:
unit tests included (table_test, db_properties_test, salt in env_test). General DB tests
and crash test updated to test new format_version.

Also temporarily updated the default format version to 6 and saw some test failures. Almost all
were due to an inadvertent additional read in VerifyChecksum to verify the index block checksum,
though it's arguably a bug that VerifyChecksum does not appear to (re-)verify the index block
checksum, just assuming it was verified in opening the index reader (probably *usually* true but
probably not always true). Some other concerns about VerifyChecksum are left in FIXME
comments. The only remaining test failure on change of default (in block_fetcher_test) now
has a comment about how to upgrade the test.

The format compatibility test does not need updating because we have not updated the default
format_version.

Reviewed By: ajkr, mrambacher

Differential Revision: D33100915

Pulled By: pdillinger

fbshipit-source-id: 8679e3e572fa580181a737fd6d113ed53c5422ee
oxigraph-main
Peter Dillinger 1 year ago committed by Facebook GitHub Bot
parent b3c54186ab
commit 7a1b0207e6
  1. 1
      db/db_impl/db_impl.cc
  2. 3
      db/db_properties_test.cc
  3. 2
      env/env_test.cc
  4. 8
      include/rocksdb/table.h
  5. 71
      table/block_based/block_based_table_builder.cc
  6. 25
      table/block_based/block_based_table_reader.cc
  7. 7
      table/block_based/block_based_table_reader.h
  8. 5
      table/block_based/block_based_table_reader_sync_and_async.h
  9. 2
      table/block_based/index_reader_common.cc
  10. 18
      table/block_based/reader_common.cc
  11. 9
      table/block_based/reader_common.h
  12. 6
      table/block_fetcher.cc
  13. 3
      table/block_fetcher_test.cc
  14. 8
      table/cuckoo/cuckoo_table_builder.cc
  15. 176
      table/format.cc
  16. 73
      table/format.h
  17. 10
      table/meta_blocks.cc
  18. 1
      table/meta_blocks.h
  19. 15
      table/plain/plain_table_builder.cc
  20. 60
      table/table_test.cc
  21. 3
      test_util/testutil.cc
  22. 2
      tools/db_crashtest.py
  23. 1
      tools/sst_dump_test.cc
  24. 1
      unreleased_history/new_features/1_context_checksum.md

@ -5824,6 +5824,7 @@ Status DBImpl::VerifyChecksumInternal(const ReadOptions& read_options,
return s; return s;
} }
} }
// FIXME? What does it mean if read_options.verify_checksums == false?
// TODO: simplify using GetRefedColumnFamilySet? // TODO: simplify using GetRefedColumnFamilySet?
std::vector<ColumnFamilyData*> cfd_list; std::vector<ColumnFamilyData*> cfd_list;

@ -2348,6 +2348,9 @@ TEST_F(DBPropertiesTest, TableMetaIndexKeys) {
EXPECT_EQ("rocksdb.hashindex.prefixes", EXPECT_EQ("rocksdb.hashindex.prefixes",
PopMetaIndexKey(meta_iter.get())); PopMetaIndexKey(meta_iter.get()));
} }
if (bbto->format_version >= 6) {
EXPECT_EQ("rocksdb.index", PopMetaIndexKey(meta_iter.get()));
}
} }
EXPECT_EQ("rocksdb.properties", PopMetaIndexKey(meta_iter.get())); EXPECT_EQ("rocksdb.properties", PopMetaIndexKey(meta_iter.get()));
EXPECT_EQ("NOT_FOUND", PopMetaIndexKey(meta_iter.get())); EXPECT_EQ("NOT_FOUND", PopMetaIndexKey(meta_iter.get()));

2
env/env_test.cc vendored

@ -3054,7 +3054,7 @@ TEST_F(EnvTest, PortGenerateRfcUuid) {
VerifyRfcUuids(t.ids); VerifyRfcUuids(t.ids);
} }
// Test the atomic, linear generation of GenerateRawUuid // Test the atomic, linear generation of GenerateRawUniqueId
TEST_F(EnvTest, GenerateRawUniqueId) { TEST_F(EnvTest, GenerateRawUniqueId) {
struct MyStressTest struct MyStressTest
: public NoDuplicateMiniStressTest<uint64_pair_t, HashUint64Pair> { : public NoDuplicateMiniStressTest<uint64_pair_t, HashUint64Pair> {

@ -47,7 +47,10 @@ struct EnvOptions;
// Types of checksums to use for checking integrity of logical blocks within // Types of checksums to use for checking integrity of logical blocks within
// files. All checksums currently use 32 bits of checking power (1 in 4B // files. All checksums currently use 32 bits of checking power (1 in 4B
// chance of failing to detect random corruption). // chance of failing to detect random corruption). Traditionally, the actual
// checking power can be far from ideal if the corruption is due to misplaced
// data (e.g. physical blocks out of order in a file, or from another file),
// which is fixed in format_version=6 (see below).
enum ChecksumType : char { enum ChecksumType : char {
kNoChecksum = 0x0, kNoChecksum = 0x0,
kCRC32c = 0x1, kCRC32c = 0x1,
@ -512,6 +515,9 @@ struct BlockBasedTableOptions {
// 5 -- Can be read by RocksDB's versions since 6.6.0. Full and partitioned // 5 -- Can be read by RocksDB's versions since 6.6.0. Full and partitioned
// filters use a generally faster and more accurate Bloom filter // filters use a generally faster and more accurate Bloom filter
// implementation, with a different schema. // implementation, with a different schema.
// 6 -- Modified the file footer and checksum matching so that SST data
// misplaced within or between files is as likely to fail checksum
// verification as random corruption. Also checksum-protects SST footer.
uint32_t format_version = 5; uint32_t format_version = 5;
// Store index blocks on disk in compressed format. Changing this option to // Store index blocks on disk in compressed format. Changing this option to

@ -263,7 +263,9 @@ class BlockBasedTableBuilder::BlockBasedTablePropertiesCollector
struct BlockBasedTableBuilder::Rep { struct BlockBasedTableBuilder::Rep {
const ImmutableOptions ioptions; const ImmutableOptions ioptions;
const MutableCFOptions moptions; // BEGIN from MutableCFOptions
std::shared_ptr<const SliceTransform> prefix_extractor;
// END from MutableCFOptions
const BlockBasedTableOptions table_options; const BlockBasedTableOptions table_options;
const InternalKeyComparator& internal_comparator; const InternalKeyComparator& internal_comparator;
// Size in bytes for the user-defined timestamps. // Size in bytes for the user-defined timestamps.
@ -361,6 +363,9 @@ struct BlockBasedTableBuilder::Rep {
// all blocks after data blocks till the end of the SST file. // all blocks after data blocks till the end of the SST file.
uint64_t tail_size; uint64_t tail_size;
// See class Footer
uint32_t base_context_checksum;
uint64_t get_offset() { return offset.load(std::memory_order_relaxed); } uint64_t get_offset() { return offset.load(std::memory_order_relaxed); }
void set_offset(uint64_t o) { offset.store(o, std::memory_order_relaxed); } void set_offset(uint64_t o) { offset.store(o, std::memory_order_relaxed); }
@ -389,6 +394,12 @@ struct BlockBasedTableBuilder::Rep {
// to false, and this is ensured by io_status_mutex, so no special memory // to false, and this is ensured by io_status_mutex, so no special memory
// order for io_status_ok is required. // order for io_status_ok is required.
if (io_status_ok.load(std::memory_order_relaxed)) { if (io_status_ok.load(std::memory_order_relaxed)) {
#ifdef ROCKSDB_ASSERT_STATUS_CHECKED // Avoid unnecessary lock acquisition
auto ios = CopyIOStatus();
ios.PermitUncheckedError();
// Assume no races in unit tests
assert(ios.ok());
#endif // ROCKSDB_ASSERT_STATUS_CHECKED
return IOStatus::OK(); return IOStatus::OK();
} else { } else {
return CopyIOStatus(); return CopyIOStatus();
@ -429,7 +440,7 @@ struct BlockBasedTableBuilder::Rep {
Rep(const BlockBasedTableOptions& table_opt, const TableBuilderOptions& tbo, Rep(const BlockBasedTableOptions& table_opt, const TableBuilderOptions& tbo,
WritableFileWriter* f) WritableFileWriter* f)
: ioptions(tbo.ioptions), : ioptions(tbo.ioptions),
moptions(tbo.moptions), prefix_extractor(tbo.moptions.prefix_extractor),
table_options(table_opt), table_options(table_opt),
internal_comparator(tbo.internal_comparator), internal_comparator(tbo.internal_comparator),
ts_sz(tbo.internal_comparator.user_comparator()->timestamp_size()), ts_sz(tbo.internal_comparator.user_comparator()->timestamp_size()),
@ -456,7 +467,7 @@ struct BlockBasedTableBuilder::Rep {
BlockBasedTableOptions::kDataBlockBinarySearch /* index_type */, BlockBasedTableOptions::kDataBlockBinarySearch /* index_type */,
0.75 /* data_block_hash_table_util_ratio */, ts_sz, 0.75 /* data_block_hash_table_util_ratio */, ts_sz,
persist_user_defined_timestamps), persist_user_defined_timestamps),
internal_prefix_transform(tbo.moptions.prefix_extractor.get()), internal_prefix_transform(prefix_extractor.get()),
compression_type(tbo.compression_type), compression_type(tbo.compression_type),
sample_for_compression(tbo.moptions.sample_for_compression), sample_for_compression(tbo.moptions.sample_for_compression),
compressible_input_data_bytes(0), compressible_input_data_bytes(0),
@ -557,7 +568,7 @@ struct BlockBasedTableBuilder::Rep {
} }
filter_builder.reset(CreateFilterBlockBuilder( filter_builder.reset(CreateFilterBlockBuilder(
ioptions, moptions, filter_context, ioptions, tbo.moptions, filter_context,
use_delta_encoding_for_index_values, p_index_builder_, ts_sz, use_delta_encoding_for_index_values, p_index_builder_, ts_sz,
persist_user_defined_timestamps)); persist_user_defined_timestamps));
} }
@ -573,7 +584,7 @@ struct BlockBasedTableBuilder::Rep {
table_properties_collectors.emplace_back( table_properties_collectors.emplace_back(
new BlockBasedTablePropertiesCollector( new BlockBasedTablePropertiesCollector(
table_options.index_type, table_options.whole_key_filtering, table_options.index_type, table_options.whole_key_filtering,
moptions.prefix_extractor != nullptr)); prefix_extractor != nullptr));
if (ts_sz > 0 && persist_user_defined_timestamps) { if (ts_sz > 0 && persist_user_defined_timestamps) {
table_properties_collectors.emplace_back( table_properties_collectors.emplace_back(
new TimestampTablePropertiesCollector( new TimestampTablePropertiesCollector(
@ -597,6 +608,17 @@ struct BlockBasedTableBuilder::Rep {
if (!ReifyDbHostIdProperty(ioptions.env, &props.db_host_id).ok()) { if (!ReifyDbHostIdProperty(ioptions.env, &props.db_host_id).ok()) {
ROCKS_LOG_INFO(ioptions.logger, "db_host_id property will not be set"); ROCKS_LOG_INFO(ioptions.logger, "db_host_id property will not be set");
} }
if (FormatVersionUsesContextChecksum(table_options.format_version)) {
// Must be non-zero and semi- or quasi-random
// TODO: ideally guaranteed different for related files (e.g. use file
// number and db_session, for benefit of SstFileWriter)
do {
base_context_checksum = Random::GetTLSInstance()->Next();
} while (UNLIKELY(base_context_checksum == 0));
} else {
base_context_checksum = 0;
}
} }
Rep(const Rep&) = delete; Rep(const Rep&) = delete;
@ -1285,7 +1307,8 @@ void BlockBasedTableBuilder::WriteMaybeCompressedBlock(
bool is_data_block = block_type == BlockType::kData; bool is_data_block = block_type == BlockType::kData;
// Old, misleading name of this function: WriteRawBlock // Old, misleading name of this function: WriteRawBlock
StopWatch sw(r->ioptions.clock, r->ioptions.stats, WRITE_RAW_BLOCK_MICROS); StopWatch sw(r->ioptions.clock, r->ioptions.stats, WRITE_RAW_BLOCK_MICROS);
handle->set_offset(r->get_offset()); const uint64_t offset = r->get_offset();
handle->set_offset(offset);
handle->set_size(block_contents.size()); handle->set_size(block_contents.size());
assert(status().ok()); assert(status().ok());
assert(io_status().ok()); assert(io_status().ok());
@ -1307,6 +1330,7 @@ void BlockBasedTableBuilder::WriteMaybeCompressedBlock(
uint32_t checksum = ComputeBuiltinChecksumWithLastByte( uint32_t checksum = ComputeBuiltinChecksumWithLastByte(
r->table_options.checksum, block_contents.data(), block_contents.size(), r->table_options.checksum, block_contents.data(), block_contents.size(),
/*last_byte*/ comp_type); /*last_byte*/ comp_type);
checksum += ChecksumModifierForContext(r->base_context_checksum, offset);
if (block_type == BlockType::kFilter) { if (block_type == BlockType::kFilter) {
Status s = r->filter_builder->MaybePostVerifyFilter(block_contents); Status s = r->filter_builder->MaybePostVerifyFilter(block_contents);
@ -1600,6 +1624,11 @@ void BlockBasedTableBuilder::WriteIndexBlock(
// The last index_block_handle will be for the partition index block // The last index_block_handle will be for the partition index block
} }
} }
// If success and need to record in metaindex rather than footer...
if (!FormatVersionUsesIndexHandleInFooter(
rep_->table_options.format_version)) {
meta_index_builder->Add(kIndexBlockName, *index_block_handle);
}
} }
void BlockBasedTableBuilder::WritePropertiesBlock( void BlockBasedTableBuilder::WritePropertiesBlock(
@ -1625,9 +1654,7 @@ void BlockBasedTableBuilder::WritePropertiesBlock(
rep_->props.compression_options = rep_->props.compression_options =
CompressionOptionsToString(rep_->compression_opts); CompressionOptionsToString(rep_->compression_opts);
rep_->props.prefix_extractor_name = rep_->props.prefix_extractor_name =
rep_->moptions.prefix_extractor != nullptr rep_->prefix_extractor ? rep_->prefix_extractor->AsString() : "nullptr";
? rep_->moptions.prefix_extractor->AsString()
: "nullptr";
std::string property_collectors_names = "["; std::string property_collectors_names = "[";
for (size_t i = 0; for (size_t i = 0;
i < rep_->ioptions.table_properties_collector_factories.size(); ++i) { i < rep_->ioptions.table_properties_collector_factories.size(); ++i) {
@ -1746,16 +1773,20 @@ void BlockBasedTableBuilder::WriteRangeDelBlock(
void BlockBasedTableBuilder::WriteFooter(BlockHandle& metaindex_block_handle, void BlockBasedTableBuilder::WriteFooter(BlockHandle& metaindex_block_handle,
BlockHandle& index_block_handle) { BlockHandle& index_block_handle) {
assert(ok());
Rep* r = rep_; Rep* r = rep_;
// this is guaranteed by BlockBasedTableBuilder's constructor // this is guaranteed by BlockBasedTableBuilder's constructor
assert(r->table_options.checksum == kCRC32c || assert(r->table_options.checksum == kCRC32c ||
r->table_options.format_version != 0); r->table_options.format_version != 0);
assert(ok());
FooterBuilder footer; FooterBuilder footer;
footer.Build(kBlockBasedTableMagicNumber, r->table_options.format_version, Status s = footer.Build(kBlockBasedTableMagicNumber,
r->get_offset(), r->table_options.checksum, r->table_options.format_version, r->get_offset(),
metaindex_block_handle, index_block_handle); r->table_options.checksum, metaindex_block_handle,
index_block_handle, r->base_context_checksum);
if (!s.ok()) {
r->SetStatus(s);
return;
}
IOStatus ios = r->file->Append(footer.GetSlice()); IOStatus ios = r->file->Append(footer.GetSlice());
if (ios.ok()) { if (ios.ok()) {
r->set_offset(r->get_offset() + footer.GetSlice().size()); r->set_offset(r->get_offset() + footer.GetSlice().size());
@ -1970,10 +2001,14 @@ Status BlockBasedTableBuilder::Finish() {
WriteFooter(metaindex_block_handle, index_block_handle); WriteFooter(metaindex_block_handle, index_block_handle);
} }
r->state = Rep::State::kClosed; r->state = Rep::State::kClosed;
r->SetStatus(r->CopyIOStatus());
Status ret_status = r->CopyStatus();
assert(!ret_status.ok() || io_status().ok());
r->tail_size = r->offset - r->props.tail_start_offset; r->tail_size = r->offset - r->props.tail_start_offset;
Status ret_status = r->CopyStatus();
IOStatus ios = r->GetIOStatus();
if (!ios.ok() && ret_status.ok()) {
// Let io_status supersede ok status (otherwise status takes precedennce)
ret_status = ios;
}
return ret_status; return ret_status;
} }
@ -1983,8 +2018,10 @@ void BlockBasedTableBuilder::Abandon() {
StopParallelCompression(); StopParallelCompression();
} }
rep_->state = Rep::State::kClosed; rep_->state = Rep::State::kClosed;
#ifdef ROCKSDB_ASSERT_STATUS_CHECKED // Avoid unnecessary lock acquisition
rep_->CopyStatus().PermitUncheckedError(); rep_->CopyStatus().PermitUncheckedError();
rep_->CopyIOStatus().PermitUncheckedError(); rep_->CopyIOStatus().PermitUncheckedError();
#endif // ROCKSDB_ASSERT_STATUS_CHECKED
} }
uint64_t BlockBasedTableBuilder::NumEntries() const { uint64_t BlockBasedTableBuilder::NumEntries() const {

@ -771,6 +771,7 @@ Status BlockBasedTable::Open(
if (!s.ok()) { if (!s.ok()) {
return s; return s;
} }
rep->verify_checksum_set_on_open = ro.verify_checksums;
s = new_table->PrefetchIndexAndFilterBlocks( s = new_table->PrefetchIndexAndFilterBlocks(
ro, prefetch_buffer.get(), metaindex_iter.get(), new_table.get(), ro, prefetch_buffer.get(), metaindex_iter.get(), new_table.get(),
prefetch_all, table_options, level, file_size, prefetch_all, table_options, level, file_size,
@ -2454,6 +2455,10 @@ BlockType BlockBasedTable::GetBlockTypeForMetaBlockByName(
return BlockType::kHashIndexMetadata; return BlockType::kHashIndexMetadata;
} }
if (meta_block_name == kIndexBlockName) {
return BlockType::kIndex;
}
if (meta_block_name.starts_with(kObsoleteFilterBlockPrefix)) { if (meta_block_name.starts_with(kObsoleteFilterBlockPrefix)) {
// Obsolete but possible in old files // Obsolete but possible in old files
return BlockType::kInvalid; return BlockType::kInvalid;
@ -2474,6 +2479,9 @@ Status BlockBasedTable::VerifyChecksumInMetaBlocks(
BlockHandle handle; BlockHandle handle;
Slice input = index_iter->value(); Slice input = index_iter->value();
s = handle.DecodeFrom(&input); s = handle.DecodeFrom(&input);
if (!s.ok()) {
break;
}
BlockContents contents; BlockContents contents;
const Slice meta_block_name = index_iter->key(); const Slice meta_block_name = index_iter->key();
if (meta_block_name == kPropertiesBlockName) { if (meta_block_name == kPropertiesBlockName) {
@ -2484,7 +2492,13 @@ Status BlockBasedTable::VerifyChecksumInMetaBlocks(
nullptr /* prefetch_buffer */, rep_->footer, nullptr /* prefetch_buffer */, rep_->footer,
rep_->ioptions, &table_properties, rep_->ioptions, &table_properties,
nullptr /* memory_allocator */); nullptr /* memory_allocator */);
} else if (rep_->verify_checksum_set_on_open &&
meta_block_name == kIndexBlockName) {
// WART: For now, to maintain similar I/O behavior as before
// format_version=6, we skip verifying index block checksum--but only
// if it was checked on open.
} else { } else {
// FIXME? Need to verify checksums of index and filter partitions?
s = BlockFetcher( s = BlockFetcher(
rep_->file.get(), nullptr /* prefetch buffer */, rep_->footer, rep_->file.get(), nullptr /* prefetch buffer */, rep_->footer,
read_options, handle, &contents, rep_->ioptions, read_options, handle, &contents, rep_->ioptions,
@ -2542,6 +2556,15 @@ Status BlockBasedTable::CreateIndexReader(
InternalIterator* meta_iter, bool use_cache, bool prefetch, bool pin, InternalIterator* meta_iter, bool use_cache, bool prefetch, bool pin,
BlockCacheLookupContext* lookup_context, BlockCacheLookupContext* lookup_context,
std::unique_ptr<IndexReader>* index_reader) { std::unique_ptr<IndexReader>* index_reader) {
if (FormatVersionUsesIndexHandleInFooter(rep_->footer.format_version())) {
rep_->index_handle = rep_->footer.index_handle();
} else {
Status s = FindMetaBlock(meta_iter, kIndexBlockName, &rep_->index_handle);
if (!s.ok()) {
return s;
}
}
switch (rep_->index_type) { switch (rep_->index_type) {
case BlockBasedTableOptions::kTwoLevelIndexSearch: { case BlockBasedTableOptions::kTwoLevelIndexSearch: {
return PartitionIndexReader::Create(this, ro, prefetch_buffer, use_cache, return PartitionIndexReader::Create(this, ro, prefetch_buffer, use_cache,
@ -2709,7 +2732,7 @@ bool BlockBasedTable::TEST_FilterBlockInCache() const {
bool BlockBasedTable::TEST_IndexBlockInCache() const { bool BlockBasedTable::TEST_IndexBlockInCache() const {
assert(rep_ != nullptr); assert(rep_ != nullptr);
return TEST_BlockInCache(rep_->footer.index_handle()); return TEST_BlockInCache(rep_->index_handle);
} }
Status BlockBasedTable::GetKVPairsFromDataBlocks( Status BlockBasedTable::GetKVPairsFromDataBlocks(

@ -594,6 +594,7 @@ struct BlockBasedTable::Rep {
BlockHandle compression_dict_handle; BlockHandle compression_dict_handle;
std::shared_ptr<const TableProperties> table_properties; std::shared_ptr<const TableProperties> table_properties;
BlockHandle index_handle;
BlockBasedTableOptions::IndexType index_type; BlockBasedTableOptions::IndexType index_type;
bool whole_key_filtering; bool whole_key_filtering;
bool prefix_filtering; bool prefix_filtering;
@ -637,6 +638,12 @@ struct BlockBasedTable::Rep {
bool index_key_includes_seq = true; bool index_key_includes_seq = true;
bool index_value_is_full = true; bool index_value_is_full = true;
// Whether block checksums in metadata blocks were verified on open.
// This is only to mostly maintain current dubious behavior of VerifyChecksum
// with respect to index blocks, but only when the checksum was previously
// verified.
bool verify_checksum_set_on_open = false;
const bool immortal_table; const bool immortal_table;
// Whether the user key contains user-defined timestamps. If this is false and // Whether the user key contains user-defined timestamps. If this is false and
// the running user comparator has a non-zero timestamp size, a min timestamp // the running user comparator has a non-zero timestamp size, a min timestamp

@ -222,9 +222,8 @@ DEFINE_SYNC_AND_ASYNC(void, BlockBasedTable::RetrieveMultipleBlocks)
// begin address of each read request, we need to add the offset // begin address of each read request, we need to add the offset
// in each read request. Checksum is stored in the block trailer, // in each read request. Checksum is stored in the block trailer,
// beyond the payload size. // beyond the payload size.
s = VerifyBlockChecksum(footer.checksum_type(), data + req_offset, s = VerifyBlockChecksum(footer, data + req_offset, handle.size(),
handle.size(), rep_->file->file_name(), rep_->file->file_name(), handle.offset());
handle.offset());
TEST_SYNC_POINT_CALLBACK("RetrieveMultipleBlocks:VerifyChecksum", &s); TEST_SYNC_POINT_CALLBACK("RetrieveMultipleBlocks:VerifyChecksum", &s);
} }
} else if (!use_shared_buffer) { } else if (!use_shared_buffer) {

@ -26,7 +26,7 @@ Status BlockBasedTable::IndexReaderCommon::ReadIndexBlock(
assert(rep != nullptr); assert(rep != nullptr);
const Status s = table->RetrieveBlock( const Status s = table->RetrieveBlock(
prefetch_buffer, read_options, rep->footer.index_handle(), prefetch_buffer, read_options, rep->index_handle,
UncompressionDict::GetEmptyDict(), &index_block->As<Block_kIndex>(), UncompressionDict::GetEmptyDict(), &index_block->As<Block_kIndex>(),
get_context, lookup_context, /* for_compaction */ false, use_cache, get_context, lookup_context, /* for_compaction */ false, use_cache,
/* async_read */ false); /* async_read */ false);

@ -23,10 +23,14 @@ void ForceReleaseCachedEntry(void* arg, void* h) {
} }
// WART: this is specific to block-based table // WART: this is specific to block-based table
Status VerifyBlockChecksum(ChecksumType type, const char* data, Status VerifyBlockChecksum(const Footer& footer, const char* data,
size_t block_size, const std::string& file_name, size_t block_size, const std::string& file_name,
uint64_t offset) { uint64_t offset) {
PERF_TIMER_GUARD(block_checksum_time); PERF_TIMER_GUARD(block_checksum_time);
assert(footer.GetBlockTrailerSize() == 5);
ChecksumType type = footer.checksum_type();
// After block_size bytes is compression type (1 byte), which is part of // After block_size bytes is compression type (1 byte), which is part of
// the checksummed section. // the checksummed section.
size_t len = block_size + 1; size_t len = block_size + 1;
@ -34,6 +38,13 @@ Status VerifyBlockChecksum(ChecksumType type, const char* data,
uint32_t stored = DecodeFixed32(data + len); uint32_t stored = DecodeFixed32(data + len);
uint32_t computed = ComputeBuiltinChecksum(type, data, len); uint32_t computed = ComputeBuiltinChecksum(type, data, len);
// Unapply context to 'stored' rather than apply to 'computed, for people
// who might look for reference crc value in error message
uint32_t modifier =
ChecksumModifierForContext(footer.base_context_checksum(), offset);
stored -= modifier;
if (stored == computed) { if (stored == computed) {
return Status::OK(); return Status::OK();
} else { } else {
@ -43,8 +54,9 @@ Status VerifyBlockChecksum(ChecksumType type, const char* data,
computed = crc32c::Unmask(computed); computed = crc32c::Unmask(computed);
} }
return Status::Corruption( return Status::Corruption(
"block checksum mismatch: stored = " + std::to_string(stored) + "block checksum mismatch: stored" +
", computed = " + std::to_string(computed) + std::string(modifier ? "(context removed)" : "") + " = " +
std::to_string(stored) + ", computed = " + std::to_string(computed) +
", type = " + std::to_string(type) + " in " + file_name + " offset " + ", type = " + std::to_string(type) + " in " + file_name + " offset " +
std::to_string(offset) + " size " + std::to_string(block_size)); std::to_string(offset) + " size " + std::to_string(block_size));
} }

@ -12,6 +12,8 @@
#include "rocksdb/table.h" #include "rocksdb/table.h"
namespace ROCKSDB_NAMESPACE { namespace ROCKSDB_NAMESPACE {
class Footer;
// Release the cached entry and decrement its ref count. // Release the cached entry and decrement its ref count.
extern void ForceReleaseCachedEntry(void* arg, void* h); extern void ForceReleaseCachedEntry(void* arg, void* h);
@ -22,12 +24,13 @@ inline MemoryAllocator* GetMemoryAllocator(
: nullptr; : nullptr;
} }
// Assumes block has a trailer as in format.h. file_name and offset provided // Assumes block has a trailer past `data + block_size` as in format.h.
// for generating a diagnostic message in returned status. // `file_name` provided for generating diagnostic message in returned status.
// `offset` might be required for proper verification (also used for message).
// //
// Returns Status::OK() on checksum match, or Status::Corruption() on checksum // Returns Status::OK() on checksum match, or Status::Corruption() on checksum
// mismatch. // mismatch.
extern Status VerifyBlockChecksum(ChecksumType type, const char* data, extern Status VerifyBlockChecksum(const Footer& footer, const char* data,
size_t block_size, size_t block_size,
const std::string& file_name, const std::string& file_name,
uint64_t offset); uint64_t offset);

@ -33,9 +33,9 @@ inline void BlockFetcher::ProcessTrailerIfPresent() {
if (footer_.GetBlockTrailerSize() > 0) { if (footer_.GetBlockTrailerSize() > 0) {
assert(footer_.GetBlockTrailerSize() == BlockBasedTable::kBlockTrailerSize); assert(footer_.GetBlockTrailerSize() == BlockBasedTable::kBlockTrailerSize);
if (read_options_.verify_checksums) { if (read_options_.verify_checksums) {
io_status_ = status_to_io_status(VerifyBlockChecksum( io_status_ = status_to_io_status(
footer_.checksum_type(), slice_.data(), block_size_, VerifyBlockChecksum(footer_, slice_.data(), block_size_,
file_->file_name(), handle_.offset())); file_->file_name(), handle_.offset()));
RecordTick(ioptions_.stats, BLOCK_CHECKSUM_COMPUTE_COUNT); RecordTick(ioptions_.stats, BLOCK_CHECKSUM_COMPUTE_COUNT);
if (!io_status_.ok()) { if (!io_status_.ok()) {
assert(io_status_.IsCorruption()); assert(io_status_.IsCorruption());

@ -107,6 +107,9 @@ class BlockFetcherTest : public testing::Test {
Footer footer; Footer footer;
ReadFooter(file.get(), &footer); ReadFooter(file.get(), &footer);
const BlockHandle& index_handle = footer.index_handle(); const BlockHandle& index_handle = footer.index_handle();
// FIXME: index handle will need to come from metaindex for
// format_version >= 6 when that becomes the default
ASSERT_FALSE(index_handle.IsNull());
CompressionType compression_type; CompressionType compression_type;
FetchBlock(file.get(), index_handle, BlockType::kIndex, FetchBlock(file.get(), index_handle, BlockType::kIndex,

@ -403,8 +403,12 @@ Status CuckooTableBuilder::Finish() {
} }
FooterBuilder footer; FooterBuilder footer;
footer.Build(kCuckooTableMagicNumber, /* format_version */ 1, offset, Status s = footer.Build(kCuckooTableMagicNumber, /* format_version */ 1,
kNoChecksum, meta_index_block_handle); offset, kNoChecksum, meta_index_block_handle);
if (!s.ok()) {
status_ = s;
return status_;
}
io_status_ = file_->Append(footer.GetSlice()); io_status_ = file_->Append(footer.GetSlice());
status_ = io_status_; status_ = io_status_;
return status_; return status_;

@ -10,6 +10,7 @@
#include "table/format.h" #include "table/format.h"
#include <cinttypes> #include <cinttypes>
#include <cstdint>
#include <string> #include <string>
#include "block_fetcher.h" #include "block_fetcher.h"
@ -18,12 +19,14 @@
#include "monitoring/perf_context_imp.h" #include "monitoring/perf_context_imp.h"
#include "monitoring/statistics_impl.h" #include "monitoring/statistics_impl.h"
#include "options/options_helper.h" #include "options/options_helper.h"
#include "port/likely.h"
#include "rocksdb/env.h" #include "rocksdb/env.h"
#include "rocksdb/options.h" #include "rocksdb/options.h"
#include "rocksdb/table.h" #include "rocksdb/table.h"
#include "table/block_based/block.h" #include "table/block_based/block.h"
#include "table/block_based/block_based_table_reader.h" #include "table/block_based/block_based_table_reader.h"
#include "table/persistent_cache_helper.h" #include "table/persistent_cache_helper.h"
#include "unique_id_impl.h"
#include "util/cast_util.h" #include "util/cast_util.h"
#include "util/coding.h" #include "util/coding.h"
#include "util/compression.h" #include "util/compression.h"
@ -195,25 +198,41 @@ inline uint8_t BlockTrailerSizeForMagicNumber(uint64_t magic_number) {
// -> format_version >= 1 // -> format_version >= 1
// checksum type (char, 1 byte) // checksum type (char, 1 byte)
// * Part2 // * Part2
// -> format_version <= 5
// metaindex handle (varint64 offset, varint64 size) // metaindex handle (varint64 offset, varint64 size)
// index handle (varint64 offset, varint64 size) // index handle (varint64 offset, varint64 size)
// <zero padding> for part2 size = 2 * BlockHandle::kMaxEncodedLength = 40 // <zero padding> for part2 size = 2 * BlockHandle::kMaxEncodedLength = 40
// - This padding is unchecked/ignored
// -> format_version >= 6
// extended magic number (4 bytes) = 0x3e 0x00 0x7a 0x00
// - Also surely invalid (size 0) handles if interpreted as older version
// - (Helps ensure a corrupted format_version doesn't get us far with no
// footer checksum.)
// footer_checksum (uint32LE, 4 bytes)
// - Checksum of above checksum type of whole footer, with this field
// set to all zeros.
// base_context_checksum (uint32LE, 4 bytes)
// metaindex block size (uint32LE, 4 bytes)
// - Assumed to be immediately before footer, < 4GB
// <zero padding> (24 bytes, reserved for future use)
// - Brings part2 size also to 40 bytes
// - Checked that last eight bytes == 0, so reserved for a future
// incompatible feature (but under format_version=6)
// * Part3 // * Part3
// -> format_version == 0 (inferred from legacy magic number) // -> format_version == 0 (inferred from legacy magic number)
// legacy magic number (8 bytes) // legacy magic number (8 bytes)
// -> format_version >= 1 (inferred from NOT legacy magic number) // -> format_version >= 1 (inferred from NOT legacy magic number)
// format_version (uint32LE, 4 bytes), also called "footer version" // format_version (uint32LE, 4 bytes), also called "footer version"
// newer magic number (8 bytes) // newer magic number (8 bytes)
const std::array<char, 4> kExtendedMagic{{0x3e, 0x00, 0x7a, 0x00}};
constexpr size_t kFooterPart2Size = 2 * BlockHandle::kMaxEncodedLength; constexpr size_t kFooterPart2Size = 2 * BlockHandle::kMaxEncodedLength;
} // namespace } // namespace
void FooterBuilder::Build(uint64_t magic_number, uint32_t format_version, Status FooterBuilder::Build(uint64_t magic_number, uint32_t format_version,
uint64_t footer_offset, ChecksumType checksum_type, uint64_t footer_offset, ChecksumType checksum_type,
const BlockHandle& metaindex_handle, const BlockHandle& metaindex_handle,
const BlockHandle& index_handle) { const BlockHandle& index_handle,
(void)footer_offset; // Future use uint32_t base_context_checksum) {
assert(magic_number != Footer::kNullTableMagicNumber); assert(magic_number != Footer::kNullTableMagicNumber);
assert(IsSupportedFormatVersion(format_version)); assert(IsSupportedFormatVersion(format_version));
@ -249,19 +268,71 @@ void FooterBuilder::Build(uint64_t magic_number, uint32_t format_version,
assert(cur + 8 == slice_.data() + slice_.size()); assert(cur + 8 == slice_.data() + slice_.size());
} }
{ if (format_version >= 6) {
if (BlockTrailerSizeForMagicNumber(magic_number) != 0) {
// base context checksum required for table formats with block checksums
assert(base_context_checksum != 0);
assert(ChecksumModifierForContext(base_context_checksum, 0) != 0);
} else {
// base context checksum not used
assert(base_context_checksum == 0);
assert(ChecksumModifierForContext(base_context_checksum, 0) == 0);
}
// Start populating Part 2
char* cur = data_.data() + /* part 1 size */ 1;
// Set extended magic of part2
std::copy(kExtendedMagic.begin(), kExtendedMagic.end(), cur);
cur += kExtendedMagic.size();
// Fill checksum data with zeros (for later computing checksum)
char* checksum_data = cur;
EncodeFixed32(cur, 0);
cur += 4;
// Save base context checksum
EncodeFixed32(cur, base_context_checksum);
cur += 4;
// Compute and save metaindex size
uint32_t metaindex_size = static_cast<uint32_t>(metaindex_handle.size());
if (metaindex_size != metaindex_handle.size()) {
return Status::NotSupported("Metaindex block size > 4GB");
}
// Metaindex must be adjacent to footer
assert(metaindex_size == 0 ||
metaindex_handle.offset() + metaindex_handle.size() ==
footer_offset - BlockTrailerSizeForMagicNumber(magic_number));
EncodeFixed32(cur, metaindex_size);
cur += 4;
// Zero pad remainder (for future use)
std::fill_n(cur, 24U, char{0});
assert(cur + 24 == part3);
// Compute checksum, add context
uint32_t checksum = ComputeBuiltinChecksum(
checksum_type, data_.data(), Footer::kNewVersionsEncodedLength);
checksum +=
ChecksumModifierForContext(base_context_checksum, footer_offset);
// Store it
EncodeFixed32(checksum_data, checksum);
} else {
// Base context checksum not used
assert(!FormatVersionUsesContextChecksum(format_version));
// Should be left empty
assert(base_context_checksum == 0);
assert(ChecksumModifierForContext(base_context_checksum, 0) == 0);
// Populate all of part 2
char* cur = part2; char* cur = part2;
cur = metaindex_handle.EncodeTo(cur); cur = metaindex_handle.EncodeTo(cur);
cur = index_handle.EncodeTo(cur); cur = index_handle.EncodeTo(cur);
// Zero pad remainder // Zero pad remainder
std::fill(cur, part3, char{0}); std::fill(cur, part3, char{0});
} }
return Status::OK();
} }
Status Footer::DecodeFrom(Slice input, uint64_t input_offset, Status Footer::DecodeFrom(Slice input, uint64_t input_offset,
uint64_t enforce_table_magic_number) { uint64_t enforce_table_magic_number) {
(void)input_offset; // Future use
// Only decode to unused Footer // Only decode to unused Footer
assert(table_magic_number_ == kNullTableMagicNumber); assert(table_magic_number_ == kNullTableMagicNumber);
assert(input != nullptr); assert(input != nullptr);
@ -284,6 +355,9 @@ Status Footer::DecodeFrom(Slice input, uint64_t input_offset,
block_trailer_size_ = BlockTrailerSizeForMagicNumber(magic); block_trailer_size_ = BlockTrailerSizeForMagicNumber(magic);
// Parse Part3 // Parse Part3
const char* part3_ptr = magic_ptr;
uint32_t computed_checksum = 0;
uint64_t footer_offset = 0;
if (legacy) { if (legacy) {
// The size is already asserted to be at least kMinEncodedLength // The size is already asserted to be at least kMinEncodedLength
// at the beginning of the function // at the beginning of the function
@ -291,37 +365,101 @@ Status Footer::DecodeFrom(Slice input, uint64_t input_offset,
format_version_ = 0 /* legacy */; format_version_ = 0 /* legacy */;
checksum_type_ = kCRC32c; checksum_type_ = kCRC32c;
} else { } else {
const char* part3_ptr = magic_ptr - 4; part3_ptr = magic_ptr - 4;
format_version_ = DecodeFixed32(part3_ptr); format_version_ = DecodeFixed32(part3_ptr);
if (!IsSupportedFormatVersion(format_version_)) { if (UNLIKELY(!IsSupportedFormatVersion(format_version_))) {
return Status::Corruption("Corrupt or unsupported format_version: " + return Status::Corruption("Corrupt or unsupported format_version: " +
std::to_string(format_version_)); std::to_string(format_version_));
} }
// All known format versions >= 1 occupy exactly this many bytes. // All known format versions >= 1 occupy exactly this many bytes.
if (input.size() < kNewVersionsEncodedLength) { if (UNLIKELY(input.size() < kNewVersionsEncodedLength)) {
return Status::Corruption("Input is too short to be an SST file"); return Status::Corruption("Input is too short to be an SST file");
} }
uint64_t adjustment = input.size() - kNewVersionsEncodedLength; uint64_t adjustment = input.size() - kNewVersionsEncodedLength;
input.remove_prefix(adjustment); input.remove_prefix(adjustment);
footer_offset = input_offset + adjustment;
// Parse Part1 // Parse Part1
char chksum = input.data()[0]; char chksum = input.data()[0];
checksum_type_ = lossless_cast<ChecksumType>(chksum); checksum_type_ = lossless_cast<ChecksumType>(chksum);
if (!IsSupportedChecksumType(checksum_type())) { if (UNLIKELY(!IsSupportedChecksumType(checksum_type()))) {
return Status::Corruption("Corrupt or unsupported checksum type: " + return Status::Corruption("Corrupt or unsupported checksum type: " +
std::to_string(lossless_cast<uint8_t>(chksum))); std::to_string(lossless_cast<uint8_t>(chksum)));
} }
// This is the most convenient place to compute the checksum
if (checksum_type_ != kNoChecksum && format_version_ >= 6) {
std::array<char, kNewVersionsEncodedLength> copy_without_checksum;
std::copy_n(input.data(), kNewVersionsEncodedLength,
&copy_without_checksum[0]);
EncodeFixed32(&copy_without_checksum[5], 0); // Clear embedded checksum
computed_checksum =
ComputeBuiltinChecksum(checksum_type(), copy_without_checksum.data(),
kNewVersionsEncodedLength);
}
// Consume checksum type field // Consume checksum type field
input.remove_prefix(1); input.remove_prefix(1);
} }
// Parse Part2 // Parse Part2
Status result = metaindex_handle_.DecodeFrom(&input); if (format_version_ >= 6) {
if (result.ok()) { Slice ext_magic(input.data(), 4);
result = index_handle_.DecodeFrom(&input); if (UNLIKELY(ext_magic.compare(Slice(kExtendedMagic.data(),
kExtendedMagic.size())) != 0)) {
return Status::Corruption("Bad extended magic number: 0x" +
ext_magic.ToString(/*hex*/ true));
}
input.remove_prefix(4);
uint32_t stored_checksum = 0, metaindex_size = 0;
bool success;
success = GetFixed32(&input, &stored_checksum);
assert(success);
success = GetFixed32(&input, &base_context_checksum_);
assert(success);
if (UNLIKELY(ChecksumModifierForContext(base_context_checksum_, 0) == 0)) {
return Status::Corruption("Invalid base context checksum");
}
computed_checksum +=
ChecksumModifierForContext(base_context_checksum_, footer_offset);
if (UNLIKELY(computed_checksum != stored_checksum)) {
return Status::Corruption("Footer at " + std::to_string(footer_offset) +
" checksum mismatch");
}
success = GetFixed32(&input, &metaindex_size);
assert(success);
(void)success;
uint64_t metaindex_end = footer_offset - GetBlockTrailerSize();
metaindex_handle_ =
BlockHandle(metaindex_end - metaindex_size, metaindex_size);
// Mark unpopulated
index_handle_ = BlockHandle::NullBlockHandle();
// 16 bytes of unchecked reserved padding
input.remove_prefix(16U);
// 8 bytes of checked reserved padding (expected to be zero unless using a
// future feature).
uint64_t reserved = 0;
success = GetFixed64(&input, &reserved);
assert(success);
if (UNLIKELY(reserved != 0)) {
return Status::NotSupported(
"File uses a future feature not supported in this version");
}
// End of part 2
assert(input.data() == part3_ptr);
} else {
// format_version_ < 6
Status result = metaindex_handle_.DecodeFrom(&input);
if (result.ok()) {
result = index_handle_.DecodeFrom(&input);
}
if (!result.ok()) {
return result;
}
// Padding in part2 is ignored
} }
return result; return Status::OK();
// Padding in part2 is ignored
} }
std::string Footer::ToString() const { std::string Footer::ToString() const {

@ -111,6 +111,40 @@ struct IndexValue {
std::string ToString(bool hex, bool have_first_key) const; std::string ToString(bool hex, bool have_first_key) const;
}; };
// Given a file's base_context_checksum and an offset of a block within that
// file, choose a 32-bit value that is as unique as possible. This value will
// be added to the standard checksum to get a checksum "with context," or can
// be subtracted to "remove" context. Returns zero (no modifier) if feature is
// disabled with base_context_checksum == 0.
inline uint32_t ChecksumModifierForContext(uint32_t base_context_checksum,
uint64_t offset) {
// To disable on base_context_checksum == 0, we could write
// `if (base_context_checksum == 0) return 0;` but benchmarking shows
// measurable performance penalty vs. this: compute the modifier
// unconditionally and use an "all or nothing" bit mask to enable
// or disable.
uint32_t all_or_nothing = uint32_t{0} - (base_context_checksum != 0);
// Desired properties:
// (call this function f(b, o) where b = base and o = offset)
// 1. Fast
// 2. f(b1, o) == f(b2, o) iff b1 == b2
// (Perfectly preserve base entropy)
// 3. f(b, o1) == f(b, o2) only if o1 == o2 or |o1-o2| >= 4 billion
// (Guaranteed uniqueness for nearby offsets)
// 3. f(b, o + j * 2**32) == f(b, o + k * 2**32) only if j == k
// (Upper bits matter, and *aligned* misplacement fails check)
// 4. f(b1, o) == f(b2, o + x) then preferably not
// f(b1, o + y) == f(b2, o + x + y)
// (Avoid linearly correlated matches)
// 5. f(b, o) == 0 depends on both b and o
// (No predictable overlap with non-context checksums)
uint32_t modifier =
base_context_checksum ^ (Lower32of64(offset) + Upper32of64(offset));
return modifier & all_or_nothing;
}
inline uint32_t GetCompressFormatForVersion(uint32_t format_version) { inline uint32_t GetCompressFormatForVersion(uint32_t format_version) {
// As of format_version 2, we encode compressed block with // As of format_version 2, we encode compressed block with
// compress_format_version == 2. Before that, the version is 1. // compress_format_version == 2. Before that, the version is 1.
@ -118,18 +152,27 @@ inline uint32_t GetCompressFormatForVersion(uint32_t format_version) {
return format_version >= 2 ? 2 : 1; return format_version >= 2 ? 2 : 1;
} }
constexpr uint32_t kLatestFormatVersion = 5; constexpr uint32_t kLatestFormatVersion = 6;
inline bool IsSupportedFormatVersion(uint32_t version) { inline bool IsSupportedFormatVersion(uint32_t version) {
return version <= kLatestFormatVersion; return version <= kLatestFormatVersion;
} }
// Same as having a unique id in footer.
inline bool FormatVersionUsesContextChecksum(uint32_t version) {
return version >= 6;
}
inline bool FormatVersionUsesIndexHandleInFooter(uint32_t version) {
return version < 6;
}
// Footer encapsulates the fixed information stored at the tail end of every // Footer encapsulates the fixed information stored at the tail end of every
// SST file. In general, it should only include things that cannot go // SST file. In general, it should only include things that cannot go
// elsewhere under the metaindex block. For example, checksum_type is // elsewhere under the metaindex block. For example, checksum_type is
// required for verifying metaindex block checksum (when applicable), but // required for verifying metaindex block checksum (when applicable), but
// index block handle can easily go in metaindex block (possible future). // index block handle can easily go in metaindex block. See also FooterBuilder
// See also FooterBuilder below. // below.
class Footer { class Footer {
public: public:
// Create empty. Populate using DecodeFrom. // Create empty. Populate using DecodeFrom.
@ -137,7 +180,7 @@ class Footer {
// Deserialize a footer (populate fields) from `input` and check for various // Deserialize a footer (populate fields) from `input` and check for various
// corruptions. `input_offset` is the offset within the target file of // corruptions. `input_offset` is the offset within the target file of
// `input` buffer (future use). // `input` buffer, which is needed for verifying format_version >= 6 footer.
// If enforce_table_magic_number != 0, will return corruption if table magic // If enforce_table_magic_number != 0, will return corruption if table magic
// number is not equal to enforce_table_magic_number. // number is not equal to enforce_table_magic_number.
Status DecodeFrom(Slice input, uint64_t input_offset, Status DecodeFrom(Slice input, uint64_t input_offset,
@ -152,13 +195,17 @@ class Footer {
// BBTO::format_version.) // BBTO::format_version.)
uint32_t format_version() const { return format_version_; } uint32_t format_version() const { return format_version_; }
// See ChecksumModifierForContext()
uint32_t base_context_checksum() const { return base_context_checksum_; }
// Block handle for metaindex block. // Block handle for metaindex block.
const BlockHandle& metaindex_handle() const { return metaindex_handle_; } const BlockHandle& metaindex_handle() const { return metaindex_handle_; }
// Block handle for (top-level) index block. // Block handle for (top-level) index block.
// TODO? remove from this struct and only read on decode for legacy cases
const BlockHandle& index_handle() const { return index_handle_; } const BlockHandle& index_handle() const { return index_handle_; }
// Checksum type used in the file. // Checksum type used in the file, including footer for format version >= 6.
ChecksumType checksum_type() const { ChecksumType checksum_type() const {
return static_cast<ChecksumType>(checksum_type_); return static_cast<ChecksumType>(checksum_type_);
} }
@ -198,6 +245,7 @@ class Footer {
uint64_t table_magic_number_ = kNullTableMagicNumber; uint64_t table_magic_number_ = kNullTableMagicNumber;
uint32_t format_version_ = kInvalidFormatVersion; uint32_t format_version_ = kInvalidFormatVersion;
uint32_t base_context_checksum_ = 0;
BlockHandle metaindex_handle_; BlockHandle metaindex_handle_;
BlockHandle index_handle_; BlockHandle index_handle_;
int checksum_type_ = kInvalidChecksumType; int checksum_type_ = kInvalidChecksumType;
@ -219,11 +267,16 @@ class FooterBuilder {
// * footer_offset is the file offset where the footer will be written // * footer_offset is the file offset where the footer will be written
// (for future use). // (for future use).
// * checksum_type is for formats using block checksums. // * checksum_type is for formats using block checksums.
// * index_handle is optional for some kinds of SST files. // * index_handle is optional for some SST kinds and (for caller convenience)
void Build(uint64_t table_magic_number, uint32_t format_version, // ignored when format_version >= 6. (Must be added to metaindex in that
uint64_t footer_offset, ChecksumType checksum_type, // case.)
const BlockHandle& metaindex_handle, // * unique_id must be specified if format_vesion >= 6 and SST uses block
const BlockHandle& index_handle = BlockHandle::NullBlockHandle()); // checksums with context. Otherwise, auto-generated if format_vesion >= 6.
Status Build(uint64_t table_magic_number, uint32_t format_version,
uint64_t footer_offset, ChecksumType checksum_type,
const BlockHandle& metaindex_handle,
const BlockHandle& index_handle = BlockHandle::NullBlockHandle(),
uint32_t base_context_checksum = 0);
// After Builder, get a Slice for the serialized Footer, backed by this // After Builder, get a Slice for the serialized Footer, backed by this
// FooterBuilder. // FooterBuilder.

@ -27,6 +27,8 @@
namespace ROCKSDB_NAMESPACE { namespace ROCKSDB_NAMESPACE {
const std::string kPropertiesBlockName = "rocksdb.properties"; const std::string kPropertiesBlockName = "rocksdb.properties";
// NB: only used with format_version >= 6
const std::string kIndexBlockName = "rocksdb.index";
// Old property block name for backward compatibility // Old property block name for backward compatibility
const std::string kPropertiesBlockOldName = "rocksdb.stats"; const std::string kPropertiesBlockOldName = "rocksdb.stats";
const std::string kCompressionDictBlockName = "rocksdb.compression_dict"; const std::string kCompressionDictBlockName = "rocksdb.compression_dict";
@ -395,8 +397,8 @@ Status ReadTablePropertiesHelper(
// Modified version of BlockFetcher checksum verification // Modified version of BlockFetcher checksum verification
// (See write_global_seqno comment above) // (See write_global_seqno comment above)
if (s.ok() && footer.GetBlockTrailerSize() > 0) { if (s.ok() && footer.GetBlockTrailerSize() > 0) {
s = VerifyBlockChecksum(footer.checksum_type(), properties_block.data(), s = VerifyBlockChecksum(footer, properties_block.data(), block_size,
block_size, file->file_name(), handle.offset()); file->file_name(), handle.offset());
if (s.IsCorruption()) { if (s.IsCorruption()) {
if (new_table_properties->external_sst_file_global_seqno_offset != 0) { if (new_table_properties->external_sst_file_global_seqno_offset != 0) {
std::string tmp_buf(properties_block.data(), std::string tmp_buf(properties_block.data(),
@ -405,8 +407,8 @@ Status ReadTablePropertiesHelper(
new_table_properties->external_sst_file_global_seqno_offset - new_table_properties->external_sst_file_global_seqno_offset -
handle.offset(); handle.offset();
EncodeFixed64(&tmp_buf[static_cast<size_t>(global_seqno_offset)], 0); EncodeFixed64(&tmp_buf[static_cast<size_t>(global_seqno_offset)], 0);
s = VerifyBlockChecksum(footer.checksum_type(), tmp_buf.data(), s = VerifyBlockChecksum(footer, tmp_buf.data(), block_size,
block_size, file->file_name(), handle.offset()); file->file_name(), handle.offset());
} }
} }
} }

@ -32,6 +32,7 @@ struct TableProperties;
// Meta block names for metaindex // Meta block names for metaindex
extern const std::string kPropertiesBlockName; extern const std::string kPropertiesBlockName;
extern const std::string kIndexBlockName;
extern const std::string kPropertiesBlockOldName; extern const std::string kPropertiesBlockOldName;
extern const std::string kCompressionDictBlockName; extern const std::string kCompressionDictBlockName;
extern const std::string kRangeDelBlockName; extern const std::string kRangeDelBlockName;

@ -274,10 +274,11 @@ Status PlainTableBuilder::Finish() {
// -- Write property block // -- Write property block
BlockHandle property_block_handle; BlockHandle property_block_handle;
IOStatus s = WriteBlock(property_block_builder.Finish(), file_, &offset_, io_status_ = WriteBlock(property_block_builder.Finish(), file_, &offset_,
&property_block_handle); &property_block_handle);
if (!s.ok()) { if (!io_status_.ok()) {
return static_cast<Status>(s); status_ = io_status_;
return status_;
} }
meta_index_builer.Add(kPropertiesBlockName, property_block_handle); meta_index_builer.Add(kPropertiesBlockName, property_block_handle);
@ -293,8 +294,12 @@ Status PlainTableBuilder::Finish() {
// Write Footer // Write Footer
// no need to write out new footer if we're using default checksum // no need to write out new footer if we're using default checksum
FooterBuilder footer; FooterBuilder footer;
footer.Build(kPlainTableMagicNumber, /* format_version */ 0, offset_, Status s = footer.Build(kPlainTableMagicNumber, /* format_version */ 0,
kNoChecksum, metaindex_block_handle); offset_, kNoChecksum, metaindex_block_handle);
if (!s.ok()) {
status_ = s;
return status_;
}
io_status_ = file_->Append(footer.GetSlice()); io_status_ = file_->Append(footer.GetSlice());
if (io_status_.ok()) { if (io_status_.ok()) {
offset_ += footer.GetSlice().size(); offset_ += footer.GetSlice().size();

@ -4472,11 +4472,12 @@ TEST(TableTest, FooterTests) {
BlockHandle index(data_size + 5, index_size); BlockHandle index(data_size + 5, index_size);
BlockHandle meta_index(data_size + index_size + 2 * 5, metaindex_size); BlockHandle meta_index(data_size + index_size + 2 * 5, metaindex_size);
uint64_t footer_offset = data_size + metaindex_size + index_size + 3 * 5; uint64_t footer_offset = data_size + metaindex_size + index_size + 3 * 5;
uint32_t base_context_checksum = 123456789;
{ {
// legacy block based // legacy block based
FooterBuilder footer; FooterBuilder footer;
footer.Build(kBlockBasedTableMagicNumber, /* format_version */ 0, ASSERT_OK(footer.Build(kBlockBasedTableMagicNumber, /* format_version */ 0,
footer_offset, kCRC32c, meta_index, index); footer_offset, kCRC32c, meta_index, index));
Footer decoded_footer; Footer decoded_footer;
ASSERT_OK(decoded_footer.DecodeFrom(footer.GetSlice(), footer_offset)); ASSERT_OK(decoded_footer.DecodeFrom(footer.GetSlice(), footer_offset));
ASSERT_EQ(decoded_footer.table_magic_number(), kBlockBasedTableMagicNumber); ASSERT_EQ(decoded_footer.table_magic_number(), kBlockBasedTableMagicNumber);
@ -4486,6 +4487,7 @@ TEST(TableTest, FooterTests) {
ASSERT_EQ(decoded_footer.index_handle().offset(), index.offset()); ASSERT_EQ(decoded_footer.index_handle().offset(), index.offset());
ASSERT_EQ(decoded_footer.index_handle().size(), index.size()); ASSERT_EQ(decoded_footer.index_handle().size(), index.size());
ASSERT_EQ(decoded_footer.format_version(), 0U); ASSERT_EQ(decoded_footer.format_version(), 0U);
ASSERT_EQ(decoded_footer.base_context_checksum(), 0U);
ASSERT_EQ(decoded_footer.GetBlockTrailerSize(), 5U); ASSERT_EQ(decoded_footer.GetBlockTrailerSize(), 5U);
// Ensure serialized with legacy magic // Ensure serialized with legacy magic
ASSERT_EQ( ASSERT_EQ(
@ -4495,9 +4497,11 @@ TEST(TableTest, FooterTests) {
// block based, various checksums, various versions // block based, various checksums, various versions
for (auto t : GetSupportedChecksums()) { for (auto t : GetSupportedChecksums()) {
for (uint32_t fv = 1; IsSupportedFormatVersion(fv); ++fv) { for (uint32_t fv = 1; IsSupportedFormatVersion(fv); ++fv) {
uint32_t maybe_bcc =
FormatVersionUsesContextChecksum(fv) ? base_context_checksum : 0U;
FooterBuilder footer; FooterBuilder footer;
footer.Build(kBlockBasedTableMagicNumber, fv, footer_offset, t, ASSERT_OK(footer.Build(kBlockBasedTableMagicNumber, fv, footer_offset, t,
meta_index, index); meta_index, index, maybe_bcc));
Footer decoded_footer; Footer decoded_footer;
ASSERT_OK(decoded_footer.DecodeFrom(footer.GetSlice(), footer_offset)); ASSERT_OK(decoded_footer.DecodeFrom(footer.GetSlice(), footer_offset));
ASSERT_EQ(decoded_footer.table_magic_number(), ASSERT_EQ(decoded_footer.table_magic_number(),
@ -4506,18 +4510,44 @@ TEST(TableTest, FooterTests) {
ASSERT_EQ(decoded_footer.metaindex_handle().offset(), ASSERT_EQ(decoded_footer.metaindex_handle().offset(),
meta_index.offset()); meta_index.offset());
ASSERT_EQ(decoded_footer.metaindex_handle().size(), meta_index.size()); ASSERT_EQ(decoded_footer.metaindex_handle().size(), meta_index.size());
ASSERT_EQ(decoded_footer.index_handle().offset(), index.offset()); if (FormatVersionUsesIndexHandleInFooter(fv)) {
ASSERT_EQ(decoded_footer.index_handle().size(), index.size()); ASSERT_EQ(decoded_footer.index_handle().offset(), index.offset());
ASSERT_EQ(decoded_footer.index_handle().size(), index.size());
}
ASSERT_EQ(decoded_footer.format_version(), fv); ASSERT_EQ(decoded_footer.format_version(), fv);
ASSERT_EQ(decoded_footer.GetBlockTrailerSize(), 5U); ASSERT_EQ(decoded_footer.GetBlockTrailerSize(), 5U);
if (FormatVersionUsesContextChecksum(fv)) {
ASSERT_EQ(decoded_footer.base_context_checksum(),
base_context_checksum);
// Bad offset should fail footer checksum
decoded_footer = Footer();
ASSERT_NOK(
decoded_footer.DecodeFrom(footer.GetSlice(), footer_offset - 1));
} else {
ASSERT_EQ(decoded_footer.base_context_checksum(), 0U);
}
// Too big metaindex size should also fail encoding only in new footer
uint64_t big_metaindex_size = 0x100000007U;
uint64_t big_footer_offset =
data_size + big_metaindex_size + index_size + 3 * 5;
BlockHandle big_metaindex =
BlockHandle(data_size + index_size + 2 * 5, big_metaindex_size);
ASSERT_NE(footer
.Build(kBlockBasedTableMagicNumber, fv, big_footer_offset,
t, big_metaindex, index, maybe_bcc)
.ok(),
FormatVersionUsesContextChecksum(fv));
} }
} }
{ {
// legacy plain table // legacy plain table
FooterBuilder footer; FooterBuilder footer;
footer.Build(kPlainTableMagicNumber, /* format_version */ 0, footer_offset, ASSERT_OK(footer.Build(kPlainTableMagicNumber, /* format_version */ 0,
kNoChecksum, meta_index); footer_offset, kNoChecksum, meta_index));
Footer decoded_footer; Footer decoded_footer;
ASSERT_OK(decoded_footer.DecodeFrom(footer.GetSlice(), footer_offset)); ASSERT_OK(decoded_footer.DecodeFrom(footer.GetSlice(), footer_offset));
ASSERT_EQ(decoded_footer.table_magic_number(), kPlainTableMagicNumber); ASSERT_EQ(decoded_footer.table_magic_number(), kPlainTableMagicNumber);
@ -4536,8 +4566,8 @@ TEST(TableTest, FooterTests) {
{ {
// xxhash plain table (not currently used) // xxhash plain table (not currently used)
FooterBuilder footer; FooterBuilder footer;
footer.Build(kPlainTableMagicNumber, /* format_version */ 1, footer_offset, ASSERT_OK(footer.Build(kPlainTableMagicNumber, /* format_version */ 1,
kxxHash, meta_index); footer_offset, kxxHash, meta_index));
Footer decoded_footer; Footer decoded_footer;
ASSERT_OK(decoded_footer.DecodeFrom(footer.GetSlice(), footer_offset)); ASSERT_OK(decoded_footer.DecodeFrom(footer.GetSlice(), footer_offset));
ASSERT_EQ(decoded_footer.table_magic_number(), kPlainTableMagicNumber); ASSERT_EQ(decoded_footer.table_magic_number(), kPlainTableMagicNumber);
@ -5211,9 +5241,13 @@ TEST_P(BlockBasedTableTest, PropertiesMetaBlockLast) {
} }
} }
ASSERT_EQ(kPropertiesBlockName, key_at_max_offset); ASSERT_EQ(kPropertiesBlockName, key_at_max_offset);
// index handle is stored in footer rather than metaindex block, so need if (FormatVersionUsesIndexHandleInFooter(footer.format_version())) {
// separate logic to verify it comes before properties block. // If index handle is stored in footer rather than metaindex block,
ASSERT_GT(max_offset, footer.index_handle().offset()); // need separate logic to verify it comes before properties block.
ASSERT_GT(max_offset, footer.index_handle().offset());
} else {
ASSERT_TRUE(footer.index_handle().IsNull());
}
c.ResetTableReader(); c.ResetTableReader();
} }

@ -39,7 +39,10 @@ namespace test {
const uint32_t kDefaultFormatVersion = BlockBasedTableOptions().format_version; const uint32_t kDefaultFormatVersion = BlockBasedTableOptions().format_version;
const std::set<uint32_t> kFooterFormatVersionsToTest{ const std::set<uint32_t> kFooterFormatVersionsToTest{
// Non-legacy, before big footer changes
5U, 5U,
// After big footer changes
6U,
// In case any interesting future changes // In case any interesting future changes
kDefaultFormatVersion, kDefaultFormatVersion,
kLatestFormatVersion, kLatestFormatVersion,

@ -134,7 +134,7 @@ default_params = {
"verify_checksum": 1, "verify_checksum": 1,
"write_buffer_size": 4 * 1024 * 1024, "write_buffer_size": 4 * 1024 * 1024,
"writepercent": 35, "writepercent": 35,
"format_version": lambda: random.choice([2, 3, 4, 5, 5]), "format_version": lambda: random.choice([2, 3, 4, 5, 6, 6]),
"index_block_restart_interval": lambda: random.choice(range(1, 16)), "index_block_restart_interval": lambda: random.choice(range(1, 16)),
"use_multiget": lambda: random.randint(0, 1), "use_multiget": lambda: random.randint(0, 1),
"use_get_entity": lambda: random.choice([0] * 7 + [1]), "use_get_entity": lambda: random.choice([0] * 7 + [1]),

@ -468,4 +468,3 @@ int main(int argc, char** argv) {
RegisterCustomObjects(argc, argv); RegisterCustomObjects(argc, argv);
return RUN_ALL_TESTS(); return RUN_ALL_TESTS();
} }

@ -0,0 +1 @@
* Added enhanced data integrity checking on SST files with new format_version=6. Performance impact is very small or negligible. Previously if SST data was misplaced or re-arranged by the storage layer, it could pass block checksum with higher than 1 in 4 billion probability. With format_version=6, block checksums depend on what file they are in and location within the file. This way, misplaced SST data is no more likely to pass checksum verification than randomly corrupted data. Also in format_version=6, SST footers are checksum-protected.
Loading…
Cancel
Save