You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
rocksdb/db/db_bloom_filter_test.cc

2117 lines
80 KiB

// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
// This source code is licensed under both the GPLv2 (found in the
// COPYING file in the root directory) and Apache 2.0 License
// (found in the LICENSE.Apache file in the root directory).
//
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file. See the AUTHORS file for names of contributors.
#include "db/db_test_util.h"
#include "options/options_helper.h"
#include "port/stack_trace.h"
#include "rocksdb/perf_context.h"
#include "table/block_based/filter_policy_internal.h"
namespace ROCKSDB_NAMESPACE {
namespace {
using BFP = BloomFilterPolicy;
} // namespace
// DB tests related to bloom filter.
class DBBloomFilterTest : public DBTestBase {
public:
DBBloomFilterTest() : DBTestBase("/db_bloom_filter_test") {}
};
class DBBloomFilterTestWithParam : public DBTestBase,
public testing::WithParamInterface<
New Bloom filter implementation for full and partitioned filters (#6007) Summary: Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter. Speed The improved speed, at least on recent x86_64, comes from * Using fastrange instead of modulo (%) * Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row. * Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc. * Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes. Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed): $ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter Build avg ns/key: 47.7135 Mixed inside/outside queries... Single filter net ns/op: 26.2825 Random filter net ns/op: 150.459 Average FP rate %: 0.954651 $ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter Build avg ns/key: 47.2245 Mixed inside/outside queries... Single filter net ns/op: 63.2978 Random filter net ns/op: 188.038 Average FP rate %: 1.13823 Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected. The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome. Accuracy The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments. Accuracy data (generalizes, except old impl gets worse with millions of keys): Memory bits per key: FP rate percent old impl -> FP rate percent new impl 6: 5.70953 -> 5.69888 8: 2.45766 -> 2.29709 10: 1.13977 -> 0.959254 12: 0.662498 -> 0.411593 16: 0.353023 -> 0.0873754 24: 0.261552 -> 0.0060971 50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP) Fixes https://github.com/facebook/rocksdb/issues/5857 Fixes https://github.com/facebook/rocksdb/issues/4120 Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized. Compatibility Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007 Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version). Differential Revision: D18294749 Pulled By: pdillinger fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
5 years ago
std::tuple<BFP::Mode, bool, uint32_t>> {
// public testing::WithParamInterface<bool> {
protected:
New Bloom filter implementation for full and partitioned filters (#6007) Summary: Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter. Speed The improved speed, at least on recent x86_64, comes from * Using fastrange instead of modulo (%) * Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row. * Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc. * Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes. Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed): $ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter Build avg ns/key: 47.7135 Mixed inside/outside queries... Single filter net ns/op: 26.2825 Random filter net ns/op: 150.459 Average FP rate %: 0.954651 $ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter Build avg ns/key: 47.2245 Mixed inside/outside queries... Single filter net ns/op: 63.2978 Random filter net ns/op: 188.038 Average FP rate %: 1.13823 Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected. The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome. Accuracy The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments. Accuracy data (generalizes, except old impl gets worse with millions of keys): Memory bits per key: FP rate percent old impl -> FP rate percent new impl 6: 5.70953 -> 5.69888 8: 2.45766 -> 2.29709 10: 1.13977 -> 0.959254 12: 0.662498 -> 0.411593 16: 0.353023 -> 0.0873754 24: 0.261552 -> 0.0060971 50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP) Fixes https://github.com/facebook/rocksdb/issues/5857 Fixes https://github.com/facebook/rocksdb/issues/4120 Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized. Compatibility Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007 Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version). Differential Revision: D18294749 Pulled By: pdillinger fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
5 years ago
BFP::Mode bfp_impl_;
bool partition_filters_;
uint32_t format_version_;
public:
DBBloomFilterTestWithParam() : DBTestBase("/db_bloom_filter_tests") {}
~DBBloomFilterTestWithParam() override {}
void SetUp() override {
bfp_impl_ = std::get<0>(GetParam());
partition_filters_ = std::get<1>(GetParam());
format_version_ = std::get<2>(GetParam());
}
};
class DBBloomFilterTestDefFormatVersion : public DBBloomFilterTestWithParam {};
class SliceTransformLimitedDomainGeneric : public SliceTransform {
const char* Name() const override {
return "SliceTransformLimitedDomainGeneric";
}
Slice Transform(const Slice& src) const override {
return Slice(src.data(), 5);
}
bool InDomain(const Slice& src) const override {
// prefix will be x????
return src.size() >= 5;
}
bool InRange(const Slice& dst) const override {
// prefix will be x????
return dst.size() == 5;
}
};
// KeyMayExist can lead to a few false positives, but not false negatives.
// To make test deterministic, use a much larger number of bits per key-20 than
// bits in the key, so that false positives are eliminated
TEST_P(DBBloomFilterTestDefFormatVersion, KeyMayExist) {
do {
ReadOptions ropts;
std::string value;
anon::OptionsOverride options_override;
options_override.filter_policy.reset(new BFP(20, bfp_impl_));
options_override.partition_filters = partition_filters_;
options_override.metadata_block_size = 32;
Options options = CurrentOptions(options_override);
if (partition_filters_ &&
static_cast<BlockBasedTableOptions*>(
options.table_factory->GetOptions())
->index_type != BlockBasedTableOptions::kTwoLevelIndexSearch) {
// In the current implementation partitioned filters depend on partitioned
// indexes
continue;
}
options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
CreateAndReopenWithCF({"pikachu"}, options);
ASSERT_TRUE(!db_->KeyMayExist(ropts, handles_[1], "a", &value));
ASSERT_OK(Put(1, "a", "b"));
bool value_found = false;
ASSERT_TRUE(
db_->KeyMayExist(ropts, handles_[1], "a", &value, &value_found));
ASSERT_TRUE(value_found);
ASSERT_EQ("b", value);
ASSERT_OK(Flush(1));
value.clear();
uint64_t numopen = TestGetTickerCount(options, NO_FILE_OPENS);
uint64_t cache_added = TestGetTickerCount(options, BLOCK_CACHE_ADD);
ASSERT_TRUE(
db_->KeyMayExist(ropts, handles_[1], "a", &value, &value_found));
ASSERT_TRUE(!value_found);
// assert that no new files were opened and no new blocks were
// read into block cache.
ASSERT_EQ(numopen, TestGetTickerCount(options, NO_FILE_OPENS));
ASSERT_EQ(cache_added, TestGetTickerCount(options, BLOCK_CACHE_ADD));
ASSERT_OK(Delete(1, "a"));
numopen = TestGetTickerCount(options, NO_FILE_OPENS);
cache_added = TestGetTickerCount(options, BLOCK_CACHE_ADD);
ASSERT_TRUE(!db_->KeyMayExist(ropts, handles_[1], "a", &value));
ASSERT_EQ(numopen, TestGetTickerCount(options, NO_FILE_OPENS));
ASSERT_EQ(cache_added, TestGetTickerCount(options, BLOCK_CACHE_ADD));
ASSERT_OK(Flush(1));
dbfull()->TEST_CompactRange(0, nullptr, nullptr, handles_[1],
true /* disallow trivial move */);
numopen = TestGetTickerCount(options, NO_FILE_OPENS);
cache_added = TestGetTickerCount(options, BLOCK_CACHE_ADD);
ASSERT_TRUE(!db_->KeyMayExist(ropts, handles_[1], "a", &value));
ASSERT_EQ(numopen, TestGetTickerCount(options, NO_FILE_OPENS));
ASSERT_EQ(cache_added, TestGetTickerCount(options, BLOCK_CACHE_ADD));
ASSERT_OK(Delete(1, "c"));
numopen = TestGetTickerCount(options, NO_FILE_OPENS);
cache_added = TestGetTickerCount(options, BLOCK_CACHE_ADD);
ASSERT_TRUE(!db_->KeyMayExist(ropts, handles_[1], "c", &value));
ASSERT_EQ(numopen, TestGetTickerCount(options, NO_FILE_OPENS));
ASSERT_EQ(cache_added, TestGetTickerCount(options, BLOCK_CACHE_ADD));
// KeyMayExist function only checks data in block caches, which is not used
// by plain table format.
} while (
ChangeOptions(kSkipPlainTable | kSkipHashIndex | kSkipFIFOCompaction));
}
TEST_F(DBBloomFilterTest, GetFilterByPrefixBloomCustomPrefixExtractor) {
for (bool partition_filters : {true, false}) {
Options options = last_options_;
options.prefix_extractor =
std::make_shared<SliceTransformLimitedDomainGeneric>();
options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
get_perf_context()->EnablePerLevelPerfContext();
BlockBasedTableOptions bbto;
bbto.filter_policy.reset(NewBloomFilterPolicy(10, false));
if (partition_filters) {
bbto.partition_filters = true;
bbto.index_type = BlockBasedTableOptions::IndexType::kTwoLevelIndexSearch;
}
bbto.whole_key_filtering = false;
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
DestroyAndReopen(options);
WriteOptions wo;
ReadOptions ro;
FlushOptions fo;
fo.wait = true;
std::string value;
ASSERT_OK(dbfull()->Put(wo, "barbarbar", "foo"));
ASSERT_OK(dbfull()->Put(wo, "barbarbar2", "foo2"));
ASSERT_OK(dbfull()->Put(wo, "foofoofoo", "bar"));
dbfull()->Flush(fo);
ASSERT_EQ("foo", Get("barbarbar"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 0);
ASSERT_EQ(
0,
(*(get_perf_context()->level_to_perf_context))[0].bloom_filter_useful);
ASSERT_EQ("foo2", Get("barbarbar2"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 0);
ASSERT_EQ(
0,
(*(get_perf_context()->level_to_perf_context))[0].bloom_filter_useful);
ASSERT_EQ("NOT_FOUND", Get("barbarbar3"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 0);
ASSERT_EQ(
0,
(*(get_perf_context()->level_to_perf_context))[0].bloom_filter_useful);
ASSERT_EQ("NOT_FOUND", Get("barfoofoo"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 1);
ASSERT_EQ(
1,
(*(get_perf_context()->level_to_perf_context))[0].bloom_filter_useful);
ASSERT_EQ("NOT_FOUND", Get("foobarbar"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 2);
ASSERT_EQ(
2,
(*(get_perf_context()->level_to_perf_context))[0].bloom_filter_useful);
ro.total_order_seek = true;
ASSERT_TRUE(db_->Get(ro, "foobarbar", &value).IsNotFound());
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 2);
ASSERT_EQ(
2,
(*(get_perf_context()->level_to_perf_context))[0].bloom_filter_useful);
get_perf_context()->Reset();
}
}
TEST_F(DBBloomFilterTest, GetFilterByPrefixBloom) {
for (bool partition_filters : {true, false}) {
Options options = last_options_;
options.prefix_extractor.reset(NewFixedPrefixTransform(8));
options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
get_perf_context()->EnablePerLevelPerfContext();
BlockBasedTableOptions bbto;
bbto.filter_policy.reset(NewBloomFilterPolicy(10, false));
if (partition_filters) {
bbto.partition_filters = true;
bbto.index_type = BlockBasedTableOptions::IndexType::kTwoLevelIndexSearch;
}
bbto.whole_key_filtering = false;
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
DestroyAndReopen(options);
WriteOptions wo;
ReadOptions ro;
FlushOptions fo;
fo.wait = true;
std::string value;
ASSERT_OK(dbfull()->Put(wo, "barbarbar", "foo"));
ASSERT_OK(dbfull()->Put(wo, "barbarbar2", "foo2"));
ASSERT_OK(dbfull()->Put(wo, "foofoofoo", "bar"));
dbfull()->Flush(fo);
ASSERT_EQ("foo", Get("barbarbar"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 0);
ASSERT_EQ("foo2", Get("barbarbar2"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 0);
ASSERT_EQ("NOT_FOUND", Get("barbarbar3"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 0);
ASSERT_EQ("NOT_FOUND", Get("barfoofoo"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 1);
ASSERT_EQ("NOT_FOUND", Get("foobarbar"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 2);
ro.total_order_seek = true;
ASSERT_TRUE(db_->Get(ro, "foobarbar", &value).IsNotFound());
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 2);
ASSERT_EQ(
2,
(*(get_perf_context()->level_to_perf_context))[0].bloom_filter_useful);
get_perf_context()->Reset();
}
}
TEST_F(DBBloomFilterTest, WholeKeyFilterProp) {
for (bool partition_filters : {true, false}) {
Options options = last_options_;
options.prefix_extractor.reset(NewFixedPrefixTransform(3));
options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
get_perf_context()->EnablePerLevelPerfContext();
BlockBasedTableOptions bbto;
bbto.filter_policy.reset(NewBloomFilterPolicy(10, false));
bbto.whole_key_filtering = false;
if (partition_filters) {
bbto.partition_filters = true;
bbto.index_type = BlockBasedTableOptions::IndexType::kTwoLevelIndexSearch;
}
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
DestroyAndReopen(options);
WriteOptions wo;
ReadOptions ro;
FlushOptions fo;
fo.wait = true;
std::string value;
ASSERT_OK(dbfull()->Put(wo, "foobar", "foo"));
// Needs insert some keys to make sure files are not filtered out by key
// ranges.
ASSERT_OK(dbfull()->Put(wo, "aaa", ""));
ASSERT_OK(dbfull()->Put(wo, "zzz", ""));
dbfull()->Flush(fo);
Reopen(options);
ASSERT_EQ("NOT_FOUND", Get("foo"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 0);
ASSERT_EQ("NOT_FOUND", Get("bar"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 1);
ASSERT_EQ("foo", Get("foobar"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 1);
// Reopen with whole key filtering enabled and prefix extractor
// NULL. Bloom filter should be off for both of whole key and
// prefix bloom.
bbto.whole_key_filtering = true;
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
options.prefix_extractor.reset();
Reopen(options);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 1);
ASSERT_EQ("NOT_FOUND", Get("foo"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 1);
ASSERT_EQ("NOT_FOUND", Get("bar"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 1);
ASSERT_EQ("foo", Get("foobar"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 1);
// Write DB with only full key filtering.
ASSERT_OK(dbfull()->Put(wo, "foobar", "foo"));
// Needs insert some keys to make sure files are not filtered out by key
// ranges.
ASSERT_OK(dbfull()->Put(wo, "aaa", ""));
ASSERT_OK(dbfull()->Put(wo, "zzz", ""));
db_->CompactRange(CompactRangeOptions(), nullptr, nullptr);
// Reopen with both of whole key off and prefix extractor enabled.
// Still no bloom filter should be used.
options.prefix_extractor.reset(NewFixedPrefixTransform(3));
bbto.whole_key_filtering = false;
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
Reopen(options);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 1);
ASSERT_EQ("NOT_FOUND", Get("foo"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 1);
ASSERT_EQ("NOT_FOUND", Get("bar"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 1);
ASSERT_EQ("foo", Get("foobar"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 1);
// Try to create a DB with mixed files:
ASSERT_OK(dbfull()->Put(wo, "foobar", "foo"));
// Needs insert some keys to make sure files are not filtered out by key
// ranges.
ASSERT_OK(dbfull()->Put(wo, "aaa", ""));
ASSERT_OK(dbfull()->Put(wo, "zzz", ""));
db_->CompactRange(CompactRangeOptions(), nullptr, nullptr);
options.prefix_extractor.reset();
bbto.whole_key_filtering = true;
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
Reopen(options);
// Try to create a DB with mixed files.
ASSERT_OK(dbfull()->Put(wo, "barfoo", "bar"));
// In this case needs insert some keys to make sure files are
// not filtered out by key ranges.
ASSERT_OK(dbfull()->Put(wo, "aaa", ""));
ASSERT_OK(dbfull()->Put(wo, "zzz", ""));
Flush();
// Now we have two files:
// File 1: An older file with prefix bloom.
// File 2: A newer file with whole bloom filter.
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 1);
ASSERT_EQ("NOT_FOUND", Get("foo"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 2);
ASSERT_EQ("NOT_FOUND", Get("bar"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 3);
ASSERT_EQ("foo", Get("foobar"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 4);
ASSERT_EQ("bar", Get("barfoo"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 4);
// Reopen with the same setting: only whole key is used
Reopen(options);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 4);
ASSERT_EQ("NOT_FOUND", Get("foo"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 5);
ASSERT_EQ("NOT_FOUND", Get("bar"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 6);
ASSERT_EQ("foo", Get("foobar"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 7);
ASSERT_EQ("bar", Get("barfoo"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 7);
// Restart with both filters are allowed
options.prefix_extractor.reset(NewFixedPrefixTransform(3));
bbto.whole_key_filtering = true;
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
Reopen(options);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 7);
// File 1 will has it filtered out.
// File 2 will not, as prefix `foo` exists in the file.
ASSERT_EQ("NOT_FOUND", Get("foo"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 8);
ASSERT_EQ("NOT_FOUND", Get("bar"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 10);
ASSERT_EQ("foo", Get("foobar"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 11);
ASSERT_EQ("bar", Get("barfoo"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 11);
// Restart with only prefix bloom is allowed.
options.prefix_extractor.reset(NewFixedPrefixTransform(3));
bbto.whole_key_filtering = false;
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
Reopen(options);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 11);
ASSERT_EQ("NOT_FOUND", Get("foo"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 11);
ASSERT_EQ("NOT_FOUND", Get("bar"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 12);
ASSERT_EQ("foo", Get("foobar"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 12);
ASSERT_EQ("bar", Get("barfoo"));
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 12);
uint64_t bloom_filter_useful_all_levels = 0;
for (auto& kv : (*(get_perf_context()->level_to_perf_context))) {
if (kv.second.bloom_filter_useful > 0) {
bloom_filter_useful_all_levels += kv.second.bloom_filter_useful;
}
}
ASSERT_EQ(12, bloom_filter_useful_all_levels);
get_perf_context()->Reset();
}
}
TEST_P(DBBloomFilterTestWithParam, BloomFilter) {
do {
Options options = CurrentOptions();
env_->count_random_reads_ = true;
options.env = env_;
// ChangeCompactOptions() only changes compaction style, which does not
// trigger reset of table_factory
BlockBasedTableOptions table_options;
table_options.no_block_cache = true;
table_options.filter_policy.reset(new BFP(10, bfp_impl_));
table_options.partition_filters = partition_filters_;
if (partition_filters_) {
table_options.index_type =
BlockBasedTableOptions::IndexType::kTwoLevelIndexSearch;
}
table_options.format_version = format_version_;
if (format_version_ >= 4) {
// value delta encoding challenged more with index interval > 1
table_options.index_block_restart_interval = 8;
}
table_options.metadata_block_size = 32;
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
CreateAndReopenWithCF({"pikachu"}, options);
// Populate multiple layers
const int N = 10000;
for (int i = 0; i < N; i++) {
ASSERT_OK(Put(1, Key(i), Key(i)));
}
Compact(1, "a", "z");
for (int i = 0; i < N; i += 100) {
ASSERT_OK(Put(1, Key(i), Key(i)));
}
Flush(1);
// Prevent auto compactions triggered by seeks
env_->delay_sstable_sync_.store(true, std::memory_order_release);
// Lookup present keys. Should rarely read from small sstable.
env_->random_read_counter_.Reset();
for (int i = 0; i < N; i++) {
ASSERT_EQ(Key(i), Get(1, Key(i)));
}
int reads = env_->random_read_counter_.Read();
fprintf(stderr, "%d present => %d reads\n", N, reads);
ASSERT_GE(reads, N);
if (partition_filters_) {
// Without block cache, we read an extra partition filter per each
// level*read and a partition index per each read
ASSERT_LE(reads, 4 * N + 2 * N / 100);
} else {
ASSERT_LE(reads, N + 2 * N / 100);
}
// Lookup present keys. Should rarely read from either sstable.
env_->random_read_counter_.Reset();
for (int i = 0; i < N; i++) {
ASSERT_EQ("NOT_FOUND", Get(1, Key(i) + ".missing"));
}
reads = env_->random_read_counter_.Read();
fprintf(stderr, "%d missing => %d reads\n", N, reads);
if (partition_filters_) {
// With partitioned filter we read one extra filter per level per each
// missed read.
ASSERT_LE(reads, 2 * N + 3 * N / 100);
} else {
ASSERT_LE(reads, 3 * N / 100);
}
env_->delay_sstable_sync_.store(false, std::memory_order_release);
Close();
} while (ChangeCompactOptions());
}
#ifndef ROCKSDB_VALGRIND_RUN
INSTANTIATE_TEST_SUITE_P(
FormatDef, DBBloomFilterTestDefFormatVersion,
::testing::Values(
New Bloom filter implementation for full and partitioned filters (#6007) Summary: Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter. Speed The improved speed, at least on recent x86_64, comes from * Using fastrange instead of modulo (%) * Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row. * Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc. * Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes. Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed): $ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter Build avg ns/key: 47.7135 Mixed inside/outside queries... Single filter net ns/op: 26.2825 Random filter net ns/op: 150.459 Average FP rate %: 0.954651 $ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter Build avg ns/key: 47.2245 Mixed inside/outside queries... Single filter net ns/op: 63.2978 Random filter net ns/op: 188.038 Average FP rate %: 1.13823 Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected. The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome. Accuracy The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments. Accuracy data (generalizes, except old impl gets worse with millions of keys): Memory bits per key: FP rate percent old impl -> FP rate percent new impl 6: 5.70953 -> 5.69888 8: 2.45766 -> 2.29709 10: 1.13977 -> 0.959254 12: 0.662498 -> 0.411593 16: 0.353023 -> 0.0873754 24: 0.261552 -> 0.0060971 50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP) Fixes https://github.com/facebook/rocksdb/issues/5857 Fixes https://github.com/facebook/rocksdb/issues/4120 Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized. Compatibility Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007 Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version). Differential Revision: D18294749 Pulled By: pdillinger fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
5 years ago
std::make_tuple(BFP::kDeprecatedBlock, false,
test::kDefaultFormatVersion),
std::make_tuple(BFP::kAuto, true, test::kDefaultFormatVersion),
std::make_tuple(BFP::kAuto, false, test::kDefaultFormatVersion)));
INSTANTIATE_TEST_SUITE_P(
FormatDef, DBBloomFilterTestWithParam,
::testing::Values(
New Bloom filter implementation for full and partitioned filters (#6007) Summary: Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter. Speed The improved speed, at least on recent x86_64, comes from * Using fastrange instead of modulo (%) * Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row. * Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc. * Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes. Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed): $ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter Build avg ns/key: 47.7135 Mixed inside/outside queries... Single filter net ns/op: 26.2825 Random filter net ns/op: 150.459 Average FP rate %: 0.954651 $ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter Build avg ns/key: 47.2245 Mixed inside/outside queries... Single filter net ns/op: 63.2978 Random filter net ns/op: 188.038 Average FP rate %: 1.13823 Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected. The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome. Accuracy The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments. Accuracy data (generalizes, except old impl gets worse with millions of keys): Memory bits per key: FP rate percent old impl -> FP rate percent new impl 6: 5.70953 -> 5.69888 8: 2.45766 -> 2.29709 10: 1.13977 -> 0.959254 12: 0.662498 -> 0.411593 16: 0.353023 -> 0.0873754 24: 0.261552 -> 0.0060971 50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP) Fixes https://github.com/facebook/rocksdb/issues/5857 Fixes https://github.com/facebook/rocksdb/issues/4120 Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized. Compatibility Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007 Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version). Differential Revision: D18294749 Pulled By: pdillinger fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
5 years ago
std::make_tuple(BFP::kDeprecatedBlock, false,
test::kDefaultFormatVersion),
std::make_tuple(BFP::kAuto, true, test::kDefaultFormatVersion),
std::make_tuple(BFP::kAuto, false, test::kDefaultFormatVersion)));
INSTANTIATE_TEST_SUITE_P(
FormatLatest, DBBloomFilterTestWithParam,
::testing::Values(
New Bloom filter implementation for full and partitioned filters (#6007) Summary: Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter. Speed The improved speed, at least on recent x86_64, comes from * Using fastrange instead of modulo (%) * Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row. * Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc. * Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes. Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed): $ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter Build avg ns/key: 47.7135 Mixed inside/outside queries... Single filter net ns/op: 26.2825 Random filter net ns/op: 150.459 Average FP rate %: 0.954651 $ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter Build avg ns/key: 47.2245 Mixed inside/outside queries... Single filter net ns/op: 63.2978 Random filter net ns/op: 188.038 Average FP rate %: 1.13823 Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected. The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome. Accuracy The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments. Accuracy data (generalizes, except old impl gets worse with millions of keys): Memory bits per key: FP rate percent old impl -> FP rate percent new impl 6: 5.70953 -> 5.69888 8: 2.45766 -> 2.29709 10: 1.13977 -> 0.959254 12: 0.662498 -> 0.411593 16: 0.353023 -> 0.0873754 24: 0.261552 -> 0.0060971 50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP) Fixes https://github.com/facebook/rocksdb/issues/5857 Fixes https://github.com/facebook/rocksdb/issues/4120 Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized. Compatibility Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007 Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version). Differential Revision: D18294749 Pulled By: pdillinger fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
5 years ago
std::make_tuple(BFP::kDeprecatedBlock, false,
test::kLatestFormatVersion),
std::make_tuple(BFP::kAuto, true, test::kLatestFormatVersion),
std::make_tuple(BFP::kAuto, false, test::kLatestFormatVersion)));
#endif // ROCKSDB_VALGRIND_RUN
TEST_F(DBBloomFilterTest, BloomFilterRate) {
while (ChangeFilterOptions()) {
Options options = CurrentOptions();
options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
get_perf_context()->EnablePerLevelPerfContext();
CreateAndReopenWithCF({"pikachu"}, options);
const int maxKey = 10000;
for (int i = 0; i < maxKey; i++) {
ASSERT_OK(Put(1, Key(i), Key(i)));
}
// Add a large key to make the file contain wide range
ASSERT_OK(Put(1, Key(maxKey + 55555), Key(maxKey + 55555)));
Flush(1);
// Check if they can be found
for (int i = 0; i < maxKey; i++) {
ASSERT_EQ(Key(i), Get(1, Key(i)));
}
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 0);
// Check if filter is useful
for (int i = 0; i < maxKey; i++) {
ASSERT_EQ("NOT_FOUND", Get(1, Key(i + 33333)));
}
ASSERT_GE(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), maxKey * 0.98);
ASSERT_GE(
(*(get_perf_context()->level_to_perf_context))[0].bloom_filter_useful,
maxKey * 0.98);
get_perf_context()->Reset();
}
}
TEST_F(DBBloomFilterTest, BloomFilterCompatibility) {
Options options = CurrentOptions();
options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
BlockBasedTableOptions table_options;
table_options.filter_policy.reset(NewBloomFilterPolicy(10, true));
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
// Create with block based filter
CreateAndReopenWithCF({"pikachu"}, options);
const int maxKey = 10000;
for (int i = 0; i < maxKey; i++) {
ASSERT_OK(Put(1, Key(i), Key(i)));
}
ASSERT_OK(Put(1, Key(maxKey + 55555), Key(maxKey + 55555)));
Flush(1);
// Check db with full filter
table_options.filter_policy.reset(NewBloomFilterPolicy(10, false));
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
ReopenWithColumnFamilies({"default", "pikachu"}, options);
// Check if they can be found
for (int i = 0; i < maxKey; i++) {
ASSERT_EQ(Key(i), Get(1, Key(i)));
}
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 0);
// Check db with partitioned full filter
table_options.partition_filters = true;
table_options.index_type =
BlockBasedTableOptions::IndexType::kTwoLevelIndexSearch;
table_options.filter_policy.reset(NewBloomFilterPolicy(10, false));
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
ReopenWithColumnFamilies({"default", "pikachu"}, options);
// Check if they can be found
for (int i = 0; i < maxKey; i++) {
ASSERT_EQ(Key(i), Get(1, Key(i)));
}
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 0);
}
TEST_F(DBBloomFilterTest, BloomFilterReverseCompatibility) {
for (bool partition_filters : {true, false}) {
Options options = CurrentOptions();
options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
BlockBasedTableOptions table_options;
if (partition_filters) {
table_options.partition_filters = true;
table_options.index_type =
BlockBasedTableOptions::IndexType::kTwoLevelIndexSearch;
}
table_options.filter_policy.reset(NewBloomFilterPolicy(10, false));
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
DestroyAndReopen(options);
// Create with full filter
CreateAndReopenWithCF({"pikachu"}, options);
const int maxKey = 10000;
for (int i = 0; i < maxKey; i++) {
ASSERT_OK(Put(1, Key(i), Key(i)));
}
ASSERT_OK(Put(1, Key(maxKey + 55555), Key(maxKey + 55555)));
Flush(1);
// Check db with block_based filter
table_options.filter_policy.reset(NewBloomFilterPolicy(10, true));
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
ReopenWithColumnFamilies({"default", "pikachu"}, options);
// Check if they can be found
for (int i = 0; i < maxKey; i++) {
ASSERT_EQ(Key(i), Get(1, Key(i)));
}
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 0);
}
}
namespace {
// A wrapped bloom over block-based FilterPolicy
class TestingWrappedBlockBasedFilterPolicy : public FilterPolicy {
public:
explicit TestingWrappedBlockBasedFilterPolicy(int bits_per_key)
: filter_(NewBloomFilterPolicy(bits_per_key, true)), counter_(0) {}
~TestingWrappedBlockBasedFilterPolicy() override { delete filter_; }
const char* Name() const override {
return "TestingWrappedBlockBasedFilterPolicy";
}
void CreateFilter(const ROCKSDB_NAMESPACE::Slice* keys, int n,
std::string* dst) const override {
std::unique_ptr<ROCKSDB_NAMESPACE::Slice[]> user_keys(
new ROCKSDB_NAMESPACE::Slice[n]);
for (int i = 0; i < n; ++i) {
user_keys[i] = convertKey(keys[i]);
}
return filter_->CreateFilter(user_keys.get(), n, dst);
}
bool KeyMayMatch(const ROCKSDB_NAMESPACE::Slice& key,
const ROCKSDB_NAMESPACE::Slice& filter) const override {
counter_++;
return filter_->KeyMayMatch(convertKey(key), filter);
}
uint32_t GetCounter() { return counter_; }
private:
const FilterPolicy* filter_;
mutable uint32_t counter_;
ROCKSDB_NAMESPACE::Slice convertKey(
const ROCKSDB_NAMESPACE::Slice& key) const {
return key;
}
};
} // namespace
TEST_F(DBBloomFilterTest, WrappedBlockBasedFilterPolicy) {
Options options = CurrentOptions();
options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
BlockBasedTableOptions table_options;
TestingWrappedBlockBasedFilterPolicy* policy =
new TestingWrappedBlockBasedFilterPolicy(10);
table_options.filter_policy.reset(policy);
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
CreateAndReopenWithCF({"pikachu"}, options);
const int maxKey = 10000;
for (int i = 0; i < maxKey; i++) {
ASSERT_OK(Put(1, Key(i), Key(i)));
}
// Add a large key to make the file contain wide range
ASSERT_OK(Put(1, Key(maxKey + 55555), Key(maxKey + 55555)));
ASSERT_EQ(0U, policy->GetCounter());
Flush(1);
// Check if they can be found
for (int i = 0; i < maxKey; i++) {
ASSERT_EQ(Key(i), Get(1, Key(i)));
}
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 0);
ASSERT_EQ(1U * maxKey, policy->GetCounter());
// Check if filter is useful
for (int i = 0; i < maxKey; i++) {
ASSERT_EQ("NOT_FOUND", Get(1, Key(i + 33333)));
}
ASSERT_GE(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), maxKey * 0.98);
ASSERT_EQ(2U * maxKey, policy->GetCounter());
}
namespace {
// NOTE: This class is referenced by HISTORY.md as a model for a wrapper
// FilterPolicy selecting among configurations based on context.
class LevelAndStyleCustomFilterPolicy : public FilterPolicy {
public:
explicit LevelAndStyleCustomFilterPolicy(int bpk_fifo, int bpk_l0_other,
int bpk_otherwise)
: policy_fifo_(NewBloomFilterPolicy(bpk_fifo)),
policy_l0_other_(NewBloomFilterPolicy(bpk_l0_other)),
policy_otherwise_(NewBloomFilterPolicy(bpk_otherwise)) {}
// OK to use built-in policy name because we are deferring to a
// built-in builder. We aren't changing the serialized format.
const char* Name() const override { return policy_fifo_->Name(); }
FilterBitsBuilder* GetBuilderWithContext(
const FilterBuildingContext& context) const override {
if (context.compaction_style == kCompactionStyleFIFO) {
return policy_fifo_->GetBuilderWithContext(context);
} else if (context.level_at_creation == 0) {
return policy_l0_other_->GetBuilderWithContext(context);
} else {
return policy_otherwise_->GetBuilderWithContext(context);
}
}
FilterBitsReader* GetFilterBitsReader(const Slice& contents) const override {
// OK to defer to any of them; they all can parse built-in filters
// from any settings.
return policy_fifo_->GetFilterBitsReader(contents);
}
// Defer just in case configuration uses block-based filter
void CreateFilter(const Slice* keys, int n, std::string* dst) const override {
policy_otherwise_->CreateFilter(keys, n, dst);
}
bool KeyMayMatch(const Slice& key, const Slice& filter) const override {
return policy_otherwise_->KeyMayMatch(key, filter);
}
private:
const std::unique_ptr<const FilterPolicy> policy_fifo_;
const std::unique_ptr<const FilterPolicy> policy_l0_other_;
const std::unique_ptr<const FilterPolicy> policy_otherwise_;
};
class TestingContextCustomFilterPolicy
: public LevelAndStyleCustomFilterPolicy {
public:
explicit TestingContextCustomFilterPolicy(int bpk_fifo, int bpk_l0_other,
int bpk_otherwise)
: LevelAndStyleCustomFilterPolicy(bpk_fifo, bpk_l0_other, bpk_otherwise) {
}
FilterBitsBuilder* GetBuilderWithContext(
const FilterBuildingContext& context) const override {
test_report_ += "cf=";
test_report_ += context.column_family_name;
test_report_ += ",cs=";
test_report_ +=
OptionsHelper::compaction_style_to_string[context.compaction_style];
test_report_ += ",lv=";
test_report_ += std::to_string(context.level_at_creation);
test_report_ += "\n";
return LevelAndStyleCustomFilterPolicy::GetBuilderWithContext(context);
}
std::string DumpTestReport() {
std::string rv;
std::swap(rv, test_report_);
return rv;
}
private:
mutable std::string test_report_;
};
} // namespace
TEST_F(DBBloomFilterTest, ContextCustomFilterPolicy) {
for (bool fifo : {true, false}) {
Options options = CurrentOptions();
options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
options.compaction_style =
fifo ? kCompactionStyleFIFO : kCompactionStyleLevel;
BlockBasedTableOptions table_options;
auto policy = std::make_shared<TestingContextCustomFilterPolicy>(15, 8, 5);
table_options.filter_policy = policy;
table_options.format_version = 5;
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
CreateAndReopenWithCF({fifo ? "abe" : "bob"}, options);
const int maxKey = 10000;
for (int i = 0; i < maxKey / 2; i++) {
ASSERT_OK(Put(1, Key(i), Key(i)));
}
// Add a large key to make the file contain wide range
ASSERT_OK(Put(1, Key(maxKey + 55555), Key(maxKey + 55555)));
Flush(1);
EXPECT_EQ(policy->DumpTestReport(),
fifo ? "cf=abe,cs=kCompactionStyleFIFO,lv=0\n"
: "cf=bob,cs=kCompactionStyleLevel,lv=0\n");
for (int i = maxKey / 2; i < maxKey; i++) {
ASSERT_OK(Put(1, Key(i), Key(i)));
}
Flush(1);
EXPECT_EQ(policy->DumpTestReport(),
fifo ? "cf=abe,cs=kCompactionStyleFIFO,lv=0\n"
: "cf=bob,cs=kCompactionStyleLevel,lv=0\n");
// Check that they can be found
for (int i = 0; i < maxKey; i++) {
ASSERT_EQ(Key(i), Get(1, Key(i)));
}
// Since we have two tables / two filters, we might have Bloom checks on
// our queries, but no more than one "useful" per query on a found key.
EXPECT_LE(TestGetAndResetTickerCount(options, BLOOM_FILTER_USEFUL), maxKey);
// Check that we have two filters, each about
// fifo: 0.12% FP rate (15 bits per key)
// level: 2.3% FP rate (8 bits per key)
for (int i = 0; i < maxKey; i++) {
ASSERT_EQ("NOT_FOUND", Get(1, Key(i + 33333)));
}
{
auto useful_count =
TestGetAndResetTickerCount(options, BLOOM_FILTER_USEFUL);
EXPECT_GE(useful_count, maxKey * 2 * (fifo ? 0.9980 : 0.975));
EXPECT_LE(useful_count, maxKey * 2 * (fifo ? 0.9995 : 0.98));
}
if (!fifo) { // FIFO only has L0
// Full compaction
ASSERT_OK(db_->CompactRange(CompactRangeOptions(), handles_[1], nullptr,
nullptr));
EXPECT_EQ(policy->DumpTestReport(),
"cf=bob,cs=kCompactionStyleLevel,lv=1\n");
// Check that we now have one filter, about 9.2% FP rate (5 bits per key)
for (int i = 0; i < maxKey; i++) {
ASSERT_EQ("NOT_FOUND", Get(1, Key(i + 33333)));
}
{
auto useful_count =
TestGetAndResetTickerCount(options, BLOOM_FILTER_USEFUL);
EXPECT_GE(useful_count, maxKey * 0.90);
EXPECT_LE(useful_count, maxKey * 0.91);
}
}
// Destroy
ASSERT_OK(dbfull()->DropColumnFamily(handles_[1]));
dbfull()->DestroyColumnFamilyHandle(handles_[1]);
handles_[1] = nullptr;
}
}
class SliceTransformLimitedDomain : public SliceTransform {
const char* Name() const override { return "SliceTransformLimitedDomain"; }
Slice Transform(const Slice& src) const override {
return Slice(src.data(), 5);
}
bool InDomain(const Slice& src) const override {
// prefix will be x????
return src.size() >= 5 && src[0] == 'x';
}
bool InRange(const Slice& dst) const override {
// prefix will be x????
return dst.size() == 5 && dst[0] == 'x';
}
};
TEST_F(DBBloomFilterTest, PrefixExtractorFullFilter) {
BlockBasedTableOptions bbto;
// Full Filter Block
bbto.filter_policy.reset(ROCKSDB_NAMESPACE::NewBloomFilterPolicy(10, false));
bbto.whole_key_filtering = false;
Options options = CurrentOptions();
options.prefix_extractor = std::make_shared<SliceTransformLimitedDomain>();
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
DestroyAndReopen(options);
ASSERT_OK(Put("x1111_AAAA", "val1"));
ASSERT_OK(Put("x1112_AAAA", "val2"));
ASSERT_OK(Put("x1113_AAAA", "val3"));
ASSERT_OK(Put("x1114_AAAA", "val4"));
// Not in domain, wont be added to filter
ASSERT_OK(Put("zzzzz_AAAA", "val5"));
ASSERT_OK(Flush());
ASSERT_EQ(Get("x1111_AAAA"), "val1");
ASSERT_EQ(Get("x1112_AAAA"), "val2");
ASSERT_EQ(Get("x1113_AAAA"), "val3");
ASSERT_EQ(Get("x1114_AAAA"), "val4");
// Was not added to filter but rocksdb will try to read it from the filter
ASSERT_EQ(Get("zzzzz_AAAA"), "val5");
}
TEST_F(DBBloomFilterTest, PrefixExtractorBlockFilter) {
BlockBasedTableOptions bbto;
// Block Filter Block
bbto.filter_policy.reset(ROCKSDB_NAMESPACE::NewBloomFilterPolicy(10, true));
Options options = CurrentOptions();
options.prefix_extractor = std::make_shared<SliceTransformLimitedDomain>();
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
DestroyAndReopen(options);
ASSERT_OK(Put("x1113_AAAA", "val3"));
ASSERT_OK(Put("x1114_AAAA", "val4"));
// Not in domain, wont be added to filter
ASSERT_OK(Put("zzzzz_AAAA", "val1"));
ASSERT_OK(Put("zzzzz_AAAB", "val2"));
ASSERT_OK(Put("zzzzz_AAAC", "val3"));
ASSERT_OK(Put("zzzzz_AAAD", "val4"));
ASSERT_OK(Flush());
std::vector<std::string> iter_res;
auto iter = db_->NewIterator(ReadOptions());
// Seek to a key that was not in Domain
for (iter->Seek("zzzzz_AAAA"); iter->Valid(); iter->Next()) {
iter_res.emplace_back(iter->value().ToString());
}
std::vector<std::string> expected_res = {"val1", "val2", "val3", "val4"};
ASSERT_EQ(iter_res, expected_res);
delete iter;
}
TEST_F(DBBloomFilterTest, MemtableWholeKeyBloomFilter) {
// regression test for #2743. the range delete tombstones in memtable should
// be added even when Get() skips searching due to its prefix bloom filter
const int kMemtableSize = 1 << 20; // 1MB
const int kMemtablePrefixFilterSize = 1 << 13; // 8KB
const int kPrefixLen = 4;
Options options = CurrentOptions();
options.memtable_prefix_bloom_size_ratio =
static_cast<double>(kMemtablePrefixFilterSize) / kMemtableSize;
options.prefix_extractor.reset(
ROCKSDB_NAMESPACE::NewFixedPrefixTransform(kPrefixLen));
options.write_buffer_size = kMemtableSize;
options.memtable_whole_key_filtering = false;
Reopen(options);
std::string key1("AAAABBBB");
std::string key2("AAAACCCC"); // not in DB
std::string key3("AAAADDDD");
std::string key4("AAAAEEEE");
std::string value1("Value1");
std::string value3("Value3");
std::string value4("Value4");
ASSERT_OK(Put(key1, value1, WriteOptions()));
// check memtable bloom stats
ASSERT_EQ("NOT_FOUND", Get(key2));
ASSERT_EQ(0, get_perf_context()->bloom_memtable_miss_count);
// same prefix, bloom filter false positive
ASSERT_EQ(1, get_perf_context()->bloom_memtable_hit_count);
// enable whole key bloom filter
options.memtable_whole_key_filtering = true;
Reopen(options);
// check memtable bloom stats
ASSERT_OK(Put(key3, value3, WriteOptions()));
ASSERT_EQ("NOT_FOUND", Get(key2));
// whole key bloom filter kicks in and determines it's a miss
ASSERT_EQ(1, get_perf_context()->bloom_memtable_miss_count);
ASSERT_EQ(1, get_perf_context()->bloom_memtable_hit_count);
// verify whole key filtering does not depend on prefix_extractor
options.prefix_extractor.reset();
Reopen(options);
// check memtable bloom stats
ASSERT_OK(Put(key4, value4, WriteOptions()));
ASSERT_EQ("NOT_FOUND", Get(key2));
// whole key bloom filter kicks in and determines it's a miss
ASSERT_EQ(2, get_perf_context()->bloom_memtable_miss_count);
ASSERT_EQ(1, get_perf_context()->bloom_memtable_hit_count);
}
TEST_F(DBBloomFilterTest, MemtablePrefixBloomOutOfDomain) {
constexpr size_t kPrefixSize = 8;
const std::string kKey = "key";
assert(kKey.size() < kPrefixSize);
Options options = CurrentOptions();
options.prefix_extractor.reset(NewFixedPrefixTransform(kPrefixSize));
options.memtable_prefix_bloom_size_ratio = 0.25;
Reopen(options);
ASSERT_OK(Put(kKey, "v"));
ASSERT_EQ("v", Get(kKey));
std::unique_ptr<Iterator> iter(dbfull()->NewIterator(ReadOptions()));
iter->Seek(kKey);
ASSERT_TRUE(iter->Valid());
ASSERT_EQ(kKey, iter->key());
iter->SeekForPrev(kKey);
ASSERT_TRUE(iter->Valid());
ASSERT_EQ(kKey, iter->key());
}
Basic MultiGet support for partitioned filters (#6757) Summary: In MultiGet, access each applicable filter partition only once per batch, rather than for each applicable key. Also, * Fix Bloom stats for MultiGet * Fix/refactor MultiGetContext::Range::KeysLeft, including * Add efficient BitsSetToOne implementation * Assert that MultiGetContext::Range does not go beyond shift range Performance test: Generate db: $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10 -partition_index_and_filters=true ... Before (middle performing run of three; note some missing Bloom stats): $ ./db_bench --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget' multireadrandom : 26.403 micros/op 597517 ops/sec; (548427 of 671968 found) rocksdb.block.cache.filter.hit COUNT : 83443275 rocksdb.bloom.filter.useful COUNT : 0 rocksdb.bloom.filter.full.positive COUNT : 0 rocksdb.bloom.filter.full.true.positive COUNT : 7931450 rocksdb.number.multiget.get COUNT : 385984 rocksdb.number.multiget.keys.read COUNT : 12351488 rocksdb.number.multiget.bytes.read COUNT : 793145000 rocksdb.number.multiget.keys.found COUNT : 7931450 After (middle performing run of three): $ ./db_bench_new --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget' multireadrandom : 21.024 micros/op 752963 ops/sec; (705188 of 863968 found) rocksdb.block.cache.filter.hit COUNT : 49856682 rocksdb.bloom.filter.useful COUNT : 45684579 rocksdb.bloom.filter.full.positive COUNT : 10395458 rocksdb.bloom.filter.full.true.positive COUNT : 9908456 rocksdb.number.multiget.get COUNT : 481984 rocksdb.number.multiget.keys.read COUNT : 15423488 rocksdb.number.multiget.bytes.read COUNT : 990845600 rocksdb.number.multiget.keys.found COUNT : 9908456 So that's about 25% higher throughput even for random keys Pull Request resolved: https://github.com/facebook/rocksdb/pull/6757 Test Plan: unit test included Reviewed By: anand1976 Differential Revision: D21243256 Pulled By: pdillinger fbshipit-source-id: 5644a1468d9e8c8575be02f4e04bc5d62dbbb57f
4 years ago
class DBBloomFilterTestVaryPrefixAndFormatVer
: public DBTestBase,
public testing::WithParamInterface<std::tuple<bool, uint32_t>> {
protected:
bool use_prefix_;
uint32_t format_version_;
public:
DBBloomFilterTestVaryPrefixAndFormatVer()
: DBTestBase("/db_bloom_filter_tests") {}
~DBBloomFilterTestVaryPrefixAndFormatVer() override {}
void SetUp() override {
use_prefix_ = std::get<0>(GetParam());
format_version_ = std::get<1>(GetParam());
}
static std::string UKey(uint32_t i) { return Key(static_cast<int>(i)); }
};
TEST_P(DBBloomFilterTestVaryPrefixAndFormatVer, PartitionedMultiGet) {
Options options = CurrentOptions();
if (use_prefix_) {
// Entire key from UKey()
options.prefix_extractor.reset(NewCappedPrefixTransform(9));
}
options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
BlockBasedTableOptions bbto;
bbto.filter_policy.reset(NewBloomFilterPolicy(20));
bbto.partition_filters = true;
bbto.index_type = BlockBasedTableOptions::IndexType::kTwoLevelIndexSearch;
bbto.whole_key_filtering = !use_prefix_;
if (use_prefix_) { // (not related to prefix, just alternating between)
// Make sure code appropriately deals with metadata block size setting
// that is "too small" (smaller than minimum size for filter builder)
bbto.metadata_block_size = 63;
} else {
// Make sure the test will work even on platforms with large minimum
// filter size, due to large cache line size.
// (Largest cache line size + 10+% overhead.)
bbto.metadata_block_size = 290;
}
Basic MultiGet support for partitioned filters (#6757) Summary: In MultiGet, access each applicable filter partition only once per batch, rather than for each applicable key. Also, * Fix Bloom stats for MultiGet * Fix/refactor MultiGetContext::Range::KeysLeft, including * Add efficient BitsSetToOne implementation * Assert that MultiGetContext::Range does not go beyond shift range Performance test: Generate db: $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10 -partition_index_and_filters=true ... Before (middle performing run of three; note some missing Bloom stats): $ ./db_bench --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget' multireadrandom : 26.403 micros/op 597517 ops/sec; (548427 of 671968 found) rocksdb.block.cache.filter.hit COUNT : 83443275 rocksdb.bloom.filter.useful COUNT : 0 rocksdb.bloom.filter.full.positive COUNT : 0 rocksdb.bloom.filter.full.true.positive COUNT : 7931450 rocksdb.number.multiget.get COUNT : 385984 rocksdb.number.multiget.keys.read COUNT : 12351488 rocksdb.number.multiget.bytes.read COUNT : 793145000 rocksdb.number.multiget.keys.found COUNT : 7931450 After (middle performing run of three): $ ./db_bench_new --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget' multireadrandom : 21.024 micros/op 752963 ops/sec; (705188 of 863968 found) rocksdb.block.cache.filter.hit COUNT : 49856682 rocksdb.bloom.filter.useful COUNT : 45684579 rocksdb.bloom.filter.full.positive COUNT : 10395458 rocksdb.bloom.filter.full.true.positive COUNT : 9908456 rocksdb.number.multiget.get COUNT : 481984 rocksdb.number.multiget.keys.read COUNT : 15423488 rocksdb.number.multiget.bytes.read COUNT : 990845600 rocksdb.number.multiget.keys.found COUNT : 9908456 So that's about 25% higher throughput even for random keys Pull Request resolved: https://github.com/facebook/rocksdb/pull/6757 Test Plan: unit test included Reviewed By: anand1976 Differential Revision: D21243256 Pulled By: pdillinger fbshipit-source-id: 5644a1468d9e8c8575be02f4e04bc5d62dbbb57f
4 years ago
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
DestroyAndReopen(options);
ReadOptions ropts;
constexpr uint32_t N = 12000;
Basic MultiGet support for partitioned filters (#6757) Summary: In MultiGet, access each applicable filter partition only once per batch, rather than for each applicable key. Also, * Fix Bloom stats for MultiGet * Fix/refactor MultiGetContext::Range::KeysLeft, including * Add efficient BitsSetToOne implementation * Assert that MultiGetContext::Range does not go beyond shift range Performance test: Generate db: $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10 -partition_index_and_filters=true ... Before (middle performing run of three; note some missing Bloom stats): $ ./db_bench --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget' multireadrandom : 26.403 micros/op 597517 ops/sec; (548427 of 671968 found) rocksdb.block.cache.filter.hit COUNT : 83443275 rocksdb.bloom.filter.useful COUNT : 0 rocksdb.bloom.filter.full.positive COUNT : 0 rocksdb.bloom.filter.full.true.positive COUNT : 7931450 rocksdb.number.multiget.get COUNT : 385984 rocksdb.number.multiget.keys.read COUNT : 12351488 rocksdb.number.multiget.bytes.read COUNT : 793145000 rocksdb.number.multiget.keys.found COUNT : 7931450 After (middle performing run of three): $ ./db_bench_new --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget' multireadrandom : 21.024 micros/op 752963 ops/sec; (705188 of 863968 found) rocksdb.block.cache.filter.hit COUNT : 49856682 rocksdb.bloom.filter.useful COUNT : 45684579 rocksdb.bloom.filter.full.positive COUNT : 10395458 rocksdb.bloom.filter.full.true.positive COUNT : 9908456 rocksdb.number.multiget.get COUNT : 481984 rocksdb.number.multiget.keys.read COUNT : 15423488 rocksdb.number.multiget.bytes.read COUNT : 990845600 rocksdb.number.multiget.keys.found COUNT : 9908456 So that's about 25% higher throughput even for random keys Pull Request resolved: https://github.com/facebook/rocksdb/pull/6757 Test Plan: unit test included Reviewed By: anand1976 Differential Revision: D21243256 Pulled By: pdillinger fbshipit-source-id: 5644a1468d9e8c8575be02f4e04bc5d62dbbb57f
4 years ago
// Add N/2 evens
for (uint32_t i = 0; i < N; i += 2) {
ASSERT_OK(Put(UKey(i), UKey(i)));
}
ASSERT_OK(Flush());
#ifndef ROCKSDB_LITE
Basic MultiGet support for partitioned filters (#6757) Summary: In MultiGet, access each applicable filter partition only once per batch, rather than for each applicable key. Also, * Fix Bloom stats for MultiGet * Fix/refactor MultiGetContext::Range::KeysLeft, including * Add efficient BitsSetToOne implementation * Assert that MultiGetContext::Range does not go beyond shift range Performance test: Generate db: $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10 -partition_index_and_filters=true ... Before (middle performing run of three; note some missing Bloom stats): $ ./db_bench --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget' multireadrandom : 26.403 micros/op 597517 ops/sec; (548427 of 671968 found) rocksdb.block.cache.filter.hit COUNT : 83443275 rocksdb.bloom.filter.useful COUNT : 0 rocksdb.bloom.filter.full.positive COUNT : 0 rocksdb.bloom.filter.full.true.positive COUNT : 7931450 rocksdb.number.multiget.get COUNT : 385984 rocksdb.number.multiget.keys.read COUNT : 12351488 rocksdb.number.multiget.bytes.read COUNT : 793145000 rocksdb.number.multiget.keys.found COUNT : 7931450 After (middle performing run of three): $ ./db_bench_new --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget' multireadrandom : 21.024 micros/op 752963 ops/sec; (705188 of 863968 found) rocksdb.block.cache.filter.hit COUNT : 49856682 rocksdb.bloom.filter.useful COUNT : 45684579 rocksdb.bloom.filter.full.positive COUNT : 10395458 rocksdb.bloom.filter.full.true.positive COUNT : 9908456 rocksdb.number.multiget.get COUNT : 481984 rocksdb.number.multiget.keys.read COUNT : 15423488 rocksdb.number.multiget.bytes.read COUNT : 990845600 rocksdb.number.multiget.keys.found COUNT : 9908456 So that's about 25% higher throughput even for random keys Pull Request resolved: https://github.com/facebook/rocksdb/pull/6757 Test Plan: unit test included Reviewed By: anand1976 Differential Revision: D21243256 Pulled By: pdillinger fbshipit-source-id: 5644a1468d9e8c8575be02f4e04bc5d62dbbb57f
4 years ago
ASSERT_EQ(TotalTableFiles(), 1);
#endif
Basic MultiGet support for partitioned filters (#6757) Summary: In MultiGet, access each applicable filter partition only once per batch, rather than for each applicable key. Also, * Fix Bloom stats for MultiGet * Fix/refactor MultiGetContext::Range::KeysLeft, including * Add efficient BitsSetToOne implementation * Assert that MultiGetContext::Range does not go beyond shift range Performance test: Generate db: $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10 -partition_index_and_filters=true ... Before (middle performing run of three; note some missing Bloom stats): $ ./db_bench --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget' multireadrandom : 26.403 micros/op 597517 ops/sec; (548427 of 671968 found) rocksdb.block.cache.filter.hit COUNT : 83443275 rocksdb.bloom.filter.useful COUNT : 0 rocksdb.bloom.filter.full.positive COUNT : 0 rocksdb.bloom.filter.full.true.positive COUNT : 7931450 rocksdb.number.multiget.get COUNT : 385984 rocksdb.number.multiget.keys.read COUNT : 12351488 rocksdb.number.multiget.bytes.read COUNT : 793145000 rocksdb.number.multiget.keys.found COUNT : 7931450 After (middle performing run of three): $ ./db_bench_new --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget' multireadrandom : 21.024 micros/op 752963 ops/sec; (705188 of 863968 found) rocksdb.block.cache.filter.hit COUNT : 49856682 rocksdb.bloom.filter.useful COUNT : 45684579 rocksdb.bloom.filter.full.positive COUNT : 10395458 rocksdb.bloom.filter.full.true.positive COUNT : 9908456 rocksdb.number.multiget.get COUNT : 481984 rocksdb.number.multiget.keys.read COUNT : 15423488 rocksdb.number.multiget.bytes.read COUNT : 990845600 rocksdb.number.multiget.keys.found COUNT : 9908456 So that's about 25% higher throughput even for random keys Pull Request resolved: https://github.com/facebook/rocksdb/pull/6757 Test Plan: unit test included Reviewed By: anand1976 Differential Revision: D21243256 Pulled By: pdillinger fbshipit-source-id: 5644a1468d9e8c8575be02f4e04bc5d62dbbb57f
4 years ago
constexpr uint32_t Q = 29;
// MultiGet In
std::array<std::string, Q> keys;
std::array<Slice, Q> key_slices;
std::array<ColumnFamilyHandle*, Q> column_families;
// MultiGet Out
std::array<Status, Q> statuses;
std::array<PinnableSlice, Q> values;
TestGetAndResetTickerCount(options, BLOCK_CACHE_FILTER_HIT);
TestGetAndResetTickerCount(options, BLOCK_CACHE_FILTER_MISS);
TestGetAndResetTickerCount(options, BLOOM_FILTER_PREFIX_USEFUL);
TestGetAndResetTickerCount(options, BLOOM_FILTER_USEFUL);
TestGetAndResetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED);
TestGetAndResetTickerCount(options, BLOOM_FILTER_FULL_POSITIVE);
TestGetAndResetTickerCount(options, BLOOM_FILTER_FULL_TRUE_POSITIVE);
// Check that initial clump of keys only loads one partition filter from
// block cache.
// And that spread out keys load many partition filters.
// In both cases, mix present vs. not present keys.
for (uint32_t stride : {uint32_t{1}, (N / Q) | 1}) {
for (uint32_t i = 0; i < Q; ++i) {
keys[i] = UKey(i * stride);
key_slices[i] = Slice(keys[i]);
column_families[i] = db_->DefaultColumnFamily();
statuses[i] = Status();
values[i] = PinnableSlice();
}
db_->MultiGet(ropts, Q, &column_families[0], &key_slices[0], &values[0],
/*timestamps=*/nullptr, &statuses[0], true);
// Confirm correct status results
uint32_t number_not_found = 0;
for (uint32_t i = 0; i < Q; ++i) {
if ((i * stride % 2) == 0) {
ASSERT_OK(statuses[i]);
} else {
ASSERT_TRUE(statuses[i].IsNotFound());
++number_not_found;
}
}
// Confirm correct Bloom stats (no FPs)
uint64_t filter_useful = TestGetAndResetTickerCount(
options,
use_prefix_ ? BLOOM_FILTER_PREFIX_USEFUL : BLOOM_FILTER_USEFUL);
uint64_t filter_checked =
TestGetAndResetTickerCount(options, use_prefix_
? BLOOM_FILTER_PREFIX_CHECKED
: BLOOM_FILTER_FULL_POSITIVE) +
(use_prefix_ ? 0 : filter_useful);
EXPECT_EQ(filter_useful, number_not_found);
EXPECT_EQ(filter_checked, Q);
if (!use_prefix_) {
EXPECT_EQ(
TestGetAndResetTickerCount(options, BLOOM_FILTER_FULL_TRUE_POSITIVE),
Q - number_not_found);
}
// Confirm no duplicate loading same filter partition
uint64_t filter_accesses =
TestGetAndResetTickerCount(options, BLOCK_CACHE_FILTER_HIT) +
TestGetAndResetTickerCount(options, BLOCK_CACHE_FILTER_MISS);
if (stride == 1) {
EXPECT_EQ(filter_accesses, 1);
} else {
// for large stride
EXPECT_GE(filter_accesses, Q / 2 + 1);
}
}
// Check that a clump of keys (present and not) works when spanning
// two partitions
int found_spanning = 0;
for (uint32_t start = 0; start < N / 2;) {
for (uint32_t i = 0; i < Q; ++i) {
keys[i] = UKey(start + i);
key_slices[i] = Slice(keys[i]);
column_families[i] = db_->DefaultColumnFamily();
statuses[i] = Status();
values[i] = PinnableSlice();
}
db_->MultiGet(ropts, Q, &column_families[0], &key_slices[0], &values[0],
/*timestamps=*/nullptr, &statuses[0], true);
// Confirm correct status results
uint32_t number_not_found = 0;
for (uint32_t i = 0; i < Q; ++i) {
if (((start + i) % 2) == 0) {
ASSERT_OK(statuses[i]);
} else {
ASSERT_TRUE(statuses[i].IsNotFound());
++number_not_found;
}
}
// Confirm correct Bloom stats (might see some FPs)
uint64_t filter_useful = TestGetAndResetTickerCount(
options,
use_prefix_ ? BLOOM_FILTER_PREFIX_USEFUL : BLOOM_FILTER_USEFUL);
uint64_t filter_checked =
TestGetAndResetTickerCount(options, use_prefix_
? BLOOM_FILTER_PREFIX_CHECKED
: BLOOM_FILTER_FULL_POSITIVE) +
(use_prefix_ ? 0 : filter_useful);
EXPECT_GE(filter_useful, number_not_found - 2); // possible FP
EXPECT_EQ(filter_checked, Q);
if (!use_prefix_) {
EXPECT_EQ(
TestGetAndResetTickerCount(options, BLOOM_FILTER_FULL_TRUE_POSITIVE),
Q - number_not_found);
}
// Confirm no duplicate loading of same filter partition
uint64_t filter_accesses =
TestGetAndResetTickerCount(options, BLOCK_CACHE_FILTER_HIT) +
TestGetAndResetTickerCount(options, BLOCK_CACHE_FILTER_MISS);
if (filter_accesses == 2) {
// Spanned across partitions.
++found_spanning;
if (found_spanning >= 2) {
break;
} else {
// Ensure that at least once we have at least one present and
// one non-present key on both sides of partition boundary.
start += 2;
}
} else {
EXPECT_EQ(filter_accesses, 1);
// See explanation at "start += 2"
start += Q - 4;
}
}
EXPECT_TRUE(found_spanning >= 2);
}
INSTANTIATE_TEST_SUITE_P(
DBBloomFilterTestVaryPrefixAndFormatVer,
DBBloomFilterTestVaryPrefixAndFormatVer,
::testing::Values(
// (use_prefix, format_version)
std::make_tuple(false, 2), std::make_tuple(false, 3),
std::make_tuple(false, 4), std::make_tuple(false, 5),
std::make_tuple(true, 2), std::make_tuple(true, 3),
std::make_tuple(true, 4), std::make_tuple(true, 5)));
Basic MultiGet support for partitioned filters (#6757) Summary: In MultiGet, access each applicable filter partition only once per batch, rather than for each applicable key. Also, * Fix Bloom stats for MultiGet * Fix/refactor MultiGetContext::Range::KeysLeft, including * Add efficient BitsSetToOne implementation * Assert that MultiGetContext::Range does not go beyond shift range Performance test: Generate db: $ ./db_bench --benchmarks=fillrandom --num=15000000 --cache_index_and_filter_blocks -bloom_bits=10 -partition_index_and_filters=true ... Before (middle performing run of three; note some missing Bloom stats): $ ./db_bench --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget' multireadrandom : 26.403 micros/op 597517 ops/sec; (548427 of 671968 found) rocksdb.block.cache.filter.hit COUNT : 83443275 rocksdb.bloom.filter.useful COUNT : 0 rocksdb.bloom.filter.full.positive COUNT : 0 rocksdb.bloom.filter.full.true.positive COUNT : 7931450 rocksdb.number.multiget.get COUNT : 385984 rocksdb.number.multiget.keys.read COUNT : 12351488 rocksdb.number.multiget.bytes.read COUNT : 793145000 rocksdb.number.multiget.keys.found COUNT : 7931450 After (middle performing run of three): $ ./db_bench_new --use-existing-db --benchmarks=multireadrandom --num=15000000 --cache_index_and_filter_blocks --bloom_bits=10 --threads=16 --cache_size=20000000 -partition_index_and_filters -batch_size=32 -multiread_batched -statistics --duration=20 2>&1 | egrep 'micros/op|block.cache.filter.hit|bloom.filter.(full|use)|number.multiget' multireadrandom : 21.024 micros/op 752963 ops/sec; (705188 of 863968 found) rocksdb.block.cache.filter.hit COUNT : 49856682 rocksdb.bloom.filter.useful COUNT : 45684579 rocksdb.bloom.filter.full.positive COUNT : 10395458 rocksdb.bloom.filter.full.true.positive COUNT : 9908456 rocksdb.number.multiget.get COUNT : 481984 rocksdb.number.multiget.keys.read COUNT : 15423488 rocksdb.number.multiget.bytes.read COUNT : 990845600 rocksdb.number.multiget.keys.found COUNT : 9908456 So that's about 25% higher throughput even for random keys Pull Request resolved: https://github.com/facebook/rocksdb/pull/6757 Test Plan: unit test included Reviewed By: anand1976 Differential Revision: D21243256 Pulled By: pdillinger fbshipit-source-id: 5644a1468d9e8c8575be02f4e04bc5d62dbbb57f
4 years ago
#ifndef ROCKSDB_LITE
namespace {
namespace BFP2 {
// Extends BFP::Mode with option to use Plain table
using PseudoMode = int;
static constexpr PseudoMode kPlainTable = -1;
} // namespace BFP2
} // namespace
class BloomStatsTestWithParam
: public DBBloomFilterTest,
public testing::WithParamInterface<std::tuple<BFP2::PseudoMode, bool>> {
public:
BloomStatsTestWithParam() {
New Bloom filter implementation for full and partitioned filters (#6007) Summary: Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter. Speed The improved speed, at least on recent x86_64, comes from * Using fastrange instead of modulo (%) * Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row. * Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc. * Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes. Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed): $ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter Build avg ns/key: 47.7135 Mixed inside/outside queries... Single filter net ns/op: 26.2825 Random filter net ns/op: 150.459 Average FP rate %: 0.954651 $ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter Build avg ns/key: 47.2245 Mixed inside/outside queries... Single filter net ns/op: 63.2978 Random filter net ns/op: 188.038 Average FP rate %: 1.13823 Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected. The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome. Accuracy The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments. Accuracy data (generalizes, except old impl gets worse with millions of keys): Memory bits per key: FP rate percent old impl -> FP rate percent new impl 6: 5.70953 -> 5.69888 8: 2.45766 -> 2.29709 10: 1.13977 -> 0.959254 12: 0.662498 -> 0.411593 16: 0.353023 -> 0.0873754 24: 0.261552 -> 0.0060971 50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP) Fixes https://github.com/facebook/rocksdb/issues/5857 Fixes https://github.com/facebook/rocksdb/issues/4120 Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized. Compatibility Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007 Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version). Differential Revision: D18294749 Pulled By: pdillinger fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
5 years ago
bfp_impl_ = std::get<0>(GetParam());
partition_filters_ = std::get<1>(GetParam());
options_.create_if_missing = true;
options_.prefix_extractor.reset(
ROCKSDB_NAMESPACE::NewFixedPrefixTransform(4));
options_.memtable_prefix_bloom_size_ratio =
8.0 * 1024.0 / static_cast<double>(options_.write_buffer_size);
if (bfp_impl_ == BFP2::kPlainTable) {
assert(!partition_filters_); // not supported in plain table
PlainTableOptions table_options;
options_.table_factory.reset(NewPlainTableFactory(table_options));
} else {
BlockBasedTableOptions table_options;
table_options.hash_index_allow_collision = false;
if (partition_filters_) {
assert(bfp_impl_ != BFP::kDeprecatedBlock);
table_options.partition_filters = partition_filters_;
table_options.index_type =
BlockBasedTableOptions::IndexType::kTwoLevelIndexSearch;
}
table_options.filter_policy.reset(
new BFP(10, static_cast<BFP::Mode>(bfp_impl_)));
options_.table_factory.reset(NewBlockBasedTableFactory(table_options));
}
options_.env = env_;
get_perf_context()->Reset();
DestroyAndReopen(options_);
}
~BloomStatsTestWithParam() override {
get_perf_context()->Reset();
Destroy(options_);
}
// Required if inheriting from testing::WithParamInterface<>
static void SetUpTestCase() {}
static void TearDownTestCase() {}
BFP2::PseudoMode bfp_impl_;
bool partition_filters_;
Options options_;
};
// 1 Insert 2 K-V pairs into DB
// 2 Call Get() for both keys - expext memtable bloom hit stat to be 2
// 3 Call Get() for nonexisting key - expect memtable bloom miss stat to be 1
// 4 Call Flush() to create SST
// 5 Call Get() for both keys - expext SST bloom hit stat to be 2
// 6 Call Get() for nonexisting key - expect SST bloom miss stat to be 1
// Test both: block and plain SST
TEST_P(BloomStatsTestWithParam, BloomStatsTest) {
std::string key1("AAAA");
std::string key2("RXDB"); // not in DB
std::string key3("ZBRA");
std::string value1("Value1");
std::string value3("Value3");
ASSERT_OK(Put(key1, value1, WriteOptions()));
ASSERT_OK(Put(key3, value3, WriteOptions()));
// check memtable bloom stats
ASSERT_EQ(value1, Get(key1));
ASSERT_EQ(1, get_perf_context()->bloom_memtable_hit_count);
ASSERT_EQ(value3, Get(key3));
ASSERT_EQ(2, get_perf_context()->bloom_memtable_hit_count);
ASSERT_EQ(0, get_perf_context()->bloom_memtable_miss_count);
ASSERT_EQ("NOT_FOUND", Get(key2));
ASSERT_EQ(1, get_perf_context()->bloom_memtable_miss_count);
ASSERT_EQ(2, get_perf_context()->bloom_memtable_hit_count);
// sanity checks
ASSERT_EQ(0, get_perf_context()->bloom_sst_hit_count);
ASSERT_EQ(0, get_perf_context()->bloom_sst_miss_count);
Flush();
// sanity checks
ASSERT_EQ(0, get_perf_context()->bloom_sst_hit_count);
ASSERT_EQ(0, get_perf_context()->bloom_sst_miss_count);
// check SST bloom stats
ASSERT_EQ(value1, Get(key1));
ASSERT_EQ(1, get_perf_context()->bloom_sst_hit_count);
ASSERT_EQ(value3, Get(key3));
ASSERT_EQ(2, get_perf_context()->bloom_sst_hit_count);
ASSERT_EQ("NOT_FOUND", Get(key2));
ASSERT_EQ(1, get_perf_context()->bloom_sst_miss_count);
}
// Same scenario as in BloomStatsTest but using an iterator
TEST_P(BloomStatsTestWithParam, BloomStatsTestWithIter) {
std::string key1("AAAA");
std::string key2("RXDB"); // not in DB
std::string key3("ZBRA");
std::string value1("Value1");
std::string value3("Value3");
ASSERT_OK(Put(key1, value1, WriteOptions()));
ASSERT_OK(Put(key3, value3, WriteOptions()));
std::unique_ptr<Iterator> iter(dbfull()->NewIterator(ReadOptions()));
// check memtable bloom stats
iter->Seek(key1);
ASSERT_OK(iter->status());
ASSERT_TRUE(iter->Valid());
ASSERT_EQ(value1, iter->value().ToString());
ASSERT_EQ(1, get_perf_context()->bloom_memtable_hit_count);
ASSERT_EQ(0, get_perf_context()->bloom_memtable_miss_count);
iter->Seek(key3);
ASSERT_OK(iter->status());
ASSERT_TRUE(iter->Valid());
ASSERT_EQ(value3, iter->value().ToString());
ASSERT_EQ(2, get_perf_context()->bloom_memtable_hit_count);
ASSERT_EQ(0, get_perf_context()->bloom_memtable_miss_count);
iter->Seek(key2);
ASSERT_OK(iter->status());
ASSERT_TRUE(!iter->Valid());
ASSERT_EQ(1, get_perf_context()->bloom_memtable_miss_count);
ASSERT_EQ(2, get_perf_context()->bloom_memtable_hit_count);
Flush();
iter.reset(dbfull()->NewIterator(ReadOptions()));
// Check SST bloom stats
iter->Seek(key1);
ASSERT_OK(iter->status());
ASSERT_TRUE(iter->Valid());
ASSERT_EQ(value1, iter->value().ToString());
ASSERT_EQ(1, get_perf_context()->bloom_sst_hit_count);
iter->Seek(key3);
ASSERT_OK(iter->status());
ASSERT_TRUE(iter->Valid());
ASSERT_EQ(value3, iter->value().ToString());
// The seek doesn't check block-based bloom filter because last index key
// starts with the same prefix we're seeking to.
New Bloom filter implementation for full and partitioned filters (#6007) Summary: Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter. Speed The improved speed, at least on recent x86_64, comes from * Using fastrange instead of modulo (%) * Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row. * Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc. * Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes. Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed): $ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter Build avg ns/key: 47.7135 Mixed inside/outside queries... Single filter net ns/op: 26.2825 Random filter net ns/op: 150.459 Average FP rate %: 0.954651 $ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter Build avg ns/key: 47.2245 Mixed inside/outside queries... Single filter net ns/op: 63.2978 Random filter net ns/op: 188.038 Average FP rate %: 1.13823 Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected. The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome. Accuracy The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments. Accuracy data (generalizes, except old impl gets worse with millions of keys): Memory bits per key: FP rate percent old impl -> FP rate percent new impl 6: 5.70953 -> 5.69888 8: 2.45766 -> 2.29709 10: 1.13977 -> 0.959254 12: 0.662498 -> 0.411593 16: 0.353023 -> 0.0873754 24: 0.261552 -> 0.0060971 50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP) Fixes https://github.com/facebook/rocksdb/issues/5857 Fixes https://github.com/facebook/rocksdb/issues/4120 Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized. Compatibility Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007 Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version). Differential Revision: D18294749 Pulled By: pdillinger fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
5 years ago
uint64_t expected_hits = bfp_impl_ == BFP::kDeprecatedBlock ? 1 : 2;
ASSERT_EQ(expected_hits, get_perf_context()->bloom_sst_hit_count);
iter->Seek(key2);
ASSERT_OK(iter->status());
ASSERT_TRUE(!iter->Valid());
ASSERT_EQ(1, get_perf_context()->bloom_sst_miss_count);
ASSERT_EQ(expected_hits, get_perf_context()->bloom_sst_hit_count);
}
INSTANTIATE_TEST_SUITE_P(
BloomStatsTestWithParam, BloomStatsTestWithParam,
New Bloom filter implementation for full and partitioned filters (#6007) Summary: Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter. Speed The improved speed, at least on recent x86_64, comes from * Using fastrange instead of modulo (%) * Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row. * Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc. * Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes. Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed): $ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter Build avg ns/key: 47.7135 Mixed inside/outside queries... Single filter net ns/op: 26.2825 Random filter net ns/op: 150.459 Average FP rate %: 0.954651 $ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter Build avg ns/key: 47.2245 Mixed inside/outside queries... Single filter net ns/op: 63.2978 Random filter net ns/op: 188.038 Average FP rate %: 1.13823 Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected. The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome. Accuracy The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments. Accuracy data (generalizes, except old impl gets worse with millions of keys): Memory bits per key: FP rate percent old impl -> FP rate percent new impl 6: 5.70953 -> 5.69888 8: 2.45766 -> 2.29709 10: 1.13977 -> 0.959254 12: 0.662498 -> 0.411593 16: 0.353023 -> 0.0873754 24: 0.261552 -> 0.0060971 50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP) Fixes https://github.com/facebook/rocksdb/issues/5857 Fixes https://github.com/facebook/rocksdb/issues/4120 Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized. Compatibility Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007 Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version). Differential Revision: D18294749 Pulled By: pdillinger fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
5 years ago
::testing::Values(std::make_tuple(BFP::kDeprecatedBlock, false),
std::make_tuple(BFP::kLegacyBloom, false),
std::make_tuple(BFP::kLegacyBloom, true),
std::make_tuple(BFP::kFastLocalBloom, false),
std::make_tuple(BFP::kFastLocalBloom, true),
std::make_tuple(BFP2::kPlainTable, false)));
namespace {
void PrefixScanInit(DBBloomFilterTest* dbtest) {
char buf[100];
std::string keystr;
const int small_range_sstfiles = 5;
const int big_range_sstfiles = 5;
// Generate 11 sst files with the following prefix ranges.
// GROUP 0: [0,10] (level 1)
// GROUP 1: [1,2], [2,3], [3,4], [4,5], [5, 6] (level 0)
// GROUP 2: [0,6], [0,7], [0,8], [0,9], [0,10] (level 0)
//
// A seek with the previous API would do 11 random I/Os (to all the
// files). With the new API and a prefix filter enabled, we should
// only do 2 random I/O, to the 2 files containing the key.
// GROUP 0
snprintf(buf, sizeof(buf), "%02d______:start", 0);
keystr = std::string(buf);
ASSERT_OK(dbtest->Put(keystr, keystr));
snprintf(buf, sizeof(buf), "%02d______:end", 10);
keystr = std::string(buf);
ASSERT_OK(dbtest->Put(keystr, keystr));
dbtest->Flush();
dbtest->dbfull()->CompactRange(CompactRangeOptions(), nullptr,
nullptr); // move to level 1
// GROUP 1
for (int i = 1; i <= small_range_sstfiles; i++) {
snprintf(buf, sizeof(buf), "%02d______:start", i);
keystr = std::string(buf);
ASSERT_OK(dbtest->Put(keystr, keystr));
snprintf(buf, sizeof(buf), "%02d______:end", i + 1);
keystr = std::string(buf);
ASSERT_OK(dbtest->Put(keystr, keystr));
dbtest->Flush();
}
// GROUP 2
for (int i = 1; i <= big_range_sstfiles; i++) {
snprintf(buf, sizeof(buf), "%02d______:start", 0);
keystr = std::string(buf);
ASSERT_OK(dbtest->Put(keystr, keystr));
snprintf(buf, sizeof(buf), "%02d______:end", small_range_sstfiles + i + 1);
keystr = std::string(buf);
ASSERT_OK(dbtest->Put(keystr, keystr));
dbtest->Flush();
}
}
} // namespace
TEST_F(DBBloomFilterTest, PrefixScan) {
while (ChangeFilterOptions()) {
int count;
Slice prefix;
Slice key;
char buf[100];
Iterator* iter;
snprintf(buf, sizeof(buf), "03______:");
prefix = Slice(buf, 8);
key = Slice(buf, 9);
ASSERT_EQ(key.difference_offset(prefix), 8);
ASSERT_EQ(prefix.difference_offset(key), 8);
// db configs
env_->count_random_reads_ = true;
Options options = CurrentOptions();
options.env = env_;
options.prefix_extractor.reset(NewFixedPrefixTransform(8));
options.disable_auto_compactions = true;
options.max_background_compactions = 2;
options.create_if_missing = true;
options.memtable_factory.reset(NewHashSkipListRepFactory(16));
Unordered Writes (#5218) Summary: Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees. The patch is prepared based on an original by siying Existing unit tests are extended to include unordered_write option. Benchmark Results: ``` TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions --unordered_write=1 ``` With WAL - Vanilla RocksDB: 78.6 MB/s - WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x) - unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees) Without WAL - Vanilla RocksDB: 111.3 MB/s - WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x) - unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees) - WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x) Limitations: - The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218 Differential Revision: D15219029 Pulled By: maysamyabandeh fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759
5 years ago
assert(!options.unordered_write);
// It is incompatible with allow_concurrent_memtable_write=false
options.allow_concurrent_memtable_write = false;
BlockBasedTableOptions table_options;
table_options.no_block_cache = true;
table_options.filter_policy.reset(NewBloomFilterPolicy(10));
table_options.whole_key_filtering = false;
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
// 11 RAND I/Os
DestroyAndReopen(options);
PrefixScanInit(this);
count = 0;
env_->random_read_counter_.Reset();
iter = db_->NewIterator(ReadOptions());
for (iter->Seek(prefix); iter->Valid(); iter->Next()) {
if (!iter->key().starts_with(prefix)) {
break;
}
count++;
}
ASSERT_OK(iter->status());
delete iter;
ASSERT_EQ(count, 2);
ASSERT_EQ(env_->random_read_counter_.Read(), 2);
Close();
} // end of while
}
TEST_F(DBBloomFilterTest, OptimizeFiltersForHits) {
Options options = CurrentOptions();
options.write_buffer_size = 64 * 1024;
options.arena_block_size = 4 * 1024;
options.target_file_size_base = 64 * 1024;
options.level0_file_num_compaction_trigger = 2;
options.level0_slowdown_writes_trigger = 2;
options.level0_stop_writes_trigger = 4;
options.max_bytes_for_level_base = 256 * 1024;
options.max_write_buffer_number = 2;
options.max_background_compactions = 8;
options.max_background_flushes = 8;
options.compression = kNoCompression;
options.compaction_style = kCompactionStyleLevel;
options.level_compaction_dynamic_level_bytes = true;
BlockBasedTableOptions bbto;
bbto.cache_index_and_filter_blocks = true;
bbto.filter_policy.reset(NewBloomFilterPolicy(10, true));
bbto.whole_key_filtering = true;
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
options.optimize_filters_for_hits = true;
options.statistics = ROCKSDB_NAMESPACE::CreateDBStatistics();
get_perf_context()->Reset();
get_perf_context()->EnablePerLevelPerfContext();
CreateAndReopenWithCF({"mypikachu"}, options);
int numkeys = 200000;
// Generate randomly shuffled keys, so the updates are almost
// random.
std::vector<int> keys;
keys.reserve(numkeys);
for (int i = 0; i < numkeys; i += 2) {
keys.push_back(i);
}
RandomShuffle(std::begin(keys), std::end(keys));
int num_inserted = 0;
for (int key : keys) {
ASSERT_OK(Put(1, Key(key), "val"));
if (++num_inserted % 1000 == 0) {
dbfull()->TEST_WaitForFlushMemTable();
dbfull()->TEST_WaitForCompact();
}
}
ASSERT_OK(Put(1, Key(0), "val"));
ASSERT_OK(Put(1, Key(numkeys), "val"));
ASSERT_OK(Flush(1));
dbfull()->TEST_WaitForCompact();
if (NumTableFilesAtLevel(0, 1) == 0) {
// No Level 0 file. Create one.
ASSERT_OK(Put(1, Key(0), "val"));
ASSERT_OK(Put(1, Key(numkeys), "val"));
ASSERT_OK(Flush(1));
dbfull()->TEST_WaitForCompact();
}
for (int i = 1; i < numkeys; i += 2) {
ASSERT_EQ(Get(1, Key(i)), "NOT_FOUND");
}
ASSERT_EQ(0, TestGetTickerCount(options, GET_HIT_L0));
ASSERT_EQ(0, TestGetTickerCount(options, GET_HIT_L1));
ASSERT_EQ(0, TestGetTickerCount(options, GET_HIT_L2_AND_UP));
// Now we have three sorted run, L0, L5 and L6 with most files in L6 have
// no bloom filter. Most keys be checked bloom filters twice.
ASSERT_GT(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 65000 * 2);
ASSERT_LT(TestGetTickerCount(options, BLOOM_FILTER_USEFUL), 120000 * 2);
uint64_t bloom_filter_useful_all_levels = 0;
for (auto& kv : (*(get_perf_context()->level_to_perf_context))) {
if (kv.second.bloom_filter_useful > 0) {
bloom_filter_useful_all_levels += kv.second.bloom_filter_useful;
}
}
ASSERT_GT(bloom_filter_useful_all_levels, 65000 * 2);
ASSERT_LT(bloom_filter_useful_all_levels, 120000 * 2);
for (int i = 0; i < numkeys; i += 2) {
ASSERT_EQ(Get(1, Key(i)), "val");
}
// Part 2 (read path): rewrite last level with blooms, then verify they get
// cached only if !optimize_filters_for_hits
options.disable_auto_compactions = true;
options.num_levels = 9;
options.optimize_filters_for_hits = false;
options.statistics = CreateDBStatistics();
bbto.block_cache.reset();
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
ReopenWithColumnFamilies({"default", "mypikachu"}, options);
MoveFilesToLevel(7 /* level */, 1 /* column family index */);
std::string value = Get(1, Key(0));
uint64_t prev_cache_filter_hits =
TestGetTickerCount(options, BLOCK_CACHE_FILTER_HIT);
value = Get(1, Key(0));
ASSERT_EQ(prev_cache_filter_hits + 1,
TestGetTickerCount(options, BLOCK_CACHE_FILTER_HIT));
// Now that we know the filter blocks exist in the last level files, see if
// filter caching is skipped for this optimization
options.optimize_filters_for_hits = true;
options.statistics = CreateDBStatistics();
bbto.block_cache.reset();
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
ReopenWithColumnFamilies({"default", "mypikachu"}, options);
value = Get(1, Key(0));
ASSERT_EQ(0, TestGetTickerCount(options, BLOCK_CACHE_FILTER_MISS));
ASSERT_EQ(0, TestGetTickerCount(options, BLOCK_CACHE_FILTER_HIT));
ASSERT_EQ(2 /* index and data block */,
TestGetTickerCount(options, BLOCK_CACHE_ADD));
// Check filter block ignored for files preloaded during DB::Open()
options.max_open_files = -1;
options.statistics = CreateDBStatistics();
bbto.block_cache.reset();
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
ReopenWithColumnFamilies({"default", "mypikachu"}, options);
uint64_t prev_cache_filter_misses =
TestGetTickerCount(options, BLOCK_CACHE_FILTER_MISS);
prev_cache_filter_hits = TestGetTickerCount(options, BLOCK_CACHE_FILTER_HIT);
Get(1, Key(0));
ASSERT_EQ(prev_cache_filter_misses,
TestGetTickerCount(options, BLOCK_CACHE_FILTER_MISS));
ASSERT_EQ(prev_cache_filter_hits,
TestGetTickerCount(options, BLOCK_CACHE_FILTER_HIT));
// Check filter block ignored for file trivially-moved to bottom level
bbto.block_cache.reset();
options.max_open_files = 100; // setting > -1 makes it not preload all files
options.statistics = CreateDBStatistics();
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
ReopenWithColumnFamilies({"default", "mypikachu"}, options);
ASSERT_OK(Put(1, Key(numkeys + 1), "val"));
ASSERT_OK(Flush(1));
int32_t trivial_move = 0;
int32_t non_trivial_move = 0;
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
"DBImpl::BackgroundCompaction:TrivialMove",
[&](void* /*arg*/) { trivial_move++; });
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->SetCallBack(
"DBImpl::BackgroundCompaction:NonTrivial",
[&](void* /*arg*/) { non_trivial_move++; });
ROCKSDB_NAMESPACE::SyncPoint::GetInstance()->EnableProcessing();
CompactRangeOptions compact_options;
compact_options.bottommost_level_compaction =
BottommostLevelCompaction::kSkip;
compact_options.change_level = true;
compact_options.target_level = 7;
db_->CompactRange(compact_options, handles_[1], nullptr, nullptr);
ASSERT_EQ(trivial_move, 1);
ASSERT_EQ(non_trivial_move, 0);
prev_cache_filter_hits = TestGetTickerCount(options, BLOCK_CACHE_FILTER_HIT);
prev_cache_filter_misses =
TestGetTickerCount(options, BLOCK_CACHE_FILTER_MISS);
value = Get(1, Key(numkeys + 1));
ASSERT_EQ(prev_cache_filter_hits,
TestGetTickerCount(options, BLOCK_CACHE_FILTER_HIT));
ASSERT_EQ(prev_cache_filter_misses,
TestGetTickerCount(options, BLOCK_CACHE_FILTER_MISS));
// Check filter block not cached for iterator
bbto.block_cache.reset();
options.statistics = CreateDBStatistics();
options.table_factory.reset(NewBlockBasedTableFactory(bbto));
ReopenWithColumnFamilies({"default", "mypikachu"}, options);
std::unique_ptr<Iterator> iter(db_->NewIterator(ReadOptions(), handles_[1]));
iter->SeekToFirst();
ASSERT_EQ(0, TestGetTickerCount(options, BLOCK_CACHE_FILTER_MISS));
ASSERT_EQ(0, TestGetTickerCount(options, BLOCK_CACHE_FILTER_HIT));
ASSERT_EQ(2 /* index and data block */,
TestGetTickerCount(options, BLOCK_CACHE_ADD));
get_perf_context()->Reset();
}
int CountIter(std::unique_ptr<Iterator>& iter, const Slice& key) {
int count = 0;
for (iter->Seek(key); iter->Valid() && iter->status() == Status::OK();
iter->Next()) {
count++;
}
return count;
}
// use iterate_upper_bound to hint compatiability of existing bloom filters.
// The BF is considered compatible if 1) upper bound and seek key transform
// into the same string, or 2) the transformed seek key is of the same length
// as the upper bound and two keys are adjacent according to the comparator.
TEST_F(DBBloomFilterTest, DynamicBloomFilterUpperBound) {
New Bloom filter implementation for full and partitioned filters (#6007) Summary: Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter. Speed The improved speed, at least on recent x86_64, comes from * Using fastrange instead of modulo (%) * Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row. * Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc. * Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes. Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed): $ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter Build avg ns/key: 47.7135 Mixed inside/outside queries... Single filter net ns/op: 26.2825 Random filter net ns/op: 150.459 Average FP rate %: 0.954651 $ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter Build avg ns/key: 47.2245 Mixed inside/outside queries... Single filter net ns/op: 63.2978 Random filter net ns/op: 188.038 Average FP rate %: 1.13823 Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected. The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome. Accuracy The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments. Accuracy data (generalizes, except old impl gets worse with millions of keys): Memory bits per key: FP rate percent old impl -> FP rate percent new impl 6: 5.70953 -> 5.69888 8: 2.45766 -> 2.29709 10: 1.13977 -> 0.959254 12: 0.662498 -> 0.411593 16: 0.353023 -> 0.0873754 24: 0.261552 -> 0.0060971 50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP) Fixes https://github.com/facebook/rocksdb/issues/5857 Fixes https://github.com/facebook/rocksdb/issues/4120 Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized. Compatibility Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007 Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version). Differential Revision: D18294749 Pulled By: pdillinger fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
5 years ago
for (auto bfp_impl : BFP::kAllFixedImpls) {
int using_full_builder = bfp_impl != BFP::kDeprecatedBlock;
Options options;
options.create_if_missing = true;
options.prefix_extractor.reset(NewCappedPrefixTransform(4));
options.disable_auto_compactions = true;
options.statistics = CreateDBStatistics();
// Enable prefix bloom for SST files
BlockBasedTableOptions table_options;
table_options.cache_index_and_filter_blocks = true;
table_options.filter_policy.reset(new BFP(10, bfp_impl));
table_options.index_shortening = BlockBasedTableOptions::
IndexShorteningMode::kShortenSeparatorsAndSuccessor;
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
DestroyAndReopen(options);
ASSERT_OK(Put("abcdxxx0", "val1"));
ASSERT_OK(Put("abcdxxx1", "val2"));
ASSERT_OK(Put("abcdxxx2", "val3"));
ASSERT_OK(Put("abcdxxx3", "val4"));
dbfull()->Flush(FlushOptions());
{
// prefix_extractor has not changed, BF will always be read
Slice upper_bound("abce");
ReadOptions read_options;
read_options.prefix_same_as_start = true;
read_options.iterate_upper_bound = &upper_bound;
std::unique_ptr<Iterator> iter(db_->NewIterator(read_options));
ASSERT_EQ(CountIter(iter, "abcd0000"), 4);
}
{
Slice upper_bound("abcdzzzz");
ReadOptions read_options;
read_options.prefix_same_as_start = true;
read_options.iterate_upper_bound = &upper_bound;
std::unique_ptr<Iterator> iter(db_->NewIterator(read_options));
ASSERT_EQ(CountIter(iter, "abcd0000"), 4);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED), 2);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_USEFUL), 0);
}
ASSERT_OK(dbfull()->SetOptions({{"prefix_extractor", "fixed:5"}}));
ASSERT_EQ(0, strcmp(dbfull()->GetOptions().prefix_extractor->Name(),
"rocksdb.FixedPrefix.5"));
{
// BF changed, [abcdxx00, abce) is a valid bound, will trigger BF read
Slice upper_bound("abce");
ReadOptions read_options;
read_options.prefix_same_as_start = true;
read_options.iterate_upper_bound = &upper_bound;
std::unique_ptr<Iterator> iter(db_->NewIterator(read_options));
ASSERT_EQ(CountIter(iter, "abcdxx00"), 4);
// should check bloom filter since upper bound meets requirement
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED),
2 + using_full_builder);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_USEFUL), 0);
}
{
// [abcdxx01, abcey) is not valid bound since upper bound is too long for
// the BF in SST (capped:4)
Slice upper_bound("abcey");
ReadOptions read_options;
read_options.prefix_same_as_start = true;
read_options.iterate_upper_bound = &upper_bound;
std::unique_ptr<Iterator> iter(db_->NewIterator(read_options));
ASSERT_EQ(CountIter(iter, "abcdxx01"), 4);
// should skip bloom filter since upper bound is too long
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED),
2 + using_full_builder);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_USEFUL), 0);
}
{
// [abcdxx02, abcdy) is a valid bound since the prefix is the same
Slice upper_bound("abcdy");
ReadOptions read_options;
read_options.prefix_same_as_start = true;
read_options.iterate_upper_bound = &upper_bound;
std::unique_ptr<Iterator> iter(db_->NewIterator(read_options));
ASSERT_EQ(CountIter(iter, "abcdxx02"), 4);
// should check bloom filter since upper bound matches transformed seek
// key
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED),
2 + using_full_builder * 2);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_USEFUL), 0);
}
{
// [aaaaaaaa, abce) is not a valid bound since 1) they don't share the
// same prefix, 2) the prefixes are not consecutive
Slice upper_bound("abce");
ReadOptions read_options;
read_options.prefix_same_as_start = true;
read_options.iterate_upper_bound = &upper_bound;
std::unique_ptr<Iterator> iter(db_->NewIterator(read_options));
ASSERT_EQ(CountIter(iter, "aaaaaaaa"), 0);
// should skip bloom filter since mismatch is found
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED),
2 + using_full_builder * 2);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_USEFUL), 0);
}
ASSERT_OK(dbfull()->SetOptions({{"prefix_extractor", "fixed:3"}}));
{
// [abc, abd) is not a valid bound since the upper bound is too short
// for BF (capped:4)
Slice upper_bound("abd");
ReadOptions read_options;
read_options.prefix_same_as_start = true;
read_options.iterate_upper_bound = &upper_bound;
std::unique_ptr<Iterator> iter(db_->NewIterator(read_options));
ASSERT_EQ(CountIter(iter, "abc"), 4);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED),
2 + using_full_builder * 2);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_USEFUL), 0);
}
ASSERT_OK(dbfull()->SetOptions({{"prefix_extractor", "capped:4"}}));
{
// set back to capped:4 and verify BF is always read
Slice upper_bound("abd");
ReadOptions read_options;
read_options.prefix_same_as_start = true;
read_options.iterate_upper_bound = &upper_bound;
std::unique_ptr<Iterator> iter(db_->NewIterator(read_options));
ASSERT_EQ(CountIter(iter, "abc"), 0);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED),
3 + using_full_builder * 2);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_USEFUL), 1);
}
}
}
// Create multiple SST files each with a different prefix_extractor config,
// verify iterators can read all SST files using the latest config.
TEST_F(DBBloomFilterTest, DynamicBloomFilterMultipleSST) {
New Bloom filter implementation for full and partitioned filters (#6007) Summary: Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter. Speed The improved speed, at least on recent x86_64, comes from * Using fastrange instead of modulo (%) * Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row. * Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc. * Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes. Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed): $ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter Build avg ns/key: 47.7135 Mixed inside/outside queries... Single filter net ns/op: 26.2825 Random filter net ns/op: 150.459 Average FP rate %: 0.954651 $ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter Build avg ns/key: 47.2245 Mixed inside/outside queries... Single filter net ns/op: 63.2978 Random filter net ns/op: 188.038 Average FP rate %: 1.13823 Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected. The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome. Accuracy The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments. Accuracy data (generalizes, except old impl gets worse with millions of keys): Memory bits per key: FP rate percent old impl -> FP rate percent new impl 6: 5.70953 -> 5.69888 8: 2.45766 -> 2.29709 10: 1.13977 -> 0.959254 12: 0.662498 -> 0.411593 16: 0.353023 -> 0.0873754 24: 0.261552 -> 0.0060971 50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP) Fixes https://github.com/facebook/rocksdb/issues/5857 Fixes https://github.com/facebook/rocksdb/issues/4120 Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized. Compatibility Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007 Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version). Differential Revision: D18294749 Pulled By: pdillinger fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
5 years ago
for (auto bfp_impl : BFP::kAllFixedImpls) {
int using_full_builder = bfp_impl != BFP::kDeprecatedBlock;
Options options;
options.create_if_missing = true;
options.prefix_extractor.reset(NewFixedPrefixTransform(1));
options.disable_auto_compactions = true;
options.statistics = CreateDBStatistics();
// Enable prefix bloom for SST files
BlockBasedTableOptions table_options;
table_options.filter_policy.reset(new BFP(10, bfp_impl));
table_options.cache_index_and_filter_blocks = true;
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
DestroyAndReopen(options);
Slice upper_bound("foz90000");
ReadOptions read_options;
read_options.prefix_same_as_start = true;
// first SST with fixed:1 BF
ASSERT_OK(Put("foo2", "bar2"));
ASSERT_OK(Put("foo", "bar"));
ASSERT_OK(Put("foq1", "bar1"));
ASSERT_OK(Put("fpa", "0"));
dbfull()->Flush(FlushOptions());
std::unique_ptr<Iterator> iter_old(db_->NewIterator(read_options));
ASSERT_EQ(CountIter(iter_old, "foo"), 4);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED), 1);
ASSERT_OK(dbfull()->SetOptions({{"prefix_extractor", "capped:3"}}));
ASSERT_EQ(0, strcmp(dbfull()->GetOptions().prefix_extractor->Name(),
"rocksdb.CappedPrefix.3"));
read_options.iterate_upper_bound = &upper_bound;
std::unique_ptr<Iterator> iter(db_->NewIterator(read_options));
ASSERT_EQ(CountIter(iter, "foo"), 2);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED),
1 + using_full_builder);
ASSERT_EQ(CountIter(iter, "gpk"), 0);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED),
1 + using_full_builder);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_USEFUL), 0);
// second SST with capped:3 BF
ASSERT_OK(Put("foo3", "bar3"));
ASSERT_OK(Put("foo4", "bar4"));
ASSERT_OK(Put("foq5", "bar5"));
ASSERT_OK(Put("fpb", "1"));
dbfull()->Flush(FlushOptions());
{
// BF is cappped:3 now
std::unique_ptr<Iterator> iter_tmp(db_->NewIterator(read_options));
ASSERT_EQ(CountIter(iter_tmp, "foo"), 4);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED),
2 + using_full_builder * 2);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_USEFUL), 0);
ASSERT_EQ(CountIter(iter_tmp, "gpk"), 0);
// both counters are incremented because BF is "not changed" for 1 of the
// 2 SST files, so filter is checked once and found no match.
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED),
3 + using_full_builder * 2);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_USEFUL), 1);
}
ASSERT_OK(dbfull()->SetOptions({{"prefix_extractor", "fixed:2"}}));
ASSERT_EQ(0, strcmp(dbfull()->GetOptions().prefix_extractor->Name(),
"rocksdb.FixedPrefix.2"));
// third SST with fixed:2 BF
ASSERT_OK(Put("foo6", "bar6"));
ASSERT_OK(Put("foo7", "bar7"));
ASSERT_OK(Put("foq8", "bar8"));
ASSERT_OK(Put("fpc", "2"));
dbfull()->Flush(FlushOptions());
{
// BF is fixed:2 now
std::unique_ptr<Iterator> iter_tmp(db_->NewIterator(read_options));
ASSERT_EQ(CountIter(iter_tmp, "foo"), 9);
// the first and last BF are checked
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED),
4 + using_full_builder * 3);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_USEFUL), 1);
ASSERT_EQ(CountIter(iter_tmp, "gpk"), 0);
// only last BF is checked and not found
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED),
5 + using_full_builder * 3);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_USEFUL), 2);
}
// iter_old can only see the first SST, so checked plus 1
ASSERT_EQ(CountIter(iter_old, "foo"), 4);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED),
6 + using_full_builder * 3);
// iter was created after the first setoptions call so only full filter
// will check the filter
ASSERT_EQ(CountIter(iter, "foo"), 2);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED),
6 + using_full_builder * 4);
{
// keys in all three SSTs are visible to iterator
// The range of [foo, foz90000] is compatible with (fixed:1) and (fixed:2)
// so +2 for checked counter
std::unique_ptr<Iterator> iter_all(db_->NewIterator(read_options));
ASSERT_EQ(CountIter(iter_all, "foo"), 9);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED),
7 + using_full_builder * 5);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_USEFUL), 2);
ASSERT_EQ(CountIter(iter_all, "gpk"), 0);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED),
8 + using_full_builder * 5);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_USEFUL), 3);
}
ASSERT_OK(dbfull()->SetOptions({{"prefix_extractor", "capped:3"}}));
ASSERT_EQ(0, strcmp(dbfull()->GetOptions().prefix_extractor->Name(),
"rocksdb.CappedPrefix.3"));
{
std::unique_ptr<Iterator> iter_all(db_->NewIterator(read_options));
ASSERT_EQ(CountIter(iter_all, "foo"), 6);
// all three SST are checked because the current options has the same as
// the remaining SST (capped:3)
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED),
9 + using_full_builder * 7);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_USEFUL), 3);
ASSERT_EQ(CountIter(iter_all, "gpk"), 0);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED),
10 + using_full_builder * 7);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_USEFUL), 4);
}
// TODO(Zhongyi): Maybe also need to add Get calls to test point look up?
}
}
// Create a new column family in a running DB, change prefix_extractor
// dynamically, verify the iterator created on the new column family behaves
// as expected
TEST_F(DBBloomFilterTest, DynamicBloomFilterNewColumnFamily) {
int iteration = 0;
New Bloom filter implementation for full and partitioned filters (#6007) Summary: Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter. Speed The improved speed, at least on recent x86_64, comes from * Using fastrange instead of modulo (%) * Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row. * Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc. * Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes. Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed): $ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter Build avg ns/key: 47.7135 Mixed inside/outside queries... Single filter net ns/op: 26.2825 Random filter net ns/op: 150.459 Average FP rate %: 0.954651 $ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter Build avg ns/key: 47.2245 Mixed inside/outside queries... Single filter net ns/op: 63.2978 Random filter net ns/op: 188.038 Average FP rate %: 1.13823 Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected. The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome. Accuracy The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments. Accuracy data (generalizes, except old impl gets worse with millions of keys): Memory bits per key: FP rate percent old impl -> FP rate percent new impl 6: 5.70953 -> 5.69888 8: 2.45766 -> 2.29709 10: 1.13977 -> 0.959254 12: 0.662498 -> 0.411593 16: 0.353023 -> 0.0873754 24: 0.261552 -> 0.0060971 50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP) Fixes https://github.com/facebook/rocksdb/issues/5857 Fixes https://github.com/facebook/rocksdb/issues/4120 Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized. Compatibility Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007 Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version). Differential Revision: D18294749 Pulled By: pdillinger fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
5 years ago
for (auto bfp_impl : BFP::kAllFixedImpls) {
Options options = CurrentOptions();
options.create_if_missing = true;
options.prefix_extractor.reset(NewFixedPrefixTransform(1));
options.disable_auto_compactions = true;
options.statistics = CreateDBStatistics();
// Enable prefix bloom for SST files
BlockBasedTableOptions table_options;
table_options.cache_index_and_filter_blocks = true;
table_options.filter_policy.reset(new BFP(10, bfp_impl));
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
CreateAndReopenWithCF({"pikachu" + std::to_string(iteration)}, options);
ReadOptions read_options;
read_options.prefix_same_as_start = true;
// create a new CF and set prefix_extractor dynamically
options.prefix_extractor.reset(NewCappedPrefixTransform(3));
CreateColumnFamilies({"ramen_dojo_" + std::to_string(iteration)}, options);
ASSERT_EQ(0,
strcmp(dbfull()->GetOptions(handles_[2]).prefix_extractor->Name(),
"rocksdb.CappedPrefix.3"));
ASSERT_OK(Put(2, "foo3", "bar3"));
ASSERT_OK(Put(2, "foo4", "bar4"));
ASSERT_OK(Put(2, "foo5", "bar5"));
ASSERT_OK(Put(2, "foq6", "bar6"));
ASSERT_OK(Put(2, "fpq7", "bar7"));
dbfull()->Flush(FlushOptions());
{
std::unique_ptr<Iterator> iter(
db_->NewIterator(read_options, handles_[2]));
ASSERT_EQ(CountIter(iter, "foo"), 3);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED), 0);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_USEFUL), 0);
}
ASSERT_OK(
dbfull()->SetOptions(handles_[2], {{"prefix_extractor", "fixed:2"}}));
ASSERT_EQ(0,
strcmp(dbfull()->GetOptions(handles_[2]).prefix_extractor->Name(),
"rocksdb.FixedPrefix.2"));
{
std::unique_ptr<Iterator> iter(
db_->NewIterator(read_options, handles_[2]));
ASSERT_EQ(CountIter(iter, "foo"), 4);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED), 0);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_USEFUL), 0);
}
ASSERT_OK(dbfull()->DropColumnFamily(handles_[2]));
dbfull()->DestroyColumnFamilyHandle(handles_[2]);
handles_[2] = nullptr;
ASSERT_OK(dbfull()->DropColumnFamily(handles_[1]));
dbfull()->DestroyColumnFamilyHandle(handles_[1]);
handles_[1] = nullptr;
iteration++;
}
}
// Verify it's possible to change prefix_extractor at runtime and iterators
// behaves as expected
TEST_F(DBBloomFilterTest, DynamicBloomFilterOptions) {
New Bloom filter implementation for full and partitioned filters (#6007) Summary: Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter. Speed The improved speed, at least on recent x86_64, comes from * Using fastrange instead of modulo (%) * Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row. * Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc. * Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes. Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed): $ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter Build avg ns/key: 47.7135 Mixed inside/outside queries... Single filter net ns/op: 26.2825 Random filter net ns/op: 150.459 Average FP rate %: 0.954651 $ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter Build avg ns/key: 47.2245 Mixed inside/outside queries... Single filter net ns/op: 63.2978 Random filter net ns/op: 188.038 Average FP rate %: 1.13823 Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected. The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome. Accuracy The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments. Accuracy data (generalizes, except old impl gets worse with millions of keys): Memory bits per key: FP rate percent old impl -> FP rate percent new impl 6: 5.70953 -> 5.69888 8: 2.45766 -> 2.29709 10: 1.13977 -> 0.959254 12: 0.662498 -> 0.411593 16: 0.353023 -> 0.0873754 24: 0.261552 -> 0.0060971 50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP) Fixes https://github.com/facebook/rocksdb/issues/5857 Fixes https://github.com/facebook/rocksdb/issues/4120 Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized. Compatibility Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007 Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version). Differential Revision: D18294749 Pulled By: pdillinger fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
5 years ago
for (auto bfp_impl : BFP::kAllFixedImpls) {
Options options;
options.create_if_missing = true;
options.prefix_extractor.reset(NewFixedPrefixTransform(1));
options.disable_auto_compactions = true;
options.statistics = CreateDBStatistics();
// Enable prefix bloom for SST files
BlockBasedTableOptions table_options;
table_options.cache_index_and_filter_blocks = true;
table_options.filter_policy.reset(new BFP(10, bfp_impl));
options.table_factory.reset(NewBlockBasedTableFactory(table_options));
DestroyAndReopen(options);
ASSERT_OK(Put("foo2", "bar2"));
ASSERT_OK(Put("foo", "bar"));
ASSERT_OK(Put("foo1", "bar1"));
ASSERT_OK(Put("fpa", "0"));
dbfull()->Flush(FlushOptions());
ASSERT_OK(Put("foo3", "bar3"));
ASSERT_OK(Put("foo4", "bar4"));
ASSERT_OK(Put("foo5", "bar5"));
ASSERT_OK(Put("fpb", "1"));
dbfull()->Flush(FlushOptions());
ASSERT_OK(Put("foo6", "bar6"));
ASSERT_OK(Put("foo7", "bar7"));
ASSERT_OK(Put("foo8", "bar8"));
ASSERT_OK(Put("fpc", "2"));
dbfull()->Flush(FlushOptions());
ReadOptions read_options;
read_options.prefix_same_as_start = true;
{
std::unique_ptr<Iterator> iter(db_->NewIterator(read_options));
ASSERT_EQ(CountIter(iter, "foo"), 12);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED), 3);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_USEFUL), 0);
}
std::unique_ptr<Iterator> iter_old(db_->NewIterator(read_options));
ASSERT_EQ(CountIter(iter_old, "foo"), 12);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED), 6);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_USEFUL), 0);
ASSERT_OK(dbfull()->SetOptions({{"prefix_extractor", "capped:3"}}));
ASSERT_EQ(0, strcmp(dbfull()->GetOptions().prefix_extractor->Name(),
"rocksdb.CappedPrefix.3"));
{
std::unique_ptr<Iterator> iter(db_->NewIterator(read_options));
// "fp*" should be skipped
ASSERT_EQ(CountIter(iter, "foo"), 9);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED), 6);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_USEFUL), 0);
}
// iterator created before should not be affected and see all keys
ASSERT_EQ(CountIter(iter_old, "foo"), 12);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED), 9);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_USEFUL), 0);
ASSERT_EQ(CountIter(iter_old, "abc"), 0);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED), 12);
ASSERT_EQ(TestGetTickerCount(options, BLOOM_FILTER_PREFIX_USEFUL), 3);
}
}
#endif // ROCKSDB_LITE
} // namespace ROCKSDB_NAMESPACE
int main(int argc, char** argv) {
ROCKSDB_NAMESPACE::port::InstallStackTraceHandler();
::testing::InitGoogleTest(&argc, argv);
return RUN_ALL_TESTS();
}