Document design/specification bugs with auto_prefix_mode (#10144)

Summary:
auto_prefix_mode is designed to use prefix filtering in a
particular "safe" set of cases where the upper bound and the seek key
have different prefixes: where the upper bound is the "same length
immediate successor". These conditions are not sufficient to guarantee
the same iteration results as total_order_seek if the DB contains
"short" keys, less than the "full" (maximum) prefix length.

We are not simply disabling the optimization in these successor cases
because it is likely that users are essentially getting what they want
out of existing usage. Especially if users are constructing successor
bounds with the intention of doing a prefix-bounded seek, the existing
behavior is more expected than the total_order_seek behavior.
Consequently, for now we reconcile the bad specification of behavior by
documenting the existing mismatch with total_order_seek.

A closely related issue affects hypothetical comparators like
ReverseBytewiseComparator: if they "correctly" implement
IsSameLengthImmediateSuccessor, auto_prefix_mode could omit more
entries (other than "short" keys noted above). Luckily, the built-in
ReverseBytewiseComparator has an "incorrect" implementation of
IsSameLengthImmediateSuccessor that effectively prevents prefix
optimization and, thus, the bug. This is now documented as a new
constraint on IsSameLengthImmediateSuccessor, and the implementation
tweaked to be simply "safe" rather than "incorrect".

This change also includes unit test updates to demonstrate the above
issues. (Test was cleaned up for readability and simplicity.)

Intended follow-up:
* Tweak documented axioms for prefix_extractor (more details then)
* Consider some sort of fix for this case. I don't know what that would
look like without breaking the performance of existing code. Perhaps
if all keys in an SST file have prefixes that are "full length," we can track
that fact and use it to allow optimization with the "same length
immediate successor", but that would only apply to new files.
* Consider a better system of specifying prefix bounds

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10144

Test Plan: test updates included

Reviewed By: siying

Differential Revision: D37052710

Pulled By: pdillinger

fbshipit-source-id: 5f63b7d65f3f214e4b143e0f9aa1749527c587db
main
Peter Dillinger 2 years ago committed by Facebook GitHub Bot
parent 8273435c22
commit ad135f3ffd
  1. 2
      HISTORY.md
  2. 46
      db/comparator_db_test.cc
  3. 189
      db/db_test2.cc
  4. 4
      include/rocksdb/comparator.h
  5. 11
      include/rocksdb/options.h
  6. 5
      include/rocksdb/slice_transform.h
  7. 10
      util/comparator.cc

@ -22,6 +22,8 @@
* `rocksdb_comparator_with_ts_create` to create timestamp aware comparator * `rocksdb_comparator_with_ts_create` to create timestamp aware comparator
* Put, Get, Delete, SingleDelete, MultiGet APIs has corresponding timestamp aware APIs with suffix `with_ts` * Put, Get, Delete, SingleDelete, MultiGet APIs has corresponding timestamp aware APIs with suffix `with_ts`
* And Add C API's for Transaction, SstFileWriter, Compaction as mentioned [here](https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-(Experimental)) * And Add C API's for Transaction, SstFileWriter, Compaction as mentioned [here](https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-(Experimental))
* The contract for implementations of Comparator::IsSameLengthImmediateSuccessor has been updated to work around a design bug in `auto_prefix_mode`.
* The API documentation for `auto_prefix_mode` now notes some corner cases in which it returns different results than `total_order_seek`, due to design bugs that are not easily fixed. Users using built-in comparators and keys at least the size of a fixed prefix length are not affected.
### New Features ### New Features
* Add FileSystem::ReadAsync API in io_tracing * Add FileSystem::ReadAsync API in io_tracing

@ -449,67 +449,85 @@ TEST_P(ComparatorDBTest, TwoStrComparator) {
} }
} }
namespace {
void VerifyNotSuccessor(const Slice& s, const Slice& t) {
auto bc = BytewiseComparator();
auto rbc = ReverseBytewiseComparator();
ASSERT_FALSE(bc->IsSameLengthImmediateSuccessor(s, t));
ASSERT_FALSE(rbc->IsSameLengthImmediateSuccessor(s, t));
ASSERT_FALSE(bc->IsSameLengthImmediateSuccessor(t, s));
ASSERT_FALSE(rbc->IsSameLengthImmediateSuccessor(t, s));
}
void VerifySuccessor(const Slice& s, const Slice& t) {
auto bc = BytewiseComparator();
auto rbc = ReverseBytewiseComparator();
ASSERT_TRUE(bc->IsSameLengthImmediateSuccessor(s, t));
ASSERT_FALSE(rbc->IsSameLengthImmediateSuccessor(s, t));
ASSERT_FALSE(bc->IsSameLengthImmediateSuccessor(t, s));
// Should be true but that increases exposure to a design bug in
// auto_prefix_mode, so currently set to FALSE
ASSERT_FALSE(rbc->IsSameLengthImmediateSuccessor(t, s));
}
} // namespace
TEST_P(ComparatorDBTest, IsSameLengthImmediateSuccessor) { TEST_P(ComparatorDBTest, IsSameLengthImmediateSuccessor) {
{ {
// different length // different length
Slice s("abcxy"); Slice s("abcxy");
Slice t("abcxyz"); Slice t("abcxyz");
ASSERT_FALSE(BytewiseComparator()->IsSameLengthImmediateSuccessor(s, t)); VerifyNotSuccessor(s, t);
} }
{ {
Slice s("abcxyz"); Slice s("abcxyz");
Slice t("abcxy"); Slice t("abcxy");
ASSERT_FALSE(BytewiseComparator()->IsSameLengthImmediateSuccessor(s, t)); VerifyNotSuccessor(s, t);
} }
{ {
// not last byte different // not last byte different
Slice s("abc1xyz"); Slice s("abc1xyz");
Slice t("abc2xyz"); Slice t("abc2xyz");
ASSERT_FALSE(BytewiseComparator()->IsSameLengthImmediateSuccessor(s, t)); VerifyNotSuccessor(s, t);
} }
{ {
// same string // same string
Slice s("abcxyz"); Slice s("abcxyz");
Slice t("abcxyz"); Slice t("abcxyz");
ASSERT_FALSE(BytewiseComparator()->IsSameLengthImmediateSuccessor(s, t)); VerifyNotSuccessor(s, t);
} }
{ {
Slice s("abcxy"); Slice s("abcxy");
Slice t("abcxz"); Slice t("abcxz");
ASSERT_TRUE(BytewiseComparator()->IsSameLengthImmediateSuccessor(s, t)); VerifySuccessor(s, t);
}
{
Slice s("abcxz");
Slice t("abcxy");
ASSERT_FALSE(BytewiseComparator()->IsSameLengthImmediateSuccessor(s, t));
} }
{ {
const char s_array[] = "\x50\x8a\xac"; const char s_array[] = "\x50\x8a\xac";
const char t_array[] = "\x50\x8a\xad"; const char t_array[] = "\x50\x8a\xad";
Slice s(s_array); Slice s(s_array);
Slice t(t_array); Slice t(t_array);
ASSERT_TRUE(BytewiseComparator()->IsSameLengthImmediateSuccessor(s, t)); VerifySuccessor(s, t);
} }
{ {
const char s_array[] = "\x50\x8a\xff"; const char s_array[] = "\x50\x8a\xff";
const char t_array[] = "\x50\x8b\x00"; const char t_array[] = "\x50\x8b\x00";
Slice s(s_array, 3); Slice s(s_array, 3);
Slice t(t_array, 3); Slice t(t_array, 3);
ASSERT_TRUE(BytewiseComparator()->IsSameLengthImmediateSuccessor(s, t)); VerifySuccessor(s, t);
} }
{ {
const char s_array[] = "\x50\x8a\xff\xff"; const char s_array[] = "\x50\x8a\xff\xff";
const char t_array[] = "\x50\x8b\x00\x00"; const char t_array[] = "\x50\x8b\x00\x00";
Slice s(s_array, 4); Slice s(s_array, 4);
Slice t(t_array, 4); Slice t(t_array, 4);
ASSERT_TRUE(BytewiseComparator()->IsSameLengthImmediateSuccessor(s, t)); VerifySuccessor(s, t);
} }
{ {
const char s_array[] = "\x50\x8a\xff\xff"; const char s_array[] = "\x50\x8a\xff\xff";
const char t_array[] = "\x50\x8b\x00\x01"; const char t_array[] = "\x50\x8b\x00\x01";
Slice s(s_array, 4); Slice s(s_array, 4);
Slice t(t_array, 4); Slice t(t_array, 4);
ASSERT_FALSE(BytewiseComparator()->IsSameLengthImmediateSuccessor(s, t)); VerifyNotSuccessor(s, t);
} }
} }

@ -6376,86 +6376,85 @@ TEST_F(DBTest2, AutoPrefixMode1) {
ReadOptions ro; ReadOptions ro;
ro.total_order_seek = false; ro.total_order_seek = false;
ro.auto_prefix_mode = true; ro.auto_prefix_mode = true;
const auto stat = BLOOM_FILTER_PREFIX_CHECKED;
{ {
std::unique_ptr<Iterator> iterator(db_->NewIterator(ro)); std::unique_ptr<Iterator> iterator(db_->NewIterator(ro));
iterator->Seek("b1"); iterator->Seek("b1");
ASSERT_TRUE(iterator->Valid()); ASSERT_TRUE(iterator->Valid());
ASSERT_EQ("x1", iterator->key().ToString()); ASSERT_EQ("x1", iterator->key().ToString());
ASSERT_EQ(0, TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED)); EXPECT_EQ(0, TestGetAndResetTickerCount(options, stat));
ASSERT_OK(iterator->status()); ASSERT_OK(iterator->status());
} }
std::string ub_str = "b9"; Slice ub;
Slice ub(ub_str);
ro.iterate_upper_bound = &ub; ro.iterate_upper_bound = &ub;
ub = "b9";
{ {
std::unique_ptr<Iterator> iterator(db_->NewIterator(ro)); std::unique_ptr<Iterator> iterator(db_->NewIterator(ro));
iterator->Seek("b1"); iterator->Seek("b1");
ASSERT_FALSE(iterator->Valid()); ASSERT_FALSE(iterator->Valid());
ASSERT_EQ(1, TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED)); EXPECT_EQ(1, TestGetAndResetTickerCount(options, stat));
ASSERT_OK(iterator->status()); ASSERT_OK(iterator->status());
} }
ub_str = "z"; ub = "z";
ub = Slice(ub_str);
{ {
std::unique_ptr<Iterator> iterator(db_->NewIterator(ro)); std::unique_ptr<Iterator> iterator(db_->NewIterator(ro));
iterator->Seek("b1"); iterator->Seek("b1");
ASSERT_TRUE(iterator->Valid()); ASSERT_TRUE(iterator->Valid());
ASSERT_EQ("x1", iterator->key().ToString()); ASSERT_EQ("x1", iterator->key().ToString());
ASSERT_EQ(1, TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED)); EXPECT_EQ(0, TestGetAndResetTickerCount(options, stat));
ASSERT_OK(iterator->status()); ASSERT_OK(iterator->status());
} }
ub_str = "c"; ub = "c";
ub = Slice(ub_str);
{ {
std::unique_ptr<Iterator> iterator(db_->NewIterator(ro)); std::unique_ptr<Iterator> iterator(db_->NewIterator(ro));
iterator->Seek("b1"); iterator->Seek("b1");
ASSERT_FALSE(iterator->Valid()); ASSERT_FALSE(iterator->Valid());
ASSERT_EQ(2, TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED)); EXPECT_EQ(1, TestGetAndResetTickerCount(options, stat));
ASSERT_OK(iterator->status()); ASSERT_OK(iterator->status());
} }
// The same queries without recreating iterator ub = "c1";
{ {
ub_str = "b9";
ub = Slice(ub_str);
ro.iterate_upper_bound = &ub;
std::unique_ptr<Iterator> iterator(db_->NewIterator(ro)); std::unique_ptr<Iterator> iterator(db_->NewIterator(ro));
iterator->Seek("b1"); iterator->Seek("b1");
ASSERT_FALSE(iterator->Valid()); ASSERT_FALSE(iterator->Valid());
ASSERT_EQ(3, TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED)); EXPECT_EQ(0, TestGetAndResetTickerCount(options, stat));
ASSERT_OK(iterator->status()); ASSERT_OK(iterator->status());
}
ub_str = "z"; // The same queries without recreating iterator
ub = Slice(ub_str); {
std::unique_ptr<Iterator> iterator(db_->NewIterator(ro));
ub = "b9";
iterator->Seek("b1");
ASSERT_FALSE(iterator->Valid());
EXPECT_EQ(1, TestGetAndResetTickerCount(options, stat));
ASSERT_OK(iterator->status());
ub = "z";
iterator->Seek("b1"); iterator->Seek("b1");
ASSERT_TRUE(iterator->Valid()); ASSERT_TRUE(iterator->Valid());
ASSERT_EQ("x1", iterator->key().ToString()); ASSERT_EQ("x1", iterator->key().ToString());
ASSERT_EQ(3, TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED)); EXPECT_EQ(0, TestGetAndResetTickerCount(options, stat));
ub_str = "c";
ub = Slice(ub_str);
ub = "c";
iterator->Seek("b1"); iterator->Seek("b1");
ASSERT_FALSE(iterator->Valid()); ASSERT_FALSE(iterator->Valid());
ASSERT_EQ(4, TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED)); EXPECT_EQ(1, TestGetAndResetTickerCount(options, stat));
ub_str = "b9"; ub = "b9";
ub = Slice(ub_str);
ro.iterate_upper_bound = &ub;
iterator->SeekForPrev("b1"); iterator->SeekForPrev("b1");
ASSERT_TRUE(iterator->Valid()); ASSERT_TRUE(iterator->Valid());
ASSERT_EQ("a1", iterator->key().ToString()); ASSERT_EQ("a1", iterator->key().ToString());
ASSERT_EQ(4, TestGetTickerCount(options, BLOOM_FILTER_PREFIX_CHECKED)); EXPECT_EQ(0, TestGetAndResetTickerCount(options, stat));
ub_str = "zz"; ub = "zz";
ub = Slice(ub_str);
ro.iterate_upper_bound = &ub;
iterator->SeekToLast(); iterator->SeekToLast();
ASSERT_TRUE(iterator->Valid()); ASSERT_TRUE(iterator->Valid());
ASSERT_EQ("y1", iterator->key().ToString()); ASSERT_EQ("y1", iterator->key().ToString());
@ -6464,6 +6463,136 @@ TEST_F(DBTest2, AutoPrefixMode1) {
ASSERT_TRUE(iterator->Valid()); ASSERT_TRUE(iterator->Valid());
ASSERT_EQ("a1", iterator->key().ToString()); ASSERT_EQ("a1", iterator->key().ToString());
} }
// Similar, now with reverse comparator
// Technically, we are violating axiom 2 of prefix_extractors, but
// it should be revised because of major use-cases using
// ReverseBytewiseComparator with capped/fixed prefix Seek. (FIXME)
options.comparator = ReverseBytewiseComparator();
options.prefix_extractor.reset(NewFixedPrefixTransform(1));
DestroyAndReopen(options);
ASSERT_OK(Put("a1", large_value));
ASSERT_OK(Put("x1", large_value));
ASSERT_OK(Put("y1", large_value));
ASSERT_OK(Flush());
{
std::unique_ptr<Iterator> iterator(db_->NewIterator(ro));
ub = "b1";
iterator->Seek("b9");
ASSERT_FALSE(iterator->Valid());
EXPECT_EQ(1, TestGetAndResetTickerCount(options, stat));
ASSERT_OK(iterator->status());
ub = "b1";
iterator->Seek("z");
ASSERT_TRUE(iterator->Valid());
ASSERT_EQ("y1", iterator->key().ToString());
EXPECT_EQ(0, TestGetAndResetTickerCount(options, stat));
ub = "b1";
iterator->Seek("c");
ASSERT_FALSE(iterator->Valid());
EXPECT_EQ(0, TestGetAndResetTickerCount(options, stat));
ub = "b";
iterator->Seek("c9");
ASSERT_FALSE(iterator->Valid());
// Fails if ReverseBytewiseComparator::IsSameLengthImmediateSuccessor
// is "correctly" implemented.
EXPECT_EQ(0, TestGetAndResetTickerCount(options, stat));
ub = "a";
iterator->Seek("b9");
// Fails if ReverseBytewiseComparator::IsSameLengthImmediateSuccessor
// is "correctly" implemented.
ASSERT_TRUE(iterator->Valid());
ASSERT_EQ("a1", iterator->key().ToString());
EXPECT_EQ(0, TestGetAndResetTickerCount(options, stat));
ub = "b";
iterator->Seek("a");
ASSERT_FALSE(iterator->Valid());
// Fails if ReverseBytewiseComparator::IsSameLengthImmediateSuccessor
// matches BytewiseComparator::IsSameLengthImmediateSuccessor. Upper
// comparing before seek key prevents a real bug from surfacing.
EXPECT_EQ(0, TestGetAndResetTickerCount(options, stat));
ub = "b1";
iterator->SeekForPrev("b9");
ASSERT_TRUE(iterator->Valid());
// Fails if ReverseBytewiseComparator::IsSameLengthImmediateSuccessor
// is "correctly" implemented.
ASSERT_EQ("x1", iterator->key().ToString());
EXPECT_EQ(0, TestGetAndResetTickerCount(options, stat));
ub = "a";
iterator->SeekToLast();
ASSERT_TRUE(iterator->Valid());
ASSERT_EQ("a1", iterator->key().ToString());
iterator->SeekToFirst();
ASSERT_TRUE(iterator->Valid());
ASSERT_EQ("y1", iterator->key().ToString());
}
// Now something a bit different, related to "short" keys that
// auto_prefix_mode can omit. See "BUG" section of auto_prefix_mode.
options.comparator = BytewiseComparator();
for (const auto config : {"fixed:2", "capped:2"}) {
ASSERT_OK(SliceTransform::CreateFromString(ConfigOptions(), config,
&options.prefix_extractor));
// FIXME: kHashSearch, etc. requires all keys be InDomain
if (StartsWith(config, "fixed") &&
(table_options.index_type == BlockBasedTableOptions::kHashSearch ||
StartsWith(options.memtable_factory->Name(), "Hash"))) {
continue;
}
DestroyAndReopen(options);
const char* a_end_stuff = "a\xffXYZ";
const char* b_begin_stuff = "b\x00XYZ";
ASSERT_OK(Put("a", large_value));
ASSERT_OK(Put("b", large_value));
ASSERT_OK(Put(Slice(b_begin_stuff, 3), large_value));
ASSERT_OK(Put("c", large_value));
ASSERT_OK(Flush());
// control showing valid optimization with auto_prefix mode
ub = Slice(a_end_stuff, 4);
ro.iterate_upper_bound = &ub;
std::unique_ptr<Iterator> iterator(db_->NewIterator(ro));
iterator->Seek(Slice(a_end_stuff, 2));
ASSERT_FALSE(iterator->Valid());
EXPECT_EQ(1, TestGetAndResetTickerCount(options, stat));
ASSERT_OK(iterator->status());
// test, cannot be validly optimized with auto_prefix_mode
ub = Slice(b_begin_stuff, 2);
ro.iterate_upper_bound = &ub;
iterator->Seek(Slice(a_end_stuff, 2));
// !!! BUG !!! See "BUG" section of auto_prefix_mode.
ASSERT_FALSE(iterator->Valid());
EXPECT_EQ(1, TestGetAndResetTickerCount(options, stat));
ASSERT_OK(iterator->status());
// To prove that is the wrong result, now use total order seek
ReadOptions tos_ro = ro;
tos_ro.total_order_seek = true;
tos_ro.auto_prefix_mode = false;
iterator.reset(db_->NewIterator(tos_ro));
iterator->Seek(Slice(a_end_stuff, 2));
ASSERT_TRUE(iterator->Valid());
ASSERT_EQ("b", iterator->key().ToString());
EXPECT_EQ(0, TestGetAndResetTickerCount(options, stat));
ASSERT_OK(iterator->status());
}
} while (ChangeOptions(kSkipPlainTable)); } while (ChangeOptions(kSkipPlainTable));
} }

@ -96,6 +96,10 @@ class Comparator : public Customizable {
virtual const Comparator* GetRootComparator() const { return this; } virtual const Comparator* GetRootComparator() const { return this; }
// given two keys, determine if t is the successor of s // given two keys, determine if t is the successor of s
// BUG: only return true if no other keys starting with `t` are ordered
// before `t`. Otherwise, the auto_prefix_mode can omit entries within
// iterator bounds that have same prefix as upper bound but different
// prefix from seek key.
virtual bool IsSameLengthImmediateSuccessor(const Slice& /*s*/, virtual bool IsSameLengthImmediateSuccessor(const Slice& /*s*/,
const Slice& /*t*/) const { const Slice& /*t*/) const {
return false; return false;

@ -1530,6 +1530,17 @@ struct ReadOptions {
// from total_order_seek, based on seek key, and iterator upper bound. // from total_order_seek, based on seek key, and iterator upper bound.
// Not supported in ROCKSDB_LITE mode, in the way that even with value true // Not supported in ROCKSDB_LITE mode, in the way that even with value true
// prefix mode is not used. // prefix mode is not used.
// BUG: Using Comparator::IsSameLengthImmediateSuccessor and
// SliceTransform::FullLengthEnabled to enable prefix mode in cases where
// prefix of upper bound differs from prefix of seek key has a flaw.
// If present in the DB, "short keys" (shorter than "full length" prefix)
// can be omitted from auto_prefix_mode iteration when they would be present
// in total_order_seek iteration, regardless of whether the short keys are
// "in domain" of the prefix extractor. This is not an issue if no short
// keys are added to DB or are not expected to be returned by such
// iterators. (We are also assuming the new condition on
// IsSameLengthImmediateSuccessor is satisfied; see its BUG section).
// A bug example is in DBTest2::AutoPrefixMode1, search for "BUG".
// Default: false // Default: false
bool auto_prefix_mode; bool auto_prefix_mode;

@ -85,7 +85,10 @@ class SliceTransform : public Customizable {
// Otherwise (including FullLengthEnabled returns false, or prefix length is // Otherwise (including FullLengthEnabled returns false, or prefix length is
// less than maximum), Seek with auto_prefix_mode is only optimized if the // less than maximum), Seek with auto_prefix_mode is only optimized if the
// iterate_upper_bound and seek key have the same prefix. // iterate_upper_bound and seek key have the same prefix.
// NOTE/TODO: We hope to revise this requirement in the future. // BUG: Despite all these conditions and even with the extra condition on
// IsSameLengthImmediateSuccessor (see it's "BUG" section), it is not
// sufficient to ensure auto_prefix_mode returns all entries that
// total_order_seek would return. See auto_prefix_mode "BUG" section.
virtual bool FullLengthEnabled(size_t* /*len*/) const { return false; } virtual bool FullLengthEnabled(size_t* /*len*/) const { return false; }
// Transform(s)=Transform(`prefix`) for any s with `prefix` as a prefix. // Transform(s)=Transform(`prefix`) for any s with `prefix` as a prefix.

@ -209,6 +209,16 @@ class ReverseBytewiseComparatorImpl : public BytewiseComparatorImpl {
// Don't do anything for simplicity. // Don't do anything for simplicity.
} }
bool IsSameLengthImmediateSuccessor(const Slice& s,
const Slice& t) const override {
// Always returning false to prevent surfacing design flaws in
// auto_prefix_mode
(void)s, (void)t;
return false;
// "Correct" implementation:
// return BytewiseComparatorImpl::IsSameLengthImmediateSuccessor(t, s);
}
bool CanKeysWithDifferentByteContentsBeEqual() const override { bool CanKeysWithDifferentByteContentsBeEqual() const override {
return false; return false;
} }

Loading…
Cancel
Save