From 30bc495c03d06d5c89a54925d7e1b019c9ce547d Mon Sep 17 00:00:00 2001 From: Changyu Bi Date: Fri, 2 Sep 2022 09:51:19 -0700 Subject: [PATCH] Skip swaths of range tombstone covered keys in merging iterator (2022 edition) (#10449) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Summary: Delete range logic is moved from `DBIter` to `MergingIterator`, and `MergingIterator` will seek to the end of a range deletion if possible instead of scanning through each key and check with `RangeDelAggregator`. With the invariant that a key in level L (consider memtable as the first level, each immutable and L0 as a separate level) has a larger sequence number than all keys in any level >L, a range tombstone `[start, end)` from level L covers all keys in its range in any level >L. This property motivates optimizations in iterator: - in `Seek(target)`, if level L has a range tombstone `[start, end)` that covers `target.UserKey`, then for all levels > L, we can do Seek() on `end` instead of `target` to skip some range tombstone covered keys. - in `Next()/Prev()`, if the current key is covered by a range tombstone `[start, end)` from level L, we can do `Seek` to `end` for all levels > L. This PR implements the above optimizations in `MergingIterator`. As all range tombstone covered keys are now skipped in `MergingIterator`, the range tombstone logic is removed from `DBIter`. The idea in this PR is similar to https://github.com/facebook/rocksdb/issues/7317, but this PR leaves `InternalIterator` interface mostly unchanged. **Credit**: the cascading seek optimization and the sentinel key (discussed below) are inspired by [Pebble](https://github.com/cockroachdb/pebble/blob/master/merging_iter.go) and suggested by ajkr in https://github.com/facebook/rocksdb/issues/7317. The two optimizations are mostly implemented in `SeekImpl()/SeekForPrevImpl()` and `IsNextDeleted()/IsPrevDeleted()` in `merging_iterator.cc`. See comments for each method for more detail. One notable change is that the minHeap/maxHeap used by `MergingIterator` now contains range tombstone end keys besides point key iterators. This helps to reduce the number of key comparisons. For example, for a range tombstone `[start, end)`, a `start` and an `end` `HeapItem` are inserted into the heap. When a `HeapItem` for range tombstone start key is popped from the minHeap, we know this range tombstone becomes "active" in the sense that, before the range tombstone's end key is popped from the minHeap, all the keys popped from this heap is covered by the range tombstone's internal key range `[start, end)`. Another major change, *delete range sentinel key*, is made to `LevelIterator`. Before this PR, when all point keys in an SST file are iterated through in `MergingIterator`, a level iterator would advance to the next SST file in its level. In the case when an SST file has a range tombstone that covers keys beyond the SST file's last point key, advancing to the next SST file would lose this range tombstone. Consequently, `MergingIterator` could return keys that should have been deleted by some range tombstone. We prevent this by pretending that file boundaries in each SST file are sentinel keys. A `LevelIterator` now only advance the file iterator once the sentinel key is processed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10449 Test Plan: - Added many unit tests in db_range_del_test - Stress test: `./db_stress --readpercent=5 --prefixpercent=19 --writepercent=20 -delpercent=10 --iterpercent=44 --delrangepercent=2` - Additional iterator stress test is added to verify against iterators against expected state: https://github.com/facebook/rocksdb/issues/10538. This is based on ajkr's previous attempt https://github.com/facebook/rocksdb/pull/5506#issuecomment-506021913. ``` python3 ./tools/db_crashtest.py blackbox --simple --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --compression_type=none --max_background_compactions=8 --value_size_mult=33 --max_key=5000000 --interval=10 --duration=7200 --delrangepercent=3 --delpercent=9 --iterpercent=25 --writepercent=60 --readpercent=3 --prefixpercent=0 --num_iterations=1000 --range_deletion_width=100 --verify_iterator_with_expected_state_one_in=1 ``` - Performance benchmark: I used a similar setup as in the blog [post](http://rocksdb.org/blog/2018/11/21/delete-range.html) that introduced DeleteRange, "a database with 5 million data keys, and 10000 range tombstones (ignoring those dropped during compaction) that were written in regular intervals after 4.5 million data keys were written". As expected, the performance with this PR depends on the range tombstone width. ``` # Setup: TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=fillrandom --writes=4500000 --num=5000000 TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=overwrite --writes=500000 --num=5000000 --use_existing_db=true --writes_per_range_tombstone=50 # Scan entire DB TEST_TMPDIR=/dev/shm ./db_bench_main --benchmarks=readseq[-X5] --use_existing_db=true --num=5000000 --disable_auto_compactions=true # Short range scan (10 Next()) TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=100000 --seek_nexts=10 --disable_auto_compactions=true # Long range scan(1000 Next()) TEST_TMPDIR=/dev/shm/width-100/ ./db_bench_main --benchmarks=seekrandom[-X5] --use_existing_db=true --num=500000 --reads=2500 --seek_nexts=1000 --disable_auto_compactions=true ``` Avg over of 10 runs (some slower tests had fews runs): For the first column (tombstone), 0 means no range tombstone, 100-10000 means width of the 10k range tombstones, and 1 means there is a single range tombstone in the entire DB (width is 1000). The 1 tombstone case is to test regression when there's very few range tombstones in the DB, as no range tombstone is likely to take a different code path than with range tombstones. - Scan entire DB | tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% | | ------------- | ------------- | ------------- | ------------- | | 0 range tombstone |2525600 (± 43564) |2486917 (± 33698) |-1.53% | | 100 |1853835 (± 24736) |2073884 (± 32176) |+11.87% | | 1000 |422415 (± 7466) |1115801 (± 22781) |+164.15% | | 10000 |22384 (± 227) |227919 (± 6647) |+918.22% | | 1 range tombstone |2176540 (± 39050) |2434954 (± 24563) |+11.87% | - Short range scan | tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% | | ------------- | ------------- | ------------- | ------------- | | 0 range tombstone |35398 (± 533) |35338 (± 569) |-0.17% | | 100 |28276 (± 664) |31684 (± 331) |+12.05% | | 1000 |7637 (± 77) |25422 (± 277) |+232.88% | | 10000 |1367 |28667 |+1997.07% | | 1 range tombstone |32618 (± 581) |32748 (± 506) |+0.4% | - Long range scan | tombstone width | Pre-PR ops/sec | Post-PR ops/sec | ±% | | ------------- | ------------- | ------------- | ------------- | | 0 range tombstone |2262 (± 33) |2353 (± 20) |+4.02% | | 100 |1696 (± 26) |1926 (± 18) |+13.56% | | 1000 |410 (± 6) |1255 (± 29) |+206.1% | | 10000 |25 |414 |+1556.0% | | 1 range tombstone |1957 (± 30) |2185 (± 44) |+11.65% | - Microbench does not show significant regression: https://gist.github.com/cbi42/59f280f85a59b678e7e5d8561e693b61 Reviewed By: ajkr Differential Revision: D38450331 Pulled By: cbi42 fbshipit-source-id: b5ef12e8d8c289ed2e163ccdf277f5039b511fca --- HISTORY.md | 3 + db/arena_wrapped_db_iter.cc | 27 +- db/arena_wrapped_db_iter.h | 9 +- db/c.cc | 2 + db/db_compaction_filter_test.cc | 14 +- db/db_impl/db_impl.cc | 45 +- db/db_impl/db_impl.h | 29 +- db/db_impl/db_impl_readonly.cc | 8 +- db/db_impl/db_impl_secondary.cc | 3 +- db/db_iter.cc | 101 +-- db/db_iter.h | 2 - db/db_range_del_test.cc | 932 +++++++++++++++++++++++++++ db/db_test_util.cc | 20 +- db/dbformat.h | 3 +- db/memtable_list.cc | 24 + db/memtable_list.h | 3 + db/range_del_aggregator.cc | 5 +- db/range_del_aggregator.h | 1 + db/range_del_aggregator_test.cc | 20 +- db/table_cache.cc | 50 +- db/table_cache.h | 8 +- db/version_set.cc | 352 +++++++++-- db/version_set.h | 8 +- include/rocksdb/c.h | 3 +- include/rocksdb/perf_context.h | 4 + monitoring/perf_context.cc | 5 + table/internal_iterator.h | 7 + table/iterator_wrapper.h | 4 + table/merging_iterator.cc | 1037 +++++++++++++++++++++++++++---- table/merging_iterator.h | 23 +- utilities/debug.cc | 6 +- 31 files changed, 2401 insertions(+), 357 deletions(-) diff --git a/HISTORY.md b/HISTORY.md index 5b69bfb7f..75a86e27c 100644 --- a/HISTORY.md +++ b/HISTORY.md @@ -22,6 +22,9 @@ ### New Features * RocksDB does internal auto prefetching if it notices 2 sequential reads if readahead_size is not specified. New option `num_file_reads_for_auto_readahead` is added in BlockBasedTableOptions which indicates after how many sequential reads internal auto prefetching should be start (default is 2). +### Performance Improvements +* Iterator performance is improved for `DeleteRange()` users. Internally, iterator will skip to the end of a range tombstone when possible, instead of looping through each key and check individually if a key is range deleted. + ## 7.6.0 (08/19/2022) ### New Features * Added `prepopulate_blob_cache` to ColumnFamilyOptions. If enabled, prepopulate warm/hot blobs which are already in memory into blob cache at the time of flush. On a flush, the blob that is in memory (in memtables) get flushed to the device. If using Direct IO, additional IO is incurred to read this blob back into memory again, which is avoided by enabling this option. This further helps if the workload exhibits high temporal locality, where most of the reads go to recently written data. This also helps in case of the remote file system since it involves network traffic and higher latencies. diff --git a/db/arena_wrapped_db_iter.cc b/db/arena_wrapped_db_iter.cc index c14217f62..b7666c3fe 100644 --- a/db/arena_wrapped_db_iter.cc +++ b/db/arena_wrapped_db_iter.cc @@ -77,22 +77,29 @@ Status ArenaWrappedDBIter::Refresh() { allow_refresh_); InternalIterator* internal_iter = db_impl_->NewInternalIterator( - read_options_, cfd_, sv, &arena_, db_iter_->GetRangeDelAggregator(), - latest_seq, /* allow_unprepared_value */ true); + read_options_, cfd_, sv, &arena_, latest_seq, + /* allow_unprepared_value */ true, /* db_iter */ this); SetIterUnderDBIter(internal_iter); break; } else { SequenceNumber latest_seq = db_impl_->GetLatestSequenceNumber(); // Refresh range-tombstones in MemTable if (!read_options_.ignore_range_deletions) { - SuperVersion* sv = cfd_->GetThreadLocalSuperVersion(db_impl_); - ReadRangeDelAggregator* range_del_agg = - db_iter_->GetRangeDelAggregator(); - std::unique_ptr range_del_iter; - range_del_iter.reset(sv->mem->NewRangeTombstoneIterator( - read_options_, latest_seq, false /* immutable_memtable */)); - range_del_agg->AddTombstones(std::move(range_del_iter)); - cfd_->ReturnThreadLocalSuperVersion(sv); + assert(memtable_range_tombstone_iter_ != nullptr); + if (memtable_range_tombstone_iter_ != nullptr) { + SuperVersion* sv = cfd_->GetThreadLocalSuperVersion(db_impl_); + auto t = sv->mem->NewRangeTombstoneIterator( + read_options_, latest_seq, false /* immutable_memtable */); + delete *memtable_range_tombstone_iter_; + if (t == nullptr || t->empty()) { + *memtable_range_tombstone_iter_ = nullptr; + } else { + *memtable_range_tombstone_iter_ = new TruncatedRangeDelIterator( + std::unique_ptr(t), + &cfd_->internal_comparator(), nullptr, nullptr); + } + cfd_->ReturnThreadLocalSuperVersion(sv); + } } // Refresh latest sequence number db_iter_->set_sequence(latest_seq); diff --git a/db/arena_wrapped_db_iter.h b/db/arena_wrapped_db_iter.h index 5b8645c90..1c0f77a1d 100644 --- a/db/arena_wrapped_db_iter.h +++ b/db/arena_wrapped_db_iter.h @@ -44,9 +44,7 @@ class ArenaWrappedDBIter : public Iterator { // Get the arena to be used to allocate memory for DBIter to be wrapped, // as well as child iterators in it. virtual Arena* GetArena() { return &arena_; } - virtual ReadRangeDelAggregator* GetRangeDelAggregator() { - return db_iter_->GetRangeDelAggregator(); - } + const ReadOptions& GetReadOptions() { return read_options_; } // Set the internal iterator wrapped inside the DB Iterator. Usually it is @@ -55,6 +53,10 @@ class ArenaWrappedDBIter : public Iterator { db_iter_->SetIter(iter); } + void SetMemtableRangetombstoneIter(TruncatedRangeDelIterator** iter) { + memtable_range_tombstone_iter_ = iter; + } + bool Valid() const override { return db_iter_->Valid(); } void SeekToFirst() override { db_iter_->SeekToFirst(); } void SeekToLast() override { db_iter_->SeekToLast(); } @@ -104,6 +106,7 @@ class ArenaWrappedDBIter : public Iterator { ReadCallback* read_callback_; bool expose_blob_index_ = false; bool allow_refresh_ = true; + TruncatedRangeDelIterator** memtable_range_tombstone_iter_ = nullptr; }; // Generate the arena wrapped iterator class. diff --git a/db/c.cc b/db/c.cc index e3a8fbb5e..91499f8b5 100644 --- a/db/c.cc +++ b/db/c.cc @@ -4116,6 +4116,8 @@ uint64_t rocksdb_perfcontext_metric(rocksdb_perfcontext_t* context, return rep->blob_checksum_time; case rocksdb_blob_decompress_time: return rep->blob_decompress_time; + case rocksdb_internal_range_del_reseek_count: + return rep->internal_range_del_reseek_count; default: break; } diff --git a/db/db_compaction_filter_test.cc b/db/db_compaction_filter_test.cc index 9ea8a350f..be863d4f6 100644 --- a/db/db_compaction_filter_test.cc +++ b/db/db_compaction_filter_test.cc @@ -328,11 +328,9 @@ TEST_F(DBTestCompactionFilter, CompactionFilter) { Arena arena; { InternalKeyComparator icmp(options.comparator); - ReadRangeDelAggregator range_del_agg(&icmp, - kMaxSequenceNumber /* upper_bound */); ReadOptions read_options; ScopedArenaIterator iter(dbfull()->NewInternalIterator( - read_options, &arena, &range_del_agg, kMaxSequenceNumber, handles_[1])); + read_options, &arena, kMaxSequenceNumber, handles_[1])); iter->SeekToFirst(); ASSERT_OK(iter->status()); while (iter->Valid()) { @@ -422,11 +420,9 @@ TEST_F(DBTestCompactionFilter, CompactionFilter) { count = 0; { InternalKeyComparator icmp(options.comparator); - ReadRangeDelAggregator range_del_agg(&icmp, - kMaxSequenceNumber /* upper_bound */); ReadOptions read_options; ScopedArenaIterator iter(dbfull()->NewInternalIterator( - read_options, &arena, &range_del_agg, kMaxSequenceNumber, handles_[1])); + read_options, &arena, kMaxSequenceNumber, handles_[1])); iter->SeekToFirst(); ASSERT_OK(iter->status()); while (iter->Valid()) { @@ -701,11 +697,9 @@ TEST_F(DBTestCompactionFilter, CompactionFilterContextManual) { int total = 0; Arena arena; InternalKeyComparator icmp(options.comparator); - ReadRangeDelAggregator range_del_agg(&icmp, - kMaxSequenceNumber /* snapshots */); ReadOptions read_options; - ScopedArenaIterator iter(dbfull()->NewInternalIterator( - read_options, &arena, &range_del_agg, kMaxSequenceNumber)); + ScopedArenaIterator iter(dbfull()->NewInternalIterator(read_options, &arena, + kMaxSequenceNumber)); iter->SeekToFirst(); ASSERT_OK(iter->status()); while (iter->Valid()) { diff --git a/db/db_impl/db_impl.cc b/db/db_impl/db_impl.cc index 755e67d9e..6ec257836 100644 --- a/db/db_impl/db_impl.cc +++ b/db/db_impl/db_impl.cc @@ -1641,7 +1641,6 @@ Status DBImpl::GetFullHistoryTsLow(ColumnFamilyHandle* column_family, InternalIterator* DBImpl::NewInternalIterator(const ReadOptions& read_options, Arena* arena, - RangeDelAggregator* range_del_agg, SequenceNumber sequence, ColumnFamilyHandle* column_family, bool allow_unprepared_value) { @@ -1656,8 +1655,8 @@ InternalIterator* DBImpl::NewInternalIterator(const ReadOptions& read_options, mutex_.Lock(); SuperVersion* super_version = cfd->GetSuperVersion()->Ref(); mutex_.Unlock(); - return NewInternalIterator(read_options, cfd, super_version, arena, - range_del_agg, sequence, allow_unprepared_value); + return NewInternalIterator(read_options, cfd, super_version, arena, sequence, + allow_unprepared_value); } void DBImpl::SchedulePurge() { @@ -1788,16 +1787,12 @@ static void CleanupGetMergeOperandsState(void* arg1, void* /*arg2*/) { } // namespace -InternalIterator* DBImpl::NewInternalIterator(const ReadOptions& read_options, - ColumnFamilyData* cfd, - SuperVersion* super_version, - Arena* arena, - RangeDelAggregator* range_del_agg, - SequenceNumber sequence, - bool allow_unprepared_value) { +InternalIterator* DBImpl::NewInternalIterator( + const ReadOptions& read_options, ColumnFamilyData* cfd, + SuperVersion* super_version, Arena* arena, SequenceNumber sequence, + bool allow_unprepared_value, ArenaWrappedDBIter* db_iter) { InternalIterator* internal_iter; assert(arena != nullptr); - assert(range_del_agg != nullptr); // Need to create internal iterator from the arena. MergeIteratorBuilder merge_iter_builder( &cfd->internal_comparator(), arena, @@ -1806,19 +1801,27 @@ InternalIterator* DBImpl::NewInternalIterator(const ReadOptions& read_options, // Collect iterator for mutable mem merge_iter_builder.AddIterator( super_version->mem->NewIterator(read_options, arena)); - std::unique_ptr range_del_iter; Status s; if (!read_options.ignore_range_deletions) { - range_del_iter.reset(super_version->mem->NewRangeTombstoneIterator( - read_options, sequence, false /* immutable_memtable */)); - range_del_agg->AddTombstones(std::move(range_del_iter)); + auto range_del_iter = super_version->mem->NewRangeTombstoneIterator( + read_options, sequence, false /* immutable_memtable */); + if (range_del_iter == nullptr || range_del_iter->empty()) { + delete range_del_iter; + merge_iter_builder.AddRangeTombstoneIterator(nullptr); + } else { + merge_iter_builder.AddRangeTombstoneIterator( + new TruncatedRangeDelIterator( + std::unique_ptr(range_del_iter), + &cfd->ioptions()->internal_comparator, nullptr /* smallest */, + nullptr /* largest */)); + } } // Collect all needed child iterators for immutable memtables if (s.ok()) { super_version->imm->AddIterators(read_options, &merge_iter_builder); if (!read_options.ignore_range_deletions) { s = super_version->imm->AddRangeTombstoneIterators(read_options, arena, - range_del_agg); + merge_iter_builder); } } TEST_SYNC_POINT_CALLBACK("DBImpl::NewInternalIterator:StatusCallback", &s); @@ -1826,10 +1829,11 @@ InternalIterator* DBImpl::NewInternalIterator(const ReadOptions& read_options, // Collect iterators for files in L0 - Ln if (read_options.read_tier != kMemtableTier) { super_version->current->AddIterators(read_options, file_options_, - &merge_iter_builder, range_del_agg, + &merge_iter_builder, allow_unprepared_value); } - internal_iter = merge_iter_builder.Finish(); + internal_iter = merge_iter_builder.Finish( + read_options.ignore_range_deletions ? nullptr : db_iter); SuperVersionHandle* cleanup = new SuperVersionHandle( this, &mutex_, super_version, read_options.background_purge_on_iterator_cleanup || @@ -3354,9 +3358,8 @@ ArenaWrappedDBIter* DBImpl::NewIteratorImpl(const ReadOptions& read_options, read_options.snapshot != nullptr ? false : allow_refresh); InternalIterator* internal_iter = NewInternalIterator( - db_iter->GetReadOptions(), cfd, sv, db_iter->GetArena(), - db_iter->GetRangeDelAggregator(), snapshot, - /* allow_unprepared_value */ true); + db_iter->GetReadOptions(), cfd, sv, db_iter->GetArena(), snapshot, + /* allow_unprepared_value */ true, db_iter); db_iter->SetIterUnderDBIter(internal_iter); return db_iter; diff --git a/db/db_impl/db_impl.h b/db/db_impl/db_impl.h index 7cc9bfbf2..6c7fe5a75 100644 --- a/db/db_impl/db_impl.h +++ b/db/db_impl/db_impl.h @@ -739,13 +739,29 @@ class DBImpl : public DB { // the value and so will require PrepareValue() to be called before value(); // allow_unprepared_value = false is convenient when this optimization is not // useful, e.g. when reading the whole column family. + // + // read_options.ignore_range_deletions determines whether range tombstones are + // processed in the returned interator internally, i.e., whether range + // tombstone covered keys are in this iterator's output. // @param read_options Must outlive the returned iterator. InternalIterator* NewInternalIterator( - const ReadOptions& read_options, Arena* arena, - RangeDelAggregator* range_del_agg, SequenceNumber sequence, + const ReadOptions& read_options, Arena* arena, SequenceNumber sequence, ColumnFamilyHandle* column_family = nullptr, bool allow_unprepared_value = false); + // Note: to support DB iterator refresh, memtable range tombstones in the + // underlying merging iterator needs to be refreshed. If db_iter is not + // nullptr, db_iter->SetMemtableRangetombstoneIter() is called with the + // memtable range tombstone iterator used by the underlying merging iterator. + // This range tombstone iterator can be refreshed later by db_iter. + // @param read_options Must outlive the returned iterator. + InternalIterator* NewInternalIterator(const ReadOptions& read_options, + ColumnFamilyData* cfd, + SuperVersion* super_version, + Arena* arena, SequenceNumber sequence, + bool allow_unprepared_value, + ArenaWrappedDBIter* db_iter = nullptr); + LogsWithPrepTracker* logs_with_prep_tracker() { return &logs_with_prep_tracker_; } @@ -868,15 +884,6 @@ class DBImpl : public DB { const WriteController& write_controller() { return write_controller_; } - // @param read_options Must outlive the returned iterator. - InternalIterator* NewInternalIterator(const ReadOptions& read_options, - ColumnFamilyData* cfd, - SuperVersion* super_version, - Arena* arena, - RangeDelAggregator* range_del_agg, - SequenceNumber sequence, - bool allow_unprepared_value); - // hollow transactions shell used for recovery. // these will then be passed to TransactionDB so that // locks can be reacquired before writing can resume. diff --git a/db/db_impl/db_impl_readonly.cc b/db/db_impl/db_impl_readonly.cc index 4b1103d4b..0f10baf24 100644 --- a/db/db_impl/db_impl_readonly.cc +++ b/db/db_impl/db_impl_readonly.cc @@ -143,8 +143,7 @@ Iterator* DBImplReadOnly::NewIterator(const ReadOptions& read_options, super_version->version_number, read_callback); auto internal_iter = NewInternalIterator( db_iter->GetReadOptions(), cfd, super_version, db_iter->GetArena(), - db_iter->GetRangeDelAggregator(), read_seq, - /* allow_unprepared_value */ true); + read_seq, /* allow_unprepared_value */ true, db_iter); db_iter->SetIterUnderDBIter(internal_iter); return db_iter; } @@ -194,9 +193,8 @@ Status DBImplReadOnly::NewIterators( sv->mutable_cf_options.max_sequential_skip_in_iterations, sv->version_number, read_callback); auto* internal_iter = NewInternalIterator( - db_iter->GetReadOptions(), cfd, sv, db_iter->GetArena(), - db_iter->GetRangeDelAggregator(), read_seq, - /* allow_unprepared_value */ true); + db_iter->GetReadOptions(), cfd, sv, db_iter->GetArena(), read_seq, + /* allow_unprepared_value */ true, db_iter); db_iter->SetIterUnderDBIter(internal_iter); iterators->push_back(db_iter); } diff --git a/db/db_impl/db_impl_secondary.cc b/db/db_impl/db_impl_secondary.cc index bcc98f131..40e85ee85 100644 --- a/db/db_impl/db_impl_secondary.cc +++ b/db/db_impl/db_impl_secondary.cc @@ -493,8 +493,7 @@ ArenaWrappedDBIter* DBImplSecondary::NewIteratorImpl( expose_blob_index, read_options.snapshot ? false : allow_refresh); auto internal_iter = NewInternalIterator( db_iter->GetReadOptions(), cfd, super_version, db_iter->GetArena(), - db_iter->GetRangeDelAggregator(), snapshot, - /* allow_unprepared_value */ true); + snapshot, /* allow_unprepared_value */ true, db_iter); db_iter->SetIterUnderDBIter(internal_iter); return db_iter; } diff --git a/db/db_iter.cc b/db/db_iter.cc index c9fdfab45..990311257 100644 --- a/db/db_iter.cc +++ b/db/db_iter.cc @@ -78,7 +78,6 @@ DBIter::DBIter(Env* _env, const ReadOptions& read_options, is_blob_(false), is_wide_(false), arena_mode_(arena_mode), - range_del_agg_(&ioptions.internal_comparator, s), db_impl_(db_impl), cfd_(cfd), timestamp_ub_(read_options.timestamp), @@ -394,49 +393,27 @@ bool DBIter::FindNextUserEntryInternal(bool skipping_saved_key, saved_key_.SetUserKey( ikey_.user_key, !pin_thru_lifetime_ || !iter_.iter()->IsKeyPinned() /* copy */); - if (range_del_agg_.ShouldDelete( - ikey_, RangeDelPositioningMode::kForwardTraversal)) { - // Arrange to skip all upcoming entries for this key since - // they are hidden by this deletion. - skipping_saved_key = true; - num_skipped = 0; - reseek_done = false; - PERF_COUNTER_ADD(internal_delete_skipped_count, 1); - } else { - if (ikey_.type == kTypeBlobIndex) { - if (!SetBlobValueIfNeeded(ikey_.user_key, iter_.value())) { - return false; - } - } else if (ikey_.type == kTypeWideColumnEntity) { - if (!SetWideColumnValueIfNeeded(iter_.value())) { - return false; - } + if (ikey_.type == kTypeBlobIndex) { + if (!SetBlobValueIfNeeded(ikey_.user_key, iter_.value())) { + return false; + } + } else if (ikey_.type == kTypeWideColumnEntity) { + if (!SetWideColumnValueIfNeeded(iter_.value())) { + return false; } - - valid_ = true; - return true; } + valid_ = true; + return true; } break; case kTypeMerge: saved_key_.SetUserKey( ikey_.user_key, !pin_thru_lifetime_ || !iter_.iter()->IsKeyPinned() /* copy */); - if (range_del_agg_.ShouldDelete( - ikey_, RangeDelPositioningMode::kForwardTraversal)) { - // Arrange to skip all upcoming entries for this key since - // they are hidden by this deletion. - skipping_saved_key = true; - num_skipped = 0; - reseek_done = false; - PERF_COUNTER_ADD(internal_delete_skipped_count, 1); - } else { - // By now, we are sure the current ikey is going to yield a - // value - current_entry_is_merged_ = true; - valid_ = true; - return MergeValuesNewToOld(); // Go to a different state machine - } + // By now, we are sure the current ikey is going to yield a value + current_entry_is_merged_ = true; + valid_ = true; + return MergeValuesNewToOld(); // Go to a different state machine break; default: valid_ = false; @@ -562,9 +539,7 @@ bool DBIter::MergeValuesNewToOld() { // hit the next user key, stop right here break; } - if (kTypeDeletion == ikey.type || kTypeSingleDeletion == ikey.type || - range_del_agg_.ShouldDelete( - ikey, RangeDelPositioningMode::kForwardTraversal)) { + if (kTypeDeletion == ikey.type || kTypeSingleDeletion == ikey.type) { // hit a delete with the same user key, stop right here // iter_ is positioned after delete iter_.Next(); @@ -913,11 +888,7 @@ bool DBIter::FindValueForCurrentKey() { case kTypeValue: case kTypeBlobIndex: case kTypeWideColumnEntity: - if (range_del_agg_.ShouldDelete( - ikey, RangeDelPositioningMode::kBackwardTraversal)) { - last_key_entry_type = kTypeRangeDeletion; - PERF_COUNTER_ADD(internal_delete_skipped_count, 1); - } else if (iter_.iter()->IsValuePinned()) { + if (iter_.iter()->IsValuePinned()) { pinned_value_ = iter_.value(); } else { valid_ = false; @@ -938,21 +909,12 @@ bool DBIter::FindValueForCurrentKey() { last_not_merge_type = last_key_entry_type; PERF_COUNTER_ADD(internal_delete_skipped_count, 1); break; - case kTypeMerge: - if (range_del_agg_.ShouldDelete( - ikey, RangeDelPositioningMode::kBackwardTraversal)) { - merge_context_.Clear(); - last_key_entry_type = kTypeRangeDeletion; - last_not_merge_type = last_key_entry_type; - PERF_COUNTER_ADD(internal_delete_skipped_count, 1); - } else { - assert(merge_operator_ != nullptr); - merge_context_.PushOperandBack( - iter_.value(), - iter_.iter()->IsValuePinned() /* operand_pinned */); - PERF_COUNTER_ADD(internal_merge_count, 1); - } - break; + case kTypeMerge: { + assert(merge_operator_ != nullptr); + merge_context_.PushOperandBack( + iter_.value(), iter_.iter()->IsValuePinned() /* operand_pinned */); + PERF_COUNTER_ADD(internal_merge_count, 1); + } break; default: valid_ = false; status_ = Status::Corruption( @@ -989,8 +951,7 @@ bool DBIter::FindValueForCurrentKey() { } if (timestamp_lb_ != nullptr) { - assert(last_key_entry_type == ikey_.type || - last_key_entry_type == kTypeRangeDeletion); + assert(last_key_entry_type == ikey_.type); } Status s; @@ -1005,7 +966,6 @@ bool DBIter::FindValueForCurrentKey() { case kTypeDeletion: case kTypeDeletionWithTimestamp: case kTypeSingleDeletion: - case kTypeRangeDeletion: if (timestamp_lb_ == nullptr) { valid_ = false; } else { @@ -1016,8 +976,7 @@ bool DBIter::FindValueForCurrentKey() { case kTypeMerge: current_entry_is_merged_ = true; if (last_not_merge_type == kTypeDeletion || - last_not_merge_type == kTypeSingleDeletion || - last_not_merge_type == kTypeRangeDeletion) { + last_not_merge_type == kTypeSingleDeletion) { s = Merge(nullptr, saved_key_.GetUserKey()); if (!s.ok()) { return false; @@ -1157,8 +1116,6 @@ bool DBIter::FindValueForCurrentKeyUsingSeek() { } if (ikey.type == kTypeDeletion || ikey.type == kTypeSingleDeletion || - range_del_agg_.ShouldDelete( - ikey, RangeDelPositioningMode::kBackwardTraversal) || kTypeDeletionWithTimestamp == ikey.type) { if (timestamp_lb_ == nullptr) { valid_ = false; @@ -1221,9 +1178,7 @@ bool DBIter::FindValueForCurrentKeyUsingSeek() { if (!user_comparator_.Equal(ikey.user_key, saved_key_.GetUserKey())) { break; } - if (ikey.type == kTypeDeletion || ikey.type == kTypeSingleDeletion || - range_del_agg_.ShouldDelete( - ikey, RangeDelPositioningMode::kForwardTraversal)) { + if (ikey.type == kTypeDeletion || ikey.type == kTypeSingleDeletion) { break; } if (!iter_.PrepareValue()) { @@ -1498,7 +1453,6 @@ void DBIter::Seek(const Slice& target) { SetSavedKeyToSeekTarget(target); iter_.Seek(saved_key_.GetInternalKey()); - range_del_agg_.InvalidateRangeDelMapPositions(); RecordTick(statistics_, NUMBER_DB_SEEK); } if (!iter_.Valid()) { @@ -1574,7 +1528,6 @@ void DBIter::SeekForPrev(const Slice& target) { PERF_TIMER_GUARD(seek_internal_seek_time); SetSavedKeyToSeekForPrevTarget(target); iter_.SeekForPrev(saved_key_.GetInternalKey()); - range_del_agg_.InvalidateRangeDelMapPositions(); RecordTick(statistics_, NUMBER_DB_SEEK); } if (!iter_.Valid()) { @@ -1622,6 +1575,8 @@ void DBIter::SeekToFirst() { max_skip_ = std::numeric_limits::max(); } status_ = Status::OK(); + // if iterator is empty, this status_ could be unchecked. + status_.PermitUncheckedError(); direction_ = kForward; ReleaseTempPinnedData(); ResetBlobValue(); @@ -1633,7 +1588,6 @@ void DBIter::SeekToFirst() { { PERF_TIMER_GUARD(seek_internal_seek_time); iter_.SeekToFirst(); - range_del_agg_.InvalidateRangeDelMapPositions(); } RecordTick(statistics_, NUMBER_DB_SEEK); @@ -1692,6 +1646,8 @@ void DBIter::SeekToLast() { max_skip_ = std::numeric_limits::max(); } status_ = Status::OK(); + // if iterator is empty, this status_ could be unchecked. + status_.PermitUncheckedError(); direction_ = kReverse; ReleaseTempPinnedData(); ResetBlobValue(); @@ -1703,7 +1659,6 @@ void DBIter::SeekToLast() { { PERF_TIMER_GUARD(seek_internal_seek_time); iter_.SeekToLast(); - range_del_agg_.InvalidateRangeDelMapPositions(); } PrevInternal(nullptr); if (statistics_ != nullptr) { diff --git a/db/db_iter.h b/db/db_iter.h index 998486fd4..d7314e3d9 100644 --- a/db/db_iter.h +++ b/db/db_iter.h @@ -139,7 +139,6 @@ class DBIter final : public Iterator { iter_.Set(iter); iter_.iter()->SetPinnedItersMgr(&pinned_iters_mgr_); } - ReadRangeDelAggregator* GetRangeDelAggregator() { return &range_del_agg_; } bool Valid() const override { #ifdef ROCKSDB_ASSERT_STATUS_CHECKED @@ -380,7 +379,6 @@ class DBIter final : public Iterator { bool arena_mode_; // List of operands for merge operator. MergeContext merge_context_; - ReadRangeDelAggregator range_del_agg_; LocalStatistics local_stats_; PinnedIteratorsManager pinned_iters_mgr_; #ifdef ROCKSDB_LITE diff --git a/db/db_range_del_test.cc b/db/db_range_del_test.cc index 7ecf3e4bd..bfa3b0bae 100644 --- a/db/db_range_del_test.cc +++ b/db/db_range_del_test.cc @@ -4,6 +4,7 @@ // (found in the LICENSE.Apache file in the root directory). #include "db/db_test_util.h" +#include "db/version_set.h" #include "port/stack_trace.h" #include "rocksdb/utilities/write_batch_with_index.h" #include "test_util/testutil.h" @@ -1756,6 +1757,937 @@ TEST_F(DBRangeDelTest, IteratorRefresh) { } } +void VerifyIteratorReachesEnd(InternalIterator* iter) { + ASSERT_TRUE(!iter->Valid() && iter->status().ok()); +} + +void VerifyIteratorReachesEnd(Iterator* iter) { + ASSERT_TRUE(!iter->Valid() && iter->status().ok()); +} + +TEST_F(DBRangeDelTest, IteratorReseek) { + // Range tombstone triggers reseek (seeking to a range tombstone end key) in + // merging iterator. Test set up: + // one memtable: range tombstone [0, 1) + // one immutable memtable: range tombstone [1, 2) + // one L0 file with range tombstone [2, 3) + // one L1 file with range tombstone [3, 4) + // Seek(0) should trigger cascading reseeks at all levels below memtable. + // Seek(1) should trigger cascading reseeks at all levels below immutable + // memtable. SeekToFirst and SeekToLast trigger no reseek. + Options options = CurrentOptions(); + options.compression = kNoCompression; + options.disable_auto_compactions = true; + + DestroyAndReopen(options); + // L1 + ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), Key(3), + Key(4))); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(1); + ASSERT_EQ(1, NumTableFilesAtLevel(1)); + // L0 + ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), Key(2), + Key(3))); + ASSERT_OK(db_->Flush(FlushOptions())); + ASSERT_EQ(1, NumTableFilesAtLevel(0)); + // Immutable memtable + ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), Key(1), + Key(2))); + ASSERT_OK(static_cast_with_check(db_)->TEST_SwitchMemtable()); + std::string value; + ASSERT_TRUE(dbfull()->GetProperty(db_->DefaultColumnFamily(), + "rocksdb.num-immutable-mem-table", &value)); + ASSERT_EQ(1, std::stoi(value)); + // live memtable + ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), Key(0), + Key(1))); + // this memtable is still active + ASSERT_TRUE(dbfull()->GetProperty(db_->DefaultColumnFamily(), + "rocksdb.num-immutable-mem-table", &value)); + ASSERT_EQ(1, std::stoi(value)); + + auto iter = db_->NewIterator(ReadOptions()); + get_perf_context()->Reset(); + iter->Seek(Key(0)); + // Reseeked immutable memtable, L0 and L1 + ASSERT_EQ(get_perf_context()->internal_range_del_reseek_count, 3); + VerifyIteratorReachesEnd(iter); + get_perf_context()->Reset(); + iter->SeekForPrev(Key(1)); + // Reseeked L0 and L1 + ASSERT_EQ(get_perf_context()->internal_range_del_reseek_count, 2); + VerifyIteratorReachesEnd(iter); + get_perf_context()->Reset(); + iter->SeekToFirst(); + ASSERT_EQ(get_perf_context()->internal_range_del_reseek_count, 0); + VerifyIteratorReachesEnd(iter); + iter->SeekToLast(); + ASSERT_EQ(get_perf_context()->internal_range_del_reseek_count, 0); + VerifyIteratorReachesEnd(iter); + delete iter; +} + +TEST_F(DBRangeDelTest, ReseekDuringNextAndPrev) { + // Range tombstone triggers reseek during Next()/Prev() in merging iterator. + // Test set up: + // memtable has: [0, 1) [2, 3) + // L0 has: 2 + // L1 has: 1, 2, 3 + // Seek(0) will reseek to 1 for L0 and L1. Seek(1) will not trigger any + // reseek. Then Next() determines 2 is covered by [2, 3), it will try to + // reseek to 3 for L0 and L1. Similar story for Prev() and SeekForPrev() is + // tested. + Options options = CurrentOptions(); + options.compression = kNoCompression; + options.disable_auto_compactions = true; + + DestroyAndReopen(options); + // L1 + ASSERT_OK(db_->Put(WriteOptions(), Key(1), "foo")); + ASSERT_OK(db_->Put(WriteOptions(), Key(2), "foo")); + ASSERT_OK(db_->Put(WriteOptions(), Key(3), "foo")); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(1); + ASSERT_EQ(1, NumTableFilesAtLevel(1)); + + // L0 + ASSERT_OK(db_->Put(WriteOptions(), Key(2), "foo")); + ASSERT_OK(db_->Flush(FlushOptions())); + ASSERT_EQ(1, NumTableFilesAtLevel(0)); + + // Memtable + ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), Key(0), + Key(1))); + ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), Key(2), + Key(3))); + + auto iter = db_->NewIterator(ReadOptions()); + auto iter_test_forward = [&] { + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->key().ToString(), Key(1)); + + get_perf_context()->Reset(); + iter->Next(); + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->key().ToString(), Key(3)); + // Reseeked L0 and L1 + ASSERT_EQ(get_perf_context()->internal_range_del_reseek_count, 2); + + // Next to Prev + get_perf_context()->Reset(); + iter->Prev(); + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->key().ToString(), Key(1)); + // Reseeked L0 and L1 + ASSERT_EQ(get_perf_context()->internal_range_del_reseek_count, 2); + + // Prev to Next + get_perf_context()->Reset(); + iter->Next(); + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->key().ToString(), Key(3)); + // Reseeked L0 and L1 + ASSERT_EQ(get_perf_context()->internal_range_del_reseek_count, 2); + + iter->Next(); + VerifyIteratorReachesEnd(iter); + }; + + get_perf_context()->Reset(); + iter->Seek(Key(0)); + // Reseeked L0 and L1 + ASSERT_EQ(get_perf_context()->internal_range_del_reseek_count, 2); + iter_test_forward(); + get_perf_context()->Reset(); + iter->Seek(Key(1)); + ASSERT_EQ(get_perf_context()->internal_range_del_reseek_count, 0); + iter_test_forward(); + + get_perf_context()->Reset(); + iter->SeekForPrev(Key(2)); + // Reseeked L0 and L1 + ASSERT_EQ(get_perf_context()->internal_range_del_reseek_count, 2); + iter_test_forward(); + get_perf_context()->Reset(); + iter->SeekForPrev(Key(1)); + ASSERT_EQ(get_perf_context()->internal_range_del_reseek_count, 0); + iter_test_forward(); + + get_perf_context()->Reset(); + iter->SeekToFirst(); + ASSERT_EQ(get_perf_context()->internal_range_del_reseek_count, 0); + iter_test_forward(); + + iter->SeekToLast(); + iter->Prev(); + iter_test_forward(); + delete iter; +} + +TEST_F(DBRangeDelTest, TombstoneFromCurrentLevel) { + // Range tombstone triggers reseek when covering key from the same level. + // in merging iterator. Test set up: + // memtable has: [0, 1) + // L0 has: [2, 3), 2 + // L1 has: 1, 2, 3 + // Seek(0) will reseek to 1 for L0 and L1. + // Then Next() will reseek to 3 for L1 since 2 in L0 is covered by [2, 3) in + // L0. + Options options = CurrentOptions(); + options.compression = kNoCompression; + options.disable_auto_compactions = true; + + DestroyAndReopen(options); + // L1 + ASSERT_OK(db_->Put(WriteOptions(), Key(1), "foo")); + ASSERT_OK(db_->Put(WriteOptions(), Key(2), "foo")); + ASSERT_OK(db_->Put(WriteOptions(), Key(3), "foo")); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(1); + ASSERT_EQ(1, NumTableFilesAtLevel(1)); + + // L0 + ASSERT_OK(db_->Put(WriteOptions(), Key(2), "foo")); + ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), Key(2), + Key(3))); + ASSERT_OK(db_->Flush(FlushOptions())); + ASSERT_EQ(1, NumTableFilesAtLevel(0)); + + // Memtable + ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), Key(0), + Key(1))); + + auto iter = db_->NewIterator(ReadOptions()); + get_perf_context()->Reset(); + iter->Seek(Key(0)); + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->key().ToString(), Key(1)); + ASSERT_EQ(get_perf_context()->internal_range_del_reseek_count, 2); + + get_perf_context()->Reset(); + iter->Next(); + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->key().ToString(), Key(3)); + ASSERT_EQ(get_perf_context()->internal_range_del_reseek_count, 1); + + delete iter; +} + +TEST_F(DBRangeDelTest, TombstoneAcrossFileBoundary) { + // Verify that a range tombstone across file boundary covers keys from older + // levels. Test set up: + // L1_0: 1, 3, [2, 6) L1_1: 5, 7, [2, 6) ([2, 6) is from compaction with + // L1_0) L2 has: 5 + // Seek(1) and then Next() should move the L1 level iterator to + // L1_1. Check if 5 is returned after Next(). + Options options = CurrentOptions(); + options.compression = kNoCompression; + options.disable_auto_compactions = true; + options.target_file_size_base = 2 * 1024; + options.max_compaction_bytes = 2 * 1024; + + DestroyAndReopen(options); + + Random rnd(301); + // L2 + ASSERT_OK(db_->Put(WriteOptions(), Key(5), rnd.RandomString(1 << 10))); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(2); + ASSERT_EQ(1, NumTableFilesAtLevel(2)); + + // L1_1 + ASSERT_OK(db_->Put(WriteOptions(), Key(5), rnd.RandomString(1 << 10))); + ASSERT_OK(db_->Put(WriteOptions(), Key(7), rnd.RandomString(1 << 10))); + ASSERT_OK(db_->Flush(FlushOptions())); + ASSERT_EQ(1, NumTableFilesAtLevel(0)); + + // L1_0 + ASSERT_OK(db_->Put(WriteOptions(), Key(1), rnd.RandomString(1 << 10))); + ASSERT_OK(db_->Put(WriteOptions(), Key(3), rnd.RandomString(1 << 10))); + // Prevent keys being compacted away + const Snapshot* snapshot = db_->GetSnapshot(); + ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), Key(2), + Key(6))); + ASSERT_OK(db_->Flush(FlushOptions())); + ASSERT_EQ(2, NumTableFilesAtLevel(0)); + MoveFilesToLevel(1); + ASSERT_EQ(2, NumTableFilesAtLevel(1)); + + auto iter = db_->NewIterator(ReadOptions()); + get_perf_context()->Reset(); + iter->Seek(Key(1)); + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->key().ToString(), Key(1)); + iter->Next(); + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->key().ToString(), Key(7)); + // 1 reseek into L2 when key 5 in L2 is covered by [2, 6) from L1 + ASSERT_EQ(get_perf_context()->internal_range_del_reseek_count, 1); + + delete iter; + db_->ReleaseSnapshot(snapshot); +} + +TEST_F(DBRangeDelTest, NonOverlappingTombstonAtBoundary) { + // Verify that a range tombstone across file boundary covers keys from older + // levels. + // Test set up: + // L1_0: 1, 3, [4, 7) L1_1: 6, 8, [4, 7) + // L2: 5 + // Note that [4, 7) is at end of L1_0 and not overlapping with any point key + // in L1_0. [4, 7) from L1_0 should cover 5 is sentinel works + Options options = CurrentOptions(); + options.compression = kNoCompression; + options.disable_auto_compactions = true; + options.target_file_size_base = 2 * 1024; + DestroyAndReopen(options); + + Random rnd(301); + // L2 + ASSERT_OK(db_->Put(WriteOptions(), Key(5), rnd.RandomString(4 << 10))); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(2); + ASSERT_EQ(1, NumTableFilesAtLevel(2)); + + // L1_1 + ASSERT_OK(db_->Put(WriteOptions(), Key(6), rnd.RandomString(4 << 10))); + ASSERT_OK(db_->Put(WriteOptions(), Key(8), rnd.RandomString(4 << 10))); + ASSERT_OK(db_->Flush(FlushOptions())); + ASSERT_EQ(1, NumTableFilesAtLevel(0)); + + // L1_0 + ASSERT_OK(db_->Put(WriteOptions(), Key(1), rnd.RandomString(4 << 10))); + ASSERT_OK(db_->Put(WriteOptions(), Key(3), rnd.RandomString(4 << 10))); + // Prevent keys being compacted away + ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), Key(4), + Key(7))); + ASSERT_OK(db_->Flush(FlushOptions())); + ASSERT_EQ(2, NumTableFilesAtLevel(0)); + MoveFilesToLevel(1); + ASSERT_EQ(2, NumTableFilesAtLevel(1)); + + auto iter = db_->NewIterator(ReadOptions()); + iter->Seek(Key(3)); + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->key(), Key(3)); + get_perf_context()->Reset(); + iter->Next(); + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->key().ToString(), Key(8)); + // 1 reseek into L1 since 5 from L2 is covered by [4, 7) from L1 + ASSERT_EQ(get_perf_context()->internal_range_del_reseek_count, 1); + for (auto& k : {4, 5, 6}) { + get_perf_context()->Reset(); + iter->Seek(Key(k)); + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->key().ToString(), Key(8)); + // 1 reseek into L1 + ASSERT_EQ(get_perf_context()->internal_range_del_reseek_count, 1); + } + delete iter; +} + +TEST_F(DBRangeDelTest, OlderLevelHasNewerData) { + // L1_0: 1, 3, [2, 7) L1_1: 5, 6 at a newer sequence number than [2, 7) + // Compact L1_1 to L2. Seek(3) should not skip 5 or 6. + Options options = CurrentOptions(); + options.compression = kNoCompression; + options.disable_auto_compactions = true; + options.target_file_size_base = 3 * 1024; + DestroyAndReopen(options); + + Random rnd(301); + // L1_0 + ASSERT_OK(db_->Put(WriteOptions(), Key(1), rnd.RandomString(4 << 10))); + ASSERT_OK(db_->Put(WriteOptions(), Key(3), rnd.RandomString(4 << 10))); + const Snapshot* snapshot = db_->GetSnapshot(); + ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), Key(2), + Key(7))); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(1); + ASSERT_EQ(1, NumTableFilesAtLevel(1)); + + // L1_1 + ASSERT_OK(db_->Put(WriteOptions(), Key(5), rnd.RandomString(4 << 10))); + ASSERT_OK(db_->Put(WriteOptions(), Key(6), rnd.RandomString(4 << 10))); + ASSERT_OK(db_->Flush(FlushOptions())); + ASSERT_EQ(1, NumTableFilesAtLevel(0)); + MoveFilesToLevel(1); + ASSERT_EQ(2, NumTableFilesAtLevel(1)); + + auto key = Key(6); + Slice begin(key); + EXPECT_OK(dbfull()->TEST_CompactRange(1, &begin, nullptr)); + ASSERT_EQ(1, NumTableFilesAtLevel(1)); + ASSERT_EQ(1, NumTableFilesAtLevel(2)); + + auto iter = db_->NewIterator(ReadOptions()); + iter->Seek(Key(3)); + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->key().ToString(), Key(5)); + iter->Next(); + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->key().ToString(), Key(6)); + delete iter; + db_->ReleaseSnapshot(snapshot); +} + +TEST_F(DBRangeDelTest, LevelBoundaryDefinedByTombstone) { + // L1 has: 1, 2, [4, 5) + // L2 has: 4 + // Seek(3), which is over all points keys in L1, check whether + // sentinel key from L1 works in this case. + Options options = CurrentOptions(); + options.compression = kNoCompression; + options.disable_auto_compactions = true; + options.target_file_size_base = 3 * 1024; + DestroyAndReopen(options); + Random rnd(301); + // L2 + ASSERT_OK(db_->Put(WriteOptions(), Key(4), "foo")); + ASSERT_OK(db_->Flush(FlushOptions())); + const Snapshot* snapshot = db_->GetSnapshot(); + MoveFilesToLevel(2); + ASSERT_EQ(1, NumTableFilesAtLevel(2)); + + // L1_0 + ASSERT_OK(db_->Put(WriteOptions(), Key(1), rnd.RandomString(4 << 10))); + ASSERT_OK(db_->Put(WriteOptions(), Key(2), rnd.RandomString(4 << 10))); + ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), Key(4), + Key(5))); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(1); + ASSERT_EQ(1, NumTableFilesAtLevel(1)); + ASSERT_EQ(1, NumTableFilesAtLevel(2)); + + auto iter = db_->NewIterator(ReadOptions()); + iter->Seek(Key(3)); + ASSERT_TRUE(!iter->Valid()); + ASSERT_OK(iter->status()); + + get_perf_context()->Reset(); + iter->SeekForPrev(Key(5)); + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->key(), Key(2)); + db_->ReleaseSnapshot(snapshot); + delete iter; +} + +TEST_F(DBRangeDelTest, TombstoneOnlyFile) { + // L1_0: 1, 2, L1_1: [3, 5) + // L2: 3 + // Seek(2) then Next() should advance L1 iterator into L1_1. + // If sentinel works with tombstone only file, it should cover the key in L2. + // Similar story for SeekForPrev(4). + Options options = CurrentOptions(); + options.compression = kNoCompression; + options.disable_auto_compactions = true; + options.target_file_size_base = 3 * 1024; + + DestroyAndReopen(options); + Random rnd(301); + // L2 + ASSERT_OK(db_->Put(WriteOptions(), Key(3), "foo")); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(2); + ASSERT_EQ(1, NumTableFilesAtLevel(2)); + + // L1_0 + ASSERT_OK(db_->Put(WriteOptions(), Key(1), rnd.RandomString(4 << 10))); + ASSERT_OK(db_->Put(WriteOptions(), Key(2), rnd.RandomString(4 << 10))); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(1); + ASSERT_EQ(1, NumTableFilesAtLevel(1)); + ASSERT_EQ(1, NumTableFilesAtLevel(2)); + + // L1_1 + ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), Key(3), + Key(5))); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(1); + ASSERT_EQ(2, NumTableFilesAtLevel(1)); + ASSERT_EQ(1, NumTableFilesAtLevel(2)); + + auto iter = db_->NewIterator(ReadOptions()); + iter->Seek(Key(2)); + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->key(), Key(2)); + iter->Next(); + VerifyIteratorReachesEnd(iter); + iter->SeekForPrev(Key(4)); + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->key(), Key(2)); + iter->Next(); + VerifyIteratorReachesEnd(iter); + delete iter; +} + +void VerifyIteratorKey(InternalIterator* iter, + const std::vector& expected_keys, + bool forward = true) { + for (auto& key : expected_keys) { + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->user_key(), key); + if (forward) { + iter->Next(); + } else { + iter->Prev(); + } + } +} + +TEST_F(DBRangeDelTest, TombstoneOnlyLevel) { + // L1 [3, 5) + // L2 has: 3, 4 + // Any kind of iterator seek should skip 3 and 4 in L2. + // L1 level iterator should produce sentinel key. + Options options = CurrentOptions(); + options.compression = kNoCompression; + options.disable_auto_compactions = true; + options.target_file_size_base = 3 * 1024; + + DestroyAndReopen(options); + // L2 + ASSERT_OK(db_->Put(WriteOptions(), Key(3), "foo")); + ASSERT_OK(db_->Put(WriteOptions(), Key(4), "bar")); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(2); + ASSERT_EQ(1, NumTableFilesAtLevel(2)); + + // L1 + ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), Key(3), + Key(5))); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(1); + ASSERT_EQ(1, NumTableFilesAtLevel(1)); + + auto iter = db_->NewIterator(ReadOptions()); + get_perf_context()->Reset(); + uint64_t expected_reseek = 0; + for (auto i = 0; i < 7; ++i) { + iter->Seek(Key(i)); + VerifyIteratorReachesEnd(iter); + if (i < 5) { + ++expected_reseek; + } + ASSERT_EQ(get_perf_context()->internal_range_del_reseek_count, + expected_reseek); + iter->SeekForPrev(Key(i)); + VerifyIteratorReachesEnd(iter); + if (i > 2) { + ++expected_reseek; + } + ASSERT_EQ(get_perf_context()->internal_range_del_reseek_count, + expected_reseek); + iter->SeekToFirst(); + VerifyIteratorReachesEnd(iter); + ASSERT_EQ(get_perf_context()->internal_range_del_reseek_count, + ++expected_reseek); + iter->SeekToLast(); + VerifyIteratorReachesEnd(iter); + ASSERT_EQ(get_perf_context()->internal_range_del_reseek_count, + ++expected_reseek); + } + delete iter; + + // Check L1 LevelIterator behavior + ColumnFamilyData* cfd = + static_cast_with_check(db_->DefaultColumnFamily()) + ->cfd(); + SuperVersion* sv = cfd->GetSuperVersion(); + Arena arena; + ReadOptions read_options; + MergeIteratorBuilder merge_iter_builder(&cfd->internal_comparator(), &arena, + false /* prefix seek */); + InternalIterator* level_iter = sv->current->TEST_GetLevelIterator( + read_options, &merge_iter_builder, 1 /* level */, true); + // This is needed to make LevelIterator range tombstone aware + merge_iter_builder.AddIterator(level_iter); + auto miter = merge_iter_builder.Finish(); + auto k = Key(3); + IterKey target; + target.SetInternalKey(k, kMaxSequenceNumber, kValueTypeForSeek); + level_iter->Seek(target.GetInternalKey()); + // sentinel key (file boundary as a fake key) + VerifyIteratorKey(level_iter, {Key(5)}); + VerifyIteratorReachesEnd(level_iter); + + k = Key(5); + target.SetInternalKey(k, 0, kValueTypeForSeekForPrev); + level_iter->SeekForPrev(target.GetInternalKey()); + VerifyIteratorKey(level_iter, {Key(3)}, false); + VerifyIteratorReachesEnd(level_iter); + + level_iter->SeekToFirst(); + VerifyIteratorKey(level_iter, {Key(5)}); + VerifyIteratorReachesEnd(level_iter); + + level_iter->SeekToLast(); + VerifyIteratorKey(level_iter, {Key(3)}, false); + VerifyIteratorReachesEnd(level_iter); + + miter->~InternalIterator(); +} + +TEST_F(DBRangeDelTest, TombstoneOnlyWithOlderVisibleKey) { + // L1: [3, 5) + // L2: 2, 4, 5 + // 2 and 5 should be visible + Options options = CurrentOptions(); + options.compression = kNoCompression; + options.disable_auto_compactions = true; + options.target_file_size_base = 3 * 1024; + + DestroyAndReopen(options); + // L2 + ASSERT_OK(db_->Put(WriteOptions(), Key(2), "foo")); + ASSERT_OK(db_->Put(WriteOptions(), Key(4), "bar")); + ASSERT_OK(db_->Put(WriteOptions(), Key(5), "foobar")); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(2); + ASSERT_EQ(1, NumTableFilesAtLevel(2)); + + // l1 + ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), Key(3), + Key(5))); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(1); + ASSERT_EQ(1, NumTableFilesAtLevel(1)); + + auto iter = db_->NewIterator(ReadOptions()); + auto iter_test_backward = [&] { + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->key(), Key(5)); + iter->Prev(); + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->key(), Key(2)); + iter->Prev(); + VerifyIteratorReachesEnd(iter); + }; + auto iter_test_forward = [&] { + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->key(), Key(2)); + iter->Next(); + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->key(), Key(5)); + iter->Next(); + VerifyIteratorReachesEnd(iter); + }; + iter->Seek(Key(4)); + iter_test_backward(); + iter->SeekForPrev(Key(4)); + iter->Next(); + iter_test_backward(); + + iter->Seek(Key(4)); + iter->Prev(); + iter_test_forward(); + iter->SeekForPrev(Key(4)); + iter_test_forward(); + + iter->SeekToFirst(); + iter_test_forward(); + iter->SeekToLast(); + iter_test_backward(); + + delete iter; +} + +TEST_F(DBRangeDelTest, TombstoneSentinelDirectionChange) { + // L1: 7 + // L2: [4, 6) + // L3: 4 + // Seek(5) will have sentinel key 6 at the top of minHeap in merging iterator. + // then do a prev, how would sentinel work? + // Redo the test after Put(5) into L1 so that there is a visible key in range + // [4, 6). + Options options = CurrentOptions(); + options.compression = kNoCompression; + options.disable_auto_compactions = true; + options.target_file_size_base = 3 * 1024; + + DestroyAndReopen(options); + // L3 + ASSERT_OK(db_->Put(WriteOptions(), Key(4), "bar")); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(3); + ASSERT_EQ(1, NumTableFilesAtLevel(3)); + // L2 + ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), Key(4), + Key(6))); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(2); + ASSERT_EQ(1, NumTableFilesAtLevel(2)); + + // L1 + ASSERT_OK(db_->Put(WriteOptions(), Key(7), "foobar")); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(1); + ASSERT_EQ(1, NumTableFilesAtLevel(1)); + + auto iter = db_->NewIterator(ReadOptions()); + iter->Seek(Key(5)); + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->key(), Key(7)); + iter->Prev(); + ASSERT_TRUE(!iter->Valid() && iter->status().ok()); + delete iter; + + ASSERT_OK(db_->Put(WriteOptions(), Key(5), "foobar")); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(1); + ASSERT_EQ(2, NumTableFilesAtLevel(1)); + + iter = db_->NewIterator(ReadOptions()); + iter->Seek(Key(5)); + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->key(), Key(5)); + iter->Prev(); + ASSERT_TRUE(!iter->Valid() && iter->status().ok()); + delete iter; +} + +// Right sentinel tested in many test cases above +TEST_F(DBRangeDelTest, LeftSentinelKeyTest) { + // L1_0: 0, 1 L1_1: [2, 3), 5 + // L2: 2 + // SeekForPrev(4) should give 1 due to sentinel key keeping [2, 3) alive. + Options options = CurrentOptions(); + options.compression = kNoCompression; + options.disable_auto_compactions = true; + options.target_file_size_base = 3 * 1024; + options.max_compaction_bytes = 1024; + + DestroyAndReopen(options); + // L2 + ASSERT_OK(db_->Put(WriteOptions(), Key(2), "foo")); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(2); + ASSERT_EQ(1, NumTableFilesAtLevel(2)); + + // L1_0 + Random rnd(301); + ASSERT_OK(db_->Put(WriteOptions(), Key(0), rnd.RandomString(4 << 10))); + ASSERT_OK(db_->Put(WriteOptions(), Key(1), rnd.RandomString(4 << 10))); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(1); + ASSERT_EQ(1, NumTableFilesAtLevel(1)); + + // L1_1 + ASSERT_OK(db_->Put(WriteOptions(), Key(5), "bar")); + ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), Key(2), + Key(3))); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(1); + ASSERT_EQ(2, NumTableFilesAtLevel(1)); + + auto iter = db_->NewIterator(ReadOptions()); + iter->SeekForPrev(Key(4)); + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->key(), Key(1)); + iter->Prev(); + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->key(), Key(0)); + iter->Prev(); + ASSERT_TRUE(!iter->Valid()); + ASSERT_OK(iter->status()); + delete iter; +} + +TEST_F(DBRangeDelTest, LeftSentinelKeyTestWithNewerKey) { + // L1_0: 1, 2 newer than L1_1, L1_1: [2, 4), 5 + // L2: 3 + // SeekForPrev(4) then Prev() should give 2 and then 1. + Options options = CurrentOptions(); + options.compression = kNoCompression; + options.disable_auto_compactions = true; + options.target_file_size_base = 3 * 1024; + options.max_compaction_bytes = 1024; + + DestroyAndReopen(options); + // L2 + ASSERT_OK(db_->Put(WriteOptions(), Key(3), "foo")); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(2); + ASSERT_EQ(1, NumTableFilesAtLevel(2)); + + // L1_1 + ASSERT_OK(db_->Put(WriteOptions(), Key(5), "bar")); + ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), Key(2), + Key(4))); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(1); + ASSERT_EQ(1, NumTableFilesAtLevel(1)); + + // L1_0 + Random rnd(301); + ASSERT_OK(db_->Put(WriteOptions(), Key(1), rnd.RandomString(4 << 10))); + ASSERT_OK(db_->Put(WriteOptions(), Key(2), rnd.RandomString(4 << 10))); + // Used to verify sequence number of iterator key later. + auto seq = dbfull()->TEST_GetLastVisibleSequence(); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(1); + ASSERT_EQ(2, NumTableFilesAtLevel(1)); + + Arena arena; + InternalKeyComparator icmp(options.comparator); + ReadOptions read_options; + ScopedArenaIterator iter; + iter.set( + dbfull()->NewInternalIterator(read_options, &arena, kMaxSequenceNumber)); + + auto k = Key(4); + IterKey target; + target.SetInternalKey(k, 0 /* sequence_number */, kValueTypeForSeekForPrev); + iter->SeekForPrev(target.GetInternalKey()); + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->user_key(), Key(2)); + SequenceNumber actual_seq; + ValueType type; + UnPackSequenceAndType(ExtractInternalKeyFooter(iter->key()), &actual_seq, + &type); + ASSERT_EQ(seq, actual_seq); + // might as well check type + ASSERT_EQ(type, kTypeValue); + + iter->Prev(); + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->user_key(), Key(1)); + iter->Prev(); + ASSERT_TRUE(!iter->Valid()); + ASSERT_OK(iter->status()); +} + +TEST_F(DBRangeDelTest, SentinelKeyCommonCaseTest) { + // L1 has 3 files + // L1_0: 1, 2 L1_1: [3, 4) 5, 6, [7, 8) L1_2: 9 + // Check iterator operations on LevelIterator. + Options options = CurrentOptions(); + options.compression = kNoCompression; + options.disable_auto_compactions = true; + options.target_file_size_base = 3 * 1024; + + DestroyAndReopen(options); + Random rnd(301); + // L1_0 + ASSERT_OK(db_->Put(WriteOptions(), Key(1), rnd.RandomString(4 << 10))); + ASSERT_OK(db_->Put(WriteOptions(), Key(2), rnd.RandomString(4 << 10))); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(1); + ASSERT_EQ(1, NumTableFilesAtLevel(1)); + + // L1_1 + ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), Key(3), + Key(4))); + ASSERT_OK(db_->Put(WriteOptions(), Key(5), rnd.RandomString(4 << 10))); + ASSERT_OK(db_->Put(WriteOptions(), Key(6), rnd.RandomString(4 << 10))); + ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), Key(7), + Key(8))); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(1); + ASSERT_EQ(2, NumTableFilesAtLevel(1)); + + // L1_2 + ASSERT_OK(db_->Put(WriteOptions(), Key(9), rnd.RandomString(4 << 10))); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(1); + ASSERT_EQ(3, NumTableFilesAtLevel(1)); + + ColumnFamilyData* cfd = + static_cast_with_check(db_->DefaultColumnFamily()) + ->cfd(); + SuperVersion* sv = cfd->GetSuperVersion(); + Arena arena; + ReadOptions read_options; + MergeIteratorBuilder merge_iter_builder(&cfd->internal_comparator(), &arena, + false /* prefix seek */); + InternalIterator* level_iter = sv->current->TEST_GetLevelIterator( + read_options, &merge_iter_builder, 1 /* level */, true); + // This is needed to make LevelIterator range tombstone aware + auto miter = merge_iter_builder.Finish(); + auto k = Key(7); + IterKey target; + target.SetInternalKey(k, kMaxSequenceNumber, kValueTypeForSeek); + level_iter->Seek(target.GetInternalKey()); + // The last Key(9) is a sentinel key. + VerifyIteratorKey(level_iter, {Key(8), Key(9), Key(9)}); + ASSERT_TRUE(!level_iter->Valid() && level_iter->status().ok()); + + k = Key(6); + target.SetInternalKey(k, kMaxSequenceNumber, kValueTypeForSeek); + level_iter->Seek(target.GetInternalKey()); + VerifyIteratorKey(level_iter, {Key(6), Key(8), Key(9), Key(9)}); + ASSERT_TRUE(!level_iter->Valid() && level_iter->status().ok()); + + k = Key(4); + target.SetInternalKey(k, 0, kValueTypeForSeekForPrev); + level_iter->SeekForPrev(target.GetInternalKey()); + VerifyIteratorKey(level_iter, {Key(3), Key(2), Key(1), Key(1)}, false); + ASSERT_TRUE(!level_iter->Valid() && level_iter->status().ok()); + + k = Key(5); + target.SetInternalKey(k, 0, kValueTypeForSeekForPrev); + level_iter->SeekForPrev(target.GetInternalKey()); + VerifyIteratorKey(level_iter, {Key(5), Key(3), Key(2), Key(1), Key(1)}, + false); + + level_iter->SeekToFirst(); + VerifyIteratorKey(level_iter, {Key(1), Key(2), Key(2), Key(5), Key(6), Key(8), + Key(9), Key(9)}); + ASSERT_TRUE(!level_iter->Valid() && level_iter->status().ok()); + + level_iter->SeekToLast(); + VerifyIteratorKey( + level_iter, + {Key(9), Key(9), Key(6), Key(5), Key(3), Key(2), Key(1), Key(1)}, false); + ASSERT_TRUE(!level_iter->Valid() && level_iter->status().ok()); + + miter->~InternalIterator(); +} + +TEST_F(DBRangeDelTest, PrefixSentinelKey) { + // L1: ['aaaa', 'aaad'), 'bbbb' + // L2: 'aaac', 'aaae' + // Prefix extracts first 3 chars + // Seek('aaab') should give 'aaae' as first key. + // This is to test a previous bug where prefix seek sees there is no prefix in + // the SST file, and will just set file iter to null in LevelIterator and may + // just skip to the next SST file. But in this case, we should keep the file's + // tombstone alive. + Options options = CurrentOptions(); + options.compression = kNoCompression; + options.disable_auto_compactions = true; + options.prefix_extractor.reset(NewFixedPrefixTransform(3)); + BlockBasedTableOptions table_options; + table_options.filter_policy.reset(NewBloomFilterPolicy(10, false)); + table_options.whole_key_filtering = false; + options.table_factory.reset(NewBlockBasedTableFactory(table_options)); + DestroyAndReopen(options); + Random rnd(301); + + // L2: + ASSERT_OK(db_->Put(WriteOptions(), "aaac", rnd.RandomString(10))); + ASSERT_OK(db_->Put(WriteOptions(), "aaae", rnd.RandomString(10))); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(2); + ASSERT_EQ(1, NumTableFilesAtLevel(2)); + + // L1 + ASSERT_OK(db_->DeleteRange(WriteOptions(), db_->DefaultColumnFamily(), "aaaa", + "aaad")); + ASSERT_OK(db_->Put(WriteOptions(), "bbbb", rnd.RandomString(10))); + ASSERT_OK(db_->Flush(FlushOptions())); + MoveFilesToLevel(1); + ASSERT_EQ(1, NumTableFilesAtLevel(1)); + + auto iter = db_->NewIterator(ReadOptions()); + iter->Seek("aaab"); + ASSERT_TRUE(iter->Valid()); + ASSERT_EQ(iter->key(), "aaae"); + delete iter; +} + #endif // ROCKSDB_LITE } // namespace ROCKSDB_NAMESPACE diff --git a/db/db_test_util.cc b/db/db_test_util.cc index 2c44e1e09..f94acec2f 100644 --- a/db/db_test_util.cc +++ b/db/db_test_util.cc @@ -967,15 +967,13 @@ std::string DBTestBase::AllEntriesFor(const Slice& user_key, int cf) { Arena arena; auto options = CurrentOptions(); InternalKeyComparator icmp(options.comparator); - ReadRangeDelAggregator range_del_agg(&icmp, - kMaxSequenceNumber /* upper_bound */); ReadOptions read_options; ScopedArenaIterator iter; if (cf == 0) { - iter.set(dbfull()->NewInternalIterator(read_options, &arena, &range_del_agg, + iter.set(dbfull()->NewInternalIterator(read_options, &arena, kMaxSequenceNumber)); } else { - iter.set(dbfull()->NewInternalIterator(read_options, &arena, &range_del_agg, + iter.set(dbfull()->NewInternalIterator(read_options, &arena, kMaxSequenceNumber, handles_[cf])); } InternalKey target(user_key, kMaxSequenceNumber, kTypeValue); @@ -1431,17 +1429,13 @@ void DBTestBase::validateNumberOfEntries(int numValues, int cf) { Arena arena; auto options = CurrentOptions(); InternalKeyComparator icmp(options.comparator); - ReadRangeDelAggregator range_del_agg(&icmp, - kMaxSequenceNumber /* upper_bound */); - // This should be defined after range_del_agg so that it destructs the - // assigned iterator before it range_del_agg is already destructed. ReadOptions read_options; ScopedArenaIterator iter; if (cf != 0) { - iter.set(dbfull()->NewInternalIterator(read_options, &arena, &range_del_agg, + iter.set(dbfull()->NewInternalIterator(read_options, &arena, kMaxSequenceNumber, handles_[cf])); } else { - iter.set(dbfull()->NewInternalIterator(read_options, &arena, &range_del_agg, + iter.set(dbfull()->NewInternalIterator(read_options, &arena, kMaxSequenceNumber)); } iter->SeekToFirst(); @@ -1646,11 +1640,9 @@ void DBTestBase::VerifyDBInternal( std::vector> true_data) { Arena arena; InternalKeyComparator icmp(last_options_.comparator); - ReadRangeDelAggregator range_del_agg(&icmp, - kMaxSequenceNumber /* upper_bound */); ReadOptions read_options; - auto iter = dbfull()->NewInternalIterator(read_options, &arena, - &range_del_agg, kMaxSequenceNumber); + auto iter = + dbfull()->NewInternalIterator(read_options, &arena, kMaxSequenceNumber); iter->SeekToFirst(); for (auto p : true_data) { ASSERT_TRUE(iter->Valid()); diff --git a/db/dbformat.h b/db/dbformat.h index 51986d6af..aa3d07438 100644 --- a/db/dbformat.h +++ b/db/dbformat.h @@ -139,7 +139,8 @@ inline size_t InternalKeyEncodingLength(const ParsedInternalKey& key) { // Pack a sequence number and a ValueType into a uint64_t inline uint64_t PackSequenceAndType(uint64_t seq, ValueType t) { assert(seq <= kMaxSequenceNumber); - assert(IsExtendedValueType(t)); + // kTypeMaxValid is used in TruncatedRangeDelIterator, see its constructor. + assert(IsExtendedValueType(t) || t == kTypeMaxValid); return (seq << 8) | t; } diff --git a/db/memtable_list.cc b/db/memtable_list.cc index cb076b595..2a4879b8d 100644 --- a/db/memtable_list.cc +++ b/db/memtable_list.cc @@ -210,6 +210,30 @@ Status MemTableListVersion::AddRangeTombstoneIterators( return Status::OK(); } +Status MemTableListVersion::AddRangeTombstoneIterators( + const ReadOptions& read_opts, Arena* /*arena*/, + MergeIteratorBuilder& builder) { + // Except for snapshot read, using kMaxSequenceNumber is OK because these + // are immutable memtables. + SequenceNumber read_seq = read_opts.snapshot != nullptr + ? read_opts.snapshot->GetSequenceNumber() + : kMaxSequenceNumber; + for (auto& m : memlist_) { + auto range_del_iter = m->NewRangeTombstoneIterator( + read_opts, read_seq, true /* immutale_memtable */); + if (range_del_iter == nullptr || range_del_iter->empty()) { + delete range_del_iter; + builder.AddRangeTombstoneIterator(nullptr); + } else { + builder.AddRangeTombstoneIterator(new TruncatedRangeDelIterator( + std::unique_ptr(range_del_iter), + &m->GetInternalKeyComparator(), nullptr /* smallest */, + nullptr /* largest */)); + } + } + return Status::OK(); +} + void MemTableListVersion::AddIterators( const ReadOptions& options, std::vector* iterator_list, Arena* arena) { diff --git a/db/memtable_list.h b/db/memtable_list.h index 18de614e1..a919f6ad5 100644 --- a/db/memtable_list.h +++ b/db/memtable_list.h @@ -111,6 +111,9 @@ class MemTableListVersion { Status AddRangeTombstoneIterators(const ReadOptions& read_opts, Arena* arena, RangeDelAggregator* range_del_agg); + Status AddRangeTombstoneIterators(const ReadOptions& read_opts, Arena* arena, + MergeIteratorBuilder& builder); + void AddIterators(const ReadOptions& options, std::vector* iterator_list, Arena* arena); diff --git a/db/range_del_aggregator.cc b/db/range_del_aggregator.cc index c175b9cee..5a2fff0c2 100644 --- a/db/range_del_aggregator.cc +++ b/db/range_del_aggregator.cc @@ -37,7 +37,6 @@ TruncatedRangeDelIterator::TruncatedRangeDelIterator( false /* log_err_key */); // TODO pik_status.PermitUncheckedError(); assert(pik_status.ok()); - smallest_ = &parsed_smallest; } if (largest != nullptr) { @@ -69,12 +68,16 @@ TruncatedRangeDelIterator::TruncatedRangeDelIterator( // the truncated end key can cover the largest key in this sstable, reduce // its sequence number by 1. parsed_largest.sequence -= 1; + // This line is not needed for correctness, but it ensures that the + // truncated end key is not covering keys from the next SST file. + parsed_largest.type = kValueTypeForSeek; } largest_ = &parsed_largest; } } bool TruncatedRangeDelIterator::Valid() const { + assert(iter_ != nullptr); return iter_->Valid() && (smallest_ == nullptr || icmp_->Compare(*smallest_, iter_->parsed_end_key()) < 0) && diff --git a/db/range_del_aggregator.h b/db/range_del_aggregator.h index 64f549590..43e34565e 100644 --- a/db/range_del_aggregator.h +++ b/db/range_del_aggregator.h @@ -46,6 +46,7 @@ class TruncatedRangeDelIterator { // Seeks to the tombstone with the highest visible sequence number that covers // target (a user key). If no such tombstone exists, the position will be at // the earliest tombstone that ends after target. + // REQUIRES: target is a user key. void Seek(const Slice& target); // Seeks to the tombstone with the highest visible sequence number that covers diff --git a/db/range_del_aggregator_test.cc b/db/range_del_aggregator_test.cc index c1d739753..99f8bc8f0 100644 --- a/db/range_del_aggregator_test.cc +++ b/db/range_del_aggregator_test.cc @@ -76,8 +76,9 @@ ParsedInternalKey UncutEndpoint(const Slice& s) { return ParsedInternalKey(s, kMaxSequenceNumber, kTypeRangeDeletion); } -ParsedInternalKey InternalValue(const Slice& key, SequenceNumber seq) { - return ParsedInternalKey(key, seq, kTypeValue); +ParsedInternalKey InternalValue(const Slice& key, SequenceNumber seq, + ValueType type = kTypeValue) { + return ParsedInternalKey(key, seq, type); } void VerifyIterator( @@ -292,16 +293,18 @@ TEST_F(RangeDelAggregatorTest, TruncatedIterPartiallyCutTombstones) { TruncatedRangeDelIterator iter(std::move(input_iter), &bytewise_icmp, &smallest, &largest); - VerifyIterator(&iter, bytewise_icmp, - {{InternalValue("d", 7), UncutEndpoint("e"), 10}, - {UncutEndpoint("e"), UncutEndpoint("g"), 8}, - {UncutEndpoint("j"), InternalValue("m", 8), 4}}); + VerifyIterator( + &iter, bytewise_icmp, + {{InternalValue("d", 7), UncutEndpoint("e"), 10}, + {UncutEndpoint("e"), UncutEndpoint("g"), 8}, + {UncutEndpoint("j"), InternalValue("m", 8, kValueTypeForSeek), 4}}); VerifySeek( &iter, bytewise_icmp, {{"d", InternalValue("d", 7), UncutEndpoint("e"), 10}, {"e", UncutEndpoint("e"), UncutEndpoint("g"), 8}, - {"ia", UncutEndpoint("j"), InternalValue("m", 8), 4}, + {"ia", UncutEndpoint("j"), InternalValue("m", 8, kValueTypeForSeek), 4, + false /* invalid */}, {"n", UncutEndpoint(""), UncutEndpoint(""), 0, true /* invalid */}, {"", InternalValue("d", 7), UncutEndpoint("e"), 10}}); @@ -310,7 +313,8 @@ TEST_F(RangeDelAggregatorTest, TruncatedIterPartiallyCutTombstones) { {{"d", InternalValue("d", 7), UncutEndpoint("e"), 10}, {"e", UncutEndpoint("e"), UncutEndpoint("g"), 8}, {"ia", UncutEndpoint("e"), UncutEndpoint("g"), 8}, - {"n", UncutEndpoint("j"), InternalValue("m", 8), 4}, + {"n", UncutEndpoint("j"), InternalValue("m", 8, kValueTypeForSeek), 4, + false /* invalid */}, {"", UncutEndpoint(""), UncutEndpoint(""), 0, true /* invalid */}}); } diff --git a/db/table_cache.cc b/db/table_cache.cc index 2dca38c01..6157a65c0 100644 --- a/db/table_cache.cc +++ b/db/table_cache.cc @@ -236,7 +236,8 @@ InternalIterator* TableCache::NewIterator( TableReaderCaller caller, Arena* arena, bool skip_filters, int level, size_t max_file_size_for_l0_meta_pin, const InternalKey* smallest_compaction_key, - const InternalKey* largest_compaction_key, bool allow_unprepared_value) { + const InternalKey* largest_compaction_key, bool allow_unprepared_value, + TruncatedRangeDelIterator** range_del_iter) { PERF_TIMER_GUARD(new_table_iterator_nanos); Status s; @@ -281,25 +282,40 @@ InternalIterator* TableCache::NewIterator( *table_reader_ptr = table_reader; } } - if (s.ok() && range_del_agg != nullptr && !options.ignore_range_deletions) { - if (range_del_agg->AddFile(fd.GetNumber())) { - std::unique_ptr range_del_iter( - static_cast( - table_reader->NewRangeTombstoneIterator(options))); - if (range_del_iter != nullptr) { - s = range_del_iter->status(); + if (s.ok() && !options.ignore_range_deletions) { + if (range_del_iter != nullptr) { + auto new_range_del_iter = + table_reader->NewRangeTombstoneIterator(options); + if (new_range_del_iter == nullptr || new_range_del_iter->empty()) { + delete new_range_del_iter; + *range_del_iter = nullptr; + } else { + *range_del_iter = new TruncatedRangeDelIterator( + std::unique_ptr( + new_range_del_iter), + &icomparator, &file_meta.smallest, &file_meta.largest); } - if (s.ok()) { - const InternalKey* smallest = &file_meta.smallest; - const InternalKey* largest = &file_meta.largest; - if (smallest_compaction_key != nullptr) { - smallest = smallest_compaction_key; + } + if (range_del_agg != nullptr) { + if (range_del_agg->AddFile(fd.GetNumber())) { + std::unique_ptr new_range_del_iter( + static_cast( + table_reader->NewRangeTombstoneIterator(options))); + if (new_range_del_iter != nullptr) { + s = new_range_del_iter->status(); } - if (largest_compaction_key != nullptr) { - largest = largest_compaction_key; + if (s.ok()) { + const InternalKey* smallest = &file_meta.smallest; + const InternalKey* largest = &file_meta.largest; + if (smallest_compaction_key != nullptr) { + smallest = smallest_compaction_key; + } + if (largest_compaction_key != nullptr) { + largest = largest_compaction_key; + } + range_del_agg->AddTombstones(std::move(new_range_del_iter), smallest, + largest); } - range_del_agg->AddTombstones(std::move(range_del_iter), smallest, - largest); } } } diff --git a/db/table_cache.h b/db/table_cache.h index 2b0205b3e..ae711164f 100644 --- a/db/table_cache.h +++ b/db/table_cache.h @@ -63,6 +63,11 @@ class TableCache { // the returned iterator. The returned "*table_reader_ptr" object is owned // by the cache and should not be deleted, and is valid for as long as the // returned iterator is live. + // If !options.ignore_range_deletions, and range_del_iter is non-nullptr, + // then range_del_iter is set to a TruncatedRangeDelIterator for range + // tombstones in the SST file corresponding to the specified file number. The + // upper/lower bounds for the TruncatedRangeDelIterator are set to the SST + // file's boundary. // @param options Must outlive the returned iterator. // @param range_del_agg If non-nullptr, adds range deletions to the // aggregator. If an error occurs, returns it in a NewErrorInternalIterator @@ -79,7 +84,8 @@ class TableCache { TableReaderCaller caller, Arena* arena, bool skip_filters, int level, size_t max_file_size_for_l0_meta_pin, const InternalKey* smallest_compaction_key, - const InternalKey* largest_compaction_key, bool allow_unprepared_value); + const InternalKey* largest_compaction_key, bool allow_unprepared_value, + TruncatedRangeDelIterator** range_del_iter = nullptr); // If a seek to internal key "k" in specified file finds an entry, // call get_context->SaveValue() repeatedly until diff --git a/db/version_set.cc b/db/version_set.cc index 5e631556b..4e665ab4e 100644 --- a/db/version_set.cc +++ b/db/version_set.cc @@ -950,7 +950,8 @@ class LevelIterator final : public InternalIterator { RangeDelAggregator* range_del_agg, const std::vector* compaction_boundaries = nullptr, - bool allow_unprepared_value = false) + bool allow_unprepared_value = false, + MergeIteratorBuilder* merge_iter_builder = nullptr) : table_cache_(table_cache), read_options_(read_options), file_options_(file_options), @@ -968,13 +969,25 @@ class LevelIterator final : public InternalIterator { range_del_agg_(range_del_agg), pinned_iters_mgr_(nullptr), compaction_boundaries_(compaction_boundaries), - is_next_read_sequential_(false) { + is_next_read_sequential_(false), + range_tombstone_iter_(nullptr), + to_return_sentinel_(false) { // Empty level is not supported. assert(flevel_ != nullptr && flevel_->num_files > 0); + if (merge_iter_builder && !read_options.ignore_range_deletions) { + // lazily initialize range_tombstone_iter_ together with file_iter_ + merge_iter_builder->AddRangeTombstoneIterator(nullptr, + &range_tombstone_iter_); + } } ~LevelIterator() override { delete file_iter_.Set(nullptr); } + // Seek to the first file with a key >= target. + // If range_tombstone_iter_ is not nullptr, then we pretend that file + // boundaries are fake keys (sentinel keys). These keys are used to keep range + // tombstones alive even when all point keys in an SST file are exhausted. + // These sentinel keys will be skipped in merging iterator. void Seek(const Slice& target) override; void SeekForPrev(const Slice& target) override; void SeekToFirst() override; @@ -983,14 +996,29 @@ class LevelIterator final : public InternalIterator { bool NextAndGetResult(IterateResult* result) override; void Prev() override; - bool Valid() const override { return file_iter_.Valid(); } + // In addition to valid and invalid state (!file_iter.Valid() and + // status.ok()), a third state of the iterator is when !file_iter_.Valid() and + // to_return_sentinel_. This means we are at the end of a file, and a sentinel + // key (the file boundary that we pretend as a key) is to be returned next. + // file_iter_.Valid() and to_return_sentinel_ should not both be true. + bool Valid() const override { + assert(!(file_iter_.Valid() && to_return_sentinel_)); + return file_iter_.Valid() || to_return_sentinel_; + } Slice key() const override { assert(Valid()); + if (to_return_sentinel_) { + // Sentinel should be returned after file_iter_ reaches the end of the + // file + assert(!file_iter_.Valid()); + return sentinel_; + } return file_iter_.key(); } Slice value() const override { assert(Valid()); + assert(!to_return_sentinel_); return file_iter_.value(); } @@ -1032,6 +1060,8 @@ class LevelIterator final : public InternalIterator { file_iter_.iter() && file_iter_.IsValuePinned(); } + bool IsDeleteRangeSentinelKey() const override { return to_return_sentinel_; } + private: // Return true if at least one invalid file is seen and skipped. bool SkipEmptyFileForward(); @@ -1044,6 +1074,11 @@ class LevelIterator final : public InternalIterator { return flevel_->files[file_index].smallest_key; } + const Slice& file_largest_key(size_t file_index) { + assert(file_index < flevel_->num_files); + return flevel_->files[file_index].largest_key; + } + bool KeyReachedUpperBound(const Slice& internal_key) { return read_options_.iterate_upper_bound != nullptr && user_comparator_.CompareWithoutTimestamp( @@ -1051,6 +1086,16 @@ class LevelIterator final : public InternalIterator { *read_options_.iterate_upper_bound, /*b_has_ts=*/false) >= 0; } + void ClearRangeTombstoneIter() { + if (range_tombstone_iter_ && *range_tombstone_iter_) { + delete *range_tombstone_iter_; + *range_tombstone_iter_ = nullptr; + } + } + + // Move file_iter_ to the file at file_index_. + // range_tombstone_iter_ is updated with a range tombstone iterator + // into the new file. Old range tombstone iterator is cleared. InternalIterator* NewFileIterator() { assert(file_index_ < flevel_->num_files); auto file_meta = flevel_->files[file_index_]; @@ -1065,13 +1110,14 @@ class LevelIterator final : public InternalIterator { largest_compaction_key = (*compaction_boundaries_)[file_index_].largest; } CheckMayBeOutOfLowerBound(); + ClearRangeTombstoneIter(); return table_cache_->NewIterator( read_options_, file_options_, icomparator_, *file_meta.file_metadata, range_del_agg_, prefix_extractor_, nullptr /* don't need reference to table */, file_read_hist_, caller_, /*arena=*/nullptr, skip_filters_, level_, /*max_file_size_for_l0_meta_pin=*/0, smallest_compaction_key, - largest_compaction_key, allow_unprepared_value_); + largest_compaction_key, allow_unprepared_value_, range_tombstone_iter_); } // Check if current file being fully within iterate_lower_bound. @@ -1117,9 +1163,51 @@ class LevelIterator final : public InternalIterator { const std::vector* compaction_boundaries_; bool is_next_read_sequential_; + + // This is set when this level iterator is used under a merging iterator + // that processes range tombstones. range_tombstone_iter_ points to where the + // merging iterator stores the range tombstones iterator for this level. When + // this level iterator moves to a new SST file, it updates the range + // tombstones accordingly through this pointer. So the merging iterator always + // has access to the current SST file's range tombstones. + // + // The level iterator treats file boundary as fake keys (sentinel keys) to + // keep range tombstones alive if needed and make upper level, i.e. merging + // iterator, aware of file changes (when level iterator moves to a new SST + // file, there is some bookkeeping work that needs to be done at merging + // iterator end). + // + // *range_tombstone_iter_ points to range tombstones of the current SST file + TruncatedRangeDelIterator** range_tombstone_iter_; + + // Whether next/prev key is a sentinel key. + bool to_return_sentinel_ = false; + // The sentinel key to be returned + Slice sentinel_; + // Sets flags for if we should return the sentinel key next. + // The condition for returning sentinel is reaching the end of current + // file_iter_: !Valid() && status.().ok(). + void TrySetDeleteRangeSentinel(const Slice& boundary_key); + void ClearSentinel() { to_return_sentinel_ = false; } + + // Set in Seek() when a prefix seek reaches end of the current file, + // and the next file has a different prefix. SkipEmptyFileForward() + // will not move to next file when this flag is set. + bool prefix_exhausted_ = false; }; +void LevelIterator::TrySetDeleteRangeSentinel(const Slice& boundary_key) { + assert(range_tombstone_iter_); + if (file_iter_.iter() != nullptr && !file_iter_.Valid() && + file_iter_.status().ok()) { + to_return_sentinel_ = true; + sentinel_ = boundary_key; + } +} + void LevelIterator::Seek(const Slice& target) { + prefix_exhausted_ = false; + ClearSentinel(); // Check whether the seek key fall under the same file bool need_to_reseek = true; if (file_iter_.iter() != nullptr && file_index_ < flevel_->num_files) { @@ -1148,44 +1236,82 @@ void LevelIterator::Seek(const Slice& target) { if (file_iter_.status() == Status::TryAgain()) { return; } - } - - if (SkipEmptyFileForward() && prefix_extractor_ != nullptr && - !read_options_.total_order_seek && !read_options_.auto_prefix_mode && - file_iter_.iter() != nullptr && file_iter_.Valid()) { - // We've skipped the file we initially positioned to. In the prefix - // seek case, it is likely that the file is skipped because of - // prefix bloom or hash, where more keys are skipped. We then check - // the current key and invalidate the iterator if the prefix is - // already passed. - // When doing prefix iterator seek, when keys for one prefix have - // been exhausted, it can jump to any key that is larger. Here we are - // enforcing a stricter contract than that, in order to make it easier for - // higher layers (merging and DB iterator) to reason the correctness: - // 1. Within the prefix, the result should be accurate. - // 2. If keys for the prefix is exhausted, it is either positioned to the - // next key after the prefix, or make the iterator invalid. - // A side benefit will be that it invalidates the iterator earlier so that - // the upper level merging iterator can merge fewer child iterators. - size_t ts_sz = user_comparator_.timestamp_size(); - Slice target_user_key_without_ts = - ExtractUserKeyAndStripTimestamp(target, ts_sz); - Slice file_user_key_without_ts = - ExtractUserKeyAndStripTimestamp(file_iter_.key(), ts_sz); - if (prefix_extractor_->InDomain(target_user_key_without_ts) && - (!prefix_extractor_->InDomain(file_user_key_without_ts) || - user_comparator_.CompareWithoutTimestamp( - prefix_extractor_->Transform(target_user_key_without_ts), false, - prefix_extractor_->Transform(file_user_key_without_ts), - false) != 0)) { - SetFileIterator(nullptr); + if (!file_iter_.Valid() && file_iter_.status().ok() && + prefix_extractor_ != nullptr && !read_options_.total_order_seek && + !read_options_.auto_prefix_mode && + file_index_ < flevel_->num_files - 1) { + size_t ts_sz = user_comparator_.timestamp_size(); + Slice target_user_key_without_ts = + ExtractUserKeyAndStripTimestamp(target, ts_sz); + Slice next_file_first_user_key_without_ts = + ExtractUserKeyAndStripTimestamp(file_smallest_key(file_index_ + 1), + ts_sz); + if (prefix_extractor_->InDomain(target_user_key_without_ts) && + (!prefix_extractor_->InDomain(next_file_first_user_key_without_ts) || + user_comparator_.CompareWithoutTimestamp( + prefix_extractor_->Transform(target_user_key_without_ts), false, + prefix_extractor_->Transform( + next_file_first_user_key_without_ts), + false) != 0)) { + // SkipEmptyFileForward() will not advance to next file when this flag + // is set for reason detailed below. + // + // The file we initially positioned to has no keys under the target + // prefix, and the next file's smallest key has a different prefix than + // target. When doing prefix iterator seek, when keys for one prefix + // have been exhausted, it can jump to any key that is larger. Here we + // are enforcing a stricter contract than that, in order to make it + // easier for higher layers (merging and DB iterator) to reason the + // correctness: + // 1. Within the prefix, the result should be accurate. + // 2. If keys for the prefix is exhausted, it is either positioned to + // the next key after the prefix, or make the iterator invalid. + // A side benefit will be that it invalidates the iterator earlier so + // that the upper level merging iterator can merge fewer child + // iterators. + // + // The flag is cleared in Seek*() calls. There is no need to clear the + // flag in Prev() since Prev() will not be called when the flag is set + // for reasons explained below. If range_tombstone_iter_ is nullptr, + // then there is no file boundary sentinel key. Since + // !file_iter_.Valid() from the if condition above, this level iterator + // is !Valid(), so Prev() will not be called. If range_tombstone_iter_ + // is not nullptr, there are two cases depending on if this level + // iterator reaches top of the heap in merging iterator (the upper + // layer). + // If so, merging iterator will see the sentinel key, call + // NextAndGetResult() and the call to NextAndGetResult() will skip the + // sentinel key and makes this level iterator invalid. If not, then it + // could be because the upper layer is done before any method of this + // level iterator is called or another Seek*() call is invoked. Either + // way, Prev() is never called before Seek*(). + // The flag should not be cleared at the beginning of + // Next/NextAndGetResult() since it is used in SkipEmptyFileForward() + // called in Next/NextAndGetResult(). + prefix_exhausted_ = true; + } + } + + if (range_tombstone_iter_) { + TrySetDeleteRangeSentinel(file_largest_key(file_index_)); } } + SkipEmptyFileForward(); CheckMayBeOutOfLowerBound(); } void LevelIterator::SeekForPrev(const Slice& target) { + prefix_exhausted_ = false; + ClearSentinel(); size_t new_file_index = FindFile(icomparator_, *flevel_, target); + // Seek beyond this level's smallest key + if (new_file_index == 0 && + icomparator_.Compare(target, file_smallest_key(0)) < 0) { + SetFileIterator(nullptr); + ClearRangeTombstoneIter(); + CheckMayBeOutOfLowerBound(); + return; + } if (new_file_index >= flevel_->num_files) { new_file_index = flevel_->num_files - 1; } @@ -1193,24 +1319,47 @@ void LevelIterator::SeekForPrev(const Slice& target) { InitFileIterator(new_file_index); if (file_iter_.iter() != nullptr) { file_iter_.SeekForPrev(target); + if (range_tombstone_iter_ && + icomparator_.Compare(target, file_smallest_key(file_index_)) >= 0) { + // In SeekForPrev() case, it is possible that the target is less than + // file's lower boundary since largest key is used to determine file index + // (FindFile()). When target is less than file's lower boundary, sentinel + // key should not be set so that SeekForPrev() does not result in a key + // larger than target. This is correct in that there is no need to keep + // the range tombstones in this file alive as they only cover keys + // starting from the file's lower boundary, which is after `target`. + TrySetDeleteRangeSentinel(file_smallest_key(file_index_)); + } SkipEmptyFileBackward(); } CheckMayBeOutOfLowerBound(); } void LevelIterator::SeekToFirst() { + prefix_exhausted_ = false; + ClearSentinel(); InitFileIterator(0); if (file_iter_.iter() != nullptr) { file_iter_.SeekToFirst(); + if (range_tombstone_iter_) { + // We do this in SeekToFirst() and SeekToLast() since + // we could have an empty file with only range tombstones. + TrySetDeleteRangeSentinel(file_largest_key(file_index_)); + } } SkipEmptyFileForward(); CheckMayBeOutOfLowerBound(); } void LevelIterator::SeekToLast() { + prefix_exhausted_ = false; + ClearSentinel(); InitFileIterator(flevel_->num_files - 1); if (file_iter_.iter() != nullptr) { file_iter_.SeekToLast(); + if (range_tombstone_iter_) { + TrySetDeleteRangeSentinel(file_smallest_key(file_index_)); + } } SkipEmptyFileBackward(); CheckMayBeOutOfLowerBound(); @@ -1218,25 +1367,47 @@ void LevelIterator::SeekToLast() { void LevelIterator::Next() { assert(Valid()); - file_iter_.Next(); + if (to_return_sentinel_) { + // file_iter_ is at EOF already when to_return_sentinel_ + ClearSentinel(); + } else { + file_iter_.Next(); + if (range_tombstone_iter_) { + TrySetDeleteRangeSentinel(file_largest_key(file_index_)); + } + } SkipEmptyFileForward(); } bool LevelIterator::NextAndGetResult(IterateResult* result) { assert(Valid()); - bool is_valid = file_iter_.NextAndGetResult(result); + // file_iter_ is at EOF already when to_return_sentinel_ + bool is_valid = !to_return_sentinel_ && file_iter_.NextAndGetResult(result); if (!is_valid) { + if (to_return_sentinel_) { + ClearSentinel(); + } else if (range_tombstone_iter_) { + TrySetDeleteRangeSentinel(file_largest_key(file_index_)); + } is_next_read_sequential_ = true; SkipEmptyFileForward(); is_next_read_sequential_ = false; is_valid = Valid(); if (is_valid) { - result->key = key(); - result->bound_check_result = file_iter_.UpperBoundCheckResult(); - // Ideally, we should return the real file_iter_.value_prepared but the - // information is not here. It would casue an extra PrepareValue() - // for the first key of a file. - result->value_prepared = !allow_unprepared_value_; + // This could be set in TrySetDeleteRangeSentinel() or + // SkipEmptyFileForward() above. + if (to_return_sentinel_) { + result->key = sentinel_; + result->bound_check_result = IterBoundCheck::kUnknown; + result->value_prepared = true; + } else { + result->key = key(); + result->bound_check_result = file_iter_.UpperBoundCheckResult(); + // Ideally, we should return the real file_iter_.value_prepared but the + // information is not here. It would casue an extra PrepareValue() + // for the first key of a file. + result->value_prepared = !allow_unprepared_value_; + } } } return is_valid; @@ -1244,47 +1415,81 @@ bool LevelIterator::NextAndGetResult(IterateResult* result) { void LevelIterator::Prev() { assert(Valid()); - file_iter_.Prev(); + if (to_return_sentinel_) { + ClearSentinel(); + } else { + file_iter_.Prev(); + if (range_tombstone_iter_) { + TrySetDeleteRangeSentinel(file_smallest_key(file_index_)); + } + } SkipEmptyFileBackward(); } bool LevelIterator::SkipEmptyFileForward() { bool seen_empty_file = false; - while (file_iter_.iter() == nullptr || - (!file_iter_.Valid() && file_iter_.status().ok() && - file_iter_.iter()->UpperBoundCheckResult() != - IterBoundCheck::kOutOfBound)) { + // Pause at sentinel key + while (!to_return_sentinel_ && + (file_iter_.iter() == nullptr || + (!file_iter_.Valid() && file_iter_.status().ok() && + file_iter_.iter()->UpperBoundCheckResult() != + IterBoundCheck::kOutOfBound))) { seen_empty_file = true; // Move to next file - if (file_index_ >= flevel_->num_files - 1) { - // Already at the last file - SetFileIterator(nullptr); - break; - } - if (KeyReachedUpperBound(file_smallest_key(file_index_ + 1))) { + if (file_index_ >= flevel_->num_files - 1 || + KeyReachedUpperBound(file_smallest_key(file_index_ + 1)) || + prefix_exhausted_) { SetFileIterator(nullptr); + ClearRangeTombstoneIter(); break; } + // may init a new *range_tombstone_iter InitFileIterator(file_index_ + 1); + // We moved to a new SST file + // Seek range_tombstone_iter_ to reset its !Valid() default state. + // We do not need to call range_tombstone_iter_.Seek* in + // LevelIterator::Seek* since when the merging iterator calls + // LevelIterator::Seek*, it should also call Seek* into the corresponding + // range tombstone iterator. if (file_iter_.iter() != nullptr) { file_iter_.SeekToFirst(); + if (range_tombstone_iter_) { + if (*range_tombstone_iter_) { + (*range_tombstone_iter_)->SeekToFirst(); + } + TrySetDeleteRangeSentinel(file_largest_key(file_index_)); + } } } return seen_empty_file; } void LevelIterator::SkipEmptyFileBackward() { - while (file_iter_.iter() == nullptr || - (!file_iter_.Valid() && file_iter_.status().ok())) { + // Pause at sentinel key + while (!to_return_sentinel_ && + (file_iter_.iter() == nullptr || + (!file_iter_.Valid() && file_iter_.status().ok()))) { // Move to previous file if (file_index_ == 0) { // Already the first file SetFileIterator(nullptr); + ClearRangeTombstoneIter(); return; } InitFileIterator(file_index_ - 1); + // We moved to a new SST file + // Seek range_tombstone_iter_ to reset its !Valid() default state. if (file_iter_.iter() != nullptr) { file_iter_.SeekToLast(); + if (range_tombstone_iter_) { + if (*range_tombstone_iter_) { + (*range_tombstone_iter_)->SeekToLast(); + } + TrySetDeleteRangeSentinel(file_smallest_key(file_index_)); + if (to_return_sentinel_) { + break; + } + } } } } @@ -1312,6 +1517,7 @@ void LevelIterator::InitFileIterator(size_t new_file_index) { if (new_file_index >= flevel_->num_files) { file_index_ = new_file_index; SetFileIterator(nullptr); + ClearRangeTombstoneIter(); return; } else { // If the file iterator shows incomplete, we try it again if users seek @@ -1661,6 +1867,21 @@ Status Version::VerifySstUniqueIds() const { return Status::OK(); } +InternalIterator* Version::TEST_GetLevelIterator( + const ReadOptions& read_options, MergeIteratorBuilder* merge_iter_builder, + int level, bool allow_unprepared_value) { + auto* arena = merge_iter_builder->GetArena(); + auto* mem = arena->AllocateAligned(sizeof(LevelIterator)); + return new (mem) LevelIterator( + cfd_->table_cache(), read_options, file_options_, + cfd_->internal_comparator(), &storage_info_.LevelFilesBrief(level), + mutable_cf_options_.prefix_extractor, should_sample_file_read(), + cfd_->internal_stats()->GetFileReadHist(level), + TableReaderCaller::kUserIterator, IsFilterSkipped(level), level, + nullptr /* range_del_agg */, nullptr /* compaction_boundaries */, + allow_unprepared_value, merge_iter_builder); +} + uint64_t VersionStorageInfo::GetEstimatedActiveKeys() const { // Estimation will be inaccurate when: // (1) there exist merge keys @@ -1711,22 +1932,19 @@ double VersionStorageInfo::GetEstimatedCompressionRatioAtLevel( void Version::AddIterators(const ReadOptions& read_options, const FileOptions& soptions, MergeIteratorBuilder* merge_iter_builder, - RangeDelAggregator* range_del_agg, bool allow_unprepared_value) { assert(storage_info_.finalized_); for (int level = 0; level < storage_info_.num_non_empty_levels(); level++) { AddIteratorsForLevel(read_options, soptions, merge_iter_builder, level, - range_del_agg, allow_unprepared_value); + allow_unprepared_value); } } void Version::AddIteratorsForLevel(const ReadOptions& read_options, const FileOptions& soptions, MergeIteratorBuilder* merge_iter_builder, - int level, - RangeDelAggregator* range_del_agg, - bool allow_unprepared_value) { + int level, bool allow_unprepared_value) { assert(storage_info_.finalized_); if (level >= storage_info_.num_non_empty_levels()) { // This is an empty level @@ -1741,17 +1959,21 @@ void Version::AddIteratorsForLevel(const ReadOptions& read_options, auto* arena = merge_iter_builder->GetArena(); if (level == 0) { // Merge all level zero files together since they may overlap + TruncatedRangeDelIterator* iter = nullptr; for (size_t i = 0; i < storage_info_.LevelFilesBrief(0).num_files; i++) { const auto& file = storage_info_.LevelFilesBrief(0).files[i]; merge_iter_builder->AddIterator(cfd_->table_cache()->NewIterator( read_options, soptions, cfd_->internal_comparator(), - *file.file_metadata, range_del_agg, + *file.file_metadata, /*range_del_agg=*/nullptr, mutable_cf_options_.prefix_extractor, nullptr, cfd_->internal_stats()->GetFileReadHist(0), TableReaderCaller::kUserIterator, arena, /*skip_filters=*/false, /*level=*/0, max_file_size_for_l0_meta_pin_, /*smallest_compaction_key=*/nullptr, - /*largest_compaction_key=*/nullptr, allow_unprepared_value)); + /*largest_compaction_key=*/nullptr, allow_unprepared_value, &iter)); + if (!read_options.ignore_range_deletions) { + merge_iter_builder->AddRangeTombstoneIterator(iter); + } } if (should_sample) { // Count ones for every L0 files. This is done per iterator creation @@ -1773,8 +1995,8 @@ void Version::AddIteratorsForLevel(const ReadOptions& read_options, mutable_cf_options_.prefix_extractor, should_sample_file_read(), cfd_->internal_stats()->GetFileReadHist(level), TableReaderCaller::kUserIterator, IsFilterSkipped(level), level, - range_del_agg, - /*compaction_boundaries=*/nullptr, allow_unprepared_value)); + /*range_del_agg=*/nullptr, /*compaction_boundaries=*/nullptr, + allow_unprepared_value, merge_iter_builder)); } } diff --git a/db/version_set.h b/db/version_set.h index e6002f61c..67fb9cd5a 100644 --- a/db/version_set.h +++ b/db/version_set.h @@ -803,7 +803,6 @@ class Version { void AddIterators(const ReadOptions& read_options, const FileOptions& soptions, MergeIteratorBuilder* merger_iter_builder, - RangeDelAggregator* range_del_agg, bool allow_unprepared_value); // @param read_options Must outlive any iterator built by @@ -811,8 +810,7 @@ class Version { void AddIteratorsForLevel(const ReadOptions& read_options, const FileOptions& soptions, MergeIteratorBuilder* merger_iter_builder, - int level, RangeDelAggregator* range_del_agg, - bool allow_unprepared_value); + int level, bool allow_unprepared_value); Status OverlapWithLevelIterator(const ReadOptions&, const FileOptions&, const Slice& smallest_user_key, @@ -963,6 +961,10 @@ class Version { Status VerifySstUniqueIds() const; + InternalIterator* TEST_GetLevelIterator( + const ReadOptions& read_options, MergeIteratorBuilder* merge_iter_builder, + int level, bool allow_unprepared_value); + private: Env* env_; SystemClock* clock_; diff --git a/include/rocksdb/c.h b/include/rocksdb/c.h index 43f1e49c5..d55b79379 100644 --- a/include/rocksdb/c.h +++ b/include/rocksdb/c.h @@ -1715,7 +1715,8 @@ enum { rocksdb_blob_read_time, rocksdb_blob_checksum_time, rocksdb_blob_decompress_time, - rocksdb_total_metric_count = 77 + rocksdb_internal_range_del_reseek_count, + rocksdb_total_metric_count = 78 }; extern ROCKSDB_LIBRARY_API void rocksdb_set_perf_level(int); diff --git a/include/rocksdb/perf_context.h b/include/rocksdb/perf_context.h index a474cd8d9..2403c455d 100644 --- a/include/rocksdb/perf_context.h +++ b/include/rocksdb/perf_context.h @@ -124,6 +124,10 @@ struct PerfContext { // How many values were fed into merge operator by iterators. // uint64_t internal_merge_count; + // Number of times we reseeked inside a merging iterator, specifically to skip + // after or before a range of keys covered by a range deletion in a newer LSM + // component. + uint64_t internal_range_del_reseek_count; uint64_t get_snapshot_time; // total nanos spent on getting snapshot uint64_t get_from_memtable_time; // total nanos spent on querying memtables diff --git a/monitoring/perf_context.cc b/monitoring/perf_context.cc index e104475b9..20240eb59 100644 --- a/monitoring/perf_context.cc +++ b/monitoring/perf_context.cc @@ -59,6 +59,7 @@ PerfContext::PerfContext(const PerfContext& other) { internal_delete_skipped_count = other.internal_delete_skipped_count; internal_recent_skipped_count = other.internal_recent_skipped_count; internal_merge_count = other.internal_merge_count; + internal_range_del_reseek_count = other.internal_range_del_reseek_count; write_wal_time = other.write_wal_time; get_snapshot_time = other.get_snapshot_time; get_from_memtable_time = other.get_from_memtable_time; @@ -166,6 +167,7 @@ PerfContext::PerfContext(PerfContext&& other) noexcept { internal_delete_skipped_count = other.internal_delete_skipped_count; internal_recent_skipped_count = other.internal_recent_skipped_count; internal_merge_count = other.internal_merge_count; + internal_range_del_reseek_count = other.internal_range_del_reseek_count; write_wal_time = other.write_wal_time; get_snapshot_time = other.get_snapshot_time; get_from_memtable_time = other.get_from_memtable_time; @@ -275,6 +277,7 @@ PerfContext& PerfContext::operator=(const PerfContext& other) { internal_delete_skipped_count = other.internal_delete_skipped_count; internal_recent_skipped_count = other.internal_recent_skipped_count; internal_merge_count = other.internal_merge_count; + internal_range_del_reseek_count = other.internal_range_del_reseek_count; write_wal_time = other.write_wal_time; get_snapshot_time = other.get_snapshot_time; get_from_memtable_time = other.get_from_memtable_time; @@ -381,6 +384,7 @@ void PerfContext::Reset() { internal_delete_skipped_count = 0; internal_recent_skipped_count = 0; internal_merge_count = 0; + internal_range_del_reseek_count = 0; write_wal_time = 0; get_snapshot_time = 0; @@ -509,6 +513,7 @@ std::string PerfContext::ToString(bool exclude_zero_counters) const { PERF_CONTEXT_OUTPUT(internal_delete_skipped_count); PERF_CONTEXT_OUTPUT(internal_recent_skipped_count); PERF_CONTEXT_OUTPUT(internal_merge_count); + PERF_CONTEXT_OUTPUT(internal_range_del_reseek_count); PERF_CONTEXT_OUTPUT(write_wal_time); PERF_CONTEXT_OUTPUT(get_snapshot_time); PERF_CONTEXT_OUTPUT(get_from_memtable_time); diff --git a/table/internal_iterator.h b/table/internal_iterator.h index d2f88b865..945dec806 100644 --- a/table/internal_iterator.h +++ b/table/internal_iterator.h @@ -186,6 +186,13 @@ class InternalIteratorBase : public Cleanable { // Default implementation is no-op and its implemented by iterators. virtual void SetReadaheadState(ReadaheadFileInfo* /*readahead_file_info*/) {} + // When used under merging iterator, LevelIterator treats file boundaries + // as sentinel keys to prevent it from moving to next SST file before range + // tombstones in the current SST file are no longer needed. This method makes + // it cheap to check if the current key is a sentinel key. This should only be + // used by MergingIterator and LevelIterator for now. + virtual bool IsDeleteRangeSentinelKey() const { return false; } + protected: void SeekForPrevImpl(const Slice& target, const CompareInterface* cmp) { Seek(target); diff --git a/table/iterator_wrapper.h b/table/iterator_wrapper.h index 134d93d08..d10330e06 100644 --- a/table/iterator_wrapper.h +++ b/table/iterator_wrapper.h @@ -162,6 +162,10 @@ class IteratorWrapperBase { } } + bool IsDeleteRangeSentinelKey() const { + return iter_->IsDeleteRangeSentinelKey(); + } + private: void Update() { valid_ = iter_->Valid(); diff --git a/table/merging_iterator.cc b/table/merging_iterator.cc index 10dda3c66..bcd66492f 100644 --- a/table/merging_iterator.cc +++ b/table/merging_iterator.cc @@ -8,8 +8,8 @@ // found in the LICENSE file. See the AUTHORS file for names of contributors. #include "table/merging_iterator.h" -#include -#include + +#include "db/arena_wrapped_db_iter.h" #include "db/dbformat.h" #include "db/pinned_iterators_manager.h" #include "memory/arena.h" @@ -26,14 +26,93 @@ #include "util/stop_watch.h" namespace ROCKSDB_NAMESPACE { +// For merging iterator to process range tombstones, we treat the start and end +// keys of a range tombstone as point keys and put them into the minHeap/maxHeap +// used in merging iterator. Take minHeap for example, we are able to keep track +// of currently "active" range tombstones (the ones whose start keys are popped +// but end keys are still in the heap) in `active_`. This `active_` set of range +// tombstones is then used to quickly determine whether the point key at heap +// top is deleted (by heap property, the point key at heap top must be within +// internal key range of active range tombstones). +// +// The HeapItem struct represents 3 types of elements in the minHeap/maxHeap: +// point key and the start and end keys of a range tombstone. +struct HeapItem { + HeapItem() = default; + + enum Type { ITERATOR, DELETE_RANGE_START, DELETE_RANGE_END }; + IteratorWrapper iter; + size_t level = 0; + std::string pinned_key; + // Will be overwritten before use, initialize here so compiler does not + // complain. + Type type = ITERATOR; + + explicit HeapItem(size_t _level, InternalIteratorBase* _iter) + : level(_level), type(Type::ITERATOR) { + iter.Set(_iter); + } + + void SetTombstoneKey(ParsedInternalKey&& pik) { + pinned_key.clear(); + // Range tombstone end key is exclusive. If a point internal key has the + // same user key and sequence number as the start or end key of a range + // tombstone, the order will be start < end key < internal key with the + // following op_type change. This is helpful to ensure keys popped from + // heap are in expected order since range tombstone start/end keys will + // be distinct from point internal keys. Strictly speaking, this is only + // needed for tombstone end points that are truncated in + // TruncatedRangeDelIterator since untruncated tombstone end points always + // have kMaxSequenceNumber and kTypeRangeDeletion (see + // TruncatedRangeDelIterator::start_key()/end_key()). + ParsedInternalKey p(pik.user_key, pik.sequence, kTypeMaxValid); + AppendInternalKey(&pinned_key, p); + } + + Slice key() const { + if (type == Type::ITERATOR) { + return iter.key(); + } + return pinned_key; + } + + bool IsDeleteRangeSentinelKey() const { + if (type == Type::ITERATOR) { + return iter.IsDeleteRangeSentinelKey(); + } + return false; + } +}; + +class MinHeapItemComparator { + public: + MinHeapItemComparator(const InternalKeyComparator* comparator) + : comparator_(comparator) {} + bool operator()(HeapItem* a, HeapItem* b) const { + return comparator_->Compare(a->key(), b->key()) > 0; + } + + private: + const InternalKeyComparator* comparator_; +}; + +class MaxHeapItemComparator { + public: + MaxHeapItemComparator(const InternalKeyComparator* comparator) + : comparator_(comparator) {} + bool operator()(HeapItem* a, HeapItem* b) const { + return comparator_->Compare(a->key(), b->key()) < 0; + } + + private: + const InternalKeyComparator* comparator_; +}; // Without anonymous namespace here, we fail the warning -Wmissing-prototypes namespace { -using MergerMaxIterHeap = BinaryHeap; -using MergerMinIterHeap = BinaryHeap; +using MergerMinIterHeap = BinaryHeap; +using MergerMaxIterHeap = BinaryHeap; } // namespace -const size_t kNumIterReserve = 4; - class MergingIterator : public InternalIterator { public: MergingIterator(const InternalKeyComparator* comparator, @@ -48,7 +127,8 @@ class MergingIterator : public InternalIterator { pinned_iters_mgr_(nullptr) { children_.resize(n); for (int i = 0; i < n; i++) { - children_[i].Set(children[i]); + children_[i].level = i; + children_[i].iter.Set(children[i]); } } @@ -59,7 +139,7 @@ class MergingIterator : public InternalIterator { } virtual void AddIterator(InternalIterator* iter) { - children_.emplace_back(iter); + children_.emplace_back(children_.size(), iter); if (pinned_iters_mgr_) { iter->SetPinnedItersMgr(pinned_iters_mgr_); } @@ -68,9 +148,44 @@ class MergingIterator : public InternalIterator { current_ = nullptr; } + // Merging iterator can optionally process range tombstones: if a key is + // covered by a range tombstone, the merging iterator will not output it but + // skip it. + // + // Add the next range tombstone iterator to this merging iterator. + // There must be either no range tombstone iterator, or same number of + // range tombstone iterators as point iterators after all range tombstone + // iters are added. The i-th added range tombstone iterator and the i-th point + // iterator must point to the same sorted run. + // Merging iterator takes ownership of the range tombstone iterator and + // is responsible for freeing it. Note that during Iterator::Refresh() + // and when a level iterator moves to a different SST file, the range + // tombstone iterator could be updated. In that case, the merging iterator + // is only responsible to freeing the new range tombstone iterator + // that it has pointers to in range_tombstone_iters_. + void AddRangeTombstoneIterator(TruncatedRangeDelIterator* iter) { + range_tombstone_iters_.emplace_back(iter); + } + + // Called by MergingIteratorBuilder when all point iterators and range + // tombstone iterators are added. Initializes HeapItems for range tombstone + // iterators so that no further allocation is needed for HeapItem. + void Finish() { + if (!range_tombstone_iters_.empty()) { + pinned_heap_item_.resize(range_tombstone_iters_.size()); + for (size_t i = 0; i < range_tombstone_iters_.size(); ++i) { + pinned_heap_item_[i].level = i; + } + } + } + ~MergingIterator() override { + for (auto child : range_tombstone_iters_) { + delete child; + } + for (auto& child : children_) { - child.DeleteIter(is_arena_mode_); + child.iter.DeleteIter(is_arena_mode_); } status_.PermitUncheckedError(); } @@ -79,13 +194,92 @@ class MergingIterator : public InternalIterator { Status status() const override { return status_; } + // Add range_tombstone_iters_[level] into min heap. + // Updates active_ if the end key of a range tombstone is inserted. + // @param start_key specifies which end point of the range tombstone to add. + void InsertRangeTombstoneToMinHeap(size_t level, bool start_key = true) { + assert(!range_tombstone_iters_.empty() && + range_tombstone_iters_[level]->Valid()); + if (start_key) { + pinned_heap_item_[level].SetTombstoneKey( + range_tombstone_iters_[level]->start_key()); + pinned_heap_item_[level].type = HeapItem::DELETE_RANGE_START; + assert(active_.count(level) == 0); + } else { + pinned_heap_item_[level].SetTombstoneKey( + range_tombstone_iters_[level]->end_key()); + pinned_heap_item_[level].type = HeapItem::DELETE_RANGE_END; + active_.insert(level); + } + minHeap_.push(&pinned_heap_item_[level]); + } + + // Add range_tombstone_iters_[level] into max heap. + // Updates active_ if the start key of a range tombstone is inserted. + // @param end_key specifies which end point of the range tombstone to add. + void InsertRangeTombstoneToMaxHeap(size_t level, bool end_key = true) { + assert(!range_tombstone_iters_.empty() && + range_tombstone_iters_[level]->Valid()); + if (end_key) { + pinned_heap_item_[level].SetTombstoneKey( + range_tombstone_iters_[level]->end_key()); + pinned_heap_item_[level].type = HeapItem::DELETE_RANGE_END; + assert(active_.count(level) == 0); + } else { + pinned_heap_item_[level].SetTombstoneKey( + range_tombstone_iters_[level]->start_key()); + pinned_heap_item_[level].type = HeapItem::DELETE_RANGE_START; + active_.insert(level); + } + maxHeap_->push(&pinned_heap_item_[level]); + } + + // Remove HeapItems from top of minHeap_ that are of type DELETE_RANGE_START + // until minHeap_ is empty or the top of the minHeap_ is not of type + // DELETE_RANGE_START. Each such item means a range tombstone becomes active, + // so `active_` is updated accordingly. + void PopDeleteRangeStart() { + while (!minHeap_.empty() && + minHeap_.top()->type == HeapItem::DELETE_RANGE_START) { + auto level = minHeap_.top()->level; + minHeap_.pop(); + // insert end key of this range tombstone and updates active_ + InsertRangeTombstoneToMinHeap(level, false /* start_key */); + } + } + + // Remove HeapItems from top of maxHeap_ that are of type DELETE_RANGE_END + // until maxHeap_ is empty or the top of the maxHeap_ is not of type + // DELETE_RANGE_END. Each such item means a range tombstone becomes active, + // so `active_` is updated accordingly. + void PopDeleteRangeEnd() { + while (!maxHeap_->empty() && + maxHeap_->top()->type == HeapItem::DELETE_RANGE_END) { + auto level = maxHeap_->top()->level; + maxHeap_->pop(); + // insert start key of this range tombstone and updates active_ + InsertRangeTombstoneToMaxHeap(level, false /* end_key */); + } + } + void SeekToFirst() override { ClearHeaps(); status_ = Status::OK(); for (auto& child : children_) { - child.SeekToFirst(); + child.iter.SeekToFirst(); AddToMinHeapOrCheckStatus(&child); } + + for (size_t i = 0; i < range_tombstone_iters_.size(); ++i) { + if (range_tombstone_iters_[i]) { + range_tombstone_iters_[i]->SeekToFirst(); + if (range_tombstone_iters_[i]->Valid()) { + // It is possible to be invalid due to snapshots. + InsertRangeTombstoneToMinHeap(i); + } + } + } + FindNextVisibleKey(); direction_ = kForward; current_ = CurrentForward(); } @@ -95,49 +289,67 @@ class MergingIterator : public InternalIterator { InitMaxHeap(); status_ = Status::OK(); for (auto& child : children_) { - child.SeekToLast(); + child.iter.SeekToLast(); AddToMaxHeapOrCheckStatus(&child); } + + for (size_t i = 0; i < range_tombstone_iters_.size(); ++i) { + if (range_tombstone_iters_[i]) { + range_tombstone_iters_[i]->SeekToLast(); + if (range_tombstone_iters_[i]->Valid()) { + // It is possible to be invalid due to snapshots. + InsertRangeTombstoneToMaxHeap(i); + } + } + } + FindPrevVisibleKey(); direction_ = kReverse; current_ = CurrentReverse(); } + // Position this merging iterator at the first key >= target (internal key). + // If range tombstones are present, keys covered by range tombstones are + // skipped, and this merging iter points to the first non-range-deleted key >= + // target after Seek(). If !Valid() and status().ok() then end of the iterator + // is reached. + // + // Internally, this involves positioning all child iterators at the first key + // >= target. If range tombstones are present, we apply a similar + // optimization, cascading seek, as in Pebble + // (https://github.com/cockroachdb/pebble). Specifically, if there is a range + // tombstone [start, end) that covers the target user key at level L, then + // this range tombstone must cover the range [target key, end) in all levels > + // L. So for all levels > L, we can pretend the target key is `end`. This + // optimization is applied at each level and hence the name "cascading seek". + // After a round of (cascading) seeks, the top of the heap is checked to see + // if it is covered by a range tombstone (see FindNextVisibleKey() for more + // detail), and advanced if so. The process is repeated until a + // non-range-deleted key is at the top of the heap, or heap becomes empty. + // + // As mentioned in comments above HeapItem, to make the checking of whether + // top of the heap is covered by some range tombstone efficient, we treat each + // range deletion [start, end) as two point keys and insert them into the same + // min/maxHeap_ where point iterators are. The set `active_` tracks the levels + // that have active range tombstones. If level L is in `active_`, and the + // point key at top of the heap is from level >= L, then the point key is + // within the internal key range of the range tombstone that + // range_tombstone_iters_[L] currently points to. For correctness reasoning, + // one invariant that Seek() (and every other public APIs Seek*(), + // Next/Prev()) guarantees is as follows. After Seek(), suppose `k` is the + // current key of level L's point iterator. Then for each range tombstone + // iterator at level <= L, it is at or before the first range tombstone with + // end key > `k`. This ensures that when level L's point iterator reaches top + // of the heap, `active_` is calculated correctly (it contains the covering + // range tombstone's level if there is one), since no range tombstone iterator + // was skipped beyond that point iterator's current key during Seek(). + // Next()/Prev() maintains a stronger version of this invariant where all + // range tombstone iterators from level <= L are *at* the first range + // tombstone with end key > `k`. void Seek(const Slice& target) override { - ClearHeaps(); - status_ = Status::OK(); - for (auto& child : children_) { - { - PERF_TIMER_GUARD(seek_child_seek_time); - child.Seek(target); - } - - PERF_COUNTER_ADD(seek_child_seek_count, 1); - - // child.status() is set to Status::TryAgain indicating asynchronous - // request for retrieval of data blocks has been submitted. So it should - // return at this point and Seek should be called again to retrieve the - // requested block and add the child to min heap. - if (child.status() == Status::TryAgain()) { - continue; - } - { - // Strictly, we timed slightly more than min heap operation, - // but these operations are very cheap. - PERF_TIMER_GUARD(seek_min_heap_time); - AddToMinHeapOrCheckStatus(&child); - } - } - - for (auto& child : children_) { - if (child.status() == Status::TryAgain()) { - child.Seek(target); - { - PERF_TIMER_GUARD(seek_min_heap_time); - AddToMinHeapOrCheckStatus(&child); - } - PERF_COUNTER_ADD(number_async_seek, 1); - } - } + assert(range_tombstone_iters_.empty() || + range_tombstone_iters_.size() == children_.size()); + SeekImpl(target); + FindNextVisibleKey(); direction_ = kForward; { @@ -147,22 +359,11 @@ class MergingIterator : public InternalIterator { } void SeekForPrev(const Slice& target) override { - ClearHeaps(); - InitMaxHeap(); - status_ = Status::OK(); + assert(range_tombstone_iters_.empty() || + range_tombstone_iters_.size() == children_.size()); + SeekForPrevImpl(target); + FindPrevVisibleKey(); - for (auto& child : children_) { - { - PERF_TIMER_GUARD(seek_child_seek_time); - child.SeekForPrev(target); - } - PERF_COUNTER_ADD(seek_child_seek_count, 1); - - { - PERF_TIMER_GUARD(seek_max_heap_time); - AddToMaxHeapOrCheckStatus(&child); - } - } direction_ = kReverse; { PERF_TIMER_GUARD(seek_max_heap_time); @@ -172,21 +373,19 @@ class MergingIterator : public InternalIterator { void Next() override { assert(Valid()); - // Ensure that all children are positioned after key(). // If we are moving in the forward direction, it is already // true for all of the non-current children since current_ is // the smallest child and key() == current_->key(). if (direction_ != kForward) { - SwitchToForward(); // The loop advanced all non-current children to be > key() so current_ // should still be strictly the smallest key. + SwitchToForward(); } // For the heap modifications below to be correct, current_ must be the // current top of the heap. assert(current_ == CurrentForward()); - // as the current points to the current record. move the iterator forward. current_->Next(); if (current_->Valid()) { @@ -194,12 +393,13 @@ class MergingIterator : public InternalIterator { // replace_top() to restore the heap property. When the same child // iterator yields a sequence of keys, this is cheap. assert(current_->status().ok()); - minHeap_.replace_top(current_); + minHeap_.replace_top(minHeap_.top()); } else { // current stopped being valid, remove it from the heap. considerStatus(current_->status()); minHeap_.pop(); } + FindNextVisibleKey(); current_ = CurrentForward(); } @@ -229,19 +429,19 @@ class MergingIterator : public InternalIterator { // For the heap modifications below to be correct, current_ must be the // current top of the heap. assert(current_ == CurrentReverse()); - current_->Prev(); if (current_->Valid()) { // current is still valid after the Prev() call above. Call // replace_top() to restore the heap property. When the same child // iterator yields a sequence of keys, this is cheap. assert(current_->status().ok()); - maxHeap_->replace_top(current_); + maxHeap_->replace_top(maxHeap_->top()); } else { // current stopped being valid, remove it from the heap. considerStatus(current_->status()); maxHeap_->pop(); } + FindPrevVisibleKey(); current_ = CurrentReverse(); } @@ -283,7 +483,7 @@ class MergingIterator : public InternalIterator { void SetPinnedItersMgr(PinnedIteratorsManager* pinned_iters_mgr) override { pinned_iters_mgr_ = pinned_iters_mgr; for (auto& child : children_) { - child.SetPinnedItersMgr(pinned_iters_mgr); + child.iter.SetPinnedItersMgr(pinned_iters_mgr); } } @@ -300,19 +500,57 @@ class MergingIterator : public InternalIterator { } private: + friend class MergeIteratorBuilder; // Clears heaps for both directions, used when changing direction or seeking - void ClearHeaps(); + void ClearHeaps(bool clear_active = true); // Ensures that maxHeap_ is initialized when starting to go in the reverse // direction void InitMaxHeap(); + // Advance this merging iterator until the current key (top of min heap) is + // not covered by any range tombstone or that there is no more keys (heap is + // empty). After this call, if Valid(), current_ points to the next key that + // is not covered by any range tombstone. + void FindNextVisibleKey(); + void FindPrevVisibleKey(); + + void SeekImpl(const Slice& target, size_t starting_level = 0, + bool range_tombstone_reseek = false); + + // Seek to fist key <= target key (internal key) for + // children_[starting_level:]. + void SeekForPrevImpl(const Slice& target, size_t starting_level = 0, + bool range_tombstone_reseek = false); + bool is_arena_mode_; bool prefix_seek_mode_; // Which direction is the iterator moving? enum Direction : uint8_t { kForward, kReverse }; Direction direction_; const InternalKeyComparator* comparator_; - autovector children_; + // We could also use an autovector with a larger reserved size. + // HeapItem for all child point iterators. + std::vector children_; + // HeapItem for range tombstone start and end keys. Each range tombstone + // iterator will have at most one side (start key or end key) in a heap + // at the same time, so this vector will be of size children_.size(); + // pinned_heap_item_[i] corresponds to the start key and end key HeapItem + // for range_tombstone_iters_[i]. + std::vector pinned_heap_item_; + // range_tombstone_iters_[i] contains range tombstones in the sorted run that + // corresponds to children_[i]. range_tombstone_iters_.empty() means not + // handling range tombstones in merging iterator. range_tombstone_iters_[i] == + // nullptr means the sorted run of children_[i] does not have range + // tombstones. + std::vector range_tombstone_iters_; + + // Levels (indices into range_tombstone_iters_/children_ ) that currently have + // "active" range tombstones. See comments above Seek() for meaning of + // "active". + std::set active_; + + bool SkipNextDeleted(); + bool SkipPrevDeleted(); // Cached pointer to child iterator with the current key, or nullptr if no // child iterators are valid. This is the top of minHeap_ or maxHeap_ @@ -329,11 +567,11 @@ class MergingIterator : public InternalIterator { // In forward direction, process a child that is not in the min heap. // If valid, add to the min heap. Otherwise, check status. - void AddToMinHeapOrCheckStatus(IteratorWrapper*); + void AddToMinHeapOrCheckStatus(HeapItem*); // In backward direction, process a child that is not in the max heap. // If valid, add to the min heap. Otherwise, check status. - void AddToMaxHeapOrCheckStatus(IteratorWrapper*); + void AddToMaxHeapOrCheckStatus(HeapItem*); void SwitchToForward(); @@ -343,86 +581,619 @@ class MergingIterator : public InternalIterator { IteratorWrapper* CurrentForward() const { assert(direction_ == kForward); - return !minHeap_.empty() ? minHeap_.top() : nullptr; + assert(minHeap_.empty() || minHeap_.top()->type == HeapItem::ITERATOR); + return !minHeap_.empty() ? &minHeap_.top()->iter : nullptr; } IteratorWrapper* CurrentReverse() const { assert(direction_ == kReverse); assert(maxHeap_); - return !maxHeap_->empty() ? maxHeap_->top() : nullptr; + assert(maxHeap_->empty() || maxHeap_->top()->type == HeapItem::ITERATOR); + return !maxHeap_->empty() ? &maxHeap_->top()->iter : nullptr; } }; -void MergingIterator::AddToMinHeapOrCheckStatus(IteratorWrapper* child) { - if (child->Valid()) { - assert(child->status().ok()); +// Seek to fist key >= target key (internal key) for children_[starting_level:]. +// Cascading seek optimizations are applied if range tombstones are present (see +// comment above Seek() for more). +// +// @param range_tombstone_reseek Whether target is some range tombstone +// end, i.e., whether this SeekImpl() call is a part of a "cascading seek". This +// is used only for recoding relevant perf_context. +void MergingIterator::SeekImpl(const Slice& target, size_t starting_level, + bool range_tombstone_reseek) { + // active range tombstones before `starting_level` remain active + ClearHeaps(false /* clear_active */); + ParsedInternalKey pik; + if (!range_tombstone_iters_.empty()) { + // pik is only used in InsertRangeTombstoneToMinHeap(). + ParseInternalKey(target, &pik, false).PermitUncheckedError(); + } + + // TODO: perhaps we could save some upheap cost by add all child iters first + // and then do a single heapify. + for (size_t level = 0; level < starting_level; ++level) { + PERF_TIMER_GUARD(seek_min_heap_time); + AddToMinHeapOrCheckStatus(&children_[level]); + } + if (!range_tombstone_iters_.empty()) { + // Add range tombstones from levels < starting_level. We can insert from + // pinned_heap_item_ for the following reasons: + // - pinned_heap_item_[level] is in minHeap_ iff + // range_tombstone_iters[level]->Valid(). + // - If `level` is in active_, then range_tombstone_iters_[level]->Valid() + // and pinned_heap_item_[level] is of type RANGE_DELETION_END. + for (size_t level = 0; level < starting_level; ++level) { + if (range_tombstone_iters_[level] && + range_tombstone_iters_[level]->Valid()) { + assert(static_cast(active_.count(level)) == + (pinned_heap_item_[level].type == HeapItem::DELETE_RANGE_END)); + minHeap_.push(&pinned_heap_item_[level]); + } else { + assert(!active_.count(level)); + } + } + // levels >= starting_level will be reseeked below, so clearing their active + // state here. + active_.erase(active_.lower_bound(starting_level), active_.end()); + } + + status_ = Status::OK(); + IterKey current_search_key; + current_search_key.SetInternalKey(target, false /* copy */); + // Seek target might change to some range tombstone end key, so + // we need to remember them for async requests. + // (level, target) pairs + autovector> prefetched_target; + for (auto level = starting_level; level < children_.size(); ++level) { + { + PERF_TIMER_GUARD(seek_child_seek_time); + children_[level].iter.Seek(current_search_key.GetInternalKey()); + } + + PERF_COUNTER_ADD(seek_child_seek_count, 1); + + if (!range_tombstone_iters_.empty()) { + if (range_tombstone_reseek) { + // This seek is to some range tombstone end key. + // Should only happen when there are range tombstones. + PERF_COUNTER_ADD(internal_range_del_reseek_count, 1); + } + if (children_[level].iter.status().IsTryAgain()) { + prefetched_target.emplace_back( + level, current_search_key.GetInternalKey().ToString()); + } + auto range_tombstone_iter = range_tombstone_iters_[level]; + if (range_tombstone_iter) { + range_tombstone_iter->Seek(current_search_key.GetUserKey()); + if (range_tombstone_iter->Valid()) { + // insert the range tombstone end that is closer to and >= + // current_search_key. Strictly speaking, since the Seek() call above + // is on user key, it is possible that range_tombstone_iter->end_key() + // < current_search_key. This can happen when range_tombstone_iter is + // truncated and range_tombstone_iter.largest_ has the same user key + // as current_search_key.GetUserKey() but with a larger sequence + // number than current_search_key. Correctness is not affected as this + // tombstone end key will be popped during FindNextVisibleKey(). + InsertRangeTombstoneToMinHeap( + level, comparator_->Compare(range_tombstone_iter->start_key(), + pik) > 0 /* start_key */); + // current_search_key < end_key guaranteed by the Seek() and Valid() + // calls above. Only interested in user key coverage since older + // sorted runs must have smaller sequence numbers than this range + // tombstone. + // + // TODO: range_tombstone_iter->Seek() finds the max covering + // sequence number, can make it cheaper by not looking for max. + if (comparator_->user_comparator()->Compare( + range_tombstone_iter->start_key().user_key, + current_search_key.GetUserKey()) <= 0) { + // Since range_tombstone_iter->Valid(), seqno should be valid, so + // there is no need to check it. + range_tombstone_reseek = true; + // Current target user key is covered by this range tombstone. + // All older sorted runs will seek to range tombstone end key. + // Note that for prefix seek case, it is possible that the prefix + // is not the same as the original target, it should not affect + // correctness. Besides, in most cases, range tombstone start and + // end key should have the same prefix? + current_search_key.SetInternalKey( + range_tombstone_iter->end_key().user_key, kMaxSequenceNumber); + } + } + } + } + // child.iter.status() is set to Status::TryAgain indicating asynchronous + // request for retrieval of data blocks has been submitted. So it should + // return at this point and Seek should be called again to retrieve the + // requested block and add the child to min heap. + if (children_[level].iter.status().IsTryAgain()) { + continue; + } + { + // Strictly, we timed slightly more than min heap operation, + // but these operations are very cheap. + PERF_TIMER_GUARD(seek_min_heap_time); + AddToMinHeapOrCheckStatus(&children_[level]); + } + } + + if (range_tombstone_iters_.empty()) { + for (auto& child : children_) { + if (child.iter.status().IsTryAgain()) { + child.iter.Seek(target); + { + PERF_TIMER_GUARD(seek_min_heap_time); + AddToMinHeapOrCheckStatus(&child); + } + PERF_COUNTER_ADD(number_async_seek, 1); + } + } + } else { + for (auto& prefetch : prefetched_target) { + // (level, target) pairs + children_[prefetch.first].iter.Seek(prefetch.second); + { + PERF_TIMER_GUARD(seek_min_heap_time); + AddToMinHeapOrCheckStatus(&children_[prefetch.first]); + } + PERF_COUNTER_ADD(number_async_seek, 1); + } + } +} + +// Returns true iff the current key (min heap top) should not be returned +// to user (of the merging iterator). This can be because the current key +// is deleted by some range tombstone, the current key is some fake file +// boundary sentinel key, or the current key is an end point of a range +// tombstone. Advance the iterator at heap top if needed. Heap order is restored +// and `active_` is updated accordingly. +// See FindNextVisibleKey() for more detail on internal implementation +// of advancing child iters. +// +// REQUIRES: +// - min heap is currently not empty, and iter is in kForward direction. +// - minHeap_ top is not DELETE_RANGE_START (so that `active_` is current). +bool MergingIterator::SkipNextDeleted() { + // 3 types of keys: + // - point key + // - file boundary sentinel keys + // - range deletion end key + auto current = minHeap_.top(); + if (current->type == HeapItem::DELETE_RANGE_END) { + minHeap_.pop(); + active_.erase(current->level); + assert(range_tombstone_iters_[current->level] && + range_tombstone_iters_[current->level]->Valid()); + range_tombstone_iters_[current->level]->Next(); + if (range_tombstone_iters_[current->level]->Valid()) { + InsertRangeTombstoneToMinHeap(current->level); + } + return true /* current key deleted */; + } + if (current->iter.IsDeleteRangeSentinelKey()) { + // If the file boundary is defined by a range deletion, the range + // tombstone's end key must come before this sentinel key (see op_type in + // SetTombstoneKey()). + assert(ExtractValueType(current->iter.key()) != kTypeRangeDeletion || + active_.count(current->level) == 0); + // LevelIterator enters a new SST file + current->iter.Next(); + if (current->iter.Valid()) { + assert(current->iter.status().ok()); + minHeap_.replace_top(current); + } else { + minHeap_.pop(); + } + // Remove last SST file's range tombstone end key if there is one. + // This means file boundary is before range tombstone end key, + // which could happen when a range tombstone and a user key + // straddle two SST files. Note that in TruncatedRangeDelIterator + // constructor, parsed_largest.sequence is decremented 1 in this case. + if (!minHeap_.empty() && minHeap_.top()->level == current->level && + minHeap_.top()->type == HeapItem::DELETE_RANGE_END) { + minHeap_.pop(); + active_.erase(current->level); + } + if (range_tombstone_iters_[current->level] && + range_tombstone_iters_[current->level]->Valid()) { + InsertRangeTombstoneToMinHeap(current->level); + } + return true /* current key deleted */; + } + assert(current->type == HeapItem::ITERATOR); + // Point key case: check active_ for range tombstone coverage. + ParsedInternalKey pik; + ParseInternalKey(current->iter.key(), &pik, false).PermitUncheckedError(); + for (auto& i : active_) { + if (i < current->level) { + // range tombstone is from a newer level, definitely covers + assert(comparator_->Compare(range_tombstone_iters_[i]->start_key(), + pik) <= 0); + assert(comparator_->Compare(pik, range_tombstone_iters_[i]->end_key()) < + 0); + std::string target; + AppendInternalKey(&target, range_tombstone_iters_[i]->end_key()); + SeekImpl(target, current->level, true); + return true /* current key deleted */; + } else if (i == current->level) { + // range tombstone is from the same level as current, check sequence + // number. By `active_` we know current key is between start key and end + // key. + assert(comparator_->Compare(range_tombstone_iters_[i]->start_key(), + pik) <= 0); + assert(comparator_->Compare(pik, range_tombstone_iters_[i]->end_key()) < + 0); + if (pik.sequence < range_tombstone_iters_[current->level]->seq()) { + // covered by range tombstone + current->iter.Next(); + if (current->iter.Valid()) { + minHeap_.replace_top(current); + } else { + minHeap_.pop(); + } + return true /* current key deleted */; + } else { + return false /* current key not deleted */; + } + } else { + return false /* current key not deleted */; + // range tombstone from an older sorted run with current key < end key. + // current key is not deleted and the older sorted run will have its range + // tombstone updated when the range tombstone's end key are popped from + // minHeap_. + } + } + // we can reach here only if active_ is empty + assert(active_.empty()); + assert(minHeap_.top()->type == HeapItem::ITERATOR); + return false /* current key not deleted */; +} + +void MergingIterator::SeekForPrevImpl(const Slice& target, + size_t starting_level, + bool range_tombstone_reseek) { + // active range tombstones before `starting_level` remain active + ClearHeaps(false /* clear_active */); + InitMaxHeap(); + ParsedInternalKey pik; + if (!range_tombstone_iters_.empty()) { + ParseInternalKey(target, &pik, false).PermitUncheckedError(); + } + for (size_t level = 0; level < starting_level; ++level) { + PERF_TIMER_GUARD(seek_max_heap_time); + AddToMaxHeapOrCheckStatus(&children_[level]); + } + if (!range_tombstone_iters_.empty()) { + // Add range tombstones before starting_level. + for (size_t level = 0; level < starting_level; ++level) { + if (range_tombstone_iters_[level] && + range_tombstone_iters_[level]->Valid()) { + assert(static_cast(active_.count(level)) == + (pinned_heap_item_[level].type == HeapItem::DELETE_RANGE_START)); + maxHeap_->push(&pinned_heap_item_[level]); + } else { + assert(!active_.count(level)); + } + } + // levels >= starting_level will be reseeked below, + active_.erase(active_.lower_bound(starting_level), active_.end()); + } + + status_ = Status::OK(); + IterKey current_search_key; + current_search_key.SetInternalKey(target, false /* copy */); + // Seek target might change to some range tombstone end key, so + // we need to remember them for async requests. + // (level, target) pairs + autovector> prefetched_target; + for (auto level = starting_level; level < children_.size(); ++level) { + { + PERF_TIMER_GUARD(seek_child_seek_time); + children_[level].iter.SeekForPrev(current_search_key.GetInternalKey()); + } + + PERF_COUNTER_ADD(seek_child_seek_count, 1); + + if (!range_tombstone_iters_.empty()) { + if (range_tombstone_reseek) { + // This seek is to some range tombstone end key. + // Should only happen when there are range tombstones. + PERF_COUNTER_ADD(internal_range_del_reseek_count, 1); + } + if (children_[level].iter.status().IsTryAgain()) { + prefetched_target.emplace_back( + level, current_search_key.GetInternalKey().ToString()); + } + auto range_tombstone_iter = range_tombstone_iters_[level]; + if (range_tombstone_iter) { + range_tombstone_iter->SeekForPrev(current_search_key.GetUserKey()); + if (range_tombstone_iter->Valid()) { + InsertRangeTombstoneToMaxHeap( + level, comparator_->Compare(range_tombstone_iter->end_key(), + pik) <= 0 /* end_key */); + // start key <= current_search_key guaranteed by the Seek() call above + // Only interested in user key coverage since older sorted runs must + // have smaller sequence numbers than this tombstone. + if (comparator_->user_comparator()->Compare( + current_search_key.GetUserKey(), + range_tombstone_iter->end_key().user_key) < 0) { + range_tombstone_reseek = true; + // covered by this range tombstone + current_search_key.SetInternalKey( + range_tombstone_iter->start_key().user_key, kMaxSequenceNumber, + kValueTypeForSeekForPrev); + } + } + } + } + // child.iter.status() is set to Status::TryAgain indicating asynchronous + // request for retrieval of data blocks has been submitted. So it should + // return at this point and Seek should be called again to retrieve the + // requested block and add the child to min heap. + if (children_[level].iter.status().IsTryAgain()) { + continue; + } + { + // Strictly, we timed slightly more than min heap operation, + // but these operations are very cheap. + PERF_TIMER_GUARD(seek_max_heap_time); + AddToMaxHeapOrCheckStatus(&children_[level]); + } + } + + if (range_tombstone_iters_.empty()) { + for (auto& child : children_) { + if (child.iter.status().IsTryAgain()) { + child.iter.SeekForPrev(target); + { + PERF_TIMER_GUARD(seek_min_heap_time); + AddToMaxHeapOrCheckStatus(&child); + } + PERF_COUNTER_ADD(number_async_seek, 1); + } + } + } else { + for (auto& prefetch : prefetched_target) { + // (level, target) pairs + children_[prefetch.first].iter.SeekForPrev(prefetch.second); + { + PERF_TIMER_GUARD(seek_max_heap_time); + AddToMaxHeapOrCheckStatus(&children_[prefetch.first]); + } + PERF_COUNTER_ADD(number_async_seek, 1); + } + } +} + +// See more in comments above SkipNextDeleted(). +// REQUIRES: +// - max heap is currently not empty, and iter is in kReverse direction. +// - maxHeap_ top is not DELETE_RANGE_END (so that `active_` is current). +bool MergingIterator::SkipPrevDeleted() { + // 3 types of keys: + // - point key + // - file boundary sentinel keys + // - range deletion start key + auto current = maxHeap_->top(); + if (current->type == HeapItem::DELETE_RANGE_START) { + maxHeap_->pop(); + active_.erase(current->level); + assert(range_tombstone_iters_[current->level] && + range_tombstone_iters_[current->level]->Valid()); + range_tombstone_iters_[current->level]->Prev(); + if (range_tombstone_iters_[current->level]->Valid()) { + InsertRangeTombstoneToMaxHeap(current->level); + } + return true /* current key deleted */; + } + if (current->iter.IsDeleteRangeSentinelKey()) { + // Different from SkipNextDeleted(): range tombstone start key is before + // file boundary due to op_type set in SetTombstoneKey(). + assert(ExtractValueType(current->iter.key()) != kTypeRangeDeletion || + active_.count(current->level)); + // LevelIterator enters a new SST file + current->iter.Prev(); + if (current->iter.Valid()) { + assert(current->iter.status().ok()); + maxHeap_->replace_top(current); + } else { + maxHeap_->pop(); + } + if (!maxHeap_->empty() && maxHeap_->top()->level == current->level && + maxHeap_->top()->type == HeapItem::DELETE_RANGE_START) { + maxHeap_->pop(); + active_.erase(current->level); + } + if (range_tombstone_iters_[current->level] && + range_tombstone_iters_[current->level]->Valid()) { + InsertRangeTombstoneToMaxHeap(current->level); + } + return true /* current key deleted */; + } + assert(current->type == HeapItem::ITERATOR); + // Point key case: check active_ for range tombstone coverage. + ParsedInternalKey pik; + ParseInternalKey(current->iter.key(), &pik, false).PermitUncheckedError(); + for (auto& i : active_) { + if (i < current->level) { + // range tombstone is from a newer level, definitely covers + assert(comparator_->Compare(range_tombstone_iters_[i]->start_key(), + pik) <= 0); + assert(comparator_->Compare(pik, range_tombstone_iters_[i]->end_key()) < + 0); + std::string target; + AppendInternalKey(&target, range_tombstone_iters_[i]->start_key()); + // This is different from SkipNextDeleted() which does reseek at sorted + // runs + // >= level (instead of i+1 here). With min heap, if level L is at top of + // the heap, then levels level L's current + // internal key, which means levels level) { + // By `active_` we know current key is between start key and end key. + assert(comparator_->Compare(range_tombstone_iters_[i]->start_key(), + pik) <= 0); + assert(comparator_->Compare(pik, range_tombstone_iters_[i]->end_key()) < + 0); + if (pik.sequence < range_tombstone_iters_[current->level]->seq()) { + current->iter.Prev(); + if (current->iter.Valid()) { + maxHeap_->replace_top(current); + } else { + maxHeap_->pop(); + } + return true /* current key deleted */; + } else { + return false /* current key not deleted */; + } + } else { + return false /* current key not deleted */; + } + } + assert(active_.empty()); + assert(maxHeap_->top()->type == HeapItem::ITERATOR); + return false /* current key not deleted */; +} + +void MergingIterator::AddToMinHeapOrCheckStatus(HeapItem* child) { + if (child->iter.Valid()) { + assert(child->iter.status().ok()); minHeap_.push(child); } else { - considerStatus(child->status()); + considerStatus(child->iter.status()); } } -void MergingIterator::AddToMaxHeapOrCheckStatus(IteratorWrapper* child) { - if (child->Valid()) { - assert(child->status().ok()); +void MergingIterator::AddToMaxHeapOrCheckStatus(HeapItem* child) { + if (child->iter.Valid()) { + assert(child->iter.status().ok()); maxHeap_->push(child); } else { - considerStatus(child->status()); + considerStatus(child->iter.status()); } } +// Advance all non current_ child to > current_.key(). +// We advance current_ after the this function call as it does not require +// Seek(). +// Advance all range tombstones iters, including the one corresponding to +// current_, to the first tombstone with end_key > current_.key(). +// TODO: potentially do cascading seek here too void MergingIterator::SwitchToForward() { - // Otherwise, advance the non-current children. We advance current_ - // just after the if-block. ClearHeaps(); Slice target = key(); for (auto& child : children_) { - if (&child != current_) { - child.Seek(target); - // child.status() is set to Status::TryAgain indicating asynchronous + if (&child.iter != current_) { + child.iter.Seek(target); + // child.iter.status() is set to Status::TryAgain indicating asynchronous // request for retrieval of data blocks has been submitted. So it should // return at this point and Seek should be called again to retrieve the // requested block and add the child to min heap. - if (child.status() == Status::TryAgain()) { + if (child.iter.status() == Status::TryAgain()) { continue; } - if (child.Valid() && comparator_->Equal(target, child.key())) { - assert(child.status().ok()); - child.Next(); + if (child.iter.Valid() && comparator_->Equal(target, child.key())) { + assert(child.iter.status().ok()); + child.iter.Next(); } } AddToMinHeapOrCheckStatus(&child); } for (auto& child : children_) { - if (child.status() == Status::TryAgain()) { - child.Seek(target); - if (child.Valid() && comparator_->Equal(target, child.key())) { - assert(child.status().ok()); - child.Next(); + if (child.iter.status() == Status::TryAgain()) { + child.iter.Seek(target); + if (child.iter.Valid() && comparator_->Equal(target, child.key())) { + assert(child.iter.status().ok()); + child.iter.Next(); } AddToMinHeapOrCheckStatus(&child); } } + // Current range tombstone iter also needs to seek for the following case: + // Previous direction is backward, so range tombstone iter may point to a + // tombstone before current_. If there is no such tombstone, then the range + // tombstone iter is !Valid(). Need to reseek here to make it valid again. + if (!range_tombstone_iters_.empty()) { + ParsedInternalKey pik; + ParseInternalKey(target, &pik, false /* log_err_key */) + .PermitUncheckedError(); + for (size_t i = 0; i < range_tombstone_iters_.size(); ++i) { + auto iter = range_tombstone_iters_[i]; + if (iter) { + iter->Seek(pik.user_key); + // The while loop is needed as the Seek() call above is only for user + // key. We could have a range tombstone with end_key covering user_key, + // but still is smaller than target. This happens when the range + // tombstone is truncated at iter.largest_. + while (iter->Valid() && + comparator_->Compare(iter->end_key(), pik) <= 0) { + iter->Next(); + } + if (range_tombstone_iters_[i]->Valid()) { + InsertRangeTombstoneToMinHeap( + i, comparator_->Compare(range_tombstone_iters_[i]->start_key(), + pik) > 0 /* start_key */); + } + } + } + } + direction_ = kForward; + assert(current_ == CurrentForward()); } +// Advance all range tombstones iters, including the one corresponding to +// current_, to the first tombstone with start_key <= current_.key(). void MergingIterator::SwitchToBackward() { ClearHeaps(); InitMaxHeap(); Slice target = key(); for (auto& child : children_) { - if (&child != current_) { - child.SeekForPrev(target); + if (&child.iter != current_) { + child.iter.SeekForPrev(target); TEST_SYNC_POINT_CALLBACK("MergeIterator::Prev:BeforePrev", &child); - if (child.Valid() && comparator_->Equal(target, child.key())) { - assert(child.status().ok()); - child.Prev(); + if (child.iter.Valid() && comparator_->Equal(target, child.key())) { + assert(child.iter.status().ok()); + child.iter.Prev(); } } AddToMaxHeapOrCheckStatus(&child); } + + ParsedInternalKey pik; + ParseInternalKey(target, &pik, false /* log_err_key */) + .PermitUncheckedError(); + for (size_t i = 0; i < range_tombstone_iters_.size(); ++i) { + auto iter = range_tombstone_iters_[i]; + if (iter) { + iter->SeekForPrev(pik.user_key); + // Since the SeekForPrev() call above is only for user key, + // we may end up with some range tombstone with start key having the + // same user key at current_, but with a smaller sequence number. This + // makes current_ not at maxHeap_ top for the CurrentReverse() call + // below. If there is a range tombstone start key with the same user + // key and the same sequence number as current_.key(), it will be fine as + // in InsertRangeTombstoneToMaxHeap() we change op_type to be the smallest + // op_type. + while (iter->Valid() && + comparator_->Compare(iter->start_key(), pik) > 0) { + iter->Prev(); + } + if (iter->Valid()) { + InsertRangeTombstoneToMaxHeap( + i, comparator_->Compare(range_tombstone_iters_[i]->end_key(), + pik) <= 0 /* end_key */); + } + } + } + direction_ = kReverse; if (!prefix_seek_mode_) { // Note that we don't do assert(current_ == CurrentReverse()) here @@ -434,16 +1205,45 @@ void MergingIterator::SwitchToBackward() { assert(current_ == CurrentReverse()); } -void MergingIterator::ClearHeaps() { +void MergingIterator::ClearHeaps(bool clear_active) { minHeap_.clear(); if (maxHeap_) { maxHeap_->clear(); } + if (clear_active) { + active_.clear(); + } } void MergingIterator::InitMaxHeap() { if (!maxHeap_) { - maxHeap_.reset(new MergerMaxIterHeap(comparator_)); + maxHeap_ = std::make_unique(comparator_); + } +} + +// Repeatedly check and remove heap top key if it is not a point key +// that is not covered by range tombstones. SeekImpl() is called to seek to end +// of a range tombstone if the heap top is a point key covered by some range +// tombstone from a newer sorted run. If the covering tombstone is from current +// key's level, then the current child iterator is simply advanced to its next +// key without reseeking. +inline void MergingIterator::FindNextVisibleKey() { + // When active_ is empty, we know heap top cannot be a range tombstone end + // key. It cannot be a range tombstone start key per PopDeleteRangeStart(). + PopDeleteRangeStart(); + while (!minHeap_.empty() && + (!active_.empty() || minHeap_.top()->IsDeleteRangeSentinelKey()) && + SkipNextDeleted()) { + PopDeleteRangeStart(); + } +} + +inline void MergingIterator::FindPrevVisibleKey() { + PopDeleteRangeEnd(); + while (!maxHeap_->empty() && + (!active_.empty() || maxHeap_->top()->IsDeleteRangeSentinelKey()) && + SkipPrevDeleted()) { + PopDeleteRangeEnd(); } } @@ -495,12 +1295,41 @@ void MergeIteratorBuilder::AddIterator(InternalIterator* iter) { } } -InternalIterator* MergeIteratorBuilder::Finish() { +void MergeIteratorBuilder::AddRangeTombstoneIterator( + TruncatedRangeDelIterator* iter, TruncatedRangeDelIterator*** iter_ptr) { + if (!use_merging_iter) { + use_merging_iter = true; + if (first_iter) { + merge_iter->AddIterator(first_iter); + first_iter = nullptr; + } + } + merge_iter->AddRangeTombstoneIterator(iter); + if (iter_ptr) { + // This is needed instead of setting to &range_tombstone_iters_[i] directly + // here since the memory address of range_tombstone_iters_[i] might change + // during vector resizing. + range_del_iter_ptrs_.emplace_back( + merge_iter->range_tombstone_iters_.size() - 1, iter_ptr); + } +} + +InternalIterator* MergeIteratorBuilder::Finish(ArenaWrappedDBIter* db_iter) { InternalIterator* ret = nullptr; if (!use_merging_iter) { ret = first_iter; first_iter = nullptr; } else { + for (auto& p : range_del_iter_ptrs_) { + *(p.second) = &(merge_iter->range_tombstone_iters_[p.first]); + } + if (db_iter) { + assert(!merge_iter->range_tombstone_iters_.empty()); + // memtable is always the first level + db_iter->SetMemtableRangetombstoneIter( + &merge_iter->range_tombstone_iters_.front()); + } + merge_iter->Finish(); ret = merge_iter; merge_iter = nullptr; } diff --git a/table/merging_iterator.h b/table/merging_iterator.h index 050892292..35cd95c24 100644 --- a/table/merging_iterator.h +++ b/table/merging_iterator.h @@ -9,12 +9,14 @@ #pragma once +#include "db/range_del_aggregator.h" #include "rocksdb/slice.h" #include "rocksdb/types.h" namespace ROCKSDB_NAMESPACE { class Arena; +class ArenaWrappedDBIter; class InternalKeyComparator; template @@ -47,18 +49,37 @@ class MergeIteratorBuilder { // Add iter to the merging iterator. void AddIterator(InternalIterator* iter); + // Add a range tombstone iterator to underlying merge iterator. + // See MergingIterator::AddRangeTombstoneIterator() for more detail. + // + // If `iter_ptr` is not nullptr, *iter_ptr will be set to where the merging + // iterator stores `iter` when MergeIteratorBuilder::Finish() is called. This + // is used by level iterator to update range tombstone iters when switching to + // a different SST file. + void AddRangeTombstoneIterator( + TruncatedRangeDelIterator* iter, + TruncatedRangeDelIterator*** iter_ptr = nullptr); + // Get arena used to build the merging iterator. It is called one a child // iterator needs to be allocated. Arena* GetArena() { return arena; } // Return the result merging iterator. - InternalIterator* Finish(); + // If db_iter is not nullptr, then db_iter->SetMemtableRangetombstoneIter() + // will be called with pointer to where the merging iterator + // stores the memtable range tombstone iterator. + // This is used for DB iterator to refresh memtable range tombstones. + InternalIterator* Finish(ArenaWrappedDBIter* db_iter = nullptr); private: MergingIterator* merge_iter; InternalIterator* first_iter; bool use_merging_iter; Arena* arena; + // Used to set LevelIterator.range_tombstone_iter_. + // See AddRangeTombstoneIterator() implementation for more detail. + std::vector> + range_del_iter_ptrs_; }; } // namespace ROCKSDB_NAMESPACE diff --git a/utilities/debug.cc b/utilities/debug.cc index a33c95a6c..f2c3bb513 100644 --- a/utilities/debug.cc +++ b/utilities/debug.cc @@ -78,11 +78,9 @@ Status GetAllKeyVersions(DB* db, ColumnFamilyHandle* cfh, Slice begin_key, DBImpl* idb = static_cast(db->GetRootDB()); auto icmp = InternalKeyComparator(idb->GetOptions(cfh).comparator); ReadOptions read_options; - ReadRangeDelAggregator range_del_agg(&icmp, - kMaxSequenceNumber /* upper_bound */); Arena arena; - ScopedArenaIterator iter(idb->NewInternalIterator( - read_options, &arena, &range_del_agg, kMaxSequenceNumber, cfh)); + ScopedArenaIterator iter( + idb->NewInternalIterator(read_options, &arena, kMaxSequenceNumber, cfh)); if (!begin_key.empty()) { InternalKey ikey;