You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
rocksdb/db/range_tombstone_fragmenter.h

104 lines
3.6 KiB

Use only "local" range tombstones during Get (#4449) Summary: Previously, range tombstones were accumulated from every level, which was necessary if a range tombstone in a higher level covered a key in a lower level. However, RangeDelAggregator::AddTombstones's complexity is based on the number of tombstones that are currently stored in it, which is wasteful in the Get case, where we only need to know the highest sequence number of range tombstones that cover the key from higher levels, and compute the highest covering sequence number at the current level. This change introduces this optimization, and removes the use of RangeDelAggregator from the Get path. In the benchmark results, the following command was used to initialize the database: ``` ./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8 ``` ...and the following command was used to measure read throughput: ``` ./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32 ``` The filluniquerandom command was only run once, and the resulting database was used to measure read performance before and after the PR. Both binaries were compiled with `DEBUG_LEVEL=0`. Readrandom results before PR: ``` readrandom : 4.544 micros/op 220090 ops/sec; 16.9 MB/s (63103 of 100000 found) ``` Readrandom results after PR: ``` readrandom : 11.147 micros/op 89707 ops/sec; 6.9 MB/s (63103 of 100000 found) ``` So it's actually slower right now, but this PR paves the way for future optimizations (see #4493). ---- Pull Request resolved: https://github.com/facebook/rocksdb/pull/4449 Differential Revision: D10370575 Pulled By: abhimadan fbshipit-source-id: 9a2e152be1ef36969055c0e9eb4beb0d96c11f4d
6 years ago
// Copyright (c) 2018-present, Facebook, Inc. All rights reserved.
// This source code is licensed under both the GPLv2 (found in the
// COPYING file in the root directory) and Apache 2.0 License
// (found in the LICENSE.Apache file in the root directory).
#pragma once
#include <list>
#include <memory>
#include <string>
#include <vector>
#include "db/dbformat.h"
#include "db/pinned_iterators_manager.h"
#include "rocksdb/status.h"
#include "table/internal_iterator.h"
namespace rocksdb {
// FragmentedRangeTombstoneIterator converts an InternalIterator of a range-del
// meta block into an iterator over non-overlapping tombstone fragments. The
// tombstone fragmentation process should be more efficient than the range
// tombstone collapsing algorithm in RangeDelAggregator because this leverages
// the internal key ordering already provided by the input iterator, if
// applicable (when the iterator is unsorted, a new sorted iterator is created
// before proceeding). If there are few overlaps, creating a
// FragmentedRangeTombstoneIterator should be O(n), while the RangeDelAggregator
// tombstone collapsing is always O(n log n).
class FragmentedRangeTombstoneIterator : public InternalIterator {
public:
FragmentedRangeTombstoneIterator(
std::unique_ptr<InternalIterator> unfragmented_tombstones,
const InternalKeyComparator& icmp, SequenceNumber snapshot);
void SeekToFirst() override;
void SeekToLast() override;
void Seek(const Slice& target) override;
void SeekForPrev(const Slice& target) override;
void Next() override;
void Prev() override;
bool Valid() const override;
Slice key() const override {
MaybePinKey();
return current_start_key_.Encode();
}
Slice value() const override { return pos_->end_key_; }
bool IsKeyPinned() const override { return false; }
bool IsValuePinned() const override { return true; }
Status status() const override { return Status::OK(); }
Slice user_key() const { return pos_->start_key_; }
SequenceNumber seq() const { return pos_->seq_; }
private:
struct FragmentedRangeTombstoneComparator {
explicit FragmentedRangeTombstoneComparator(const Comparator* c) : cmp(c) {}
bool operator()(const RangeTombstone& a, const RangeTombstone& b) const {
int user_key_cmp = cmp->Compare(a.start_key_, b.start_key_);
if (user_key_cmp != 0) {
return user_key_cmp < 0;
}
return a.seq_ > b.seq_;
}
const Comparator* cmp;
};
void MaybePinKey() const {
if (pos_ != tombstones_.end() && pinned_pos_ != pos_) {
current_start_key_.Set(pos_->start_key_, pos_->seq_, kTypeRangeDeletion);
pinned_pos_ = pos_;
}
}
void ParseKey(ParsedInternalKey* parsed) const {
parsed->user_key = pos_->start_key_;
parsed->sequence = pos_->seq_;
parsed->type = kTypeRangeDeletion;
}
// Given an ordered range tombstone iterator unfragmented_tombstones,
// "fragment" the tombstones into non-overlapping pieces, and store them in
// tombstones_.
void FragmentTombstones(
std::unique_ptr<InternalIterator> unfragmented_tombstones,
SequenceNumber snapshot);
const FragmentedRangeTombstoneComparator tombstone_cmp_;
const InternalKeyComparator* icmp_;
const Comparator* ucmp_;
std::vector<RangeTombstone> tombstones_;
std::list<std::string> pinned_slices_;
std::vector<RangeTombstone>::const_iterator pos_;
mutable std::vector<RangeTombstone>::const_iterator pinned_pos_;
mutable InternalKey current_start_key_;
PinnedIteratorsManager pinned_iters_mgr_;
};
SequenceNumber MaxCoveringTombstoneSeqnum(
FragmentedRangeTombstoneIterator* tombstone_iter, const Slice& key,
const Comparator* ucmp);
} // namespace rocksdb