Improve performance of long range scans with readahead

Summary:
This change improves the performance of iterators doing long range scans (e.g. big/full table scans in MyRocks) by using readahead and prefetching additional data on each disk IO. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan.

Constraints:
- The prefetched data is stored by the OS in page cache. So this currently works only for non direct-reads use-cases i.e applications which use page cache. (Direct-I/O support will be enabled in a later PR).
- This gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value).

Thanks to siying for the original idea and implementation.

**Benchmarks:**
Data fill:
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes
```
Do a long range scan: Seekrandom with large number of nexts
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram
```

Page cache was cleared before each experiment with the command:
```
sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
```
```
Before:
seekrandom   :   34020.945 micros/op 29 ops/sec;   32.5 MB/s (1636 of 1999 found)
With this change:
seekrandom   :    8726.912 micros/op 114 ops/sec;  126.8 MB/s (5702 of 6999 found)
```
~3.9X performance improvement.

Also verified with strace and gdb that the readahead size is increasing as expected.
```
strace -e readahead -f -T -t -p <db_bench process pid>
```
Closes https://github.com/facebook/rocksdb/pull/3282

Differential Revision: D6586477

Pulled By: sagar0

fbshipit-source-id: 8a118a0ed4594fbb7f5b1cafb242d7a4033cb58c
main
Sagar Vemuri 7 years ago committed by Facebook Github Bot
parent 65d431639b
commit d938226af4
  1. 4
      HISTORY.md
  2. 25
      table/block_based_table_reader.cc
  3. 8
      table/block_based_table_reader.h

@ -2,6 +2,10 @@
## Unreleased ## Unreleased
### Public API Change ### Public API Change
* Iterator::SeekForPrev is now a pure virtual method. This is to prevent user who implement the Iterator interface fail to implement SeekForPrev by mistake. * Iterator::SeekForPrev is now a pure virtual method. This is to prevent user who implement the Iterator interface fail to implement SeekForPrev by mistake.
### New Features
* Improve the performance of iterators doing long range scans by using readahead.
### Bug Fixes ### Bug Fixes
* Fix `DisableFileDeletions()` followed by `GetSortedWalFiles()` to not return obsolete WAL files that `PurgeObsoleteFiles()` is going to delete. * Fix `DisableFileDeletions()` followed by `GetSortedWalFiles()` to not return obsolete WAL files that `PurgeObsoleteFiles()` is going to delete.
* Fix DB::Flush() keep waiting after flush finish under certain condition. * Fix DB::Flush() keep waiting after flush finish under certain condition.

@ -1594,6 +1594,9 @@ BlockBasedTable::BlockEntryIteratorState::BlockEntryIteratorState(
is_index_(is_index), is_index_(is_index),
block_map_(block_map) {} block_map_(block_map) {}
const size_t BlockBasedTable::BlockEntryIteratorState::kMaxReadaheadSize =
256 * 1024;
InternalIterator* InternalIterator*
BlockBasedTable::BlockEntryIteratorState::NewSecondaryIterator( BlockBasedTable::BlockEntryIteratorState::NewSecondaryIterator(
const Slice& index_value) { const Slice& index_value) {
@ -1618,6 +1621,28 @@ BlockBasedTable::BlockEntryIteratorState::NewSecondaryIterator(
&rep->internal_comparator, nullptr, true, rep->ioptions.statistics); &rep->internal_comparator, nullptr, true, rep->ioptions.statistics);
} }
} }
// Automatically prefetch additional data when a range scan (iterator) does
// more than 2 sequential IOs. This is enabled only when
// ReadOptions.readahead_size is 0.
if (read_options_.readahead_size == 0) {
if (num_file_reads_ < 2) {
num_file_reads_++;
} else if (handle.offset() + static_cast<size_t>(handle.size()) +
kBlockTrailerSize >
readahead_limit_) {
num_file_reads_++;
// Do not readahead more than kMaxReadaheadSize.
readahead_size_ =
std::min(BlockBasedTable::BlockEntryIteratorState::kMaxReadaheadSize,
readahead_size_);
table_->rep_->file->Prefetch(handle.offset(), readahead_size_);
readahead_limit_ = handle.offset() + readahead_size_;
// Keep exponentially increasing readahead size until kMaxReadaheadSize.
readahead_size_ *= 2;
}
}
return NewDataBlockIterator(rep, read_options_, handle, return NewDataBlockIterator(rep, read_options_, handle,
/* input_iter */ nullptr, is_index_, /* input_iter */ nullptr, is_index_,
/* get_context */ nullptr, s); /* get_context */ nullptr, s);

@ -376,6 +376,14 @@ class BlockBasedTable::BlockEntryIteratorState : public TwoLevelIteratorState {
bool is_index_; bool is_index_;
std::unordered_map<uint64_t, CachableEntry<Block>>* block_map_; std::unordered_map<uint64_t, CachableEntry<Block>>* block_map_;
port::RWMutex cleaner_mu; port::RWMutex cleaner_mu;
static const size_t kInitReadaheadSize = 8 * 1024;
// Found that 256 KB readahead size provides the best performance, based on
// experiments.
static const size_t kMaxReadaheadSize;
size_t readahead_size_ = kInitReadaheadSize;
size_t readahead_limit_ = 0;
int num_file_reads_ = 0;
}; };
// CachableEntry represents the entries that *may* be fetched from block cache. // CachableEntry represents the entries that *may* be fetched from block cache.

Loading…
Cancel
Save