Fix higher read qps during db open caused by pr 11406 (#11516)

Summary:
**Context:**
[PR11406](https://github.com/facebook/rocksdb/pull/11406/) caused more frequent read during db open reading files with no `tail_size` in the manifest as part of the upgrade to 11406. This is due to that PR introduced
- [smaller](https://github.com/facebook/rocksdb/pull/11406/files#diff-57ed8c49db2bdd4db7618646a177397674bbf25beacacecb104070071d30129fR833) prefetch tail buffer size compared to pre-11406 for small files (< 52 MB) when `tail_prefetch_stats` infers tail size to be 0 (usually happens when the stats does not have much historical data to infer early on)
-  more read (up to # of partitioned filter/index) when such small prefetch tail buffer does not contain all the partitioned filter/index needed in CacheDependencies() since the [fallback logic](https://github.com/facebook/rocksdb/pull/11406/files#diff-d98f1a83de24412ad7f3527725dae7e28851c7222622c3cdb832d3cdf24bbf9fR165-R179) that prefetches all partitions at once will be [skipped](url) when such a small prefetch tail buffer is passed in

**Summary:**
- Revert the fallback prefetch buffer size change to preserve existing behavior fully during upgrading in `BlockBasedTable::PrefetchTail()`
- Use passed-in prefetch tail buffer in `CacheDependencies()` only if it has a smaller offset than the the offset of first partition filter/index, that is, at least as good as the existing prefetching behavior

Pull Request resolved: https://github.com/facebook/rocksdb/pull/11516

Test Plan:
- db bench

Create db with small files prior to PR 11406
```
./db_bench -db=/tmp/testdb/ --partition_index_and_filters=1 --statistics=1 -benchmarks=fillseq -key_size=3200 -value_size=5 -num=1000000 -write_buffer_size=6550000 -disable_auto_compactions=true -compression_type=zstd`
```
Read db to see if post-pr has lower read qps (i.e, rocksdb.file.read.db.open.micros count) during db open.
```
./db_bench -use_direct_reads=1 --file_opening_threads=1 --threads=1 --use_existing_db=1 --seed=1682546046158958 --partition_index_and_filters=1 --statistics=1 --db=/tmp/testdb/ --benchmarks=readrandom --key_size=3200 --value_size=5 --num=100 --disable_auto_compactions=true --compression_type=zstd
```
Pre-PR:
```
rocksdb.file.read.db.open.micros P50 : 3.399023 P95 : 5.924468 P99 : 12.408333 P100 : 29.000000 COUNT : 611 SUM : 2539
```

Post-PR:
```
rocksdb.file.read.db.open.micros P50 : 593.736842 P95 : 861.605263 P99 : 1212.868421 P100 : 2663.000000 COUNT : 585 SUM : 345349
```

_Note: To control the starting offset of the prefetch tail buffer easier, I manually override the following to eliminate the effect of alignment_
```
class PosixRandomAccessFile : public FSRandomAccessFile {
virtual size_t GetRequiredBufferAlignment() const override {
-    return logical_sector_size_;
+    return 1;
  }
 ```

- CI

Reviewed By: pdillinger

Differential Revision: D46472566

Pulled By: hx235

fbshipit-source-id: 2fe14ac8d489d15b0e08e6f8fe4f46d5f110978e
oxigraph-main
Hui Xiao 2 years ago committed by Facebook GitHub Bot
parent 2e8cc98ab2
commit 3093d98c78
  1. 2
      file/file_prefetch_buffer.h
  2. 5
      table/block_based/block_based_table_reader.cc
  3. 3
      table/block_based/partitioned_filter_block.cc
  4. 3
      table/block_based/partitioned_index_reader.cc

@ -234,6 +234,8 @@ class FilePrefetchBuffer {
// tracked if track_min_offset = true.
size_t min_offset_read() const { return min_offset_read_; }
size_t GetPrefetchOffset() const { return bufs_[curr_].offset_; }
// Called in case of implicit auto prefetching.
void UpdateReadPattern(const uint64_t& offset, const size_t& len,
bool decrease_readaheadsize) {

@ -830,10 +830,7 @@ Status BlockBasedTable::PrefetchTail(
// index/filter is enabled and top-level partition pinning is enabled.
// That's because we need to issue readahead before we read the
// properties, at which point we don't yet know the index type.
tail_prefetch_size =
prefetch_all || preload_all
? static_cast<size_t>(4 * 1024 + 0.01 * file_size)
: 4 * 1024;
tail_prefetch_size = prefetch_all || preload_all ? 512 * 1024 : 4 * 1024;
ROCKS_LOG_WARN(logger,
"Tail prefetch size %zu is calculated based on heuristics",

@ -493,7 +493,8 @@ Status PartitionedFilterBlockReader::CacheDependencies(
handle.offset() + handle.size() + BlockBasedTable::kBlockTrailerSize;
uint64_t prefetch_len = last_off - prefetch_off;
std::unique_ptr<FilePrefetchBuffer> prefetch_buffer;
if (tail_prefetch_buffer == nullptr || !tail_prefetch_buffer->Enabled()) {
if (tail_prefetch_buffer == nullptr || !tail_prefetch_buffer->Enabled() ||
tail_prefetch_buffer->GetPrefetchOffset() > prefetch_off) {
rep->CreateFilePrefetchBuffer(
0, 0, &prefetch_buffer, false /* Implicit autoreadahead */,
0 /*num_reads_*/, 0 /*num_file_reads_for_auto_readahead*/);

@ -165,7 +165,8 @@ Status PartitionIndexReader::CacheDependencies(
handle.offset() + BlockBasedTable::BlockSizeWithTrailer(handle);
uint64_t prefetch_len = last_off - prefetch_off;
std::unique_ptr<FilePrefetchBuffer> prefetch_buffer;
if (tail_prefetch_buffer == nullptr || !tail_prefetch_buffer->Enabled()) {
if (tail_prefetch_buffer == nullptr || !tail_prefetch_buffer->Enabled() ||
tail_prefetch_buffer->GetPrefetchOffset() > prefetch_off) {
rep->CreateFilePrefetchBuffer(
0, 0, &prefetch_buffer, false /*Implicit auto readahead*/,
0 /*num_reads_*/, 0 /*num_file_reads_for_auto_readahead*/);

Loading…
Cancel
Save