From f291eefb022b5e3f6906bf4dd1ab1229839eb4e1 Mon Sep 17 00:00:00 2001
From: Changyu Bi <changyubi@fb.com>
Date: Tue, 13 Sep 2022 20:07:28 -0700
Subject: [PATCH] Cache fragmented range tombstone list for mutable memtables
 (#10547)

Summary:
Each read from memtable used to read and fragment all the range tombstones into a `FragmentedRangeTombstoneList`. https://github.com/facebook/rocksdb/issues/10380 improved the inefficient here by caching a `FragmentedRangeTombstoneList` with each immutable memtable. This PR extends the caching to mutable memtables. The fragmented range tombstone can be constructed in either read (This PR) or write path (https://github.com/facebook/rocksdb/issues/10584). With both implementation, each `DeleteRange()` will invalidate the cache, and the difference is where the cache is re-constructed.`CoreLocalArray` is used to store the cache with each memtable so that multi-threaded reads can be efficient. More specifically, each core will have a shared_ptr to a shared_ptr pointing to the current cache. Each read thread will only update the reference count in its core-local shared_ptr, and this is only needed when reading from mutable memtables.

The choice between write path and read path is not an easy one: they are both improvement compared to no caching in the current implementation, but they favor different operations and could cause regression in the other operation (read vs write). The write path caching in (https://github.com/facebook/rocksdb/issues/10584) leads to a cleaner implementation, but I chose the read path caching here to avoid significant regression in write performance when there is a considerable amount of range tombstones in a single memtable (the number from the benchmark below suggests >1000 with concurrent writers). Note that even though the fragmented range tombstone list is only constructed in `DeleteRange()` operations, it could block other writes from proceeding, and hence affects overall write performance.

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10547

Test Plan:
- TestGet() in stress test is updated in https://github.com/facebook/rocksdb/issues/10553 to compare Get() result against expected state: `./db_stress_branch --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4`
- Perf benchmark: tested read and write performance where a memtable has 0, 1, 10, 100 and 1000 range tombstones.
```
./db_bench --benchmarks=fillrandom,readrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=200000 --reads=100000 --disable_auto_compactions --max_num_range_tombstones=1000
```
Write perf regressed since the cost of constructing fragmented range tombstone list is shifted from every read to a single write. 6cbe5d8e172dc5f1ef65c9d0a6eedbd9987b2c72 is included in the last column as a reference to see performance impact on multi-thread reads if `CoreLocalArray` is not used.

micros/op averaged over 5 runs: first 4 columns are for fillrandom, last 4 columns are for readrandom.
|   |fillrandom main           | write path caching          | read path caching          |memtable V3 (https://github.com/facebook/rocksdb/issues/10308)     | readrandom main            | write path caching           | read path caching            |memtable V3      |
|---   |---  |---   |---   |---   | ---   |           ---   |  ---   |  ---   |
| 0                    |6.35                           |6.15                           |5.82                           |6.12                           |2.24                           |2.26                           |2.03                           |2.07                           |
| 1                    |5.99                           |5.88                           |5.77                           |6.28                           |2.65                           |2.27                           |2.24                           |2.5                            |
| 10                   |6.15                           |6.02                           |5.92                           |5.95                           |5.15                           |2.61                           |2.31                           |2.53                           |
| 100                  |5.95                           |5.78                           |5.88                           |6.23                           |28.31                          |2.34                           |2.45                           |2.94                           |
| 100 25 threads       |52.01                          |45.85                          |46.18                          |47.52                          |35.97                          |3.34                           |3.34                           |3.56                           |
| 1000                 |6.0                            |7.07                           |5.98                           |6.08                           |333.18                         |2.86                           |2.7                            |3.6                            |
| 1000 25 threads      |52.6                           |148.86                         |79.06                          |45.52                          |473.49                         |3.66                           |3.48                           |4.38                           |

  - Benchmark performance of`readwhilewriting` from https://github.com/facebook/rocksdb/issues/10552, 100 range tombstones are written: `./db_bench --benchmarks=readwhilewriting --writes_per_range_tombstone=500 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=100000 --reads=500000 --disable_auto_compactions --max_num_range_tombstones=10000 --finish_after_writes`

readrandom micros/op:
|  |main            |write path caching           |read path caching            |memtable V3      |
|---|---|---|---|---|
| single thread        |48.28                          |1.55                           |1.52                           |1.96                           |
| 25 threads           |64.3                           |2.55                           |2.67                           |2.64                           |

Reviewed By: ajkr

Differential Revision: D38895410

Pulled By: cbi42

fbshipit-source-id: 930bfc309dd1b2f4e8e9042f5126785bba577559
---
 HISTORY.md                       |  1 +
 db/memtable.cc                   | 56 ++++++++++++++++++++++++++------
 db/memtable.h                    |  5 +++
 db/range_tombstone_fragmenter.cc | 16 +++++++++
 db/range_tombstone_fragmenter.h  | 14 ++++++++
 include/rocksdb/db.h             |  2 +-
 6 files changed, 83 insertions(+), 11 deletions(-)

diff --git a/HISTORY.md b/HISTORY.md
index 8bfd73876..005e6f481 100644
--- a/HISTORY.md
+++ b/HISTORY.md
@@ -35,6 +35,7 @@
 * Iterator performance is improved for `DeleteRange()` users. Internally, iterator will skip to the end of a range tombstone when possible, instead of looping through each key and check individually if a key is range deleted.
 * Eliminated some allocations and copies in the blob read path. Also, `PinnableSlice` now only points to the blob value and pins the backing resource (cache entry or buffer) in all cases, instead of containing a copy of the blob value. See #10625 and #10647.
 * In case of scans with async_io enabled, few optimizations have been added to issue more asynchronous requests in parallel in order to avoid synchronous prefetching.
+* `DeleteRange()` users should see improvement in get/iterator performance from mutable memtable (see #10547).
 
 ## 7.6.0 (08/19/2022)
 ### New Features
diff --git a/db/memtable.cc b/db/memtable.cc
index 53f40bb98..2acf9b6da 100644
--- a/db/memtable.cc
+++ b/db/memtable.cc
@@ -560,16 +560,28 @@ FragmentedRangeTombstoneIterator* MemTable::NewRangeTombstoneIteratorInternal(
         read_seq);
   }
 
-  auto* unfragmented_iter = new MemTableIterator(
-      *this, read_options, nullptr /* arena */, true /* use_range_del_table */);
-  auto fragmented_tombstone_list =
-      std::make_shared<FragmentedRangeTombstoneList>(
-          std::unique_ptr<InternalIterator>(unfragmented_iter),
-          comparator_.comparator);
-
-  auto* fragmented_iter = new FragmentedRangeTombstoneIterator(
-      fragmented_tombstone_list, comparator_.comparator, read_seq);
-  return fragmented_iter;
+  // takes current cache
+  std::shared_ptr<FragmentedRangeTombstoneListCache> cache =
+      std::atomic_load_explicit(cached_range_tombstone_.Access(),
+                                std::memory_order_relaxed);
+  // construct fragmented tombstone list if necessary
+  if (!cache->initialized.load(std::memory_order_acquire)) {
+    cache->reader_mutex.lock();
+    if (!cache->tombstones) {
+      auto* unfragmented_iter =
+          new MemTableIterator(*this, read_options, nullptr /* arena */,
+                               true /* use_range_del_table */);
+      cache->tombstones = std::make_unique<FragmentedRangeTombstoneList>(
+          FragmentedRangeTombstoneList(
+              std::unique_ptr<InternalIterator>(unfragmented_iter),
+              comparator_.comparator));
+      cache->initialized.store(true, std::memory_order_release);
+    }
+    cache->reader_mutex.unlock();
+  }
+
+  return new FragmentedRangeTombstoneIterator(cache, comparator_.comparator,
+                                              read_seq);
 }
 
 void MemTable::ConstructFragmentedRangeTombstones() {
@@ -819,6 +831,30 @@ Status MemTable::Add(SequenceNumber s, ValueType type,
     }
   }
   if (type == kTypeRangeDeletion) {
+    auto new_cache = std::make_shared<FragmentedRangeTombstoneListCache>();
+    size_t size = cached_range_tombstone_.Size();
+    if (allow_concurrent) {
+      range_del_mutex_.lock();
+    }
+    for (size_t i = 0; i < size; ++i) {
+      std::shared_ptr<FragmentedRangeTombstoneListCache>* local_cache_ref_ptr =
+          cached_range_tombstone_.AccessAtCore(i);
+      auto new_local_cache_ref = std::make_shared<
+          const std::shared_ptr<FragmentedRangeTombstoneListCache>>(new_cache);
+      // It is okay for some reader to load old cache during invalidation as
+      // the new sequence number is not published yet.
+      // Each core will have a shared_ptr to a shared_ptr to the cached
+      // fragmented range tombstones, so that ref count is maintianed locally
+      // per-core using the per-core shared_ptr.
+      std::atomic_store_explicit(
+          local_cache_ref_ptr,
+          std::shared_ptr<FragmentedRangeTombstoneListCache>(
+              new_local_cache_ref, new_cache.get()),
+          std::memory_order_relaxed);
+    }
+    if (allow_concurrent) {
+      range_del_mutex_.unlock();
+    }
     is_range_del_table_empty_.store(false, std::memory_order_relaxed);
   }
   UpdateOldestKeyTime();
diff --git a/db/memtable.h b/db/memtable.h
index 74b3c64bb..86c2160af 100644
--- a/db/memtable.h
+++ b/db/memtable.h
@@ -536,6 +536,11 @@ class MemTable {
                                     size_t protection_bytes_per_key,
                                     bool allow_data_in_errors = false);
 
+  // makes sure there is a single range tombstone writer to invalidate cache
+  std::mutex range_del_mutex_;
+  CoreLocalArray<std::shared_ptr<FragmentedRangeTombstoneListCache>>
+      cached_range_tombstone_;
+
  private:
   enum FlushStateEnum { FLUSH_NOT_REQUESTED, FLUSH_REQUESTED, FLUSH_SCHEDULED };
 
diff --git a/db/range_tombstone_fragmenter.cc b/db/range_tombstone_fragmenter.cc
index 8c90ff5b6..356eee181 100644
--- a/db/range_tombstone_fragmenter.cc
+++ b/db/range_tombstone_fragmenter.cc
@@ -251,6 +251,22 @@ FragmentedRangeTombstoneIterator::FragmentedRangeTombstoneIterator(
   Invalidate();
 }
 
+FragmentedRangeTombstoneIterator::FragmentedRangeTombstoneIterator(
+    const std::shared_ptr<FragmentedRangeTombstoneListCache>& tombstones_cache,
+    const InternalKeyComparator& icmp, SequenceNumber _upper_bound,
+    SequenceNumber _lower_bound)
+    : tombstone_start_cmp_(icmp.user_comparator()),
+      tombstone_end_cmp_(icmp.user_comparator()),
+      icmp_(&icmp),
+      ucmp_(icmp.user_comparator()),
+      tombstones_cache_ref_(tombstones_cache),
+      tombstones_(tombstones_cache_ref_->tombstones.get()),
+      upper_bound_(_upper_bound),
+      lower_bound_(_lower_bound) {
+  assert(tombstones_ != nullptr);
+  Invalidate();
+}
+
 void FragmentedRangeTombstoneIterator::SeekToFirst() {
   pos_ = tombstones_->begin();
   seq_pos_ = tombstones_->seq_begin();
diff --git a/db/range_tombstone_fragmenter.h b/db/range_tombstone_fragmenter.h
index f323db5d7..0c8cbf181 100644
--- a/db/range_tombstone_fragmenter.h
+++ b/db/range_tombstone_fragmenter.h
@@ -17,6 +17,15 @@
 #include "table/internal_iterator.h"
 
 namespace ROCKSDB_NAMESPACE {
+struct FragmentedRangeTombstoneList;
+
+struct FragmentedRangeTombstoneListCache {
+  // ensure only the first reader needs to initialize l
+  std::mutex reader_mutex;
+  std::unique_ptr<FragmentedRangeTombstoneList> tombstones = nullptr;
+  // readers will first check this bool to avoid
+  std::atomic<bool> initialized = false;
+};
 
 struct FragmentedRangeTombstoneList {
  public:
@@ -113,6 +122,10 @@ class FragmentedRangeTombstoneIterator : public InternalIterator {
       const std::shared_ptr<const FragmentedRangeTombstoneList>& tombstones,
       const InternalKeyComparator& icmp, SequenceNumber upper_bound,
       SequenceNumber lower_bound = 0);
+  FragmentedRangeTombstoneIterator(
+      const std::shared_ptr<FragmentedRangeTombstoneListCache>& tombstones,
+      const InternalKeyComparator& icmp, SequenceNumber upper_bound,
+      SequenceNumber lower_bound = 0);
 
   void SeekToFirst() override;
   void SeekToLast() override;
@@ -260,6 +273,7 @@ class FragmentedRangeTombstoneIterator : public InternalIterator {
   const InternalKeyComparator* icmp_;
   const Comparator* ucmp_;
   std::shared_ptr<const FragmentedRangeTombstoneList> tombstones_ref_;
+  std::shared_ptr<FragmentedRangeTombstoneListCache> tombstones_cache_ref_;
   const FragmentedRangeTombstoneList* tombstones_;
   SequenceNumber upper_bound_;
   SequenceNumber lower_bound_;
diff --git a/include/rocksdb/db.h b/include/rocksdb/db.h
index b7fcdd786..1e7c19044 100644
--- a/include/rocksdb/db.h
+++ b/include/rocksdb/db.h
@@ -469,7 +469,7 @@ class DB {
   // a `Status::InvalidArgument` is returned.
   //
   // This feature is now usable in production, with the following caveats:
-  // 1) Accumulating many range tombstones in the memtable will degrade read
+  // 1) Accumulating too many range tombstones in the memtable will degrade read
   // performance; this can be avoided by manually flushing occasionally.
   // 2) Limiting the maximum number of open files in the presence of range
   // tombstones can degrade read performance. To avoid this problem, set