Add an option to put first key of each sst block in the index (#5289)

Summary: The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes. Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it. So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks. Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files. This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289 Differential Revision: D15256423 Pulled By: al13n321 fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
6 years ago · b4d7209428
parent 554a6456aa
commit b4d7209428
25 changed files with 1362 additions and 588 deletions
--- a/HISTORY.md
+++ b/HISTORY.md
@ -41,6 +41,7 @@
 * Block-based table index now contains exact highest key in the file, rather than an upper bound. This may improve Get() and iterator Seek() performance in some situations, especially when direct IO is enabled and block cache is disabled. A setting BlockBasedTableOptions::index_shortening is introduced to control this behavior. Set it to kShortenSeparatorsAndSuccessor to get the old behavior.
 * When reading from option file/string/map, customized envs can be filled according to object registry.
 * Improve range scan performance when using explicit user readahead by not creating new table readers for every iterator.
 * Add index type BlockBasedTableOptions::IndexType::kBinarySearchWithFirstKey. It significantly reduces read amplification in some setups, especially for iterator seeks. It's not fully implemented yet: IO errors are not handled right.
 ### Public API Change
 * Change the behavior of OptimizeForPointLookup(): move away from hash-based block-based-table index, and use whole key memtable filtering.
--- a/db/db_iterator_test.cc
+++ b/db/db_iterator_test.cc
@ -1049,6 +1049,148 @@ TEST_P(DBIteratorTest, DBIteratorBoundOptimizationTest) {
    ASSERT_EQ(upper_bound_hits, 1);
  }
 }
 // Enable kBinarySearchWithFirstKey, do some iterator operations and check that
 // they don't do unnecessary block reads.
 TEST_P(DBIteratorTest, IndexWithFirstKey) {
  for (int tailing = 0; tailing < 2; ++tailing) {
    SCOPED_TRACE("tailing = " + std::to_string(tailing));
    Options options = CurrentOptions();
    options.env = env_;
    options.create_if_missing = true;
    options.prefix_extractor = nullptr;
    options.merge_operator = MergeOperators::CreateStringAppendOperator();
    options.statistics = rocksdb::CreateDBStatistics();
    Statistics* stats = options.statistics.get();
    BlockBasedTableOptions table_options;
    table_options.index_type =
        BlockBasedTableOptions::IndexType::kBinarySearchWithFirstKey;
    table_options.index_shortening =
        BlockBasedTableOptions::IndexShorteningMode::kNoShortening;
    table_options.flush_block_policy_factory =
        std::make_shared<FlushBlockEveryKeyPolicyFactory>();
    table_options.block_cache = NewLRUCache(1000);  // fits all blocks
    options.table_factory.reset(NewBlockBasedTableFactory(table_options));
    DestroyAndReopen(options);
    ASSERT_OK(Merge("a1", "x1"));
    ASSERT_OK(Merge("b1", "y1"));
    ASSERT_OK(Merge("c0", "z1"));
    ASSERT_OK(Flush());
    ASSERT_OK(Merge("a2", "x2"));
    ASSERT_OK(Merge("b2", "y2"));
    ASSERT_OK(Merge("c0", "z2"));
    ASSERT_OK(Flush());
    ASSERT_OK(Merge("a3", "x3"));
    ASSERT_OK(Merge("b3", "y3"));
    ASSERT_OK(Merge("c3", "z3"));
    ASSERT_OK(Flush());
    // Block cache is not important for this test.
    // We use BLOCK_CACHE_DATA_* counters just because they're the most readily
    // available way of counting block accesses.
    ReadOptions ropt;
    ropt.tailing = tailing;
    std::unique_ptr<Iterator> iter(NewIterator(ropt));
    iter->Seek("b10");
    ASSERT_TRUE(iter->Valid());
    EXPECT_EQ("b2", iter->key().ToString());
    EXPECT_EQ("y2", iter->value().ToString());
    EXPECT_EQ(1, stats->getTickerCount(BLOCK_CACHE_DATA_MISS));
    iter->Next();
    ASSERT_TRUE(iter->Valid());
    EXPECT_EQ("b3", iter->key().ToString());
    EXPECT_EQ("y3", iter->value().ToString());
    EXPECT_EQ(2, stats->getTickerCount(BLOCK_CACHE_DATA_MISS));
    EXPECT_EQ(0, stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
    iter->Seek("c0");
    ASSERT_TRUE(iter->Valid());
    EXPECT_EQ("c0", iter->key().ToString());
    EXPECT_EQ("z1,z2", iter->value().ToString());
    EXPECT_EQ(0, stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
    EXPECT_EQ(4, stats->getTickerCount(BLOCK_CACHE_DATA_MISS));
    iter->Next();
    ASSERT_TRUE(iter->Valid());
    EXPECT_EQ("c3", iter->key().ToString());
    EXPECT_EQ("z3", iter->value().ToString());
    EXPECT_EQ(0, stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
    EXPECT_EQ(5, stats->getTickerCount(BLOCK_CACHE_DATA_MISS));
    iter.reset();
    // Enable iterate_upper_bound and check that iterator is not trying to read
    // blocks that are fully above upper bound.
    std::string ub = "b3";
    Slice ub_slice(ub);
    ropt.iterate_upper_bound = &ub_slice;
    iter.reset(NewIterator(ropt));
    iter->Seek("b2");
    ASSERT_TRUE(iter->Valid());
    EXPECT_EQ("b2", iter->key().ToString());
    EXPECT_EQ("y2", iter->value().ToString());
    EXPECT_EQ(1, stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
    EXPECT_EQ(5, stats->getTickerCount(BLOCK_CACHE_DATA_MISS));
    iter->Next();
    ASSERT_FALSE(iter->Valid());
    EXPECT_EQ(1, stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
    EXPECT_EQ(5, stats->getTickerCount(BLOCK_CACHE_DATA_MISS));
  }
 }
 TEST_P(DBIteratorTest, IndexWithFirstKeyGet) {
  Options options = CurrentOptions();
  options.env = env_;
  options.create_if_missing = true;
  options.prefix_extractor = nullptr;
  options.merge_operator = MergeOperators::CreateStringAppendOperator();
  options.statistics = rocksdb::CreateDBStatistics();
  Statistics* stats = options.statistics.get();
  BlockBasedTableOptions table_options;
  table_options.index_type =
      BlockBasedTableOptions::IndexType::kBinarySearchWithFirstKey;
  table_options.index_shortening =
      BlockBasedTableOptions::IndexShorteningMode::kNoShortening;
  table_options.flush_block_policy_factory =
      std::make_shared<FlushBlockEveryKeyPolicyFactory>();
  table_options.block_cache = NewLRUCache(1000);  // fits all blocks
  options.table_factory.reset(NewBlockBasedTableFactory(table_options));
  DestroyAndReopen(options);
  ASSERT_OK(Merge("a", "x1"));
  ASSERT_OK(Merge("c", "y1"));
  ASSERT_OK(Merge("e", "z1"));
  ASSERT_OK(Flush());
  ASSERT_OK(Merge("c", "y2"));
  ASSERT_OK(Merge("e", "z2"));
  ASSERT_OK(Flush());
  // Get() between blocks shouldn't read any blocks.
  ASSERT_EQ("NOT_FOUND", Get("b"));
  EXPECT_EQ(0, stats->getTickerCount(BLOCK_CACHE_DATA_MISS));
  EXPECT_EQ(0, stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
  // Get() of an existing key shouldn't read any unnecessary blocks when there's
  // only one key per block.
  ASSERT_EQ("y1,y2", Get("c"));
  EXPECT_EQ(2, stats->getTickerCount(BLOCK_CACHE_DATA_MISS));
  EXPECT_EQ(0, stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
  ASSERT_EQ("x1", Get("a"));
  EXPECT_EQ(3, stats->getTickerCount(BLOCK_CACHE_DATA_MISS));
  EXPECT_EQ(0, stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
  EXPECT_EQ(std::vector<std::string>({"NOT_FOUND", "z1,z2"}),
            MultiGet({"b", "e"}));
 }
 // TODO(3.13): fix the issue of Seek() + Prev() which might not necessary
 //             return the biggest key which is smaller than the seek key.
 TEST_P(DBIteratorTest, PrevAfterAndNextAfterMerge) {
--- a/include/rocksdb/table.h
+++ b/include/rocksdb/table.h
@ -93,14 +93,32 @@ struct BlockBasedTableOptions {
  enum IndexType : char {
    // A space efficient index block that is optimized for
    // binary-search-based index.
-    kBinarySearch,
+    kBinarySearch = 0x00,
    // The hash index, if enabled, will do the hash lookup when
    // `Options.prefix_extractor` is provided.
-    kHashSearch,
+    kHashSearch = 0x01,
    // A two-level index implementation. Both levels are binary search indexes.
-    kTwoLevelIndexSearch,
+    kTwoLevelIndexSearch = 0x02,
    // Like kBinarySearch, but index also contains first key of each block.
    // This allows iterators to defer reading the block until it's actually
    // needed. May significantly reduce read amplification of short range scans.
    // Without it, iterator seek usually reads one block from each level-0 file
    // and from each level, which may be expensive.
    // Works best in combination with:
    //  - IndexShorteningMode::kNoShortening,
    //  - custom FlushBlockPolicy to cut blocks at some meaningful boundaries,
    //    e.g. when prefix changes.
    // Makes the index significantly bigger (2x or more), especially when keys
    // are long.
    //
    // IO errors are not handled correctly in this mode right now: if an error
    // happens when lazily reading a block in value(), value() returns empty
    // slice, and you need to call Valid()/status() afterwards.
    // TODO(kolmike): Fix it.
    kBinarySearchWithFirstKey = 0x03,
  };
  IndexType index_type = kBinarySearch;
--- a/java/rocksjni/portal.h
+++ b/java/rocksjni/portal.h
@ -5902,8 +5902,10 @@ class IndexTypeJni {
       return 0x0;
     case rocksdb::BlockBasedTableOptions::IndexType::kHashSearch:
       return 0x1;
-    case rocksdb::BlockBasedTableOptions::IndexType::kTwoLevelIndexSearch:
+     case rocksdb::BlockBasedTableOptions::IndexType::kTwoLevelIndexSearch:
       return 0x2;
     case rocksdb::BlockBasedTableOptions::IndexType::kBinarySearchWithFirstKey:
       return 0x3;
     default:
       return 0x7F;  // undefined
   }
@ -5920,6 +5922,9 @@ class IndexTypeJni {
       return rocksdb::BlockBasedTableOptions::IndexType::kHashSearch;
     case 0x2:
       return rocksdb::BlockBasedTableOptions::IndexType::kTwoLevelIndexSearch;
     case 0x3:
       return rocksdb::BlockBasedTableOptions::IndexType::
           kBinarySearchWithFirstKey;
     default:
       // undefined/default
       return rocksdb::BlockBasedTableOptions::IndexType::kBinarySearch;
--- a/options/options_helper.cc
+++ b/options/options_helper.cc
@ -1671,7 +1671,9 @@ std::unordered_map<std::string, BlockBasedTableOptions::IndexType>
        {"kBinarySearch", BlockBasedTableOptions::IndexType::kBinarySearch},
        {"kHashSearch", BlockBasedTableOptions::IndexType::kHashSearch},
        {"kTwoLevelIndexSearch",
-         BlockBasedTableOptions::IndexType::kTwoLevelIndexSearch}};
+         BlockBasedTableOptions::IndexType::kTwoLevelIndexSearch},
        {"kBinarySearchWithFirstKey",
         BlockBasedTableOptions::IndexType::kBinarySearchWithFirstKey}};
 std::unordered_map<std::string, BlockBasedTableOptions::DataBlockIndexType>
    OptionsHelper::block_base_table_data_block_index_type_string_map = {
--- a/table/block_based/block.cc
+++ b/table/block_based/block.cc
@ -608,8 +608,7 @@ bool IndexBlockIter::ParseNextIndexKey() {
  }
  // else we are in the middle of a restart interval and the restart_index_
  // thus has not changed
-  if (value_delta_encoded_) {
+  if (value_delta_encoded_ || global_seqno_state_ != nullptr) {
    assert(value_length == 0);
    DecodeCurrentValue(shared);
  }
  return true;
@ -627,24 +626,32 @@ bool IndexBlockIter::ParseNextIndexKey() {
 // Otherwise the format is delta-size = block handle size - size of last block
 // handle.
 void IndexBlockIter::DecodeCurrentValue(uint32_t shared) {
-  assert(value_delta_encoded_);
+  Slice v(value_.data(), data_ + restarts_ - value_.data());
-  const char* limit = data_ + restarts_;
+  // Delta encoding is used if `shared` != 0.
-  if (shared == 0) {
+  Status decode_s __attribute__((__unused__)) = decoded_value_.DecodeFrom(
-    uint64_t o, s;
+      &v, have_first_key_,
-    const char* newp = GetVarint64Ptr(value_.data(), limit, &o);
+      (value_delta_encoded_ && shared) ? &decoded_value_.handle : nullptr);
-    assert(newp);
+  assert(decode_s.ok());
-    newp = GetVarint64Ptr(newp, limit, &s);
+  value_ = Slice(value_.data(), v.data() - value_.data());
-    assert(newp);
+
-    decoded_value_ = BlockHandle(o, s);
+  if (global_seqno_state_ != nullptr) {
-    value_ = Slice(value_.data(), newp - value_.data());
+    // Overwrite sequence number the same way as in DataBlockIter.
-  } else {
+
-    uint64_t next_value_base =
+    IterKey& first_internal_key = global_seqno_state_->first_internal_key;
-        decoded_value_.offset() + decoded_value_.size() + kBlockTrailerSize;
+    first_internal_key.SetInternalKey(decoded_value_.first_internal_key,
-    int64_t delta;
+                                      /* copy */ true);
-    const char* newp = GetVarsignedint64Ptr(value_.data(), limit, &delta);
+
-    decoded_value_ =
+    assert(GetInternalKeySeqno(first_internal_key.GetInternalKey()) == 0);
-        BlockHandle(next_value_base, decoded_value_.size() + delta);
+
-    value_ = Slice(value_.data(), newp - value_.data());
+    ValueType value_type = ExtractValueType(first_internal_key.GetKey());
    assert(value_type == ValueType::kTypeValue ||
           value_type == ValueType::kTypeMerge ||
           value_type == ValueType::kTypeDeletion ||
           value_type == ValueType::kTypeRangeDeletion);
    first_internal_key.UpdateInternalKey(global_seqno_state_->global_seqno,
                                         value_type);
    decoded_value_.first_internal_key = first_internal_key.GetKey();
  }
 }
@ -875,14 +882,10 @@ Block::Block(BlockContents&& contents, SequenceNumber _global_seqno,
  }
 }
-template <>
+DataBlockIter* Block::NewDataIterator(const Comparator* cmp,
-DataBlockIter* Block::NewIterator(const Comparator* cmp, const Comparator* ucmp,
+                                      const Comparator* ucmp,
-                                  DataBlockIter* iter, Statistics* stats,
+                                      DataBlockIter* iter, Statistics* stats,
-                                  bool /*total_order_seek*/,
+                                      bool block_contents_pinned) {
                                  bool /*key_includes_seq*/,
                                  bool /*value_is_full*/,
                                  bool block_contents_pinned,
                                  BlockPrefixIndex* /*prefix_index*/) {
  DataBlockIter* ret_iter;
  if (iter != nullptr) {
    ret_iter = iter;
@ -913,13 +916,11 @@ DataBlockIter* Block::NewIterator(const Comparator* cmp, const Comparator* ucmp,
  return ret_iter;
 }
-template <>
+IndexBlockIter* Block::NewIndexIterator(
-IndexBlockIter* Block::NewIterator(const Comparator* cmp,
+    const Comparator* cmp, const Comparator* ucmp, IndexBlockIter* iter,
-                                   const Comparator* ucmp, IndexBlockIter* iter,
+    Statistics* /*stats*/, bool total_order_seek, bool have_first_key,
-                                   Statistics* /*stats*/, bool total_order_seek,
+    bool key_includes_seq, bool value_is_full, bool block_contents_pinned,
-                                   bool key_includes_seq, bool value_is_full,
+    BlockPrefixIndex* prefix_index) {
                                   bool block_contents_pinned,
                                   BlockPrefixIndex* prefix_index) {
  IndexBlockIter* ret_iter;
  if (iter != nullptr) {
    ret_iter = iter;
@ -938,9 +939,9 @@ IndexBlockIter* Block::NewIterator(const Comparator* cmp,
    BlockPrefixIndex* prefix_index_ptr =
        total_order_seek ? nullptr : prefix_index;
    ret_iter->Initialize(cmp, ucmp, data_, restart_offset_, num_restarts_,
-                         prefix_index_ptr, key_includes_seq, value_is_full,
+                         global_seqno_, prefix_index_ptr, have_first_key,
-                         block_contents_pinned,
+                         key_includes_seq, value_is_full,
-                         nullptr /* data_block_hash_index */);
+                         block_contents_pinned);
  }
  return ret_iter;
--- a/table/block_based/block.h
+++ b/table/block_based/block.h
@ -165,17 +165,7 @@ class Block {
  // If iter is null, return new Iterator
  // If iter is not null, update this one and return it as Iterator*
  //
-  // key_includes_seq, default true, means that the keys are in internal key
+  // Updates read_amp_bitmap_ if it is not nullptr.
  // format.
  // value_is_full, default true, means that no delta encoding is
  // applied to values.
  //
  // NewIterator<DataBlockIter>
  // Same as above but also updates read_amp_bitmap_ if it is not nullptr.
  //
  // NewIterator<IndexBlockIter>
  // If `prefix_index` is not nullptr this block will do hash lookup for the key
  // prefix. If total_order_seek is true, prefix_index_ is ignored.
  //
  // If `block_contents_pinned` is true, the caller will guarantee that when
  // the cleanup functions are transferred from the iterator to other
@ -188,13 +178,32 @@ class Block {
  // NOTE: for the hash based lookup, if a key prefix doesn't match any key,
  // the iterator will simply be set as "invalid", rather than returning
  // the key that is just pass the target key.
-  template <typename TBlockIter>
+
-  TBlockIter* NewIterator(
+  DataBlockIter* NewDataIterator(const Comparator* comparator,
-      const Comparator* comparator, const Comparator* user_comparator,
+                                 const Comparator* user_comparator,
-      TBlockIter* iter = nullptr, Statistics* stats = nullptr,
+                                 DataBlockIter* iter = nullptr,
-      bool total_order_seek = true, bool key_includes_seq = true,
+                                 Statistics* stats = nullptr,
-      bool value_is_full = true, bool block_contents_pinned = false,
+                                 bool block_contents_pinned = false);
-      BlockPrefixIndex* prefix_index = nullptr);
+
  // key_includes_seq, default true, means that the keys are in internal key
  // format.
  // value_is_full, default true, means that no delta encoding is
  // applied to values.
  //
  // If `prefix_index` is not nullptr this block will do hash lookup for the key
  // prefix. If total_order_seek is true, prefix_index_ is ignored.
  //
  // `have_first_key` controls whether IndexValue will contain
  // first_internal_key. It affects data serialization format, so the same value
  // have_first_key must be used when writing and reading index.
  // It is determined by IndexType property of the table.
  IndexBlockIter* NewIndexIterator(const Comparator* comparator,
                                   const Comparator* user_comparator,
                                   IndexBlockIter* iter, Statistics* stats,
                                   bool total_order_seek, bool have_first_key,
                                   bool key_includes_seq, bool value_is_full,
                                   bool block_contents_pinned = false,
                                   BlockPrefixIndex* prefix_index = nullptr);
  // Report an approximation of how much memory has been used.
  size_t ApproximateMemoryUsage() const;
@ -471,7 +480,7 @@ class DataBlockIter final : public BlockIter<Slice> {
  bool SeekForGetImpl(const Slice& target);
 };
-class IndexBlockIter final : public BlockIter<BlockHandle> {
+class IndexBlockIter final : public BlockIter<IndexValue> {
 public:
  IndexBlockIter() : BlockIter(), prefix_index_(nullptr) {}
@ -483,23 +492,12 @@ class IndexBlockIter final : public BlockIter<BlockHandle> {
  // format.
  // value_is_full, default true, means that no delta encoding is
  // applied to values.
  IndexBlockIter(const Comparator* comparator,
                 const Comparator* user_comparator, const char* data,
                 uint32_t restarts, uint32_t num_restarts,
                 BlockPrefixIndex* prefix_index, bool key_includes_seq,
                 bool value_is_full, bool block_contents_pinned)
      : IndexBlockIter() {
    Initialize(comparator, user_comparator, data, restarts, num_restarts,
               prefix_index, key_includes_seq, block_contents_pinned,
               value_is_full, nullptr /* data_block_hash_index */);
  }
  void Initialize(const Comparator* comparator,
                  const Comparator* user_comparator, const char* data,
                  uint32_t restarts, uint32_t num_restarts,
-                  BlockPrefixIndex* prefix_index, bool key_includes_seq,
+                  SequenceNumber global_seqno, BlockPrefixIndex* prefix_index,
-                  bool value_is_full, bool block_contents_pinned,
+                  bool have_first_key, bool key_includes_seq,
-                  DataBlockHashIndex* /*data_block_hash_index*/) {
+                  bool value_is_full, bool block_contents_pinned) {
    InitializeBase(key_includes_seq ? comparator : user_comparator, data,
                   restarts, num_restarts, kDisableGlobalSequenceNumber,
                   block_contents_pinned);
@ -507,6 +505,12 @@ class IndexBlockIter final : public BlockIter<BlockHandle> {
    key_.SetIsUserKey(!key_includes_seq_);
    prefix_index_ = prefix_index;
    value_delta_encoded_ = !value_is_full;
    have_first_key_ = have_first_key;
    if (have_first_key_ && global_seqno != kDisableGlobalSequenceNumber) {
      global_seqno_state_.reset(new GlobalSeqnoState(global_seqno));
    } else {
      global_seqno_state_.reset();
    }
  }
  Slice user_key() const override {
@ -516,16 +520,17 @@ class IndexBlockIter final : public BlockIter<BlockHandle> {
    return key();
  }
-  virtual BlockHandle value() const override {
+  virtual IndexValue value() const override {
    assert(Valid());
-    if (value_delta_encoded_) {
+    if (value_delta_encoded_ || global_seqno_state_ != nullptr) {
      return decoded_value_;
    } else {
-      BlockHandle handle;
+      IndexValue entry;
      Slice v = value_;
-      Status decode_s __attribute__((__unused__)) = handle.DecodeFrom(&v);
+      Status decode_s __attribute__((__unused__)) =
          entry.DecodeFrom(&v, have_first_key_, nullptr);
      assert(decode_s.ok());
-      return handle;
+      return entry;
    }
  }
@ -552,10 +557,15 @@ class IndexBlockIter final : public BlockIter<BlockHandle> {
  void Invalidate(Status s) { InvalidateBase(s); }
  bool IsValuePinned() const override {
    return global_seqno_state_ != nullptr ? false : BlockIter::IsValuePinned();
  }
 private:
  // Key is in InternalKey format
  bool key_includes_seq_;
  bool value_delta_encoded_;
  bool have_first_key_;  // value includes first_internal_key
  BlockPrefixIndex* prefix_index_;
  // Whether the value is delta encoded. In that case the value is assumed to be
  // BlockHandle. The first value in each restart interval is the full encoded
@ -563,7 +573,22 @@ class IndexBlockIter final : public BlockIter<BlockHandle> {
  // offset of delta encoded BlockHandles is computed by adding the size of
  // previous delta encoded values in the same restart interval to the offset of
  // the first value in that restart interval.
-  BlockHandle decoded_value_;
+  IndexValue decoded_value_;
  // When sequence number overwriting is enabled, this struct contains the seqno
  // to overwrite with, and current first_internal_key with overwritten seqno.
  // This is rarely used, so we put it behind a pointer and only allocate when
  // needed.
  struct GlobalSeqnoState {
    // First internal key according to current index entry, but with sequence
    // number overwritten to global_seqno.
    IterKey first_internal_key;
    SequenceNumber global_seqno;
    explicit GlobalSeqnoState(SequenceNumber seqno) : global_seqno(seqno) {}
  };
  std::unique_ptr<GlobalSeqnoState> global_seqno_state_;
  bool PrefixSeek(const Slice& target, uint32_t* index);
  bool BinaryBlockIndexSeek(const Slice& target, uint32_t* block_ids,
--- a/table/block_based/block_based_table_reader.cc
+++ b/table/block_based/block_based_table_reader.cc
--- a/table/block_based/block_based_table_reader.h
+++ b/table/block_based/block_based_table_reader.h
@ -43,7 +43,6 @@
 namespace rocksdb {
 class BlockHandle;
 class Cache;
 class FilterBlockReader;
 class BlockBasedFilterBlockReader;
@ -198,7 +197,7 @@ class BlockBasedTable : public TableReader {
    // wraps the passed iter. In the latter case the return value points
    // to a different object then iter, and the callee has the ownership of the
    // returned object.
-    virtual InternalIteratorBase<BlockHandle>* NewIterator(
+    virtual InternalIteratorBase<IndexValue>* NewIterator(
        const ReadOptions& read_options, bool disable_prefix_seek,
        IndexBlockIter* iter, GetContext* get_context,
        BlockCacheLookupContext* lookup_context) = 0;
@ -230,8 +229,7 @@ class BlockBasedTable : public TableReader {
  template <typename TBlockIter>
  TBlockIter* NewDataBlockIterator(
      const ReadOptions& ro, const BlockHandle& block_handle,
-      TBlockIter* input_iter, BlockType block_type, bool key_includes_seq,
+      TBlockIter* input_iter, BlockType block_type, GetContext* get_context,
      bool index_key_is_full, GetContext* get_context,
      BlockCacheLookupContext* lookup_context, Status s,
      FilePrefetchBuffer* prefetch_buffer, bool for_compaction = false) const;
@ -259,6 +257,12 @@ class BlockBasedTable : public TableReader {
                                   BlockType block_type,
                                   GetContext* get_context) const;
  // Either Block::NewDataIterator() or Block::NewIndexIterator().
  template <typename TBlockIter>
  static TBlockIter* InitBlockIterator(const Rep* rep, Block* block,
                                       TBlockIter* input_iter,
                                       bool block_contents_pinned);
  // If block cache enabled (compressed or uncompressed), looks for the block
  // identified by handle in (1) uncompressed cache, (2) compressed cache, and
  // then (3) file. If found, inserts into the cache(s) that were searched
@ -312,7 +316,7 @@ class BlockBasedTable : public TableReader {
  //  2. index is not present in block cache.
  //  3. We disallowed any io to be performed, that is, read_options ==
  //     kBlockCacheTier
-  InternalIteratorBase<BlockHandle>* NewIndexIterator(
+  InternalIteratorBase<IndexValue>* NewIndexIterator(
      const ReadOptions& read_options, bool need_upper_bound_check,
      IndexBlockIter* input_iter, GetContext* get_context,
      BlockCacheLookupContext* lookup_context) const;
@ -355,9 +359,6 @@ class BlockBasedTable : public TableReader {
  friend class TableCache;
  friend class BlockBasedTableBuilder;
  // Figure the index type, update it in rep_, and also return it.
  BlockBasedTableOptions::IndexType UpdateIndexType();
  // Create a index reader based on the index type stored in the table.
  // Optionally, user can pass a preloaded meta_index_iter for the index that
  // need to access extra meta blocks for index construction. This parameter
@ -410,7 +411,7 @@ class BlockBasedTable : public TableReader {
  static BlockType GetBlockTypeForMetaBlockByName(const Slice& meta_block_name);
  Status VerifyChecksumInMetaBlocks(InternalIteratorBase<Slice>* index_iter);
-  Status VerifyChecksumInBlocks(InternalIteratorBase<BlockHandle>* index_iter);
+  Status VerifyChecksumInBlocks(InternalIteratorBase<IndexValue>* index_iter);
  // Create the filter from the filter block.
  virtual FilterBlockReader* ReadFilter(
@ -446,17 +447,14 @@ class BlockBasedTable::PartitionedIndexIteratorState
 public:
  PartitionedIndexIteratorState(
      const BlockBasedTable* table,
-      std::unordered_map<uint64_t, CachableEntry<Block>>* block_map,
+      std::unordered_map<uint64_t, CachableEntry<Block>>* block_map);
-      const bool index_key_includes_seq, const bool index_key_is_full);
+  InternalIteratorBase<IndexValue>* NewSecondaryIterator(
  InternalIteratorBase<BlockHandle>* NewSecondaryIterator(
      const BlockHandle& index_value) override;
 private:
  // Don't own table_
  const BlockBasedTable* table_;
  std::unordered_map<uint64_t, CachableEntry<Block>>* block_map_;
  bool index_key_includes_seq_;
  bool index_key_is_full_;
 };
 // Stores all the properties associated with a BlockBasedTable.
@ -564,12 +562,16 @@ struct BlockBasedTable::Rep {
  // still work, just not as quickly.
  bool blocks_definitely_zstd_compressed = false;
  // These describe how index is encoded.
  bool index_has_first_key = false;
  bool index_key_includes_seq = true;
  bool index_value_is_full = true;
  bool closed = false;
  const bool immortal_table;
  SequenceNumber get_global_seqno(BlockType block_type) const {
    return (block_type == BlockType::kFilter ||
            block_type == BlockType::kIndex ||
            block_type == BlockType::kCompressionDictionary)
               ? kDisableGlobalSequenceNumber
               : global_seqno;
@ -602,11 +604,10 @@ class BlockBasedTableIterator : public InternalIteratorBase<TValue> {
  BlockBasedTableIterator(const BlockBasedTable* table,
                          const ReadOptions& read_options,
                          const InternalKeyComparator& icomp,
-                          InternalIteratorBase<BlockHandle>* index_iter,
+                          InternalIteratorBase<IndexValue>* index_iter,
                          bool check_filter, bool need_upper_bound_check,
                          const SliceTransform* prefix_extractor,
-                          BlockType block_type, bool key_includes_seq,
+                          BlockType block_type, TableReaderCaller caller,
                          bool index_key_is_full, TableReaderCaller caller,
                          size_t compaction_readahead_size = 0)
      : InternalIteratorBase<TValue>(false),
        table_(table),
@ -620,8 +621,6 @@ class BlockBasedTableIterator : public InternalIteratorBase<TValue> {
        need_upper_bound_check_(need_upper_bound_check),
        prefix_extractor_(prefix_extractor),
        block_type_(block_type),
        key_includes_seq_(key_includes_seq),
        index_key_is_full_(index_key_is_full),
        lookup_context_(caller),
        compaction_readahead_size_(compaction_readahead_size) {}
@ -635,19 +634,38 @@ class BlockBasedTableIterator : public InternalIteratorBase<TValue> {
  bool NextAndGetResult(Slice* ret_key) override;
  void Prev() override;
  bool Valid() const override {
-    return !is_out_of_bound_ && block_iter_points_to_real_block_ &&
+    return !is_out_of_bound_ &&
-           block_iter_.Valid();
+           (is_at_first_key_from_index_ ||
            (block_iter_points_to_real_block_ && block_iter_.Valid()));
  }
  Slice key() const override {
    assert(Valid());
-    return block_iter_.key();
+    if (is_at_first_key_from_index_) {
      return index_iter_->value().first_internal_key;
    } else {
      return block_iter_.key();
    }
  }
  Slice user_key() const override {
    assert(Valid());
-    return block_iter_.user_key();
+    if (is_at_first_key_from_index_) {
      return ExtractUserKey(index_iter_->value().first_internal_key);
    } else {
      return block_iter_.user_key();
    }
  }
  TValue value() const override {
    assert(Valid());
    // Load current block if not loaded.
    if (is_at_first_key_from_index_ &&
        !const_cast<BlockBasedTableIterator*>(this)
             ->MaterializeCurrentBlock()) {
      // Oops, index is not consistent with block contents, but we have
      // no good way to report error at this point. Let's return empty value.
      return TValue();
    }
    return block_iter_.value();
  }
  Status status() const override {
@ -667,10 +685,17 @@ class BlockBasedTableIterator : public InternalIteratorBase<TValue> {
    pinned_iters_mgr_ = pinned_iters_mgr;
  }
  bool IsKeyPinned() const override {
    // Our key comes either from block_iter_'s current key
    // or index_iter_'s current *value*.
    return pinned_iters_mgr_ && pinned_iters_mgr_->PinningEnabled() &&
-           block_iter_points_to_real_block_ && block_iter_.IsKeyPinned();
+           ((is_at_first_key_from_index_ && index_iter_->IsValuePinned()) ||
            (block_iter_points_to_real_block_ && block_iter_.IsKeyPinned()));
  }
  bool IsValuePinned() const override {
    // Load current block if not loaded.
    if (is_at_first_key_from_index_) {
      const_cast<BlockBasedTableIterator*>(this)->MaterializeCurrentBlock();
    }
    // BlockIter::IsValuePinned() is always true. No need to check
    return pinned_iters_mgr_ && pinned_iters_mgr_->PinningEnabled() &&
           block_iter_points_to_real_block_;
@ -704,35 +729,33 @@ class BlockBasedTableIterator : public InternalIteratorBase<TValue> {
    if (block_iter_points_to_real_block_) {
      // Reseek. If they end up with the same data block, we shouldn't re-fetch
      // the same data block.
-      prev_index_value_ = index_iter_->value();
+      prev_block_offset_ = index_iter_->value().handle.offset();
    }
  }
  void InitDataBlock();
  inline void FindKeyForward();
  void FindBlockForward();
  void FindKeyBackward();
  void CheckOutOfBound();
 private:
  const BlockBasedTable* table_;
  const ReadOptions read_options_;
  const InternalKeyComparator& icomp_;
  UserComparatorWrapper user_comparator_;
-  InternalIteratorBase<BlockHandle>* index_iter_;
+  InternalIteratorBase<IndexValue>* index_iter_;
  PinnedIteratorsManager* pinned_iters_mgr_;
  TBlockIter block_iter_;
  // True if block_iter_ is initialized and points to the same block
  // as index iterator.
  bool block_iter_points_to_real_block_;
  // See InternalIteratorBase::IsOutOfBound().
  bool is_out_of_bound_ = false;
  // True if we're standing at the first key of a block, and we haven't loaded
  // that block yet. A call to value() will trigger loading the block.
  bool is_at_first_key_from_index_ = false;
  bool check_filter_;
  // TODO(Zhongyi): pick a better name
  bool need_upper_bound_check_;
  const SliceTransform* prefix_extractor_;
  BlockType block_type_;
-  // If the keys in the blocks over which we iterate include 8 byte sequence
+  uint64_t prev_block_offset_;
  bool key_includes_seq_;
  bool index_key_is_full_;
  BlockHandle prev_index_value_;
  BlockCacheLookupContext lookup_context_;
  // Readahead size used in compaction, its value is used only if
  // lookup_context_.caller = kCompaction.
@ -748,6 +771,16 @@ class BlockBasedTableIterator : public InternalIteratorBase<TValue> {
  size_t readahead_limit_ = 0;
  int64_t num_file_reads_ = 0;
  std::unique_ptr<FilePrefetchBuffer> prefetch_buffer_;
  // If `target` is null, seek to first.
  void SeekImpl(const Slice* target);
  void InitDataBlock();
  bool MaterializeCurrentBlock();
  void FindKeyForward();
  void FindBlockForward();
  void FindKeyBackward();
  void CheckOutOfBound();
 };
 }  // namespace rocksdb
--- a/table/block_based/block_test.cc
+++ b/table/block_based/block_test.cc
@ -69,37 +69,12 @@ void GenerateRandomKVs(std::vector<std::string> *keys,
  }
 }
 // Same as GenerateRandomKVs but the values are BlockHandle
 void GenerateRandomKBHs(std::vector<std::string> *keys,
                        std::vector<BlockHandle> *values, const int from,
                        const int len, const int step = 1,
                        const int padding_size = 0,
                        const int keys_share_prefix = 1) {
  Random rnd(302);
  uint64_t offset = 0;
  // generate different prefix
  for (int i = from; i < from + len; i += step) {
    // generate keys that shares the prefix
    for (int j = 0; j < keys_share_prefix; ++j) {
      keys->emplace_back(GenerateKey(i, j, padding_size, &rnd));
      uint64_t size = rnd.Uniform(1024 * 16);
      BlockHandle handle(offset, size);
      offset += size + kBlockTrailerSize;
      values->emplace_back(handle);
    }
  }
 }
 class BlockTest : public testing::Test {};
 // block test
 TEST_F(BlockTest, SimpleTest) {
  Random rnd(301);
  Options options = Options();
  std::unique_ptr<InternalKeyComparator> ic;
  ic.reset(new test::PlainInternalKeyComparator(options.comparator));
  std::vector<std::string> keys;
  std::vector<std::string> values;
@ -123,7 +98,7 @@ TEST_F(BlockTest, SimpleTest) {
  // read contents of block sequentially
  int count = 0;
  InternalIterator *iter =
-      reader.NewIterator<DataBlockIter>(options.comparator, options.comparator);
+      reader.NewDataIterator(options.comparator, options.comparator);
  for (iter->SeekToFirst(); iter->Valid(); count++, iter->Next()) {
    // read kv from block
    Slice k = iter->key();
@ -136,8 +111,7 @@ TEST_F(BlockTest, SimpleTest) {
  delete iter;
  // read block contents randomly
-  iter =
+  iter = reader.NewDataIterator(options.comparator, options.comparator);
      reader.NewIterator<DataBlockIter>(options.comparator, options.comparator);
  for (int i = 0; i < num_records; i++) {
    // find a random key in the lookaside array
    int index = rnd.Uniform(num_records);
@ -152,83 +126,6 @@ TEST_F(BlockTest, SimpleTest) {
  delete iter;
 }
 TEST_F(BlockTest, ValueDeltaEncodingTest) {
  Random rnd(301);
  Options options = Options();
  std::unique_ptr<InternalKeyComparator> ic;
  ic.reset(new test::PlainInternalKeyComparator(options.comparator));
  std::vector<std::string> keys;
  std::vector<BlockHandle> values;
  const bool kUseDeltaEncoding = true;
  const bool kUseValueDeltaEncoding = true;
  BlockBuilder builder(16, kUseDeltaEncoding, kUseValueDeltaEncoding);
  int num_records = 100;
  GenerateRandomKBHs(&keys, &values, 0, num_records);
  // add a bunch of records to a block
  BlockHandle last_encoded_handle;
  for (int i = 0; i < num_records; i++) {
    auto block_handle = values[i];
    std::string handle_encoding;
    block_handle.EncodeTo(&handle_encoding);
    std::string handle_delta_encoding;
    PutVarsignedint64(&handle_delta_encoding,
                      block_handle.size() - last_encoded_handle.size());
    last_encoded_handle = block_handle;
    const Slice handle_delta_encoding_slice(handle_delta_encoding);
    builder.Add(keys[i], handle_encoding, &handle_delta_encoding_slice);
  }
  // read serialized contents of the block
  Slice rawblock = builder.Finish();
  // create block reader
  BlockContents contents;
  contents.data = rawblock;
  Block reader(std::move(contents), kDisableGlobalSequenceNumber);
  const bool kTotalOrderSeek = true;
  const bool kIncludesSeq = true;
  const bool kValueIsFull = !kUseValueDeltaEncoding;
  IndexBlockIter *kNullIter = nullptr;
  Statistics *kNullStats = nullptr;
  // read contents of block sequentially
  int count = 0;
  InternalIteratorBase<BlockHandle> *iter = reader.NewIterator<IndexBlockIter>(
      options.comparator, options.comparator, kNullIter, kNullStats,
      kTotalOrderSeek, kIncludesSeq, kValueIsFull);
  for (iter->SeekToFirst(); iter->Valid(); count++, iter->Next()) {
    // read kv from block
    Slice k = iter->key();
    BlockHandle handle = iter->value();
    // compare with lookaside array
    ASSERT_EQ(k.ToString().compare(keys[count]), 0);
    ASSERT_EQ(values[count].offset(), handle.offset());
    ASSERT_EQ(values[count].size(), handle.size());
  }
  delete iter;
  // read block contents randomly
  iter = reader.NewIterator<IndexBlockIter>(
      options.comparator, options.comparator, kNullIter, kNullStats,
      kTotalOrderSeek, kIncludesSeq, kValueIsFull);
  for (int i = 0; i < num_records; i++) {
    // find a random key in the lookaside array
    int index = rnd.Uniform(num_records);
    Slice k(keys[index]);
    // search in block for this key
    iter->Seek(k);
    ASSERT_TRUE(iter->Valid());
    BlockHandle handle = iter->value();
    ASSERT_EQ(values[index].offset(), handle.offset());
    ASSERT_EQ(values[index].size(), handle.size());
  }
  delete iter;
 }
 // return the block contents
 BlockContents GetBlockContents(std::unique_ptr<BlockBuilder> *builder,
                               const std::vector<std::string> &keys,
@ -261,8 +158,7 @@ void CheckBlockContents(BlockContents contents, const int max_key,
      NewFixedPrefixTransform(prefix_size));
  std::unique_ptr<InternalIterator> regular_iter(
-      reader2.NewIterator<DataBlockIter>(BytewiseComparator(),
+      reader2.NewDataIterator(BytewiseComparator(), BytewiseComparator()));
                                         BytewiseComparator()));
  // Seek existent keys
  for (size_t i = 0; i < keys.size(); i++) {
@ -457,8 +353,6 @@ TEST_F(BlockTest, BlockReadAmpBitmap) {
 TEST_F(BlockTest, BlockWithReadAmpBitmap) {
  Random rnd(301);
  Options options = Options();
  std::unique_ptr<InternalKeyComparator> ic;
  ic.reset(new test::PlainInternalKeyComparator(options.comparator));
  std::vector<std::string> keys;
  std::vector<std::string> values;
@ -486,9 +380,8 @@ TEST_F(BlockTest, BlockWithReadAmpBitmap) {
    // read contents of block sequentially
    size_t read_bytes = 0;
-    DataBlockIter *iter =
+    DataBlockIter *iter = reader.NewDataIterator(
-        static_cast<DataBlockIter *>(reader.NewIterator<DataBlockIter>(
+        options.comparator, options.comparator, nullptr, stats.get());
            options.comparator, options.comparator, nullptr, stats.get()));
    for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {
      iter->value();
      read_bytes += iter->TEST_CurrentEntrySize();
@ -519,9 +412,8 @@ TEST_F(BlockTest, BlockWithReadAmpBitmap) {
                 kBytesPerBit, stats.get());
    size_t read_bytes = 0;
-    DataBlockIter *iter =
+    DataBlockIter *iter = reader.NewDataIterator(
-        static_cast<DataBlockIter *>(reader.NewIterator<DataBlockIter>(
+        options.comparator, options.comparator, nullptr, stats.get());
            options.comparator, options.comparator, nullptr, stats.get()));
    for (int i = 0; i < num_records; i++) {
      Slice k(keys[i]);
@ -555,9 +447,8 @@ TEST_F(BlockTest, BlockWithReadAmpBitmap) {
                 kBytesPerBit, stats.get());
    size_t read_bytes = 0;
-    DataBlockIter *iter =
+    DataBlockIter *iter = reader.NewDataIterator(
-        static_cast<DataBlockIter *>(reader.NewIterator<DataBlockIter>(
+        options.comparator, options.comparator, nullptr, stats.get());
            options.comparator, options.comparator, nullptr, stats.get()));
    std::unordered_set<int> read_keys;
    for (int i = 0; i < num_records; i++) {
      int index = rnd.Uniform(num_records);
@ -602,6 +493,132 @@ TEST_F(BlockTest, ReadAmpBitmapPow2) {
  ASSERT_EQ(BlockReadAmpBitmap(100, 35, stats.get()).GetBytesPerBit(), 32);
 }
 class IndexBlockTest
    : public testing::Test,
      public testing::WithParamInterface<std::tuple<bool, bool>> {
 public:
  IndexBlockTest() = default;
  bool useValueDeltaEncoding() const { return std::get<0>(GetParam()); }
  bool includeFirstKey() const { return std::get<1>(GetParam()); }
 };
 // Similar to GenerateRandomKVs but for index block contents.
 void GenerateRandomIndexEntries(std::vector<std::string> *separators,
                                std::vector<BlockHandle> *block_handles,
                                std::vector<std::string> *first_keys,
                                const int len) {
  Random rnd(42);
  // For each of `len` blocks, we need to generate a first and last key.
  // Let's generate n*2 random keys, sort them, group into consecutive pairs.
  std::set<std::string> keys;
  while ((int)keys.size() < len * 2) {
    // Keys need to be at least 8 bytes long to look like internal keys.
    keys.insert(test::RandomKey(&rnd, 12));
  }
  uint64_t offset = 0;
  for (auto it = keys.begin(); it != keys.end();) {
    first_keys->emplace_back(*it++);
    separators->emplace_back(*it++);
    uint64_t size = rnd.Uniform(1024 * 16);
    BlockHandle handle(offset, size);
    offset += size + kBlockTrailerSize;
    block_handles->emplace_back(handle);
  }
 }
 TEST_P(IndexBlockTest, IndexValueEncodingTest) {
  Random rnd(301);
  Options options = Options();
  std::vector<std::string> separators;
  std::vector<BlockHandle> block_handles;
  std::vector<std::string> first_keys;
  const bool kUseDeltaEncoding = true;
  BlockBuilder builder(16, kUseDeltaEncoding, useValueDeltaEncoding());
  int num_records = 100;
  GenerateRandomIndexEntries(&separators, &block_handles, &first_keys,
                             num_records);
  BlockHandle last_encoded_handle;
  for (int i = 0; i < num_records; i++) {
    IndexValue entry(block_handles[i], first_keys[i]);
    std::string encoded_entry;
    std::string delta_encoded_entry;
    entry.EncodeTo(&encoded_entry, includeFirstKey(), nullptr);
    if (useValueDeltaEncoding() && i > 0) {
      entry.EncodeTo(&delta_encoded_entry, includeFirstKey(),
                     &last_encoded_handle);
    }
    last_encoded_handle = entry.handle;
    const Slice delta_encoded_entry_slice(delta_encoded_entry);
    builder.Add(separators[i], encoded_entry, &delta_encoded_entry_slice);
  }
  // read serialized contents of the block
  Slice rawblock = builder.Finish();
  // create block reader
  BlockContents contents;
  contents.data = rawblock;
  Block reader(std::move(contents), kDisableGlobalSequenceNumber);
  const bool kTotalOrderSeek = true;
  const bool kIncludesSeq = true;
  const bool kValueIsFull = !useValueDeltaEncoding();
  IndexBlockIter *kNullIter = nullptr;
  Statistics *kNullStats = nullptr;
  // read contents of block sequentially
  InternalIteratorBase<IndexValue> *iter = reader.NewIndexIterator(
      options.comparator, options.comparator, kNullIter, kNullStats,
      kTotalOrderSeek, includeFirstKey(), kIncludesSeq, kValueIsFull);
  iter->SeekToFirst();
  for (int index = 0; index < num_records; ++index) {
    ASSERT_TRUE(iter->Valid());
    Slice k = iter->key();
    IndexValue v = iter->value();
    EXPECT_EQ(separators[index], k.ToString());
    EXPECT_EQ(block_handles[index].offset(), v.handle.offset());
    EXPECT_EQ(block_handles[index].size(), v.handle.size());
    EXPECT_EQ(includeFirstKey() ? first_keys[index] : "",
              v.first_internal_key.ToString());
    iter->Next();
  }
  delete iter;
  // read block contents randomly
  iter = reader.NewIndexIterator(options.comparator, options.comparator,
                                 kNullIter, kNullStats, kTotalOrderSeek,
                                 includeFirstKey(), kIncludesSeq, kValueIsFull);
  for (int i = 0; i < num_records * 2; i++) {
    // find a random key in the lookaside array
    int index = rnd.Uniform(num_records);
    Slice k(separators[index]);
    // search in block for this key
    iter->Seek(k);
    ASSERT_TRUE(iter->Valid());
    IndexValue v = iter->value();
    EXPECT_EQ(separators[index], iter->key().ToString());
    EXPECT_EQ(block_handles[index].offset(), v.handle.offset());
    EXPECT_EQ(block_handles[index].size(), v.handle.size());
    EXPECT_EQ(includeFirstKey() ? first_keys[index] : "",
              v.first_internal_key.ToString());
  }
  delete iter;
 }
 INSTANTIATE_TEST_CASE_P(P, IndexBlockTest,
                        ::testing::Values(std::make_tuple(false, false),
                                          std::make_tuple(false, true),
                                          std::make_tuple(true, false),
                                          std::make_tuple(true, true)));
 }  // namespace rocksdb
 int main(int argc, char **argv) {
--- a/table/block_based/data_block_hash_index_test.cc
+++ b/table/block_based/data_block_hash_index_test.cc
@ -391,7 +391,7 @@ TEST(DataBlockHashIndex, BlockTestSingleKey) {
  Block reader(std::move(contents), kDisableGlobalSequenceNumber);
  const InternalKeyComparator icmp(BytewiseComparator());
-  auto iter = reader.NewIterator<DataBlockIter>(&icmp, icmp.user_comparator());
+  auto iter = reader.NewDataIterator(&icmp, icmp.user_comparator());
  bool may_exist;
  // search in block for the key just inserted
  {
@ -474,8 +474,7 @@ TEST(DataBlockHashIndex, BlockTestLarge) {
  // random seek existent keys
  for (int i = 0; i < num_records; i++) {
-    auto iter =
+    auto iter = reader.NewDataIterator(&icmp, icmp.user_comparator());
        reader.NewIterator<DataBlockIter>(&icmp, icmp.user_comparator());
    // find a random key in the lookaside array
    int index = rnd.Uniform(num_records);
    std::string ukey(keys[index] + "1" /* existing key marker */);
@ -512,8 +511,7 @@ TEST(DataBlockHashIndex, BlockTestLarge) {
  //     C         true          false
  for (int i = 0; i < num_records; i++) {
-    auto iter =
+    auto iter = reader.NewDataIterator(&icmp, icmp.user_comparator());
        reader.NewIterator<DataBlockIter>(&icmp, icmp.user_comparator());
    // find a random key in the lookaside array
    int index = rnd.Uniform(num_records);
    std::string ukey(keys[index] + "0" /* non-existing key marker */);
--- a/table/block_based/index_builder.cc
+++ b/table/block_based/index_builder.cc
@ -36,7 +36,7 @@ IndexBuilder* IndexBuilder::CreateIndexBuilder(
      result = new ShortenedIndexBuilder(
          comparator, table_opt.index_block_restart_interval,
          table_opt.format_version, use_value_delta_encoding,
-          table_opt.index_shortening);
+          table_opt.index_shortening, /* include_first_key */ false);
    } break;
    case BlockBasedTableOptions::kHashSearch: {
      result = new HashIndexBuilder(
@ -48,6 +48,12 @@ IndexBuilder* IndexBuilder::CreateIndexBuilder(
      result = PartitionedIndexBuilder::CreateIndexBuilder(
          comparator, use_value_delta_encoding, table_opt);
    } break;
    case BlockBasedTableOptions::kBinarySearchWithFirstKey: {
      result = new ShortenedIndexBuilder(
          comparator, table_opt.index_block_restart_interval,
          table_opt.format_version, use_value_delta_encoding,
          table_opt.index_shortening, /* include_first_key */ true);
    } break;
    default: {
      assert(!"Do not recognize the index type ");
    } break;
@ -94,7 +100,7 @@ void PartitionedIndexBuilder::MakeNewSubIndexBuilder() {
  sub_index_builder_ = new ShortenedIndexBuilder(
      comparator_, table_opt_.index_block_restart_interval,
      table_opt_.format_version, use_value_delta_encoding_,
-      table_opt_.index_shortening);
+      table_opt_.index_shortening, /* include_first_key */ false);
  flush_policy_.reset(FlushBlockBySizePolicyFactory::NewFlushBlockPolicy(
      table_opt_.metadata_block_size, table_opt_.block_size_deviation,
      // Note: this is sub-optimal since sub_index_builder_ could later reset
--- a/table/block_based/index_builder.h
+++ b/table/block_based/index_builder.h
@ -58,6 +58,7 @@ class IndexBuilder {
  // To allow further optimization, we provide `last_key_in_current_block` and
  // `first_key_in_next_block`, based on which the specific implementation can
  // determine the best index key to be used for the index block.
  // Called before the OnKeyAdded() call for first_key_in_next_block.
  // @last_key_in_current_block: this parameter maybe overridden with the value
  //                             "substitute key".
  // @first_key_in_next_block: it will be nullptr if the entry being added is
@ -123,7 +124,8 @@ class ShortenedIndexBuilder : public IndexBuilder {
      const InternalKeyComparator* comparator,
      const int index_block_restart_interval, const uint32_t format_version,
      const bool use_value_delta_encoding,
-      BlockBasedTableOptions::IndexShorteningMode shortening_mode)
+      BlockBasedTableOptions::IndexShorteningMode shortening_mode,
      bool include_first_key)
      : IndexBuilder(comparator),
        index_block_builder_(index_block_restart_interval,
                             true /*use_delta_encoding*/,
@ -131,11 +133,19 @@ class ShortenedIndexBuilder : public IndexBuilder {
        index_block_builder_without_seq_(index_block_restart_interval,
                                         true /*use_delta_encoding*/,
                                         use_value_delta_encoding),
        use_value_delta_encoding_(use_value_delta_encoding),
        include_first_key_(include_first_key),
        shortening_mode_(shortening_mode) {
    // Making the default true will disable the feature for old versions
    seperator_is_key_plus_seq_ = (format_version <= 2);
  }
  virtual void OnKeyAdded(const Slice& key) override {
    if (include_first_key_ && current_block_first_internal_key_.empty()) {
      current_block_first_internal_key_.assign(key.data(), key.size());
    }
  }
  virtual void AddIndexEntry(std::string* last_key_in_current_block,
                             const Slice* first_key_in_next_block,
                             const BlockHandle& block_handle) override {
@ -159,20 +169,27 @@ class ShortenedIndexBuilder : public IndexBuilder {
    }
    auto sep = Slice(*last_key_in_current_block);
-    std::string handle_encoding;
+    assert(!include_first_key_ || !current_block_first_internal_key_.empty());
-    block_handle.EncodeTo(&handle_encoding);
+    IndexValue entry(block_handle, current_block_first_internal_key_);
-    std::string handle_delta_encoding;
+    std::string encoded_entry;
-    PutVarsignedint64(&handle_delta_encoding,
+    std::string delta_encoded_entry;
-                      block_handle.size() - last_encoded_handle_.size());
+    entry.EncodeTo(&encoded_entry, include_first_key_, nullptr);
-    assert(handle_delta_encoding.size() != 0);
+    if (use_value_delta_encoding_ && !last_encoded_handle_.IsNull()) {
      entry.EncodeTo(&delta_encoded_entry, include_first_key_,
                     &last_encoded_handle_);
    } else {
      // If it's the first block, or delta encoding is disabled,
      // BlockBuilder::Add() below won't use delta-encoded slice.
    }
    last_encoded_handle_ = block_handle;
-    const Slice handle_delta_encoding_slice(handle_delta_encoding);
+    const Slice delta_encoded_entry_slice(delta_encoded_entry);
-    index_block_builder_.Add(sep, handle_encoding,
+    index_block_builder_.Add(sep, encoded_entry, &delta_encoded_entry_slice);
                             &handle_delta_encoding_slice);
    if (!seperator_is_key_plus_seq_) {
-      index_block_builder_without_seq_.Add(ExtractUserKey(sep), handle_encoding,
+      index_block_builder_without_seq_.Add(ExtractUserKey(sep), encoded_entry,
-                                           &handle_delta_encoding_slice);
+                                           &delta_encoded_entry_slice);
    }
    current_block_first_internal_key_.clear();
  }
  using IndexBuilder::Finish;
@ -200,9 +217,12 @@ class ShortenedIndexBuilder : public IndexBuilder {
 private:
  BlockBuilder index_block_builder_;
  BlockBuilder index_block_builder_without_seq_;
  const bool use_value_delta_encoding_;
  bool seperator_is_key_plus_seq_;
  const bool include_first_key_;
  BlockBasedTableOptions::IndexShorteningMode shortening_mode_;
-  BlockHandle last_encoded_handle_;
+  BlockHandle last_encoded_handle_ = BlockHandle::NullBlockHandle();
  std::string current_block_first_internal_key_;
 };
 // HashIndexBuilder contains a binary-searchable primary index and the
@ -243,7 +263,7 @@ class HashIndexBuilder : public IndexBuilder {
      : IndexBuilder(comparator),
        primary_index_builder_(comparator, index_block_restart_interval,
                               format_version, use_value_delta_encoding,
-                               shortening_mode),
+                               shortening_mode, /* include_first_key */ false),
        hash_key_extractor_(hash_key_extractor) {}
  virtual void AddIndexEntry(std::string* last_key_in_current_block,
--- a/table/block_based/partitioned_filter_block.cc
+++ b/table/block_based/partitioned_filter_block.cc
@ -147,12 +147,13 @@ PartitionedFilterBlockReader::~PartitionedFilterBlockReader() {
  IndexBlockIter biter;
  BlockHandle handle;
  Statistics* kNullStats = nullptr;
-  idx_on_fltr_blk_->NewIterator<IndexBlockIter>(
+  idx_on_fltr_blk_->NewIndexIterator(
      &comparator_, comparator_.user_comparator(), &biter, kNullStats, true,
-      index_key_includes_seq_, index_value_is_full_);
+      /* have_first_key */ false, index_key_includes_seq_,
      index_value_is_full_);
  biter.SeekToFirst();
  for (; biter.Valid(); biter.Next()) {
-    handle = biter.value();
+    handle = biter.value().handle;
    auto key = BlockBasedTable::GetCacheKey(table_->rep_->cache_key_prefix,
                                            table_->rep_->cache_key_prefix_size,
                                            handle, cache_key);
@ -221,15 +222,16 @@ BlockHandle PartitionedFilterBlockReader::GetFilterPartitionHandle(
    const Slice& entry) {
  IndexBlockIter iter;
  Statistics* kNullStats = nullptr;
-  idx_on_fltr_blk_->NewIterator<IndexBlockIter>(
+  idx_on_fltr_blk_->NewIndexIterator(
      &comparator_, comparator_.user_comparator(), &iter, kNullStats, true,
-      index_key_includes_seq_, index_value_is_full_);
+      /* have_first_key */ false, index_key_includes_seq_,
      index_value_is_full_);
  iter.Seek(entry);
  if (UNLIKELY(!iter.Valid())) {
    return BlockHandle(0, 0);
  }
  assert(iter.Valid());
-  BlockHandle fltr_blk_handle = iter.value();
+  BlockHandle fltr_blk_handle = iter.value().handle;
  return fltr_blk_handle;
 }
@ -280,18 +282,19 @@ void PartitionedFilterBlockReader::CacheDependencies(
  BlockCacheLookupContext lookup_context{TableReaderCaller::kPrefetch};
  IndexBlockIter biter;
  Statistics* kNullStats = nullptr;
-  idx_on_fltr_blk_->NewIterator<IndexBlockIter>(
+  idx_on_fltr_blk_->NewIndexIterator(
      &comparator_, comparator_.user_comparator(), &biter, kNullStats, true,
-      index_key_includes_seq_, index_value_is_full_);
+      /* have_first_key */ false, index_key_includes_seq_,
      index_value_is_full_);
  // Index partitions are assumed to be consecuitive. Prefetch them all.
  // Read the first block offset
  biter.SeekToFirst();
-  BlockHandle handle = biter.value();
+  BlockHandle handle = biter.value().handle;
  uint64_t prefetch_off = handle.offset();
  // Read the last block's offset
  biter.SeekToLast();
-  handle = biter.value();
+  handle = biter.value().handle;
  uint64_t last_off = handle.offset() + handle.size() + kBlockTrailerSize;
  uint64_t prefetch_len = last_off - prefetch_off;
  std::unique_ptr<FilePrefetchBuffer> prefetch_buffer;
@ -304,7 +307,7 @@ void PartitionedFilterBlockReader::CacheDependencies(
  // After prefetch, read the partitions one by one
  biter.SeekToFirst();
  for (; biter.Valid(); biter.Next()) {
-    handle = biter.value();
+    handle = biter.value().handle;
    const bool no_io = true;
    const bool is_a_filter_partition = true;
    auto filter = table_->GetFilter(
--- a/table/block_fetcher.cc
+++ b/table/block_fetcher.cc
@ -15,7 +15,6 @@
 #include "logging/logging.h"
 #include "memory/memory_allocator.h"
 #include "monitoring/perf_context_imp.h"
 #include "monitoring/statistics.h"
 #include "rocksdb/env.h"
 #include "table/block_based/block.h"
 #include "table/block_based/block_based_table_reader.h"
--- a/table/format.cc
+++ b/table/format.cc
@ -91,6 +91,58 @@ std::string BlockHandle::ToString(bool hex) const {
 const BlockHandle BlockHandle::kNullBlockHandle(0, 0);
 void IndexValue::EncodeTo(std::string* dst, bool have_first_key,
                          const BlockHandle* previous_handle) const {
  if (previous_handle) {
    assert(handle.offset() == previous_handle->offset() +
                                  previous_handle->size() + kBlockTrailerSize);
    PutVarsignedint64(dst, handle.size() - previous_handle->size());
  } else {
    handle.EncodeTo(dst);
  }
  assert(dst->size() != 0);
  if (have_first_key) {
    PutLengthPrefixedSlice(dst, first_internal_key);
  }
 }
 Status IndexValue::DecodeFrom(Slice* input, bool have_first_key,
                              const BlockHandle* previous_handle) {
  if (previous_handle) {
    int64_t delta;
    if (!GetVarsignedint64(input, &delta)) {
      return Status::Corruption("bad delta-encoded index value");
    }
    handle = BlockHandle(
        previous_handle->offset() + previous_handle->size() + kBlockTrailerSize,
        previous_handle->size() + delta);
  } else {
    Status s = handle.DecodeFrom(input);
    if (!s.ok()) {
      return s;
    }
  }
  if (!have_first_key) {
    first_internal_key = Slice();
  } else if (!GetLengthPrefixedSlice(input, &first_internal_key)) {
    return Status::Corruption("bad first key in block info");
  }
  return Status::OK();
 }
 std::string IndexValue::ToString(bool hex, bool have_first_key) const {
  std::string s;
  EncodeTo(&s, have_first_key, nullptr);
  if (hex) {
    return Slice(s).ToString(true);
  } else {
    return s;
  }
 }
 namespace {
 inline bool IsLegacyFooterFormat(uint64_t magic_number) {
  return magic_number == kLegacyBlockBasedTableMagicNumber ||
--- a/table/format.h
+++ b/table/format.h
@ -76,6 +76,35 @@ class BlockHandle {
  static const BlockHandle kNullBlockHandle;
 };
 // Value in block-based table file index.
 //
 // The index entry for block n is: y -> h, [x],
 // where: y is some key between the last key of block n (inclusive) and the
 // first key of block n+1 (exclusive); h is BlockHandle pointing to block n;
 // x, if present, is the first key of block n (unshortened).
 // This struct represents the "h, [x]" part.
 struct IndexValue {
  BlockHandle handle;
  // Empty means unknown.
  Slice first_internal_key;
  IndexValue() = default;
  IndexValue(BlockHandle _handle, Slice _first_internal_key)
      : handle(_handle), first_internal_key(_first_internal_key) {}
  // have_first_key indicates whether the `first_internal_key` is used.
  // If previous_handle is not null, delta encoding is used;
  // in this case, the two handles must point to consecutive blocks:
  // handle.offset() ==
  //     previous_handle->offset() + previous_handle->size() + kBlockTrailerSize
  void EncodeTo(std::string* dst, bool have_first_key,
                const BlockHandle* previous_handle) const;
  Status DecodeFrom(Slice* input, bool have_first_key,
                    const BlockHandle* previous_handle);
  std::string ToString(bool hex, bool have_first_key) const;
 };
 inline uint32_t GetCompressFormatForVersion(CompressionType compression_type,
                                            uint32_t version) {
 #ifdef NDEBUG
--- a/table/internal_iterator.h
+++ b/table/internal_iterator.h
@ -90,8 +90,11 @@ class InternalIteratorBase : public Cleanable {
  // satisfied without doing some IO, then this returns Status::Incomplete().
  virtual Status status() const = 0;
-  // True if the iterator is invalidated because it is out of the iterator
+  // True if the iterator is invalidated because it reached a key that is above
-  // upper bound
+  // the iterator upper bound. Used by LevelIterator to decide whether it should
  // stop or move on to the next file.
  // Important: if iterator reached the end of the file without encountering any
  // keys above the upper bound, IsOutOfBound() must return false.
  virtual bool IsOutOfBound() { return false; }
  // Pass the PinnedIteratorsManager to the Iterator, most Iterators dont
--- a/table/iterator.cc
+++ b/table/iterator.cc
@ -167,7 +167,7 @@ template <class TValue>
 InternalIteratorBase<TValue>* NewErrorInternalIterator(const Status& status) {
  return new EmptyInternalIterator<TValue>(status);
 }
-template InternalIteratorBase<BlockHandle>* NewErrorInternalIterator(
+template InternalIteratorBase<IndexValue>* NewErrorInternalIterator(
    const Status& status);
 template InternalIteratorBase<Slice>* NewErrorInternalIterator(
    const Status& status);
@ -182,7 +182,7 @@ InternalIteratorBase<TValue>* NewErrorInternalIterator(const Status& status,
    return new (mem) EmptyInternalIterator<TValue>(status);
  }
 }
-template InternalIteratorBase<BlockHandle>* NewErrorInternalIterator(
+template InternalIteratorBase<IndexValue>* NewErrorInternalIterator(
    const Status& status, Arena* arena);
 template InternalIteratorBase<Slice>* NewErrorInternalIterator(
    const Status& status, Arena* arena);
@ -191,7 +191,7 @@ template <class TValue>
 InternalIteratorBase<TValue>* NewEmptyInternalIterator() {
  return new EmptyInternalIterator<TValue>(Status::OK());
 }
-template InternalIteratorBase<BlockHandle>* NewEmptyInternalIterator();
+template InternalIteratorBase<IndexValue>* NewEmptyInternalIterator();
 template InternalIteratorBase<Slice>* NewEmptyInternalIterator();
 template <class TValue>
@ -203,7 +203,7 @@ InternalIteratorBase<TValue>* NewEmptyInternalIterator(Arena* arena) {
    return new (mem) EmptyInternalIterator<TValue>(Status::OK());
  }
 }
-template InternalIteratorBase<BlockHandle>* NewEmptyInternalIterator(
+template InternalIteratorBase<IndexValue>* NewEmptyInternalIterator(
    Arena* arena);
 template InternalIteratorBase<Slice>* NewEmptyInternalIterator(Arena* arena);
--- a/table/meta_blocks.cc
+++ b/table/meta_blocks.cc
@ -229,8 +229,8 @@ Status ReadProperties(const Slice& handle_value, RandomAccessFileReader* file,
  Block properties_block(std::move(block_contents),
                         kDisableGlobalSequenceNumber);
  DataBlockIter iter;
-  properties_block.NewIterator<DataBlockIter>(BytewiseComparator(),
+  properties_block.NewDataIterator(BytewiseComparator(), BytewiseComparator(),
-                                              BytewiseComparator(), &iter);
+                                   &iter);
  auto new_table_properties = new TableProperties();
  // All pre-defined properties of type uint64_t
@ -386,9 +386,8 @@ Status ReadTableProperties(RandomAccessFileReader* file, uint64_t file_size,
  // are to compress it.
  Block metaindex_block(std::move(metaindex_contents),
                        kDisableGlobalSequenceNumber);
-  std::unique_ptr<InternalIterator> meta_iter(
+  std::unique_ptr<InternalIterator> meta_iter(metaindex_block.NewDataIterator(
-      metaindex_block.NewIterator<DataBlockIter>(BytewiseComparator(),
+      BytewiseComparator(), BytewiseComparator()));
                                                 BytewiseComparator()));
  // -- Read property block
  bool found_properties_block = true;
@ -459,8 +458,8 @@ Status FindMetaBlock(RandomAccessFileReader* file, uint64_t file_size,
                        kDisableGlobalSequenceNumber);
  std::unique_ptr<InternalIterator> meta_iter;
-  meta_iter.reset(metaindex_block.NewIterator<DataBlockIter>(
+  meta_iter.reset(metaindex_block.NewDataIterator(BytewiseComparator(),
-      BytewiseComparator(), BytewiseComparator()));
+                                                  BytewiseComparator()));
  return FindMetaBlock(meta_iter.get(), meta_block_name, block_handle);
 }
@ -504,8 +503,8 @@ Status ReadMetaBlock(RandomAccessFileReader* file,
                        kDisableGlobalSequenceNumber);
  std::unique_ptr<InternalIterator> meta_iter;
-  meta_iter.reset(metaindex_block.NewIterator<DataBlockIter>(
+  meta_iter.reset(metaindex_block.NewDataIterator(BytewiseComparator(),
-      BytewiseComparator(), BytewiseComparator()));
+                                                  BytewiseComparator()));
  BlockHandle block_handle;
  status = FindMetaBlock(meta_iter.get(), meta_block_name, &block_handle);
--- a/table/table_test.cc
+++ b/table/table_test.cc
@ -236,7 +236,7 @@ class BlockConstructor: public Constructor {
  }
  InternalIterator* NewIterator(
      const SliceTransform* /*prefix_extractor*/) const override {
-    return block_->NewIterator<DataBlockIter>(comparator_, comparator_);
+    return block_->NewDataIterator(comparator_, comparator_);
  }
 private:
@ -308,8 +308,9 @@ class TableConstructor: public Constructor {
 public:
  explicit TableConstructor(const Comparator* cmp,
                            bool convert_to_internal_key = false,
-                            int level = -1)
+                            int level = -1, SequenceNumber largest_seqno = 0)
      : Constructor(cmp),
        largest_seqno_(largest_seqno),
        convert_to_internal_key_(convert_to_internal_key),
        level_(level) {}
  ~TableConstructor() override { Reset(); }
@ -326,6 +327,14 @@ class TableConstructor: public Constructor {
    std::unique_ptr<TableBuilder> builder;
    std::vector<std::unique_ptr<IntTblPropCollectorFactory>>
        int_tbl_prop_collector_factories;
    if (largest_seqno_ != 0) {
      // Pretend that it's an external file written by SstFileWriter.
      int_tbl_prop_collector_factories.emplace_back(
          new SstFileWriterPropertiesCollectorFactory(2 /* version */,
                                                      0 /* global_seqno*/));
    }
    std::string column_family_name;
    builder.reset(ioptions.table_factory->NewTableBuilder(
        TableBuilderOptions(ioptions, moptions, internal_comparator,
@ -362,7 +371,7 @@ class TableConstructor: public Constructor {
    return ioptions.table_factory->NewTableReader(
        TableReaderOptions(ioptions, moptions.prefix_extractor.get(), soptions,
                           internal_comparator, !kSkipFilters, !kImmortal,
-                           level_),
+                           level_, largest_seqno_, nullptr),
        std::move(file_reader_), TEST_GetSink()->contents().size(),
        &table_reader_);
  }
@ -428,6 +437,7 @@ class TableConstructor: public Constructor {
  std::unique_ptr<WritableFileWriter> file_writer_;
  std::unique_ptr<RandomAccessFileReader> file_reader_;
  std::unique_ptr<TableReader> table_reader_;
  SequenceNumber largest_seqno_;
  bool convert_to_internal_key_;
  int level_;
@ -1484,7 +1494,7 @@ TEST_P(BlockBasedTableTest, PrefetchTest) {
 TEST_P(BlockBasedTableTest, TotalOrderSeekOnHashIndex) {
  BlockBasedTableOptions table_options = GetBlockBasedTableOptions();
-  for (int i = 0; i < 4; ++i) {
+  for (int i = 0; i <= 5; ++i) {
    Options options;
    // Make each key/value an individual block
    table_options.block_size = 64;
@ -1515,11 +1525,16 @@ TEST_P(BlockBasedTableTest, TotalOrderSeekOnHashIndex) {
      options.prefix_extractor.reset(NewFixedPrefixTransform(4));
      break;
    case 4:
-    default:
+      // Two-level index
      // Binary search index
      table_options.index_type = BlockBasedTableOptions::kTwoLevelIndexSearch;
      options.table_factory.reset(new BlockBasedTableFactory(table_options));
      break;
    case 5:
      // Binary search with first key
      table_options.index_type =
          BlockBasedTableOptions::kBinarySearchWithFirstKey;
      options.table_factory.reset(new BlockBasedTableFactory(table_options));
      break;
    }
    TableConstructor c(BytewiseComparator(),
@ -1663,10 +1678,10 @@ static std::string RandomString(Random* rnd, int len) {
 }
 void AddInternalKey(TableConstructor* c, const std::string& prefix,
-                    int /*suffix_len*/ = 800) {
+                    std::string value = "v", int /*suffix_len*/ = 800) {
  static Random rnd(1023);
  InternalKey k(prefix + RandomString(&rnd, 800), 0, kTypeValue);
-  c->Add(k.Encode().ToString(), "v");
+  c->Add(k.Encode().ToString(), value);
 }
 void TableTest::IndexTest(BlockBasedTableOptions table_options) {
@ -1845,6 +1860,286 @@ TEST_P(BlockBasedTableTest, IndexSeekOptimizationIncomplete) {
  ASSERT_TRUE(iter->status().IsIncomplete());
 }
 TEST_P(BlockBasedTableTest, BinaryIndexWithFirstKey1) {
  BlockBasedTableOptions table_options = GetBlockBasedTableOptions();
  table_options.index_type = BlockBasedTableOptions::kBinarySearchWithFirstKey;
  IndexTest(table_options);
 }
 class CustomFlushBlockPolicy : public FlushBlockPolicyFactory,
                               public FlushBlockPolicy {
 public:
  explicit CustomFlushBlockPolicy(std::vector<int> keys_per_block)
      : keys_per_block_(keys_per_block) {}
  const char* Name() const override { return "table_test"; }
  FlushBlockPolicy* NewFlushBlockPolicy(const BlockBasedTableOptions&,
                                        const BlockBuilder&) const override {
    return new CustomFlushBlockPolicy(keys_per_block_);
  }
  bool Update(const Slice&, const Slice&) override {
    if (keys_in_current_block_ >= keys_per_block_.at(current_block_idx_)) {
      ++current_block_idx_;
      keys_in_current_block_ = 1;
      return true;
    }
    ++keys_in_current_block_;
    return false;
  }
  std::vector<int> keys_per_block_;
  int current_block_idx_ = 0;
  int keys_in_current_block_ = 0;
 };
 TEST_P(BlockBasedTableTest, BinaryIndexWithFirstKey2) {
  for (int use_first_key = 0; use_first_key < 2; ++use_first_key) {
    SCOPED_TRACE("use_first_key = " + std::to_string(use_first_key));
    BlockBasedTableOptions table_options = GetBlockBasedTableOptions();
    table_options.index_type =
        use_first_key ? BlockBasedTableOptions::kBinarySearchWithFirstKey
                      : BlockBasedTableOptions::kBinarySearch;
    table_options.block_cache = NewLRUCache(10000);  // fits all blocks
    table_options.index_shortening =
        BlockBasedTableOptions::IndexShorteningMode::kNoShortening;
    table_options.flush_block_policy_factory =
        std::make_shared<CustomFlushBlockPolicy>(std::vector<int>{2, 1, 3, 2});
    Options options;
    options.table_factory.reset(NewBlockBasedTableFactory(table_options));
    options.statistics = CreateDBStatistics();
    Statistics* stats = options.statistics.get();
    std::unique_ptr<InternalKeyComparator> comparator(
        new InternalKeyComparator(BytewiseComparator()));
    const ImmutableCFOptions ioptions(options);
    const MutableCFOptions moptions(options);
    TableConstructor c(BytewiseComparator());
    // Block 0.
    AddInternalKey(&c, "aaaa", "v0");
    AddInternalKey(&c, "aaac", "v1");
    // Block 1.
    AddInternalKey(&c, "aaca", "v2");
    // Block 2.
    AddInternalKey(&c, "caaa", "v3");
    AddInternalKey(&c, "caac", "v4");
    AddInternalKey(&c, "caae", "v5");
    // Block 3.
    AddInternalKey(&c, "ccaa", "v6");
    AddInternalKey(&c, "ccac", "v7");
    // Write the file.
    std::vector<std::string> keys;
    stl_wrappers::KVMap kvmap;
    c.Finish(options, ioptions, moptions, table_options, *comparator, &keys,
             &kvmap);
    ASSERT_EQ(8, keys.size());
    auto reader = c.GetTableReader();
    auto props = reader->GetTableProperties();
    ASSERT_EQ(4u, props->num_data_blocks);
    std::unique_ptr<InternalIterator> iter(reader->NewIterator(
        ReadOptions(), /*prefix_extractor=*/nullptr, /*arena=*/nullptr,
        /*skip_filters=*/false, TableReaderCaller::kUncategorized));
    // Shouldn't have read data blocks before iterator is seeked.
    EXPECT_EQ(0, stats->getTickerCount(BLOCK_CACHE_DATA_MISS));
    EXPECT_EQ(0, stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
    auto ikey = [](Slice user_key) {
      return InternalKey(user_key, 0, kTypeValue).Encode().ToString();
    };
    // Seek to a key between blocks. If index contains first key, we shouldn't
    // read any data blocks until value is requested.
    iter->Seek(ikey("aaba"));
    ASSERT_TRUE(iter->Valid());
    EXPECT_EQ(keys[2], iter->key().ToString());
    EXPECT_EQ(use_first_key ? 0 : 1,
              stats->getTickerCount(BLOCK_CACHE_DATA_MISS));
    EXPECT_EQ("v2", iter->value().ToString());
    EXPECT_EQ(1, stats->getTickerCount(BLOCK_CACHE_DATA_MISS));
    EXPECT_EQ(0, stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
    // Seek to the middle of a block. The block should be read right away.
    iter->Seek(ikey("caab"));
    ASSERT_TRUE(iter->Valid());
    EXPECT_EQ(keys[4], iter->key().ToString());
    EXPECT_EQ(2, stats->getTickerCount(BLOCK_CACHE_DATA_MISS));
    EXPECT_EQ(0, stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
    EXPECT_EQ("v4", iter->value().ToString());
    EXPECT_EQ(0, stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
    // Seek to just before the same block and don't access value.
    // The iterator should keep pinning the block contents.
    iter->Seek(ikey("baaa"));
    ASSERT_TRUE(iter->Valid());
    EXPECT_EQ(keys[3], iter->key().ToString());
    EXPECT_EQ(0, stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
    // Seek to the same block again to check that the block is still pinned.
    iter->Seek(ikey("caae"));
    ASSERT_TRUE(iter->Valid());
    EXPECT_EQ(keys[5], iter->key().ToString());
    EXPECT_EQ(0, stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
    EXPECT_EQ("v5", iter->value().ToString());
    EXPECT_EQ(2, stats->getTickerCount(BLOCK_CACHE_DATA_MISS));
    EXPECT_EQ(0, stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
    // Step forward and fall through to the next block. Don't access value.
    iter->Next();
    ASSERT_TRUE(iter->Valid());
    EXPECT_EQ(keys[6], iter->key().ToString());
    EXPECT_EQ(use_first_key ? 2 : 3,
              stats->getTickerCount(BLOCK_CACHE_DATA_MISS));
    EXPECT_EQ(0, stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
    // Step forward again. Block should be read.
    iter->Next();
    ASSERT_TRUE(iter->Valid());
    EXPECT_EQ(keys[7], iter->key().ToString());
    EXPECT_EQ(3, stats->getTickerCount(BLOCK_CACHE_DATA_MISS));
    EXPECT_EQ("v7", iter->value().ToString());
    EXPECT_EQ(0, stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
    // Step forward and reach the end.
    iter->Next();
    EXPECT_FALSE(iter->Valid());
    EXPECT_EQ(3, stats->getTickerCount(BLOCK_CACHE_DATA_MISS));
    EXPECT_EQ(0, stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
    // Seek to a single-key block and step forward without accessing value.
    iter->Seek(ikey("aaca"));
    ASSERT_TRUE(iter->Valid());
    EXPECT_EQ(keys[2], iter->key().ToString());
    EXPECT_EQ(use_first_key ? 0 : 1,
              stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
    iter->Next();
    ASSERT_TRUE(iter->Valid());
    EXPECT_EQ(keys[3], iter->key().ToString());
    EXPECT_EQ(use_first_key ? 1 : 2,
              stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
    EXPECT_EQ("v3", iter->value().ToString());
    EXPECT_EQ(2, stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
    EXPECT_EQ(3, stats->getTickerCount(BLOCK_CACHE_DATA_MISS));
    // Seek between blocks and step back without accessing value.
    iter->Seek(ikey("aaca"));
    ASSERT_TRUE(iter->Valid());
    EXPECT_EQ(keys[2], iter->key().ToString());
    EXPECT_EQ(use_first_key ? 2 : 3,
              stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
    EXPECT_EQ(3, stats->getTickerCount(BLOCK_CACHE_DATA_MISS));
    iter->Prev();
    ASSERT_TRUE(iter->Valid());
    EXPECT_EQ(keys[1], iter->key().ToString());
    EXPECT_EQ(use_first_key ? 2 : 3,
              stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
    // All blocks are in cache now, there'll be no more misses ever.
    EXPECT_EQ(4, stats->getTickerCount(BLOCK_CACHE_DATA_MISS));
    EXPECT_EQ("v1", iter->value().ToString());
    // Next into the next block again.
    iter->Next();
    ASSERT_TRUE(iter->Valid());
    EXPECT_EQ(keys[2], iter->key().ToString());
    EXPECT_EQ(use_first_key ? 2 : 4,
              stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
    // Seek to first and step back without accessing value.
    iter->SeekToFirst();
    ASSERT_TRUE(iter->Valid());
    EXPECT_EQ(keys[0], iter->key().ToString());
    EXPECT_EQ(use_first_key ? 2 : 5,
              stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
    iter->Prev();
    EXPECT_FALSE(iter->Valid());
    EXPECT_EQ(use_first_key ? 2 : 5,
              stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
    // Do some SeekForPrev() and SeekToLast() just to cover all methods.
    iter->SeekForPrev(ikey("caad"));
    ASSERT_TRUE(iter->Valid());
    EXPECT_EQ(keys[4], iter->key().ToString());
    EXPECT_EQ(use_first_key ? 3 : 6,
              stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
    EXPECT_EQ("v4", iter->value().ToString());
    EXPECT_EQ(use_first_key ? 3 : 6,
              stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
    iter->SeekToLast();
    ASSERT_TRUE(iter->Valid());
    EXPECT_EQ(keys[7], iter->key().ToString());
    EXPECT_EQ(use_first_key ? 4 : 7,
              stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
    EXPECT_EQ("v7", iter->value().ToString());
    EXPECT_EQ(use_first_key ? 4 : 7,
              stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
    EXPECT_EQ(4, stats->getTickerCount(BLOCK_CACHE_DATA_MISS));
    c.ResetTableReader();
  }
 }
 TEST_P(BlockBasedTableTest, BinaryIndexWithFirstKeyGlobalSeqno) {
  BlockBasedTableOptions table_options = GetBlockBasedTableOptions();
  table_options.index_type = BlockBasedTableOptions::kBinarySearchWithFirstKey;
  table_options.block_cache = NewLRUCache(10000);
  Options options;
  options.statistics = CreateDBStatistics();
  Statistics* stats = options.statistics.get();
  options.table_factory.reset(NewBlockBasedTableFactory(table_options));
  std::unique_ptr<InternalKeyComparator> comparator(
      new InternalKeyComparator(BytewiseComparator()));
  const ImmutableCFOptions ioptions(options);
  const MutableCFOptions moptions(options);
  TableConstructor c(BytewiseComparator(), /* convert_to_internal_key */ false,
                     /* level */ -1, /* largest_seqno */ 42);
  c.Add(InternalKey("b", 0, kTypeValue).Encode().ToString(), "x");
  c.Add(InternalKey("c", 0, kTypeValue).Encode().ToString(), "y");
  std::vector<std::string> keys;
  stl_wrappers::KVMap kvmap;
  c.Finish(options, ioptions, moptions, table_options, *comparator, &keys,
           &kvmap);
  ASSERT_EQ(2, keys.size());
  auto reader = c.GetTableReader();
  auto props = reader->GetTableProperties();
  ASSERT_EQ(1u, props->num_data_blocks);
  std::unique_ptr<InternalIterator> iter(reader->NewIterator(
      ReadOptions(), /*prefix_extractor=*/nullptr, /*arena=*/nullptr,
      /*skip_filters=*/false, TableReaderCaller::kUncategorized));
  iter->Seek(InternalKey("a", 0, kTypeValue).Encode().ToString());
  ASSERT_TRUE(iter->Valid());
  EXPECT_EQ(InternalKey("b", 42, kTypeValue).Encode().ToString(),
            iter->key().ToString());
  EXPECT_NE(keys[0], iter->key().ToString());
  // Key should have been served from index, without reading data blocks.
  EXPECT_EQ(0, stats->getTickerCount(BLOCK_CACHE_DATA_MISS));
  EXPECT_EQ("x", iter->value().ToString());
  EXPECT_EQ(1, stats->getTickerCount(BLOCK_CACHE_DATA_MISS));
  EXPECT_EQ(0, stats->getTickerCount(BLOCK_CACHE_DATA_HIT));
  EXPECT_EQ(InternalKey("b", 42, kTypeValue).Encode().ToString(),
            iter->key().ToString());
  c.ResetTableReader();
 }
 // It's very hard to figure out the index block size of a block accurately.
 // To make sure we get the index size, we just make sure as key number
 // grows, the filter block size also grows.
@ -3606,9 +3901,8 @@ TEST_P(BlockBasedTableTest, PropertiesBlockRestartPointTest) {
    Block metaindex_block(std::move(metaindex_contents),
                          kDisableGlobalSequenceNumber);
-    std::unique_ptr<InternalIterator> meta_iter(
+    std::unique_ptr<InternalIterator> meta_iter(metaindex_block.NewDataIterator(
-        metaindex_block.NewIterator<DataBlockIter>(BytewiseComparator(),
+        BytewiseComparator(), BytewiseComparator()));
                                                   BytewiseComparator()));
    bool found_properties_block = true;
    ASSERT_OK(SeekToPropertiesBlock(meta_iter.get(), &found_properties_block));
    ASSERT_TRUE(found_properties_block);
@ -3688,8 +3982,7 @@ TEST_P(BlockBasedTableTest, PropertiesMetaBlockLast) {
  // verify properties block comes last
  std::unique_ptr<InternalIterator> metaindex_iter{
-      metaindex_block.NewIterator<DataBlockIter>(options.comparator,
+      metaindex_block.NewDataIterator(options.comparator, options.comparator)};
                                                 options.comparator)};
  uint64_t max_offset = 0;
  std::string key_at_max_offset;
  for (metaindex_iter->SeekToFirst(); metaindex_iter->Valid();
--- a/table/two_level_iterator.cc
+++ b/table/two_level_iterator.cc
@ -19,11 +19,11 @@ namespace rocksdb {
 namespace {
-class TwoLevelIndexIterator : public InternalIteratorBase<BlockHandle> {
+class TwoLevelIndexIterator : public InternalIteratorBase<IndexValue> {
 public:
  explicit TwoLevelIndexIterator(
      TwoLevelIteratorState* state,
-      InternalIteratorBase<BlockHandle>* first_level_iter);
+      InternalIteratorBase<IndexValue>* first_level_iter);
  ~TwoLevelIndexIterator() override {
    first_level_iter_.DeleteIter(false /* is_arena_mode */);
@ -43,7 +43,7 @@ class TwoLevelIndexIterator : public InternalIteratorBase<BlockHandle> {
    assert(Valid());
    return second_level_iter_.key();
  }
-  BlockHandle value() const override {
+  IndexValue value() const override {
    assert(Valid());
    return second_level_iter_.value();
  }
@ -69,12 +69,12 @@ class TwoLevelIndexIterator : public InternalIteratorBase<BlockHandle> {
  }
  void SkipEmptyDataBlocksForward();
  void SkipEmptyDataBlocksBackward();
-  void SetSecondLevelIterator(InternalIteratorBase<BlockHandle>* iter);
+  void SetSecondLevelIterator(InternalIteratorBase<IndexValue>* iter);
  void InitDataBlock();
  TwoLevelIteratorState* state_;
-  IteratorWrapperBase<BlockHandle> first_level_iter_;
+  IteratorWrapperBase<IndexValue> first_level_iter_;
-  IteratorWrapperBase<BlockHandle> second_level_iter_;  // May be nullptr
+  IteratorWrapperBase<IndexValue> second_level_iter_;  // May be nullptr
  Status status_;
  // If second_level_iter is non-nullptr, then "data_block_handle_" holds the
  // "index_value" passed to block_function_ to create the second_level_iter.
@ -83,7 +83,7 @@ class TwoLevelIndexIterator : public InternalIteratorBase<BlockHandle> {
 TwoLevelIndexIterator::TwoLevelIndexIterator(
    TwoLevelIteratorState* state,
-    InternalIteratorBase<BlockHandle>* first_level_iter)
+    InternalIteratorBase<IndexValue>* first_level_iter)
    : state_(state), first_level_iter_(first_level_iter) {}
 void TwoLevelIndexIterator::Seek(const Slice& target) {
@ -177,8 +177,8 @@ void TwoLevelIndexIterator::SkipEmptyDataBlocksBackward() {
 }
 void TwoLevelIndexIterator::SetSecondLevelIterator(
-    InternalIteratorBase<BlockHandle>* iter) {
+    InternalIteratorBase<IndexValue>* iter) {
-  InternalIteratorBase<BlockHandle>* old_iter = second_level_iter_.Set(iter);
+  InternalIteratorBase<IndexValue>* old_iter = second_level_iter_.Set(iter);
  delete old_iter;
 }
@ -186,14 +186,14 @@ void TwoLevelIndexIterator::InitDataBlock() {
  if (!first_level_iter_.Valid()) {
    SetSecondLevelIterator(nullptr);
  } else {
-    BlockHandle handle = first_level_iter_.value();
+    BlockHandle handle = first_level_iter_.value().handle;
    if (second_level_iter_.iter() != nullptr &&
        !second_level_iter_.status().IsIncomplete() &&
        handle.offset() == data_block_handle_.offset()) {
      // second_level_iter is already constructed with this iterator, so
      // no need to change anything
    } else {
-      InternalIteratorBase<BlockHandle>* iter =
+      InternalIteratorBase<IndexValue>* iter =
          state_->NewSecondaryIterator(handle);
      data_block_handle_ = handle;
      SetSecondLevelIterator(iter);
@ -203,9 +203,9 @@ void TwoLevelIndexIterator::InitDataBlock() {
 }  // namespace
-InternalIteratorBase<BlockHandle>* NewTwoLevelIterator(
+InternalIteratorBase<IndexValue>* NewTwoLevelIterator(
    TwoLevelIteratorState* state,
-    InternalIteratorBase<BlockHandle>* first_level_iter) {
+    InternalIteratorBase<IndexValue>* first_level_iter) {
  return new TwoLevelIndexIterator(state, first_level_iter);
 }
 }  // namespace rocksdb
--- a/table/two_level_iterator.h
+++ b/table/two_level_iterator.h
@ -22,11 +22,10 @@ struct TwoLevelIteratorState {
  TwoLevelIteratorState() {}
  virtual ~TwoLevelIteratorState() {}
-  virtual InternalIteratorBase<BlockHandle>* NewSecondaryIterator(
+  virtual InternalIteratorBase<IndexValue>* NewSecondaryIterator(
      const BlockHandle& handle) = 0;
 };
 // Return a new two level iterator.  A two-level iterator contains an
 // index iterator whose values point to a sequence of blocks where
 // each block is itself a sequence of key,value pairs.  The returned
@ -37,8 +36,8 @@ struct TwoLevelIteratorState {
 // Uses a supplied function to convert an index_iter value into
 // an iterator over the contents of the corresponding block.
 // Note: this function expects first_level_iter was not created using the arena
-extern InternalIteratorBase<BlockHandle>* NewTwoLevelIterator(
+extern InternalIteratorBase<IndexValue>* NewTwoLevelIterator(
    TwoLevelIteratorState* state,
-    InternalIteratorBase<BlockHandle>* first_level_iter);
+    InternalIteratorBase<IndexValue>* first_level_iter);
 }  // namespace rocksdb
--- a/test_util/testutil.cc
+++ b/test_util/testutil.cc
@ -9,6 +9,7 @@
 #include "test_util/testutil.h"
 #include <array>
 #include <cctype>
 #include <sstream>
@ -197,8 +198,12 @@ BlockBasedTableOptions RandomBlockBasedTableOptions(Random* rnd) {
  opt.cache_index_and_filter_blocks = rnd->Uniform(2);
  opt.pin_l0_filter_and_index_blocks_in_cache = rnd->Uniform(2);
  opt.pin_top_level_index_and_filter = rnd->Uniform(2);
-  opt.index_type = rnd->Uniform(2) ? BlockBasedTableOptions::kBinarySearch
+  using IndexType = BlockBasedTableOptions::IndexType;
-                                   : BlockBasedTableOptions::kHashSearch;
+  const std::array<IndexType, 4> index_types = {
      {IndexType::kBinarySearch, IndexType::kHashSearch,
       IndexType::kTwoLevelIndexSearch, IndexType::kBinarySearchWithFirstKey}};
  opt.index_type =
      index_types[rnd->Uniform(static_cast<int>(index_types.size()))];
  opt.hash_index_allow_collision = rnd->Uniform(2);
  opt.checksum = static_cast<ChecksumType>(rnd->Uniform(3));
  opt.block_size = rnd->Uniform(10000000);
--- a/util/coding.h
+++ b/util/coding.h
@ -58,6 +58,7 @@ extern bool GetFixed32(Slice* input, uint32_t* value);
 extern bool GetFixed16(Slice* input, uint16_t* value);
 extern bool GetVarint32(Slice* input, uint32_t* value);
 extern bool GetVarint64(Slice* input, uint64_t* value);
 extern bool GetVarsignedint64(Slice* input, int64_t* value);
 extern bool GetLengthPrefixedSlice(Slice* input, Slice* result);
 // This function assumes data is well-formed.
 extern Slice GetLengthPrefixedSlice(const char* data);
@ -377,6 +378,18 @@ inline bool GetVarint64(Slice* input, uint64_t* value) {
  }
 }
 inline bool GetVarsignedint64(Slice* input, int64_t* value) {
  const char* p = input->data();
  const char* limit = p + input->size();
  const char* q = GetVarsignedint64Ptr(p, limit, value);
  if (q == nullptr) {
    return false;
  } else {
    *input = Slice(q, static_cast<size_t>(limit - q));
    return true;
  }
 }
 // Provide an interface for platform independent endianness transformation
 inline uint64_t EndianTransform(uint64_t input, size_t size) {
  char* pos = reinterpret_cast<char*>(&input);