rocksdb/docs/_posts/2014-09-12-cuckoo.markdown

4.4 KiB

title layout author category redirect_from
Cuckoo Hashing Table Format post radheshyam blog [/blog/1427/new-bloom-filter-format/]

Introduction

We recently introduced a new Cuckoo Hashing based SST file format which is optimized for fast point lookups. The new format was built for applications which require very high point lookup rates (~4Mqps) in read only mode but do not use operations like range scan, merge operator, etc. But, the existing RocksDB file formats were built to support range scan and other operations and the current best point lookup in RocksDB is 1.2 Mqps given by PlainTable format. This prompted a hashing based file format, which we present here. The new table format uses a cache friendly version of Cuckoo Hashing algorithm with only 1 or 2 memory accesses per lookup.

Goals:

  • Reduce memory accesses per lookup to 1 or 2

  • Get an end to end point lookup rate of at least 4 Mqps

  • Minimize database size

Assumptions:

  • Key length and value length are fixed

  • The database is operated in read only mode

Non-goals:

  • While optimizing the performance of Get() operation was our primary goal, compaction and build times were secondary. We may work on improving them in future.

Details for setting up the table format can be found in GitHub.

Cuckoo Hashing Algorithm

In order to achieve high lookup speeds, we did multiple optimizations, including a cache friendly cuckoo hash algorithm. Cuckoo Hashing uses multiple hash functions, _h1, ..., _hn.

Original Cuckoo Hashing

To insert any new key k, we compute hashes of the key h1(k), ..., hn(k). We insert the key in the first hash location that is free. If all the locations are blocked, we try to move one of the colliding keys to a different location by trying to re-insert it.

Finding smallest set of keys to displace in order to accommodate the new key is naturally a shortest path problem in a directed graph where nodes are buckets of hash table and there is an edge from bucket A to bucket B if the element stored in bucket A can be accommodated in bucket B using one of the hash functions. The source nodes are the possible hash locations for the given key k and destination is any one of the empty buckets. We use this algorithm to handle collision.

To retrieve a key k, we compute hashes, h1(k), ..., hn(k) and the key must be present in one of these locations.

Our goal is to minimize average (and maximum) number of hash functions required and hence the number of memory accesses. In our experiments, with a hash utilization of 90%, we found that the average number of lookups is 1.8 and maximum is 3. Around 44% of keys are accommodated in first hash location and 33% in second location.

Cache Friendly Cuckoo Hashing

We noticed the following two sub-optimal properties in original Cuckoo implementation:

  • If the key is not present in first hash location, we jump to second hash location which may not be in cache. This results in many cache misses.

  • Because only 44% of keys are located in first cuckoo block, we couldn't have an optimal prefetching strategy - prefetching all hash locations for a key is wasteful. But prefetching only the first hash location helps only 44% of cases.

The solution is to insert more keys near first location. In case of collision in the first hash location - h1(k), we try to insert it in next few buckets, _h1(k)+1, _h1(k)+2, ..., h1(k)+t-1. If all of these t locations are occupied, we skip over to next hash function h2 and repeat the process. We call the set of t buckets as a Cuckoo Block. We chose t such that size of a block is not bigger than a cache line and we prefetch the first cuckoo block.

With the new algorithm, for 90% hash utilization, we found that 85% of keys are accommodated in first Cuckoo Block. Prefetching the first cuckoo block yields best results. For a database of 100 million keys with key length 8 and value length 4, the hash algorithm alone can achieve 9.6 Mqps and we are working on improving it further. End to end RocksDB performance results can be found here.