|
|
|
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
|
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
|
|
// (found in the LICENSE.Apache file in the root directory).
|
|
|
|
//
|
|
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
|
|
|
|
|
|
|
#pragma once
|
|
|
|
|
|
|
|
#include <array>
|
|
|
|
#include <atomic>
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
#include <cstddef>
|
|
|
|
#include <cstdint>
|
|
|
|
#include <memory>
|
|
|
|
#include <string>
|
|
|
|
|
|
|
|
#include "cache/cache_key.h"
|
|
|
|
#include "cache/sharded_cache.h"
|
|
|
|
#include "port/lang.h"
|
|
|
|
#include "port/malloc.h"
|
|
|
|
#include "port/port.h"
|
|
|
|
#include "rocksdb/cache.h"
|
|
|
|
#include "rocksdb/secondary_cache.h"
|
|
|
|
#include "util/autovector.h"
|
|
|
|
#include "util/math.h"
|
|
|
|
|
|
|
|
namespace ROCKSDB_NAMESPACE {
|
|
|
|
|
|
|
|
namespace clock_cache {
|
|
|
|
|
|
|
|
// Forward declaration of friend class.
|
|
|
|
class ClockCacheTest;
|
|
|
|
|
|
|
|
// HyperClockCache is an alternative to LRUCache specifically tailored for
|
|
|
|
// use as BlockBasedTableOptions::block_cache
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
//
|
|
|
|
// Benefits
|
|
|
|
// --------
|
|
|
|
// * Fully lock free (no waits or spins) for efficiency under high concurrency
|
|
|
|
// * Optimized for hot path reads. For concurrency control, most Lookup() and
|
|
|
|
// essentially all Release() are a single atomic add operation.
|
|
|
|
// * Eviction on insertion is fully parallel and lock-free.
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// * Uses a generalized + aging variant of CLOCK eviction that might outperform
|
|
|
|
// LRU in some cases. (For background, see
|
|
|
|
// https://en.wikipedia.org/wiki/Page_replacement_algorithm)
|
|
|
|
//
|
|
|
|
// Costs
|
|
|
|
// -----
|
|
|
|
// * Hash table is not resizable (for lock-free efficiency) so capacity is not
|
|
|
|
// dynamically changeable. Rely on an estimated average value (block) size for
|
|
|
|
// space+time efficiency. (See estimated_entry_charge option details.)
|
|
|
|
// * Insert usually does not (but might) overwrite a previous entry associated
|
|
|
|
// with a cache key. This is OK for RocksDB uses of Cache.
|
|
|
|
// * Only supports keys of exactly 16 bytes, which is what RocksDB uses for
|
|
|
|
// block cache (not row cache or table cache).
|
|
|
|
// * SecondaryCache is not supported.
|
|
|
|
// * Cache priorities are less aggressively enforced. Unlike LRUCache, enough
|
|
|
|
// transient LOW or BOTTOM priority items can evict HIGH priority entries that
|
|
|
|
// are not referenced recently (or often) enough.
|
|
|
|
// * If pinned entries leave little or nothing eligible for eviction,
|
|
|
|
// performance can degrade substantially, because of clock eviction eating
|
|
|
|
// CPU looking for evictable entries and because Release does not
|
|
|
|
// pro-actively delete unreferenced entries when the cache is over-full.
|
|
|
|
// Specifically, this makes this implementation more susceptible to the
|
|
|
|
// following combination:
|
|
|
|
// * num_shard_bits is high (e.g. 6)
|
|
|
|
// * capacity small (e.g. some MBs)
|
|
|
|
// * some large individual entries (e.g. non-partitioned filters)
|
|
|
|
// where individual entries occupy a large portion of their shard capacity.
|
|
|
|
// This should be mostly mitigated by the implementation picking a lower
|
|
|
|
// number of cache shards than LRUCache for a given capacity (when
|
|
|
|
// num_shard_bits is not overridden; see calls to GetDefaultCacheShardBits()).
|
|
|
|
// * With strict_capacity_limit=false, respecting the capacity limit is not as
|
|
|
|
// aggressive as LRUCache. The limit might be transiently exceeded by a very
|
|
|
|
// small number of entries even when not strictly necessary, and slower to
|
|
|
|
// recover after pinning forces limit to be substantially exceeded. (Even with
|
|
|
|
// strict_capacity_limit=true, RocksDB will nevertheless transiently allocate
|
|
|
|
// memory before discovering it is over the block cache capacity, so this
|
|
|
|
// should not be a detectable regression in respecting memory limits, except
|
|
|
|
// on exceptionally small caches.)
|
|
|
|
// * In some cases, erased or duplicated entries might not be freed
|
|
|
|
// immediately. They will eventually be freed by eviction from further Inserts.
|
|
|
|
// * Internal metadata can overflow if the number of simultaneous references
|
|
|
|
// to a cache handle reaches many millions.
|
|
|
|
//
|
|
|
|
// High-level eviction algorithm
|
|
|
|
// -----------------------------
|
|
|
|
// A score (or "countdown") is maintained for each entry, initially determined
|
|
|
|
// by priority. The score is incremented on each Lookup, up to a max of 3,
|
|
|
|
// though is easily returned to previous state if useful=false with Release.
|
|
|
|
// During CLOCK-style eviction iteration, entries with score > 0 are
|
|
|
|
// decremented if currently unreferenced and entries with score == 0 are
|
|
|
|
// evicted if currently unreferenced. Note that scoring might not be perfect
|
|
|
|
// because entries can be referenced transiently within the cache even when
|
|
|
|
// there are no outside references to the entry.
|
|
|
|
//
|
|
|
|
// Cache sharding like LRUCache is used to reduce contention on usage+eviction
|
|
|
|
// state, though here the performance improvement from more shards is small,
|
|
|
|
// and (as noted above) potentially detrimental if shard capacity is too close
|
|
|
|
// to largest entry size. Here cache sharding mostly only affects cache update
|
|
|
|
// (Insert / Erase) performance, not read performance.
|
|
|
|
//
|
|
|
|
// Read efficiency (hot path)
|
|
|
|
// --------------------------
|
|
|
|
// Mostly to minimize the cost of accessing metadata blocks with
|
|
|
|
// cache_index_and_filter_blocks=true, we focus on optimizing Lookup and
|
|
|
|
// Release. In terms of concurrency, at a minimum, these operations have
|
|
|
|
// to do reference counting (and Lookup has to compare full keys in a safe
|
|
|
|
// way). Can we fold in all the other metadata tracking *for free* with
|
|
|
|
// Lookup and Release doing a simple atomic fetch_add/fetch_sub? (Assume
|
|
|
|
// for the moment that Lookup succeeds on the first probe.)
|
|
|
|
//
|
|
|
|
// We have a clever way of encoding an entry's reference count and countdown
|
|
|
|
// clock so that Lookup and Release are each usually a single atomic addition.
|
|
|
|
// In a single metadata word we have both an "acquire" count, incremented by
|
|
|
|
// Lookup, and a "release" count, incremented by Release. If useful=false,
|
|
|
|
// Release can instead decrement the acquire count. Thus the current ref
|
|
|
|
// count is (acquires - releases), and the countdown clock is min(3, acquires).
|
|
|
|
// Note that only unreferenced entries (acquires == releases) are eligible
|
|
|
|
// for CLOCK manipulation and eviction. We tolerate use of more expensive
|
|
|
|
// compare_exchange operations for cache writes (insertions and erasures).
|
|
|
|
//
|
|
|
|
// In a cache receiving many reads and little or no writes, it is possible
|
|
|
|
// for the acquire and release counters to overflow. Assuming the *current*
|
|
|
|
// refcount never reaches to many millions, we only have to correct for
|
|
|
|
// overflow in both counters in Release, not in Lookup. The overflow check
|
|
|
|
// should be only 1-2 CPU cycles per Release because it is a predictable
|
|
|
|
// branch on a simple condition on data already in registers.
|
|
|
|
//
|
|
|
|
// Slot states
|
|
|
|
// -----------
|
|
|
|
// We encode a state indicator into the same metadata word with the
|
|
|
|
// acquire and release counters. This allows bigger state transitions to
|
|
|
|
// be atomic. States:
|
|
|
|
//
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// * Empty - slot is not in use and unowned. All other metadata and data is
|
|
|
|
// in an undefined state.
|
|
|
|
// * Construction - slot is exclusively owned by one thread, the thread
|
|
|
|
// successfully entering this state, for populating or freeing data.
|
|
|
|
// * Shareable (group) - slot holds an entry with counted references for
|
|
|
|
// pinning and reading, including
|
|
|
|
// * Visible - slot holds an entry that can be returned by Lookup
|
|
|
|
// * Invisible - slot holds an entry that is not visible to Lookup
|
|
|
|
// (erased by user) but can be read by existing references, and ref count
|
|
|
|
// changed by Ref and Release.
|
|
|
|
//
|
|
|
|
// A special case is "standalone" entries, which are heap-allocated handles
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// not in the table. They are always Invisible and freed on zero refs.
|
|
|
|
//
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// State transitions:
|
|
|
|
// Empty -> Construction (in Insert): The encoding of state enables Insert to
|
|
|
|
// perform an optimistic atomic bitwise-or to take ownership if a slot is
|
|
|
|
// empty, or otherwise make no state change.
|
|
|
|
//
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// Construction -> Visible (in Insert): This can be a simple assignment to the
|
|
|
|
// metadata word because the current thread has exclusive ownership and other
|
|
|
|
// metadata is meaningless.
|
|
|
|
//
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// Visible -> Invisible (in Erase): This can be a bitwise-and while holding
|
|
|
|
// a shared reference, which is safe because the change is idempotent (in case
|
|
|
|
// of parallel Erase). By the way, we never go Invisible->Visible.
|
|
|
|
//
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// Shareable -> Construction (in Evict part of Insert, in Erase, and in
|
|
|
|
// Release if Invisible): This is for starting to freeing/deleting an
|
|
|
|
// unreferenced entry. We have to use compare_exchange to ensure we only make
|
|
|
|
// this transition when there are zero refs.
|
|
|
|
//
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// Construction -> Empty (in same places): This is for completing free/delete
|
|
|
|
// of an entry. A "release" atomic store suffices, as we have exclusive
|
|
|
|
// ownership of the slot but have to ensure none of the data member reads are
|
|
|
|
// re-ordered after committing the state transition.
|
|
|
|
//
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// Insert
|
|
|
|
// ------
|
|
|
|
// If Insert were to guarantee replacing an existing entry for a key, there
|
|
|
|
// would be complications for concurrency and efficiency. First, consider how
|
|
|
|
// many probes to get to an entry. To ensure Lookup never waits and
|
|
|
|
// availability of a key is uninterrupted, we would need to use a different
|
|
|
|
// slot for a new entry for the same key. This means it is most likely in a
|
|
|
|
// later probing position than the old version, which should soon be removed.
|
|
|
|
// (Also, an entry is too big to replace atomically, even if no current refs.)
|
|
|
|
//
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// However, overwrite capability is not really needed by RocksDB. Also, we
|
|
|
|
// know from our "redundant" stats that overwrites are very rare for the block
|
|
|
|
// cache, so we should not spend much to make them effective.
|
|
|
|
//
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// So instead we Insert as soon as we find an empty slot in the probing
|
|
|
|
// sequence without seeing an existing (visible) entry for the same key. This
|
|
|
|
// way we only insert if we can improve the probing performance, and we don't
|
|
|
|
// need to probe beyond our insert position, assuming we are willing to let
|
|
|
|
// the previous entry for the same key die of old age (eventual eviction from
|
|
|
|
// not being used). We can reach a similar state with concurrent insertions,
|
|
|
|
// where one will pass over the other while it is "under construction."
|
|
|
|
// This temporary duplication is acceptable for RocksDB block cache because
|
|
|
|
// we know redundant insertion is rare.
|
|
|
|
//
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// Another problem to solve is what to return to the caller when we find an
|
|
|
|
// existing entry whose probing position we cannot improve on, or when the
|
|
|
|
// table occupancy limit has been reached. If strict_capacity_limit=false,
|
|
|
|
// we must never fail Insert, and if a Handle* is provided, we have to return
|
|
|
|
// a usable Cache handle on success. The solution to this (typically rare)
|
|
|
|
// problem is "standalone" handles, which are usable by the caller but not
|
|
|
|
// actually available for Lookup in the Cache. Standalone handles are allocated
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// independently on the heap and specially marked so that they are freed on
|
|
|
|
// the heap when their last reference is released.
|
|
|
|
//
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// Usage on capacity
|
|
|
|
// -----------------
|
|
|
|
// Insert takes different approaches to usage tracking depending on
|
|
|
|
// strict_capacity_limit setting. If true, we enforce a kind of strong
|
|
|
|
// consistency where compare-exchange is used to ensure the usage number never
|
|
|
|
// exceeds its limit, and provide threads with an authoritative signal on how
|
|
|
|
// much "usage" they have taken ownership of. With strict_capacity_limit=false,
|
|
|
|
// we use a kind of "eventual consistency" where all threads Inserting to the
|
|
|
|
// same cache shard might race on reserving the same space, but the
|
|
|
|
// over-commitment will be worked out in later insertions. It is kind of a
|
|
|
|
// dance because we don't want threads racing each other too much on paying
|
|
|
|
// down the over-commitment (with eviction) either.
|
|
|
|
//
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// Eviction
|
|
|
|
// --------
|
|
|
|
// A key part of Insert is evicting some entries currently unreferenced to
|
|
|
|
// make room for new entries. The high-level eviction algorithm is described
|
|
|
|
// above, but the details are also interesting. A key part is parallelizing
|
|
|
|
// eviction with a single CLOCK pointer. This works by each thread working on
|
|
|
|
// eviction pre-emptively incrementing the CLOCK pointer, and then CLOCK-
|
|
|
|
// updating or evicting the incremented-over slot(s). To reduce contention at
|
|
|
|
// the cost of possibly evicting too much, each thread increments the clock
|
|
|
|
// pointer by 4, so commits to updating at least 4 slots per batch. As
|
|
|
|
// described above, a CLOCK update will decrement the "countdown" of
|
|
|
|
// unreferenced entries, or evict unreferenced entries with zero countdown.
|
|
|
|
// Referenced entries are not updated, because we (presumably) don't want
|
|
|
|
// long-referenced entries to age while referenced. Note however that we
|
|
|
|
// cannot distinguish transiently referenced entries from cache user
|
|
|
|
// references, so some CLOCK updates might be somewhat arbitrarily skipped.
|
|
|
|
// This is OK as long as it is rare enough that eviction order is still
|
|
|
|
// pretty good.
|
|
|
|
//
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// There is no synchronization on the completion of the CLOCK updates, so it
|
|
|
|
// is theoretically possible for another thread to cycle back around and have
|
|
|
|
// two threads racing on CLOCK updates to the same slot. Thus, we cannot rely
|
|
|
|
// on any implied exclusivity to make the updates or eviction more efficient.
|
|
|
|
// These updates use an opportunistic compare-exchange (no loop), where a
|
|
|
|
// racing thread might cause the update to be skipped without retry, but in
|
|
|
|
// such case the update is likely not needed because the most likely update
|
|
|
|
// to an entry is that it has become referenced. (TODO: test efficiency of
|
|
|
|
// avoiding compare-exchange loop)
|
|
|
|
//
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// Release
|
|
|
|
// -------
|
|
|
|
// In the common case, Release is a simple atomic increment of the release
|
|
|
|
// counter. There is a simple overflow check that only does another atomic
|
|
|
|
// update in extremely rare cases, so costs almost nothing.
|
|
|
|
//
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// If the Release specifies "not useful", we can instead decrement the
|
|
|
|
// acquire counter, which returns to the same CLOCK state as before Lookup
|
|
|
|
// or Ref.
|
|
|
|
//
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// Adding a check for over-full cache on every release to zero-refs would
|
|
|
|
// likely be somewhat expensive, increasing read contention on cache shard
|
|
|
|
// metadata. Instead we are less aggressive about deleting entries right
|
|
|
|
// away in those cases.
|
|
|
|
//
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// However Release tries to immediately delete entries reaching zero refs
|
|
|
|
// if (a) erase_if_last_ref is set by the caller, or (b) the entry is already
|
|
|
|
// marked invisible. Both of these are checks on values already in CPU
|
|
|
|
// registers so do not increase cross-CPU contention when not applicable.
|
|
|
|
// When applicable, they use a compare-exchange loop to take exclusive
|
|
|
|
// ownership of the slot for freeing the entry. These are rare cases
|
|
|
|
// that should not usually affect performance.
|
|
|
|
//
|
|
|
|
// Erase
|
|
|
|
// -----
|
|
|
|
// Searches for an entry like Lookup but moves it to Invisible state if found.
|
|
|
|
// This state transition is with bit operations so is idempotent and safely
|
|
|
|
// done while only holding a shared "read" reference. Like Release, it makes
|
|
|
|
// a best effort to immediately release an Invisible entry that reaches zero
|
|
|
|
// refs, but there are some corner cases where it will only be freed by the
|
|
|
|
// clock eviction process.
|
|
|
|
|
|
|
|
// ----------------------------------------------------------------------- //
|
|
|
|
|
|
|
|
// The load factor p is a real number in (0, 1) such that at all
|
|
|
|
// times at most a fraction p of all slots, without counting tombstones,
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// are occupied by elements. This means that the probability that a random
|
|
|
|
// probe hits an occupied slot is at most p, and thus at most 1/p probes
|
|
|
|
// are required on average. For example, p = 70% implies that between 1 and 2
|
|
|
|
// probes are needed on average (bear in mind that this reasoning doesn't
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// consider the effects of clustering over time, which should be negligible
|
|
|
|
// with double hashing).
|
|
|
|
// Because the size of the hash table is always rounded up to the next
|
|
|
|
// power of 2, p is really an upper bound on the actual load factor---the
|
|
|
|
// actual load factor is anywhere between p/2 and p. This is a bit wasteful,
|
|
|
|
// but bear in mind that slots only hold metadata, not actual values.
|
|
|
|
// Since space cost is dominated by the values (the LSM blocks),
|
|
|
|
// overprovisioning the table with metadata only increases the total cache space
|
|
|
|
// usage by a tiny fraction.
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
constexpr double kLoadFactor = 0.7;
|
|
|
|
|
|
|
|
// The user can exceed kLoadFactor if the sizes of the inserted values don't
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// match estimated_value_size, or in some rare cases with
|
|
|
|
// strict_capacity_limit == false. To avoid degenerate performance, we set a
|
|
|
|
// strict upper bound on the load factor.
|
|
|
|
constexpr double kStrictLoadFactor = 0.84;
|
|
|
|
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
struct ClockHandleBasicData {
|
Major Cache refactoring, CPU efficiency improvement (#10975)
Summary:
This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache).
The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below.
* static_cast lines of code +29 -35 (net removed 6)
* reinterpret_cast lines of code +6 -32 (net removed 26)
## cache.h and secondary_cache.h
* Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications:
* Simpler for implementations to deal with just one Insert and one Lookup.
* Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters
* Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428.
* Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks).
* It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below).
* I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc.
* Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation.
* Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.)
* Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.)
* Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774)
* Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object.
* Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change.
## typed_cache.h
Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae).
The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used.
* PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value.
* BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter.
* FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue.
* For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`.
These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.)
Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it.
## block_cache.h
This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table.
## block_based_table_reader.cc
Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation.
The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions.
## block_based_table_builder.cc, cache_dump_load_impl.cc
Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.)
## Everything else
Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975
Test Plan:
tests updated
Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache):
34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844
34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594
34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297
34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523
34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602
34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293
34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926
34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488
233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984
233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922
233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559
233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93
233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418
233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273
233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691
233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82
1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55
1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02
1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45
1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24
1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92
1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78
1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36
1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83
Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn.
Reviewed By: anand1976
Differential Revision: D42417818
Pulled By: pdillinger
fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
2 years ago
|
|
|
Cache::ObjectPtr value = nullptr;
|
|
|
|
const Cache::CacheItemHelper* helper = nullptr;
|
|
|
|
// A lossless, reversible hash of the fixed-size (16 byte) cache key. This
|
|
|
|
// eliminates the need to store a hash separately.
|
|
|
|
UniqueId64x2 hashed_key = kNullUniqueId64x2;
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
size_t total_charge = 0;
|
|
|
|
|
|
|
|
inline size_t GetTotalCharge() const { return total_charge; }
|
|
|
|
|
|
|
|
// Calls deleter (if non-null) on cache key and value
|
Major Cache refactoring, CPU efficiency improvement (#10975)
Summary:
This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache).
The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below.
* static_cast lines of code +29 -35 (net removed 6)
* reinterpret_cast lines of code +6 -32 (net removed 26)
## cache.h and secondary_cache.h
* Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications:
* Simpler for implementations to deal with just one Insert and one Lookup.
* Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters
* Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428.
* Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks).
* It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below).
* I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc.
* Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation.
* Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.)
* Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.)
* Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774)
* Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object.
* Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change.
## typed_cache.h
Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae).
The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used.
* PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value.
* BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter.
* FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue.
* For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`.
These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.)
Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it.
## block_cache.h
This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table.
## block_based_table_reader.cc
Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation.
The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions.
## block_based_table_builder.cc, cache_dump_load_impl.cc
Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.)
## Everything else
Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975
Test Plan:
tests updated
Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache):
34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844
34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594
34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297
34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523
34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602
34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293
34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926
34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488
233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984
233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922
233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559
233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93
233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418
233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273
233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691
233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82
1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55
1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02
1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45
1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24
1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92
1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78
1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36
1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83
Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn.
Reviewed By: anand1976
Differential Revision: D42417818
Pulled By: pdillinger
fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
2 years ago
|
|
|
void FreeData(MemoryAllocator* allocator) const;
|
|
|
|
|
|
|
|
// Required by concept HandleImpl
|
|
|
|
const UniqueId64x2& GetHash() const { return hashed_key; }
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
};
|
|
|
|
|
|
|
|
struct ClockHandle : public ClockHandleBasicData {
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// Constants for handling the atomic `meta` word, which tracks most of the
|
|
|
|
// state of the handle. The meta word looks like this:
|
|
|
|
// low bits high bits
|
|
|
|
// -----------------------------------------------------------------------
|
|
|
|
// | acquire counter | release counter | state marker |
|
|
|
|
// -----------------------------------------------------------------------
|
|
|
|
|
|
|
|
// For reading or updating counters in meta word.
|
|
|
|
static constexpr uint8_t kCounterNumBits = 30;
|
|
|
|
static constexpr uint64_t kCounterMask = (uint64_t{1} << kCounterNumBits) - 1;
|
|
|
|
|
|
|
|
static constexpr uint8_t kAcquireCounterShift = 0;
|
|
|
|
static constexpr uint64_t kAcquireIncrement = uint64_t{1}
|
|
|
|
<< kAcquireCounterShift;
|
|
|
|
static constexpr uint8_t kReleaseCounterShift = kCounterNumBits;
|
|
|
|
static constexpr uint64_t kReleaseIncrement = uint64_t{1}
|
|
|
|
<< kReleaseCounterShift;
|
|
|
|
|
|
|
|
// For reading or updating the state marker in meta word
|
|
|
|
static constexpr uint8_t kStateShift = 2U * kCounterNumBits;
|
|
|
|
|
|
|
|
// Bits contribution to state marker.
|
|
|
|
// Occupied means any state other than empty
|
|
|
|
static constexpr uint8_t kStateOccupiedBit = 0b100;
|
|
|
|
// Shareable means the entry is reference counted (visible or invisible)
|
|
|
|
// (only set if also occupied)
|
|
|
|
static constexpr uint8_t kStateShareableBit = 0b010;
|
|
|
|
// Visible is only set if also shareable
|
|
|
|
static constexpr uint8_t kStateVisibleBit = 0b001;
|
|
|
|
|
|
|
|
// Complete state markers (not shifted into full word)
|
|
|
|
static constexpr uint8_t kStateEmpty = 0b000;
|
|
|
|
static constexpr uint8_t kStateConstruction = kStateOccupiedBit;
|
|
|
|
static constexpr uint8_t kStateInvisible =
|
|
|
|
kStateOccupiedBit | kStateShareableBit;
|
|
|
|
static constexpr uint8_t kStateVisible =
|
|
|
|
kStateOccupiedBit | kStateShareableBit | kStateVisibleBit;
|
|
|
|
|
|
|
|
// Constants for initializing the countdown clock. (Countdown clock is only
|
|
|
|
// in effect with zero refs, acquire counter == release counter, and in that
|
|
|
|
// case the countdown clock == both of those counters.)
|
|
|
|
static constexpr uint8_t kHighCountdown = 3;
|
|
|
|
static constexpr uint8_t kLowCountdown = 2;
|
|
|
|
static constexpr uint8_t kBottomCountdown = 1;
|
|
|
|
// During clock update, treat any countdown clock value greater than this
|
|
|
|
// value the same as this value.
|
|
|
|
static constexpr uint8_t kMaxCountdown = kHighCountdown;
|
|
|
|
// TODO: make these coundown values tuning parameters for eviction?
|
|
|
|
|
|
|
|
// See above. Mutable for read reference counting.
|
|
|
|
mutable std::atomic<uint64_t> meta{};
|
|
|
|
|
|
|
|
// Whether this is a "deteched" handle that is independently allocated
|
|
|
|
// with `new` (so must be deleted with `delete`).
|
|
|
|
// TODO: ideally this would be packed into some other data field, such
|
|
|
|
// as upper bits of total_charge, but that incurs a measurable performance
|
|
|
|
// regression.
|
|
|
|
bool standalone = false;
|
|
|
|
|
|
|
|
inline bool IsStandalone() const { return standalone; }
|
|
|
|
|
|
|
|
inline void SetStandalone() { standalone = true; }
|
|
|
|
}; // struct ClockHandle
|
|
|
|
|
|
|
|
class BaseClockTable {
|
|
|
|
public:
|
|
|
|
BaseClockTable(CacheMetadataChargePolicy metadata_charge_policy,
|
|
|
|
MemoryAllocator* allocator,
|
|
|
|
const Cache::EvictionCallback* eviction_callback,
|
|
|
|
const uint32_t* hash_seed)
|
|
|
|
: metadata_charge_policy_(metadata_charge_policy),
|
|
|
|
allocator_(allocator),
|
|
|
|
eviction_callback_(*eviction_callback),
|
|
|
|
hash_seed_(*hash_seed) {}
|
|
|
|
|
|
|
|
template <class Table>
|
|
|
|
typename Table::HandleImpl* CreateStandalone(ClockHandleBasicData& proto,
|
|
|
|
size_t capacity,
|
|
|
|
bool strict_capacity_limit,
|
|
|
|
bool allow_uncharged);
|
|
|
|
|
|
|
|
template <class Table>
|
|
|
|
Status Insert(const ClockHandleBasicData& proto,
|
|
|
|
typename Table::HandleImpl** handle, Cache::Priority priority,
|
|
|
|
size_t capacity, bool strict_capacity_limit);
|
|
|
|
|
|
|
|
void Ref(ClockHandle& handle);
|
|
|
|
|
|
|
|
size_t GetOccupancy() const {
|
|
|
|
return occupancy_.load(std::memory_order_relaxed);
|
|
|
|
}
|
|
|
|
|
|
|
|
size_t GetUsage() const { return usage_.load(std::memory_order_relaxed); }
|
|
|
|
|
|
|
|
size_t GetStandaloneUsage() const {
|
|
|
|
return standalone_usage_.load(std::memory_order_relaxed);
|
|
|
|
}
|
|
|
|
|
|
|
|
uint32_t GetHashSeed() const { return hash_seed_; }
|
|
|
|
|
|
|
|
struct EvictionData {
|
|
|
|
size_t freed_charge = 0;
|
|
|
|
size_t freed_count = 0;
|
|
|
|
};
|
|
|
|
|
|
|
|
void TrackAndReleaseEvictedEntry(ClockHandle* h, EvictionData* data);
|
|
|
|
|
|
|
|
#ifndef NDEBUG
|
|
|
|
// Acquire N references
|
|
|
|
void TEST_RefN(ClockHandle& handle, size_t n);
|
|
|
|
// Helper for TEST_ReleaseN
|
|
|
|
void TEST_ReleaseNMinus1(ClockHandle* handle, size_t n);
|
|
|
|
#endif
|
|
|
|
|
|
|
|
private: // fns
|
|
|
|
// Creates a "standalone" handle for returning from an Insert operation that
|
|
|
|
// cannot be completed by actually inserting into the table.
|
|
|
|
// Updates `standalone_usage_` but not `usage_` nor `occupancy_`.
|
|
|
|
template <class HandleImpl>
|
|
|
|
HandleImpl* StandaloneInsert(const ClockHandleBasicData& proto);
|
|
|
|
|
|
|
|
// Helper for updating `usage_` for new entry with given `total_charge`
|
|
|
|
// and evicting if needed under strict_capacity_limit=true rules. This
|
|
|
|
// means the operation might fail with Status::MemoryLimit. If
|
|
|
|
// `need_evict_for_occupancy`, then eviction of at least one entry is
|
|
|
|
// required, and the operation should fail if not possible.
|
|
|
|
// NOTE: Otherwise, occupancy_ is not managed in this function
|
|
|
|
template <class Table>
|
|
|
|
Status ChargeUsageMaybeEvictStrict(size_t total_charge, size_t capacity,
|
|
|
|
bool need_evict_for_occupancy,
|
|
|
|
typename Table::InsertState& state);
|
|
|
|
|
|
|
|
// Helper for updating `usage_` for new entry with given `total_charge`
|
|
|
|
// and evicting if needed under strict_capacity_limit=false rules. This
|
|
|
|
// means that updating `usage_` always succeeds even if forced to exceed
|
|
|
|
// capacity. If `need_evict_for_occupancy`, then eviction of at least one
|
|
|
|
// entry is required, and the operation should return false if such eviction
|
|
|
|
// is not possible. `usage_` is not updated in that case. Otherwise, returns
|
|
|
|
// true, indicating success.
|
|
|
|
// NOTE: occupancy_ is not managed in this function
|
|
|
|
template <class Table>
|
|
|
|
bool ChargeUsageMaybeEvictNonStrict(size_t total_charge, size_t capacity,
|
|
|
|
bool need_evict_for_occupancy,
|
|
|
|
typename Table::InsertState& state);
|
|
|
|
|
|
|
|
protected: // data
|
|
|
|
// We partition the following members into different cache lines
|
|
|
|
// to avoid false sharing among Lookup, Release, Erase and Insert
|
|
|
|
// operations in ClockCacheShard.
|
|
|
|
|
|
|
|
// Clock algorithm sweep pointer.
|
|
|
|
std::atomic<uint64_t> clock_pointer_{};
|
|
|
|
|
|
|
|
ALIGN_AS(CACHE_LINE_SIZE)
|
|
|
|
// Number of elements in the table.
|
|
|
|
std::atomic<size_t> occupancy_{};
|
|
|
|
|
|
|
|
// Memory usage by entries tracked by the cache (including standalone)
|
|
|
|
std::atomic<size_t> usage_{};
|
|
|
|
|
|
|
|
// Part of usage by standalone entries (not in table)
|
|
|
|
std::atomic<size_t> standalone_usage_{};
|
|
|
|
|
|
|
|
ALIGN_AS(CACHE_LINE_SIZE)
|
|
|
|
const CacheMetadataChargePolicy metadata_charge_policy_;
|
|
|
|
|
|
|
|
// From Cache, for deleter
|
|
|
|
MemoryAllocator* const allocator_;
|
|
|
|
|
|
|
|
// A reference to Cache::eviction_callback_
|
|
|
|
const Cache::EvictionCallback& eviction_callback_;
|
|
|
|
|
|
|
|
// A reference to ShardedCacheBase::hash_seed_
|
|
|
|
const uint32_t& hash_seed_;
|
|
|
|
};
|
|
|
|
|
|
|
|
class HyperClockTable : public BaseClockTable {
|
|
|
|
public:
|
|
|
|
// Target size to be exactly a common cache line size (see static_assert in
|
|
|
|
// clock_cache.cc)
|
|
|
|
struct ALIGN_AS(64U) HandleImpl : public ClockHandle {
|
|
|
|
// The number of elements that hash to this slot or a lower one, but wind
|
|
|
|
// up in this slot or a higher one.
|
|
|
|
std::atomic<uint32_t> displacements{};
|
|
|
|
|
|
|
|
}; // struct HandleImpl
|
|
|
|
|
|
|
|
struct Opts {
|
|
|
|
size_t estimated_value_size;
|
|
|
|
};
|
|
|
|
|
|
|
|
HyperClockTable(size_t capacity, bool strict_capacity_limit,
|
|
|
|
CacheMetadataChargePolicy metadata_charge_policy,
|
HyperClockCache support for SecondaryCache, with refactoring (#11301)
Summary:
Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key.
This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places.
It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct.
## cache.h (public API)
Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache.
## advanced_cache.h (advanced public API)
* Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it.
* Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper.
These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations.
* Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle)
I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss.
## cacheable_entry.h
A couple of functions are obsolete because Cache::Handle can no longer be pending.
## cache.cc
Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches.
## secondary_cache_adapter.{h,cc}
The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code.
## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc
Simply updated for Cache API changes.
## lru_cache.{h,cc}
Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality.
## clock_cache.{h,cc}
Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring.
## block_based_table_reader*
Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready.
Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse).
Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait.
## Intended follow-up work
* Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet
* Stacked secondary caches (see above discussion)
* See if we can make up for the small MultiGet performance regression.
* Study more performance with SecondaryCache
* Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal.
* Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup.
* Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301
Test Plan:
## Unit tests
Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them.
## Crash/stress test
Updated to use the new combination.
## Performance
First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0.
```
(while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }'
```
**Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type:
HyperClockCache: 3168662 (average parallel ops/sec)
LRUCache: 2940127
**After** this and #11299, running for about an hour:
HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower)
LRUCache: 2940928 (0.03% faster)
This is an acceptable difference IMHO.
Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits.
Create DB and test (before and after tests running simultaneously):
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```
**Before**:
multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec
multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec
**After**:
multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec
So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache:
**Before**:
multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec
multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec
**After**:
multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec
Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately.
Let's also look at Get() in db_bench:
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```
**Before**:
readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec
readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec
**After**:
readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec
readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec
That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement:
**Before**:
readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec
readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec
**After**:
readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec
readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec
Reviewed By: ltamasi
Differential Revision: D44177044
Pulled By: pdillinger
fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2 years ago
|
|
|
MemoryAllocator* allocator,
|
|
|
|
const Cache::EvictionCallback* eviction_callback,
|
|
|
|
const uint32_t* hash_seed, const Opts& opts);
|
|
|
|
~HyperClockTable();
|
|
|
|
|
|
|
|
// For BaseClockTable::Insert
|
|
|
|
struct InsertState {};
|
|
|
|
|
|
|
|
void StartInsert(InsertState& state);
|
|
|
|
|
|
|
|
// Returns true iff there is room for the proposed number of entries.
|
|
|
|
bool GrowIfNeeded(size_t new_occupancy, InsertState& state);
|
|
|
|
|
|
|
|
HandleImpl* DoInsert(const ClockHandleBasicData& proto,
|
|
|
|
uint64_t initial_countdown, bool take_ref,
|
|
|
|
InsertState& state);
|
|
|
|
|
|
|
|
// Runs the clock eviction algorithm trying to reclaim at least
|
|
|
|
// requested_charge. Returns how much is evicted, which could be less
|
|
|
|
// if it appears impossible to evict the requested amount without blocking.
|
|
|
|
void Evict(size_t requested_charge, InsertState& state, EvictionData* data);
|
HyperClockCache support for SecondaryCache, with refactoring (#11301)
Summary:
Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key.
This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places.
It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct.
## cache.h (public API)
Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache.
## advanced_cache.h (advanced public API)
* Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it.
* Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper.
These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations.
* Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle)
I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss.
## cacheable_entry.h
A couple of functions are obsolete because Cache::Handle can no longer be pending.
## cache.cc
Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches.
## secondary_cache_adapter.{h,cc}
The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code.
## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc
Simply updated for Cache API changes.
## lru_cache.{h,cc}
Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality.
## clock_cache.{h,cc}
Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring.
## block_based_table_reader*
Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready.
Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse).
Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait.
## Intended follow-up work
* Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet
* Stacked secondary caches (see above discussion)
* See if we can make up for the small MultiGet performance regression.
* Study more performance with SecondaryCache
* Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal.
* Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup.
* Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301
Test Plan:
## Unit tests
Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them.
## Crash/stress test
Updated to use the new combination.
## Performance
First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0.
```
(while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }'
```
**Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type:
HyperClockCache: 3168662 (average parallel ops/sec)
LRUCache: 2940127
**After** this and #11299, running for about an hour:
HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower)
LRUCache: 2940928 (0.03% faster)
This is an acceptable difference IMHO.
Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits.
Create DB and test (before and after tests running simultaneously):
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```
**Before**:
multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec
multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec
**After**:
multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec
So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache:
**Before**:
multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec
multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec
**After**:
multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec
Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately.
Let's also look at Get() in db_bench:
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```
**Before**:
readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec
readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec
**After**:
readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec
readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec
That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement:
**Before**:
readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec
readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec
**After**:
readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec
readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec
Reviewed By: ltamasi
Differential Revision: D44177044
Pulled By: pdillinger
fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2 years ago
|
|
|
|
|
|
|
HandleImpl* Lookup(const UniqueId64x2& hashed_key);
|
|
|
|
|
|
|
|
bool Release(HandleImpl* handle, bool useful, bool erase_if_last_ref);
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
|
|
|
|
void Erase(const UniqueId64x2& hashed_key);
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
|
|
|
|
void EraseUnRefEntries();
|
|
|
|
|
|
|
|
size_t GetTableSize() const { return size_t{1} << length_bits_; }
|
|
|
|
|
|
|
|
size_t GetOccupancyLimit() const { return occupancy_limit_; }
|
|
|
|
|
|
|
|
const HandleImpl* HandlePtr(size_t idx) const { return &array_[idx]; }
|
|
|
|
|
|
|
|
#ifndef NDEBUG
|
|
|
|
size_t& TEST_MutableOccupancyLimit() const {
|
|
|
|
return const_cast<size_t&>(occupancy_limit_);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Release N references
|
|
|
|
void TEST_ReleaseN(HandleImpl* handle, size_t n);
|
|
|
|
#endif
|
|
|
|
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
private: // functions
|
|
|
|
// Returns x mod 2^{length_bits_}.
|
|
|
|
inline size_t ModTableSize(uint64_t x) {
|
|
|
|
return BitwiseAnd(x, length_bits_mask_);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Returns the first slot in the probe sequence with a handle e such that
|
|
|
|
// match_fn(e) is true. At every step, the function first tests whether
|
|
|
|
// match_fn(e) holds. If this is false, it evaluates abort_fn(e) to decide
|
|
|
|
// whether the search should be aborted, and if so, FindSlot immediately
|
|
|
|
// returns nullptr. For every handle e that is not a match and not aborted,
|
|
|
|
// FindSlot runs update_fn(e, is_last) where is_last is set to true iff that
|
|
|
|
// slot will be the last probed because the next would cycle back to the first
|
|
|
|
// slot probed. This function uses templates instead of std::function to
|
|
|
|
// minimize the risk of heap-allocated closures being created.
|
|
|
|
template <typename MatchFn, typename AbortFn, typename UpdateFn>
|
|
|
|
inline HandleImpl* FindSlot(const UniqueId64x2& hashed_key,
|
|
|
|
const MatchFn& match_fn, const AbortFn& abort_fn,
|
|
|
|
const UpdateFn& update_fn);
|
|
|
|
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// Re-decrement all displacements in probe path starting from beginning
|
|
|
|
// until (not including) the given handle
|
|
|
|
inline void Rollback(const UniqueId64x2& hashed_key, const HandleImpl* h);
|
|
|
|
|
|
|
|
// Subtracts `total_charge` from `usage_` and 1 from `occupancy_`.
|
|
|
|
// Ideally this comes after releasing the entry itself so that we
|
|
|
|
// actually have the available occupancy/usage that is claimed.
|
|
|
|
// However, that means total_charge has to be saved from the handle
|
|
|
|
// before releasing it so that it can be provided to this function.
|
|
|
|
inline void ReclaimEntryUsage(size_t total_charge);
|
|
|
|
|
Major Cache refactoring, CPU efficiency improvement (#10975)
Summary:
This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache).
The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below.
* static_cast lines of code +29 -35 (net removed 6)
* reinterpret_cast lines of code +6 -32 (net removed 26)
## cache.h and secondary_cache.h
* Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications:
* Simpler for implementations to deal with just one Insert and one Lookup.
* Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters
* Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428.
* Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks).
* It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below).
* I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc.
* Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation.
* Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.)
* Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.)
* Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774)
* Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object.
* Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change.
## typed_cache.h
Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae).
The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used.
* PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value.
* BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter.
* FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue.
* For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`.
These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.)
Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it.
## block_cache.h
This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table.
## block_based_table_reader.cc
Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation.
The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions.
## block_based_table_builder.cc, cache_dump_load_impl.cc
Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.)
## Everything else
Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975
Test Plan:
tests updated
Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache):
34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844
34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594
34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297
34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523
34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602
34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293
34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926
34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488
233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984
233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922
233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559
233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93
233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418
233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273
233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691
233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82
1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55
1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02
1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45
1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24
1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92
1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78
1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36
1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83
Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn.
Reviewed By: anand1976
Differential Revision: D42417818
Pulled By: pdillinger
fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
2 years ago
|
|
|
MemoryAllocator* GetAllocator() const { return allocator_; }
|
|
|
|
|
|
|
|
// Returns the number of bits used to hash an element in the hash
|
|
|
|
// table.
|
|
|
|
static int CalcHashBits(size_t capacity, size_t estimated_value_size,
|
|
|
|
CacheMetadataChargePolicy metadata_charge_policy);
|
|
|
|
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
private: // data
|
|
|
|
// Number of hash bits used for table index.
|
|
|
|
// The size of the table is 1 << length_bits_.
|
|
|
|
const int length_bits_;
|
|
|
|
|
|
|
|
// For faster computation of ModTableSize.
|
|
|
|
const size_t length_bits_mask_;
|
|
|
|
|
|
|
|
// Maximum number of elements the user can store in the table.
|
|
|
|
const size_t occupancy_limit_;
|
|
|
|
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// Array of slots comprising the hash table.
|
|
|
|
const std::unique_ptr<HandleImpl[]> array_;
|
|
|
|
}; // class HyperClockTable
|
|
|
|
|
|
|
|
// A single shard of sharded cache.
|
|
|
|
template <class Table>
|
|
|
|
class ALIGN_AS(CACHE_LINE_SIZE) ClockCacheShard final : public CacheShardBase {
|
|
|
|
public:
|
|
|
|
ClockCacheShard(size_t capacity, bool strict_capacity_limit,
|
|
|
|
CacheMetadataChargePolicy metadata_charge_policy,
|
HyperClockCache support for SecondaryCache, with refactoring (#11301)
Summary:
Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key.
This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places.
It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct.
## cache.h (public API)
Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache.
## advanced_cache.h (advanced public API)
* Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it.
* Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper.
These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations.
* Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle)
I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss.
## cacheable_entry.h
A couple of functions are obsolete because Cache::Handle can no longer be pending.
## cache.cc
Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches.
## secondary_cache_adapter.{h,cc}
The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code.
## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc
Simply updated for Cache API changes.
## lru_cache.{h,cc}
Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality.
## clock_cache.{h,cc}
Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring.
## block_based_table_reader*
Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready.
Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse).
Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait.
## Intended follow-up work
* Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet
* Stacked secondary caches (see above discussion)
* See if we can make up for the small MultiGet performance regression.
* Study more performance with SecondaryCache
* Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal.
* Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup.
* Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301
Test Plan:
## Unit tests
Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them.
## Crash/stress test
Updated to use the new combination.
## Performance
First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0.
```
(while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }'
```
**Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type:
HyperClockCache: 3168662 (average parallel ops/sec)
LRUCache: 2940127
**After** this and #11299, running for about an hour:
HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower)
LRUCache: 2940928 (0.03% faster)
This is an acceptable difference IMHO.
Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits.
Create DB and test (before and after tests running simultaneously):
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```
**Before**:
multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec
multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec
**After**:
multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec
So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache:
**Before**:
multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec
multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec
**After**:
multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec
Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately.
Let's also look at Get() in db_bench:
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```
**Before**:
readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec
readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec
**After**:
readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec
readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec
That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement:
**Before**:
readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec
readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec
**After**:
readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec
readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec
Reviewed By: ltamasi
Differential Revision: D44177044
Pulled By: pdillinger
fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2 years ago
|
|
|
MemoryAllocator* allocator,
|
|
|
|
const Cache::EvictionCallback* eviction_callback,
|
|
|
|
const uint32_t* hash_seed, const typename Table::Opts& opts);
|
|
|
|
|
|
|
|
// For CacheShard concept
|
|
|
|
using HandleImpl = typename Table::HandleImpl;
|
|
|
|
// Hash is lossless hash of 128-bit key
|
|
|
|
using HashVal = UniqueId64x2;
|
|
|
|
using HashCref = const HashVal&;
|
|
|
|
static inline uint32_t HashPieceForSharding(HashCref hash) {
|
|
|
|
return Upper32of64(hash[0]);
|
|
|
|
}
|
|
|
|
static inline HashVal ComputeHash(const Slice& key, uint32_t seed) {
|
|
|
|
assert(key.size() == kCacheKeySize);
|
|
|
|
HashVal in;
|
|
|
|
HashVal out;
|
|
|
|
// NOTE: endian dependence
|
|
|
|
// TODO: use GetUnaligned?
|
|
|
|
std::memcpy(&in, key.data(), kCacheKeySize);
|
|
|
|
BijectiveHash2x64(in[1], in[0] ^ seed, &out[1], &out[0]);
|
|
|
|
return out;
|
|
|
|
}
|
|
|
|
|
|
|
|
// For reconstructing key from hashed_key. Requires the caller to provide
|
|
|
|
// backing storage for the Slice in `unhashed`
|
|
|
|
static inline Slice ReverseHash(const UniqueId64x2& hashed,
|
|
|
|
UniqueId64x2* unhashed, uint32_t seed) {
|
|
|
|
BijectiveUnhash2x64(hashed[1], hashed[0], &(*unhashed)[1], &(*unhashed)[0]);
|
|
|
|
(*unhashed)[0] ^= seed;
|
|
|
|
// NOTE: endian dependence
|
|
|
|
return Slice(reinterpret_cast<const char*>(unhashed), kCacheKeySize);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Although capacity is dynamically changeable, the number of table slots is
|
|
|
|
// not, so growing capacity substantially could lead to hitting occupancy
|
|
|
|
// limit.
|
|
|
|
void SetCapacity(size_t capacity);
|
|
|
|
|
|
|
|
void SetStrictCapacityLimit(bool strict_capacity_limit);
|
|
|
|
|
Major Cache refactoring, CPU efficiency improvement (#10975)
Summary:
This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache).
The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below.
* static_cast lines of code +29 -35 (net removed 6)
* reinterpret_cast lines of code +6 -32 (net removed 26)
## cache.h and secondary_cache.h
* Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications:
* Simpler for implementations to deal with just one Insert and one Lookup.
* Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters
* Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428.
* Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks).
* It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below).
* I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc.
* Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation.
* Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.)
* Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.)
* Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774)
* Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object.
* Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change.
## typed_cache.h
Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae).
The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used.
* PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value.
* BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter.
* FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue.
* For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`.
These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.)
Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it.
## block_cache.h
This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table.
## block_based_table_reader.cc
Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation.
The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions.
## block_based_table_builder.cc, cache_dump_load_impl.cc
Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.)
## Everything else
Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975
Test Plan:
tests updated
Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache):
34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844
34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594
34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297
34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523
34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602
34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293
34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926
34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488
233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984
233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922
233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559
233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93
233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418
233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273
233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691
233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82
1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55
1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02
1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45
1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24
1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92
1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78
1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36
1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83
Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn.
Reviewed By: anand1976
Differential Revision: D42417818
Pulled By: pdillinger
fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
2 years ago
|
|
|
Status Insert(const Slice& key, const UniqueId64x2& hashed_key,
|
|
|
|
Cache::ObjectPtr value, const Cache::CacheItemHelper* helper,
|
|
|
|
size_t charge, HandleImpl** handle, Cache::Priority priority);
|
|
|
|
|
HyperClockCache support for SecondaryCache, with refactoring (#11301)
Summary:
Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key.
This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places.
It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct.
## cache.h (public API)
Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache.
## advanced_cache.h (advanced public API)
* Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it.
* Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper.
These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations.
* Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle)
I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss.
## cacheable_entry.h
A couple of functions are obsolete because Cache::Handle can no longer be pending.
## cache.cc
Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches.
## secondary_cache_adapter.{h,cc}
The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code.
## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc
Simply updated for Cache API changes.
## lru_cache.{h,cc}
Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality.
## clock_cache.{h,cc}
Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring.
## block_based_table_reader*
Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready.
Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse).
Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait.
## Intended follow-up work
* Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet
* Stacked secondary caches (see above discussion)
* See if we can make up for the small MultiGet performance regression.
* Study more performance with SecondaryCache
* Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal.
* Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup.
* Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301
Test Plan:
## Unit tests
Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them.
## Crash/stress test
Updated to use the new combination.
## Performance
First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0.
```
(while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }'
```
**Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type:
HyperClockCache: 3168662 (average parallel ops/sec)
LRUCache: 2940127
**After** this and #11299, running for about an hour:
HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower)
LRUCache: 2940928 (0.03% faster)
This is an acceptable difference IMHO.
Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits.
Create DB and test (before and after tests running simultaneously):
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```
**Before**:
multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec
multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec
**After**:
multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec
So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache:
**Before**:
multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec
multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec
**After**:
multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec
Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately.
Let's also look at Get() in db_bench:
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```
**Before**:
readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec
readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec
**After**:
readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec
readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec
That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement:
**Before**:
readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec
readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec
**After**:
readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec
readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec
Reviewed By: ltamasi
Differential Revision: D44177044
Pulled By: pdillinger
fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2 years ago
|
|
|
HandleImpl* CreateStandalone(const Slice& key, const UniqueId64x2& hashed_key,
|
|
|
|
Cache::ObjectPtr obj,
|
|
|
|
const Cache::CacheItemHelper* helper,
|
|
|
|
size_t charge, bool allow_uncharged);
|
|
|
|
|
|
|
|
HandleImpl* Lookup(const Slice& key, const UniqueId64x2& hashed_key);
|
|
|
|
|
|
|
|
bool Release(HandleImpl* handle, bool useful, bool erase_if_last_ref);
|
|
|
|
|
|
|
|
bool Release(HandleImpl* handle, bool erase_if_last_ref = false);
|
|
|
|
|
|
|
|
bool Ref(HandleImpl* handle);
|
|
|
|
|
|
|
|
void Erase(const Slice& key, const UniqueId64x2& hashed_key);
|
|
|
|
|
|
|
|
size_t GetCapacity() const;
|
|
|
|
|
|
|
|
size_t GetUsage() const;
|
|
|
|
|
|
|
|
size_t GetStandaloneUsage() const;
|
|
|
|
|
|
|
|
size_t GetPinnedUsage() const;
|
|
|
|
|
|
|
|
size_t GetOccupancyCount() const;
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
|
|
|
|
size_t GetOccupancyLimit() const;
|
|
|
|
|
|
|
|
size_t GetTableAddressCount() const;
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
|
|
|
|
void ApplyToSomeEntries(
|
Major Cache refactoring, CPU efficiency improvement (#10975)
Summary:
This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache).
The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below.
* static_cast lines of code +29 -35 (net removed 6)
* reinterpret_cast lines of code +6 -32 (net removed 26)
## cache.h and secondary_cache.h
* Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications:
* Simpler for implementations to deal with just one Insert and one Lookup.
* Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters
* Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428.
* Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks).
* It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below).
* I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc.
* Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation.
* Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.)
* Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.)
* Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774)
* Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object.
* Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change.
## typed_cache.h
Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae).
The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used.
* PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value.
* BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter.
* FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue.
* For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`.
These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.)
Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it.
## block_cache.h
This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table.
## block_based_table_reader.cc
Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation.
The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions.
## block_based_table_builder.cc, cache_dump_load_impl.cc
Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.)
## Everything else
Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975
Test Plan:
tests updated
Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache):
34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844
34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594
34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297
34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523
34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602
34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293
34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926
34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488
233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984
233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922
233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559
233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93
233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418
233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273
233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691
233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82
1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55
1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02
1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45
1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24
1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92
1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78
1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36
1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83
Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn.
Reviewed By: anand1976
Differential Revision: D42417818
Pulled By: pdillinger
fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
2 years ago
|
|
|
const std::function<void(const Slice& key, Cache::ObjectPtr obj,
|
|
|
|
size_t charge,
|
|
|
|
const Cache::CacheItemHelper* helper)>& callback,
|
|
|
|
size_t average_entries_per_lock, size_t* state);
|
|
|
|
|
|
|
|
void EraseUnRefEntries();
|
|
|
|
|
|
|
|
std::string GetPrintableOptions() const { return std::string{}; }
|
|
|
|
|
|
|
|
HandleImpl* Lookup(const Slice& key, const UniqueId64x2& hashed_key,
|
|
|
|
const Cache::CacheItemHelper* /*helper*/,
|
Major Cache refactoring, CPU efficiency improvement (#10975)
Summary:
This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache).
The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below.
* static_cast lines of code +29 -35 (net removed 6)
* reinterpret_cast lines of code +6 -32 (net removed 26)
## cache.h and secondary_cache.h
* Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications:
* Simpler for implementations to deal with just one Insert and one Lookup.
* Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters
* Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428.
* Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks).
* It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below).
* I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc.
* Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation.
* Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.)
* Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.)
* Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774)
* Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object.
* Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change.
## typed_cache.h
Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae).
The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used.
* PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value.
* BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter.
* FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue.
* For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`.
These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.)
Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it.
## block_cache.h
This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table.
## block_based_table_reader.cc
Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation.
The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions.
## block_based_table_builder.cc, cache_dump_load_impl.cc
Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.)
## Everything else
Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975
Test Plan:
tests updated
Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache):
34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844
34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594
34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297
34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523
34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602
34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293
34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926
34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488
233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984
233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922
233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559
233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93
233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418
233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273
233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691
233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82
1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55
1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02
1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45
1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24
1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92
1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78
1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36
1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83
Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn.
Reviewed By: anand1976
Differential Revision: D42417818
Pulled By: pdillinger
fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
2 years ago
|
|
|
Cache::CreateContext* /*create_context*/,
|
HyperClockCache support for SecondaryCache, with refactoring (#11301)
Summary:
Internally refactors SecondaryCache integration out of LRUCache specifically and into a wrapper/adapter class that works with various Cache implementations. Notably, this relies on separating the notion of async lookup handles from other cache handles, so that HyperClockCache doesn't have to deal with the problem of allocating handles from the hash table for lookups that might fail anyway, and might be on the same key without support for coalescing. (LRUCache's hash table can incorporate previously allocated handles thanks to its pointer indirection.) Specifically, I'm worried about the case in which hundreds of threads try to access the same block and probing in the hash table degrades to linear search on the pile of entries with the same key.
This change is a big step in the direction of supporting stacked SecondaryCaches, but there are obstacles to completing that. Especially, there is no SecondaryCache hook for evictions to pass from one to the next. It has been proposed that evictions be transmitted simply as the persisted data (as in SaveToCallback), but given the current structure provided by the CacheItemHelpers, that would require an extra copy of the block data, because there's intentionally no way to ask for a contiguous Slice of the data (to allow for flexibility in storage). `AsyncLookupHandle` and the re-worked `WaitAll()` should be essentially prepared for stacked SecondaryCaches, but several "TODO with stacked secondaries" issues remain in various places.
It could be argued that the stacking instead be done as a SecondaryCache adapter that wraps two (or more) SecondaryCaches, but at least with the current API that would require an extra heap allocation on SecondaryCache Lookup for a wrapper SecondaryCacheResultHandle that can transfer a Lookup between secondaries. We could also consider trying to unify the Cache and SecondaryCache APIs, though that might be difficult if `AsyncLookupHandle` is kept a fixed struct.
## cache.h (public API)
Moves `secondary_cache` option from LRUCacheOptions to ShardedCacheOptions so that it is applicable to HyperClockCache.
## advanced_cache.h (advanced public API)
* Add `Cache::CreateStandalone()` so that the SecondaryCache support wrapper can use it.
* Add `SetEvictionCallback()` / `eviction_callback_` so that the SecondaryCache support wrapper can use it. Only a single callback is supported for efficiency. If there is ever a need for more than one, hopefully that can be handled with a broadcast callback wrapper.
These are essentially the two "extra" pieces of `Cache` for pulling out specific SecondaryCache support from the `Cache` implementation. I think it's a good trade-off as these are reasonable, limited, and reusable "cut points" into the `Cache` implementations.
* Remove async capability from standard `Lookup()` (getting rid of awkward restrictions on pending Handles) and add `AsyncLookupHandle` and `StartAsyncLookup()`. As noted in the comments, the full struct of `AsyncLookupHandle` is exposed so that it can be stack allocated, for efficiency, though more data is being copied around than before, which could impact performance. (Lookup info -> AsyncLookupHandle -> Handle vs. Lookup info -> Handle)
I could foresee a future in which a Cache internally saves a pointer to the AsyncLookupHandle, which means it's dangerous to allow it to be copyable or even movable. It also means it's not compatible with std::vector (which I don't like requiring as an API parameter anyway), so `WaitAll()` expects any contiguous array of AsyncLookupHandles. I believe this is best for common case efficiency, while behaving well in other cases also. For example, `WaitAll()` has no effect on default-constructed AsyncLookupHandles, which look like a completed cache miss.
## cacheable_entry.h
A couple of functions are obsolete because Cache::Handle can no longer be pending.
## cache.cc
Provides default implementations for new or revamped Cache functions, especially appropriate for non-blocking caches.
## secondary_cache_adapter.{h,cc}
The full details of the Cache wrapper adding SecondaryCache support. Essentially replicates the SecondaryCache handling that was in LRUCache, but obviously refactored. There is a bit of logic duplication, where Lookup() is essentially a manually optimized version of StartAsyncLookup() and Wait(), but it's roughly a dozen lines of code.
## sharded_cache.h, typed_cache.h, charged_cache.{h,cc}, sim_cache.cc
Simply updated for Cache API changes.
## lru_cache.{h,cc}
Carefully remove SecondaryCache logic, implement `CreateStandalone` and eviction handler functionality.
## clock_cache.{h,cc}
Expose existing `CreateStandalone` functionality, add eviction handler functionality. Light refactoring.
## block_based_table_reader*
Mostly re-worked the only usage of async Lookup, which is in BlockBasedTable::MultiGet. Used arrays in place of autovector in some places for efficiency. Simplified some logic by not trying to process some cache results before they're all ready.
Created new function `BlockBasedTable::GetCachePriority()` to reduce some pre-existing code duplication (and avoid making it worse).
Fixed at least one small bug from the prior confusing mixture of async and sync Lookups. In MaybeReadBlockAndLoadToCache(), called by RetrieveBlock(), called by MultiGet() with wait=false, is_cache_hit for the block_cache_tracer entry would not be set to true if the handle was pending after Lookup and before Wait.
## Intended follow-up work
* Figure out if there are any missing stats or block_cache_tracer work in refactored BlockBasedTable::MultiGet
* Stacked secondary caches (see above discussion)
* See if we can make up for the small MultiGet performance regression.
* Study more performance with SecondaryCache
* Items evicted from over-full LRUCache in Release were not being demoted to SecondaryCache, and still aren't to minimize unit test churn. Ideally they would be demoted, but it's an exceptional case so not a big deal.
* Use CreateStandalone for cache reservations (save unnecessary hash table operations). Not a big deal, but worthy cleanup.
* Somehow I got the contract for SecondaryCache::Insert wrong in #10945. (Doesn't take ownership!) That API comment needs to be fixed, but didn't want to mingle that in here.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11301
Test Plan:
## Unit tests
Generally updated to include HCC in SecondaryCache tests, though HyperClockCache has some different, less strict behaviors that leads to some tests not really being set up to work with it. Some of the tests remain disabled with it, but I think we have good coverage without them.
## Crash/stress test
Updated to use the new combination.
## Performance
First, let's check for regression on caches without secondary cache configured. Adding support for the eviction callback is likely to have a tiny effect, but it shouldn't be worrisome. LRUCache could benefit slightly from less logic around SecondaryCache handling. We can test with cache_bench default settings, built with DEBUG_LEVEL=0 and PORTABLE=0.
```
(while :; do base/cache_bench --cache_type=hyper_clock_cache | grep Rough; done) | awk '{ sum += $9; count++; print $0; print "Average: " int(sum / count) }'
```
**Before** this and #11299 (which could also have a small effect), running for about an hour, before & after running concurrently for each cache type:
HyperClockCache: 3168662 (average parallel ops/sec)
LRUCache: 2940127
**After** this and #11299, running for about an hour:
HyperClockCache: 3164862 (average parallel ops/sec) (0.12% slower)
LRUCache: 2940928 (0.03% faster)
This is an acceptable difference IMHO.
Next, let's consider essentially the worst case of new CPU overhead affecting overall performance. MultiGet uses the async lookup interface regardless of whether SecondaryCache or folly are used. We can configure a benchmark where all block cache queries are for data blocks, and all are hits.
Create DB and test (before and after tests running simultaneously):
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
TEST_TMPDIR=/dev/shm base/db_bench -benchmarks=multireadrandom[-X30] -readonly -multiread_batched -batch_size=32 -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```
**Before**:
multireadrandom [AVG 30 runs] : 3444202 (± 57049) ops/sec; 240.9 (± 4.0) MB/sec
multireadrandom [MEDIAN 30 runs] : 3514443 ops/sec; 245.8 MB/sec
**After**:
multireadrandom [AVG 30 runs] : 3291022 (± 58851) ops/sec; 230.2 (± 4.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3366179 ops/sec; 235.4 MB/sec
So that's roughly a 3% regression, on kind of a *worst case* test of MultiGet CPU. Similar story with HyperClockCache:
**Before**:
multireadrandom [AVG 30 runs] : 3933777 (± 41840) ops/sec; 275.1 (± 2.9) MB/sec
multireadrandom [MEDIAN 30 runs] : 3970667 ops/sec; 277.7 MB/sec
**After**:
multireadrandom [AVG 30 runs] : 3755338 (± 30391) ops/sec; 262.6 (± 2.1) MB/sec
multireadrandom [MEDIAN 30 runs] : 3785696 ops/sec; 264.8 MB/sec
Roughly a 4-5% regression. Not ideal, but not the whole story, fortunately.
Let's also look at Get() in db_bench:
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom[-X30] -readonly -num=30000000 -bloom_bits=16 -cache_size=6789000000 -duration 20 -threads=16
```
**Before**:
readrandom [AVG 30 runs] : 2198685 (± 13412) ops/sec; 153.8 (± 0.9) MB/sec
readrandom [MEDIAN 30 runs] : 2209498 ops/sec; 154.5 MB/sec
**After**:
readrandom [AVG 30 runs] : 2292814 (± 43508) ops/sec; 160.3 (± 3.0) MB/sec
readrandom [MEDIAN 30 runs] : 2365181 ops/sec; 165.4 MB/sec
That's showing roughly a 4% improvement, perhaps because of the secondary cache code that is no longer part of LRUCache. But weirdly, HyperClockCache is also showing 2-3% improvement:
**Before**:
readrandom [AVG 30 runs] : 2272333 (± 9992) ops/sec; 158.9 (± 0.7) MB/sec
readrandom [MEDIAN 30 runs] : 2273239 ops/sec; 159.0 MB/sec
**After**:
readrandom [AVG 30 runs] : 2332407 (± 11252) ops/sec; 163.1 (± 0.8) MB/sec
readrandom [MEDIAN 30 runs] : 2335329 ops/sec; 163.3 MB/sec
Reviewed By: ltamasi
Differential Revision: D44177044
Pulled By: pdillinger
fbshipit-source-id: e808e48ff3fe2f792a79841ba617be98e48689f5
2 years ago
|
|
|
Cache::Priority /*priority*/, Statistics* /*stats*/) {
|
|
|
|
return Lookup(key, hashed_key);
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
}
|
|
|
|
|
|
|
|
#ifndef NDEBUG
|
|
|
|
size_t& TEST_MutableOccupancyLimit() const {
|
|
|
|
return table_.TEST_MutableOccupancyLimit();
|
|
|
|
}
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// Acquire/release N references
|
|
|
|
void TEST_RefN(HandleImpl* handle, size_t n);
|
|
|
|
void TEST_ReleaseN(HandleImpl* handle, size_t n);
|
|
|
|
#endif
|
|
|
|
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
private: // data
|
|
|
|
Table table_;
|
|
|
|
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// Maximum total charge of all elements stored in the table.
|
|
|
|
std::atomic<size_t> capacity_;
|
|
|
|
|
Revamp, optimize new experimental clock cache (#10626)
Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
* Duplicate Inserts will sometimes go undetected and the shadow duplicate
will age out with eviction.
* In many cases, the older Inserted value for a given cache key will be kept
(i.e. Insert does not support overwrite).
* Entries explicitly erased (rather than evicted) might not be freed
immediately in some rare cases.
* With strict_capacity_limit=false, capacity limit is not tracked/enforced as
precisely as LRUCache, but is self-correcting and should only deviate by a
very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.
## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
Increment internal ref count at slot
If possible hit:
Check flags atomic (and non-atomic fields)
If cache hit:
Three distinct updates to 'flags' atomic
Increment refs for internal-to-external
Return
Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
Increment acquire counter in meta word (optimistic)
If visible entry (already read meta word):
If match (read non-atomic fields):
Return
Else:
Decrement acquire counter in meta word
Else if invisible entry (rare, already read meta word):
Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
Use CAS etc. to remove
Return
Else:
Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
Use CAS etc. to remove
Return
```
## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:
base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change
## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944
4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821
4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)
4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38
4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)
4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37
4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46
Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.
Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56
1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45
1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63
610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5
610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453
610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812
The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)
Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.
233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461
233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402
233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016
89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754
89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293
89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223
^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)
Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125
34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793
34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52
As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:
13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383
13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758
13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27
gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:
13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707
13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626
Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN
Reviewed By: anand1976
Differential Revision: D39368406
Pulled By: pdillinger
fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
2 years ago
|
|
|
// Whether to reject insertion if cache reaches its full capacity.
|
|
|
|
std::atomic<bool> strict_capacity_limit_;
|
|
|
|
}; // class ClockCacheShard
|
|
|
|
|
|
|
|
class HyperClockCache
|
|
|
|
#ifdef NDEBUG
|
|
|
|
final
|
|
|
|
#endif
|
|
|
|
: public ShardedCache<ClockCacheShard<HyperClockTable>> {
|
|
|
|
public:
|
|
|
|
using Shard = ClockCacheShard<HyperClockTable>;
|
|
|
|
|
|
|
|
explicit HyperClockCache(const HyperClockCacheOptions& opts);
|
|
|
|
|
|
|
|
const char* Name() const override { return "HyperClockCache"; }
|
|
|
|
|
Major Cache refactoring, CPU efficiency improvement (#10975)
Summary:
This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache).
The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below.
* static_cast lines of code +29 -35 (net removed 6)
* reinterpret_cast lines of code +6 -32 (net removed 26)
## cache.h and secondary_cache.h
* Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications:
* Simpler for implementations to deal with just one Insert and one Lookup.
* Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters
* Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428.
* Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks).
* It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below).
* I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc.
* Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation.
* Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.)
* Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.)
* Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774)
* Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object.
* Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change.
## typed_cache.h
Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae).
The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used.
* PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value.
* BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter.
* FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue.
* For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`.
These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.)
Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it.
## block_cache.h
This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table.
## block_based_table_reader.cc
Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation.
The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions.
## block_based_table_builder.cc, cache_dump_load_impl.cc
Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.)
## Everything else
Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975
Test Plan:
tests updated
Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache):
34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844
34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594
34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297
34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523
34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602
34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293
34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926
34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488
233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984
233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922
233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559
233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93
233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418
233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273
233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691
233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82
1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55
1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02
1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45
1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24
1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92
1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78
1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36
1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83
Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn.
Reviewed By: anand1976
Differential Revision: D42417818
Pulled By: pdillinger
fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
2 years ago
|
|
|
Cache::ObjectPtr Value(Handle* handle) override;
|
|
|
|
|
|
|
|
size_t GetCharge(Handle* handle) const override;
|
|
|
|
|
Major Cache refactoring, CPU efficiency improvement (#10975)
Summary:
This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache).
The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below.
* static_cast lines of code +29 -35 (net removed 6)
* reinterpret_cast lines of code +6 -32 (net removed 26)
## cache.h and secondary_cache.h
* Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications:
* Simpler for implementations to deal with just one Insert and one Lookup.
* Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters
* Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428.
* Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks).
* It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below).
* I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc.
* Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation.
* Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.)
* Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.)
* Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774)
* Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object.
* Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change.
## typed_cache.h
Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae).
The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used.
* PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value.
* BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter.
* FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue.
* For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`.
These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.)
Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it.
## block_cache.h
This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table.
## block_based_table_reader.cc
Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation.
The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions.
## block_based_table_builder.cc, cache_dump_load_impl.cc
Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.)
## Everything else
Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975
Test Plan:
tests updated
Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache):
34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844
34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594
34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297
34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523
34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602
34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293
34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926
34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488
233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984
233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922
233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559
233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93
233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418
233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273
233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691
233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82
1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55
1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02
1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45
1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24
1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92
1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78
1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36
1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83
Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn.
Reviewed By: anand1976
Differential Revision: D42417818
Pulled By: pdillinger
fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
2 years ago
|
|
|
const CacheItemHelper* GetCacheItemHelper(Handle* handle) const override;
|
|
|
|
|
|
|
|
void ReportProblems(
|
|
|
|
const std::shared_ptr<Logger>& /*info_log*/) const override;
|
|
|
|
}; // class HyperClockCache
|
|
|
|
|
|
|
|
} // namespace clock_cache
|
|
|
|
|
|
|
|
} // namespace ROCKSDB_NAMESPACE
|