Revamp, optimize new experimental clock cache (#10626)

Summary:
* Consolidates most metadata into a single word per slot so that more
can be accomplished with a single atomic update. In the common case,
Lookup was previously about 4 atomic updates, now just 1 atomic update.
Common case Release was previously 1 atomic read + 1 atomic update,
now just 1 atomic update.
* Eliminate spins / waits / yields, which likely threaten some "lock free"
benefits. Compare-exchange loops are only used in explicit Erase, and
strict_capacity_limit=true Insert. Eviction uses opportunistic compare-
exchange.
* Relaxes some aggressiveness and guarantees. For example,
  * Duplicate Inserts will sometimes go undetected and the shadow duplicate
    will age out with eviction.
  * In many cases, the older Inserted value for a given cache key will be kept
  (i.e. Insert does not support overwrite).
  * Entries explicitly erased (rather than evicted) might not be freed
  immediately in some rare cases.
  * With strict_capacity_limit=false, capacity limit is not tracked/enforced as
  precisely as LRUCache, but is self-correcting and should only deviate by a
  very small number of extra or fewer entries.
* Use smaller "computed default" number of cache shards in many cases,
because benefits to larger usage tracking / eviction pools outweigh the small
cost of more lock-free atomic contention. The improvement in CPU and I/O
is dramatic in some limit-memory cases.
* Even without the sharding change, the eviction algorithm is likely more
effective than LRU overall because it's more stateful, even though the
"hot path" state tracking for it is essentially free with ref counting. It
is like a generalized CLOCK with aging (see code comments). I don't have
performance numbers showing a specific improvement, but in theory, for a
Poisson access pattern to each block, keeping some state allows better
estimation of time to next access (Poisson interval) than strict LRU. The
bounded randomness in CLOCK can also reduce "cliff" effect for repeated
range scans approaching and exceeding cache size.

## Hot path algorithm comparison
Rough descriptions, focusing on number and kind of atomic operations:
* Old `Lookup()` (2-5 atomic updates per probe):
```
Loop:
  Increment internal ref count at slot
  If possible hit:
    Check flags atomic (and non-atomic fields)
    If cache hit:
      Three distinct updates to 'flags' atomic
      Increment refs for internal-to-external
      Return
  Decrement internal ref count
while atomic read 'displacements' > 0
```
* New `Lookup()` (1-2 atomic updates per probe):
```
Loop:
  Increment acquire counter in meta word (optimistic)
  If visible entry (already read meta word):
    If match (read non-atomic fields):
      Return
    Else:
      Decrement acquire counter in meta word
  Else if invisible entry (rare, already read meta word):
    Decrement acquire counter in meta word
while atomic read 'displacements' > 0
```
* Old `Release()` (1 atomic update, conditional on atomic read, rarely more):
```
Read atomic ref count
If last reference and invisible (rare):
  Use CAS etc. to remove
  Return
Else:
  Decrement ref count
```
* New `Release()` (1 unconditional atomic update, rarely more):
```
Increment release counter in meta word
If last reference and invisible (rare):
  Use CAS etc. to remove
  Return
```

## Performance test setup
Build DB with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=30000000 -disable_wal=1 -bloom_bits=16
```
Test with
```
TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=readrandom -readonly -num=30000000 -bloom_bits=16 -cache_index_and_filter_blocks=1 -cache_size=${CACHE_MB}000000 -duration 60 -threads=$THREADS -statistics
```
Numbers on a single socket Skylake Xeon system with 48 hardware threads, DEBUG_LEVEL=0 PORTABLE=0. Very similar story on a dual socket system with 80 hardware threads. Using (every 2nd) Fibonacci MB cache sizes to sample the territory between powers of two. Configurations:

base: LRUCache before this change, but with db_bench change to default cache_numshardbits=-1 (instead of fixed at 6)
folly: LRUCache before this change, with folly enabled (distributed mutex) but on an old compiler (sorry)
gt_clock: experimental ClockCache before this change
new_clock: experimental ClockCache with this change

## Performance test results
First test "hot path" read performance, with block cache large enough for whole DB:
4181MB 1thread base -> kops/s: 47.761
4181MB 1thread folly -> kops/s: 45.877
4181MB 1thread gt_clock -> kops/s: 51.092
4181MB 1thread new_clock -> kops/s: 53.944

4181MB 16thread base -> kops/s: 284.567
4181MB 16thread folly -> kops/s: 249.015
4181MB 16thread gt_clock -> kops/s: 743.762
4181MB 16thread new_clock -> kops/s: 861.821

4181MB 24thread base -> kops/s: 303.415
4181MB 24thread folly -> kops/s: 266.548
4181MB 24thread gt_clock -> kops/s: 975.706
4181MB 24thread new_clock -> kops/s: 1205.64 (~= 24 * 53.944)

4181MB 32thread base -> kops/s: 311.251
4181MB 32thread folly -> kops/s: 274.952
4181MB 32thread gt_clock -> kops/s: 1045.98
4181MB 32thread new_clock -> kops/s: 1370.38

4181MB 48thread base -> kops/s: 310.504
4181MB 48thread folly -> kops/s: 268.322
4181MB 48thread gt_clock -> kops/s: 1195.65
4181MB 48thread new_clock -> kops/s: 1604.85 (~= 24 * 1.25 * 53.944)

4181MB 64thread base -> kops/s: 307.839
4181MB 64thread folly -> kops/s: 272.172
4181MB 64thread gt_clock -> kops/s: 1204.47
4181MB 64thread new_clock -> kops/s: 1615.37

4181MB 128thread base -> kops/s: 310.934
4181MB 128thread folly -> kops/s: 267.468
4181MB 128thread gt_clock -> kops/s: 1188.75
4181MB 128thread new_clock -> kops/s: 1595.46

Whether we have just one thread on a quiet system or an overload of threads, the new version wins every time in thousand-ops per second, sometimes dramatically so. Mutex-based implementation quickly becomes contention-limited. New clock cache shows essentially perfect scaling up to number of physical cores (24), and then each hyperthreaded core adding about 1/4 the throughput of an additional physical core (see 48 thread case). Block cache miss rates (omitted above) are negligible across the board. With partitioned instead of full filters, the maximum speed-up vs. base is more like 2.5x rather than 5x.

Now test a large block cache with low miss ratio, but some eviction is required:
1597MB 1thread base -> kops/s: 46.603 io_bytes/op: 1584.63 miss_ratio: 0.0201066 max_rss_mb: 1589.23
1597MB 1thread folly -> kops/s: 45.079 io_bytes/op: 1530.03 miss_ratio: 0.019872 max_rss_mb: 1550.43
1597MB 1thread gt_clock -> kops/s: 48.711 io_bytes/op: 1566.63 miss_ratio: 0.0198923 max_rss_mb: 1691.4
1597MB 1thread new_clock -> kops/s: 51.531 io_bytes/op: 1589.07 miss_ratio: 0.0201969 max_rss_mb: 1583.56

1597MB 32thread base -> kops/s: 301.174 io_bytes/op: 1439.52 miss_ratio: 0.0184218 max_rss_mb: 1656.59
1597MB 32thread folly -> kops/s: 273.09 io_bytes/op: 1375.12 miss_ratio: 0.0180002 max_rss_mb: 1586.8
1597MB 32thread gt_clock -> kops/s: 904.497 io_bytes/op: 1411.29 miss_ratio: 0.0179934 max_rss_mb: 1775.89
1597MB 32thread new_clock -> kops/s: 1182.59 io_bytes/op: 1440.77 miss_ratio: 0.0185449 max_rss_mb: 1636.45

1597MB 128thread base -> kops/s: 309.91 io_bytes/op: 1438.25 miss_ratio: 0.018399 max_rss_mb: 1689.98
1597MB 128thread folly -> kops/s: 267.605 io_bytes/op: 1394.16 miss_ratio: 0.0180286 max_rss_mb: 1631.91
1597MB 128thread gt_clock -> kops/s: 691.518 io_bytes/op: 9056.73 miss_ratio: 0.0186572 max_rss_mb: 1982.26
1597MB 128thread new_clock -> kops/s: 1406.12 io_bytes/op: 1440.82 miss_ratio: 0.0185463 max_rss_mb: 1685.63

610MB 1thread base -> kops/s: 45.511 io_bytes/op: 2279.61 miss_ratio: 0.0290528 max_rss_mb: 615.137
610MB 1thread folly -> kops/s: 43.386 io_bytes/op: 2217.29 miss_ratio: 0.0289282 max_rss_mb: 600.996
610MB 1thread gt_clock -> kops/s: 46.207 io_bytes/op: 2275.51 miss_ratio: 0.0290057 max_rss_mb: 637.934
610MB 1thread new_clock -> kops/s: 48.879 io_bytes/op: 2283.1 miss_ratio: 0.0291253 max_rss_mb: 613.5

610MB 32thread base -> kops/s: 306.59 io_bytes/op: 2250 miss_ratio: 0.0288721 max_rss_mb: 683.402
610MB 32thread folly -> kops/s: 269.176 io_bytes/op: 2187.86 miss_ratio: 0.0286938 max_rss_mb: 628.742
610MB 32thread gt_clock -> kops/s: 855.097 io_bytes/op: 2279.26 miss_ratio: 0.0288009 max_rss_mb: 733.062
610MB 32thread new_clock -> kops/s: 1121.47 io_bytes/op: 2244.29 miss_ratio: 0.0289046 max_rss_mb: 666.453

610MB 128thread base -> kops/s: 305.079 io_bytes/op: 2252.43 miss_ratio: 0.0288884 max_rss_mb: 723.457
610MB 128thread folly -> kops/s: 269.583 io_bytes/op: 2204.58 miss_ratio: 0.0287001 max_rss_mb: 676.426
610MB 128thread gt_clock -> kops/s: 53.298 io_bytes/op: 8128.98 miss_ratio: 0.0292452 max_rss_mb: 956.273
610MB 128thread new_clock -> kops/s: 1301.09 io_bytes/op: 2246.04 miss_ratio: 0.0289171 max_rss_mb: 788.812

The new version is still winning every time, sometimes dramatically so, and we can tell from the maximum resident memory numbers (which contain some noise, by the way) that the new cache is not cheating on memory usage. IMPORTANT: The previous generation experimental clock cache appears to hit a serious bottleneck in the higher thread count configurations, presumably due to some of its waiting functionality. (The same bottleneck is not seen with partitioned index+filters.)

Now we consider even smaller cache sizes, with higher miss ratios, eviction work, etc.

233MB 1thread base -> kops/s: 10.557 io_bytes/op: 227040 miss_ratio: 0.0403105 max_rss_mb: 247.371
233MB 1thread folly -> kops/s: 15.348 io_bytes/op: 112007 miss_ratio: 0.0372238 max_rss_mb: 245.293
233MB 1thread gt_clock -> kops/s: 6.365 io_bytes/op: 244854 miss_ratio: 0.0413873 max_rss_mb: 259.844
233MB 1thread new_clock -> kops/s: 47.501 io_bytes/op: 2591.93 miss_ratio: 0.0330989 max_rss_mb: 242.461

233MB 32thread base -> kops/s: 96.498 io_bytes/op: 363379 miss_ratio: 0.0459966 max_rss_mb: 479.227
233MB 32thread folly -> kops/s: 109.95 io_bytes/op: 314799 miss_ratio: 0.0450032 max_rss_mb: 400.738
233MB 32thread gt_clock -> kops/s: 2.353 io_bytes/op: 385397 miss_ratio: 0.048445 max_rss_mb: 500.688
233MB 32thread new_clock -> kops/s: 1088.95 io_bytes/op: 2567.02 miss_ratio: 0.0330593 max_rss_mb: 303.402

233MB 128thread base -> kops/s: 84.302 io_bytes/op: 378020 miss_ratio: 0.0466558 max_rss_mb: 1051.84
233MB 128thread folly -> kops/s: 89.921 io_bytes/op: 338242 miss_ratio: 0.0460309 max_rss_mb: 812.785
233MB 128thread gt_clock -> kops/s: 2.588 io_bytes/op: 462833 miss_ratio: 0.0509158 max_rss_mb: 1109.94
233MB 128thread new_clock -> kops/s: 1299.26 io_bytes/op: 2565.94 miss_ratio: 0.0330531 max_rss_mb: 361.016

89MB 1thread base -> kops/s: 0.574 io_bytes/op: 5.35977e+06 miss_ratio: 0.274427 max_rss_mb: 91.3086
89MB 1thread folly -> kops/s: 0.578 io_bytes/op: 5.16549e+06 miss_ratio: 0.27276 max_rss_mb: 96.8984
89MB 1thread gt_clock -> kops/s: 0.512 io_bytes/op: 4.13111e+06 miss_ratio: 0.242817 max_rss_mb: 119.441
89MB 1thread new_clock -> kops/s: 48.172 io_bytes/op: 2709.76 miss_ratio: 0.0346162 max_rss_mb: 100.754

89MB 32thread base -> kops/s: 5.779 io_bytes/op: 6.14192e+06 miss_ratio: 0.320399 max_rss_mb: 311.812
89MB 32thread folly -> kops/s: 5.601 io_bytes/op: 5.83838e+06 miss_ratio: 0.313123 max_rss_mb: 252.418
89MB 32thread gt_clock -> kops/s: 0.77 io_bytes/op: 3.99236e+06 miss_ratio: 0.236296 max_rss_mb: 396.422
89MB 32thread new_clock -> kops/s: 1064.97 io_bytes/op: 2687.23 miss_ratio: 0.0346134 max_rss_mb: 155.293

89MB 128thread base -> kops/s: 4.959 io_bytes/op: 6.20297e+06 miss_ratio: 0.323945 max_rss_mb: 823.43
89MB 128thread folly -> kops/s: 4.962 io_bytes/op: 5.9601e+06 miss_ratio: 0.319857 max_rss_mb: 626.824
89MB 128thread gt_clock -> kops/s: 1.009 io_bytes/op: 4.1083e+06 miss_ratio: 0.242512 max_rss_mb: 1095.32
89MB 128thread new_clock -> kops/s: 1224.39 io_bytes/op: 2688.2 miss_ratio: 0.0346207 max_rss_mb: 218.223

^ Now something interesting has happened: the new clock cache has gained a dramatic lead in the single-threaded case, and this is because the cache is so small, and full filters are so big, that dividing the cache into 64 shards leads to significant (random) imbalances in cache shards and excessive churn in imbalanced shards. This new clock cache only uses two shards for this configuration, and that helps to ensure that entries are part of a sufficiently big pool that their eviction order resembles the single-shard order. (This effect is not seen with partitioned index+filters.)

Even smaller cache size:
34MB 1thread base -> kops/s: 0.198 io_bytes/op: 1.65342e+07 miss_ratio: 0.939466 max_rss_mb: 48.6914
34MB 1thread folly -> kops/s: 0.201 io_bytes/op: 1.63416e+07 miss_ratio: 0.939081 max_rss_mb: 45.3281
34MB 1thread gt_clock -> kops/s: 0.448 io_bytes/op: 4.43957e+06 miss_ratio: 0.266749 max_rss_mb: 100.523
34MB 1thread new_clock -> kops/s: 1.055 io_bytes/op: 1.85439e+06 miss_ratio: 0.107512 max_rss_mb: 75.3125

34MB 32thread base -> kops/s: 3.346 io_bytes/op: 1.64852e+07 miss_ratio: 0.93596 max_rss_mb: 180.48
34MB 32thread folly -> kops/s: 3.431 io_bytes/op: 1.62857e+07 miss_ratio: 0.935693 max_rss_mb: 137.531
34MB 32thread gt_clock -> kops/s: 1.47 io_bytes/op: 4.89704e+06 miss_ratio: 0.295081 max_rss_mb: 392.465
34MB 32thread new_clock -> kops/s: 8.19 io_bytes/op: 3.70456e+06 miss_ratio: 0.20826 max_rss_mb: 519.793

34MB 128thread base -> kops/s: 2.293 io_bytes/op: 1.64351e+07 miss_ratio: 0.931866 max_rss_mb: 449.484
34MB 128thread folly -> kops/s: 2.34 io_bytes/op: 1.6219e+07 miss_ratio: 0.932023 max_rss_mb: 396.457
34MB 128thread gt_clock -> kops/s: 1.798 io_bytes/op: 5.4241e+06 miss_ratio: 0.324881 max_rss_mb: 1104.41
34MB 128thread new_clock -> kops/s: 10.519 io_bytes/op: 2.39354e+06 miss_ratio: 0.136147 max_rss_mb: 1050.52

As the miss ratio gets higher (say, above 10%), the CPU time spent in eviction starts to erode the advantage of using fewer shards (13% miss rate much lower than 94%). LRU's O(1) eviction time can eventually pay off when there's enough block cache churn:

13MB 1thread base -> kops/s: 0.195 io_bytes/op: 1.65732e+07 miss_ratio: 0.946604 max_rss_mb: 45.6328
13MB 1thread folly -> kops/s: 0.197 io_bytes/op: 1.63793e+07 miss_ratio: 0.94661 max_rss_mb: 33.8633
13MB 1thread gt_clock -> kops/s: 0.519 io_bytes/op: 4.43316e+06 miss_ratio: 0.269379 max_rss_mb: 100.684
13MB 1thread new_clock -> kops/s: 0.176 io_bytes/op: 1.54148e+07 miss_ratio: 0.91545 max_rss_mb: 66.2383

13MB 32thread base -> kops/s: 3.266 io_bytes/op: 1.65544e+07 miss_ratio: 0.943386 max_rss_mb: 132.492
13MB 32thread folly -> kops/s: 3.396 io_bytes/op: 1.63142e+07 miss_ratio: 0.943243 max_rss_mb: 101.863
13MB 32thread gt_clock -> kops/s: 2.758 io_bytes/op: 5.13714e+06 miss_ratio: 0.310652 max_rss_mb: 396.121
13MB 32thread new_clock -> kops/s: 3.11 io_bytes/op: 1.23419e+07 miss_ratio: 0.708425 max_rss_mb: 321.758

13MB 128thread base -> kops/s: 2.31 io_bytes/op: 1.64823e+07 miss_ratio: 0.939543 max_rss_mb: 425.539
13MB 128thread folly -> kops/s: 2.339 io_bytes/op: 1.6242e+07 miss_ratio: 0.939966 max_rss_mb: 346.098
13MB 128thread gt_clock -> kops/s: 3.223 io_bytes/op: 5.76928e+06 miss_ratio: 0.345899 max_rss_mb: 1087.77
13MB 128thread new_clock -> kops/s: 2.984 io_bytes/op: 1.05341e+07 miss_ratio: 0.606198 max_rss_mb: 898.27

gt_clock is clearly blowing way past its memory budget for lower miss rates and best throughput. new_clock also seems to be exceeding budgets, and this warrants more investigation but is not the use case we are targeting with the new cache. With partitioned index+filter, the miss ratio is much better, and although still high enough that the eviction CPU time is definitely offsetting mutex contention:

13MB 1thread base -> kops/s: 16.326 io_bytes/op: 23743.9 miss_ratio: 0.205362 max_rss_mb: 65.2852
13MB 1thread folly -> kops/s: 15.574 io_bytes/op: 19415 miss_ratio: 0.184157 max_rss_mb: 56.3516
13MB 1thread gt_clock -> kops/s: 14.459 io_bytes/op: 22873 miss_ratio: 0.198355 max_rss_mb: 63.9688
13MB 1thread new_clock -> kops/s: 16.34 io_bytes/op: 24386.5 miss_ratio: 0.210512 max_rss_mb: 61.707

13MB 128thread base -> kops/s: 289.786 io_bytes/op: 23710.9 miss_ratio: 0.205056 max_rss_mb: 103.57
13MB 128thread folly -> kops/s: 185.282 io_bytes/op: 19433.1 miss_ratio: 0.184275 max_rss_mb: 116.219
13MB 128thread gt_clock -> kops/s: 354.451 io_bytes/op: 23150.6 miss_ratio: 0.200495 max_rss_mb: 102.871
13MB 128thread new_clock -> kops/s: 295.359 io_bytes/op: 24626.4 miss_ratio: 0.212452 max_rss_mb: 121.109

Pull Request resolved: https://github.com/facebook/rocksdb/pull/10626

Test Plan: updated unit tests, stress/crash test runs including with TSAN, ASAN, UBSAN

Reviewed By: anand1976

Differential Revision: D39368406

Pulled By: pdillinger

fbshipit-source-id: 5afc44da4c656f8f751b44552bbf27bd3ca6fef9
main
Peter Dillinger 2 years ago committed by Facebook GitHub Bot
parent 37b75e1364
commit 5724348689
  1. 7
      cache/cache_bench_tool.cc
  2. 132
      cache/cache_test.cc
  3. 1373
      cache/clock_cache.cc
  4. 1002
      cache/clock_cache.h
  5. 14
      cache/fast_lru_cache.cc
  6. 2
      cache/fast_lru_cache.h
  7. 14
      cache/lru_cache.cc
  8. 4
      cache/lru_cache.h
  9. 498
      cache/lru_cache_test.cc
  10. 21
      cache/sharded_cache.cc
  11. 17
      cache/sharded_cache.h
  12. 23
      db/db_block_cache_test.cc
  13. 5
      db/internal_stats.cc
  14. 3
      db/internal_stats.h
  15. 10
      include/rocksdb/cache.h
  16. 5
      tools/db_bench_tool.cc

@ -441,6 +441,8 @@ class CacheBench {
uint64_t total_key_size = 0;
uint64_t total_charge = 0;
uint64_t total_entry_count = 0;
uint64_t table_occupancy = 0;
uint64_t table_size = 0;
std::set<Cache::DeleterFn> deleters;
StopWatchNano timer(clock);
@ -456,6 +458,9 @@ class CacheBench {
std::ostringstream ostr;
ostr << "Most recent cache entry stats:\n"
<< "Number of entries: " << total_entry_count << "\n"
<< "Table occupancy: " << table_occupancy << " / "
<< table_size << " = "
<< (100.0 * table_occupancy / table_size) << "%\n"
<< "Total charge: " << BytesToHumanString(total_charge) << "\n"
<< "Average key size: "
<< (1.0 * total_key_size / total_entry_count) << "\n"
@ -492,6 +497,8 @@ class CacheBench {
Cache::ApplyToAllEntriesOptions opts;
opts.average_entries_per_lock = FLAGS_gather_stats_entries_per_lock;
shared->GetCacheBench()->cache_->ApplyToAllEntries(fn, opts);
table_occupancy = shared->GetCacheBench()->cache_->GetOccupancyCount();
table_size = shared->GetCacheBench()->cache_->GetTableAddressCount();
stats_hist->Add(timer.ElapsedNanos() / 1000);
}
}

132
cache/cache_test.cc vendored

@ -106,6 +106,8 @@ class CacheTest : public testing::TestWithParam<std::string> {
std::shared_ptr<Cache> cache_;
std::shared_ptr<Cache> cache2_;
size_t estimated_value_size_ = 1;
CacheTest()
: cache_(NewCache(kCacheSize, kNumShardBits, false)),
cache2_(NewCache(kCacheSize2, kNumShardBits2, false)) {
@ -122,12 +124,12 @@ class CacheTest : public testing::TestWithParam<std::string> {
}
if (type == kClock) {
return ExperimentalNewClockCache(
capacity, 1 /*estimated_value_size*/, -1 /*num_shard_bits*/,
capacity, estimated_value_size_, -1 /*num_shard_bits*/,
false /*strict_capacity_limit*/, kDefaultCacheMetadataChargePolicy);
}
if (type == kFast) {
return NewFastLRUCache(
capacity, 1 /*estimated_value_size*/, -1 /*num_shard_bits*/,
capacity, estimated_value_size_, -1 /*num_shard_bits*/,
false /*strict_capacity_limit*/, kDefaultCacheMetadataChargePolicy);
}
return nullptr;
@ -239,7 +241,10 @@ TEST_P(CacheTest, UsageTest) {
auto cache = NewCache(kCapacity, 8, false, kDontChargeCacheMetadata);
auto precise_cache = NewCache(kCapacity, 0, false, kFullChargeCacheMetadata);
ASSERT_EQ(0, cache->GetUsage());
ASSERT_EQ(0, precise_cache->GetUsage());
size_t baseline_meta_usage = precise_cache->GetUsage();
if (type != kClock) {
ASSERT_EQ(0, baseline_meta_usage);
}
size_t usage = 0;
char value[10] = "abcdef";
@ -258,13 +263,17 @@ TEST_P(CacheTest, UsageTest) {
kv_size, DumbDeleter));
usage += kv_size;
ASSERT_EQ(usage, cache->GetUsage());
ASSERT_LT(usage, precise_cache->GetUsage());
if (type == kClock) {
ASSERT_EQ(baseline_meta_usage + usage, precise_cache->GetUsage());
} else {
ASSERT_LT(usage, precise_cache->GetUsage());
}
}
cache->EraseUnRefEntries();
precise_cache->EraseUnRefEntries();
ASSERT_EQ(0, cache->GetUsage());
ASSERT_EQ(0, precise_cache->GetUsage());
ASSERT_EQ(baseline_meta_usage, precise_cache->GetUsage());
// make sure the cache will be overloaded
for (size_t i = 1; i < kCapacity; ++i) {
@ -284,7 +293,15 @@ TEST_P(CacheTest, UsageTest) {
ASSERT_GT(kCapacity, cache->GetUsage());
ASSERT_GT(kCapacity, precise_cache->GetUsage());
ASSERT_LT(kCapacity * 0.95, cache->GetUsage());
ASSERT_LT(kCapacity * 0.95, precise_cache->GetUsage());
if (type != kClock) {
ASSERT_LT(kCapacity * 0.95, precise_cache->GetUsage());
} else {
// estimated value size of 1 is weird for clock cache, because
// almost all of the capacity will be used for metadata, and due to only
// using power of 2 table sizes, we might hit strict occupancy limit
// before hitting capacity limit.
ASSERT_LT(kCapacity * 0.80, precise_cache->GetUsage());
}
}
// TODO: This test takes longer than expected on ClockCache. This is
@ -301,6 +318,10 @@ TEST_P(CacheTest, PinnedUsageTest) {
const size_t kCapacity = 200000;
auto cache = NewCache(kCapacity, 8, false, kDontChargeCacheMetadata);
auto precise_cache = NewCache(kCapacity, 8, false, kFullChargeCacheMetadata);
size_t baseline_meta_usage = precise_cache->GetUsage();
if (type != kClock) {
ASSERT_EQ(0, baseline_meta_usage);
}
size_t pinned_usage = 0;
char value[10] = "abcdef";
@ -390,7 +411,7 @@ TEST_P(CacheTest, PinnedUsageTest) {
cache->EraseUnRefEntries();
precise_cache->EraseUnRefEntries();
ASSERT_EQ(0, cache->GetUsage());
ASSERT_EQ(0, precise_cache->GetUsage());
ASSERT_EQ(baseline_meta_usage, precise_cache->GetUsage());
}
TEST_P(CacheTest, HitAndMiss) {
@ -407,16 +428,30 @@ TEST_P(CacheTest, HitAndMiss) {
ASSERT_EQ(-1, Lookup(300));
Insert(100, 102);
ASSERT_EQ(102, Lookup(100));
if (GetParam() == kClock) {
// ClockCache usually doesn't overwrite on Insert
ASSERT_EQ(101, Lookup(100));
} else {
ASSERT_EQ(102, Lookup(100));
}
ASSERT_EQ(201, Lookup(200));
ASSERT_EQ(-1, Lookup(300));
ASSERT_EQ(1U, deleted_keys_.size());
ASSERT_EQ(100, deleted_keys_[0]);
ASSERT_EQ(101, deleted_values_[0]);
if (GetParam() == kClock) {
ASSERT_EQ(102, deleted_values_[0]);
} else {
ASSERT_EQ(101, deleted_values_[0]);
}
}
TEST_P(CacheTest, InsertSameKey) {
if (GetParam() == kClock) {
ROCKSDB_GTEST_BYPASS(
"ClockCache doesn't guarantee Insert overwrite same key.");
return;
}
Insert(1, 1);
Insert(1, 2);
ASSERT_EQ(2, Lookup(1));
@ -442,6 +477,11 @@ TEST_P(CacheTest, Erase) {
}
TEST_P(CacheTest, EntriesArePinned) {
if (GetParam() == kClock) {
ROCKSDB_GTEST_BYPASS(
"ClockCache doesn't guarantee Insert overwrite same key.");
return;
}
Insert(100, 101);
Cache::Handle* h1 = cache_->Lookup(EncodeKey(100));
ASSERT_EQ(101, DecodeValue(cache_->Value(h1)));
@ -474,7 +514,6 @@ TEST_P(CacheTest, EntriesArePinned) {
TEST_P(CacheTest, EvictionPolicy) {
Insert(100, 101);
Insert(200, 201);
// Frequently used entry must be kept around
for (int i = 0; i < 2 * kCacheSize; i++) {
Insert(1000+i, 2000+i);
@ -503,6 +542,12 @@ TEST_P(CacheTest, ExternalRefPinsEntries) {
for (int j = 0; j < 2 * kCacheSize + 100; j++) {
Insert(1000 + j, 2000 + j);
}
// Clock cache is even more stateful and needs more churn to evict
if (GetParam() == kClock) {
for (int j = 0; j < kCacheSize; j++) {
Insert(11000 + j, 11000 + j);
}
}
if (i < 2) {
ASSERT_EQ(101, Lookup(100));
}
@ -810,11 +855,6 @@ TEST_P(LRUCacheTest, SetStrictCapacityLimit) {
}
TEST_P(CacheTest, OverCapacity) {
auto type = GetParam();
if (type == kClock) {
ROCKSDB_GTEST_BYPASS("Requires LRU eviction policy.");
return;
}
size_t n = 10;
// a LRUCache with n entries and one shard only
@ -842,23 +882,34 @@ TEST_P(CacheTest, OverCapacity) {
for (int i = 0; i < static_cast<int>(n + 1); i++) {
cache->Release(handles[i]);
}
// Make sure eviction is triggered.
cache->SetCapacity(n);
// cache is under capacity now since elements were released
ASSERT_EQ(n, cache->GetUsage());
if (GetParam() == kClock) {
// Make sure eviction is triggered.
ASSERT_OK(cache->Insert(EncodeKey(-1), nullptr, 1, &deleter, &handles[0]));
// element 0 is evicted and the rest is there
// This is consistent with the LRU policy since the element 0
// was released first
for (int i = 0; i < static_cast<int>(n + 1); i++) {
std::string key = EncodeKey(i + 1);
auto h = cache->Lookup(key);
if (h) {
ASSERT_NE(static_cast<size_t>(i), 0U);
cache->Release(h);
} else {
ASSERT_EQ(static_cast<size_t>(i), 0U);
// cache is under capacity now since elements were released
ASSERT_GE(n, cache->GetUsage());
// clean up
cache->Release(handles[0]);
} else {
// LRUCache checks for over-capacity in Release.
// cache is exactly at capacity now with minimal eviction
ASSERT_EQ(n, cache->GetUsage());
// element 0 is evicted and the rest is there
// This is consistent with the LRU policy since the element 0
// was released first
for (int i = 0; i < static_cast<int>(n + 1); i++) {
std::string key = EncodeKey(i + 1);
auto h = cache->Lookup(key);
if (h) {
ASSERT_NE(static_cast<size_t>(i), 0U);
cache->Release(h);
} else {
ASSERT_EQ(static_cast<size_t>(i), 0U);
}
}
}
}
@ -966,19 +1017,30 @@ TEST_P(CacheTest, ApplyToAllEntriesDuringResize) {
}
TEST_P(CacheTest, DefaultShardBits) {
// test1: set the flag to false. Insert more keys than capacity. See if they
// all go through.
std::shared_ptr<Cache> cache = NewCache(16 * 1024L * 1024L);
// Prevent excessive allocation (to save time & space)
estimated_value_size_ = 100000;
// Implementations use different minimum shard sizes
size_t min_shard_size = (GetParam() == kClock ? 32U * 1024U : 512U) * 1024U;
std::shared_ptr<Cache> cache = NewCache(32U * min_shard_size);
ShardedCache* sc = dynamic_cast<ShardedCache*>(cache.get());
ASSERT_EQ(5, sc->GetNumShardBits());
cache = NewLRUCache(511 * 1024L, -1, true);
cache = NewCache(min_shard_size / 1000U * 999U);
sc = dynamic_cast<ShardedCache*>(cache.get());
ASSERT_EQ(0, sc->GetNumShardBits());
cache = NewLRUCache(1024L * 1024L * 1024L, -1, true);
cache = NewCache(3U * 1024U * 1024U * 1024U);
sc = dynamic_cast<ShardedCache*>(cache.get());
// current maximum of 6
ASSERT_EQ(6, sc->GetNumShardBits());
if constexpr (sizeof(size_t) > 4) {
cache = NewCache(128U * min_shard_size);
sc = dynamic_cast<ShardedCache*>(cache.get());
// current maximum of 6
ASSERT_EQ(6, sc->GetNumShardBits());
}
}
TEST_P(CacheTest, GetChargeAndDeleter) {

1373
cache/clock_cache.cc vendored

File diff suppressed because it is too large Load Diff

1002
cache/clock_cache.h vendored

File diff suppressed because it is too large Load Diff

@ -173,13 +173,13 @@ inline int LRUHandleTable::FindSlot(const Slice& key,
LRUCacheShard::LRUCacheShard(size_t capacity, size_t estimated_value_size,
bool strict_capacity_limit,
CacheMetadataChargePolicy metadata_charge_policy)
: capacity_(capacity),
: CacheShard(metadata_charge_policy),
capacity_(capacity),
strict_capacity_limit_(strict_capacity_limit),
table_(
CalcHashBits(capacity, estimated_value_size, metadata_charge_policy)),
usage_(0),
lru_usage_(0) {
set_metadata_charge_policy(metadata_charge_policy);
// Make empty circular linked list.
lru_.next = &lru_;
lru_.prev = &lru_;
@ -525,6 +525,16 @@ size_t LRUCacheShard::GetPinnedUsage() const {
return usage_ - lru_usage_;
}
size_t LRUCacheShard::GetOccupancyCount() const {
DMutexLock l(mutex_);
return table_.GetOccupancy();
}
size_t LRUCacheShard::GetTableAddressCount() const {
DMutexLock l(mutex_);
return table_.GetTableSize();
}
std::string LRUCacheShard::GetPrintableOptions() const { return std::string{}; }
LRUCache::LRUCache(size_t capacity, size_t estimated_value_size,

@ -368,6 +368,8 @@ class ALIGN_AS(CACHE_LINE_SIZE) LRUCacheShard final : public CacheShard {
size_t GetUsage() const override;
size_t GetPinnedUsage() const override;
size_t GetOccupancyCount() const override;
size_t GetTableAddressCount() const override;
void ApplyToSomeEntries(
const std::function<void(const Slice& key, void* value, size_t charge,

14
cache/lru_cache.cc vendored

@ -115,7 +115,8 @@ LRUCacheShard::LRUCacheShard(
double low_pri_pool_ratio, bool use_adaptive_mutex,
CacheMetadataChargePolicy metadata_charge_policy, int max_upper_hash_bits,
const std::shared_ptr<SecondaryCache>& secondary_cache)
: capacity_(0),
: CacheShard(metadata_charge_policy),
capacity_(0),
high_pri_pool_usage_(0),
low_pri_pool_usage_(0),
strict_capacity_limit_(strict_capacity_limit),
@ -128,7 +129,6 @@ LRUCacheShard::LRUCacheShard(
lru_usage_(0),
mutex_(use_adaptive_mutex),
secondary_cache_(secondary_cache) {
set_metadata_charge_policy(metadata_charge_policy);
// Make empty circular linked list.
lru_.next = &lru_;
lru_.prev = &lru_;
@ -759,6 +759,16 @@ size_t LRUCacheShard::GetPinnedUsage() const {
return usage_ - lru_usage_;
}
size_t LRUCacheShard::GetOccupancyCount() const {
DMutexLock l(mutex_);
return table_.GetOccupancyCount();
}
size_t LRUCacheShard::GetTableAddressCount() const {
DMutexLock l(mutex_);
return size_t{1} << table_.GetLengthBits();
}
std::string LRUCacheShard::GetPrintableOptions() const {
const int kBufferSize = 200;
char buffer[kBufferSize];

4
cache/lru_cache.h vendored

@ -305,6 +305,8 @@ class LRUHandleTable {
int GetLengthBits() const { return length_bits_; }
size_t GetOccupancyCount() const { return elems_; }
private:
// Return a pointer to slot that points to a cache entry that
// matches key/hash. If there is no such cache entry, return a
@ -394,6 +396,8 @@ class ALIGN_AS(CACHE_LINE_SIZE) LRUCacheShard final : public CacheShard {
virtual size_t GetUsage() const override;
virtual size_t GetPinnedUsage() const override;
virtual size_t GetOccupancyCount() const override;
virtual size_t GetTableAddressCount() const override;
virtual void ApplyToSomeEntries(
const std::function<void(const Slice& key, void* value, size_t charge,

@ -521,11 +521,11 @@ class ClockCacheTest : public testing::Test {
}
}
void NewShard(size_t capacity) {
void NewShard(size_t capacity, bool strict_capacity_limit = true) {
DeleteShard();
shard_ = reinterpret_cast<ClockCacheShard*>(
port::cacheline_aligned_alloc(sizeof(ClockCacheShard)));
new (shard_) ClockCacheShard(capacity, 1, true /*strict_capacity_limit*/,
new (shard_) ClockCacheShard(capacity, 1, strict_capacity_limit,
kDontChargeCacheMetadata);
}
@ -539,21 +539,26 @@ class ClockCacheTest : public testing::Test {
return Insert(std::string(kCacheKeySize, key), priority);
}
Status Insert(char key, size_t len) { return Insert(std::string(len, key)); }
Status InsertWithLen(char key, size_t len) {
return Insert(std::string(len, key));
}
bool Lookup(const std::string& key) {
bool Lookup(const std::string& key, bool useful = true) {
auto handle = shard_->Lookup(key, 0 /*hash*/);
if (handle) {
shard_->Release(handle);
shard_->Release(handle, useful, /*erase_if_last_ref=*/false);
return true;
}
return false;
}
bool Lookup(char key) { return Lookup(std::string(kCacheKeySize, key)); }
bool Lookup(char key, bool useful = true) {
return Lookup(std::string(kCacheKeySize, key), useful);
}
void Erase(const std::string& key) { shard_->Erase(key, 0 /*hash*/); }
#if 0 // FIXME
size_t CalcEstimatedHandleChargeWrapper(
size_t estimated_value_size,
CacheMetadataChargePolicy metadata_charge_policy) {
@ -583,106 +588,419 @@ class ClockCacheTest : public testing::Test {
(1 << (hash_bits - 1) <= max_occupancy);
}
}
#endif
private:
ClockCacheShard* shard_ = nullptr;
};
TEST_F(ClockCacheTest, Validate) {
TEST_F(ClockCacheTest, Misc) {
NewShard(3);
EXPECT_OK(Insert('a', 16));
EXPECT_NOK(Insert('b', 15));
EXPECT_OK(Insert('b', 16));
EXPECT_NOK(Insert('c', 17));
EXPECT_NOK(Insert('d', 1000));
EXPECT_NOK(Insert('e', 11));
EXPECT_NOK(Insert('f', 0));
}
TEST_F(ClockCacheTest, ClockPriorityTest) {
ClockHandle handle;
EXPECT_EQ(handle.GetClockPriority(), ClockHandle::ClockPriority::NONE);
handle.SetClockPriority(ClockHandle::ClockPriority::HIGH);
EXPECT_EQ(handle.GetClockPriority(), ClockHandle::ClockPriority::HIGH);
handle.DecreaseClockPriority();
EXPECT_EQ(handle.GetClockPriority(), ClockHandle::ClockPriority::MEDIUM);
handle.DecreaseClockPriority();
EXPECT_EQ(handle.GetClockPriority(), ClockHandle::ClockPriority::LOW);
handle.SetClockPriority(ClockHandle::ClockPriority::MEDIUM);
EXPECT_EQ(handle.GetClockPriority(), ClockHandle::ClockPriority::MEDIUM);
handle.SetClockPriority(ClockHandle::ClockPriority::NONE);
EXPECT_EQ(handle.GetClockPriority(), ClockHandle::ClockPriority::NONE);
handle.SetClockPriority(ClockHandle::ClockPriority::MEDIUM);
EXPECT_EQ(handle.GetClockPriority(), ClockHandle::ClockPriority::MEDIUM);
handle.DecreaseClockPriority();
handle.DecreaseClockPriority();
EXPECT_EQ(handle.GetClockPriority(), ClockHandle::ClockPriority::NONE);
// Key size stuff
EXPECT_OK(InsertWithLen('a', 16));
EXPECT_NOK(InsertWithLen('b', 15));
EXPECT_OK(InsertWithLen('b', 16));
EXPECT_NOK(InsertWithLen('c', 17));
EXPECT_NOK(InsertWithLen('d', 1000));
EXPECT_NOK(InsertWithLen('e', 11));
EXPECT_NOK(InsertWithLen('f', 0));
// Some of this is motivated by code coverage
std::string wrong_size_key(15, 'x');
EXPECT_FALSE(Lookup(wrong_size_key));
EXPECT_FALSE(shard_->Ref(nullptr));
EXPECT_FALSE(shard_->Release(nullptr));
shard_->Erase(wrong_size_key, /*hash*/ 42); // no-op
}
TEST_F(ClockCacheTest, CalcHashBitsTest) {
size_t capacity;
size_t estimated_value_size;
double max_occupancy;
int hash_bits;
CacheMetadataChargePolicy metadata_charge_policy;
TEST_F(ClockCacheTest, Limits) {
NewShard(3, false /*strict_capacity_limit*/);
for (bool strict_capacity_limit : {false, true, false}) {
SCOPED_TRACE("strict_capacity_limit = " +
std::to_string(strict_capacity_limit));
// Vary the cache capacity, fix the element charge.
for (int i = 0; i < 2048; i++) {
capacity = i;
estimated_value_size = 0;
metadata_charge_policy = kFullChargeCacheMetadata;
max_occupancy = CalcMaxOccupancy(capacity, estimated_value_size,
metadata_charge_policy);
hash_bits = CalcHashBitsWrapper(capacity, estimated_value_size,
metadata_charge_policy);
EXPECT_TRUE(TableSizeIsAppropriate(hash_bits, max_occupancy));
// Also tests switching between strict limit and not
shard_->SetStrictCapacityLimit(strict_capacity_limit);
std::string key(16, 'x');
// Single entry charge beyond capacity
{
Status s = shard_->Insert(key, 0 /*hash*/, nullptr /*value*/,
5 /*charge*/, nullptr /*deleter*/,
nullptr /*handle*/, Cache::Priority::LOW);
if (strict_capacity_limit) {
EXPECT_TRUE(s.IsMemoryLimit());
} else {
EXPECT_OK(s);
}
}
// Single entry fills capacity
{
Cache::Handle* h;
ASSERT_OK(shard_->Insert(key, 0 /*hash*/, nullptr /*value*/, 3 /*charge*/,
nullptr /*deleter*/, &h, Cache::Priority::LOW));
// Try to insert more
Status s = Insert('a');
if (strict_capacity_limit) {
EXPECT_TRUE(s.IsMemoryLimit());
} else {
EXPECT_OK(s);
}
// Release entry filling capacity.
// Cover useful = false case.
shard_->Release(h, false /*useful*/, false /*erase_if_last_ref*/);
}
// Insert more than table size can handle (cleverly using zero-charge
// entries) to exceed occupancy limit.
{
size_t n = shard_->GetTableAddressCount() + 1;
std::unique_ptr<Cache::Handle* []> ha { new Cache::Handle* [n] {} };
Status s;
for (size_t i = 0; i < n && s.ok(); ++i) {
EncodeFixed64(&key[0], i);
s = shard_->Insert(key, 0 /*hash*/, nullptr /*value*/, 0 /*charge*/,
nullptr /*deleter*/, &ha[i], Cache::Priority::LOW);
if (i == 0) {
EXPECT_OK(s);
}
}
if (strict_capacity_limit) {
EXPECT_TRUE(s.IsMemoryLimit());
} else {
EXPECT_OK(s);
}
// Same result if not keeping a reference
s = Insert('a');
if (strict_capacity_limit) {
EXPECT_TRUE(s.IsMemoryLimit());
} else {
EXPECT_OK(s);
}
// Regardless, we didn't allow table to actually get full
EXPECT_LT(shard_->GetOccupancyCount(), shard_->GetTableAddressCount());
// Release handles
for (size_t i = 0; i < n; ++i) {
if (ha[i]) {
shard_->Release(ha[i]);
}
}
}
}
}
// Fix the cache capacity, vary the element charge.
for (int i = 0; i < 1024; i++) {
capacity = 1024;
estimated_value_size = i;
metadata_charge_policy = kFullChargeCacheMetadata;
max_occupancy = CalcMaxOccupancy(capacity, estimated_value_size,
metadata_charge_policy);
hash_bits = CalcHashBitsWrapper(capacity, estimated_value_size,
metadata_charge_policy);
EXPECT_TRUE(TableSizeIsAppropriate(hash_bits, max_occupancy));
TEST_F(ClockCacheTest, ClockEvictionTest) {
for (bool strict_capacity_limit : {false, true}) {
SCOPED_TRACE("strict_capacity_limit = " +
std::to_string(strict_capacity_limit));
NewShard(6, strict_capacity_limit);
EXPECT_OK(Insert('a', Cache::Priority::BOTTOM));
EXPECT_OK(Insert('b', Cache::Priority::LOW));
EXPECT_OK(Insert('c', Cache::Priority::HIGH));
EXPECT_OK(Insert('d', Cache::Priority::BOTTOM));
EXPECT_OK(Insert('e', Cache::Priority::LOW));
EXPECT_OK(Insert('f', Cache::Priority::HIGH));
EXPECT_TRUE(Lookup('a', /*use*/ false));
EXPECT_TRUE(Lookup('b', /*use*/ false));
EXPECT_TRUE(Lookup('c', /*use*/ false));
EXPECT_TRUE(Lookup('d', /*use*/ false));
EXPECT_TRUE(Lookup('e', /*use*/ false));
EXPECT_TRUE(Lookup('f', /*use*/ false));
// Ensure bottom are evicted first, even if new entries are low
EXPECT_OK(Insert('g', Cache::Priority::LOW));
EXPECT_OK(Insert('h', Cache::Priority::LOW));
EXPECT_FALSE(Lookup('a', /*use*/ false));
EXPECT_TRUE(Lookup('b', /*use*/ false));
EXPECT_TRUE(Lookup('c', /*use*/ false));
EXPECT_FALSE(Lookup('d', /*use*/ false));
EXPECT_TRUE(Lookup('e', /*use*/ false));
EXPECT_TRUE(Lookup('f', /*use*/ false));
// Mark g & h useful
EXPECT_TRUE(Lookup('g', /*use*/ true));
EXPECT_TRUE(Lookup('h', /*use*/ true));
// Then old LOW entries
EXPECT_OK(Insert('i', Cache::Priority::LOW));
EXPECT_OK(Insert('j', Cache::Priority::LOW));
EXPECT_FALSE(Lookup('b', /*use*/ false));
EXPECT_TRUE(Lookup('c', /*use*/ false));
EXPECT_FALSE(Lookup('e', /*use*/ false));
EXPECT_TRUE(Lookup('f', /*use*/ false));
// Mark g & h useful once again
EXPECT_TRUE(Lookup('g', /*use*/ true));
EXPECT_TRUE(Lookup('h', /*use*/ true));
EXPECT_TRUE(Lookup('i', /*use*/ false));
EXPECT_TRUE(Lookup('j', /*use*/ false));
// Then old HIGH entries
EXPECT_OK(Insert('k', Cache::Priority::LOW));
EXPECT_OK(Insert('l', Cache::Priority::LOW));
EXPECT_FALSE(Lookup('c', /*use*/ false));
EXPECT_FALSE(Lookup('f', /*use*/ false));
EXPECT_TRUE(Lookup('g', /*use*/ false));
EXPECT_TRUE(Lookup('h', /*use*/ false));
EXPECT_TRUE(Lookup('i', /*use*/ false));
EXPECT_TRUE(Lookup('j', /*use*/ false));
EXPECT_TRUE(Lookup('k', /*use*/ false));
EXPECT_TRUE(Lookup('l', /*use*/ false));
// Then the (roughly) least recently useful
EXPECT_OK(Insert('m', Cache::Priority::HIGH));
EXPECT_OK(Insert('n', Cache::Priority::HIGH));
EXPECT_TRUE(Lookup('g', /*use*/ false));
EXPECT_TRUE(Lookup('h', /*use*/ false));
EXPECT_FALSE(Lookup('i', /*use*/ false));
EXPECT_FALSE(Lookup('j', /*use*/ false));
EXPECT_TRUE(Lookup('k', /*use*/ false));
EXPECT_TRUE(Lookup('l', /*use*/ false));
// Now try changing capacity down
shard_->SetCapacity(4);
// Insert to ensure evictions happen
EXPECT_OK(Insert('o', Cache::Priority::LOW));
EXPECT_OK(Insert('p', Cache::Priority::LOW));
EXPECT_FALSE(Lookup('g', /*use*/ false));
EXPECT_FALSE(Lookup('h', /*use*/ false));
EXPECT_FALSE(Lookup('k', /*use*/ false));
EXPECT_FALSE(Lookup('l', /*use*/ false));
EXPECT_TRUE(Lookup('m', /*use*/ false));
EXPECT_TRUE(Lookup('n', /*use*/ false));
EXPECT_TRUE(Lookup('o', /*use*/ false));
EXPECT_TRUE(Lookup('p', /*use*/ false));
// Now try changing capacity up
EXPECT_TRUE(Lookup('m', /*use*/ true));
EXPECT_TRUE(Lookup('n', /*use*/ true));
shard_->SetCapacity(6);
EXPECT_OK(Insert('q', Cache::Priority::HIGH));
EXPECT_OK(Insert('r', Cache::Priority::HIGH));
EXPECT_OK(Insert('s', Cache::Priority::HIGH));
EXPECT_OK(Insert('t', Cache::Priority::HIGH));
EXPECT_FALSE(Lookup('o', /*use*/ false));
EXPECT_FALSE(Lookup('p', /*use*/ false));
EXPECT_TRUE(Lookup('m', /*use*/ false));
EXPECT_TRUE(Lookup('n', /*use*/ false));
EXPECT_TRUE(Lookup('q', /*use*/ false));
EXPECT_TRUE(Lookup('r', /*use*/ false));
EXPECT_TRUE(Lookup('s', /*use*/ false));
EXPECT_TRUE(Lookup('t', /*use*/ false));
}
}
// Zero-capacity cache, and only values have charge.
capacity = 0;
estimated_value_size = 1;
metadata_charge_policy = kDontChargeCacheMetadata;
hash_bits = CalcHashBitsWrapper(capacity, estimated_value_size,
metadata_charge_policy);
EXPECT_TRUE(TableSizeIsAppropriate(hash_bits, 0 /* max_occupancy */));
void IncrementIntDeleter(const Slice& /*key*/, void* value) {
*reinterpret_cast<int*>(value) += 1;
}
// Zero-capacity cache, and only metadata has charge.
capacity = 0;
estimated_value_size = 0;
metadata_charge_policy = kFullChargeCacheMetadata;
hash_bits = CalcHashBitsWrapper(capacity, estimated_value_size,
metadata_charge_policy);
EXPECT_TRUE(TableSizeIsAppropriate(hash_bits, 0 /* max_occupancy */));
// Testing calls to CorrectNearOverflow in Release
TEST_F(ClockCacheTest, ClockCounterOverflowTest) {
NewShard(6, /*strict_capacity_limit*/ false);
Cache::Handle* h;
int deleted = 0;
std::string my_key(kCacheKeySize, 'x');
uint32_t my_hash = 42;
ASSERT_OK(shard_->Insert(my_key, my_hash, &deleted, 1, IncrementIntDeleter,
&h, Cache::Priority::HIGH));
// Some large number outstanding
shard_->TEST_RefN(h, 123456789);
// Simulate many lookup/ref + release, plenty to overflow counters
for (int i = 0; i < 10000; ++i) {
shard_->TEST_RefN(h, 1234567);
shard_->TEST_ReleaseN(h, 1234567);
}
// Mark it invisible (to reach a different CorrectNearOverflow() in Release)
shard_->Erase(my_key, my_hash);
// Simulate many more lookup/ref + release (one-by-one would be too
// expensive for unit test)
for (int i = 0; i < 10000; ++i) {
shard_->TEST_RefN(h, 1234567);
shard_->TEST_ReleaseN(h, 1234567);
}
// Free all but last 1
shard_->TEST_ReleaseN(h, 123456789);
// Still alive
ASSERT_EQ(deleted, 0);
// Free last ref, which will finalize erasure
shard_->Release(h);
// Deleted
ASSERT_EQ(deleted, 1);
}
// Small cache, large elements.
capacity = 1024;
estimated_value_size = 8192;
metadata_charge_policy = kFullChargeCacheMetadata;
hash_bits = CalcHashBitsWrapper(capacity, estimated_value_size,
metadata_charge_policy);
EXPECT_TRUE(TableSizeIsAppropriate(hash_bits, 0 /* max_occupancy */));
// This test is mostly to exercise some corner case logic, by forcing two
// keys to have the same hash, and more
TEST_F(ClockCacheTest, CollidingInsertEraseTest) {
NewShard(6, /*strict_capacity_limit*/ false);
int deleted = 0;
std::string key1(kCacheKeySize, 'x');
std::string key2(kCacheKeySize, 'y');
std::string key3(kCacheKeySize, 'z');
uint32_t my_hash = 42;
Cache::Handle* h1;
ASSERT_OK(shard_->Insert(key1, my_hash, &deleted, 1, IncrementIntDeleter, &h1,
Cache::Priority::HIGH));
Cache::Handle* h2;
ASSERT_OK(shard_->Insert(key2, my_hash, &deleted, 1, IncrementIntDeleter, &h2,
Cache::Priority::HIGH));
Cache::Handle* h3;
ASSERT_OK(shard_->Insert(key3, my_hash, &deleted, 1, IncrementIntDeleter, &h3,
Cache::Priority::HIGH));
// Can repeatedly lookup+release despite the hash collision
Cache::Handle* tmp_h;
for (bool erase_if_last_ref : {true, false}) { // but not last ref
tmp_h = shard_->Lookup(key1, my_hash);
ASSERT_EQ(h1, tmp_h);
ASSERT_FALSE(shard_->Release(tmp_h, erase_if_last_ref));
tmp_h = shard_->Lookup(key2, my_hash);
ASSERT_EQ(h2, tmp_h);
ASSERT_FALSE(shard_->Release(tmp_h, erase_if_last_ref));
tmp_h = shard_->Lookup(key3, my_hash);
ASSERT_EQ(h3, tmp_h);
ASSERT_FALSE(shard_->Release(tmp_h, erase_if_last_ref));
}
// Make h1 invisible
shard_->Erase(key1, my_hash);
// Redundant erase
shard_->Erase(key1, my_hash);
// All still alive
ASSERT_EQ(deleted, 0);
// Invisible to Lookup
tmp_h = shard_->Lookup(key1, my_hash);
ASSERT_EQ(nullptr, tmp_h);
// Can still find h2, h3
for (bool erase_if_last_ref : {true, false}) { // but not last ref
tmp_h = shard_->Lookup(key2, my_hash);
ASSERT_EQ(h2, tmp_h);
ASSERT_FALSE(shard_->Release(tmp_h, erase_if_last_ref));
tmp_h = shard_->Lookup(key3, my_hash);
ASSERT_EQ(h3, tmp_h);
ASSERT_FALSE(shard_->Release(tmp_h, erase_if_last_ref));
}
// Also Insert with invisible entry there
ASSERT_OK(shard_->Insert(key1, my_hash, &deleted, 1, IncrementIntDeleter,
nullptr, Cache::Priority::HIGH));
tmp_h = shard_->Lookup(key1, my_hash);
// Found but distinct handle
ASSERT_NE(nullptr, tmp_h);
ASSERT_NE(h1, tmp_h);
ASSERT_TRUE(shard_->Release(tmp_h, /*erase_if_last_ref*/ true));
// tmp_h deleted
ASSERT_EQ(deleted--, 1);
// Release last ref on h1 (already invisible)
ASSERT_TRUE(shard_->Release(h1, /*erase_if_last_ref*/ false));
// h1 deleted
ASSERT_EQ(deleted--, 1);
h1 = nullptr;
// Can still find h2, h3
for (bool erase_if_last_ref : {true, false}) { // but not last ref
tmp_h = shard_->Lookup(key2, my_hash);
ASSERT_EQ(h2, tmp_h);
ASSERT_FALSE(shard_->Release(tmp_h, erase_if_last_ref));
tmp_h = shard_->Lookup(key3, my_hash);
ASSERT_EQ(h3, tmp_h);
ASSERT_FALSE(shard_->Release(tmp_h, erase_if_last_ref));
}
// Release last ref on h2
ASSERT_FALSE(shard_->Release(h2, /*erase_if_last_ref*/ false));
// h2 still not deleted (unreferenced in cache)
ASSERT_EQ(deleted, 0);
// Can still find it
tmp_h = shard_->Lookup(key2, my_hash);
ASSERT_EQ(h2, tmp_h);
// Release last ref on h2, with erase
ASSERT_TRUE(shard_->Release(h2, /*erase_if_last_ref*/ true));
// h2 deleted
ASSERT_EQ(deleted--, 1);
tmp_h = shard_->Lookup(key2, my_hash);
ASSERT_EQ(nullptr, tmp_h);
// Can still find h3
for (bool erase_if_last_ref : {true, false}) { // but not last ref
tmp_h = shard_->Lookup(key3, my_hash);
ASSERT_EQ(h3, tmp_h);
ASSERT_FALSE(shard_->Release(tmp_h, erase_if_last_ref));
}
// Large capacity.
capacity = 31924172;
estimated_value_size = 8192;
metadata_charge_policy = kFullChargeCacheMetadata;
max_occupancy =
CalcMaxOccupancy(capacity, estimated_value_size, metadata_charge_policy);
hash_bits = CalcHashBitsWrapper(capacity, estimated_value_size,
metadata_charge_policy);
EXPECT_TRUE(TableSizeIsAppropriate(hash_bits, max_occupancy));
// Release last ref on h3, without erase
ASSERT_FALSE(shard_->Release(h3, /*erase_if_last_ref*/ false));
// h3 still not deleted (unreferenced in cache)
ASSERT_EQ(deleted, 0);
// Explicit erase
shard_->Erase(key3, my_hash);
// h3 deleted
ASSERT_EQ(deleted--, 1);
tmp_h = shard_->Lookup(key3, my_hash);
ASSERT_EQ(nullptr, tmp_h);
}
// This uses the public API to effectively test CalcHashBits etc.
TEST_F(ClockCacheTest, TableSizesTest) {
for (size_t est_val_size : {1U, 5U, 123U, 2345U, 345678U}) {
SCOPED_TRACE("est_val_size = " + std::to_string(est_val_size));
for (double est_count : {1.1, 2.2, 511.9, 512.1, 2345.0}) {
SCOPED_TRACE("est_count = " + std::to_string(est_count));
size_t capacity = static_cast<size_t>(est_val_size * est_count);
// kDontChargeCacheMetadata
auto cache = ExperimentalNewClockCache(
capacity, est_val_size, /*num shard_bits*/ -1,
/*strict_capacity_limit*/ false, kDontChargeCacheMetadata);
// Table sizes are currently only powers of two
EXPECT_GE(cache->GetTableAddressCount(), est_count / kLoadFactor);
EXPECT_LE(cache->GetTableAddressCount(), est_count / kLoadFactor * 2.0);
EXPECT_EQ(cache->GetUsage(), 0);
// kFullChargeMetaData
// Because table sizes are currently only powers of two, sizes get
// really weird when metadata is a huge portion of capacity. For example,
// doubling the table size could cut by 90% the space available to
// values. Therefore, we omit those weird cases for now.
if (est_val_size >= 512) {
cache = ExperimentalNewClockCache(
capacity, est_val_size, /*num shard_bits*/ -1,
/*strict_capacity_limit*/ false, kFullChargeCacheMetadata);
double est_count_after_meta =
(capacity - cache->GetUsage()) * 1.0 / est_val_size;
EXPECT_GE(cache->GetTableAddressCount(),
est_count_after_meta / kLoadFactor);
EXPECT_LE(cache->GetTableAddressCount(),
est_count_after_meta / kLoadFactor * 2.0);
}
}
}
}
} // namespace clock_cache

@ -213,9 +213,9 @@ std::string ShardedCache::GetPrintableOptions() const {
ret.append(GetShard(0)->GetPrintableOptions());
return ret;
}
int GetDefaultCacheShardBits(size_t capacity) {
int GetDefaultCacheShardBits(size_t capacity, size_t min_shard_size) {
int num_shard_bits = 0;
size_t min_shard_size = 512L * 1024L; // Every shard is at least 512KB.
size_t num_shards = capacity / min_shard_size;
while (num_shards >>= 1) {
if (++num_shard_bits >= 6) {
@ -230,4 +230,21 @@ int ShardedCache::GetNumShardBits() const { return BitsSetToOne(shard_mask_); }
uint32_t ShardedCache::GetNumShards() const { return shard_mask_ + 1; }
size_t ShardedCache::GetOccupancyCount() const {
size_t oc = 0;
uint32_t num_shards = GetNumShards();
for (uint32_t s = 0; s < num_shards; s++) {
oc += GetShard(s)->GetOccupancyCount();
}
return oc;
}
size_t ShardedCache::GetTableAddressCount() const {
size_t tac = 0;
uint32_t num_shards = GetNumShards();
for (uint32_t s = 0; s < num_shards; s++) {
tac += GetShard(s)->GetTableAddressCount();
}
return tac;
}
} // namespace ROCKSDB_NAMESPACE

@ -20,7 +20,8 @@ namespace ROCKSDB_NAMESPACE {
// Single cache shard interface.
class CacheShard {
public:
CacheShard() = default;
explicit CacheShard(CacheMetadataChargePolicy metadata_charge_policy)
: metadata_charge_policy_(metadata_charge_policy) {}
virtual ~CacheShard() = default;
using DeleterFn = Cache::DeleterFn;
@ -47,6 +48,8 @@ class CacheShard {
virtual void SetStrictCapacityLimit(bool strict_capacity_limit) = 0;
virtual size_t GetUsage() const = 0;
virtual size_t GetPinnedUsage() const = 0;
virtual size_t GetOccupancyCount() const = 0;
virtual size_t GetTableAddressCount() const = 0;
// Handles iterating over roughly `average_entries_per_lock` entries, using
// `state` to somehow record where it last ended up. Caller initially uses
// *state == 0 and implementation sets *state = UINT32_MAX to indicate
@ -57,13 +60,9 @@ class CacheShard {
uint32_t average_entries_per_lock, uint32_t* state) = 0;
virtual void EraseUnRefEntries() = 0;
virtual std::string GetPrintableOptions() const { return ""; }
void set_metadata_charge_policy(
CacheMetadataChargePolicy metadata_charge_policy) {
metadata_charge_policy_ = metadata_charge_policy;
}
protected:
CacheMetadataChargePolicy metadata_charge_policy_ = kDontChargeCacheMetadata;
const CacheMetadataChargePolicy metadata_charge_policy_;
};
// Generic cache interface which shards cache by hash of keys. 2^num_shard_bits
@ -106,6 +105,8 @@ class ShardedCache : public Cache {
virtual size_t GetUsage() const override;
virtual size_t GetUsage(Handle* handle) const override;
virtual size_t GetPinnedUsage() const override;
virtual size_t GetOccupancyCount() const override;
virtual size_t GetTableAddressCount() const override;
virtual void ApplyToAllEntries(
const std::function<void(const Slice& key, void* value, size_t charge,
DeleterFn deleter)>& callback,
@ -127,6 +128,8 @@ class ShardedCache : public Cache {
std::atomic<uint64_t> last_id_;
};
extern int GetDefaultCacheShardBits(size_t capacity);
// 512KB is traditional minimum shard size.
int GetDefaultCacheShardBits(size_t capacity,
size_t min_shard_size = 512U * 1024U);
} // namespace ROCKSDB_NAMESPACE

@ -939,11 +939,15 @@ TEST_F(DBBlockCacheTest, AddRedundantStats) {
for (std::shared_ptr<Cache> base_cache :
{NewLRUCache(capacity, num_shard_bits),
ExperimentalNewClockCache(
capacity, 1 /*estimated_value_size*/, num_shard_bits,
false /*strict_capacity_limit*/, kDefaultCacheMetadataChargePolicy),
NewFastLRUCache(capacity, 1 /*estimated_value_size*/, num_shard_bits,
false /*strict_capacity_limit*/,
kDefaultCacheMetadataChargePolicy)}) {
capacity,
BlockBasedTableOptions().block_size /*estimated_value_size*/,
num_shard_bits, false /*strict_capacity_limit*/,
kDefaultCacheMetadataChargePolicy),
NewFastLRUCache(
capacity,
BlockBasedTableOptions().block_size /*estimated_value_size*/,
num_shard_bits, false /*strict_capacity_limit*/,
kDefaultCacheMetadataChargePolicy)}) {
if (!base_cache) {
// Skip clock cache when not supported
continue;
@ -1298,10 +1302,11 @@ TEST_F(DBBlockCacheTest, CacheEntryRoleStats) {
for (bool partition : {false, true}) {
for (std::shared_ptr<Cache> cache :
{NewLRUCache(capacity),
ExperimentalNewClockCache(capacity, 1 /*estimated_value_size*/,
-1 /*num_shard_bits*/,
false /*strict_capacity_limit*/,
kDefaultCacheMetadataChargePolicy)}) {
ExperimentalNewClockCache(
capacity,
BlockBasedTableOptions().block_size /*estimated_value_size*/,
-1 /*num_shard_bits*/, false /*strict_capacity_limit*/,
kDefaultCacheMetadataChargePolicy)}) {
if (!cache) {
// Skip clock cache when not supported
continue;

@ -671,6 +671,9 @@ void InternalStats::CacheEntryRoleStats::BeginCollection(
<< port::GetProcessID();
cache_id = str.str();
cache_capacity = cache->GetCapacity();
cache_usage = cache->GetUsage();
table_size = cache->GetTableAddressCount();
occupancy = cache->GetOccupancyCount();
}
void InternalStats::CacheEntryRoleStats::EndCollection(
@ -695,6 +698,8 @@ std::string InternalStats::CacheEntryRoleStats::ToString(
std::ostringstream str;
str << "Block cache " << cache_id
<< " capacity: " << BytesToHumanString(cache_capacity)
<< " usage: " << BytesToHumanString(cache_usage)
<< " table_size: " << table_size << " occupancy: " << occupancy
<< " collections: " << collection_count
<< " last_copies: " << copies_of_last_collection
<< " last_secs: " << (GetLastDurationMicros() / 1000000.0)

@ -453,6 +453,9 @@ class InternalStats {
// For use with CacheEntryStatsCollector
struct CacheEntryRoleStats {
uint64_t cache_capacity = 0;
uint64_t cache_usage = 0;
size_t table_size = 0;
size_t occupancy = 0;
std::string cache_id;
std::array<uint64_t, kNumCacheEntryRoles> total_charges;
std::array<size_t, kNumCacheEntryRoles> entry_counts;

@ -404,6 +404,16 @@ class Cache {
// Returns the memory size for the entries residing in the cache.
virtual size_t GetUsage() const = 0;
// Returns the number of entries currently tracked in the table. SIZE_MAX
// means "not supported." This is used for inspecting the load factor, along
// with GetTableAddressCount().
virtual size_t GetOccupancyCount() const { return SIZE_MAX; }
// Returns the number of ways the hash function is divided for addressing
// entries. Zero means "not supported." This is used for inspecting the load
// factor, along with GetOccupancyCount().
virtual size_t GetTableAddressCount() const { return 0; }
// Returns the memory size for a specific entry in the cache.
virtual size_t GetUsage(Handle* handle) const = 0;

@ -560,7 +560,7 @@ DEFINE_bool(universal_incremental, false,
DEFINE_int64(cache_size, 8 << 20, // 8MB
"Number of bytes to use as a cache of uncompressed data");
DEFINE_int32(cache_numshardbits, 6,
DEFINE_int32(cache_numshardbits, -1,
"Number of shards for the block cache"
" is 2 ** cache_numshardbits. Negative means use default settings."
" This is applied only if FLAGS_cache_size is non-negative.");
@ -3618,6 +3618,9 @@ class Benchmark {
}
fresh_db = true;
method = &Benchmark::TimeSeries;
} else if (name == "block_cache_entry_stats") {
// DB::Properties::kBlockCacheEntryStats
PrintStats("rocksdb.block-cache-entry-stats");
} else if (name == "stats") {
PrintStats("rocksdb.stats");
} else if (name == "resetstats") {

Loading…
Cancel
Save