rocksdb

fork of https://github.com/oxigraph/rocksdb and https://github.com/facebook/rocksdb for nextgraph and oxigraph

Peter Dillinger b55b2f45d0 Faster new DynamicBloom implementation (for memtable) (#5762 ) Summary: Since DynamicBloom is now only used in-memory, we're free to change it without schema compatibility issues. The new implementation is drawn from (with manifest permission) `303542a767/bloom_simulation_tests/foo.cc (L613)` This has several speed advantages over the prior implementation: * Uses fastrange instead of % * Minimum logic to determine first (and all) probed memory addresses * (Major) Two probes per 64-bit memory fetch/write. * Very fast and effective (murmur-like) hash expansion/re-mixing. (At least on recent CPUs, integer multiplication is very cheap.) While a Bloom filter with 512-bit cache locality has about a 1.15x FP rate penalty (e.g. 0.84% to 0.97%), further restricting to two probes per 64 bits incurs an additional 1.12x FP rate penalty (e.g. 0.97% to 1.09%). Nevertheless, the unit tests show no "mediocre" FP rate samples, unlike the old implementation with more erratic FP rates. Especially for the memtable, we expect speed to outweigh somewhat higher FP rates. For example, a negative table query would have to be 1000x slower than a BF query to justify doubling BF query time to shave 10% off FP rate (working assumption around 1% FP rate). While that seems likely for SSTs, my data suggests a speed factor of roughly 50x for the memtable (vs. BF; ~1.5% lower write throughput when enabling memtable Bloom filter, after this change). Thus, it's probably not worth even 5% more time in the Bloom filter to shave off 1/10th of the Bloom FP rate, or 0.1% in absolute terms, and it's probably at least 20% slower to recoup that much FP rate from this new implementation. Because of this, we do not see a need for a 'locality' option that affects the MemTable Bloom filter and have decoupled the MemTable Bloom filter from Options::bloom_locality. Note that just 3% more memory to the Bloom filter (10.3 bits per key vs. just 10) is able to make up for the ~12% FP rate drop in the new implementation: [] # Nearly "ideal" FP-wise but reasonably fast cache-local implementation [~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_WORM64_FROM32_any.out 10000000 6 10 $RANDOM 100000000 ./foo_gcc_IMPL_CACHE_WORM64_FROM32_any.out time: 3.29372 sampled_fp_rate: 0.00985956 ... [] # Close match to this new implementation [~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out 10000000 6 10.3 $RANDOM 100000000 ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out time: 2.10072 sampled_fp_rate: 0.00985655 ... [] # Old locality=1 implementation [~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_ROCKSDB_DYNAMIC_any.out 10000000 6 10 $RANDOM 100000000 ./foo_gcc_IMPL_CACHE_ROCKSDB_DYNAMIC_any.out time: 3.95472 sampled_fp_rate: 0.00988943 ... Also note the dramatic speed improvement vs. alternatives. -- Performance unit test: DynamicBloomTest.concurrent_with_perf is updated to report more precise timing data. (Measure running time of each thread, not just longest running thread, etc.) Results averaged over various sizes enabled with --enable_perf and 20 runs each; old dynamic bloom refers to locality=1, the faster of the old: old dynamic bloom, avg add latency = 65.6468 new dynamic bloom, avg add latency = 44.3809 old dynamic bloom, avg query latency = 50.6485 new dynamic bloom, avg query latency = 43.2186 old avg parallel add latency = 41.678 new avg parallel add latency = 24.5238 old avg parallel hit latency = 14.6322 new avg parallel hit latency = 12.3939 old avg parallel miss latency = 16.7289 new avg parallel miss latency = 12.2134 Tested on a dedicated 64-bit production machine at Facebook. Significant improvement all around. Despite now using std::atomic<uint64_t>, quick before-and-after test on a 32-bit machine (Intel Atom N270, released 2008) shows no regression in performance, in some cases modest improvement. -- Performance integration test (synthetic): with DEBUG_LEVEL=0, used TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillrandom,readmissing,readrandom,stats --num=2000000 and optionally with -memtable_whole_key_filtering -memtable_bloom_size_ratio=0.01 300 runs each configuration. Write throughput change by enabling memtable bloom: Old locality=0: -3.06% Old locality=1: -2.37% New: -1.50% conclusion -> seems to substantially close the gap Readmissing throughput change by enabling memtable bloom: Old locality=0: +34.47% Old locality=1: +34.80% New: +33.25% conclusion -> maybe a small new penalty from FP rate Readrandom throughput change by enabling memtable bloom: Old locality=0: +31.54% Old locality=1: +31.13% New: +30.60% conclusion -> maybe also from FP rate (after memtable flush) -- Another conclusion we can draw from this new implementation is that the existing 32-bit hash function is not inherently crippling the Bloom filter speed or accuracy, below about 5 million keys. For speed, the implementation is essentially the same whether starting with 32-bits or 64-bits of hash; it just determines whether the first multiplication after fastrange is a pseudorandom expansion or needed re-mix. Note that this multiplication can occur while memory is fetching. For accuracy, in a standard configuration, you need about 5 million keys before you have about a 1.1x FP penalty due to using a 32-bit hash vs. 64-bit: [~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out $((5 * 1000 * 1000 * 10)) 6 10 $RANDOM 100000000 ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_FROM32_any.out time: 2.52069 sampled_fp_rate: 0.0118267 ... [~/wormhashing/bloom_simulation_tests] ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_any.out $((5 * 1000 * 1000 * 10)) 6 10 $RANDOM 100000000 ./foo_gcc_IMPL_CACHE_MUL64_BLOCK_any.out time: 2.43871 sampled_fp_rate: 0.0109059 Pull Request resolved: https://github.com/facebook/rocksdb/pull/5762 Differential Revision: D17214194 Pulled By: pdillinger fbshipit-source-id: ad9da031772e985fd6b62a0e1db8e81892520595		6 years ago
buckifier	Change buckifier to support parameterized dependencies (#5648 )	6 years ago
build_tools	Port folly/synchronization/DistributedMutex to rocksdb (#5642 )	6 years ago
cache	Cleaned up and simplified LRU cache implementation (#5579 )	6 years ago
cmake	cmake: s/SNAPPY_LIBRARIES/snappy_LIBRARIES/ (#5687 )	6 years ago
coverage	Fix interpreter lines for files with python2-only syntax.	6 years ago
db	Faster new DynamicBloom implementation (for memtable) (#5762 )	6 years ago
docs	Blog post for write_unprepared (#5711 )	6 years ago
env	fix compile error: ‘FALLOC_FL_KEEP_SIZE’ undeclared (#5708 )	6 years ago
examples	Refactor trimming logic for immutable memtables (#5022 )	6 years ago
file	Persistent globally unique DB ID in manifest (#5725 )	6 years ago
hdfs	Add copyright headers per FB open-source checkup tool. (#5199 )	7 years ago
include/rocksdb	Adding DB::GetCurrentWalFile() API as a repliction/backup helper (#5765 )	6 years ago
java	Refactor trimming logic for immutable memtables (#5022 )	6 years ago
logging	Auto Roll Logger to add some extra checking to avoid segfault. (#5623 )	6 years ago
memory	Move some logging related files to logging/ (#5387 )	6 years ago
memtable	simplify include directive involving inttypes (#5402 )	6 years ago
monitoring	MOD: trim last space and comma in perf context and iostat context ToString()	6 years ago
options	Persistent globally unique DB ID in manifest (#5725 )	6 years ago
port	fix sign compare warnings (#5651 )	6 years ago
table	Faster new DynamicBloom implementation (for memtable) (#5762 )	6 years ago
test_util	Refactor trimming logic for immutable memtables (#5022 )	6 years ago
third-party	Fix TSAN failures in DistributedMutex tests (#5684 )	6 years ago
tools	crash_test to skip compaction TTL for FIFO compaction. (#5749 )	6 years ago
trace_replay	Block cache analyzer: Support reading from human readable trace file. (#5679 )	6 years ago
util	Faster new DynamicBloom implementation (for memtable) (#5762 )	6 years ago
utilities	use c++17's try_emplace if available (#5696 )	6 years ago
.clang-format	A script that automatically reformat affected lines	12 years ago
.gitignore	Block cache simulator: Add pysim to simulate caches using reinforcement learning. (#5610 )	6 years ago
.lgtm.yml	Create lgtm.yml for LGTM.com C/C++ analysis (#4058 )	7 years ago
.travis.yml	Switch Travis to Xenial build (#4789 )	6 years ago
.watchmanconfig	Added .watchmanconfig file to rocksdb repo (#5593 )	6 years ago
AUTHORS	Update RocksDB Authors File	8 years ago
CMakeLists.txt	Copy/split PlainTableBloomV1 from DynamicBloom (refactor) (#5767 )	6 years ago
CODE_OF_CONDUCT.md	Adopt Contributor Covenant	6 years ago
CONTRIBUTING.md	Add Code of Conduct	8 years ago
COPYING	Add GPLv2 as an alternative license.	9 years ago
DEFAULT_OPTIONS_HISTORY.md	options.delayed_write_rate use the rate of rate_limiter by default.	9 years ago
DUMP_FORMAT.md	First version of rocksdb_dump and rocksdb_undump.	11 years ago
HISTORY.md	Faster new DynamicBloom implementation (for memtable) (#5762 )	6 years ago
INSTALL.md	Update the version of the dependencies used by the RocksJava static build (#4761 )	7 years ago
LANGUAGE-BINDINGS.md	LANGUAGE-BINDINGS.md: mention python-rocksdb	7 years ago
LICENSE.Apache	Change RocksDB License	8 years ago
LICENSE.leveldb	Add back the LevelDB license file	8 years ago
Makefile	fix checking the '-march' flag (#5766 )	6 years ago
README.md	Add LevelDB repository link in the Readme	7 years ago
ROCKSDB_LITE.md	Fix some typos in comments and docs.	8 years ago
TARGETS	Copy/split PlainTableBloomV1 from DynamicBloom (refactor) (#5767 )	6 years ago
USERS.md	Add Crux to USERS.md	6 years ago
Vagrantfile	Adding CentOS 7 Vagrantfile & build script	8 years ago
WINDOWS_PORT.md	#5145 , rename port/dirent.h to port/port_dirent.h to avoid compile err when use port dir as header dir output (#5152 )	7 years ago
appveyor.yml	New API to get all merge operands for a Key (#5604 )	6 years ago
defs.bzl	Change buckifier to support parameterized dependencies (#5648 )	6 years ago
issue_template.md	Add a template for issues	8 years ago
src.mk	Copy/split PlainTableBloomV1 from DynamicBloom (refactor) (#5767 )	6 years ago
thirdparty.inc	Fix build jemalloc api (#5470 )	6 years ago

README.md

RocksDB: A Persistent Key-Value Store for Flash and RAM Storage

RocksDB is developed and maintained by Facebook Database Engineering Team. It is built on earlier work on LevelDB by Sanjay Ghemawat (sanjay@google.com) and Jeff Dean (jeff@google.com)

This code is a library that forms the core building block for a fast key value server, especially suited for storing data on flash drives. It has a Log-Structured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor (SAF). It has multi-threaded compactions, making it specially suitable for storing multiple terabytes of data in a single database.

Start with example usage here: https://github.com/facebook/rocksdb/tree/master/examples

See the github wiki for more explanation.

The public interface is in include/. Callers should not include or rely on the details of any other header files in this package. Those internal APIs may be changed without warning.

Design discussions are conducted in https://www.facebook.com/groups/rocksdb.dev/

License

RocksDB is dual-licensed under both the GPLv2 (found in the COPYING file in the root directory) and Apache 2.0 License (found in the LICENSE.Apache file in the root directory). You may select, at your option, one of the above-listed licenses.