rocksdb

fork of https://github.com/oxigraph/rocksdb and https://github.com/facebook/rocksdb for nextgraph and oxigraph

Hui Xiao 98d5db5c2e Sort L0 files by newly introduced epoch_num (#10922 ) Summary: Context: Sorting L0 files by `largest_seqno` has at least two inconvenience: - File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap. - For example, consider the following sequence of events ("key@n" indicates key at seqno "n") - insert k1@1 to memtable m1 - ingest file s1 with k2@2, ingest file s2 with k3@3 - insert k4@4 to m1 - compact files s1, s2 and result in new file s3 of seqno range [2, 3] - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1 - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption. - Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example) - an existing SST s1 contains only k1@1 - insert k1@2 to memtable m1 - ingest file s2 with k3@3, ingest file s3 with k4@4 - insert single delete k5@5 in m1 - flush m1 and result in new file s4 of seqno range [2, 5] - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4] - compact s4 and result in new file s6 of seqno range [2] due to single delete - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno` Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways: - In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption` for more. - In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption. Summary:* - Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`. - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`) - Compaction output file is assigned with the minimum `epoch_number` among input files' - Refit level: reuse refitted file's epoch_number - Other paths needing `epoch_number` treatment: - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo` - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`. - Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair). - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder. - Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag` - Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`. - Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR. - Misc: - update existing tests with `epoch_number` so make check will pass - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber() Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922 Test Plan: - `make check` - New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc` - Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930 - [Ongoing] Compatibility test: manually run `36a5686ec0` (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox` - [Ongoing] normal db stress test - [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 Reviewed By: ajkr Differential Revision: D41063187 Pulled By: hx235 fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9		3 years ago
.circleci	Upgrade CircleCI Windows Build (#10090 )	3 years ago
.github/workflows	ci: add GitHub token permissions for workflow (#10549 )	3 years ago
buckifier	Enable BLACK for internal_repo_rocksdb (#10710 )	3 years ago
build_tools	Print stack traces on frozen tests in CI (#10828 )	3 years ago
cache	Add a SecondaryCache::InsertSaved() API, use in CacheDumper impl (#10945 )	3 years ago
cmake	gcc-11 and cmake related cleanup (#9286 )	4 years ago
coverage	Enable BLACK for internal_repo_rocksdb (#10710 )	3 years ago
db	Sort L0 files by newly introduced epoch_num (#10922 )	3 years ago
db_stress_tool	Revert PR 10777 "Fix FIFO causing overlapping seqnos in L0 files due to overla…" (#10999 )	3 years ago
docs	Bump nokogiri from 1.13.9 to 1.13.10 in /docs (#11024 )	3 years ago
env	Added placeholders for MADV defines (#10881 )	3 years ago
examples	Add rocksdb_backup_restore_example to examples/.gitignore (#10825 )	3 years ago
file	Fix db_stress failure in async_io in FilePrefetchBuffer (#10949 )	3 years ago
fuzz	Add some missing headers (#10519 )	3 years ago
include/rocksdb	Sort L0 files by newly introduced epoch_num (#10922 )	3 years ago
java	Fix missing WAL in new manifest by rolling over the WAL deletion record from prev manifest (#10892 )	3 years ago
logging	Observe and warn about misconfigured HyperClockCache (#10965 )	3 years ago
memory	Fix use of make_unique in Arena::AllocateNewBlock (#11012 )	3 years ago
memtable	Run clang format against files under example/, memory/ and memtable/ folders (#10893 )	3 years ago
microbench	Avoid allocations/copies for large `GetMergeOperands()` results (#10458 )	3 years ago
monitoring	Add an unittest for Periodic compaction conflict with ongoing compaction (#10908 )	3 years ago
options	Ignore max_compaction_bytes for compaction input that are within output key-range (#10835 )	3 years ago
plugin	Add initial CMake support to plugin (#9214 )	4 years ago
port	Fix include of windows.h in mmap.h (#10885 )	3 years ago
table	replace sprintf with its safe version snprintf (v2) (#11011 )	3 years ago
test_util	clang format files under test_util/ (#10855 )	3 years ago
third-party	Meta-internal folly integration with F14FastMap (#9546 )	4 years ago
tools	Revert PR 10777 "Fix FIFO causing overlapping seqnos in L0 files due to overla…" (#10999 )	3 years ago
trace_replay	fix compile warnings (#10976 )	3 years ago
util	Improve error messages for SST footer and size errors (#11009 )	3 years ago
utilities	Sort L0 files by newly introduced epoch_num (#10922 )	3 years ago
.clang-format	…
.gitignore	Git ignore .clangd/ (#10817 )	3 years ago
.lgtm.yml	Create lgtm.yml for LGTM.com C/C++ analysis (#4058 )	7 years ago
.watchmanconfig	Added .watchmanconfig file to rocksdb repo (#5593 )	6 years ago
AUTHORS	…
CMakeLists.txt	Add a SecondaryCache::InsertSaved() API, use in CacheDumper impl (#10945 )	3 years ago
CODE_OF_CONDUCT.md	Adopt Contributor Covenant	6 years ago
CONTRIBUTING.md	…
COPYING	…
DEFAULT_OPTIONS_HISTORY.md	Add Options::DisableExtraChecks, clarify force_consistency_checks (#9363 )	4 years ago
DUMP_FORMAT.md	…
HISTORY.md	Sort L0 files by newly introduced epoch_num (#10922 )	3 years ago
INSTALL.md	Update supported VS versions in INSTALL.md (#9823 )	4 years ago
LANGUAGE-BINDINGS.md	Add grocksdb in Go language bindings (#10498 )	3 years ago
LICENSE.Apache	…
LICENSE.leveldb	…
Makefile	Fix broken dependency: update zlib from 1.2.12 to 1.2.13 (#10833 )	3 years ago
PLUGINS.md	Add pmem-rocksdb-plugin link in PLUGINs.md (#9934 )	4 years ago
README.md	Remove Travis CI (#10407 )	3 years ago
ROCKSDB_LITE.md	Fix remaining uses of "backupable" (#9792 )	4 years ago
TARGETS	Add a SecondaryCache::InsertSaved() API, use in CacheDumper impl (#10945 )	3 years ago
USERS.md	Add Apache Spark as a user (#10993 )	3 years ago
Vagrantfile	…
WINDOWS_PORT.md	Update branch name in WINDOWS_PORT.md (#8745 )	4 years ago
common.mk	Clean up variables for temporary directory (#9961 )	4 years ago
crash_test.mk	Allow a custom DB cleanup command to be passed to db_crashtest.py (#10883 )	3 years ago
issue_template.md	Add Google Group to Issue Template	6 years ago
rocksdb.pc.in	build: fix pkg-config file generation (#9953 )	3 years ago
src.mk	Add a SecondaryCache::InsertSaved() API, use in CacheDumper impl (#10945 )	3 years ago
thirdparty.inc	Fix build jemalloc api (#5470 )	6 years ago

README.md

RocksDB: A Persistent Key-Value Store for Flash and RAM Storage

RocksDB is developed and maintained by Facebook Database Engineering Team. It is built on earlier work on LevelDB by Sanjay Ghemawat (sanjay@google.com) and Jeff Dean (jeff@google.com)

This code is a library that forms the core building block for a fast key-value server, especially suited for storing data on flash drives. It has a Log-Structured-Merge-Database (LSM) design with flexible tradeoffs between Write-Amplification-Factor (WAF), Read-Amplification-Factor (RAF) and Space-Amplification-Factor (SAF). It has multi-threaded compactions, making it especially suitable for storing multiple terabytes of data in a single database.

Start with example usage here: https://github.com/facebook/rocksdb/tree/main/examples

See the github wiki for more explanation.

The public interface is in include/. Callers should not include or rely on the details of any other header files in this package. Those internal APIs may be changed without warning.

Questions and discussions are welcome on the RocksDB Developers Public Facebook group and email list on Google Groups.

License

RocksDB is dual-licensed under both the GPLv2 (found in the COPYING file in the root directory) and Apache 2.0 License (found in the LICENSE.Apache file in the root directory). You may select, at your option, one of the above-listed licenses.