rocksdb

Commit Graph

Author	SHA1	Message	Date
Peter Dillinger	cff0d1e8e6	New backup meta schema, with file temperatures (#9660 ) Summary: The primary goal of this change is to add support for backing up and restoring (applying on restore) file temperature metadata, without committing to either the DB manifest or the FS reported "current" temperatures being exclusive "source of truth". To achieve this goal, we need to add temperature information to backup metadata, which requires updated backup meta schema. Fortunately I prepared for this in https://github.com/facebook/rocksdb/issues/8069, which began forward compatibility in version 6.19.0 for this kind of schema update. (Previously, backup meta schema was not extensible! Making this schema update public will allow some other "nice to have" features like taking backups with hard links, and avoiding crc32c checksum computation when another checksum is already available.) While schema version 2 is newly public, the default schema version is still 1. Until we change the default, users will need to set to 2 to enable features like temperature data backup+restore. New metadata like temperature information will be ignored with a warning in versions before this change and since 6.19.0. The metadata is considered ignorable because a functioning DB can be restored without it. Some detail: * Some renaming because "future schema" is now just public schema 2. * Initialize some atomics in TestFs (linter reported) * Add temperature hint support to SstFileDumper (used by BackupEngine) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9660 Test Plan: related unit test majorly updated for the new functionality, including some shared testing support for tracking temperatures in a FS. Some other tests and testing hooks into production code also updated for making the backup meta schema change public. Reviewed By: ajkr Differential Revision: D34686968 Pulled By: pdillinger fbshipit-source-id: 3ac1fa3e67ee97ca8a5103d79cc87d872c1d862a	4 years ago
Peter Dillinger	ce60d0cbe5	Test refactoring for Backups+Temperatures (#9655 ) Summary: In preparation for more support for file Temperatures in BackupEngine, this change does some test refactoring: * Move DBTest2::BackupFileTemperature test to BackupEngineTest::FileTemperatures, with some updates to make it work in the new home. This test will soon be expanded for deeper backup work. * Move FileTemperatureTestFS from db_test2.cc to db_test_util.h, to support sharing because of above moved test, but split off the "no link" part to the test needing it. * Use custom FileSystems in backupable_db_test rather than custom Envs, because going through Env file interfaces doesn't support temperatures. * Fix RemapFileSystem to map DirFsyncOptions::renamed_new_name parameter to FsyncWithDirOptions, which was required because this limitation caused a crash only after moving to higher fidelity of FileSystem interface (vs. LegacyDirectoryWrapper throwing away some parameter details) * `backupable_options_` -> `engine_options_` as part of the ongoing work to get rid of the obsolete "backupable" naming. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9655 Test Plan: test code updates only Reviewed By: jay-zhuang Differential Revision: D34622183 Pulled By: pdillinger fbshipit-source-id: f24b7a596a89b9e089e960f4e5d772575513e93f	4 years ago
Jay Zhuang	d3a2f284d9	Add Temperature info in `NewSequentialFile()` (#9499 ) Summary: Add Temperature hints information from RocksDB in API `NewSequentialFile()`. backup and checkpoint operations need to open the source files with `NewSequentialFile()`, which will have the temperature hints. Other operations are not covered. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9499 Test Plan: Added unittest Reviewed By: pdillinger Differential Revision: D34006115 Pulled By: jay-zhuang fbshipit-source-id: 568b34602b76520e53128672bd07e9d886786a2f	4 years ago
Peter Dillinger	78aee6fedc	Remove obsolete backupable_db.h, utility_db.h (#9438 ) Summary: This also removes the obsolete names BackupableDBOptions and UtilityDB. API users must now use BackupEngineOptions and DBWithTTL::Open. In C API, `rocksdb_backupable_db_` is replaced `rocksdb_backup_engine_`. Similar renaming in Java API. In reference to https://github.com/facebook/rocksdb/issues/9389 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9438 Test Plan: CI Reviewed By: mrambacher Differential Revision: D33780269 Pulled By: pdillinger fbshipit-source-id: 4a6cfc5c1b4c78bcad790b9d3dd13c5fdf4a1fac	4 years ago
mrambacher	fe31dc53ca	Make the Env class Customizable (#9293 ) Summary: Allows the Env to have options (Configurable) and loads like other Customizable classes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9293 Reviewed By: pdillinger, zhichao-cao Differential Revision: D33181591 Pulled By: mrambacher fbshipit-source-id: 55e823886c654d214eda9eedd45ccdc54dac14d7	4 years ago
Andrew Kryczka	538d2365e9	Fix race condition in BackupEngineTest.ChangeManifestDuringBackupCreation (#9327 ) Summary: The failure looked like this: ``` utilities/backupable/backupable_db_test.cc:3161: Failure Value of: db_chroot_env_->FileExists(prev_manifest_path).IsNotFound() Actual: false Expected: true ``` The failure could be coerced consistently with the following patch: ``` diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc index 80410f671..637636791 100644 --- a/db/db_impl/db_impl_compaction_flush.cc +++ b/db/db_impl/db_impl_compaction_flush.cc @@ -2772,6 +2772,8 @@ void DBImpl::BackgroundCallFlush(Env::Priority thread_pri) { if (job_context.HaveSomethingToClean() \|\| job_context.HaveSomethingToDelete() \|\| !log_buffer.IsEmpty()) { mutex_.Unlock(); + bg_cv_.SignalAll(); + sleep(1); TEST_SYNC_POINT("DBImpl::BackgroundCallFlush:FilesFound"); // Have to flush the info logs before bg_flush_scheduled_-- // because if bg_flush_scheduled_ becomes 0 and the lock is ``` The cause was a familiar problem, which is manual flush/compaction may return before files they obsoleted are removed. The solution is just to wait for "scheduled" work to complete, which includes all phases including cleanup. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9327 Test Plan: after this PR, even the above patch to coerce the bug cannot cause the test to fail. Reviewed By: riversand963 Differential Revision: D33252208 Pulled By: ajkr fbshipit-source-id: 720a7eaca58c7247d221911fffe3d5e1dbf581e9	4 years ago
Peter Dillinger	0050a73a4f	New stable, fixed-length cache keys (#9126 ) Summary: This change standardizes on a new 16-byte cache key format for block cache (incl compressed and secondary) and persistent cache (but not table cache and row cache). The goal is a really fast cache key with practically ideal stability and uniqueness properties without external dependencies (e.g. from FileSystem). A fixed key size of 16 bytes should enable future optimizations to the concurrent hash table for block cache, which is a heavy CPU user / bottleneck, but there appears to be measurable performance improvement even with no changes to LRUCache. This change replaces a lot of disjointed and ugly code handling cache keys with calls to a simple, clean new internal API (cache_key.h). (Preserving the old cache key logic under an option would be very ugly and likely negate the performance gain of the new approach. Complete replacement carries some inherent risk, but I think that's acceptable with sufficient analysis and testing.) The scheme for encoding new cache keys is complicated but explained in cache_key.cc. Also: EndianSwapValue is moved to math.h to be next to other bit operations. (Explains some new include "math.h".) ReverseBits operation added and unit tests added to hash_test for both. Fixes https://github.com/facebook/rocksdb/issues/7405 (presuming a root cause) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9126 Test Plan: ### Basic correctness Several tests needed updates to work with the new functionality, mostly because we are no longer relying on filesystem for stable cache keys so table builders & readers need more context info to agree on cache keys. This functionality is so core, a huge number of existing tests exercise the cache key functionality. ### Performance Create db with `TEST_TMPDIR=/dev/shm ./db_bench -bloom_bits=10 -benchmarks=fillrandom -num=3000000 -partition_index_and_filters` And test performance with `TEST_TMPDIR=/dev/shm ./db_bench -readonly -use_existing_db -bloom_bits=10 -benchmarks=readrandom -num=3000000 -duration=30 -cache_index_and_filter_blocks -cache_size=250000 -threads=4` using DEBUG_LEVEL=0 and simultaneous before & after runs. Before ops/sec, avg over 100 runs: 121924 After ops/sec, avg over 100 runs: 125385 (+2.8%) ### Collision probability I have built a tool, ./cache_bench -stress_cache_key to broadly simulate host-wide cache activity over many months, by making some pessimistic simplifying assumptions: * Every generated file has a cache entry for every byte offset in the file (contiguous range of cache keys) * All of every file is cached for its entire lifetime We use a simple table with skewed address assignment and replacement on address collision to simulate files coming & going, with quite a variance (super-Poisson) in ages. Some output with `./cache_bench -stress_cache_key -sck_keep_bits=40`: ``` Total cache or DBs size: 32TiB Writing 925.926 MiB/s or 76.2939TiB/day Multiply by 9.22337e+18 to correct for simulation losses (but still assume whole file cached) ``` These come from default settings of 2.5M files per day of 32 MB each, and `-sck_keep_bits=40` means that to represent a single file, we are only keeping 40 bits of the 128-bit cache key. With file size of 2\\25 contiguous keys (pessimistic), our simulation is about 2\\(128-40-25) or about 9 billion billion times more prone to collision than reality. More default assumptions, relatively pessimistic: * 100 DBs in same process (doesn't matter much) * Re-open DB in same process (new session ID related to old session ID) on average every 100 files generated * Restart process (all new session IDs unrelated to old) 24 times per day After enough data, we get a result at the end: ``` (keep 40 bits) 17 collisions after 2 x 90 days, est 10.5882 days between (9.76592e+19 corrected) ``` If we believe the (pessimistic) simulation and the mathematical generalization, we would need to run a billion machines all for 97 billion days to expect a cache key collision. To help verify that our generalization ("corrected") is robust, we can make our simulation more precise with `-sck_keep_bits=41` and `42`, which takes more running time to get enough data: ``` (keep 41 bits) 16 collisions after 4 x 90 days, est 22.5 days between (1.03763e+20 corrected) (keep 42 bits) 19 collisions after 10 x 90 days, est 47.3684 days between (1.09224e+20 corrected) ``` The generalized prediction still holds. With the `-sck_randomize` option, we can see that we are beating "random" cache keys (except offsets still non-randomized) by a modest amount (roughly 20x less collision prone than random), which should make us reasonably comfortable even in "degenerate" cases: ``` 197 collisions after 1 x 90 days, est 0.456853 days between (4.21372e+18 corrected) ``` I've run other tests to validate other conditions behave as expected, never behaving "worse than random" unless we start chopping off structured data. Reviewed By: zhichao-cao Differential Revision: D33171746 Pulled By: pdillinger fbshipit-source-id: f16a57e369ed37be5e7e33525ace848d0537c88f	4 years ago
Peter Dillinger	80ac7412b5	Polish/deflake BackupEngineTest.FileCollision (#9257 ) Summary: Use smaller and more predictable behaviors Pull Request resolved: https://github.com/facebook/rocksdb/pull/9257 Test Plan: gtest-parallel --repeat=N ./backupable_db_test --gtest_filter=BackupEngineTest.FileCollision before (N=50) we see inconsistent sets of SST files $ find /dev/shm/rocksdb_blah/ \| grep -o '/00.sst' \| grep -o '^[^_]' \| sort \| uniq -c 49 /000009 3 /000010 1 /000010.sst 49 /000012 3 /000013 1 /000013.sst 49 /000015 2 /000016 1 /000016.sst 22 /000018 2 /000019 1 /000019.sst 29 /000020 11 /000021 2 /000021.sst 46 /000022 2 /000022.sst 4 /000023 1 /000023.sst 27 /000025 And after (N=5000) we see $ find /dev/shm/rocksdb_blah/ \| grep -o '/00.sst' \| grep -o '^[^_]' \| sort \| uniq -c 10000 /000009 10000 /000012 5000 /000015 Reviewed By: ajkr Differential Revision: D32888393 Pulled By: pdillinger fbshipit-source-id: 5bfd075b3184bb66c5613758a53f431c406e9808	4 years ago
Hui Xiao	cff7819dff	Fix BackupEngine's internal callers of GenericRateLimiter::Request() not honoring bytes <= GetSingleBurstBytes() (#9063 ) Summary: Context: Some existing internal calls of `GenericRateLimiter::Request()` in backupable_db.cc and newly added internal calls in https://github.com/facebook/rocksdb/pull/8722/ do not make sure `bytes <= GetSingleBurstBytes()` as required by rate_limiter https://github.com/facebook/rocksdb/blob/master/include/rocksdb/rate_limiter.h#L47. Impacts of this bug include: (1) In debug build, when `GenericRateLimiter::Request()` requests bytes greater than `GenericRateLimiter:: kMinRefillBytesPerPeriod = 100` byte, process will crash due to assertion failure. See https://github.com/facebook/rocksdb/pull/9063#discussion_r737034133 and for possible scenario (2) In production build, although there will not be the above crash due to disabled assertion, the bug can lead to a request of small bytes being blocked for a long time by a request of same priority with insanely large bytes from a different thread. See updated https://github.com/facebook/rocksdb/wiki/Rate-Limiter ("Notice that although....the maximum bytes that can be granted in a single request have to be bounded...") for more info. There is an on-going effort to move rate-limiting to file wrapper level so rate limiting in `BackupEngine` and this PR might be made obsolete in the future. Summary: - Implemented loop-calling `GenericRateLimiter::Request()` with `bytes <= GetSingleBurstBytes()` as a static private helper function `BackupEngineImpl::LoopRateLimitRequestHelper` -- Considering make this a util function in `RateLimiter` later or do something with `RateLimiter::RequestToken()` - Replaced buggy internal callers with this helper function wherever requested byte is not pre-limited by `GetSingleBurstBytes()` - Removed the minimum refill bytes per period enforced by `GenericRateLimiter` since it is useless and prevents testing `GenericRateLimiter` for extreme case with small refill bytes per period. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9063 Test Plan: - Added a new test that failed the assertion before this change and now passes - It exposed bugs in [the write during creation in `CopyOrCreateFile()`](`df7cc66e17/utilities/backupable/backupable_db.cc (L2034-L2043)`), [the read of table properties in `GetFileDbIdentities()`](`df7cc66e17/utilities/backupable/backupable_db.cc (L2372-L2378)`), [some read of metadata in `BackupMeta::LoadFromFile()`](`df7cc66e17/utilities/backupable/backupable_db.cc (L2726)`) - Passing Existing tests Reviewed By: ajkr Differential Revision: D31824535 Pulled By: hx235 fbshipit-source-id: d2b3dea7a64e2a4b1e6a59fca322f0800a4fcbcc	4 years ago
Jay Zhuang	29102641dd	Skip directory fsync for filesystem btrfs (#8903 ) Summary: Directory fsync might be expensive on btrfs and it may not be needed. Here are 4 directory fsync cases: 1. creating a new file: dir-fsync is not needed on btrfs, as long as the new file itself is synced. 2. renaming a file: dir-fsync is not needed if the renamed file is synced. So an API `FsyncAfterFileRename(filename, ...)` is provided to sync the file on btrfs. By default, it just calls dir-fsync. 3. deleting files: dir-fsync is forced by set `IOOptions.force_dir_fsync = true` 4. renaming multiple files (like backup and checkpoint): dir-fsync is forced, the same as above. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8903 Test Plan: run tests on btrfs and non btrfs Reviewed By: ajkr Differential Revision: D30885059 Pulled By: jay-zhuang fbshipit-source-id: dd2730b31580b0bcaedffc318a762d7dbf25de4a	4 years ago
Peter Dillinger	3ffb3baa0b	Add (Live)FileStorageInfo API (#8968 ) Summary: New classes FileStorageInfo and LiveFileStorageInfo and 'experimental' function DB::GetLiveFilesStorageInfo, which is intended to largely replace several fragmented DB functions needed to create checkpoints and backups. This function is now used to create checkpoints and backups, because it fixes many (probably not all) of the prior complexities of checkpoint not having atomic access to DB metadata. This also ensures strong functional test coverage of the new API. Specifically, much of the old CheckpointImpl::CreateCustomCheckpoint has been migrated to and updated in DBImpl::GetLiveFilesStorageInfo, with the former now calling the latter. Also, the class FileStorageInfo in metadata.h compatibly replaces BackupFileInfo and serves as a new base class for SstFileMetaData. Some old fields of SstFileMetaData are still provided (for now) but deprecated. Although FileStorageInfo::directory is accurate when using db_paths and/or cf_paths, these have never been supported by Checkpoint nor BackupEngine and still are not. This change does now detect these cases and return NotSupported when appropriate. (More work needed for support.) Somehow this change broke ProgressCallbackDuringBackup, but the progress_callback logic was dubious to begin with because it would call the callback based on copy buffer size, not size actually copied. Logic and test updated to track size actually copied per-thread. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8968 Test Plan: tests updated. DB::GetLiveFilesStorageInfo mostly tested by use in CheckpointImpl. DBTest.SnapshotFiles updated to also test GetLiveFilesStorageInfo, including reading the data after DB close. Added CheckpointTest.CheckpointWithDbPath (NotSupported). Reviewed By: siying Differential Revision: D31242045 Pulled By: pdillinger fbshipit-source-id: b183d1ce9799e220daaefd6b3b5365d98de676c0	4 years ago
Peter Dillinger	5268cdc997	Finish BackupEngine migration to IOStatus (#8940 ) Summary: Updates a few remaining functions that should have been updated from Status -> IOStatus, and adds to HISTORY for the overall change including https://github.com/facebook/rocksdb/issues/8820. This change is for inclusion in version 6.25. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8940 Test Plan: CI Reviewed By: zhichao-cao Differential Revision: D31085029 Pulled By: pdillinger fbshipit-source-id: 91557c6a39ef1d90357d4f4dcd79af0645d87c7b	4 years ago
Zhichao Cao	82e7631de6	Replace Status with IOStatus in the backupable_db (#8820 ) Summary: In order to populate the IOStatus up to the higher level, replace some of the Status to IOStatus. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8820 Test Plan: make check Reviewed By: pdillinger Differential Revision: D30967215 Pulled By: zhichao-cao fbshipit-source-id: ccf9d5cfbd9d3de047c464aaa85f9fa43b474903	4 years ago
hx235	45175ca2e1	Charge read to rate limiter in BackupEngine (#8722 ) Summary: Context: While all the non-trivial write operations in BackupEngine go through the RateLimiter, reads currently do not. In general, this is not a huge issue because (especially since some I/O efficiency fixes) reads in BackupEngine are mostly limited by corresponding writes, for both backup and restore. But in principle we should charge the RateLimiter for reads as well. - Charged read operations in `BackupEngineImpl::CopyOrCreateFile`, `BackupEngineImpl::ReadFileAndComputeChecksum`, `BackupEngineImpl::BackupMeta::LoadFromFile` and `BackupEngineImpl::GetFileDbIdentities` Pull Request resolved: https://github.com/facebook/rocksdb/pull/8722 Test Plan: - Passed existing tests - Passed added unit tests Reviewed By: pdillinger Differential Revision: D30610464 Pulled By: hx235 fbshipit-source-id: 9b08c9387159a5385c8d390d6666377a0d0117e5	4 years ago
Andrew Kryczka	dd092c2d11	prevent stranded LATEST_BACKUP in BackupEngineTest.NoDeleteWithReadOnly (#8887 ) Summary: A "LATEST_BACKUP" file was left in the backup directory by "BackupEngineTest.NoDeleteWithReadOnly" test, affecting future test runs. In particular, it caused "BackupEngineTest.IOStats" to fail since it relies on backup directory containing only data written by its `BackupEngine`. The fix is to promote "LATEST_BACKUP" to an explicitly managed file so it is deleted in `BackupEngineTest` constructor if it exists. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8887 Test Plan: below command used to fail. Now it passes: ``` $ TEST_TMPDIR=/dev/shm ./backupable_db_test --gtest_filter='BackupEngineTest.NoDeleteWithReadOnly:BackupEngineTest.IOStats' ``` Reviewed By: pdillinger Differential Revision: D30812336 Pulled By: ajkr fbshipit-source-id: 32dfbe1368ebdab872e610764bfea5daf9a2af09	4 years ago
Andrew Kryczka	9308ff366c	Bytes read/written stats for `CreateNewBackup*()` (#8819 ) Summary: Gets `Statistics` from the options associated with the `DB` undergoing backup, and populates new ticker stats with the thread-local `IOContext` read/write counters for the threads doing backup work. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8819 Reviewed By: pdillinger Differential Revision: D30779238 Pulled By: ajkr fbshipit-source-id: 75ccafc355f90906df5cf80367f7245b985772d8	4 years ago
Peter Dillinger	0ef88538c6	Improve support for using regexes (#8740 ) Summary: * Consolidate use of std::regex for testing to testharness.cc, to minimize Facebook linters constantly flagging uses in non-production code. * Improve syntax and error messages for asserting some string matches a regex in tests. * Add a public Regex wrapper class to encapsulate existing usage in ObjectRegistry. * Remove unnecessary include <regex> * Put warnings that use of Regex in production code could cause bad performance or stack overflow. Intended follow-up work: * Replace std::regex with another underlying implementation like RE2 * Improve ObjectRegistry interface in terms of possibly confusing literal string matching vs. regex and in terms of reporting invalid regex. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8740 Test Plan: tests updated, basic unit test for public Regex, and some manual testing of temporary changes to see example error messages: utilities/backupable/backupable_db_test.cc:917: Failure 000010_1162373755_138626.blob (child.name) does not match regex [0-9]+_[0-9]+_[0-9]+[.]blobHAHAHA (pattern) db/db_basic_test.cc:74: Failure R3SHSBA8C4U0CIMV2ZB0 (sid3) does not match regex [0-9A-Z]{20}HAHAHA Reviewed By: mrambacher Differential Revision: D30706246 Pulled By: pdillinger fbshipit-source-id: ba845e8f563ccad39bdb58f44f04e9da8f78c3fd	4 years ago
Peter Dillinger	32752551b9	Fix a buffer size race condition in BackupEngine (#8732 ) Summary: If RateLimiter burst bytes changes during concurrent Restore operations Pull Request resolved: https://github.com/facebook/rocksdb/pull/8732 Test Plan: updated unit test fails with TSAN before change, passes after Reviewed By: ajkr Differential Revision: D30683879 Pulled By: pdillinger fbshipit-source-id: d0ddb3587ade91ee2a4d926b475acf7781b03086	4 years ago
Peter Dillinger	a7fd1d0881	Make backup restore atomic, with sync option (#8568 ) Summary: Guarantees that if a restore is interrupted, DB::Open will fail. This works by restoring CURRENT first to CURRENT.tmp then as a final step renaming to CURRENT. Also makes restore respect BackupEngineOptions::sync (default true). When set, the restore is guaranteed persisted by the time it returns OK. Also makes the above atomicity guarantee work in case the interruption is power loss or OS crash (not just process interruption or crash). Fixes https://github.com/facebook/rocksdb/issues/8500 Pull Request resolved: https://github.com/facebook/rocksdb/pull/8568 Test Plan: added to backup mini-stress unit test. Passes with gtest_repeat=100 (whereas fails 7 times without the CURRENT.tmp) Reviewed By: akankshamahajan15 Differential Revision: D29812605 Pulled By: pdillinger fbshipit-source-id: 24e9a993b305b1835ca95558fa7a7152e54cda8e	4 years ago
Andrew Kryczka	ed8eb436db	Move slow valgrind tests behind -DROCKSDB_FULL_VALGRIND_RUN (#8475 ) Summary: Various tests had disabled valgrind due to it slowing down and timing out (as is the case right now) the CI runs. Where a test was disabled with no comment, I assumed slowness was the cause. For these tests that were slow under valgrind, as well as the ones identified in https://github.com/facebook/rocksdb/issues/8352, this PR moves them behind the compiler flag `-DROCKSDB_FULL_VALGRIND_RUN`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8475 Test Plan: running `make full_valgrind_test`, `make valgrind_test`, `make check`; will verify they appear working correctly Reviewed By: jay-zhuang Differential Revision: D29504843 Pulled By: ajkr fbshipit-source-id: 2aac90749cfbd30d5ce11cb29a07a1b9314eeea7	4 years ago
Peter Dillinger	c26b75baa5	Deprecate obsolete "backupable db" from public APIs (#8274 ) Summary: An early design of BackupEngine used stackable DB, so I guess a DB had to opt-in to being backupable. Unfortunately the naming of that obsolete design still infects our public API and implementation. This change fixes the public API, with a deprecated backward-compatibility header. `BackupableDBOptions` is renamed to `BackupEngineOptions` (copy-replace in the public header) and backup_engine.h replaces backupable_db.h (present for backward compatibility). The only other change in backupable_db.h -> backup_engine.h is cleaning up headers. Later changes will fix the internal implementation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8274 Test Plan: The internal implementation of BackupEngine uses the name BackupEngineOptions, while the unit tests use the old name BackupableDBOptions. This gives me confidence that both still work. Reviewed By: mrambacher Differential Revision: D28259471 Pulled By: pdillinger fbshipit-source-id: a25dbe327b9772143488e7bb0ec7139ee42d0613	4 years ago
Yanqin Jin	a376c22066	Handle rename() failure in non-local FS (#8192 ) Summary: In a distributed environment, a file `rename()` operation can succeed on server (remote) side, but the client can somehow return non-ok status to RocksDB. Possible reasons include network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a new MANIFEST. We currently always delete the new MANIFEST if an error occurs. This is problematic in distributed world. If the server-side successfully updates the CURRENT file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail. As a fix, we can track the execution result of IO operations on the new MANIFEST. - If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original MANIFEST. Therefore, it is safe to remove the new MANIFEST. - If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the new MANIFEST.) Therefore, we keep the new MANIFEST. - Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT. - If process reopens the db immediately after the failure, then the CURRENT file can point to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can succeed and ignore the other. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8192 Test Plan: make check Reviewed By: zhichao-cao Differential Revision: D27804648 Pulled By: riversand963 fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4	5 years ago
Akanksha Mahajan	c377c2ba15	Fix flaky test BackupableDBTest.FileSizeForIncremental (#8197 ) Summary: Test was flaky because for kUseDbSessionId naming, blob files use naming scheme kLegacyCrc32cAndFileSize. So expected number of files because of collision can vary. So disabling blobdb for this test case. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8197 Reviewed By: pdillinger Differential Revision: D27836997 Pulled By: akankshamahajan15 fbshipit-source-id: 5eb21a5f4acae3d6b730a9e1b207264fbc18cb80	5 years ago
Peter Dillinger	bb75092574	Misc Backup API enhancements (#8170 ) Summary: * CreateNewBackup(WithMetadata) returning the BackupID of new backup through optional new output param. This is especially useful with the new mutithreading support, so that you can transactionally determine the ID of a backup you create. * GetBackupInfo / GetLatestBackupInfo for individual backups, so that you don't have to comb through a vector of backups if you don't want to. Updated HISTORY.md (including re: BlobDB support as new feature) Pull Request resolved: https://github.com/facebook/rocksdb/pull/8170 Test Plan: Added test logic to existing tests, to minimize increase in cost of running tests Reviewed By: zhichao-cao Differential Revision: D27680410 Pulled By: pdillinger fbshipit-source-id: 1fc45b73d81aae293ccd4a43d9583d7fd915d3eb	5 years ago
Akanksha Mahajan	d52b520d51	Integrated BlobDB for backup/restore support (#8129 ) Summary: Add support for blob files for backup/restore like table files. Since DB session ID is currently not supported for blob files (there is no place to store it in the header), so for blob files uses the kLegacyCrc32cAndFileSize naming scheme even if share_files_with_checksum_naming is set to kUseDbSessionId. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8129 Test Plan: Add new test units Reviewed By: ltamasi Differential Revision: D27408510 Pulled By: akankshamahajan15 fbshipit-source-id: b27434d189a639ef3e6ad165c61a143a2daaf06e	5 years ago
Peter Dillinger	a4e82a3cca	Fix read-only DB writing to filesystem with write_dbid_to_manifest (#8164 ) Summary: Fixing another crash test failure in the case of write_dbid_to_manifest=true and reading a backup as read-only DB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8164 Test Plan: enhanced unit test for backup as read-only DB, ran blackbox_crash_test more with elevated backup_one_in Reviewed By: zhichao-cao Differential Revision: D27622237 Pulled By: pdillinger fbshipit-source-id: 680d0f99ddb465a601737f2e3f2c80efd47384fb	5 years ago
Peter Dillinger	35af0433cf	Fix crash test with backup as read-only DB (#8161 ) Summary: Forgot to re-test crash test after adding read-only filesystem enforcement to https://github.com/facebook/rocksdb/issues/8142. The problem is ReadOnlyFileSystem would reject CreateDirIfMissing whenever DBOptions::create_if_missing=true. The fix that is better for users is to allow CreateDirIfMissing in ReadOnlyFileSystem if the directory exists, so that they don't cause a failure on using create_if_missing with opening backups as read-only DBs. Added this option test to the unit test (in addition to being in the crash test). Also fixed a couple of lints. And some better messaging from 'make format' so that when you run it with uncommitted changes, it's clear that it's only checking the uncommitted changes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8161 Test Plan: local blackbox_crash_test with amplified backup_one_in Reviewed By: ajkr Differential Revision: D27614409 Pulled By: pdillinger fbshipit-source-id: 63ccb626c7e34c200d61c6bca2a8f60da9015179	5 years ago
Peter Dillinger	879357fdb0	Make backups openable as read-only DBs (#8142 ) Summary: A current limitation of backups is that you don't know the exact database state of when the backup was taken. With this new feature, you can at least inspect the backup's DB state without restoring it by opening it as a read-only DB. Rather than add something like OpenAsReadOnlyDB to the BackupEngine API, which would inhibit opening stackable DB implementations read-only (if/when their APIs support it), we instead provide a DB name and Env that can be used to open as a read-only DB. Possible follow-up work: * Add a version of GetBackupInfo for a single backup. * Let CreateNewBackup return the BackupID of the newly-created backup. Implementation details: Refactored ChrootFileSystem to split off new base class RemapFileSystem, which allows more general remapping of files. We use this base class to implement BackupEngineImpl::RemapSharedFileSystem. To minimize API impact, I decided to just add these fields `name_for_open` and `env_for_open` to those set by GetBackupInfo when include_file_details=true. Creating the RemapSharedFileSystem adds a bit to the memory consumption, perhaps unnecessarily in some cases, but this has been mitigated by (a) only initialize the RemapSharedFileSystem lazily when GetBackupInfo with include_file_details=true is called, and (b) using the existing `shared_ptr<FileInfo>` objects to hold most of the mapping data. To enhance API safety, RemapSharedFileSystem is wrapped by new ReadOnlyFileSystem which rejects any attempts to write. This uncovered a couple of places in which DB::OpenForReadOnly would write to the filesystem, so I fixed these. Added a release note because this affects logging. Additional minor refactoring in backupable_db.cc to support the new functionality. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8142 Test Plan: new test (run with ASAN and UBSAN), added to stress test and ran it for a while with amplified backup_one_in Reviewed By: ajkr Differential Revision: D27535408 Pulled By: pdillinger fbshipit-source-id: 04666d310aa0261ef6b2385c43ca793ce1dfd148	5 years ago
Peter Dillinger	96205baa63	Likely fix flaky TableFileCorruptedBeforeBackup (#8151 ) Summary: Before corrupting a file in the DB and expecting corruption to be detected, open DB read-only to ensure file is not made obsolete by compaction. Also, to avoid obsolete files not yet deleted, only select live files to corrupt. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8151 Test Plan: watch CI Reviewed By: akankshamahajan15 Differential Revision: D27568849 Pulled By: pdillinger fbshipit-source-id: 39a69a2eafde0482b20a197949d24abe21952f27	5 years ago
Peter Dillinger	ec11c23caa	Add thread safety to BackupEngine, explain more (#8115 ) Summary: BackupEngine previously had unclear but strict concurrency requirements that the API user must follow for safe use. Now we make that clear, by separating operations into "Read," "Append," and "Write" operations, and specifying which combinations are safe across threads on the same BackupEngine object (previously none; now all, using a read-write lock), and which are safe across different BackupEngine instances open on the same backup_dir. The changes to backupable_db.h should be backward compatible. It is mostly about eliminating copies of what should be the same function and (unsurprisingly) useful documentation comments were often placed on only one of the two copies. With the re-organization, we are also grouping different categories of operations. In the future we might add BackupEngineReadAppendOnly, but that didn't seem necessary. To mark API Read operations 'const', I had to mark some implementation functions 'const' and some fields mutable. Functional changes: * Added RWMutex locking around public API functions to implement thread safety on a single object. To avoid future bugs, this is another internal class layered on top (removing many "override" in BackupEngineImpl). It would be possible to allow more concurrency between operations, rather than mutual exclusion, but IMHO not worth the work. * Fixed a race between Open() (Initialize()) and CreateNewBackup() for different objects on the same backup_dir, where Initialize() could delete the temporary meta file created during CreateNewBackup(). (This was found by the new test.) Also cleaned up a couple of "status checked" TODOs, and improved a checksum mismatch error message to include involved files. Potential follow-up work: * CreateNewBackup has an API wart because it doesn't tell you the BackupID it just created, which makes it of limited use in a multithreaded setting. * We could also consider a Refresh() function to catch up to changes made from another BackupEngine object to the same dir. * Use a lock file to prevent multiple writer BackupEngines, but this won't work on remote filesystems not supporting lock files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8115 Test Plan: new mini-stress test in backup unit tests, run with gcc, clang, ASC, TSAN, and UBSAN, 100 iterations each. Reviewed By: ajkr Differential Revision: D27347589 Pulled By: pdillinger fbshipit-source-id: 28d82ed2ac672e44085a739ddb19d297dad14b15	5 years ago
Peter Dillinger	3bfd3ed2f3	Begin forward compatibility for new backup meta schema (#8069 ) Summary: This does not add any new public APIs or published functionality, but adds the ability to read and use (and in tests, write) backups with a new meta file schema, based on the old schema but not forward-compatible (before this change). The new schema enables some capabilities not in the old: * Explicit versioning, so that users get clean error messages the next time we want to break forward compatibility. * Ignoring unrecognized fields (with warning), so that new non-critical features can be added without breaking forward compatibility. * Rejecting future "non-ignorable" fields, so that new features critical to some use-cases could potentially be added outside of linear schema versions, with broken forward compatibility. * Fields at the end of the meta file, such as for checksum of the meta file's contents (up to that point) * New optional 'size' field for each file, which is checked when present * Optionally omitting 'crc32' field, so that we aren't required to have a crc32c checksum for files to take a backup. (E.g. to support backup via hard links and to better support file custom checksums.) Because we do not have a JSON parser and to share code, the new schema is simply derived from the old schema. BackupEngine code is updated to allow missing checksums in some places, and to make that easier, `has_checksum` and `verify_checksum_after_work` are eliminated. Empty `checksum_hex` indicates checksum is unknown. I'm not too afraid of regressing on data integrity, because (a) we have pretty good test coverage of corruption detection in backups, and (b) we are increasingly relying on the DB itself for data integrity rather than it being an exclusive feature of backups. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8069 Test Plan: new unit tests, added to crash test (some local run with boosted backup probability) Reviewed By: ajkr Differential Revision: D27139824 Pulled By: pdillinger fbshipit-source-id: 9e0e4decfb42bb84783d64d2d246456d97e8e8c5	5 years ago
Yanqin Jin	7ee41a5d25	Fix a test failure when built with ASSERT_STATUS_CHECKED=1 (#8075 ) Summary: As title. Test plan ASSERT_STATUS_CHECKED=1 make -j20 backupable_db_test error_handler_fs_test ./backupable_db_test ./error_handler_fs_test Pull Request resolved: https://github.com/facebook/rocksdb/pull/8075 Reviewed By: zhichao-cao Differential Revision: D27173832 Pulled By: riversand963 fbshipit-source-id: 37dac50f7c89127804ff2572abddd4174642de30	5 years ago
Peter Dillinger	589ea6bec2	Add BackupEngine API for backup file details (#8042 ) Summary: This API can be used for things like determining how much space can be freed up by deleting a particular backup, etc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8042 Test Plan: validation of the API added to many existing backup unit tests Reviewed By: mrambacher Differential Revision: D26936577 Pulled By: pdillinger fbshipit-source-id: f0bbd90f0917b9781a6837652fb4616d9247816a	5 years ago
Peter Dillinger	4b18c46d10	Refactor: add LineFileReader and Status::MustCheck (#8026 ) Summary: Removed confusing, awkward, and undocumented internal API ReadOneLine and replaced with very simple LineFileReader. In refactoring backupable_db.cc, this has the side benefit of removing the arbitrary cap on the size of backup metadata files. Also added Status::MustCheck to make it easy to mark a Status as "must check." Using this, I can ensure that after LineFileReader::ReadLine returns false the caller checks GetStatus(). Also removed some excessive conditional compilation in status.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/8026 Test Plan: added unit test, and running tests with ASSERT_STATUS_CHECKED Reviewed By: mrambacher Differential Revision: D26831687 Pulled By: pdillinger fbshipit-source-id: ef749c265a7a26bb13cd44f6f0f97db2955f6f0f	5 years ago
Peter Dillinger	847ca9f964	Make default share_files_with_checksum=true (#8020 ) Summary: New comment for share_files_with_checksum: // Only used if share_table_files is set to true. Setting to false is // DEPRECATED and potentially dangerous because in that case BackupEngine // can lose data if backing up databases with distinct or divergent // history, for example if restoring from a backup other than the latest, // writing to the DB, and creating another backup. Setting to true (default) // prevents these issues by ensuring that different table files (SSTs) with // the same number are treated as distinct. See // share_files_with_checksum_naming and ShareFilesNaming. I have also removed interim option kFlagMatchInterimNaming, which is no longer needed and was never needed for correct+compatible operation (just performance). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8020 Test Plan: tests updated. Backward+forward compatibility verified with SHORT_TEST=1 check_format_compatible.sh. ldb uses default backup options, and I manually verified shared_checksum in /tmp/rocksdb_format_compatible_peterd/bak/current/ after run. Reviewed By: ajkr Differential Revision: D26786331 Pulled By: pdillinger fbshipit-source-id: 36f968dfef1f5cacbd65154abe1d846151a55130	5 years ago
Akanksha Mahajan	f19612970d	Support retrieving checksums for blob files from the MANIFEST when checkpointing (#8003 ) Summary: The checkpointing logic supports passing file level checksums to the copy_file_cb callback function which is used by the backup code for detecting corruption during file copies. However, this is currently implemented only for table files. This PR extends the checksum retrieval to blob files as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8003 Test Plan: Add new test units Reviewed By: ltamasi Differential Revision: D26680701 Pulled By: akankshamahajan15 fbshipit-source-id: 1bd1e2464df6e9aa31091d35b8c72786d94cd1c5	5 years ago
mrambacher	4a09d632c4	Remove Legacy and Custom FileWrapper classes from header files (#7851 ) Summary: Removed the uses of the Legacy FileWrapper classes from the source code. The wrappers were creating an additional layer of indirection/wrapping, as the Env already has a FileSystem. Moved the Custom FileWrapper classes into the CustomEnv, as these classes are really for the private use the the CustomEnv class. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7851 Reviewed By: anand1976 Differential Revision: D26114816 Pulled By: mrambacher fbshipit-source-id: db32840e58d969d3a0fa6c25aaf13d6dcdc74150	5 years ago
Adam Retter	4926b33742	Improvements to Env::GetChildren (#7819 ) Summary: The main improvement here is to not include `.` or `..` in the results of `Env::GetChildren`. The occurrence of `.` or `..`; it is non-portable, dependent on the Operating System and the File System. See: https://www.gnu.org/software/libc/manual/html_node/Reading_002fClosing-Directory.html There were lots of duplicate checks spread through the RocksDB codebase previously to skip `.` and `..`. This new removes the need for those at the source. Also some minor fixes to `Env::GetChildren`: * Improve error handling in POSIX implementation * Remove unnecessary array allocation on Windows * Fix struct name for Windows Non-UTF-8 API Pull Request resolved: https://github.com/facebook/rocksdb/pull/7819 Reviewed By: ajkr Differential Revision: D25837394 Pulled By: jay-zhuang fbshipit-source-id: 1e137e7218d38b450af9c083f73d5357abcbba2e	5 years ago
mrambacher	cc2a180d00	Add more tests to the ASC pass list (#7834 ) Summary: Fixed the following to now pass ASC checks: * `ttl_test` * `blob_db_test` * `backupable_db_test`, * `delete_scheduler_test` Pull Request resolved: https://github.com/facebook/rocksdb/pull/7834 Reviewed By: jay-zhuang Differential Revision: D25795398 Pulled By: ajkr fbshipit-source-id: a10037817deda4fc7cbb353a2e00b62ed89b6476	5 years ago
cheng-chang	bdb7e544bd	Skip WALs according to MinLogNumberToKeep when creating checkpoint (#7789 ) Summary: In a stress test failure, we observe that a WAL is skipped when creating checkpoint, although its log number >= MinLogNumberToKeep(). This might happen in the following case: 1. when creating the checkpoint, there are 2 column families: CF0 and CF1, and there are 2 WALs: 1, 2; 2. CF0's log number is 1, CF0's active memtable is empty, CF1's log number is 2, CF1's active memtable is not empty, WAL 2 is not empty, the sequence number points to WAL 2; 2. the checkpoint process flushes CF0, since CF0' active memtable is empty, there is no need to SwitchMemtable, thus no new WAL will be created, so CF0's log number is now 2, concurrently, some data is written to CF0 and WAL 2; 3. the checkpoint process flushes CF1, WAL 3 is created and CF1's log number is now 3, CF0's log number is still 2 because CF0 is not empty and WAL 2 contains its unflushed data concurrently written in step 2; 4. the checkpoint process determines that WAL 1 and 2 are no longer needed according to [live_wal_files[i]->StartSequence() >= *sequence_number](https://github.com/facebook/rocksdb/blob/master/utilities/checkpoint/checkpoint_impl.cc#L388), so it skips linking them to the checkpoint directory; 5. but according to `MinLogNumberToKeep()`, WAL 2 still needs to be kept because CF0's log number is 2. If the checkpoint is reopened in read-only mode, and only read from the snapshot with the initial sequence number, then there will be no data loss or data inconsistency. But if the checkpoint is reopened and read from the most recent sequence number, suppose in step 3, there are also data concurrently written to CF1 and WAL 3, then the most recent sequence number refers to the latest entry in WAL 3, so the data written in step 2 should also be visible, but since WAL 2 is discarded, those data are lost. When tracking WAL in MANIFEST is enabled, when reopening the checkpoint, since WAL 2 is still tracked in MANIFEST as alive, but it's missing from the checkpoint directory, a corruption will be reported. This PR makes the checkpoint process to only skip a WAL if its log number < `MinLogNumberToKeep`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7789 Test Plan: watch existing tests to pass. Reviewed By: ajkr Differential Revision: D25662346 Pulled By: cheng-chang fbshipit-source-id: 136471095baa01886cf44809455cf855f24857a0	5 years ago
Akanksha Mahajan	20c7d7c58a	Handling misuse of snprintf return value (#7686 ) Summary: Handle misuse of snprintf return value to avoid Out of bound read/write. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7686 Test Plan: make check -j64 Reviewed By: riversand963 Differential Revision: D25030831 Pulled By: akankshamahajan15 fbshipit-source-id: 1a1d181c067c78b94d720323ae00b79566b57cfa	5 years ago
Cheng Chang	5e794b0841	Fix a recovery corner case (#7621 ) Summary: Consider the following sequence of events: 1. Db flushed an SST with file number N, appended to MANIFEST, and tried to sync the MANIFEST. 2. Syncing MANIFEST failed and db crashed. 3. Db tried to recover with this MANIFEST. In the meantime, no entry about the newly-flushed SST was found in the MANIFEST. Therefore, RocksDB replayed WAL and tried to flush to an SST file reusing the same file number N. This failed because file system does not support overwrite. Then Db deleted this file. 4. Db crashed again. 5. Db tried to recover. When db read the MANIFEST, there was an entry referencing N.sst. This could happen probably because the append in step 1 finally reached the MANIFEST and became visible. Since N.sst had been deleted in step 3, recovery failed. It is possible that N.sst created in step 1 is valid. Although step 3 would still fail since the MANIFEST was not synced properly in step 1 and 2, deleting N.sst would make it impossible for the db to recover even if the remaining part of MANIFEST was appended and visible after step 5. After this PR, in step 3, immediately after recovering from MANIFEST, a new MANIFEST is created, then we find that N.sst is not referenced in the MANIFEST, so we delete it, and we'll not reuse N as file number. Then in step 5, since the new MANIFEST does not contain N.sst, the recovery failure situation in step 5 won't happen. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7621 Test Plan: 1. some tests are updated, because these tests assume that new MANIFEST is created after WAL recovery. 2. a new unit test is added in db_basic_test to simulate step 3. Reviewed By: riversand963 Differential Revision: D24668144 Pulled By: cheng-chang fbshipit-source-id: 90d7487fbad2bc3714f5ede46ea949895b15ae3b	5 years ago
Zhichao Cao	d8ec0a760a	Make FileType Public and Replace kLogFile with kWalFile (#7580 ) Summary: As suggested by pdillinger ,The name of kLogFile is misleading, in some tests, kLogFile is defined as info log. Replace it with kWalFile and move it to public, which will be used in https://github.com/facebook/rocksdb/issues/7523 Pull Request resolved: https://github.com/facebook/rocksdb/pull/7580 Test Plan: make check Reviewed By: riversand963 Differential Revision: D24485420 Pulled By: zhichao-cao fbshipit-source-id: 955e3dacc1021bb590fde93b0a568ffe9ad80799	5 years ago
Jay Zhuang	e127fe18c3	Fix TSAN failure for backupable_db_test (#7478 ) Summary: It's a transient failure, but can be reproduce with running the test 100 times: https://app.circleci.com/pipelines/github/facebook/rocksdb/3760/workflows/de909685-f22b-45ba-a8f3-6ebb78a54e96/jobs/37039 Pull Request resolved: https://github.com/facebook/rocksdb/pull/7478 Test Plan: re-run the test 100 times Reviewed By: ajkr Differential Revision: D24035758 Pulled By: jay-zhuang fbshipit-source-id: 6b31983d5c3f7faa8d5481306098513485d0d69d	5 years ago
Peter Dillinger	9d8eb77c4d	Less I/O for incremental backups, slightly better corruption detection (#7413 ) Summary: Two relatively simple functional changes to incremental backup behavior, integrated with a minor refactoring to reduce code redundancy and improve error/log message. There are nuances to the impact of these changes, but I believe they are fundamentally good and generally safe. Those functional changes: * Incremental backups no longer read DB table files that are already saved to a shared part of the backup directory, unless `share_files_with_checksum` is used with `kLegacyCrc32cAndFileSize` naming (discouraged) where crc32c full file checksums are needed to determine file naming. * Justification: incremental backups should not need to read the whole DB, especially without rate limiting. (Although other BackupEngine reads are not rate limited either, other non-trivial reads are generally limited by a corresponding write, as in copying files.) Also, the fact that this is not already fixed was arguably a bug/oversight in the implementation of https://github.com/facebook/rocksdb/issues/7110. * When considering whether a table file is already backed up in a shared part of backup directory, BackupEngine would already query the sizes of source (DB) and pre-existing destination (backup) files. BackupEngine now uses these file sizes to detect corruption, as at least one of (a) old backup, (b) backup in progress, or (c) current DB is corrupt if there's a size mismatch. * Justification: a random related fix that also helps to cover a small hole in corruption checking uncovered by the other functional change: * For `share_table_files` without "checksum" (not recommended), the other change regresses in detecting fundamentally unsafe use of this option combination: when you might generate different versions of same SST file number. As demonstrated by `BackupableDBTest.FailOverwritingBackups,` this regression is greatly mitigated by the new file size checking. Nevertheless, almost no reason to use `share_files_with_checksum=false` should remain, and comments are updated appropriately. Also, this change renames internal function `CalculateChecksum` to `ReadFileAndComputeChecksum` to make the performance impact of this function clear in code reviews. It is not clear what 'same_path' is for in backupable_db.cc, and I suspect it cannot be true for a DB with unique file names (like DBImpl). Nevertheless, I've tried to keep its functionality intact when `true` to minimize risk for now, despite having no unit tests for which it is true. Select impact details (much more in unit tests): For `share_files_with_checksum`, I am confident there is no regression (vs. pre-6.12) in detecting DB or backup corruption at backup creation time, mostly because the old design did not leverage this extra checksum computation for detecting inconsistencies at backup creation time. (With computed checksums in names, a recently corrupted file just looked like a different file vs. what was already backed up.) Even in the hypothetical case of DB session id collision (~100 bits entropy collision), file size in name and/or our file size check add an extra layer of protection against false success in creating an accurate new backup. (Unit test included.) `DB::VerifyChecksum` and `BackupEngine::VerifyBackup` with checksum checking are still able to catch corruptions that `CreateNewBackup` does not. Note that when custom file checksum support is added to BackupEngine, that will essentially give the same power as `DB::VerifyChecksum` into `CreateNewBackup`. We could add options for `CreateNewBackup` to cover some of what would be caught by `VerifyBackup` with checksum checking. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7413 Test Plan: Two new unit tests included, both of which fail without these changes. Although we don't test the I/O improvement directly, we test it indirectly in DB corruption detection power that was inadvertently unlocked with new backup file naming PLUS computing current content checksums (now removed). (I don't think that case of DB corruption detection justifies reading the whole DB on incremental backup.) Reviewed By: zhichao-cao Differential Revision: D23818480 Pulled By: pdillinger fbshipit-source-id: 148aff16f001af5b9fd4b22f155311c2461f1bac	5 years ago
Peter Dillinger	b475a83f9d	Postponing custom checksum support in BackupEngine (#7411 ) Summary: This change reverts BackupEngine to 6.12 state to accommodate a higher-priority fix that does not easily merge with this custom checksum support. We intend to reinstate this support soon, by merging a revert of this change. For backupable_db_test, I've removed the tests depending on this feature. I've also removed relevant HISTORY.md entry. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7411 Test Plan: unit tests Reviewed By: ajkr Differential Revision: D23793835 Pulled By: pdillinger fbshipit-source-id: 7e861436539584799b13d1a8ae559b81b6d08052	5 years ago
Peter Dillinger	93719fc953	Restore file size in backup table file names (and other cleanup) (#7400 ) Summary: Prior to 6.12, backup files using share_files_with_checksum had the file size encoded in the file name, after the last '\_' and before the last '.'. We considered this an implementation detail subject to change, and indeed removed this information from the file name (with an option to use old behavior) because it was considered ineffective/inefficient for file name uniqueness. However, some downstream RocksDB users were relying on this information since the file size is not explicitly in the backup manifest file. This primary purpose of this change is "retrofitting" the 6.12 release (not yet a public release) to simultaneously support the benefits of the new naming scheme (I/O performance and data correctness at scale) and preserve the file size information, both as default behaviors. With this change, we are essentially making the file size information encoded in the file name an official, though obscure, extension of the backup meta file format. We preserve an option (kLegacyCrc32cAndFileSize) to use the original "legacy" naming scheme, with its caveats, and make it easy to omit the file size information (no kFlagIncludeFileSize), for more compact file names. But note that changing the naming scheme used on an existing db and backup directory can lead to transient space amplification, as some files will be stored under two names in the shared_checksum directory. Because some backups were saved using the original 6.12 naming scheme, we offer two ways of dealing with those files: SST files generated by older 6.12 versions can either use the default naming scheme in effect when the SST files were generated (kFlagMatchInterimNaming, default, no transient space amplification) or can use a new naming scheme (no kFlagMatchInterimNaming, potential space amplification because some already stored files getting a new name). We don't have a natural way to detect which files were generated by previous 6.12 versions, but this change hacks one in by changing DB session ids to now use a more concise encoding, reducing file name length, saving ~dozen bytes from SST files, and making them visually distinct from DB ids so that they are less likely to be mixed up. Two final auxiliary notes: Recognizing that the backup file names have become a de facto part of the backup meta schema, this change makes them easier to parse and extend by putting a distinct marker, 's', before DB session ids embedded in the name. When we extend this to allow custom checksums in the name, they can get their own marker to ensure safe parsing. For backward compatibility, file size does not get a marker but is assumed for `_[0-9]+[.]` Another change from initial 6.12 default behavior is never including file custom checksum in the file name. Looking ahead to 6.13, we do not want the default behavior to cause backup space amplification for someone turning on file custom checksum checking in BackupEngine; we want that to be an easy decision. When implemented, including file custom checksums in backup file names will be a non-default option. Actual file name patterns and priorities, as regexes: kLegacyCrc32cAndFileSize OR pre-6.12 SST file -> [0-9]+_[0-9]+_[0-9]+[.]sst kFlagMatchInterimNaming set (default) AND early 6.12 SST file -> [0-9]+_[0-9a-fA-F-]+[.]sst kUseDbSessionId AND NOT kFlagIncludeFileSize -> [0-9]+_s[0-9A-Z]{20}[.]sst kUseDbSessionId AND kFlagIncludeFileSize (default) -> [0-9]+_s[0-9A-Z]{20}_[0-9]+[.]sst We might add opt-in options for more '\_' separated data in the name, but embedded file size, if present, will always be after last '\_' and before '.sst'. This change was originally applied to version 6.12. (See https://github.com/facebook/rocksdb/issues/7390) Pull Request resolved: https://github.com/facebook/rocksdb/pull/7400 Test Plan: unit tests included. Sync point callbacks are used to mimic previous version SST files. Reviewed By: ajkr Differential Revision: D23759587 Pulled By: pdillinger fbshipit-source-id: f62d8af4e0978de0a34f26288cfbe66049b70025	5 years ago
Peter Dillinger	4e258d3e63	Fix backup/restore in stress/crash test (#7357 ) Summary: (1) Skip check on specific key if restoring an old backup (small minority of cases) because it can fail in those cases. (2) Remove an old assertion about number of column families and number of keys passed in, which is broken by atomic flush (cf_consistency) test. Like other code (for better or worse) assume a single key and iterate over column families. (3) Apply mock_direct_io to NewSequentialFile so that db_stress backup works on /dev/shm. Also add more context to output in case of backup/restore db_stress failure. Also a minor fix to BackupEngine to report first failure status in creating new backup, and drop another clue about the potential source of a "Backup failed" status. Reverts "Disable backup/restore stress test (https://github.com/facebook/rocksdb/issues/7350)" Pull Request resolved: https://github.com/facebook/rocksdb/pull/7357 Test Plan: Using backup_one_in=10000, "USE_CLANG=1 make crash_test_with_atomic_flush" for 30+ minutes "USE_CLANG=1 make blackbox_crash_test" for 30+ minutes And with use_direct_reads with TEST_TMPDIR=/dev/shm/rocksdb Reviewed By: riversand963 Differential Revision: D23567244 Pulled By: pdillinger fbshipit-source-id: e77171c2e8394d173917e36898c02dead1c40b77	5 years ago
Peter Dillinger	9aad24da55	Real fix for race in backup custom checksum checking (#7309 ) Summary: This is a "real" fix for the issue worked around in https://github.com/facebook/rocksdb/issues/7294. To get DB checksum info for live files, we now read the manifest file that will become part of the checkpoint/backup. This requires a little extra handling in taking a custom checkpoint, including only reading the manifest file up to the size prescribed by the checkpoint. This moves GetFileChecksumsFromManifest from backup code to file_checksum_helper.{h,cc} and removes apparently unnecessary checking related to column families. Updated HISTORY.md and warned potential future users of DB::GetLiveFilesChecksumInfo() Pull Request resolved: https://github.com/facebook/rocksdb/pull/7309 Test Plan: updated unit test, before and after Reviewed By: ajkr Differential Revision: D23311994 Pulled By: pdillinger fbshipit-source-id: 741e30a2dc1830e8208f7648fcc8c5f000d4e2d5	5 years ago
Peter Dillinger	a1b5484811	Work around a backup bug with DB custom checksums (#7294 ) Summary: On a read-write DB configured with DBOptions::file_checksum_gen_factory, BackupEngine::CreateNewBackup can fail intermittently, with non-OK status. This is due to a race between GetLiveFiles and GetLiveFilesChecksumInfo in creating backups. For patching 6.12 release (as this commit is intended for, except this is a forward-merged version), we can simply treat files for which we falsely failed to get checksum info as legacy files lacking checksum info. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7294 Test Plan: unit test reproducer included Reviewed By: ajkr Differential Revision: D23253489 Pulled By: pdillinger fbshipit-source-id: 9e4945dad120b776ad3e753be10b962f61f28e14	5 years ago

1 2 3 4 5 ...

273 Commits (c18c4a081c74251798ad2a1abf83bad417518481)