Summary:
`CheckpointTest.CurrentFileModifiedWhileCheckpointing` could hang
because now create checkpoint triggers flush twice. The test should wait
both flush done.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7849
Test Plan: `gtest-parallel ./checkpoint_test --gtest_filter=CheckpointTest.CurrentFileModifiedWhileCheckpointing -r 100`
Reviewed By: ajkr
Differential Revision: D25860713
Pulled By: jay-zhuang
fbshipit-source-id: e1c2f23037dedc33e205519f4289a25e77816b41
Summary:
The main improvement here is to not include `.` or `..` in the results of `Env::GetChildren`. The occurrence of `.` or `..`; it is non-portable, dependent on the Operating System and the File System. See: https://www.gnu.org/software/libc/manual/html_node/Reading_002fClosing-Directory.html
There were lots of duplicate checks spread through the RocksDB codebase previously to skip `.` and `..`. This new removes the need for those at the source.
Also some minor fixes to `Env::GetChildren`:
* Improve error handling in POSIX implementation
* Remove unnecessary array allocation on Windows
* Fix struct name for Windows Non-UTF-8 API
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7819
Reviewed By: ajkr
Differential Revision: D25837394
Pulled By: jay-zhuang
fbshipit-source-id: 1e137e7218d38b450af9c083f73d5357abcbba2e
Summary:
Currently, manifest size is determined before getting min_log_num.
But between getting manifest size and getting min_log_num, concurrently, a flush might succeed, which will write new records to manifest to make some WALs become outdated, then min_log_num will be correspondingly increased, but the new records in manifest will not be copied into the checkpoint because the manifest's size is determined before them, then the newly outdated WALs will still exist in the checkpoint's manifest, but they are not linked/copied to the checkpoint because their log number is < min_log_num, so a corruption of missing WAL will be reported when restoring from the checkpoint.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7836
Test Plan: make crash_test
Reviewed By: ajkr
Differential Revision: D25788204
Pulled By: cheng-chang
fbshipit-source-id: a4e5acf30f08270b3c0a95304ff559a9e655252f
Summary:
This PR does the following:
-> Creates a WinFileSystem class. This class is the Windows equivalent of the PosixFileSystem and will be used on Windows systems.
-> Introduces a CustomEnv class. A CustomEnv is an Env that takes a FileSystem as constructor argument. I believe there will only ever be two implementations of this class (PosixEnv and WinEnv). There is still a CustomEnvWrapper class that takes an Env and a FileSystem and wraps the Env calls with the input Env but uses the FileSystem for the FileSystem calls
-> Eliminates the public uses of the LegacyFileSystemWrapper.
With this change in place, there are effectively the following patterns of Env:
- "Base Env classes" (PosixEnv, WinEnv). These classes implement the core Env functions (e.g. Threads) and have a hard-coded input FileSystem. These classes inherit from CompositeEnv, implement the core Env functions (threads) and delegate the FileSystem-like calls to the input file system.
- Wrapped Composite Env classes (MemEnv). These classes take in an Env and a FileSystem. The core env functions are re-directed to the wrapped env. The file system calls are redirected to the input file system
- Legacy Wrapped Env classes. These classes take in an Env input (but no FileSystem). The core env functions are re-directed to the wrapped env. A "Legacy File System" is created using this env and the file system calls directed to the env itself.
With these changes in place, the PosixEnv becomes a singleton -- there is only ever one created. Any other use of the PosixEnv is via another wrapped env. This cleans up some of the issues with the env construction and destruction.
Additionally, there were places in the code that required had an Env when they required a FileSystem. Many of these places would wrap the Env with a LegacyFileSystemWrapper instead of using the env->GetFileSystem(). These places were changed, thereby removing layers of additional redirection (LegacyFileSystem --> Env --> Env::FileSystem).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7703
Reviewed By: zhichao-cao
Differential Revision: D25762190
Pulled By: anand1976
fbshipit-source-id: 1a088e97fc916f28ac69c149cd1dcad0ab31704b
Summary:
Change the StringEnv and related classes to be based on FileSystem APIs rather than the corresponding Env ones. The StringSink and StringSource classes were changed to be based on the corresponding FS file classes.
Part of a cleanup to use the newer interfaces. This change also eliminates some of the casts/wrappers to LegacyFile classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7786
Reviewed By: jay-zhuang
Differential Revision: D25761460
Pulled By: anand1976
fbshipit-source-id: 428ae8e32b3db97dbeeca08c9d3bb0d9d4d3a38f
Summary:
1. Made `WriteBatchWithIndexInternal` into a class that stores the `DB*` or `DBOptions*`.
2. Changed the `GetFromBatch` method to be non-static and use an instance of the class. Added `MergeKey` methods to perform the merge itself and return any status.
This change unifies the multiple calls to the `MergeHelper` under a single wrapped API.
Closes https://github.com/facebook/rocksdb/issues/6683
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6851
Reviewed By: ajkr
Differential Revision: D21706574
Pulled By: pdillinger
fbshipit-source-id: 6860bd64d62669aaa591846e914eed3b674e68b1
Summary:
BasicLockEscalation will cause false-positive warnings under TSAN (this is a known issue in TSAN, see details in https://gist.github.com/spetrunia/77274cf2d5848e0a7e090d622695ed4e), skip this test if TSAN is enabled, or if we are not sure whether TSAN is enabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7814
Test Plan: watch the tsan contrun test to pass.
Reviewed By: zhichao-cao
Differential Revision: D25708094
Pulled By: cheng-chang
fbshipit-source-id: 4fc813ff373301d033d086154cc7bb60a5e95889
Summary:
Added "no-elide-constructors to the ASSERT_STATUS_CHECK builds. This flag gives more errors/warnings for some of the Status checks where an inner class checks a Status and later returns it. In this case, without the elide check on, the returned status may not have been checked in the caller, thereby bypassing the checked code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7798
Reviewed By: jay-zhuang
Differential Revision: D25680451
Pulled By: pdillinger
fbshipit-source-id: c3f14ed9e2a13f0a8c54d839d5fb4d1fc1e93917
Summary:
In a stress test failure, we observe that a WAL is skipped when creating checkpoint, although its log number >= MinLogNumberToKeep(). This might happen in the following case:
1. when creating the checkpoint, there are 2 column families: CF0 and CF1, and there are 2 WALs: 1, 2;
2. CF0's log number is 1, CF0's active memtable is empty, CF1's log number is 2, CF1's active memtable is not empty, WAL 2 is not empty, the sequence number points to WAL 2;
2. the checkpoint process flushes CF0, since CF0' active memtable is empty, there is no need to SwitchMemtable, thus no new WAL will be created, so CF0's log number is now 2, concurrently, some data is written to CF0 and WAL 2;
3. the checkpoint process flushes CF1, WAL 3 is created and CF1's log number is now 3, CF0's log number is still 2 because CF0 is not empty and WAL 2 contains its unflushed data concurrently written in step 2;
4. the checkpoint process determines that WAL 1 and 2 are no longer needed according to [live_wal_files[i]->StartSequence() >= *sequence_number](https://github.com/facebook/rocksdb/blob/master/utilities/checkpoint/checkpoint_impl.cc#L388), so it skips linking them to the checkpoint directory;
5. but according to `MinLogNumberToKeep()`, WAL 2 still needs to be kept because CF0's log number is 2.
If the checkpoint is reopened in read-only mode, and only read from the snapshot with the initial sequence number, then there will be no data loss or data inconsistency.
But if the checkpoint is reopened and read from the most recent sequence number, suppose in step 3, there are also data concurrently written to CF1 and WAL 3, then the most recent sequence number refers to the latest entry in WAL 3, so the data written in step 2 should also be visible, but since WAL 2 is discarded, those data are lost.
When tracking WAL in MANIFEST is enabled, when reopening the checkpoint, since WAL 2 is still tracked in MANIFEST as alive, but it's missing from the checkpoint directory, a corruption will be reported.
This PR makes the checkpoint process to only skip a WAL if its log number < `MinLogNumberToKeep`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7789
Test Plan: watch existing tests to pass.
Reviewed By: ajkr
Differential Revision: D25662346
Pulled By: cheng-chang
fbshipit-source-id: 136471095baa01886cf44809455cf855f24857a0
Summary:
Range Locking - an implementation based on the locktree library
- Add a RangeTreeLockManager and RangeTreeLockTracker which implement
range locking using the locktree library.
- Point locks are handled as locks on single-point ranges.
- Add a unit test: range_locking_test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7506
Reviewed By: akankshamahajan15
Differential Revision: D25320703
Pulled By: cheng-chang
fbshipit-source-id: f86347384b42ba2b0257d67eca0f45f806b69da7
Summary:
This disables Linux/amd64 builds in Travis for PRs, and adds a
gcc-10+c++20 build in CircleCI, which should fill out sufficient coverage
vs. what we had in Travis
Fixed a use of std::is_pod, which is deprecated in c++20
Fixed ++ on a volatile in db_repl_stress.cc, with bigger refactoring.
Although ++ on this volatile was probably ok with one thread writer and
one thread reader, the code was still overly complex. There was a
deadcode check for error
`if (replThread.no_read < dataPump.no_records)` which can be proven
never to happen based on the structure of the code. It infinite loops
instead for the case intended to be checked. I just simplified the code
for what should be the same checking power.
Also most configurations seem to be using make parallelism = 2 * vcores,
so fixing / using that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7791
Test Plan:
CI
and `while ./db_repl_stress; do echo again; done` for a while
Reviewed By: siying
Differential Revision: D25669834
Pulled By: pdillinger
fbshipit-source-id: b2c688053d0b1d52c989903449d3cd27a04130d6
Summary:
Inject the random write error to stress test, it requires set reopen=0 and disable_wal=true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7653
Test Plan: pass db_stress and python3 db_crashtest.py blackbox
Reviewed By: ajkr
Differential Revision: D25354132
Pulled By: zhichao-cao
fbshipit-source-id: 44721104eecb416e27f65f854912c40e301dd669
Summary:
Some clients do not close their iterators until after the transaction finishes. To handle this case, we will invalidate any iterators on transaction clear.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7733
Reviewed By: cheng-chang
Differential Revision: D25261158
Pulled By: lth
fbshipit-source-id: b91320f00c54cbe0e6882b794b34f3bb5640dbc0
Summary:
To be used for implementing Range Locking.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7753
Reviewed By: zhichao-cao
Differential Revision: D25378980
Pulled By: cheng-chang
fbshipit-source-id: 801a9c5cd92a84654ca2586b73e8f69001e89320
Summary:
This PR has two commits:
1. Modify the code to allow different Lock Managers (of any kind) to be used. It is implied that a LockManager uses its own custom LockTracker.
2. Add definitions for Range Locking (class Endpoint and GetRangeLock() function.
cheng-chang, is this what you've had in mind (should the PR have both item 1 and item 2?)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7443
Reviewed By: zhichao-cao
Differential Revision: D24123172
Pulled By: cheng-chang
fbshipit-source-id: c6548ad6d4cc3c25f68d13b29147bc6fdf357185
Summary:
Handle misuse of snprintf return value to avoid Out of bound
read/write.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7686
Test Plan: make check -j64
Reviewed By: riversand963
Differential Revision: D25030831
Pulled By: akankshamahajan15
fbshipit-source-id: 1a1d181c067c78b94d720323ae00b79566b57cfa
Summary:
The patch adds iterator support to the integrated BlobDB implementation.
Whenever a blob reference is encountered during iteration, the corresponding
blob is retrieved by calling `Version::GetBlob`, assuming the `expose_blob_index`
(formerly `allow_blob`) flag is *not* set. (Note: the flag is set by the old stacked
BlobDB implementation, which has its own blob file handling/blob retrieval logic.)
In addition, `DBIter` now uniformly returns `Status::NotSupported` with the error
message `"BlobDB does not support merge operator."` when encountering a
blob reference while performing a merge (instead of potentially returning a
message that implies the database should be opened using the stacked BlobDB's
`Open`.)
TODO: We can implement support for lazily retrieving the blob value (or in other
words, bypassing the retrieval of blob values based on key) by extending the `Iterator`
API with a new `PrepareValue` method (similarly to `InternalIterator`, which already
supports lazy values).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7731
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D25256293
Pulled By: ltamasi
fbshipit-source-id: c39cd782011495a526cdff99c16f5fca400c4811
Summary:
An application may accidentally write merge operands without properly configuring `merge_operator`. We should alert them as early as possible that there's an API misuse. Previously RocksDB only notified them when a query or background operation needed to merge but couldn't. With this PR, RocksDB notifies them of the problem before applying the merge operand to the memtable (although it may already be in WAL, which seems it'd cause a crash loop until they enable `merge_operator`).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7667
Reviewed By: riversand963
Differential Revision: D24933360
Pulled By: ajkr
fbshipit-source-id: 3a4a2ceb0b7aed184113dd03b8efd735a8332f7f
Summary:
Consider the following sequence of events:
1. Db flushed an SST with file number N, appended to MANIFEST, and tried to sync the MANIFEST.
2. Syncing MANIFEST failed and db crashed.
3. Db tried to recover with this MANIFEST. In the meantime, no entry about the newly-flushed SST was found in the MANIFEST. Therefore, RocksDB replayed WAL and tried to flush to an SST file reusing the same file number N. This failed because file system does not support overwrite. Then Db deleted this file.
4. Db crashed again.
5. Db tried to recover. When db read the MANIFEST, there was an entry referencing N.sst. This could happen probably because the append in step 1 finally reached the MANIFEST and became visible. Since N.sst had been deleted in step 3, recovery failed.
It is possible that N.sst created in step 1 is valid. Although step 3 would still fail since the MANIFEST was not synced properly in step 1 and 2, deleting N.sst would make it impossible for the db to recover even if the remaining part of MANIFEST was appended and visible after step 5.
After this PR, in step 3, immediately after recovering from MANIFEST, a new MANIFEST is created, then we find that N.sst is not referenced in the MANIFEST, so we delete it, and we'll not reuse N as file number. Then in step 5, since the new MANIFEST does not contain N.sst, the recovery failure situation in step 5 won't happen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7621
Test Plan:
1. some tests are updated, because these tests assume that new MANIFEST is created after WAL recovery.
2. a new unit test is added in db_basic_test to simulate step 3.
Reviewed By: riversand963
Differential Revision: D24668144
Pulled By: cheng-chang
fbshipit-source-id: 90d7487fbad2bc3714f5ede46ea949895b15ae3b
Summary:
The tests often times out in internal infra, skipping fsync should reduce test time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7641
Test Plan: watch existing tests to pass
Reviewed By: anand1976
Differential Revision: D24765098
Pulled By: cheng-chang
fbshipit-source-id: c62bf8110361aee901918d632cf4772435d05e8d
Summary:
This is a PR generated **semi-automatically** by an internal tool to remove unused includes and `using` statements.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7604
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D24579392
Pulled By: riversand963
fbshipit-source-id: c4bfa6c6b08da1de186690d37eb73d8fff45aecd
Summary:
Fixes Issue https://github.com/facebook/rocksdb/issues/7497
When allow_data_in_errors db_options is set, log error key details in `ParseInternalKey()`
Have fixed most of the calls. Have few TODOs still pending - because have to make more deeper changes to pass in the allow_data_in_errors flag. Will do those in a separate PR later.
Tests:
- make check
- some of the existing tests that exercise the "internal key too small" condition are: dbformat_test, cuckoo_table_builder_test
- some of the existing tests that exercise the corrupted key path are: corruption_test, merge_helper_test, compaction_iterator_test
Example of new status returns:
- Key too small - `Corrupted Key: Internal Key too small. Size=5`
- Corrupt key with allow_data_in_errors option set to false: `Corrupted Key: '<redacted>' seq:3, type:3`
- Corrupt key with allow_data_in_errors option set to true: `Corrupted Key: '61' seq:3, type:3`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7515
Reviewed By: ajkr
Differential Revision: D24240264
Pulled By: ramvadiv
fbshipit-source-id: bc48f5d4475ac19d7713e16df37505b31aac42e7
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
Summary:
As suggested by pdillinger ,The name of kLogFile is misleading, in some tests, kLogFile is defined as info log. Replace it with kWalFile and move it to public, which will be used in https://github.com/facebook/rocksdb/issues/7523
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7580
Test Plan: make check
Reviewed By: riversand963
Differential Revision: D24485420
Pulled By: zhichao-cao
fbshipit-source-id: 955e3dacc1021bb590fde93b0a568ffe9ad80799
Summary:
When `ASSERT_STATUS_CHECKED` is enabled, `transaction_test` does not pass without this PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7572
Test Plan: `ASSERT_STATUS_CHECKED=1 make -j32 transaction_test && ./transaction_test`
Reviewed By: zhichao-cao
Differential Revision: D24404319
Pulled By: cheng-chang
fbshipit-source-id: 13689035995366ab06d8eada3ea404e45fef8bc5
Summary:
Further refinement of the earlier PR. Now the Status is NotFound with a subcode of PathNotFound. Also the existing functions for options parsing/loading are reverted to return InvalidArgument no matter in which way the user-provided arguments are deemed invalid.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7563
Reviewed By: zhichao-cao
Differential Revision: D24422491
Pulled By: ajkr
fbshipit-source-id: ba6b237cd0584d3f925c5ba0d349aeb8c250af67
Summary:
In order to be able to introduce more locking protocols, we need to abstract out the locking subsystem in TransactionDB into a set of interfaces.
PR https://github.com/facebook/rocksdb/pull/7013 introduces interface `LockTracker`. This PR is a follow up to take the first step to abstract out a `LockManager` interface.
Further modifications to the interface may be needed when introducing the first implementation of range lock. But the idea here is to put the range lock implementation based on range tree under the `utilities/transactions/lock/range/range_tree`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7532
Test Plan: point_lock_manager_test
Reviewed By: ajkr
Differential Revision: D24238731
Pulled By: cheng-chang
fbshipit-source-id: 2a9458cd8b3fb008d9529dbc4d3b28c24631f463
Summary:
The patch adds blob file support to the `Get` API by extending `Version` so that
whenever a blob reference is read from a file, the blob is retrieved from the corresponding
blob file and passed back to the caller. (This is assuming the blob reference is valid
and the blob file is actually part of the given `Version`.) It also introduces a cache
of `BlobFileReader`s called `BlobFileCache` that enables sharing `BlobFileReader`s
between callers. `BlobFileCache` uses the same backing cache as `TableCache`, so
`max_open_files` (if specified) limits the total number of open (table + blob) files.
TODO: proactively open/cache blob files and pin the cache handles of the readers in the
metadata objects similarly to what `VersionBuilder::LoadTableHandlers` does for
table files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7540
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D24260219
Pulled By: ltamasi
fbshipit-source-id: a8a2a4f11d3d04d6082201b52184bc4d7b0857ba
Summary:
Make LoadLatestOptions return PathNotFound if the options file does not exist. Added tests for the LoadOptions related methods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7554
Reviewed By: akankshamahajan15
Differential Revision: D24298985
Pulled By: zhichao-cao
fbshipit-source-id: c9ae3cb12fc4a5bbef07743e1c1300f98a2441b3
Summary:
The patch does some cleanup in and around the legacy `BlobLogReader` class:
* It renames the class to `BlobLogSequentialReader` to emphasize that it is for
sequentially iterating through blobs in a blob file, as opposed to doing random
point reads using `BlobIndex`es (which is `BlobFileReader`'s jurisdiction).
* It removes some dead code from the old BlobDB implementation that references
`BlobLogReader` (namely the method `BlobFile::OpenRandomAccessReader`).
* It cleans up some `#include`s and forward declarations.
* It fixes some incorrect/outdated comments related to the reader class.
* It adds a few assertions to the `Read` methods of the class.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7517
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D24172611
Pulled By: ltamasi
fbshipit-source-id: 43e2ae1eba5c3dd30c1070cb00f217edc45bd64f
Summary:
Add db_basic_test status check list. Some of the warnings are suppressed. It is possible that some of them are due to real bugs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7452
Test Plan: See CI tests pass.
Reviewed By: zhichao-cao
Differential Revision: D23979764
fbshipit-source-id: 6151570c2a9b931b0fbb3fe939a94b2bd1583cbe
Summary:
The patch adds support for injecting errors when reading from `RandomAccessFile`
using `FaultInjectionTestEnv`. (This functionality was curiously missing
w/r/t `RandomAccessFile`, even though it was implemented for `RandomRWFile`.)
The patch also fixes up a test case in `blob_db_test` which uses `FaultInjectionTestEnv`
but has so far relied on reads from `RandomAccessFile`s succeeding even after
deactivating the filesystem.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7447
Test Plan: `make check`
Reviewed By: zhichao-cao
Differential Revision: D23971740
Pulled By: ltamasi
fbshipit-source-id: 8492736cb64b1ee138c658822535f3ff4fe560c6
Summary:
A generic algorithm in progress depends on a templatized
version of fastrange, so this change generalizes it and renames
it to fit our style guidelines, FastRange32, FastRange64, and now
FastRangeGeneric.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7436
Test Plan: added a few more test cases
Reviewed By: jay-zhuang
Differential Revision: D23958153
Pulled By: pdillinger
fbshipit-source-id: 8c3b76101653417804997e5f076623a25586f3e8
Summary:
Add new AppendWithVerify and PositionedAppendWithVerify APIs to Env and FileSystem to bring the data verification information (data checksum information) from upper layer (e.g., WritableFileWriter) to the storage layer. This PR only include the API definition, no functional codes are added to unblock other developers which depend on these APIs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7419
Test Plan: make -j32
Reviewed By: pdillinger
Differential Revision: D23883196
Pulled By: zhichao-cao
fbshipit-source-id: 94676c26bc56144cc32e3661f84f21eccd790411
Summary:
Two relatively simple functional changes to incremental backup
behavior, integrated with a minor refactoring to reduce code redundancy and
improve error/log message. There are nuances to the impact of these changes,
but I believe they are fundamentally good and generally safe. Those functional
changes:
* Incremental backups no longer read DB table files that are already saved to a
shared part of the backup directory, unless `share_files_with_checksum` is used
with `kLegacyCrc32cAndFileSize` naming (discouraged) where crc32c full file
checksums are needed to determine file naming.
* Justification: incremental backups should not need to read the whole DB,
especially without rate limiting. (Although other BackupEngine reads are not
rate limited either, other non-trivial reads are generally limited by a
corresponding write, as in copying files.) Also, the fact that this is not
already fixed was arguably a bug/oversight in the implementation of https://github.com/facebook/rocksdb/issues/7110.
* When considering whether a table file is already backed up in a shared part
of backup directory, BackupEngine would already query the sizes of source (DB)
and pre-existing destination (backup) files. BackupEngine now uses these file
sizes to detect corruption, as at least one of (a) old backup, (b) backup in
progress, or (c) current DB is corrupt if there's a size mismatch.
* Justification: a random related fix that also helps to cover a small hole
in corruption checking uncovered by the other functional change:
* For `share_table_files` without "checksum" (not recommended), the other
change regresses in detecting fundamentally unsafe use of this option
combination: when you might generate different versions of same SST file
number. As demonstrated by `BackupableDBTest.FailOverwritingBackups,` this
regression is greatly mitigated by the new file size checking. Nevertheless,
almost no reason to use `share_files_with_checksum=false` should remain, and
comments are updated appropriately.
Also, this change renames internal function `CalculateChecksum` to
`ReadFileAndComputeChecksum` to make the performance impact of this function
clear in code reviews.
It is not clear what 'same_path' is for in backupable_db.cc, and I suspect it
cannot be true for a DB with unique file names (like DBImpl). Nevertheless,
I've tried to keep its functionality intact when `true` to minimize risk for
now, despite having no unit tests for which it is true.
Select impact details (much more in unit tests): For
`share_files_with_checksum`, I am confident there is no regression (vs.
pre-6.12) in detecting DB or backup corruption at backup creation time, mostly
because the old design did not leverage this extra checksum computation for
detecting inconsistencies at backup creation time. (With computed checksums in
names, a recently corrupted file just looked like a different file vs. what was
already backed up.)
Even in the hypothetical case of DB session id collision (~100 bits entropy
collision), file size in name and/or our file size check add an extra layer of
protection against false success in creating an accurate new backup. (Unit test
included.)
`DB::VerifyChecksum` and `BackupEngine::VerifyBackup` with checksum checking
are still able to catch corruptions that `CreateNewBackup` does not. Note that
when custom file checksum support is added to BackupEngine, that will
essentially give the same power as `DB::VerifyChecksum` into `CreateNewBackup`.
We could add options for `CreateNewBackup` to cover some of what would be
caught by `VerifyBackup` with checksum checking.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7413
Test Plan:
Two new unit tests included, both of which fail without these
changes. Although we don't test the I/O improvement directly, we test it
indirectly in DB corruption detection power that was inadvertently unlocked
with new backup file naming PLUS computing current content checksums (now
removed). (I don't think that case of DB corruption detection justifies reading
the whole DB on incremental backup.)
Reviewed By: zhichao-cao
Differential Revision: D23818480
Pulled By: pdillinger
fbshipit-source-id: 148aff16f001af5b9fd4b22f155311c2461f1bac