Summary:
Prior to 6.12, backup files using share_files_with_checksum had
the file size encoded in the file name, after the last '\_' and before
the last '.'. We considered this an implementation detail subject to
change, and indeed removed this information from the file name (with an
option to use old behavior) because it was considered
ineffective/inefficient for file name uniqueness. However, some
downstream RocksDB users were relying on this information since the file
size is not explicitly in the backup manifest file.
This primary purpose of this change is "retrofitting" the 6.12 release
(not yet a public release) to simultaneously support the benefits of the
new naming scheme (I/O performance and data correctness at scale) and
preserve the file size information, both as default behaviors. With this
change, we are essentially making the file size information encoded in
the file name an official, though obscure, extension of the backup meta
file format.
We preserve an option (kLegacyCrc32cAndFileSize) to use the original
"legacy" naming scheme, with its caveats, and make it easy to omit the
file size information (no kFlagIncludeFileSize), for more compact file
names. But note that changing the naming scheme used on an existing db
and backup directory can lead to transient space amplification, as some
files will be stored under two names in the shared_checksum directory.
Because some backups were saved using the original 6.12 naming scheme,
we offer two ways of dealing with those files: SST files generated by
older 6.12 versions can either use the default naming scheme in effect
when the SST files were generated (kFlagMatchInterimNaming, default, no
transient space amplification) or can use a new naming scheme (no
kFlagMatchInterimNaming, potential space amplification because some
already stored files getting a new name).
We don't have a natural way to detect which files were generated by
previous 6.12 versions, but this change hacks one in by changing DB
session ids to now use a more concise encoding, reducing file name
length, saving ~dozen bytes from SST files, and making them visually
distinct from DB ids so that they are less likely to be mixed up.
Two final auxiliary notes:
Recognizing that the backup file names have become a de facto part of
the backup meta schema, this change makes them easier to parse and
extend by putting a distinct marker, 's', before DB session ids embedded
in the name. When we extend this to allow custom checksums in the name,
they can get their own marker to ensure safe parsing. For backward
compatibility, file size does not get a marker but is assumed for
`_[0-9]+[.]`
Another change from initial 6.12 default behavior is never including
file custom checksum in the file name. Looking ahead to 6.13, we do not
want the default behavior to cause backup space amplification for
someone turning on file custom checksum checking in BackupEngine; we
want that to be an easy decision. When implemented, including file
custom checksums in backup file names will be a non-default option.
Actual file name patterns and priorities, as regexes:
kLegacyCrc32cAndFileSize OR pre-6.12 SST file ->
[0-9]+_[0-9]+_[0-9]+[.]sst
kFlagMatchInterimNaming set (default) AND early 6.12 SST file ->
[0-9]+_[0-9a-fA-F-]+[.]sst
kUseDbSessionId AND NOT kFlagIncludeFileSize ->
[0-9]+_s[0-9A-Z]{20}[.]sst
kUseDbSessionId AND kFlagIncludeFileSize (default) ->
[0-9]+_s[0-9A-Z]{20}_[0-9]+[.]sst
We might add opt-in options for more '\_' separated data in the name,
but embedded file size, if present, will always be after last '\_' and
before '.sst'.
This change was originally applied to version 6.12. (See https://github.com/facebook/rocksdb/issues/7390)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7400
Test Plan:
unit tests included. Sync point callbacks are used to mimic
previous version SST files.
Reviewed By: ajkr
Differential Revision: D23759587
Pulled By: pdillinger
fbshipit-source-id: f62d8af4e0978de0a34f26288cfbe66049b70025
Summary:
Cleaned up the public API to use the EncryptedEnv. This change will allow providers to be developed and added to the system easier in the future. It will also allow better integration in the future with the OPTIONS file.
- The internal classes were moved out of the public API into an internal "env_encryption_ctr.h" header. Short-cut constructors were added to provide the original API functionality.
- The APIs to the constructors were changed to take shared_ptr, rather than raw pointers or references to allow better memory management and alternative implementations.
- CreateFromString methods were added to allow future expansion to other provider and cipher implementations through a standard API.
Additionally, there was a code duplication in the NewXXXFile methods. This common code was moved under a templatized function.
A first-pass at structuring the code was made to potentially allow multiple EncryptionProviders in a single EncryptedEnv. The idea was that different providers may use different cipher keys or different versions/algorithms. The EncryptedEnv should have some means of picking different providers based on information. The groundwork was started for this (the use of the provider_ member variable was localized) but the work has not been completed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7279
Reviewed By: jay-zhuang
Differential Revision: D23709440
Pulled By: zhichao-cao
fbshipit-source-id: 0e845fff0e03a52603eb9672b4ade32d063ff2f2
Summary:
This PR merges the functionality of making the ColumnFamilyOptions, TableFactory, and DBOptions into Configurable into a single PR, resolving any merge conflicts
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5753
Reviewed By: ajkr
Differential Revision: D23385030
Pulled By: zhichao-cao
fbshipit-source-id: 8b977a7731556230b9b8c5a081b98e49ee4f160a
Summary:
In a distributed file system, directory ownership is enforced by fencing
off the previous owner once they've been preempted by a new owner. This
PR adds a IOStatus subcode for ```StatusCode::IOError``` to indicate this.
Once this error is returned for a file write, the DB is put in read-only
mode and not allowed to resume in read-write mode.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7374
Test Plan: Add new unit tests in ```error_handler_fs_test```
Reviewed By: riversand963
Differential Revision: D23687777
Pulled By: anand1976
fbshipit-source-id: bef948642089dc0af399057864d9a8ca339e8b2f
Summary:
The patch adds support for exposing the start of the expiration range
for TTL blob files through the `GetLiveFilesMetaData` API. This can be
used for monitoring purposes, i.e. to make sure TTL blob files are
deleted in a timely manner. The patch also fixes a couple of uninitialized
variable issues.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7365
Test Plan: `make check`
Reviewed By: pdillinger
Differential Revision: D23605465
Pulled By: ltamasi
fbshipit-source-id: 97a9612bf5f4b058423debdd3f28f576bb23a70f
Summary:
Currently, application may pass a statistics object to db but later
wants to reduce stats tracking overhead by setting stats level to
kExceptHistogramOrTimers (the current lowest level). Tickers will still
be incremented, causing up to 1% CPU. We can add a new lowest stats
level `kExceptTickers` to disable ticker incrementing as well, thus
reducing CPU cycles spent on tickers.
Test Plan (devserver):
```
make check
make clean
DEBUG_LEVEL=0 make db_bench
./db_bench -perf_level=1 -stats_level=0 -statistics -benchmarks=fillseq,readrandom -duration=120
```
Measure CPU util (%) before and after change:
CPU util by rocksdb::RecordTick: 1.1 vs (<0.1)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7329
Reviewed By: pdillinger
Differential Revision: D23434014
Pulled By: riversand963
fbshipit-source-id: 72ff0f02a192ac476d4b0044b9f37fd4a22ff0d4
Summary:
This is adapted from https://github.com/facebook/rocksdb/issues/6678 but takes a different approach, avoiding opening a read-write DB and avoiding the `DeleteFile()` API.
First, this PR refactors how options variables are initialized in `ldb` so it can be reused in a subcommand that doesn't open a DB:
- Separated remaining option initialization logic out of `OpenDB()`. The new `PrepareOptions()` function initializes the full options state.
- Fixed an old TODO about applying the subcommand CF option overrides to the proper `ColumnFamilyOptions` object.
Second, this PR adds the `ldb unsafe_remove_sst_file` subcommand. It uses the `VersionSet`-level APIs to remove the file with the specified number.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7335
Test Plan: played with interactive python and this file removal command. Verified openability/correct results in case of multiple column families, multiple levels, etc.
Reviewed By: pdillinger
Differential Revision: D23454575
Pulled By: ajkr
fbshipit-source-id: 039b7a8cbfc42fd123dcb25821eef51d61148afe
Summary:
Since we can't land https://github.com/facebook/rocksdb/issues/7336 until the next major release, added a strong warning against the `DeleteFile()` API in the meantime.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7337
Reviewed By: pdillinger
Differential Revision: D23459728
Pulled By: ajkr
fbshipit-source-id: 326cb9b18190386080c35c761a8736d8a877dafb
Summary:
This PR creates `rocksdb_open_column_families_with_ttl` which allows C API users to open a DBWithTLL with column families.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7314
Reviewed By: cheng-chang
Differential Revision: D23430287
Pulled By: ajkr
fbshipit-source-id: 307aa21d170d1402653263a91f6f832ef76afba0
Summary:
A new file interface `SupportPrefetch()` is added. When the user overrides it to `false`, an internal prefetch buffer will be used for readahead. Useful for non-directIO but FS doesn't have readahead support.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7312
Reviewed By: anand1976
Differential Revision: D23329847
Pulled By: jay-zhuang
fbshipit-source-id: 71cd4ce6f4a820840294e4e6aec111ab76175527
Summary:
This is a "real" fix for the issue worked around in https://github.com/facebook/rocksdb/issues/7294.
To get DB checksum info for live files, we now read the manifest file
that will become part of the checkpoint/backup. This requires a little
extra handling in taking a custom checkpoint, including only reading the
manifest file up to the size prescribed by the checkpoint.
This moves GetFileChecksumsFromManifest from backup code to
file_checksum_helper.{h,cc} and removes apparently unnecessary checking
related to column families.
Updated HISTORY.md and warned potential future users of
DB::GetLiveFilesChecksumInfo()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7309
Test Plan: updated unit test, before and after
Reviewed By: ajkr
Differential Revision: D23311994
Pulled By: pdillinger
fbshipit-source-id: 741e30a2dc1830e8208f7648fcc8c5f000d4e2d5
Summary:
When SST file is created, application is able to know the file information through OnTableFileCreated callback in LogAndNotifyTableFileCreationFinished. Since file checksum information can be useful for application when the SST file is created, we add file_checksum and file_checksum_func_name information to TableFileCreationInfo, which will be passed through OnTableFileCreated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7108
Test Plan: make check, listener_test.
Reviewed By: ajkr
Differential Revision: D22470240
Pulled By: zhichao-cao
fbshipit-source-id: 92c20344d9b986eadfe3480f3769bf4add0dbaae
Summary:
This test uses database functionality and required more extensive work to get it to pass than the other tests. The DB functionality required for this test now passes the check.
When it was unclear what the proper behavior was for unchecked status codes, a TODO was added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7283
Reviewed By: akankshamahajan15
Differential Revision: D23251497
Pulled By: ajkr
fbshipit-source-id: 52b79629bdafa0a58de8ead1d1d66f141b331523
Summary:
- Made it clear only one record in the tail is allowed to have a problem
- Added detail about the valid use case instead of calling it legacy behavior
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7270
Reviewed By: riversand963
Differential Revision: D23169075
Pulled By: ajkr
fbshipit-source-id: 2a4b45aa8641f17efa104523fbad765012a98fb0
Summary:
Add `kEntryDeleteWithTimestamp` to `EntryType` which is a public API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7195
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D22914704
Pulled By: riversand963
fbshipit-source-id: 886f73c6b70c527cad1c8fc9fc8d3afe60e1ea39
Summary:
A new option `std::shared_ptr<FileChecksumGenFactory> backup_checksum_gen_factory` is added to `BackupableDBOptions`. This allows custom checksum functions to be used for creating, verifying, or restoring backups.
Tests are added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7085
Test Plan: Passed make check
Reviewed By: pdillinger
Differential Revision: D22390756
Pulled By: gg814
fbshipit-source-id: 3b7756ca444c2129844536b91c3ca09f53b6248f
Summary:
Introduce io_timeout in ReadOptions and enabled deadline/io_timeout for
Iterators.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7161
Test Plan: New unit tests in db_basic_test
Reviewed By: riversand963
Differential Revision: D22687352
Pulled By: anand1976
fbshipit-source-id: 67bbb0e6d7ae80b256589244468494292538c6ec
Summary:
The patch moves `CompressionType` to its own header file and makes sure
all other public headers include this new header directly, as opposed to
relying on transitive includes or forward declarations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7162
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D22676545
Pulled By: ltamasi
fbshipit-source-id: 01d7a232377a229cbbc373d0ec1bf01dc0b0ce02
Summary:
Very small improvements to document the defaults.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7215
Reviewed By: zhichao-cao
Differential Revision: D22902286
fbshipit-source-id: a754d172a0d8e4c03754f6f1771d4a693d60a770
Summary:
Adds compaction statistics (total bytes read and written) for compactions that occur for delete-triggered, periodic, and TTL compaction reasons.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7165
Test Plan:
TTL and periodic can be checked by runnning db_bench with the options activated:
/db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -periodic_compaction_seconds=1
./db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -fifo_compaction_ttl=1
Setting the time to one second causes non-zero bytes read/written for those compaction reasons. Disabling them or setting them to times longer than the test run length causes the stats to return to zero as expected.
Delete-triggered compaction counting is tested in DBTablePropertiesTest.DeletionTriggeredCompactionMarking
Reviewed By: ajkr
Differential Revision: D22693050
Pulled By: akabcenell
fbshipit-source-id: d15cef4d94576f703015c8942d5f0d492f69401d
Summary:
Make (most of) the env*_test pass when ASSERT_STATUS_CHECKED is enabled.
One test that opens a database is currently disabled in this mode, as there are many errors that need revisited for DB tests and status checks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7176
Reviewed By: cheng-chang
Differential Revision: D22799278
Pulled By: ajkr
fbshipit-source-id: 16d8a02eaeecd6df1060249b6a5811292801f2ed
Summary:
SST Partitioner interface that allows to split SST files during compactions.
It basically instruct compaction to create a new file when needed. When one is using well defined prefixes and prefixed way of defining tables it is good to define also partitioning so that promotion of some SST file does not cover huge key space on next level (worst case complete space).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6957
Reviewed By: ajkr
Differential Revision: D22461239
fbshipit-source-id: 9ce07bba08b3ba89c2d45630520368f704d1316e
Summary:
When paraoid_files_checks=true, a rolling key-value hash is generated and compared to what is written to the file. If the values do not match, the SST file is rejected.
Code put in place for the check for both flush and compaction jobs. Corresponding test added to corruption_test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7134
Reviewed By: cheng-chang
Differential Revision: D22646149
fbshipit-source-id: 8fde1984a1a11edd3bd82a413acffc5ea7aa683f
Summary:
Issue https://github.com/facebook/rocksdb/issues/7133 reported that using `system_clock` in `FileOperationInfo::TimePoint` causes the duration of file flush operation (which can be a noop on MacOS in some scenarios) appears to be 0 and fail an assertion in listener_test. Using `steady_clock` supposedly fixed the problem.
`steady_clock` actually fits better into the use cases of `FileOperationInfo::TimePoint` as all usages care about durations but not wall clock time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7153
Test Plan: make check.
Reviewed By: riversand963
Differential Revision: D22654136
Pulled By: roghnin
fbshipit-source-id: 5980b1080734bdae496a18071a2c2b5887c67d85
Summary:
BackupEngine requires computing table checksums twice when backing up table files to the `shared_checksum` directory.
The repeated computation can be avoided by utilizing the db session id stored as a part of the table properties.
Filenames of table files in the `shared_checksum` directory depend on the following conditions:
1. the naming scheme is `kOptionalChecksumAndDbSessionId`,
2. `db_session_id` is not empty,
3. checksum is available in the DB manifest.
If 1,2,3 are satisfied, then the filenames will be of the form `<file_number>_<checksum>_<db_session_id>.sst`.
If 1,2 are satisfied, then the filenames will be of the form `<file_number>_<db_session_id>.sst`.
In all other cases, the filenames are of the form `<file_number>_<checksum>_<size>.sst`.
Additionally, if `kOptionalChecksumAndDbSessionId` is used (and not falling back to `kChecksumAndFileSize`), the `<checksum>` appeared in the filenames is hexadecimally encoded, instead of being plain `uint32_t` value.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7110
Test Plan: backupable_db_test and manual tests.
Reviewed By: ajkr
Differential Revision: D22508992
Pulled By: gg814
fbshipit-source-id: 5669f0ea9ad5a097f69f6d87aca4abba15032389
Summary:
In current codebase, in write path, if Retryable IO Error happens, SetBGError is called. The retryable IO Error is converted to hard error and DB is in read only mode. User or application needs to resume it. In this PR, if Retryable IO Error happens in one DB, SetBGError will create a new thread to call Resume (auto resume). otpions.max_bgerror_resume_count controls if auto resume is enabled or not (if max_bgerror_resume_count<=0, auto resume will not be enabled). options.bgerror_resume_retry_interval controls the time interval to call Resume again if the previous resume fails due to the Retryable IO Error. If non-retryable error happens during resume, auto resume will terminate.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6765
Test Plan: Added the unit test cases in error_handler_fs_test and pass make asan_check
Reviewed By: anand1976
Differential Revision: D21916789
Pulled By: zhichao-cao
fbshipit-source-id: acb8b5e5dc3167adfa9425a5b7fc104f6b95cb0b
Summary:
1. Add the wrapper classes FileSystemTracingWrapper, FSSequentialFileTracingWrapper, FSRandomAccessFileTracingWrapper, FSWritableFileTracingWrapper, FSRandomRWFileTracingWrapper that forward the calls to underlying storage system and then pass the file operation information to IOTracer. IOTracer dumps the record in binary format for tracing.
2. Add the wrapper classes FileSystemPtr, FSSequentialFilePtr, FSRandomAccessFilePtr, FSWritableFilePtr and FSRandomRWFilePtr that overload operator-> and return ptr to underlying storage system or Tracing wrapper class based on enabling/disabling of IO tracing. These classes are added to bypass Tracing Wrapper classes when we disable tracing.
3. Add enums in trace.h that distinguish which options need to be added for different file operations(Read, close, write etc) as part of tracing record.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7002
Test Plan: make check -j64
Reviewed By: anand1976
Differential Revision: D22127897
Pulled By: akankshamahajan15
fbshipit-source-id: 74cff58ce5661c9a3832dfaa52483f3b2d8565e0
Summary:
Currently, `EventListener` in listner.h only have callback functions for file read and write. One may favor extended callback functions for more file I/O operations like flush, sync and close. This PR tries to add those interface and have them called when appropriate throughout the code base.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7055
Test Plan:
Write an experimental listener with those new callback functions with log output in them; run experiments and check logs to see those functions are actually called.
Default test suits `make check` should also be included.
Reviewed By: riversand963
Differential Revision: D22380624
Pulled By: roghnin
fbshipit-source-id: 4121491d45c2c2aae8c255e7998090559a241c6a
Summary:
When table file checksums are enabled and stored in the DB manifest by using the RocksDB default crc32c checksum function, BackupEngine will calculate the crc32c checksum of the file to be copied and compare the calculated result with the one stored in the DB manifest before copying the file to the backup directory.
After copying to the backup directory, BackupEngine will verify the checksum of the copied file with the one calculated before copying. This helps detect some rare corruption events such as bit-flips during the copying process.
No verification with checksums in DB manifest will be performed if the table file checksum function is not the RocksDB default crc32c checksum function.
In addition, If `share_table_files` and `share_files_with_checksum` are true, BackupEngine will compare the checksums computed before and after copying of the table files.
Corresponding tests are added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7015
Test Plan: Passed make check
Reviewed By: pdillinger
Differential Revision: D22165732
Pulled By: gg814
fbshipit-source-id: ee0e8cc397c455eba64545c29380b9d9853588ec
Summary:
`bool BackupableDBOptions::new_naming_for_backup_files` is updated to `BackupTableNameOption BackupableDBOptions::share_files_with_checksum_naming`, where `BackupTableNameOption` is an `enum` type with two enumerators `kChecksumAndFileSize` and `kChecksumAndFileSize`. This opens up possibilities of extenting the current naming scheme for backup table files. By default, `BackupTableNameOption BackupableDBOptions::share_files_with_checksum_naming` is set to `kChecksumAndDbSessionId`.
Revert `BackupEngine::VerifyBackup` to only check file sizes by default.
Also fix the construction of the `SstFileDumper` in `GetFileDbIdentities` by setting a proper `Env` of the `Options` passed in the constructor.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7032
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D22237763
Pulled By: gg814
fbshipit-source-id: 466902a4e731babd64e30f0e82ca1aa82962e52e
Summary:
Added compaction filter support for BlobDB non-TTL values. Same as vanilla RocksDB, user compaction filter applies to all k/v pairs of the compaction for non-TTL values. It honors `min_blob_size`, which potentially results value transitions between inlined data and stored-in-blob data when size of value is changed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6850
Reviewed By: siying
Differential Revision: D22263487
Pulled By: ltamasi
fbshipit-source-id: 8fc03f8cde2a5c831e63b436b3dbf1b7f90939e8
Summary:
Current implementation of the ```read_options.deadline``` option only checks the deadline for random file reads during point lookups. This PR extends the checks to file opens, prefetches and preloads as part of table open.
The main changes are in the ```BlockBasedTable```, partitioned index and filter readers, and ```TableCache``` to take ReadOptions as an additional parameter. In ```BlockBasedTable::Open```, in order to retain existing behavior w.r.t checksum verification and block cache usage, we filter out most of the options in ```ReadOptions``` except ```deadline```. However, having the ```ReadOptions``` gives us more flexibility to honor other options like verify_checksums, fill_cache etc. in the future.
Additional changes in callsites due to function signature changes in ```NewTableReader()``` and ```FilePrefetchBuffer```.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6982
Test Plan: Add new unit tests in db_basic_test
Reviewed By: riversand963
Differential Revision: D22219515
Pulled By: anand1976
fbshipit-source-id: 8a3b92f4a889808013838603aa3ca35229cd501b
Summary:
A parameter `verify_with_checksum` is added to `BackupEngine::VerifyBackup`, which is true by default. So now `BackupEngine::VerifyBackup` verifies backup files with checksum AND file size by default. When `verify_with_checksum` is false, `BackupEngine::VerifyBackup` only compares file sizes to verify backup files.
Also add a test for the case when corruption does not change the file size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7014
Test Plan: Passed backupable_db_test
Reviewed By: zhichao-cao
Differential Revision: D22165590
Pulled By: gg814
fbshipit-source-id: 606a7450714e868bceb38598c89fd356c6004f4f