Summary:
This new options allows application to specify that files must be
ingested to bottommost level, otherwise the ingestion will fail instead
of silently ingesting to a non-bottommost level.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9849
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D35680307
Pulled By: riversand963
fbshipit-source-id: 01cf54ef6c76198f7654dc06b5544631dea1be1e
Summary:
Add the ability to cancel remote compaction on the remote side by
setting `OpenAndCompactOptions.canceled` to true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9725
Test Plan: added unittest
Reviewed By: ajkr
Differential Revision: D35018800
Pulled By: jay-zhuang
fbshipit-source-id: be3652f9645e0347df429e42a5614d5a9b3a1ec4
Summary:
So the user is able to set event listener on the compactor
side.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9821
Test Plan: unittest added
Reviewed By: ajkr
Differential Revision: D35485388
Pulled By: jay-zhuang
fbshipit-source-id: 669d8a3aaee012b75b940470306756c03ffa09b2
Summary:
In FilePrefetchBuffer if reads are sequential, after prefetching call ReadAsync API to prefetch data asynchronously so that in next prefetching data will be available. Data prefetched asynchronously will be readahead_size/2. It uses two buffers, one for synchronous prefetching and one for asynchronous. In case, the data is overlapping, the data is copied from both buffers to third buffer to make it continuous.
This feature is under ReadOptions::async_io and is under experimental.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9674
Test Plan:
1. Add new unit tests
2. Run **db_stress** to make sure nothing crashes.
- Normal prefetch without `async_io` ran successfully:
```
export CRASH_TEST_EXT_ARGS=" --async_io=0"
make crash_test -j
```
3. **Run Regressions**.
i) Main branch without any change for normal prefetching with async_io disabled:
```
./db_bench -db=/tmp/prefix_scan_prefetch_main -benchmarks="fillseq" -key_size=32 -value_size=512 -num=5000000 -
use_direct_io_for_flush_and_compaction=true -target_file_size_base=16777216
```
```
./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB: version 7.0
Date: Thu Mar 17 13:11:34 2022
CPU: 24 * Intel Core Processor (Broadwell)
CPUCache: 16384 KB
Keys: 32 bytes each (+ 0 bytes user-defined timestamp)
Values: 512 bytes each (256 bytes after compression)
Entries: 5000000
Prefix: 0 bytes
Keys per prefix: 0
RawSize: 2594.0 MB (estimated)
FileSize: 1373.3 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_main]
seekrandom : 483618.390 micros/op 2 ops/sec; 338.9 MB/s (249 of 249 found)
```
ii) normal prefetching after changes with async_io disable:
```
./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_withchange -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1
Initializing RocksDB Options from the specified file
Initializing RocksDB Options from command-line flags
RocksDB: version 7.0
Date: Thu Mar 17 14:11:31 2022
CPU: 24 * Intel Core Processor (Broadwell)
CPUCache: 16384 KB
Keys: 32 bytes each (+ 0 bytes user-defined timestamp)
Values: 512 bytes each (256 bytes after compression)
Entries: 5000000
Prefix: 0 bytes
Keys per prefix: 0
RawSize: 2594.0 MB (estimated)
FileSize: 1373.3 MB (estimated)
Write rate: 0 bytes/second
Read rate: 0 ops/second
Compression: Snappy
Compression sampling rate: 0
Memtablerep: SkipListFactory
Perf Level: 1
------------------------------------------------
DB path: [/tmp/prefix_scan_prefetch_withchange]
seekrandom : 471347.227 micros/op 2 ops/sec; 348.1 MB/s (255 of 255 found)
```
Reviewed By: anand1976
Differential Revision: D34731543
Pulled By: akankshamahajan15
fbshipit-source-id: 8e23aa93453d5fe3c672b9231ad582f60207937f
Summary:
`ColumnFamilyOptions::OldDefaults` and `DBOptions::OldDefaults`
now deprecated. Were previously overlooked with `Options::OldDefaults` in https://github.com/facebook/rocksdb/issues/9363
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9594
Test Plan: comments only
Reviewed By: jay-zhuang
Differential Revision: D34318592
Pulled By: pdillinger
fbshipit-source-id: 773c97a61e2a8290ae154f363dd61c1f35a9dd16
Summary:
When WAL compression is enabled, add a record (new record type) to store the compression type to indicate that all subsequent records are compressed. The log reader will store the compression type when this record is encountered and use the type to uncompress the subsequent records. Compress and uncompress to be implemented in subsequent diffs.
Enabled WAL compression in some WAL tests to check for regressions. Some tests that rely on offsets have been disabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9556
Reviewed By: anand1976
Differential Revision: D34308216
Pulled By: sidroyc
fbshipit-source-id: 7f10595e46f3277f1ea2d309fbf95e2e935a8705
Summary:
Users can set the priority for file reads associated with their operation by setting `ReadOptions::rate_limiter_priority` to something other than `Env::IO_TOTAL`. Rate limiting `VerifyChecksum()` and `VerifyFileChecksums()` is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working.
`RandomAccessFileReader::Read()` already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed in `Env::IOPriority`. Now the compaction read rate limiting is supported by setting `rate_limiter_priority = Env::IO_LOW` on its `ReadOptions`.
There is no default value for the new `Env::IOPriority` parameter to `RandomAccessFileReader::Read()`. That means this PR goes through all callers (in some cases multiple layers up the call stack) to find a `ReadOptions` to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads).
The API doc only lists the missing cases where a file read associated with a provided `ReadOptions` cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provide `ReadOptions` or `Env::IOPriority`, so I didn't count that as missing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9424
Test Plan:
- new unit tests
- new benchmarks on ~50MB database with 1MB/s read rate limit and 100ms refill interval; verified with strace reads are chunked (at 0.1MB per chunk) and spaced roughly 100ms apart.
- setup command: `./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true`
- benchmarks command: `strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true`
- crash test using IO_USER priority on non-validation reads with https://github.com/facebook/rocksdb/issues/9567 reverted: `python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10`
Reviewed By: hx235
Differential Revision: D33747386
Pulled By: ajkr
fbshipit-source-id: a2d985e97912fba8c54763798e04f006ccc56e0c
Summary:
Remove deprecated remote compaction APIs
`CompactionService::Start()` and `CompactionService::WaitForComplete()`.
Please use `CompactionService::StartV2()`,
`CompactionService::WaitForCompleteV2()` instead, which provides the
same information plus extra data like priority, db_id, etc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9570
Test Plan: CI
Reviewed By: riversand963
Differential Revision: D34255969
Pulled By: jay-zhuang
fbshipit-source-id: c6376eccdd1123f1c42ab53771b5f65f8160c325
Summary:
In RocksDB option new_table_reader_for_compaction_inputs has
not effect on Compaction or on the behavior of RocksDB library.
Therefore, we are removing it in the upcoming 7.0 release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9443
Test Plan: CircleCI
Reviewed By: ajkr
Differential Revision: D33788508
Pulled By: akankshamahajan15
fbshipit-source-id: 324ca6f12bfd019e9bd5e1b0cdac39be5c3cec7d
Summary:
I feel it would be nice if we can fix this spelling error.
In `SizeApproximationOptions`, the `include_memtabtles` should be `include_memtables`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9490
Test Plan: make check
Reviewed By: hx235
Differential Revision: D33949862
Pulled By: riversand963
fbshipit-source-id: b2be67501b65d4aabb6b8df1bf25eb8d54cc1466
Summary:
ajkr reminded me that we have a rule of not including per-kv related data in `WriteOptions`.
Namely, `WriteOptions` should not include information about "what-to-write", but should just
include information about "how-to-write".
According to this rule, `WriteOptions::timestamp` (experimental) is clearly a violation. Therefore,
this PR removes `WriteOptions::timestamp` for compliance.
After the removal, we need to pass timestamp info via another set of APIs. This PR proposes a set
of overloaded functions `Put(write_opts, key, value, ts)`, `Delete(write_opts, key, ts)`, and
`SingleDelete(write_opts, key, ts)`. Planned to add `Write(write_opts, batch, ts)`, but its complexity
made me reconsider doing it in another PR (maybe).
For better checking and returning error early, we also add a new set of APIs to `WriteBatch` that take
extra `timestamp` information when writing to `WriteBatch`es.
These set of APIs in `WriteBatchWithIndex` are currently not supported, and are on our TODO list.
Removed `WriteBatch::AssignTimestamps()` and renamed `WriteBatch::AssignTimestamp()` to
`WriteBatch::UpdateTimestamps()` since this method require that all keys have space for timestamps
allocated already and multiple timestamps can be updated.
The constructor of `WriteBatch` now takes a fourth argument `default_cf_ts_sz` which is the timestamp
size of the default column family. This will be used to allocate space when calling APIs that do not
specify a column family handle.
Also, updated `DB::Get()`, `DB::MultiGet()`, `DB::NewIterator()`, `DB::NewIterators()` methods, replacing
some assertions about timestamp to returning Status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8946
Test Plan:
make check
./db_bench -benchmarks=fillseq,fillrandom,readrandom,readseq,deleterandom -user_timestamp_size=8
./db_stress --user_timestamp_size=8 -nooverwritepercent=0 -test_secondary=0 -secondary_catch_up_one_in=0 -continuous_verification_interval=0
Make sure there is no perf regression by running the following
```
./db_bench_opt -db=/dev/shm/rocksdb -use_existing_db=0 -level0_stop_writes_trigger=256 -level0_slowdown_writes_trigger=256 -level0_file_num_compaction_trigger=256 -disable_wal=1 -duration=10 -benchmarks=fillrandom
```
Before this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom : 1.831 micros/op 546235 ops/sec; 60.4 MB/s
```
After this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom : 1.820 micros/op 549404 ops/sec; 60.8 MB/s
```
Reviewed By: ltamasi
Differential Revision: D33721359
Pulled By: riversand963
fbshipit-source-id: c131561534272c120ffb80711d42748d21badf09
Summary:
Apparently setting total_order_seek=true for DB::Get was
intended to allow accurate read semantics if the current prefix
extractor doesn't match what was used to generate SST files on
disk. But since prefix_extractor was made a mutable option in 5.14.0, we
have been able to detect this case and provide the correct semantics
regardless of the total_order_seek option. Since that time, the option
has only made Get() slower in a reasonably common case: prefix_extractor
unchanged and whole_key_filtering=false.
So this change primarily removes unnecessary effect of
total_order_seek on Get. Also cleans up some related comments.
Also adds a -total_order_seek option to db_bench and canonicalizes
handling of ReadOptions in db_bench so that command line options have
the expected association with library features. (There is potential
for change in regression test behavior, but the old behavior is likely
indefensible, or some other inconsistency would need to be fixed.)
TODO in follow-up work: there should be no reason for Get() to depend on
current prefix extractor at all.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9427
Test Plan:
Unit tests updated.
Performance (using db_bench update)
Create DB with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12 -whole_key_filtering=0`
Test with and without `-total_order_seek` on `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=readrandom -num=10000000 -duration=40 -disable_wal=1 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12`
Before this change, total_order_seek=false: 25188 ops/sec
Before this change, total_order_seek=true: 1222 ops/sec (~20x slower)
After this change, total_order_seek=false: 24570 ops/sec
After this change, total_order_seek=true: 25012 ops/sec (indistinguishable)
Reviewed By: siying
Differential Revision: D33753458
Pulled By: pdillinger
fbshipit-source-id: bf892f34907a5e407d9c40bd4d42f0adbcbe0014
Summary:
In RocksDB DBOptions::skip_log_error_on_recovery is marked as
"NOT SUPPORTED" for a long time, and setting this option does not have
any effect on the behavior of RocksDB library. Therefore, we are removing it
in the upcoming 7.0 release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9434
Test Plan: CircleCI
Reviewed By: ajkr
Differential Revision: D33763015
Pulled By: akankshamahajan15
fbshipit-source-id: 11f09643298da6c02d3dcdb090b996f4c3cfdd76
Summary:
The API is deprecated long time ago. Clean up the codebase by
removing it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9462
Test Plan: CI, fake release: D33835220
Reviewed By: riversand963
Differential Revision: D33835103
Pulled By: jay-zhuang
fbshipit-source-id: 6d2dc12c8e7fdbe2700865a3e61f0e3f78bd8184
Summary:
In RocksDB, this option was already marked as "NOT SUPPORTED" for a long time, and setting this option does not have any effect on the behavior of RocksDB library. Therefore, we are removing it in the preparations of the upcoming 7.0 release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9450
Reviewed By: ajkr
Differential Revision: D33802466
Pulled By: bjlemaire
fbshipit-source-id: 97570985f1400525304053476450f7ef504c0cd5
Summary:
Add an option to set the WAL compression algorithm - wal_compression.
TODO: WAL compression is not implemented and will only support zstd initially. Will be added in subsequent diffs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9432
Reviewed By: pdillinger
Differential Revision: D33797275
Pulled By: sidroyc
fbshipit-source-id: 8db81d9c9cea5e2e4f1445d3aecad8106137b8e7
Summary:
Loose ends relate to mmap on 32-bit systems. (Testing is more
complicated when the feature was completely disabled on 32-bit.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9386
Test Plan: CI
Reviewed By: ajkr
Differential Revision: D33590715
Pulled By: pdillinger
fbshipit-source-id: f2637036a538a552200adee65b6765fce8cae27b
Summary:
In response to https://github.com/facebook/rocksdb/issues/9354, this PR adds a way for users to "opt out"
of extra checks that can impact peak write performance, which
currently only includes force_consistency_checks. I considered including
some other options but did not see a db_bench performance difference.
Also clarify in comment for force_consistency_checks that it can "slow
down saturated writing."
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9363
Test Plan:
basic coverage in unit tests
Using my perf test in https://github.com/facebook/rocksdb/issues/9354 comment, I see
force_consistency_checks=true -> 725360 ops/s
force_consistency_checks=false -> 783072 ops/s
Reviewed By: mrambacher
Differential Revision: D33636559
Pulled By: pdillinger
fbshipit-source-id: 25bfd006f4844675e7669b342817dd4c6a641e84
Summary:
This option causes trace records to be written in the serialized write thread. That way, the write records in the trace must follow the same order as writes that are logged to WAL and writes that are applied to the DB.
By default I left it disabled to match existing behavior. I enabled it in `db_stress`, though, as that use case requires order of write records in trace matches the order in WAL.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9334
Test Plan:
- See if below unsynced data loss crash test can run for 24h straight. It used to crash after a few hours when reaching an unlucky trace ordering.
```
DEBUG_LEVEL=0 TEST_TMPDIR=/dev/shm /usr/local/bin/python3 -u tools/db_crashtest.py blackbox --interval=10 --max_key=100000 --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152 --value_size_mult=33 --sync_fault_injection=1 --test_batches_snapshots=0 --duration=86400
```
Reviewed By: zhichao-cao
Differential Revision: D33301990
Pulled By: ajkr
fbshipit-source-id: 82d97559727adb4462a7af69758449c8725b22d3
Summary:
`ReadOptions::iter_start_seqnum` and `DBOptions::preserve_deletes` are
deprecated, please try using user defined timestamp feature instead.
The feature is used to support differential snapshots, but not well
maintained (https://github.com/facebook/rocksdb/issues/6837, https://github.com/facebook/rocksdb/issues/8472) and the interface is not user friendly which
returns an internal key from the iterator. The user defined timestamp
feature is a more flexible feature to support similar usecase, please
switch to that if you have such usecase.
The deprecated feature will be removed in a future release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9091
Test Plan:
check LOG
Fix https://github.com/facebook/rocksdb/issues/9090
Reviewed By: ajkr
Differential Revision: D32071750
Pulled By: jay-zhuang
fbshipit-source-id: b882c4668dd1bf26ce03c4c192f1bba584bf6104
Summary:
- Fixed bug where bottom-pri manual compactions were counting towards `bg_compaction_scheduled_` instead of `bg_bottom_compaction_scheduled_`. It seems to have no negative effect.
- Fixed bug where automatic compaction scheduling did not consider `bg_bottom_compaction_scheduled_`. Now automatic compactions cannot be scheduled that exceed the per-DB compaction concurrency limit (`max_compactions`) when some existing compactions are bottommost.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9179
Test Plan: new unit test for manual/automatic. Also verified the existing automatic/automatic test ("ConcurrentBottomPriLowPriCompactions") hanged until changing it to explicitly enable concurrency.
Reviewed By: riversand963
Differential Revision: D32488048
Pulled By: ajkr
fbshipit-source-id: 20c4c0693678e81e43f85ed3cc3402fcf26e3310
Summary:
RocksDB does auto-readahead for iterators on noticing more than two sequential reads for a table file if user doesn't provide readahead_size. The readahead starts at 8KB and doubles on every additional read up to max_auto_readahead_size. However at each level, if iterator moves over next file, readahead_size starts again from 8KB.
This PR introduces a new ReadOption "adaptive_readahead" which when set true will maintain readahead_size at each level. So when iterator moves from one file to another, new file's readahead_size will continue from previous file's readahead_size instead of scratch. However if reads are not sequential it will fall back to 8KB (default) with no prefetching for that block.
1. If block is found in cache but it was eligible for prefetch (block wasn't in Rocksdb's prefetch buffer), readahead_size will decrease by 8KB.
2. It maintains readahead_size for L1 - Ln levels.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9056
Test Plan:
Added new unit tests
Ran db_bench for "readseq, seekrandom, seekrandomwhilewriting, readrandom" with --adaptive_readahead=true and there was no regression if new feature is enabled.
Reviewed By: anand1976
Differential Revision: D31773640
Pulled By: akankshamahajan15
fbshipit-source-id: 7332d16258b846ae5cea773009195a5af58f8f98
Summary:
Add clarification/extension to comments on max_total_wal_size and the Java wrapper MaxTotalWalSize to better explain the effect of the option on log file sizes.
Closes https://github.com/facebook/rocksdb/issues/5789
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9108
Reviewed By: pdillinger
Differential Revision: D32066640
Pulled By: mrambacher
fbshipit-source-id: 7d5affc87e4119019054af9c884a2ea01d68f5b7
Summary:
* Clarify that RocksDB is not exception safe on many of our callback
and extension interfaces
* Clarify FSRandomAccessFile::MultiRead implementations must accept
non-sorted inputs (see https://github.com/facebook/rocksdb/issues/8953)
* Clarify ConcurrentTaskLimiter and SstFileManager are not (currently)
extensible interfaces
* Mark WriteBufferManager as `final`, so it is then clearly not a
callback interface, even though it smells like one
* Clarify TablePropertiesCollector Status returns are mostly ignored
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9080
Test Plan: comments only (except WriteBufferManager final)
Reviewed By: ajkr
Differential Revision: D31968782
Pulled By: pdillinger
fbshipit-source-id: 11b648ce3ce3c5e5bdc02d2eafc7ea4b864bd1d2
Summary:
Currently, if Secondary Cache is provided to the lru cache, it is used by default. We add CacheTier to advanced_options.h to describe the cache tier we used. Add a `lowest_used_cache_tier` option to `DBOptions` (immutable) and pass it to BlockBasedTableReader to decide if secondary cache will be used or not. By default it is `CacheTier::kNonVolatileTier`, which means, we always use both block cache (kVolatileTier) and secondary cache (kNonVolatileTier). By set it to `CacheTier::kVolatileTier`, the DB will not use the secondary cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9050
Test Plan: added new tests
Reviewed By: anand1976
Differential Revision: D31744769
Pulled By: zhichao-cao
fbshipit-source-id: a0575ebd23e1c6dfcfc2b4c8578764e73b15bce6
Summary:
New classes FileStorageInfo and LiveFileStorageInfo and
'experimental' function DB::GetLiveFilesStorageInfo, which is intended
to largely replace several fragmented DB functions needed to create
checkpoints and backups.
This function is now used to create checkpoints and backups, because
it fixes many (probably not all) of the prior complexities of checkpoint
not having atomic access to DB metadata. This also ensures strong
functional test coverage of the new API. Specifically, much of the old
CheckpointImpl::CreateCustomCheckpoint has been migrated to and
updated in DBImpl::GetLiveFilesStorageInfo, with the former now
calling the latter.
Also, the class FileStorageInfo in metadata.h compatibly replaces
BackupFileInfo and serves as a new base class for SstFileMetaData.
Some old fields of SstFileMetaData are still provided (for now) but
deprecated.
Although FileStorageInfo::directory is accurate when using db_paths
and/or cf_paths, these have never been supported by Checkpoint
nor BackupEngine and still are not. This change does now detect
these cases and return NotSupported when appropriate. (More work
needed for support.)
Somehow this change broke ProgressCallbackDuringBackup, but
the progress_callback logic was dubious to begin with because it
would call the callback based on copy buffer size, not size actually
copied. Logic and test updated to track size actually copied
per-thread.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8968
Test Plan:
tests updated.
DB::GetLiveFilesStorageInfo mostly tested by use in CheckpointImpl.
DBTest.SnapshotFiles updated to also test GetLiveFilesStorageInfo,
including reading the data after DB close.
Added CheckpointTest.CheckpointWithDbPath (NotSupported).
Reviewed By: siying
Differential Revision: D31242045
Pulled By: pdillinger
fbshipit-source-id: b183d1ce9799e220daaefd6b3b5365d98de676c0
Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/8991
Test Plan: the new test hangs forever without this fix and passes with this fix.
Reviewed By: hx235
Differential Revision: D31456419
Pulled By: ajkr
fbshipit-source-id: a82c0e5560b6e6153089dccd8e46163c61b07bff
Summary:
- Update few stale GitHub wiki link references from rocksdb.org
- Update the API comments for ignore_range_deletions
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8983
Reviewed By: ajkr
Differential Revision: D31355965
Pulled By: ramvadiv
fbshipit-source-id: 245ac4a6913976dd82afa308bc4aae6bff3d788c
Summary:
Add support for fallback to local compaction, the user can
return `CompactionServiceJobStatus::kUseLocal` to instruct RocksDB to
run the compaction locally instead of waiting for the remote compaction
result.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8709
Test Plan: unittest
Reviewed By: ajkr
Differential Revision: D30560163
Pulled By: jay-zhuang
fbshipit-source-id: 65d8905a4a1bc185a68daa120997f21d3198dbe1
Summary:
Add compaction priority information in RemoteCompaction, which
can be used to schedule high priority job first.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8707
Test Plan: unittest
Reviewed By: ajkr
Differential Revision: D30548401
Pulled By: jay-zhuang
fbshipit-source-id: b30446511fb31b4583c49edd8565d496cf013a34
Summary:
Currently, we only provide job_id in RemoteCompaction APIs, the
main problem of `job_id` is it cannot uniquely identify a compaction job
between DB instances or between sessions.
Providing DB and session id to the user, which will make building cross
DB compaction service easier.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8680
Test Plan: unittest
Reviewed By: ajkr
Differential Revision: D30444859
Pulled By: jay-zhuang
fbshipit-source-id: fdf107f4286564049637f154193c6d94c3c59448
Summary:
Add comment for `options.allow_fallocate` that btrfs
preallocated space are not freed and a suggestion to disable
preallocation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8646
Test Plan: No code change
Reviewed By: ajkr
Differential Revision: D30240050
Pulled By: jay-zhuang
fbshipit-source-id: 75b7190bc8276ce8d8ac2d0cb9064b386cbf4768
Summary:
Changes the API of the MemPurge process: the `bool experimental_allow_mempurge` and `experimental_mempurge_policy` flags have been replaced by a `double experimental_mempurge_threshold` option.
This change of API reflects another major change introduced in this PR: the MemPurgeDecider() function now works by sampling the memtables being flushed to estimate the overall amount of useful payload (payload minus the garbage), and then compare this useful payload estimate with the `double experimental_mempurge_threshold` value.
Therefore, when the value of this flag is `0.0` (default value), mempurge is simply deactivated. On the other hand, a value of `DBL_MAX` would be equivalent to always going through a mempurge regardless of the garbage ratio estimate.
At the moment, a `double experimental_mempurge_threshold` value else than 0.0 or `DBL_MAX` is opnly supported`with the `SkipList` memtable representation.
Regarding the sampling, this PR includes the introduction of a `MemTable::UniqueRandomSample` function that collects (approximately) random entries from the memtable by using the new `SkipList::Iterator::RandomSeek()` under the hood, or by iterating through each memtable entry, depending on the target sample size and the total number of entries.
The unit tests have been readapted to support this new API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8628
Reviewed By: pdillinger
Differential Revision: D30149315
Pulled By: bjlemaire
fbshipit-source-id: 1feef5390c95db6f4480ab4434716533d3947f27
Summary:
Add `experimental_mempurge_policy` option flag and introduce two new `MemPurge` (Memtable Garbage Collection) policies: 'ALWAYS' and 'ALTERNATE'. Default value: ALTERNATE.
`ALWAYS`: every flush will first go through a `MemPurge` process. If the output is too big to fit into a single memtable, then the mempurge is aborted and a regular flush process carries on. `ALWAYS` is designed for user that need to reduce the number of L0 SST file created to a strict minimum, and can afford a small dent in performance (possibly hits to CPU usage, read efficiency, and maximum burst write throughput).
`ALTERNATE`: a flush is transformed into a `MemPurge` except if one of the memtables being flushed is the product of a previous `MemPurge`. `ALTERNATE` is a good tradeoff between reduction in number of L0 SST files created and performance. `ALTERNATE` perform particularly well for completely random garbage ratios, or garbage ratios anywhere in (0%,50%], and even higher when there is a wild variability in garbage ratios.
This PR also includes support for `experimental_mempurge_policy` in `db_bench`.
Testing was done locally by replacing all the `MemPurge` policies of the unit tests with `ALTERNATE`, as well as local testing with `db_crashtest.py` `whitebox` and `blackbox`. Overall, if an `ALWAYS` mempurge policy passes the tests, there is no reasons why an `ALTERNATE` policy would fail, and therefore the mempurge policy was set to `ALWAYS` for all mempurge unit tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8583
Reviewed By: pdillinger
Differential Revision: D29888050
Pulled By: bjlemaire
fbshipit-source-id: e2cf26646d66679f6f5fb29842624615610759c1
Summary:
There is an extra " in options.h (`"index block""`)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8550
Test Plan: None
Reviewed By: jay-zhuang
Differential Revision: D29746077
Pulled By: autopear
fbshipit-source-id: 2e5117296e5414b7c7440d990926bc1e567a0b4f
Summary:
Implement an experimental feature called "MemPurge", which consists in purging "garbage" bytes out of a memtable and reuse the memtable struct instead of making it immutable and eventually flushing its content to storage.
The prototype is by default deactivated and is not intended for use. It is intended for correctness and validation testing. At the moment, the "MemPurge" feature can be switched on by using the `options.experimental_allow_mempurge` flag. For this early stage, when the allow_mempurge flag is set to `true`, all the flush operations will be rerouted to perform a MemPurge. This is a temporary design decision that will give us the time to explore meaningful heuristics to use MemPurge at the right time for relevant workloads . Moreover, the current MemPurge operation only supports `Puts`, `Deletes`, `DeleteRange` operations, and handles `Iterators` as well as `CompactionFilter`s that are invoked at flush time .
Three unit tests are added to `db_flush_test.cc` to test if MemPurge works correctly (and checks that the previously mentioned operations are fully supported thoroughly tested).
One noticeable design decision is the timing of the MemPurge operation in the memtable workflow: for this prototype, the mempurge happens when the memtable is switched (and usually made immutable). This is an inefficient process because it implies that the entirety of the MemPurge operation happens while holding the db_mutex. Future commits will make the MemPurge operation a background task (akin to the regular flush operation) and aim at drastically enhancing the performance of this operation. The MemPurge is also not fully "WAL-compatible" yet, but when the WAL is full, or when the regular MemPurge operation fails (or when the purged memtable still needs to be flushed), a regular flush operation takes place. Later commits will also correct these behaviors.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8454
Reviewed By: anand1976
Differential Revision: D29433971
Pulled By: bjlemaire
fbshipit-source-id: 6af48213554e35048a7e03816955100a80a26dc5
Summary:
Change the job_id for remote compaction interface, which will include
both internal compaction job_id, also a sub_compaction_job_id. It is not
a backward compatible change. The user needs to update interface during
upgrade. (We will avoid backward incompatible change after the feature is
not experimental.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8364
Reviewed By: ajkr
Differential Revision: D28917301
Pulled By: jay-zhuang
fbshipit-source-id: 6d72a21f652bb517ad6954d0387b496797fc4e11
Summary:
(1)Make CompactionService derived from Customizable by defining two extra functions that are needed, as described in customizable.h comment section
(2)Revise the MyTestCompactionService class in compaction_service_test.cc to satisfy the class inheritance requirement
(3)Specify namespace of ToString() in compaction_service_test.cc to avoid function collision with CompactionService's ancestor classes
Test did:
make -j24 compaction_service_test
./compaction_service_test
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8395
Reviewed By: jay-zhuang
Differential Revision: D29076068
Pulled By: hx235
fbshipit-source-id: c130100fa466939b3137e917f5fdc4b2ae8e37d4
Summary:
Added the ability to cancel an in-progress range compaction by storing to an atomic "canceled" variable pointed to within the CompactRangeOptions structure.
Tested via two tests added to db_tests2.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8351
Reviewed By: ajkr
Differential Revision: D28808894
Pulled By: ddevec
fbshipit-source-id: cb321361c9e23b084b188bb203f11c375a22c2dd
Summary:
When a memtable is flushed, it will validate number of entries it reads, and compare the number with how many entries inserted into memtable. This serves as one sanity c\
heck against memory corruption. This change will also allow more counters to be added in the future for better validation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8288
Test Plan: Pass all existing tests
Reviewed By: ajkr
Differential Revision: D28369194
fbshipit-source-id: 7ff870380c41eab7f99eee508550dcdce32838ad