Summary:
Currently, application may pass a statistics object to db but later
wants to reduce stats tracking overhead by setting stats level to
kExceptHistogramOrTimers (the current lowest level). Tickers will still
be incremented, causing up to 1% CPU. We can add a new lowest stats
level `kExceptTickers` to disable ticker incrementing as well, thus
reducing CPU cycles spent on tickers.
Test Plan (devserver):
```
make check
make clean
DEBUG_LEVEL=0 make db_bench
./db_bench -perf_level=1 -stats_level=0 -statistics -benchmarks=fillseq,readrandom -duration=120
```
Measure CPU util (%) before and after change:
CPU util by rocksdb::RecordTick: 1.1 vs (<0.1)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7329
Reviewed By: pdillinger
Differential Revision: D23434014
Pulled By: riversand963
fbshipit-source-id: 72ff0f02a192ac476d4b0044b9f37fd4a22ff0d4
Summary:
This PR creates `rocksdb_open_column_families_with_ttl` which allows C API users to open a DBWithTLL with column families.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7314
Reviewed By: cheng-chang
Differential Revision: D23430287
Pulled By: ajkr
fbshipit-source-id: 307aa21d170d1402653263a91f6f832ef76afba0
Summary:
- Closes https://github.com/facebook/rocksdb/issues/6490
- Currently MERGEs are converted to PUTs at bottom or compaction has reached the beginning of the key, this can wrongly cover a PUT future base case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7166
Test Plan:
- Automated: `make all check`
- Manual: With `allow_ingest_behind = true`, add Merge operations to a key then run compaction. Then run ingesting external files to make sure the base case is probably compacted with existing Merges.
Reviewed By: cheng-chang
Differential Revision: D23325425
Pulled By: ajkr
fbshipit-source-id: 3eb415eb7b381b5453e45245393566153b1abb68
Summary:
L0 score is based on size target and number of files. The size target
used is `max_bytes_for_level_base`. However, the base level's size can
dynamically expand in write burst mode. In fact, it can expand so much
that L0->Lbase becomes the highest fanout in target sizes. This doesn't
make sense from an efficiency perspective, so this PR bounds the
L0->Lbase fanout to the smoothed level multiplier. The L0 scoring based
on file count remains unchanged.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7325
Test Plan:
contrived benchmark that exhibits the problem:
```
$ TEST_TMPDIR=/data/users/andrewkr/ ./db_bench -benchmarks=filluniquerandom,readrandom -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -level0_file_num_compaction_trigger=4 -level_compaction_dynamic_level_bytes=true -compression_type=none -max_background_jobs=12 -rate_limiter_bytes_per_sec=104857600 -benchmark_write_rate_limit=10485760 -num=100000000
```
Results:
- "Burst W-Amp" is the write-amp near the end of the fillrandom benchmark
- "Total W-Amp" is the write-amp after readrandom has run a while and all levels no longer need compaction
Branch | Burst W-Amp | Total W-Amp | fillrandom (MB/s)
-- | -- | -- | --
master | 20.2 | 21.5 | 4.7
dynamic-l0-score | 12.6 | 14.1 | 7.2
Reviewed By: siying
Differential Revision: D23412935
Pulled By: ajkr
fbshipit-source-id: f91f2067188e432dd39deab02f1c56f195057a0e
Summary:
The patch adds a log message to `BlobFileBuilder` that is logged upon
generating a blob file, similarly to how we log the generation of table files
during flush and compaction. The log message contains the column family
name, job id, blob file number, and the number and total size of blobs in
the new file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7324
Test Plan: Ran `make check` and checked the actual log messages using a custom `db_bench`.
Reviewed By: riversand963
Differential Revision: D23402229
Pulled By: ltamasi
fbshipit-source-id: ca42beb4db284b783d1eb2651f321032a45d0c5f
Summary:
Add a unit test case to check memory usage when
max_write_buffer_size_to_maintain is set if flushed immutable memtables are
trimmed timely or not.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7311
Test Plan: Compared the results with before bug fix.
Reviewed By: ltamasi
Differential Revision: D23321702
Pulled By: akankshamahajan15
fbshipit-source-id: da04ee21137d641a07fd499a9e2749eb036fcb1e
Summary:
The patch adds a class called `BlobFileBuilder` that can be used to build
and cut blob files in background jobs (flushes/compactions). The class
enforces a value size threshold (`min_blob_size`; smaller blobs will be inlined
in the LSM tree itself), and supports specifying a blob file size limit (`blob_file_size`),
as well as compression (`blob_compression_type`) and checksums for blob files.
It also keeps track of the generated blob files and their associated `BlobFileAddition`
metadata, which can be applied as part of the background job's `VersionEdit`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7306
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D23298817
Pulled By: ltamasi
fbshipit-source-id: 38f35d81dab1ba81f15236240612ec173d7f21b5
Summary:
Replace FSRandomAccessFile pointer with FSRandomAccessFilePtr
object in RandomAccessFileReader.
This new object wraps FSRandomAccessFile pointer.
Objective: If tracing is enabled, FSRandomAccessFile Ptr returns
FSRandomAccessFileTracingWrapper pointer that includes all necessary
information in IORecord and calls underlying FileSystem and invokes
IOTracer to dump that record in a binary file. If tracing is disabled
then, underlying FileSystem pointer is returned directly.
FSRandomAccessFilePtr wrapper class is added to bypass the FSRandomAccessFileWrapper when
tracing is disabled.
Test Plan: make check -j64
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7192
Reviewed By: anand1976
Differential Revision: D23356867
Pulled By: akankshamahajan15
fbshipit-source-id: 48f31168166a17a7444b40be44a9a9d4a5c7182c
Summary:
This is a "real" fix for the issue worked around in https://github.com/facebook/rocksdb/issues/7294.
To get DB checksum info for live files, we now read the manifest file
that will become part of the checkpoint/backup. This requires a little
extra handling in taking a custom checkpoint, including only reading the
manifest file up to the size prescribed by the checkpoint.
This moves GetFileChecksumsFromManifest from backup code to
file_checksum_helper.{h,cc} and removes apparently unnecessary checking
related to column families.
Updated HISTORY.md and warned potential future users of
DB::GetLiveFilesChecksumInfo()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7309
Test Plan: updated unit test, before and after
Reviewed By: ajkr
Differential Revision: D23311994
Pulled By: pdillinger
fbshipit-source-id: 741e30a2dc1830e8208f7648fcc8c5f000d4e2d5
Summary:
Right now all I/O failures under PartitionIndexReader::CacheDependencies() is swallowed. This doesn't impact correctness but we've made a decision that any I/O error in read path now should be returned to users for awareness. Return errors in those cases instead.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7297
Test Plan: Add a new unit test that ingest errors in this code path and see Get() fails. Only one I/O path is hit in PartitionIndexReader::CacheDependencies(). Several option changes are attempt but not able to got other pread paths triggered. Not sure whether other failure cases would be even possible. Would rely on continuous stress test to validate it.
Reviewed By: anand1976
Differential Revision: D23257950
fbshipit-source-id: 859dbc92fa239996e1bb378329344d3d54168c03
Summary:
DBBasicTest.CompactBetweenSnapshots can time-out in some slow-I/O hosts. Parameterize it so that single test runs shorter.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7301
Test Plan: Run the test and see see different runs are of different configerations in a hacky way.
Reviewed By: ltamasi
Differential Revision: D23277733
fbshipit-source-id: 1f717b4131322d175abf9e211131fe7e9b1ef758
Summary:
When SST file is created, application is able to know the file information through OnTableFileCreated callback in LogAndNotifyTableFileCreationFinished. Since file checksum information can be useful for application when the SST file is created, we add file_checksum and file_checksum_func_name information to TableFileCreationInfo, which will be passed through OnTableFileCreated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7108
Test Plan: make check, listener_test.
Reviewed By: ajkr
Differential Revision: D22470240
Pulled By: zhichao-cao
fbshipit-source-id: 92c20344d9b986eadfe3480f3769bf4add0dbaae
Summary:
After releasing a snapshot, it checks whether it is suitable to trigger bottom compactions.
When disabling auto compactions, it may still schedule compaction when releasing a snapshot. Whereas no compaction job will be actually handled, so the state of LSM is not changed and compaction will be triggered again and again every time releasing a snapshot.
Too frequent compactions lead to high CPU usage and high db_mutex lock contention which affects foreground write duration finally.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7267
Test Plan:
- make check
- manual test
Reviewed By: akankshamahajan15
Differential Revision: D23252880
Pulled By: ajkr
fbshipit-source-id: 4431e071a35d9912a2a3592875db27bae521434b
Summary:
More tests now pass. When in doubt, I added a TODO comment to check what should happen with an ignored error.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7305
Reviewed By: akankshamahajan15
Differential Revision: D23301262
Pulled By: ajkr
fbshipit-source-id: 5f120edc7393560aefc0633250277bbc7e8de9e6
Summary:
Some ExternalSSTFileTest runs very long on some places. Disable fsync in some tests to speed them up.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7303
Test Plan: Run these tests.
Reviewed By: riversand963
Differential Revision: D23280261
fbshipit-source-id: 0dca862e462f9e6d807f393320a1f82aa5b87e59
Summary:
When a memtable is trimmed in MemTableListVersion, the memtable
is only added to delete list if it is
the last reference. However it is not the last reference as it is held
by the super version. But the super version would not be switched if the
delete list is empty. So the memtable is never destroyed and memory
usage increases beyond write_buffer_size +
max_write_buffer_size_to_maintain.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7296
Test Plan:
1. ./db_bench -benchmarks=randomtransaction
-optimistic_transaction_db=1 -statistics -stats_interval_seconds=1
-duration=90 -num=500000 --max_write_buffer_size_to_maintain=16000000
--transaction_set_snapshot
Reviewed By: ltamasi
Differential Revision: D23267395
Pulled By: akankshamahajan15
fbshipit-source-id: 3a8d437fe9f4015f851ff84c0e29528aa946b650
Summary:
This test uses database functionality and required more extensive work to get it to pass than the other tests. The DB functionality required for this test now passes the check.
When it was unclear what the proper behavior was for unchecked status codes, a TODO was added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7283
Reviewed By: akankshamahajan15
Differential Revision: D23251497
Pulled By: ajkr
fbshipit-source-id: 52b79629bdafa0a58de8ead1d1d66f141b331523
Summary:
There's a potential deadlock caused by MockTimeEnv time value get to a large number, which causes TimedWait() wait forever. The test misuses the microseconds as seconds, making it more likely to happen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7277
Reviewed By: pdillinger
Differential Revision: D23183873
Pulled By: jay-zhuang
fbshipit-source-id: 6fc38ebd40b4125a99551204b271f91a27e70086
Summary:
Seems it's only causing assert failure during compaction pick, but in production code, the problematic compactions are excluded at a later step.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7281
Reviewed By: akankshamahajan15
Differential Revision: D23228000
Pulled By: jay-zhuang
fbshipit-source-id: 2e4055aeebe0f5a2b07e299e0a2d51a1ad2e216d
Summary:
This diff contains following changes:
1. Replace `FSSequentialFile` pointer with `FSSequentialFilePtr` object that wraps `FSSequentialFile` pointer in `SequenceFileReader`.
Objective: If tracing is enabled, `FSSequentialFilePtr` returns `FSSequentialFileTracingWrapper` pointer that includes all necessary information in `IORecord` and calls underlying FileSystem and invokes `IOTracer` to dump that record in a binary file. If tracing is disabled then, underlying `FileSystem` pointer is returned directly. `FSSequentialFilePtr` wrapper class is added to bypass the `FSSequentialFileTracingWrapper` when tracing is disabled.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7190
Test Plan:
make check -j64
COMPILE_WITH_TSAN=1 make check -j64
Reviewed By: anand1976
Differential Revision: D23059616
Pulled By: akankshamahajan15
fbshipit-source-id: 1564b94dd1297cd0fbfe2ed5c9cc3e20f7395301
Summary:
After https://github.com/facebook/rocksdb/pull/7036, we still see extra DBTest that can timeout when running 10 or 20 in parallel. Expand skip-fsync mode in whole DBTest. Still preserve other tests from doing this mode to be conservative.
This commit reinstates https://github.com/facebook/rocksdb/issues/7049, whose un-revert was lost in an automatic
infrastructure mis-merge.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7274
Test Plan: Run all existing files.
Reviewed By: pdillinger
Differential Revision: D23177444
fbshipit-source-id: 1f61690b2ac6333c3b2c87176fef6b2cba086b33
Summary:
The two features are naturally incompatible. WAL recycling expects the recovery to succeed upon encountering a corrupt record at the point where new data ends and recycled data remains at the tail. However, `WALRecoveryMode::kTolerateCorruptedTailRecords` must fail upon encountering any such corrupt record, as it cannot differentiate between this and a real corruption, which would cause committed updates to be truncated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7271
Reviewed By: riversand963
Differential Revision: D23169923
Pulled By: ajkr
fbshipit-source-id: 2cf8a3bcd2c9a0ecb0055a84725047a10fd4db50
Summary:
Add `kEntryDeleteWithTimestamp` to `EntryType` which is a public API.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7195
Test Plan: make check
Reviewed By: ajkr
Differential Revision: D22914704
Pulled By: riversand963
fbshipit-source-id: 886f73c6b70c527cad1c8fc9fc8d3afe60e1ea39
Summary:
The patch makes sure that the functionality required for the new integrated
BlobDB implementation (most importantly, the classes related to reading and
writing blob files) is also built in LITE mode by removing the corresponding
`#ifndef`s.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7272
Test Plan: Ran `make check` in both regular and LITE mode.
Reviewed By: zhichao-cao
Differential Revision: D23173280
Pulled By: ltamasi
fbshipit-source-id: 1596bd1a76409a8a6d83d8f1dbfe08bfdea7ffe6
Summary:
There is potential data race related CompactRange() with level refitting. After the compaction step and refitting step, some automatic compaction could put data to the destination level and cause the DB to be corrupted. Fix the bug by checking the target level to be empty.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7261
Test Plan: Add a unit test, which would fail with "Corruption: L1 have overlapping ranges '666F6F' seq:6, type:1 vs. '626172' seq:2, type:1", and now it succeeds.
Reviewed By: ajkr
Differential Revision: D23142269
fbshipit-source-id: 28bc14d5ac934c192260b23a4ce3f10a95e3ee91
Summary:
Looks like somebody simply missed initializing a member variable. The column family ID, cf_id, is not set during OnCompactionBegin. But it is set properly in the next function for OnCompactionCompleted. Need this cf_id for tracking progress of a Stardog optimize since there may be multiple compactions required for a given column family.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6938
Reviewed By: siying
Differential Revision: D23153235
Pulled By: ajkr
fbshipit-source-id: 932938de3a4ebbc7ac89702f655583862587d251
Summary:
Have a global StatsDumpScheduler for all DB instance stats dumping, including `DumpStats()` and `PersistStats()`. Before this, there're 2 dedicate threads for every DB instance, one for DumpStats() one for PersistStats(), which could create lots of threads if there're hundreds DB instances.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7223
Reviewed By: riversand963
Differential Revision: D23056737
Pulled By: jay-zhuang
fbshipit-source-id: 0faa2311142a73433ebb3317361db7cbf43faeba
Summary:
If user-defined timestamp is enabled, current implementation can expose
newer data to queries even if an older sequence number is specified via
read_options.snapshot. This PR makes Get() respect sequence-number-based
snapshot.
Solution is simple. Besides using <ukey, ts, seq> to search the index for the key,
we also verify that the candidate result's seq is smaller than or equal to seq. This
requires passing a seq via `GetContext`, which results in the majority of code
change caused by this PR.
Also added a few unit tests to demonstrate standard visibility during point lookup
and range scan when timestamp and snapshot are both present.
Test plan (devserver):
```
make check
$./db_bench --benchmarks=fillseq,readrandom -cache_size=$[64*1024*1024]
```
Result
this PR: readrandom : 4.827 micros/op 207180 ops/sec; 22.9 MB/s (1000000 of 1000000 found)
master: readrandom : 4.936 micros/op 202610 ops/sec; 22.4 MB/s (1000000 of 1000000 found)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7227
Reviewed By: ltamasi
Differential Revision: D23015242
Pulled By: riversand963
fbshipit-source-id: ea7b85a728654553ba357d2e6a207b5e40f7376a
Summary:
Manual compaction with `CompactRangeOptions::change_levels` set could
refit to a level targeted by another manual compaction. If
force_consistency_checks were disabled, it could be possible for
overlapping files to be written at that target level.
This PR prevents the possibility by calling `DisableManualCompaction()`
prior to `ReFitLevel()`. It also improves the manual compaction disabling
mechanism to wait for pending manual compactions to complete before
returning, and support disabling from multiple threads.
Fixes https://github.com/facebook/rocksdb/issues/6432.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7250
Test Plan:
crash test command that repro'd the bug reliably:
```
$ TEST_TMPDIR=/dev/shm python tools/db_crashtest.py blackbox --simple -target_file_size_base=524288 -write_buffer_size=1048576 -clear_column_family_one_in=0 -reopen=0 -max_key=10000000 -column_families=1 -max_background_compactions=8 -compact_range_one_in=100000 -compression_type=none -compaction_style=1 -num_levels=5 -universal_min_merge_width=4 -universal_max_merge_width=8 -level0_file_num_compaction_trigger=12 -rate_limiter_bytes_per_sec=1048576000 -universal_max_size_amplification_percent=100 --duration=3600 --interval=60 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --enable_compaction_filter=0
```
Reviewed By: ltamasi
Differential Revision: D23090800
Pulled By: ajkr
fbshipit-source-id: afcbcd51b42ce76789fdb907d8b9ada790709c13
Summary:
Upgrade tool chain to the latest. It is done mostly manually as build_tools/build_detect_platform fails to update many of them.
Try to fix a new clang analyze warning with the new tool chain.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7251
Test Plan: "make all", "USE_CLANG=1 make all"
Reviewed By: riversand963
Differential Revision: D23091090
fbshipit-source-id: 732e5a30137837431438f85f36296406b641f975
Summary:
The patch cleans up and refactors `CompressBlock` and `CompressBlockInternal` a bit.
In particular, it does the following:
* It renames `CompressBlockInternal` to `CompressData` and moves it to `util/compression.h`,
where other general compression-related utilities are located. This will facilitate reuse in the
BlobDB write path.
* The signature of the method is changed so it now takes `compression_format_version`
(similarly to the compression library specific methods) instead of `format_version` (which is
specific to the block based table).
* `GetCompressionFormatForVersion` no longer takes `compression_type` as a parameter.
This parameter was only used in a (not entirely up-to-date) assertion; also, removing it
eliminates the need to ensure this precondition holds at all call sites.
* Does some minor cleanup in `CompressBlock`, for instance, it is now possible to pass
only one of `sampled_output_fast` and `sampled_output_slow`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7249
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D23087278
Pulled By: ltamasi
fbshipit-source-id: e6316e45baed8b4e7de7c1780c90501c2a3439b3
Summary:
As part of the IOTracing project, this PR
1. Caches "FileSystemPtr" object(wrapper class that returns file system pointer based on tracing enabled) instead of "FileSystem" pointer.
2. FileSystemPtr object is created using FileSystem pointer and IOTracer
pointer.
3. IOTracer shared_ptr is created in DBImpl and it is passed to different classes through constructor.
4. When tracing is enabled through DB::StartIOTrace, FileSystemPtr
returns FileSystemTracingWrapper pointer for tracing purpose and when
it is disabled underlying FileSystem pointer is returned.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7180
Test Plan:
make check -j64
COMPILE_WITH_TSAN=1 make check -j64
Reviewed By: anand1976
Differential Revision: D22987117
Pulled By: akankshamahajan15
fbshipit-source-id: 6073617e4c2d5bc363914f3a1f55ae3b0a58fbf1
Summary:
A new option `std::shared_ptr<FileChecksumGenFactory> backup_checksum_gen_factory` is added to `BackupableDBOptions`. This allows custom checksum functions to be used for creating, verifying, or restoring backups.
Tests are added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7085
Test Plan: Passed make check
Reviewed By: pdillinger
Differential Revision: D22390756
Pulled By: gg814
fbshipit-source-id: 3b7756ca444c2129844536b91c3ca09f53b6248f
Summary:
We have a number of tests hanging on MacOS and windows due to
mishandling of code for mock sleeps. In addition, the code was in
terrible shape because the same variable (addon_time_) would sometimes
refer to microseconds and sometimes to seconds. One test even assumed it
was nanoseconds but was written to pass anyway.
This has been cleaned up so that DB tests generally use a SpecialEnv
function to mock sleep, for either some number of microseconds or seconds
depending on the function called. But to call one of these, the test must first
call SetMockSleep (precondition enforced with assertion), which also turns
sleeps in RocksDB into mock sleeps. To also removes accounting for actual
clock time, call SetTimeElapseOnlySleepOnReopen, which implies
SetMockSleep (on DB re-open). This latter setting only works by applying
on DB re-open, otherwise havoc can ensue if Env goes back in time with
DB open.
More specifics:
Removed some unused test classes, and updated comments on the general
problem.
Fixed DBSSTTest.GetTotalSstFilesSize using a sync point callback instead
of mock time. For this we have the only modification to production code,
inserting a sync point callback in flush_job.cc, which is not a change to
production behavior.
Removed unnecessary resetting of mock times to 0 in many tests. RocksDB
deals in relative time. Any behaviors relying on absolute date/time are likely
a bug. (The above test DBSSTTest.GetTotalSstFilesSize was the only one
clearly injecting a specific absolute time for actual testing convenience.) Just
in case I misunderstood some test, I put this note in each replacement:
// NOTE: Presumed unnecessary and removed: resetting mock time in env
Strengthened some tests like MergeTestTime, MergeCompactionTimeTest, and
FilterCompactionTimeTest in db_test.cc
stats_history_test and blob_db_test are each their own beast, rather deeply
dependent on MockTimeEnv. Each gets its own variant of a work-around for
TimedWait in a mock time environment. (Reduces redundancy and
inconsistency in stats_history_test.)
Intended follow-up:
Remove TimedWait from the public API of InstrumentedCondVar, and only
make that accessible through Env by passing in an InstrumentedCondVar and
a deadline. Then the Env implementations mocking time can fix this problem
without using sync points. (Test infrastructure using sync points interferes
with individual tests' control over sync points.)
With that change, we can simplify/consolidate the scattered work-arounds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7101
Test Plan: make check on Linux and MacOS
Reviewed By: zhichao-cao
Differential Revision: D23032815
Pulled By: pdillinger
fbshipit-source-id: 7f33967ada8b83011fb54e8279365c008bd6610b
Summary:
`VersionStorageInfo::AddFile` currently has a debug-mode consistency
check to make sure the newly added file does not overlap with the
previous one (for levels below L0). Considering that
`VersionBuilder::CheckConsistency` also performs similar checks (in
fact, those checks are more comprehensive and cover L0 as well), this
check is redundant. The patch removes it and also cleans up `AddFile` a
little.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7237
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D23041937
Pulled By: ltamasi
fbshipit-source-id: e00665f3b83bfd17f86c54c238800f3d77d739bd
Summary:
Delete triggered compaction (DTC) for universal compaction style with ```num_levels = 1``` has been disabled for sometime due to a data correctness bug. This PR re-enables it with a bug fix. A file marked for compaction can be picked, along with all L0 files after it as the compaction input. We stop adding files to the input once we encounter a file already being compacted (the original bug failed to check the compaction status of the files).
Tests:
Add unit tests to ```compaction_picker_test.cc```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7224
Reviewed By: ajkr
Differential Revision: D23031845
Pulled By: anand1976
fbshipit-source-id: 9de3cab5f9774cede666c2c48d309a7d9b88a505
Summary:
The debug is supposed to print out two keys to show the value mismatch, which was compared just a few lines above.
However, the actual print-out is the same values (so they obviously won't be mismatched)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6587
Reviewed By: riversand963
Differential Revision: D23025279
Pulled By: ajkr
fbshipit-source-id: 4c6c35bc60b273f13c08b5464b6f690d8a5cfe41
Summary:
Introduce io_timeout in ReadOptions and enabled deadline/io_timeout for
Iterators.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7161
Test Plan: New unit tests in db_basic_test
Reviewed By: riversand963
Differential Revision: D22687352
Pulled By: anand1976
fbshipit-source-id: 67bbb0e6d7ae80b256589244468494292538c6ec
Summary:
Pointed out by https://github.com/facebook/rocksdb/issues/7197 , there is a double lock in WriteImplWALOnly.
Also find another deadlock in UnorderedWriteMemtable. Move the check after switch_all_.notify_all().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7199
Test Plan: pass make check
Reviewed By: anand1976
Differential Revision: D22961714
Pulled By: zhichao-cao
fbshipit-source-id: 0707922dc50d28ea141a15a8cdcbd1c8993ea0d8
Summary:
A colon will be added after 'msg' automatically when invoke function Status(Code _code, const Slice& msg, const Slice& msg2),
it's not needed to append a colon explicitly to 'msg'.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7041
Reviewed By: ajkr
Differential Revision: D22292801
fbshipit-source-id: 8f2d69065bb779d2613468bf9fc9169f32c3f1ec
Summary:
`WalAddition`, `WalDeletion` are defined in `wal_version.h` and used in `VersionEdit`.
`WalAddition` is used to represent events of creating a new WAL (no size, just log number), or closing a WAL (with size).
`WalDeletion` is used to represent events of deleting or archiving a WAL, it means the WAL is no longer alive (won't be replayed during recovery).
`WalSet` is the set of alive WALs kept in `VersionSet`.
1. Why use `WalDeletion` instead of relying on `MinLogNumber` to identify outdated WALs
On recovery, we can compute `MinLogNumber()` based on the log numbers kept in MANIFEST, any log with number < MinLogNumber can be ignored. So it seems that we don't need to persist `WalDeletion` to MANIFEST, since we can ignore the WALs based on MinLogNumber.
But the `MinLogNumber()` is actually a lower bound, it does not exactly mean that logs starting from MinLogNumber must exist. This is because in a corner case, when a column family is empty and never flushed, its log number is set to the largest log number, but not persisted in MANIFEST. So let's say there are 2 column families, when creating the DB, the first WAL has log number 1, so it's persisted to MANIFEST for both column families. Then CF 0 is empty and never flushed, CF 1 is updated and flushed, so a new WAL with log number 2 is created and persisted to MANIFEST for CF 1. But CF 0's log number in MANIFEST is still 1. So on recovery, MinLogNumber is 1, but since log 1 only contains data for CF 1, and CF 1 is flushed, log 1 might have already been deleted from disk.
We can make `MinLogNumber()` be the exactly minimum log number that must exist, by persisting the most recent log number for empty column families that are not flushed. But if there are N such column families, then every time a new WAL is created, we need to add N records to MANIFEST.
In current design, a record is persisted to MANIFEST only when WAL is created, closed, or deleted/archived, so the number of WAL related records are bounded to 3x number of WALs.
2. Why keep `WalSet` in `VersionSet` instead of applying the `VersionEdit`s to `VersionStorageInfo`
`VersionEdit`s are originally designed to track the addition and deletion of SST files. The SST files are related to column families, each column family has a list of `Version`s, and each `Version` keeps the set of active SST files in `VersionStorageInfo`.
But WALs are a concept of DB, they are not bounded to specific column families. So logically it does not make sense to store WALs in a column family's `Version`s.
Also, `Version`'s purpose is to keep reference to SST / blob files, so that they are not deleted until there is no version referencing them. But a WAL is deleted regardless of version references.
So we keep the WALs in `VersionSet` for the purpose of writing out the DB state's snapshot when creating new MANIFESTs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7164
Test Plan:
make version_edit_test && ./version_edit_test
make wal_edit_test && ./wal_edit_test
Reviewed By: ltamasi
Differential Revision: D22677936
Pulled By: cheng-chang
fbshipit-source-id: 5a3b6890140e572ffd79eb37e6e4c3c32361a859
Summary:
IteratorIterator::IsOutOfBound() and IteratorIterator::MayBeOutOfUpperBound() are two functions that related to upper bound check. It is hard for users to reason about this complexity. Consolidate the two functions into one and assign an enum as results to improve readability.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7200
Test Plan: Run all existing test. Would run crash test with atomic for a while.
Reviewed By: anand1976
Differential Revision: D22833181
fbshipit-source-id: a0c724267056adbd0476bde74650e6c7226077e6
Summary:
as title.
When ReadOptions.iter_start_ts is not nullptr, DBIter::key() should
return internal keys including value type.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7178
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D22935879
Pulled By: riversand963
fbshipit-source-id: 7508d962cf11ebcfa6386d2529b4f3606b47ccfd