Summary:
The goal of this change is to allow changes to the "current" (in
FileSystem) file temperatures to feed back into DB metadata, so that
they can inform decisions and stats reporting. In part because of
modular code factoring, it doesn't seem easy to do this automagically,
where opening an SST file and observing current Temperature different
from expected would trigger a change in metadata and DB manifest write
(essentially giving the deep read path access to the write path). It is also
difficult to do this while the DB is open because of the limitations of
LogAndApply.
This change allows updating file temperature metadata on a closed DB
using an experimental utility function UpdateManifestForFilesState()
or `ldb update_manifest --update_temperatures`. This should suffice for
"migration" scenarios where outside tooling has placed or re-arranged DB
files into a (different) tiered configuration without going through
RocksDB itself (currently, only compaction can change temperature
metadata).
Some details:
* Refactored and added unit test for `ldb unsafe_remove_sst_file` because
of shared functionality
* Pulled in autovector.h changes from https://github.com/facebook/rocksdb/issues/9546 to fix SuperVersionContext
move constructor (related to an older draft of this change)
Possible follow-up work:
* Support updating manifest with file checksums, such as when a
new checksum function is used and want existing DB metadata updated
for it.
* It's possible that for some repair scenarios, lighter weight than
full repair, we might want to support UpdateManifestForFilesState() to
modify critical file details like size or checksum using same
algorithm. But let's make sure these are differentiated from modifying
file details in ways that don't suspect corruption (or require extreme
trust).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9683
Test Plan: unit tests added
Reviewed By: jay-zhuang
Differential Revision: D34798828
Pulled By: pdillinger
fbshipit-source-id: cfd83e8fb10761d8c9e7f9c020d68c9106a95554
Summary:
The primary goal of this change is to add support for backing up and
restoring (applying on restore) file temperature metadata, without
committing to either the DB manifest or the FS reported "current"
temperatures being exclusive "source of truth".
To achieve this goal, we need to add temperature information to backup
metadata, which requires updated backup meta schema. Fortunately I
prepared for this in https://github.com/facebook/rocksdb/issues/8069, which began forward compatibility in version
6.19.0 for this kind of schema update. (Previously, backup meta schema
was not extensible! Making this schema update public will allow some
other "nice to have" features like taking backups with hard links, and
avoiding crc32c checksum computation when another checksum is already
available.) While schema version 2 is newly public, the default schema
version is still 1. Until we change the default, users will need to set
to 2 to enable features like temperature data backup+restore. New
metadata like temperature information will be ignored with a warning
in versions before this change and since 6.19.0. The metadata is
considered ignorable because a functioning DB can be restored without
it.
Some detail:
* Some renaming because "future schema" is now just public schema 2.
* Initialize some atomics in TestFs (linter reported)
* Add temperature hint support to SstFileDumper (used by BackupEngine)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9660
Test Plan:
related unit test majorly updated for the new functionality,
including some shared testing support for tracking temperatures in a FS.
Some other tests and testing hooks into production code also updated for
making the backup meta schema change public.
Reviewed By: ajkr
Differential Revision: D34686968
Pulled By: pdillinger
fbshipit-source-id: 3ac1fa3e67ee97ca8a5103d79cc87d872c1d862a
Summary:
In preparation for more support for file Temperatures in BackupEngine,
this change does some test refactoring:
* Move DBTest2::BackupFileTemperature test to
BackupEngineTest::FileTemperatures, with some updates to make it work
in the new home. This test will soon be expanded for deeper backup work.
* Move FileTemperatureTestFS from db_test2.cc to db_test_util.h, to
support sharing because of above moved test, but split off the "no link"
part to the test needing it.
* Use custom FileSystems in backupable_db_test rather than custom Envs,
because going through Env file interfaces doesn't support temperatures.
* Fix RemapFileSystem to map DirFsyncOptions::renamed_new_name
parameter to FsyncWithDirOptions, which was required because this
limitation caused a crash only after moving to higher fidelity of
FileSystem interface (vs. LegacyDirectoryWrapper throwing away some
parameter details)
* `backupable_options_` -> `engine_options_` as part of the ongoing
work to get rid of the obsolete "backupable" naming.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9655
Test Plan: test code updates only
Reviewed By: jay-zhuang
Differential Revision: D34622183
Pulled By: pdillinger
fbshipit-source-id: f24b7a596a89b9e089e960f4e5d772575513e93f
Summary:
We often see flaky tests due to `DB::Flush()` or `DBImpl::TEST_WaitForFlushMemTable()` not waiting until event listeners complete. For example, https://github.com/facebook/rocksdb/issues/9084, https://github.com/facebook/rocksdb/issues/9400, https://github.com/facebook/rocksdb/issues/9528, plus two new ones this week: "EventListenerTest.OnSingleDBFlushTest" and "DBFlushTest.FireOnFlushCompletedAfterCommittedResult". I ran a `make check` with the below race condition-coercing patch and fixed issues it found besides old BlobDB.
```
diff --git a/db/db_impl/db_impl_compaction_flush.cc b/db/db_impl/db_impl_compaction_flush.cc
index 0e1864788..aaba68c4a 100644
--- a/db/db_impl/db_impl_compaction_flush.cc
+++ b/db/db_impl/db_impl_compaction_flush.cc
@@ -861,6 +861,8 @@ void DBImpl::NotifyOnFlushCompleted(
mutable_cf_options.level0_stop_writes_trigger);
// release lock while notifying events
mutex_.Unlock();
+ bg_cv_.SignalAll();
+ sleep(1);
{
for (auto& info : *flush_jobs_info) {
info->triggered_writes_slowdown = triggered_writes_slowdown;
```
The reason I did not fix old BlobDB issues is because it appears to have a fundamental (non-test) issue. In particular, it uses an EventListener to keep track of the files. OnFlushCompleted() could be delayed until even after a compaction involving that flushed file completes, causing the compaction to unexpectedly delete an untracked file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9617
Test Plan: `make check` including the race condition coercing patch
Reviewed By: hx235
Differential Revision: D34384022
Pulled By: ajkr
fbshipit-source-id: 2652ded39b415277c5d6a628414345223930514e
Summary:
Valgrind was failing with the below error because we forgot to destroy
the `BackupEngine` object:
```
==421173== Command: ./db_test2 --gtest_filter=DBTest2.BackupFileTemperature
==421173==
Note: Google Test filter = DBTest2.BackupFileTemperature
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from DBTest2
[ RUN ] DBTest2.BackupFileTemperature
--421173-- WARNING: unhandled amd64-linux syscall: 425
--421173-- You may be able to write your own handler.
--421173-- Read the file README_MISSING_SYSCALL_OR_IOCTL.
--421173-- Nevertheless we consider this a bug. Please report
--421173-- it at http://valgrind.org/support/bug_reports.html.
[ OK ] DBTest2.BackupFileTemperature (3366 ms)
[----------] 1 test from DBTest2 (3371 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (3413 ms total)
[ PASSED ] 1 test.
==421173==
==421173== HEAP SUMMARY:
==421173== in use at exit: 13,042 bytes in 195 blocks
==421173== total heap usage: 26,022 allocs, 25,827 frees, 27,555,265 bytes allocated
==421173==
==421173== 8 bytes in 1 blocks are possibly lost in loss record 6 of 167
==421173== at 0x4838DBF: operator new(unsigned long) (vg_replace_malloc.c:344)
==421173== by 0x8D4606: allocate (new_allocator.h:114)
==421173== by 0x8D4606: allocate (alloc_traits.h:445)
==421173== by 0x8D4606: _M_allocate (stl_vector.h:343)
==421173== by 0x8D4606: reserve (vector.tcc:78)
==421173== by 0x8D4606: rocksdb::BackupEngineImpl::Initialize() (backupable_db.cc:1174)
==421173== by 0x8D5473: Initialize (backupable_db.cc:918)
==421173== by 0x8D5473: rocksdb::BackupEngine::Open(rocksdb::BackupEngineOptions const&, rocksdb::Env*, rocksdb::BackupEngine**) (backupable_db.cc:937)
==421173== by 0x50AC8F: Open (backup_engine.h:585)
==421173== by 0x50AC8F: rocksdb::DBTest2_BackupFileTemperature_Test::TestBody() (db_test2.cc:6996)
...
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9610
Test Plan:
```
$ make -j24 ROCKSDBTESTS_SUBSET=db_test2 valgrind_check_some
```
Reviewed By: akankshamahajan15
Differential Revision: D34371210
Pulled By: ajkr
fbshipit-source-id: 68154fcb0c51b28222efa23fa4ee02df8d925a18
Summary:
Add Temperature hints information from RocksDB in API
`NewSequentialFile()`. backup and checkpoint operations need to open the
source files with `NewSequentialFile()`, which will have the temperature
hints. Other operations are not covered.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9499
Test Plan: Added unittest
Reviewed By: pdillinger
Differential Revision: D34006115
Pulled By: jay-zhuang
fbshipit-source-id: 568b34602b76520e53128672bd07e9d886786a2f
Summary:
RocksDB try to provide temperature information in the event
listener callbacks. The information is not guaranteed, as some operation
like backup won't have these information.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9591
Test Plan: Added unittest
Reviewed By: siying, pdillinger
Differential Revision: D34309339
Pulled By: jay-zhuang
fbshipit-source-id: 4aca4f270f99fa49186d85d300da42594663d6d7
Summary:
Users can set the priority for file reads associated with their operation by setting `ReadOptions::rate_limiter_priority` to something other than `Env::IO_TOTAL`. Rate limiting `VerifyChecksum()` and `VerifyFileChecksums()` is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working.
`RandomAccessFileReader::Read()` already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed in `Env::IOPriority`. Now the compaction read rate limiting is supported by setting `rate_limiter_priority = Env::IO_LOW` on its `ReadOptions`.
There is no default value for the new `Env::IOPriority` parameter to `RandomAccessFileReader::Read()`. That means this PR goes through all callers (in some cases multiple layers up the call stack) to find a `ReadOptions` to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads).
The API doc only lists the missing cases where a file read associated with a provided `ReadOptions` cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provide `ReadOptions` or `Env::IOPriority`, so I didn't count that as missing.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9424
Test Plan:
- new unit tests
- new benchmarks on ~50MB database with 1MB/s read rate limit and 100ms refill interval; verified with strace reads are chunked (at 0.1MB per chunk) and spaced roughly 100ms apart.
- setup command: `./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true`
- benchmarks command: `strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true`
- crash test using IO_USER priority on non-validation reads with https://github.com/facebook/rocksdb/issues/9567 reverted: `python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10`
Reviewed By: hx235
Differential Revision: D33747386
Pulled By: ajkr
fbshipit-source-id: a2d985e97912fba8c54763798e04f006ccc56e0c
Summary:
In RocksDB option new_table_reader_for_compaction_inputs has
not effect on Compaction or on the behavior of RocksDB library.
Therefore, we are removing it in the upcoming 7.0 release.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9443
Test Plan: CircleCI
Reviewed By: ajkr
Differential Revision: D33788508
Pulled By: akankshamahajan15
fbshipit-source-id: 324ca6f12bfd019e9bd5e1b0cdac39be5c3cec7d
Summary:
ajkr reminded me that we have a rule of not including per-kv related data in `WriteOptions`.
Namely, `WriteOptions` should not include information about "what-to-write", but should just
include information about "how-to-write".
According to this rule, `WriteOptions::timestamp` (experimental) is clearly a violation. Therefore,
this PR removes `WriteOptions::timestamp` for compliance.
After the removal, we need to pass timestamp info via another set of APIs. This PR proposes a set
of overloaded functions `Put(write_opts, key, value, ts)`, `Delete(write_opts, key, ts)`, and
`SingleDelete(write_opts, key, ts)`. Planned to add `Write(write_opts, batch, ts)`, but its complexity
made me reconsider doing it in another PR (maybe).
For better checking and returning error early, we also add a new set of APIs to `WriteBatch` that take
extra `timestamp` information when writing to `WriteBatch`es.
These set of APIs in `WriteBatchWithIndex` are currently not supported, and are on our TODO list.
Removed `WriteBatch::AssignTimestamps()` and renamed `WriteBatch::AssignTimestamp()` to
`WriteBatch::UpdateTimestamps()` since this method require that all keys have space for timestamps
allocated already and multiple timestamps can be updated.
The constructor of `WriteBatch` now takes a fourth argument `default_cf_ts_sz` which is the timestamp
size of the default column family. This will be used to allocate space when calling APIs that do not
specify a column family handle.
Also, updated `DB::Get()`, `DB::MultiGet()`, `DB::NewIterator()`, `DB::NewIterators()` methods, replacing
some assertions about timestamp to returning Status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8946
Test Plan:
make check
./db_bench -benchmarks=fillseq,fillrandom,readrandom,readseq,deleterandom -user_timestamp_size=8
./db_stress --user_timestamp_size=8 -nooverwritepercent=0 -test_secondary=0 -secondary_catch_up_one_in=0 -continuous_verification_interval=0
Make sure there is no perf regression by running the following
```
./db_bench_opt -db=/dev/shm/rocksdb -use_existing_db=0 -level0_stop_writes_trigger=256 -level0_slowdown_writes_trigger=256 -level0_file_num_compaction_trigger=256 -disable_wal=1 -duration=10 -benchmarks=fillrandom
```
Before this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom : 1.831 micros/op 546235 ops/sec; 60.4 MB/s
```
After this PR
```
DB path: [/dev/shm/rocksdb]
fillrandom : 1.820 micros/op 549404 ops/sec; 60.8 MB/s
```
Reviewed By: ltamasi
Differential Revision: D33721359
Pulled By: riversand963
fbshipit-source-id: c131561534272c120ffb80711d42748d21badf09
Summary:
Fix a bug that causes file temperature not preserved after DB is restarted, or options.max_manifest_file_size is hit.
Also, pass temperature information to NewRandomAccessFile() to allow users to hack a solution where they don't preserve tiering information.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9242
Test Plan: Add a unit test that would fail without the fix.
Reviewed By: jay-zhuang
Differential Revision: D32818150
fbshipit-source-id: 36aa3f148c60107f7b8e9d65b63b039f9e1a1eec
Summary:
DBTest2.RateLimitedCompactionReads sometime shows following failure:
what(): db/db_test2.cc:3976: Failure
Expected equality of these values:
i + 1
Which is: 4
NumTableFilesAtLevel(0)
Which is: 0
The assertion itself doesn't appear to be correct. Fix it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9185
Test Plan: Removing an assertion shouldn't break anything.
Reviewed By: ajkr
Differential Revision: D32549530
fbshipit-source-id: 9993372d8af89161f903337a13f3e316e690a6b8
Summary:
Add the 3 read bytes counter to the Statistic, which will be used by storage tiering and get the information for files with different temperature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9123
Test Plan: added new testing cases.
Reviewed By: siying
Differential Revision: D32154745
Pulled By: zhichao-cao
fbshipit-source-id: b7905d6dae469a72428742364ec07b634b6f15da
Summary:
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9162
Existing TransactionUtil::CheckKeyForConflict() performs only seq-based
conflict checking. If user-defined timestamp is enabled, it should perform
conflict checking based on timestamps too.
Update TransactionUtil::CheckKey-related methods to verify the timestamp of the
latest version of a key is smaller than the read timestamp. Note that
CheckKeysForConflict() is not updated since it's used only by optimistic
transaction, and we do not plan to update it in this upcoming batch of diffs.
Existing GetLatestSequenceForKey() returns the sequence of the latest
version of a specific user key. Since we support user-defined timestamp, we
need to update this method to also return the timestamp (if enabled) of the
latest version of the key. This will be needed for snapshot validation.
Reviewed By: ltamasi
Differential Revision: D31567960
fbshipit-source-id: 2e4a14aed267435a9aa91bc632d2411c01946d44
Summary:
Allow compaction_job_test, db_io_failure_test, dbformat_test, deletefile_test, and fault_injection_test to use a custom Env object. Also move ```RegisterCustomObjects``` declaration to a header file to simplify things.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9087
Test Plan: Run manually using "buck test rocksdb/src:compaction_job_test_fbcode" etc.
Reviewed By: riversand963
Differential Revision: D32007222
Pulled By: anand1976
fbshipit-source-id: 99af58559e25bf61563dfa95dc46e31fa7375792
Summary:
`dbs` should not be cleared, as it is reused later when reopening the DBs, so we have an out-of-bounds access with `dbnames[dbnum]`. The values left in the vector don't need to be reset, as the db pointer is an out parameter for `DB::Open`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9046
Reviewed By: pdillinger
Differential Revision: D31738263
Pulled By: ot
fbshipit-source-id: c619e947b8d3dbc3d896f29971f093d3e3c794d3
Summary:
Add the file temperature to `IngestExternalFileArg` such that when SST files are ingested, user is able to assign the temperature to each SST file. If the temperature vector is empty or its size does not match the file name vector size, all ingested SST files will be assigned with `Temperature::unKnown`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8949
Test Plan: add the new test and make check
Reviewed By: siying
Differential Revision: D31127852
Pulled By: zhichao-cao
fbshipit-source-id: 141a81f0f7b473d88f4ab0cb2a21a114cbc6f83c
Summary:
For tiered storage project, we need to know the block read count and read bytes of files with different temperature. Add FileIOByTemperature to IOStatsContext and collect the bytes read and read count from different temperature files through the RandomAccessFileReader.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8710
Test Plan: make check, add the testing cases
Reviewed By: siying
Differential Revision: D30582400
Pulled By: zhichao-cao
fbshipit-source-id: d83173de594374fc8404af5ce93a6a9be72c7141
Summary:
With test sync points, we can assert on the equality of iterator value in three existing
unit tests.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8973
Test Plan:
```
gtest-parallel -r 1000 ./db_test2 --gtest_filter=DBTest2.IterRaceFlush2:DBTest2.IterRaceFlush1:DBTest2.IterRefreshRaceFlush
```
make check
Reviewed By: akankshamahajan15
Differential Revision: D31256340
Pulled By: riversand963
fbshipit-source-id: a9440767ab383e0ec61bd43ffa8fbec4ba562ea2
Summary:
Make the Statistics object into a Customizable object. Statistics can now be stored and created to/from the Options file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8637
Reviewed By: zhichao-cao
Differential Revision: D30530550
Pulled By: mrambacher
fbshipit-source-id: 5fc7d01d8431f37b2c205bbbd8342c9f697023bd
Summary:
This PR does the following:
-> Makes the MemTableRepFactory into a Customizable class and creatable/configurable via CreateFromString
-> Makes the existing implementations compatible with configurations
-> Moves the "SpecialRepFactory" test class into testutil, accessible via the ObjectRegistry or a NewSpecial API
New tests were added to validate the functionality and all existing tests pass. db_bench and memtablerep_bench were hand-tested to verify the functionality in those tools.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8419
Reviewed By: zhichao-cao
Differential Revision: D29558961
Pulled By: mrambacher
fbshipit-source-id: 81b7229636e4e649a0c914e73ac7b0f8454c931c
Summary:
* FullKey and ParseFullKey appear to serve no purpose in the public API
(or anything else) so removed. Only use in one test updated.
* NumberToString serves no purpose vs. ToString so removed, numerous
calls updated
* Remove unnecessary forward declarations in metadata.h by re-arranging
class definitions.
* Remove some unneeded semicolons
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8736
Test Plan: existing tests
Reviewed By: mrambacher
Differential Revision: D30700039
Pulled By: pdillinger
fbshipit-source-id: 1e436a576f511a6ed8b4d97af7cc8216bc729af2
Summary:
- Allow to get `Valid()`, `status()`, `key()` and `value()` of an iterator from `IteratorTraceExecutionResult`.
- Move lower bound and upper bound from `IteratorSeekQueryTraceRecord` to `IteratorQueryTraceRecord`.
Added test in `DBTest2.TraceAndReplay`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8687
Reviewed By: zhichao-cao
Differential Revision: D30457630
Pulled By: autopear
fbshipit-source-id: be433099a25895b3aa6f0c00f95ad7b1d7489c1d
Summary:
Trace file V2 added lower/upper bounds to `Iterator::Seek()` and `Iterator::SeekForPrev()`. They were not used anywhere during the execution of a `TraceRecord`. Now they are added to be used by `ReadOptions` during `Iterator::Seek()` and `Iterator::SeekForPrev()` if they are set.
Added test cases in `DBTest2.TraceAndManualReplay`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8677
Reviewed By: zhichao-cao
Differential Revision: D30438255
Pulled By: autopear
fbshipit-source-id: 82563006be0b69155990e506a74951c18af8d288
Summary:
`Replayer::Execute()` can directly returns the result (e.g, request latency, DB::Get() return code, returned value, etc.)
`Replayer::Replay()` reports the results via a callback function.
New interface:
`TraceRecordResult` in "rocksdb/trace_record_result.h".
`DBTest2.TraceAndReplay` and `DBTest2.TraceAndManualReplay` are updated accordingly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8657
Reviewed By: ajkr
Differential Revision: D30290216
Pulled By: autopear
fbshipit-source-id: 3c8d4e6b180ec743de1a9d9dcaee86064c74f0d6
Summary:
New public interfaces:
`TraceRecord` and `TraceRecord::Handler`, available in "rocksdb/trace_record.h".
`Replayer`, available in `rocksdb/utilities/replayer.h`.
User can use `DB::NewDefaultReplayer()` to create a Replayer to auto/manual replay a trace file.
Unit tests:
- `./db_test2 --gtest_filter="DBTest2.TraceAndReplay"`: Updated with the internal API changes.
- `./db_test2 --gtest_filter="DBTest2.TraceAndManualReplay"`: New for manual replay.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8611
Reviewed By: ajkr
Differential Revision: D30266329
Pulled By: autopear
fbshipit-source-id: 1ecb3cbbedae0f6a67c18f0cc82e002b4d81b6f8
Summary:
Prior to this change, the "wal_dir" DBOption would always be set (defaults to dbname) when the DBOptions were sanitized. Because of this setitng in the options file, it was not possible to rename/relocate a database directory after it had been created and use the existing options file.
After this change, the "wal_dir" option is only set under specific circumstances. Methods were added to the ImmutableDBOptions class to see if it is set and if it is set to something other than the dbname. Additionally, a method was added to retrieve the effective value of the WAL dir (either the option or the dbname/path).
Tests were added to the core and ldb to test that a database could be created and renamed without issue. Additional tests for various permutations of wal_dir were also added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8582
Reviewed By: pdillinger, autopear
Differential Revision: D29881122
Pulled By: mrambacher
fbshipit-source-id: 67d3d033dc8813d59917b0a3fba2550c0efd6dfb
Summary:
The PerThreadDBPath has already specified a slash. It does not need to be specified when initializing the test path.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8555
Reviewed By: ajkr
Differential Revision: D29758399
Pulled By: jay-zhuang
fbshipit-source-id: 6d2b878523e3e8580536e2829cb25489844d9011
Summary:
Added the ability to cancel an in-progress range compaction by storing to an atomic "canceled" variable pointed to within the CompactRangeOptions structure.
Tested via two tests added to db_tests2.cc.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8351
Reviewed By: ajkr
Differential Revision: D28808894
Pulled By: ddevec
fbshipit-source-id: cb321361c9e23b084b188bb203f11c375a22c2dd
Summary:
When a memtable is flushed, it will validate number of entries it reads, and compare the number with how many entries inserted into memtable. This serves as one sanity c\
heck against memory corruption. This change will also allow more counters to be added in the future for better validation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8288
Test Plan: Pass all existing tests
Reviewed By: ajkr
Differential Revision: D28369194
fbshipit-source-id: 7ff870380c41eab7f99eee508550dcdce32838ad
Summary:
As a part of tiered storage, writing tempeature information to manifest is needed so that after DB recovery, RocksDB still has the tiering information, to implement some further necessary functionalities.
Also fix some issues in simulated hybrid FS.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8284
Test Plan: Add a new unit test to validate that the information is indeed written and read back.
Reviewed By: zhichao-cao
Differential Revision: D28335801
fbshipit-source-id: 56aeb2e6ea090be0200181dd968c8a7278037def
Summary:
In current RocksDB, in recover the information form WAL, we do the consistency check for each column family when one WAL file is corrupted and PointInTimeRecovery is set. However, it will report a false positive alert on "SST file is ahead of WALs" when one of the CF current log number is greater than the corrupted WAL number (CF contains the data beyond the corrupted WAl) due to a new column family creation during flush. In this case, a new WAL is created (it is empty) during a flush. Also, due to some reason (e.g., storage issue or crash happens before SyncCloseLog is called), the old WAL is corrupted. The new CF has no data, therefore, it does not have the consistency issue.
Fix: when checking cfd->GetLogNumber() > corrupted_wal_number also check cfd->GetLiveSstFilesSize() > 0. So the CFs with no SST file data will skip the check here.
Note potential ignored inconsistency caused due to fix: empty CF can also be caused by write+delete. In this case, after flush, there is no SST files being generated. However, this CF still have the log in the WAL. When the WAL is corrupted, the DB might be inconsistent.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8207
Test Plan: added unit test, make crash_test
Reviewed By: riversand963
Differential Revision: D27898839
Pulled By: zhichao-cao
fbshipit-source-id: 931fc2d8b92dd00b4169bf84b94e712fd688a83e
Summary:
In a distributed environment, a file `rename()` operation can succeed on server (remote)
side, but the client can somehow return non-ok status to RocksDB. Possible reasons include
network partition, connection issue, etc. This happens in `rocksdb::SetCurrentFile()`, which
can be called in `LogAndApply() -> ProcessManifestWrites()` if RocksDB tries to switch to a
new MANIFEST. We currently always delete the new MANIFEST if an error occurs.
This is problematic in distributed world. If the server-side successfully updates the CURRENT
file via renaming, then a subsequent `DB::Open()` will try to look for the new MANIFEST and fail.
As a fix, we can track the execution result of IO operations on the new MANIFEST.
- If IO operations on the new MANIFEST fail, then we know the CURRENT must point to the original
MANIFEST. Therefore, it is safe to remove the new MANIFEST.
- If IO operations on the new MANIFEST all succeed, but somehow we end up in the clean up
code block, then we do not know whether CURRENT points to the new or old MANIFEST. (For local
POSIX-compliant FS, it should still point to old MANIFEST, but it does not matter if we keep the
new MANIFEST.) Therefore, we keep the new MANIFEST.
- Any future `LogAndApply()` will switch to a new MANIFEST and update CURRENT.
- If process reopens the db immediately after the failure, then the CURRENT file can point
to either the new MANIFEST or the old one, both of which exist. Therefore, recovery can
succeed and ignore the other.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8192
Test Plan: make check
Reviewed By: zhichao-cao
Differential Revision: D27804648
Pulled By: riversand963
fbshipit-source-id: 9c16f2a5ce41bc6aadf085e48449b19ede8423e4
Summary:
Current flush reason attribution is misleading or incorrect (depending on what the original intention was):
- Flush due to WAL reaching its maximum size is attributed to `kWriteBufferManager`
- Flushes due to full write buffer and write buffer manager are not distinguishable, both are attributed to `kWriteBufferFull`
This changes the first to a new flush reason `kWALFull`, and splits the second between `kWriteBufferManager` and `kWriteBufferFull`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8150
Reviewed By: zhichao-cao
Differential Revision: D27569645
Pulled By: ot
fbshipit-source-id: 7e3c8ca186a6e71976e6b8e937297eebd4b769cc
Summary:
With thread/process-specific dirs. (Errors seen in FB infra.)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8145
Test Plan: see in FB infra tests
Reviewed By: riversand963
Differential Revision: D27542355
Pulled By: pdillinger
fbshipit-source-id: b3c8e66f91a6a6b3a775f6fc0c3cf71e63c29ade
Summary:
For some reason I still cannot figure out, the manual flush in this test
was sometimes producing a third tiny file. I saw it a bunch of times on
ppc64le, but even running a qemu system with that architecture (and
playing with various other options) could not repro. However we did get
an instrumented Travis run to confirm the problem is indeed a third tiny
file - https://travis-ci.org/github/facebook/rocksdb/jobs/761986592. We
can avoid it by filling memtables less full and using manual flush.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8044
Reviewed By: akankshamahajan15
Differential Revision: D26892635
Pulled By: ajkr
fbshipit-source-id: 775c04176931cf01d07cc78fb82cfe3a11beebcf
Summary:
For dictionary compression, we need to collect some representative samples of the data to be compressed, which we use to either generate or train (when `CompressionOptions::zstd_max_train_bytes > 0`) a dictionary. Previously, the strategy was to buffer all the data blocks during flush, and up to the target file size during compaction. That strategy allowed us to randomly pick samples from as wide a range as possible that'd be guaranteed to land in a single output file.
However, some users try to make huge files in memory-constrained environments, where this strategy can cause OOM. This PR introduces an option, `CompressionOptions::max_dict_buffer_bytes`, that limits how much data blocks are buffered before we switch to unbuffered mode (which means creating the per-SST dictionary, writing out the buffered data, and compressing/writing new blocks as soon as they are built). It is not strict as we currently buffer more than just data blocks -- also keys are buffered. But it does make a step towards giving users predictable memory usage.
Related changes include:
- Changed sampling for dictionary compression to select unique data blocks when there is limited availability of data blocks
- Made use of `BlockBuilder::SwapAndReset()` to save an allocation+memcpy when buffering data blocks for building a dictionary
- Changed `ParseBoolean()` to accept an input containing characters after the boolean. This is necessary since, with this PR, a value for `CompressionOptions::enabled` is no longer necessarily the final component in the `CompressionOptions` string.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7970
Test Plan:
- updated `CompressionOptions` unit tests to verify limit is respected (to the extent expected in the current implementation) in various scenarios of flush/compaction to bottommost/non-bottommost level
- looked at jemalloc heap profiles right before and after switching to unbuffered mode during flush/compaction. Verified memory usage in buffering is proportional to the limit set.
Reviewed By: pdillinger
Differential Revision: D26467994
Pulled By: ajkr
fbshipit-source-id: 3da4ef9fba59974e4ef40e40c01611002c861465
Summary:
Memtable bloom filter is useful in many use cases. A default value on with conservative 1.5% memory can benefit more use cases than use cases impacted.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6584
Test Plan: Run all existing tests.
Reviewed By: pdillinger
Differential Revision: D20626739
fbshipit-source-id: 1dd45532b932139552519b8c2682bd954550c2f9
Summary:
The main improvement here is to not include `.` or `..` in the results of `Env::GetChildren`. The occurrence of `.` or `..`; it is non-portable, dependent on the Operating System and the File System. See: https://www.gnu.org/software/libc/manual/html_node/Reading_002fClosing-Directory.html
There were lots of duplicate checks spread through the RocksDB codebase previously to skip `.` and `..`. This new removes the need for those at the source.
Also some minor fixes to `Env::GetChildren`:
* Improve error handling in POSIX implementation
* Remove unnecessary array allocation on Windows
* Fix struct name for Windows Non-UTF-8 API
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7819
Reviewed By: ajkr
Differential Revision: D25837394
Pulled By: jay-zhuang
fbshipit-source-id: 1e137e7218d38b450af9c083f73d5357abcbba2e
Summary:
This PR does the following:
-> Creates a WinFileSystem class. This class is the Windows equivalent of the PosixFileSystem and will be used on Windows systems.
-> Introduces a CustomEnv class. A CustomEnv is an Env that takes a FileSystem as constructor argument. I believe there will only ever be two implementations of this class (PosixEnv and WinEnv). There is still a CustomEnvWrapper class that takes an Env and a FileSystem and wraps the Env calls with the input Env but uses the FileSystem for the FileSystem calls
-> Eliminates the public uses of the LegacyFileSystemWrapper.
With this change in place, there are effectively the following patterns of Env:
- "Base Env classes" (PosixEnv, WinEnv). These classes implement the core Env functions (e.g. Threads) and have a hard-coded input FileSystem. These classes inherit from CompositeEnv, implement the core Env functions (threads) and delegate the FileSystem-like calls to the input file system.
- Wrapped Composite Env classes (MemEnv). These classes take in an Env and a FileSystem. The core env functions are re-directed to the wrapped env. The file system calls are redirected to the input file system
- Legacy Wrapped Env classes. These classes take in an Env input (but no FileSystem). The core env functions are re-directed to the wrapped env. A "Legacy File System" is created using this env and the file system calls directed to the env itself.
With these changes in place, the PosixEnv becomes a singleton -- there is only ever one created. Any other use of the PosixEnv is via another wrapped env. This cleans up some of the issues with the env construction and destruction.
Additionally, there were places in the code that required had an Env when they required a FileSystem. Many of these places would wrap the Env with a LegacyFileSystemWrapper instead of using the env->GetFileSystem(). These places were changed, thereby removing layers of additional redirection (LegacyFileSystem --> Env --> Env::FileSystem).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7703
Reviewed By: zhichao-cao
Differential Revision: D25762190
Pulled By: anand1976
fbshipit-source-id: 1a088e97fc916f28ac69c149cd1dcad0ab31704b
Summary:
Added a fix for the failure of
DBTest2.PartitionedIndexUserToInternalKey on ppc64le in travis
Closes https://github.com/facebook/rocksdb/issues/7746
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7752
Test Plan:
Ran travis job multiple times and it passed. Will keep
watching the travis job after this patch.
Reviewed By: pdillinger
Differential Revision: D25373130
Pulled By: akankshamahajan15
fbshipit-source-id: fa0e3f85f75b687415044a506e42cc38ead87975