Summary:
There is a race condition if WAL tracking in the MANIFEST is enabled in a database that disables 2PC.
The race condition is between two background flush threads trying to install flush results to the MANIFEST.
Consider an example database with two column families: "default" (cfd0) and "cf1" (cfd1). Initially,
both column families have one mutable (active) memtable whose data backed by 6.log.
1. Trigger a manual flush for "cf1", creating a 7.log
2. Insert another key to "default", and trigger flush for "default", creating 8.log
3. BgFlushThread1 finishes writing 9.sst
4. BgFlushThread2 finishes writing 10.sst
```
Time BgFlushThread1 BgFlushThread2
| mutex_.Lock()
| precompute min_wal_to_keep as 6
| mutex_.Unlock()
| mutex_.Lock()
| precompute min_wal_to_keep as 6
| join MANIFEST write queue and mutex_.Unlock()
| write to MANIFEST
| mutex_.Lock()
| cfd1->log_number = 7
| Signal bg_flush_2 and mutex_.Unlock()
| wake up and mutex_.Lock()
| cfd0->log_number = 8
| FindObsoleteFiles() with job_context->log_number == 7
| mutex_.Unlock()
| PurgeObsoleteFiles() deletes 6.log
V
```
As shown in the above, BgFlushThread2 thinks that the min wal to keep is 6.log because "cf1" has unflushed data in 6.log (cf1.log_number=6).
Similarly, BgThread1 thinks that min wal to keep is also 6.log because "default" has unflushed data (default.log_number=6).
No WAL deletion will be written to MANIFEST because 6 is equal to `versions_->wals_.min_wal_number_to_keep`,
due to https://github.com/facebook/rocksdb/blob/7.1.fb/db/memtable_list.cc#L513:L514.
The bg flush thread that finishes last will perform file purging. `job_context.log_number` will be evaluated as 7, i.e.
the min wal that contains unflushed data, causing 6.log to be deleted. However, MANIFEST thinks 6.log should still exist.
If you close the db at this point, you won't be able to re-open it if `track_and_verify_wal_in_manifest` is true.
We must handle the case of multiple bg flush threads, and it is difficult for one bg flush thread to know
the correct min wal number until the other bg flush threads have finished committing to the manifest and updated
the `cfd::log_number`.
To fix this issue, we rename an existing variable `min_log_number_to_keep_2pc` to `min_log_number_to_keep`,
and use it to track WAL file deletion in non-2pc mode as well.
This variable is updated only 1) during recovery with mutex held, or 2) in the MANIFEST write thread.
`min_log_number_to_keep` means RocksDB will delete WALs below it, although there may be WALs
above it which are also obsolete. Formally, we will have [min_wal_to_keep, max_obsolete_wal]. During recovery, we
make sure that only WALs above max_obsolete_wal are checked and added back to `alive_log_files_`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9715
Test Plan:
```
make check
```
Also ran stress test below (with asan) to make sure it completes successfully.
```
TEST_TMPDIR=/dev/shm/rocksdb OPT=-g ASAN_OPTIONS=disable_coredump=0 \
CRASH_TEST_EXT_ARGS=--compression_type=zstd SKIP_FORMAT_BUCK_CHECKS=1 \
make J=52 -j52 blackbox_asan_crash_test
```
Reviewed By: ltamasi
Differential Revision: D34984412
Pulled By: riversand963
fbshipit-source-id: c7b21a8d84751bb55ea79c9f387103d21b231005
Summary:
Previously, the OnErrorRecoveryCompleted callback was called when
RocksDB was able to successfully recover from a retryable error.
However, if the recovery failed and was eventually stopped, there was no
indication of the status. To fix that, a new OnErrorRecoveryEnd callback
is introduced that deprecates the OnErrorRecoveryCompleted callback. The
new callback is called with the original error and the new error status.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9244
Test Plan: Add a new unit test in error_handler_fs_test
Reviewed By: zhichao-cao
Differential Revision: D32922303
Pulled By: anand1976
fbshipit-source-id: f04e77a9cb92c5ea6385590682d3fcf559971b99
Summary:
1. Extend FlushJobInfo and CompactionJobInfo with information about the blob files generated by flush/compaction jobs. This PR add two structures BlobFileInfo and BlobFileGarbageInfo that contains the required information of blob files.
2. Notify the creation and deletion of blob files through OnBlobFileCreationStarted, OnBlobFileCreated, and OnBlobFileDeleted.
3. Test OnFile*Finish operations notifications with Blob Files.
4. Log the blob file creation/deletion events through EventLogger in Log file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8675
Test Plan: Add new unit tests in listener_test
Reviewed By: ltamasi
Differential Revision: D30412613
Pulled By: akankshamahajan15
fbshipit-source-id: ca51b63c6e8c8d0485a38c503572bc5a82bd5d07
Summary:
I very recently realized that with https://github.com/facebook/rocksdb/issues/8669 we cannot later add
file numbers to external SST files (so that more can share db session
ids for better uniqueness properties), because of forward compatibility.
We would have a version of RocksDB that assumes session IDs are unique
on external SST files and therefore can't really break that invariant in
future files.
This change adds a table property for "orig_file_number" which is
populated by normal SST files and also external SST files generated by
SstFileWriter. SstFileWriter now keeps a db_session_id for life of the
object and increments its own file numbers for embedding in table
properties. (They are arguably "fake" file numbers because these numbers
and not embedded in the file name.)
While updating block_based_table_builder, I removed several unnecessary
fields from Rep, because following the pattern would have created
another unnecessary field.
This change also updates block_based_table_reader to use this new
property when available, which means that for newer SST files, we can
determine the stable/original <db_session_id,file_number> unique
identifier using just the file contents, not the file name. (It's a bit
complicated; detailed comments in block_based_table_reader.)
Also added DB host id to properties listing by sst_dump, which could be
useful in debugging.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8686
Test Plan: majorly overhauled StableCacheKeys test for this change
Reviewed By: zhichao-cao
Differential Revision: D30457742
Pulled By: pdillinger
fbshipit-source-id: 2e5ae7dddeb94fb9d8eac8a928486aed8b8cd445
Summary:
- Added Type/CreateFromString
- Added ability to load EventListeners to DBOptions
- Since EventListeners did not previously have a Name(), defaulted to "". If there is no name, the listener cannot be loaded from the ObjectRegistry.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8473
Reviewed By: zhichao-cao
Differential Revision: D29901488
Pulled By: mrambacher
fbshipit-source-id: 2d3a4aa6db1562ac03e7ad41b360e3521d486254
Summary:
With Ribbon filter work and possible variance in actual bits
per key (or prefix; general term "entry") to achieve certain FP rates,
I've received a request to be able to track actual bits per key in
generated filters. This change adds a num_filter_entries table
property, which can be combined with filter_size to get bits per key
(entry).
This can vary from num_entries in at least these ways:
* Different versions of same key are only counted once in filters.
* With prefix filters, several user keys map to the same filter entry.
* A single filter can include both prefixes and user keys.
Note that FilterBlockBuilder::NumAdded() didn't do anything useful
except distinguish empty from non-empty.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8323
Test Plan: basic unit test included, others updated
Reviewed By: jay-zhuang
Differential Revision: D28596210
Pulled By: pdillinger
fbshipit-source-id: 529a111f3c84501e5a470bc84705e436ee68c376
Summary:
Added `TableProperties::{fast,slow}_compression_estimated_data_size`.
These properties are present in block-based tables when
`ColumnFamilyOptions::sample_for_compression > 0` and the necessary
compression library is supported when the file is generated. They
contain estimates of what `TableProperties::data_size` would be if the
"fast"/"slow" compression library had been used instead. One
limitation is we do not record exactly which "fast" (ZSTD or Zlib)
or "slow" (LZ4 or Snappy) compression library produced the result.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8139
Test Plan:
- new unit test
- ran `db_bench` with `sample_for_compression=1`; verified the `data_size` property matches the `{slow,fast}_compression_estimated_data_size` when the same compression type is used for the output file compression and the sampled compression
Reviewed By: riversand963
Differential Revision: D27454338
Pulled By: ajkr
fbshipit-source-id: 9529293de93ddac7f03b2e149d746e9f634abac4
Summary:
When SST file is created, application is able to know the file information through OnTableFileCreated callback in LogAndNotifyTableFileCreationFinished. Since file checksum information can be useful for application when the SST file is created, we add file_checksum and file_checksum_func_name information to TableFileCreationInfo, which will be passed through OnTableFileCreated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7108
Test Plan: make check, listener_test.
Reviewed By: ajkr
Differential Revision: D22470240
Pulled By: zhichao-cao
fbshipit-source-id: 92c20344d9b986eadfe3480f3769bf4add0dbaae
Summary:
`db_id` and `db_session_id` are now part of the table properties for all formats and stored in SST files. This adds about 99 bytes to each new SST file.
The `TablePropertiesNames` for these two identifiers are `rocksdb.creating.db.identity` and `rocksdb.creating.session.identity`.
In addition, SST files generated from SstFileWriter and Repairer have DB identity “SST Writer” and “DB Repairer”, respectively. Their DB session IDs are generated in the same way as `DB::GetDbSessionId`.
A table property test is added.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6983
Test Plan: make check and some manual tests.
Reviewed By: zhichao-cao
Differential Revision: D22048826
Pulled By: gg814
fbshipit-source-id: afdf8c11424a6f509b5c0b06dafad584a80103c9
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
Differential Revision: D19977691
fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
Summary:
This is groundwork for adding garbage collection support to BlobDB. The
patch adds logic that keeps track of the oldest blob file referred to by
each SST file. The oldest blob file is identified during flush/
compaction (similarly to how the range of keys covered by the SST is
identified), and persisted in the manifest as a custom field of the new
file edit record. Blob indexes with TTL are ignored for the purposes of
identifying the oldest blob file (since such blob files are cleaned up by the
TTL logic in BlobDB).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5903
Test Plan:
Added new unit tests; also ran db_bench in BlobDB mode, inspected the
manifest using ldb, and confirmed (by scanning the SST files using
sst_dump) that the value of the oldest blob file number field matches
the contents of the file for each SST.
Differential Revision: D17859997
Pulled By: ltamasi
fbshipit-source-id: 21662c137c6259a6af70446faaf3a9912c550e90
Summary:
When a new SST file is created via flush or compaction, we dump out the table properties, however only a few table properties are logged. The change here is to log all the table properties
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5168
Differential Revision: D14876928
Pulled By: vjnadimpalli
fbshipit-source-id: 1aca42ad00f9f650761d39e187f8beeb8700149b
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
Summary:
This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557.
Closes https://github.com/facebook/rocksdb/pull/3662
Differential Revision: D7426121
Pulled By: Dayvedde
fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352
Summary:
Some users want to prevent rocksdb from entering read-only mode in certain error cases. This diff gives them a callback, `OnBackgroundError`, that they can use to achieve it.
- call `OnBackgroundError` every time we consider setting `bg_error_`. Use its result to assign `bg_error_` but not to change the function's return status.
- classified calls using `BackgroundErrorReason` to give the callback some info about where the error happened
- renamed `ParanoidCheck` to something more specific so we can provide a clear `BackgroundErrorReason`
- unit tests for the most common cases: flush or compaction errors
Closes https://github.com/facebook/rocksdb/pull/2477
Differential Revision: D5300190
Pulled By: ajkr
fbshipit-source-id: a0ea4564249719b83428e3f4c6ca2c49e366e9b3
Summary: Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status.
Test Plan: unit test.
Reviewers: dhruba, yhchiang, ott, sdong
Reviewed By: sdong
Subscribers: sdong, kradhakrishnan, IslamAbdelRahman, andrewkr, yhchiang, leveldb, ott, dhruba
Differential Revision: https://reviews.facebook.net/D56337
Summary: It would be nice to have and access to table properties in compaction callbacks. In MyRocks project, it will make possible to update optimizer statistics online.
Test Plan: ran the unit test. Ran myrocks with the new way of collecting stats.
Reviewers: igor, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48267
Summary:
Add EventListener::OnTableFileDeletion(), which will be
called when a table file is deleted.
Test Plan: Extend three existing tests in db_test to verify the deleted files.
Reviewers: rven, anthony, kradhakrishnan, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38931
Summary: Remove a TODO that has been done
Test Plan: make
Reviewers: sdong, igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39429
Summary:
Add EventListener::OnTableFileCreated(), which will be called
when a table file is created. This patch is part of the
EventLogger and EventListener integration.
Test Plan: Augment existing test in db/listener_test.cc
Reviewers: anthony, kradhakrishnan, rven, igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38865
Summary:
Rename EventLoggerHelpers EventHelpers, as it's going to include
all event-related helper functions instead of EventLogger only stuffs.
Test Plan: make
Reviewers: sdong, rven, anthony
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D39093
Summary:
Fixed a missing "}" at the end of the generated JSON Log
in EventLoggerHelpers::LogTableFileCreation.
Test Plan: db_bench
Reviewers: igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38919
Summary:
Allow EventLogger to directly log from a JSONWriter. This allows
the JSONWriter to be shared by EventLogger and potentially EventListener,
which is an important step to integrate EventLogger and EventListener.
This patch also rewrites EventLoggerHelpers::LogTableFileCreation(),
which uses the new API to generate identical log.
Test Plan:
Run db_bench in debug mode and make sure the log is correct and no
assertions fail.
Reviewers: sdong, anthony, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38709