Summary:
Previously we store sequence number range of each blob files, and use the sequence number range to check if the file can be possibly visible by a snapshot. But it adds complexity to the code, since the sequence number is only available after a write. (The current implementation get sequence number by calling GetLatestSequenceNumber(), which is wrong.) With the patch, we are not storing sequence number range, and check if snapshot_sequence < obsolete_sequence to decide if the file is visible by a snapshot (previously we check if first_sequence <= snapshot_sequence < obsolete_sequence).
Closes https://github.com/facebook/rocksdb/pull/3274
Differential Revision: D6571497
Pulled By: yiwu-arbug
fbshipit-source-id: ca06479dc1fcd8782f6525b62b7762cd47d61909
Summary:
added `ThreadType::BOTTOM_PRIORITY` which is used in the `ThreadStatus` object to indicate the thread is used for bottom-pri compactions. Previously there was a bug where we mislabeled such threads as `ThreadType::LOW_PRIORITY`.
Closes https://github.com/facebook/rocksdb/pull/3270
Differential Revision: D6559428
Pulled By: ajkr
fbshipit-source-id: 96b1a50a9c19492b1a5fd1b77cf7061a6f9f1d1c
Summary:
Currently WriteImplWALOnly simply returns when disableWAL is set. This is an incorrect behavior since it does not allocated the sequence number, which is a side-effect of writing to the WAL. This patch fixes the issue.
Closes https://github.com/facebook/rocksdb/pull/3262
Differential Revision: D6550974
Pulled By: maysamyabandeh
fbshipit-source-id: 745a83ae8f04e7ca6c8ffb247d6ef16c287c52e7
Summary:
DeleteScheduler::MarkAsTrash() don't handle existing .trash files correctly
This cause rocksdb to not being able to delete existing .trash files on restart
Closes https://github.com/facebook/rocksdb/pull/3261
Differential Revision: D6548003
Pulled By: IslamAbdelRahman
fbshipit-source-id: c3800639412e587a690062c63076a5a08881e0e6
Summary:
Add snapshot_checker check whenever we need to check sequence against snapshots and decide what to do with an input key. The changes are related to one of:
* compaction filter
* single delete
* delete at bottom level
* merge
Closes https://github.com/facebook/rocksdb/pull/3251
Differential Revision: D6537850
Pulled By: yiwu-arbug
fbshipit-source-id: 3faba40ed5e37779f4a0cb7ae78af9546659c7f2
Summary:
A data race is caught by tsan_crash test between compaction and DropColumnFamily:
https://gist.github.com/yiwu-arbug/5a2b4baae05eeb99ae1719b650f30a44 Compaction checks if the column family has been dropped on each key input, while user can issue DropColumnFamily which updates cfd->dropped_, causing the data race. Fixing it by making cfd->dropped_ an atomic.
Closes https://github.com/facebook/rocksdb/pull/3250
Differential Revision: D6535991
Pulled By: yiwu-arbug
fbshipit-source-id: 5571df020beae7fa7db6fff5ad0d598f49962895
Summary:
Let me know if more test coverage is needed
Closes https://github.com/facebook/rocksdb/pull/3213
Differential Revision: D6457165
Pulled By: miasantreble
fbshipit-source-id: 3f944abff28aa7775237f1c4f61c64ccbad4eea9
Summary:
db/compaction_job.cc:
ReportStartedCompaction(compaction);
CID 1419863 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
2. uninit_member: Non-static class member bottommost_level_ is not initialized in this constructor nor in any functions that it calls.
db/compaction_picker_universal.cc:
7struct InputFileInfo {
2. uninit_member: Non-static class member level is not initialized in this constructor nor in any functions that it calls.
CID 1405355 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
4. uninit_member: Non-static class member index is not initialized in this constructor nor in any functions that it calls.
38 InputFileInfo() : f(nullptr) {}
db/dbformat.h:
ParsedInternalKey()
84 : sequence(kMaxSequenceNumber) // Make code analyzer happy
CID 1168095 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
2. uninit_member: Non-static class member type is not initialized in this constructor nor in any functions that it calls.
85 {} // Intentionally left uninitialized (for speed)
Closes https://github.com/facebook/rocksdb/pull/3091
Differential Revision: D6534558
Pulled By: yiwu-arbug
fbshipit-source-id: 5ada975956196d267b3f149386842af71eda7553
Summary:
db/version_builder.cc:
117 base_vstorage_->InternalComparator();
CID 1351713 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
2. uninit_member: Non-static class member field level_zero_cmp_.internal_comparator is not initialized in this constructor nor in any functions that it calls.
db/version_edit.h:
145 FdWithKeyRange()
146 : fd(),
147 smallest_key(),
148 largest_key() {
CID 1418254 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
2. uninit_member: Non-static class member file_metadata is not initialized in this constructor nor in any functions that it calls.
149 }
db/version_set.cc:
120 }
CID 1322789 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
4. uninit_member: Non-static class member curr_file_level_ is not initialized in this constructor nor in any functions that it calls.
121 }
db/write_batch.cc:
939 assert(cf_mems_);
CID 1419862 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
3. uninit_member: Non-static class member rebuilding_trx_seq_ is not initialized in this constructor nor in any functions that it calls.
940 }
Closes https://github.com/facebook/rocksdb/pull/3092
Differential Revision: D6505666
Pulled By: yiwu-arbug
fbshipit-source-id: fd2c68948a0280772691a419d72ac7e190951d86
Summary:
error message was
```
==3095==ERROR: AddressSanitizer: stack-use-after-scope on address 0x7ffd18216c40 at pc 0x0000005edda1 bp 0x7ffd18215550 sp 0x7ffd18214d00
...
Address 0x7ffd18216c40 is located in stack of thread T0 at offset 1952 in frame
#0 internal_repo_rocksdb/db_compaction_test.cc:1520 rocksdb::DBCompactionTest_DeleteFileRangeFileEndpointsOverlapBug_Test::TestBody()
```
It was unsafe to have slices referring to the temporary string objects' buffers, as those strings were destroyed before the slices were used. Fixed it by assigning the strings returned by `Key()` to local variables.
Closes https://github.com/facebook/rocksdb/pull/3238
Differential Revision: D6507864
Pulled By: ajkr
fbshipit-source-id: dd07de1a0070c6748c1ab4f3d7bd31f9a81889d0
Summary:
`GetCurrentTime()` is used to populate `creation_time` table property during flushes and compactions. It is safe to ignore `GetCurrentTime()` failures here but they should be logged.
(Note that `creation_time` property was introduced as part of TTL-based FIFO compaction in #2480.)
Tes Plan:
`make check`
Closes https://github.com/facebook/rocksdb/pull/3231
Differential Revision: D6501935
Pulled By: sagar0
fbshipit-source-id: 376adcf4ab801d3a43ec4453894b9a10909c8eb6
Summary:
Fix for #2833.
- In `DeleteFilesInRange`, use `GetCleanInputsWithinInterval` instead of `GetOverlappingInputs` to make sure we get a clean cut set of files to delete.
- In `GetCleanInputsWithinInterval`, support nullptr as `begin_key` or `end_key`.
- In `GetOverlappingInputsRangeBinarySearch`, move the assertion for non-empty range away from `ExtendFileRangeWithinInterval`, which should be allowed to return an empty range (via `end_index < begin_index`).
Closes https://github.com/facebook/rocksdb/pull/2843
Differential Revision: D5772387
Pulled By: ajkr
fbshipit-source-id: e554e8461823c6be82b21a9262a2da02b3957881
Summary:
Since #1665, on merge error, iterator will be set to corrupted status, but it doesn't invalidate the iterator. Fixing it.
Closes https://github.com/facebook/rocksdb/pull/3226
Differential Revision: D6499094
Pulled By: yiwu-arbug
fbshipit-source-id: 80222930f949e31f90a6feaa37ddc3529b510d2c
Summary:
I started adding gflags support for cmake on linux and got frustrated that I'd need to duplicate the build_detect_platform logic, which determines namespace based on attempting compilation. We can do it differently -- use the GFLAGS_NAMESPACE macro if available, and if not, that indicates it's an old gflags version without configurable namespace so we can simply hardcode "google".
Closes https://github.com/facebook/rocksdb/pull/3212
Differential Revision: D6456973
Pulled By: ajkr
fbshipit-source-id: 3e6d5bde3ca00d4496a120a7caf4687399f5d656
Summary:
Add PreReleaseCallback to be called at the end of WriteImpl but before publishing the sequence number. The callback is used in WritePrepareTxn to i) update the commit map, ii) update the last published sequence number in the 2nd write queue. It also ensures that all the commits will go to the 2nd queue.
These changes will ensure that the commit map is updated before the sequence number is published and used by reading snapshots. If we use two write queues, the snapshots will use the seq number published by the 2nd queue. If we use one write queue (the default, the snapshots will use the last seq number in the memtable, which also indicates the last published seq number.
Closes https://github.com/facebook/rocksdb/pull/3205
Differential Revision: D6438959
Pulled By: maysamyabandeh
fbshipit-source-id: f8b6c434e94bc5f5ab9cb696879d4c23e2577ab9
Summary:
When Seek a key less than `lower_bound`, should return `lower_bound`.
ajkr PTAL
Closes https://github.com/facebook/rocksdb/pull/3199
Differential Revision: D6421126
Pulled By: ajkr
fbshipit-source-id: a06c825830573e0040630704f6bcb3f7f48626f7
Summary:
This is a simpler version of #3097 by removing all unrelated changes.
Fixing the bug where concurrent writes may get Status::OK while it actually gets IOError on WAL write. This happens when multiple writes form a write batch group, and the leader get an IOError while writing to WAL. The leader failed to pass the error to followers in the group, and the followers end up returning Status::OK() while actually writing nothing. The bug only affect writes in a batch group. Future writes after the batch group will correctly return immediately with the IOError.
Closes https://github.com/facebook/rocksdb/pull/3201
Differential Revision: D6421644
Pulled By: yiwu-arbug
fbshipit-source-id: 1c2a455c5b73f6842423785eb8a9dbfbb191dc0e
Summary:
Before we were checking every file in the level which was unnecessary. We can piggyback onto the code for checking point-key overlap, which already opens all the files that could possibly contain overlapping range deletions. This PR makes us check just the range deletions from those files, so no extra ones will be opened.
Closes https://github.com/facebook/rocksdb/pull/3179
Differential Revision: D6358125
Pulled By: ajkr
fbshipit-source-id: 00e200770fdb8f3cc6b1b2da232b755e4ba36279
Summary:
Expose read and write options via the C API
Closes https://github.com/facebook/rocksdb/pull/3185
Differential Revision: D6389658
Pulled By: sagar0
fbshipit-source-id: 1848912750329a476805b3cb2f315e7b71f61472
Summary:
Augment WriteWithCallbackTest to also test when seq_per_batch is true.
Closes https://github.com/facebook/rocksdb/pull/3195
Differential Revision: D6398143
Pulled By: maysamyabandeh
fbshipit-source-id: 7bc4218609355ec20fed25df426a8455ec2390d3
Summary:
This diff adds a new ticker stat, NUMBER_ITER_SKIP, to count the
number of internal keys skipped during iteration. Keys can be skipped
due to deletes, or lower sequence number, or higher sequence number
than the one requested.
Also, fix the issue when StatisticsData is naturally aligned on cacheline boundary,
padding becomes a zero size array, which the Windows compiler doesn't
like. So add a cacheline worth of padding in that case to keep it happy.
We cannot conditionally add padding as gcc doesn't allow using sizeof
in preprocessor directives.
Closes https://github.com/facebook/rocksdb/pull/3177
Differential Revision: D6353897
Pulled By: anand1976
fbshipit-source-id: 441d5a09af9c4e22e7355242dfc0c7b27aa0a6c2
Summary:
Allow users to configure the trash-to-DB size ratio limit, so
that ratelimits for deletes can be enforced even when larger portions of
the database are being deleted.
Closes https://github.com/facebook/rocksdb/pull/3158
Differential Revision: D6304897
Pulled By: gdavidsson
fbshipit-source-id: a28dd13059ebab7d4171b953ed91ce383a84d6b3
Summary:
Refactor the logic around WriteCallback in the write path to clarify when and how exactly we advance the sequence number and making sure it is consistent across the code.
Closes https://github.com/facebook/rocksdb/pull/3168
Differential Revision: D6324312
Pulled By: maysamyabandeh
fbshipit-source-id: 9a34f479561fdb2a5d01ef6d37a28908d03bbe33
Summary:
The sequence number was not properly advanced after a rollback marker. The patch extends the existing unit tests to detect the bug and also fixes it.
Closes https://github.com/facebook/rocksdb/pull/3157
Differential Revision: D6304291
Pulled By: maysamyabandeh
fbshipit-source-id: 1b519c44a5371b802da49c9e32bd00087a8da401
Summary:
When testing rebuilding_trx_ in MemTableInserter might still be set before the tests finishes which would cause ASAN alarms for leaks. This patch deletes the pointers in MemTableInserter destructor.
Closes https://github.com/facebook/rocksdb/pull/3162
Differential Revision: D6317113
Pulled By: maysamyabandeh
fbshipit-source-id: a68be70709a4fff7ac2b768660119311968f9c21
Summary:
This patch clarifies and refactors the logic around tracked keys in transactions.
Closes https://github.com/facebook/rocksdb/pull/3140
Differential Revision: D6290258
Pulled By: maysamyabandeh
fbshipit-source-id: 03b50646264cbcc550813c060b180fc7451a55c1
Summary:
Add tests to ensure that WritePrepared and WriteCommitted policies are cross compatible when the db WAL is empty. This is important when the admin want to switch between the policies. In such case, before the switch the admin needs to empty the WAL by i) committing/rollbacking all the pending transactions, ii) FlushMemTables
Closes https://github.com/facebook/rocksdb/pull/3118
Differential Revision: D6227247
Pulled By: maysamyabandeh
fbshipit-source-id: bcde3d92c1e89cda3b9cfa69f6a20af5d8993db7
Summary:
Add per-exe execution capability
Add fix parsing of groups/tests
Add timer test exclusion
Fix unit tests
Ifdef threadpool specific tests that do not pass on Vista threadpool.
Remove spurious outout from prefix_test so test case listing works
properly.
Fix not using standard test directories results in file creation errors
in sst_dump_test.
BlobDb fixes:
In C++ end() iterators can not be dereferenced. They are not valid.
When deleting blob_db_ set it to nullptr before any other code executes.
Not fixed:. On Windows you can not delete a file while it is open.
[ RUN ] BlobDBTest.ReadWhileGC
d:\dev\rocksdb\rocksdb\utilities\blob_db\blob_db_test.cc(75): error: DestroyBlobDB(dbname_, options, bdb_options)
IO error: Failed to delete: d:/mnt/db\testrocksdb-17444/blob_db_test/blob_dir/000001.blob: Permission denied
d:\dev\rocksdb\rocksdb\utilities\blob_db\blob_db_test.cc(75): error: DestroyBlobDB(dbname_, options, bdb_options)
IO error: Failed to delete: d:/mnt/db\testrocksdb-17444/blob_db_test/blob_dir/000001.blob: Permission denied
write_batch
Should not call front() if there is a chance the container is empty
Closes https://github.com/facebook/rocksdb/pull/3152
Differential Revision: D6293274
Pulled By: sagar0
fbshipit-source-id: 318c3717c22087fae13b18715dffb24565dbd956
Summary:
Add a simple policy for NVMe write time life hint
Closes https://github.com/facebook/rocksdb/pull/3095
Differential Revision: D6298030
Pulled By: shligit
fbshipit-source-id: 9a72a42e32e92193af11599eb71f0cf77448e24d
Summary:
The previous compression type selection caused unexpected behavior when the base level was also the bottommost level. The following sequence of events could happen:
- full compaction generates files with `bottommost_compression` type
- now base level is bottommost level since all files are in the same level
- any compaction causes files to be rewritten `compression_per_level` type since bottommost compression didn't apply to base level
I changed the code to make bottommost compression apply to base level.
Closes https://github.com/facebook/rocksdb/pull/3141
Differential Revision: D6264614
Pulled By: ajkr
fbshipit-source-id: d7aaa8675126896684154a1f2c9034d6214fde82
Summary:
While investigating the usage of `new_table_iterator_nanos` perf counter, I saw some code was wrapper around with unnecessary status check ... so removed it.
Closes https://github.com/facebook/rocksdb/pull/3120
Differential Revision: D6229181
Pulled By: sagar0
fbshipit-source-id: f8a44fe67f5a05df94553fdb233b21e54e88cc34
Summary:
Instead of using samples directly, we now support passing the samples through zstd's dictionary generator when `CompressionOptions::zstd_max_train_bytes` is set to nonzero. If set to zero, we will use the samples directly as the dictionary -- same as before.
Note this is the first step of #2987, extracted into a separate PR per reviewer request.
Closes https://github.com/facebook/rocksdb/pull/3057
Differential Revision: D6116891
Pulled By: ajkr
fbshipit-source-id: 70ab13cc4c734fa02e554180eed0618b75255497
Summary:
Previously setting `write_buffer_size` with `SetOptions` would only apply to new memtables. An internal user wanted it to take effect immediately, instead of at an arbitrary future point, to prevent OOM.
This PR makes the memtable's size mutable, and makes `SetOptions()` mutate it. There is one case when we preserve the old behavior, which is when memtable prefix bloom filter is enabled and the user is increasing the memtable's capacity. That's because the prefix bloom filter's size is fixed and wouldn't work as well on a larger memtable.
Closes https://github.com/facebook/rocksdb/pull/3119
Differential Revision: D6228304
Pulled By: ajkr
fbshipit-source-id: e44bd9d10a5f8c9d8c464bf7436070bb3eafdfc9
Summary:
After adding expiration to blob index in #3066, we are now able to add a compaction filter to cleanup expired blob index entries.
Closes https://github.com/facebook/rocksdb/pull/3090
Differential Revision: D6183812
Pulled By: yiwu-arbug
fbshipit-source-id: 9cb03267a9702975290e758c9c176a2c03530b83
Summary:
Blob db will keep blob file if data in the file is visible to an active snapshot. Before this patch it checks whether there is an active snapshot has sequence number greater than the earliest sequence in the file. This is problematic since we take snapshot on every read, if it keep having reads, old blob files will not be cleanup. Change to check if there is an active snapshot falls in the range of [earliest_sequence, obsolete_sequence) where obsolete sequence is
1. if data is relocated to another file by garbage collection, it is the latest sequence at the time garbage collection finish
2. otherwise, it is the latest sequence of the file
Closes https://github.com/facebook/rocksdb/pull/3087
Differential Revision: D6182519
Pulled By: yiwu-arbug
fbshipit-source-id: cdf4c35281f782eb2a9ad6a87b6727bbdff27a45
Summary:
It's also defined in db/dbformat.cc per 7fe3b32896
Closes https://github.com/facebook/rocksdb/pull/3111
Differential Revision: D6219140
Pulled By: ajkr
fbshipit-source-id: 0f2b14e41457334a4665c6b7e3f42f1a060a0f35
Summary:
Implements ValidateSnapshot for WritePrepared txns and also adds a unit test to clarify the contract of this function.
Closes https://github.com/facebook/rocksdb/pull/3101
Differential Revision: D6199405
Pulled By: maysamyabandeh
fbshipit-source-id: ace509934c307ea5d26f4bbac5f836d7c80fd240
Summary:
The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2).
This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages.
From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff".
This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR.
For now, what's done here according to initial discussions:
Preserving deletes:
- We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion.
- I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum.
- Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum.
Iterator changes:
- couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum.
TableCache changes:
- I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span.
What's left:
- Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type.
Closes https://github.com/facebook/rocksdb/pull/2999
Differential Revision: D6175602
Pulled By: mikhail-antonov
fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900
Summary:
GetCommitTimeWriteBatch is currently used to store some state as part of commit in 2PC. In MyRocks it is specifically used to store some data that would be needed only during recovery. So it is not need to be stored in memtable right after each commit.
This patch enables an optimization to write the GetCommitTimeWriteBatch only to the WAL. The batch will be written to memtable during recovery when the WAL is replayed. To cover the case when WAL is deleted after memtable flush, the batch is also buffered and written to memtable right before each memtable flush.
Closes https://github.com/facebook/rocksdb/pull/3071
Differential Revision: D6148023
Pulled By: maysamyabandeh
fbshipit-source-id: 2d09bae5565abe2017c0327421010d5c0d55eaa7
Summary:
The assertion was caught by `MySQLStyleTransactionTest/MySQLStyleTransactionTest.TransactionStressTest/5` when run in a loop. The caller doesn't track whether the released snapshot is oldest, so let this function handle that case.
Closes https://github.com/facebook/rocksdb/pull/3080
Differential Revision: D6185257
Pulled By: ajkr
fbshipit-source-id: 4b3015c11db5d31e46521a00af568546ef4558cd