Summary:
The CollapsedRangeDelMap was entirely mishandling tombstones at the same
sequence number when the tombstones did not have identical start and end
keys. Such tombstones are common since 90fc40690, which causes
tombstones to be split during compactions.
For example, if the tombstone [a, c) @ 1 lies across a compaction
boundary at b, it will be split into [a, b) @ 1 and [b, c) @ 1. Without
this patch, the collapsed range deletion map would look like this:
a -> 1
b -> 1
c -> 0
Notice how the b -> 1 entry is redundant. When the tombstones overlap,
the problem is even worse. Consider tombstones [a, c) @ 1 and [b, d) @
1, which produces this map without this patch:
a -> 1
b -> 1
c -> 0
d -> 0
This map is corrupt, as a map can never contain adjacent sentinel (zero)
entries. When the iterator advances from b to c, it will notice that c
is a sentinel enty and skip to d--but d is also a sentinel entry! Asking
what tombstone this iterator points to will trigger an assertion, as it
is not pointing to a valid tombstone.
/cc ajkr
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4424
Differential Revision: D10039248
Pulled By: abhimadan
fbshipit-source-id: 6d737c1e88d60e80cf27286726627ba44463e7f4
Summary:
Improve time measurements for AddTombstones to only include the
call and not the VectorIterator setup. Also add a new
add_tombstones_per_run flag to call AddTombstones multiple times per
aggregator, which will help simulate more realistic workloads.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4395
Differential Revision: D9996811
Pulled By: abhimadan
fbshipit-source-id: 5865a95c323fbd9b3606493013664b4890fe5a02
Summary:
Fix IO error on read not being handle and crashing the DB. With the fix we properly return the error.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4410
Differential Revision: D9979246
Pulled By: yiwu-arbug
fbshipit-source-id: 111a85675067a29c03cb60e9a34103f4ff636694
Summary:
Make the CompactOnDeletionCollectorFactory class public, and provide
methods to update the window size and deletion trigger params. These
will take effect on subsequent created SST files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4403
Differential Revision: D9976857
Pulled By: anand1976
fbshipit-source-id: 31dbf0511c12fa2bb9b2a7ba620079e0ee09cf48
Summary:
If range tombstones are generated every few writes, the
KeyGenerator's limit is now extended to account for the additional
Next() calls. This is primarily important for `filluniquerandom`
benchmarks that enforce the call limit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4404
Differential Revision: D9949326
Pulled By: abhimadan
fbshipit-source-id: 0bdfeb2cad2098dc0b8b029236dab5e4bef25e38
Summary:
The default for index_block_restart_interval is 1 but some use 16 in production. The patch extends crash test to test both values.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4383
Differential Revision: D9887304
Pulled By: maysamyabandeh
fbshipit-source-id: a8d00fea974a79ad563f9f4d9d7b069e9f746a8f
Summary:
Per #4387 this should address the validation error with the link tag. This is a quick fix, a future iteration could significantly upgrade the jekyll integration.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4392
Differential Revision: D9923643
Pulled By: gfosco
fbshipit-source-id: e7ed478e55c907add8319290326540e6e44fc0d6
Summary:
Add a unit test for range collapsing when non-default comparator is used. This exposes the bug fixed in #4386.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4388
Differential Revision: D9918252
Pulled By: ajkr
fbshipit-source-id: 99501b96b251eab41791a7e33b27055ee36c5c39
Summary:
The Comparator passed to CollapsedRangeDelMap was not used for
operator less of the std::map `rep_` object contained in
CollapsedRangeDelMap. So the map was always sorted using the
default ByteWiseComparator, which seems wrong.
Passing the specified Comparator through for usage in that map
object fixes actual problems we were seeing with RangeDelete operations
that do not delete keys as expected when using a custom Comparator.
I found that the tests in current master crash when I run them locally,
both with and without my patch, at the very same location. I therefore
don't know if the patch breaks something else, but it seems to fix
RangeDeletion issues in our product that uses RocksDB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4386
Differential Revision: D9916506
Pulled By: ajkr
fbshipit-source-id: 27bff8c775831f089dde8c5289df7343d88b2d66
Summary:
Value delta encoding in format_version 4 requires the differences between the size of two consecutive handles to be sent to BlockBuilder::Add. This applies not only to indexes on blocks but also the indexes on indexes and filters in partitioned indexes and filters respectively. The patch fixes a bug where the partitioned filters would encode the entire size of the handle rather than the difference of the size with the last size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4381
Differential Revision: D9879505
Pulled By: maysamyabandeh
fbshipit-source-id: 27a22e49b482b927fbd5629dc310c46d63d4b6d1
Summary:
To measure the results of upcoming DeleteRange v2 work, this commit adds
simple benchmarks for RangeDelAggregator. It measures the average time
for AddTombstones and ShouldDelete calls.
Using this to compare the results before #4014 and on the latest master (using the default arguments) produces the following results:
Before #4014:
```
=======================
Results:
=======================
AddTombstones: 1356.28 us
ShouldDelete: 0.401732 us
```
Latest master:
```
=======================
Results:
=======================
AddTombstones: 740.82 us
ShouldDelete: 0.383271 us
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4363
Differential Revision: D9881676
Pulled By: abhimadan
fbshipit-source-id: 793e7d61aa4b9d47eb917bbcc03f08695b5e5442
Summary:
1. Add override keyword to overridden virtual functions in EventListener
2. Fix a memory corruption that can happen during DB shutdown when in
read-only mode due to a background write error
3. Fix uninitialized buffers in error_handler_test.cc that cause
valgrind to complain
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4375
Differential Revision: D9875779
Pulled By: anand1976
fbshipit-source-id: 022ede1edc01a9f7e21ecf4c61ef7d46545d0640
Summary:
This is a follow up to #4370. The earlier comment is not correct.
Thanks to ajkr for pointing this out.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4380
Differential Revision: D9874667
Pulled By: sagar0
fbshipit-source-id: f4e092d86b29c715258210b770643d367e38caae
Summary:
Including tools/trace_analyzer_tool.cc in rocksdb_lib was causing conflicts in dependent binaries due to duplicate gflag (other_prefix).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4371
Differential Revision: D9846953
Pulled By: anand1976
fbshipit-source-id: 80b4aa36ab8428b8f6dceb896c45532684102709
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a
Summary:
Because `base_files` and `added_files` both are sorted, using a merge
operation to these two sorted arrays is more effective. The complexity
is reduced to linear time.
- optmize the merge complexity.
- move the `NDEBUG` of sorted `added_files` out of merge process.
Signed-off-by: JiYou <jiyou09@gmail.com>
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4366
Differential Revision: D9833592
Pulled By: ajkr
fbshipit-source-id: dd32b67ebdca4c20e5e9546ab8082cecefe99fd0
Summary:
The code is dead in RocksDB as `log::Reader::initial_offset_` is always zero. We should delete it so we don't have to maintain it like in #4359.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4362
Differential Revision: D9817829
Pulled By: ajkr
fbshipit-source-id: 474a2c679e5bd273b40608f3a5332931d9eefe6d
Summary:
Please consider this small PR providing access to the `MemoryUsage::GetApproximateMemoryUsageByType` function in plain C API. Actually I'm working on Go application and now trying to investigate the reasons of high memory consumption (#4313). Go [wrappers](https://github.com/tecbot/gorocksdb) are built on the top of Rocksdb C API. According to the #706, `MemoryUsage::GetApproximateMemoryUsageByType` is considered as the best option to get database internal memory usage stats, but it wasn't supported in C API yet.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4340
Differential Revision: D9655135
Pulled By: ajkr
fbshipit-source-id: a3d2f3f47c143ae75862fbcca2f571ea1b49e14a
Summary:
With #3983 the size of IndexBlockIter was increased. This had resulted in a regression on P50 latencies in one of our benchmarks. The patch reduces IndexBlockIter size be eliminating active_comparator_ field from the class.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4358
Differential Revision: D9781737
Pulled By: maysamyabandeh
fbshipit-source-id: 71e2b28d90ff0813db9e04b737ae73e185583c52
Summary:
Before the fix:
On a PowerPC machine, run the following
```
$ make jtest
```
The command will fail due to "undefined symbol: crc32c_ppc". It was caused by
'rocksdbjava' Makefile target not including crc32c_ppc object files when
generating the shared lib. The fix is simple.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4357
Differential Revision: D9779474
Pulled By: riversand963
fbshipit-source-id: 3c5ec9068c2b9c796e6500f71cd900267064fd51
Summary:
`RangeDelAggregator::AddTombstones` contained an assertion which stated that, if a range tombstone extended past the largest key in the sstable, then `FileMetaData::largest` must have a sentinel sequence number of `kMaxSequenceNumber`, which implies that the tombstone's end key is safe to truncate. However, `largest` will not be a sentinel key when the next sstable in the level's smallest key is equal to the current sstable's largest key, which caused the assertion to fail.
The assertion must hold for the truncation to be safe, so it has been moved to an additional check on end-key truncation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4356
Differential Revision: D9760891
Pulled By: abhimadan
fbshipit-source-id: 7c20c3885cd919dcd14f291f88fd27aa33defebc
Summary:
TransactionOptions::skip_concurrency_control allows pessimistic transactions to skip the overhead of concurrency control. This could be as an optimization if the application knows that the transaction would not have any conflict with concurrent transactions. It is currently used during recovery assuming (i) application guarantees no conflict between prepared transactions in the WAL (ii) application guarantees that recovered transactions will be rolled back/commit before new transactions start.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4346
Differential Revision: D9759149
Pulled By: maysamyabandeh
fbshipit-source-id: f896e84fa58b0b584be904c7fd3883a41ea3215b
Summary:
In C++ 11, the order of argument and move evaluation in a statement such
as below is unspecified -
foo(a.b).bar(std::move(a))
The compiler is free to evaluate std::move(a) first, and then a.b is unspecified.
In C++ 17, this will be safe if a draft proposal around function
chaining rules is accepted.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4348
Differential Revision: D9688810
Pulled By: anand1976
fbshipit-source-id: e4651d0ca03dcf007e50371a0fc72c0d1e710fb4
Summary:
Reverting is needed to unblock a user building against master, who is blocked for multiple days due to a thread-safety issue in `GetEmptyDict`. We haven't been able to fix it quickly, so reverting.
Simply ran `git revert 6c40806e51a89386d2b066fddf73d3fd03a36f65`. There were no merge conflicts.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4347
Differential Revision: D9668365
Pulled By: ajkr
fbshipit-source-id: 0c56334f0a23cf5ee0233d4e4679eae6709739cd
Summary:
As you know, almost all compilers support "pragma once" keyword instead of using include guards. To be keep consistency between header files, all header files are edited.
Besides this, try to fix some warnings about loss of data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4339
Differential Revision: D9654990
Pulled By: ajkr
fbshipit-source-id: c2cf3d2d03a599847684bed81378c401920ca848
Summary:
This is a followup to #4311. Checking `!RangeDelAggregator::IsEmpty()` before opening a dedicated range tombstone SST did not properly prevent empty SSTs from being generated. That's because it relies on `CollapsedRangeDelMap::Size`, which had an underflow bug when the map was empty. This PR fixes that underflow bug.
Also fixed an uninitialized variable in db_stress.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4336
Differential Revision: D9600080
Pulled By: ajkr
fbshipit-source-id: bc6980ca79d2cd01b825ebc9dbccd51c1a70cfc7
Summary:
`GetLiveFiles` and `GetLiveFilesMetadata` should return path relative to db path.
It is a separate issue when `path_relative` is false how can we return relative path. But `DBImpl::GetLiveFiles` don't handle it as well when there are multiple `db_paths`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4326
Differential Revision: D9545904
Pulled By: yiwu-arbug
fbshipit-source-id: 6762d879fcb561df2b612e6fdfb4a6b51db03f5d
Summary:
Currently unity-test is failing because both trace_replay.cc and trace_analyzer_tool.cc defined `DecodeCFAndKey` under anonymous namespace. It is supposed to be fine except unity test will dump all source files together and now we have a conflict.
Another issue with trace_analyzer_tool.cc is that it is using some utility functions from ldb_cmd which is not included in Makefile for unity_test, I chose to update TESTHARNESS to include LIBOBJECTS. Feel free to comment if there is a less intrusive way to solve this.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4323
Differential Revision: D9599170
Pulled By: miasantreble
fbshipit-source-id: 38765b11f8e7de92b43c63bdcf43ea914abdc029
Summary:
trace_analyzer_tool should only be in ANALYZER_LIB_SOURCES and not in LIB_SOURCES.
This fixes java_test travis build failures seen in jtest.
Blame: a6d3de4e7a
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4331
Differential Revision: D9560377
Pulled By: sagar0
fbshipit-source-id: 6b9636201a920b56ee0f61e367fee5d3dca692b0
Summary:
In our application we spawn helper child processes concurrently with
opening rocksdb. In one situation I observed that the child process had inherited
the rocksdb lock file as well as directory handles to the rocksdb storage location.
The code in env_posix takes care to set CLOEXEC but doesn't use `O_CLOEXEC` at the
time that the files are opened which means that there is a window of opportunity
to leak the descriptors across a fork/exec boundary.
This diff introduces a helper that can conditionally set the `O_CLOEXEC` bit for
the open call using the same logic as that in the existing helper for setting
that flag post-open.
I've preserved the post-open logic for systems that don't have `O_CLOEXEC`.
I've introduced setting `O_CLOEXEC` for what appears to be a number of temporary
or transient files and directory handles; I suspect that none of the files
opened by Rocks are intended to be inherited by a forked child process.
In one case, `fopen` is used to open a file. I've added the use of the glibc-specific `e`
mode to turn on `O_CLOEXEC` for this case. While this doesn't cover all posix systems,
it is an improvement for our common deployment system.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4328
Reviewed By: ajkr
Differential Revision: D9553046
Pulled By: wez
fbshipit-source-id: acdb89f7a85ca649b22fe3c3bd76f82142bec2bf
Summary:
Basically at the moment it seems it's possible to cause write stall by calling flush (either manually vis DB::Flush(), or from Backup Engine directly calling FlushMemTable() while background flush may be already happening.
One of the ways to fix it is that in DBImpl::CompactRange() we already check for possible stall and delay flush if needed before we actually proceed to call FlushMemTable(). We can simply move this delay logic to separate method and call it from FlushMemTable.
This is draft patch, for first look; need to check tests/update SyncPoints and most certainly would need to add allow_write_stall method to FlushOptions().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4297
Differential Revision: D9420705
Pulled By: mikhail-antonov
fbshipit-source-id: f81d206b55e1d7b39e4dc64242fdfbceeea03fcc
Summary: There were a few files that were missed when AutoHeaders were moved to their own file. Add explicit loads
Reviewed By: yfeldblum
Differential Revision: D9499942
fbshipit-source-id: 942bf3a683b8961e1b6244136f6337477dcc45af
Summary: For the CURRENT file forged during checkpoint, we were forgetting to `fsync` or `fdatasync` it after its creation. This PR fixes it.
Differential Revision: D9525939
Pulled By: ajkr
fbshipit-source-id: a505483644026ee3f501cfc0dcbe74832165b2e3