Summary:
According to 4848bd0c4e/db/log_reader.cc (L355), the original text is misleading when describing the layout of RecyclableLogHeader.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4315
Differential Revision: D9505284
Pulled By: riversand963
fbshipit-source-id: 79994c37a69e7003f03453e7efc0186feeafa609
Summary:
This PR fixes issue 3842. We drop deletion markers iff
1. We are the bottom most level AND
2. All other occurrences of the key are in the same snapshot range as the delete
I've also enhanced db_stress_test to add an option that does a full compare of the keys. This is done by a single thread (thread # 0). For tests I've run (so far)
make check -j64
db_stress
db_stress --acquire_snapshot_one_in=1000 --ops_per_thread=100000 /* to verify that new code doesnt break existing tests */
./db_stress --compare_full_db_state_snapshot=true --acquire_snapshot_one_in=1000 --ops_per_thread=100000 /* to verify new test code */
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4289
Differential Revision: D9491165
Pulled By: shrikanthshankar
fbshipit-source-id: ce144834f31736c189aaca81bed356ba990331e2
Summary:
RocksDB currently queues individual column family for flushing. This is not sufficient to support the needs of some applications that want to enforce order/dependency between column families, given that multiple foreground and background activities can trigger flushing in RocksDB.
This PR aims to address this limitation. Each flush request is described as a `FlushRequest` that can contain multiple column families. A background flushing thread pops one flush request from the queue at a time and processes it.
This PR does not enable atomic_flush yet, but is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3952
Differential Revision: D8529933
Pulled By: riversand963
fbshipit-source-id: 78908a21e389a3a3f7de2a79bae0cd13af5f3539
Summary:
I have a PR to start calling `OnTableFileCreated` for empty SSTs: #4307. However, it is a behavior change so should not go into a patch release.
This PR adds back a check to make sure range deletions at least exist before starting file creation. This PR should be safe to backport to earlier versions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4311
Differential Revision: D9493734
Pulled By: ajkr
fbshipit-source-id: f0d43cda4cfd904f133cfe3a6eb622f52a9ccbe8
Summary:
In RocksDB, for a given SST file, all data blocks are compressed with the same dictionary. When we compress a block using the dictionary's raw bytes, the compression library first has to digest the dictionary to get it into a usable form. This digestion work is redundant and ideally should be done once per file.
ZSTD offers APIs for the caller to create and reuse a digested dictionary object (`ZSTD_CDict`). In this PR, we call `ZSTD_createCDict` once per file to digest the raw bytes. Then we use `ZSTD_compress_usingCDict` to compress each data block using the pre-digested dictionary. Once the file's created `ZSTD_freeCDict` releases the resources held by the digested dictionary.
There are a couple other changes included in this PR:
- Changed the parameter object for (un)compression functions from `CompressionContext`/`UncompressionContext` to `CompressionInfo`/`UncompressionInfo`. This avoids the previous pattern, where `CompressionContext`/`UncompressionContext` had to be mutated before calling a (un)compression function depending on whether dictionary should be used. I felt that mutation was error-prone so eliminated it.
- Added support for digested uncompression dictionaries (`ZSTD_DDict`) as well. However, this PR does not support reusing them across uncompression calls for the same file. That work is deferred to a later PR when we will store the `ZSTD_DDict` objects in block cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4251
Differential Revision: D9257078
Pulled By: ajkr
fbshipit-source-id: 21b8cb6bbdd48e459f1c62343780ab66c0a64438
Summary:
The API comment on `OnTableFileCreationStarted` (b6280d01f9/include/rocksdb/listener.h (L331-L333)) led users to believe a call to `OnTableFileCreationStarted` will always be matched with a call to `OnTableFileCreated`. However, we were skipping the `OnTableFileCreated` call in one case: no error happens but also no file is generated since there's no data.
This PR adds the call to `OnTableFileCreated` for that case. The filename will be "(nil)" and the size will be zero.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4307
Differential Revision: D9485201
Pulled By: ajkr
fbshipit-source-id: 2f077ec7913f128487aae2624c69a50762394df6
Summary:
Add the unit test of Iterator (Seek and SeekForPrev) to trace_analyzer_test. The output files after analyzing the trace file are checked to make sure that analyzing results are correct.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4282
Differential Revision: D9436758
Pulled By: zhichao-cao
fbshipit-source-id: 88d471c9a69e07382d9c6a45eba72773b171e7c2
Summary:
Memtables are selected for flushing by the flush job. Currently we
have listener which is invoked when memtables for a column family are
flushed. That listener does not indicate which memtable was flushed in
the notification. If clients want to know if particular data in the
memtable was retired, there is no straight forward way to know this.
This method will help users who implement memtablerep factory and extend
interface for memtablerep, to know if the data in the memtable was
retired.
Another option that was tried, was to depend on memtable destructor to
be called after flush to mark that data was persisted. This works all
the time but sometimes there can huge delays between actual flush
happening and memtable getting destroyed. Hence, if anyone who is
waiting for data to persist will have to wait that longer.
It is expected that anyone who is implementing this method to have
return quickly as it blocks RocksDB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4304
Reviewed By: riversand963
Differential Revision: D9472312
Pulled By: gdrane
fbshipit-source-id: 8e693308dee749586af3a4c5d4fcf1fa5276ea4d
Summary:
When returning `kNoEntry` from HashIndex lookup, previously we invalidate the
`biter` by set `current_=restarts_`, so that the search can continue to the next
block in case the search result may reside in the next block.
There is one problem: when we are searching for a missing key, if the search
finds a `kNoEntry` and continue the search to the next block, there is also a
non-trivial possibility that the HashIndex return `kNoEntry` too, and the
expensive index iterator `Next()` will happen several times for nothing.
The solution is that if the hash table returns `kNoEntry`, `SeekForGetImpl()` just search the last restart interval for the key. It will stop at the first key that is large than the seek_key, or to the end of the block, and each case will be handled correctly.
Microbenchmark script:
```
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=fillseq,readtocache,readmissing \
--cache_size=20000000000 --use_data_block_hash_index={true|false}
```
`readmissing` performance (lower is better):
```
binary: 3.6098 micros/op
hash (before applying diff): 4.1048 micros/op
hash (after applying diff): 3.3502 micros/op
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4296
Differential Revision: D9419159
Pulled By: fgwu
fbshipit-source-id: 21e3eedcccbc47a249aa8eb4bf405c9def0b8a05
Summary:
We want to sample the file I/O issued by RocksDB and report the function calls. This requires us to include the file paths otherwise it's hard to tell what has been going on.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4039
Differential Revision: D8670178
Pulled By: riversand963
fbshipit-source-id: 97ee806d1c583a2983e28e213ee764dc6ac28f7a
Summary:
User reported (https://github.com/facebook/rocksdb/issues/4168) that when opening RocksDB in read-only mode, some statistics are not correctly reported. After some investigation, we believe the following counters are indeed not reported during Get() call in a read-only DB:
rocksdb.memtable.hit
rocksdb.memtable.miss
rocksdb.number.keys.read
rocksdb.bytes.read
As well as histogram rocksdb.bytes.per.read
and perf context get_read_bytes
This PR will add the necessary counter reporting logic in the Get() call path
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4260
Differential Revision: D9476431
Pulled By: miasantreble
fbshipit-source-id: 7ab409d4e59df05d09ae8b69fe75554e5aa240d6
Summary:
ZSTD's dynamic library exports `ZDICT_trainFromBuffer` symbol since v1.1.3, and its static library exports it since v0.6.1. We don't know whether linkage is static or dynamic, so just require v1.1.3 to use dictionary trainer.
Fixes the issue reported here: https://jira.mariadb.org/browse/MDEV-16525.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4295
Differential Revision: D9417183
Pulled By: ajkr
fbshipit-source-id: 0e89d2f48d9e7f6eee73e7f4572660a9f7122db8
Summary:
Previously, the trace_analyzer_tool will be complied with other libobjects, which let the GFLAGS of trace_analyzer appear in other tools (e.g., db_bench, rocksdb_dump, and etc.). When using '--help', the help information of trace_analyzer will appear in other tool help information, which will cause confusion issues.
Currently, trace_analyzer_tool is built and used only by trace_analyzer and trace_analyzer_test to avoid the issues.
Tested with make asan_check.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4290
Differential Revision: D9413163
Pulled By: zhichao-cao
fbshipit-source-id: ed5d20c4575a53ca15ff62a2ffe601d5cf278cc4
Summary:
`BlockBasedTableBuilder::Add()` eventually calls
`DataBlockHashIndexBuilder::EstimateSize()`. The previous implementation
divides the `num_keys` by the `util_ratio_` to get an estimizted `num_buckets`.
Such division is expensive as it happens in every
`BlockBasedTableBuilder::Add()`.
This diff estimates the `num_buckets` by double addition instead of double
division. Specifically, in each `Add()`, we add `bucket_per_key_`
(inverse of `util_ratio_`) to the current `estimiated_num_buckets_`. The cost is
that we are gonna have the `estimated_num_buckets_` stored as one extra field
in the DataBlockHashIndexBuilder.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4293
Differential Revision: D9412472
Pulled By: fgwu
fbshipit-source-id: 2925c8509a401e7bd3c1ab1d9e9c7244755c277a
Summary:
Fix expired file not being evicted from the DB. We have a background task (previously called `CheckSeqFiles` and I rename it to `EvictExpiredFiles`) to scan and remove expired files, but it only close the files, not marking them as expired.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4294
Differential Revision: D9415984
Pulled By: yiwu-arbug
fbshipit-source-id: eff7bf0331c52a7ccdb02318602bff7f64f3ef3d
Summary:
Clang analyze is not happy in two pieces of code, with "Potential memory leak". No idea what the problem but slightly changing the code makes clang happy.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4292
Differential Revision: D9413555
Pulled By: siying
fbshipit-source-id: 9428c9d3664530c72129feefd135ee63d8386137
Summary:
Suppress two CLANG analyze warnings. They don't seem to be real bugs
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4291
Differential Revision: D9407333
Pulled By: siying
fbshipit-source-id: 2ed63d88fa0b217fdccb1572d7508467c2203dc8
Summary:
During recovery, RocksDB is able to handle version edits that belong to group commits.
This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3945
Differential Revision: D8529122
Pulled By: riversand963
fbshipit-source-id: 57cb0f9cc55ecca684a837742d6626dc9c07f37e
Summary:
Transaction has had methods to deal with SavePoints already, but
was missing the PopSavePoint method provided by WriteBatch and
WriteBatchWithIndex.
This PR adds PopSavePoint to Transaction as well. Having the method
on Transaction-level too is useful for applications that repeatedly
execute a sequence of operations that normally succeed, but infrequently
need to get rolled back. Using SavePoints here is sensible, but as
operations normally succeed the application may pile up a lot of
useless SavePoints inside a Transaction, leading to slightly increased
memory usage for managing the unneeded SavePoints.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4256
Differential Revision: D9326932
Pulled By: yiwu-arbug
fbshipit-source-id: 53a0af18a6c7e87feff8a56f1f3eab9df7f371d6
Summary:
Closes https://github.com/facebook/rocksdb/issues/4195
CompactRangeOptions are available the CPP API, but not in the Java API. This PR adds CompactRangeOptions to the Java API and adds an overloaded compactRange() method. See https://github.com/facebook/rocksdb/issues/4195 for the original discussion.
This change supports all fields of CompactRangeOptions, including the required enum converters in the JNI portal.
Significant changes:
- Make CompactRangeOptions available in the compactRange() for Java.
- Deprecate other compactRange() methods that have individual option params, like in the CPP code.
- Migrate rocksdb_compactrange_helper() to CompactRangeOptions.
- Add Java unit tests for CompactRangeOptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4220
Differential Revision: D9380007
Pulled By: sagar0
fbshipit-source-id: 6af6c334f221427f1997b33fb24c3986b092fed6
Summary:
Followup for #4266. There is one more place in **get_context.cc** where **MergeOperator::ShouldMerge** should be called with reversed list of operands.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4284
Differential Revision: D9380008
Pulled By: sagar0
fbshipit-source-id: 70ec26e607e5b88465e1acbdcd6c6171bd76b9f2
Summary:
Add `--data_block_index_type` and `--data_block_hash_table_util_ratio` option to `db_bench`.
`--data_block_index_type` can be either of `binary` (default) or `binary_and_hash`;
`--data_block_hash_table_util_ratio` will be a double. The default value is `0.75`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4281
Differential Revision: D9361476
Pulled By: fgwu
fbshipit-source-id: dc53e01acef9db81b9eec5e8a96f3bc8ed718c10
Summary:
Right now, `ldb idump` may have memory out of control if there is a big range of tombstones. Add an option to cut maxinum number of keys in GetAllKeyVersions(), and push down --max_num_ikeys from ldb.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4271
Differential Revision: D9369149
Pulled By: siying
fbshipit-source-id: 7cbb797b7d2fa16573495a7e84937456d3ff25bf
Summary:
This PR addresses issue #3865 and implements the following approach to fix it:
- adds `MergeContext::GetOperandsDirectionForward` and `MergeContext::GetOperandsDirectionBackward` to query merge operands in a specific order
- `MergeContext::GetOperands` becomes a shortcut for `MergeContext::GetOperandsDirectionForward`
- pass `MergeContext::GetOperandsDirectionBackward` to `MergeOperator::ShouldMerge` and document the order
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4266
Differential Revision: D9360750
Pulled By: sagar0
fbshipit-source-id: 20cb73ff017760b062ecdcf4382560767086e092
Summary:
Add hash index support to data blocks, which helps to reduce the CPU utilization of point-lookup operations. This feature is backward compatible with the data block created without the hash index. It is disabled by default unless `BlockBasedTableOptions::data_block_index_type` is set to `data_block_index_type = kDataBlockBinaryAndHash.`
The DB size would be bigger with the hash index option as a hash table is added at the end of each data block. If the hash utilization ratio is 1:1, the space overhead is one byte per key. The hash table utilization ratio is adjustable using `BlockBasedTableOptions::data_block_hash_table_util_ratio`. A lower utilization ratio will improve more on the point-lookup efficiency, but take more space too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4174
Differential Revision: D8965914
Pulled By: fgwu
fbshipit-source-id: 1c6bae5d1fc39c80282d8890a72e9e67bc247198
Summary:
The wrong options are used in the trace_analyzer_test, removed. The potential loses integer precision are fixed.
Pass the specified testing case, make asan_check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4274
Reviewed By: yiwu-arbug
Differential Revision: D9327811
Pulled By: zhichao-cao
fbshipit-source-id: d62cb18d6586503a490cd323bfc1c672b68b346e
Summary:
In the OnTableFileCreation() listener, assert on various TableProperties
only when file size > 0 bytes. The listener can get called even for 0
byte SSTs which have been deleted.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4273
Differential Revision: D9322738
Pulled By: anand1976
fbshipit-source-id: 17cdfb3d0da946b9a158d7328e5db1c87973956b
Summary:
Add a unit test to check that iterators release data blocks after it has moved away from it. Verify the same for compaction input iterators.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4170
Differential Revision: D8962513
Pulled By: siying
fbshipit-source-id: 05a5b604d7d29887fb488f2cda7286f554a14407
Summary:
A framework of trace analyzing for RocksDB
After collecting the trace by using the tool of [PR #3837](https://github.com/facebook/rocksdb/pull/3837). User can use the Trace Analyzer to interpret, analyze, and characterize the collected workload.
**Input:**
1. trace file
2. Whole keys space file
**Statistics:**
1. Access count of each operation (Get, Put, Delete, SingleDelete, DeleteRange, Merge) in each column family.
2. Key hotness (access count) of each one
3. Key space separation based on given prefix
4. Key size distribution
5. Value size distribution if appliable
6. Top K accessed keys
7. QPS statistics including the average QPS and peak QPS
8. Top K accessed prefix
9. The query correlation analyzing, output the number of X after Y and the corresponding average time
intervals
**Output:**
1. key access heat map (either in the accessed key space or whole key space)
2. trace sequence file (interpret the raw trace file to line base text file for future use)
3. Time serial (The key space ID and its access time)
4. Key access count distritbution
5. Key size distribution
6. Value size distribution (in each intervals)
7. whole key space separation by the prefix
8. Accessed key space separation by the prefix
9. QPS of each operation and each column family
10. Top K QPS and their accessed prefix range
**Test:**
1. Added the unit test of analyzing Get, Put, Delete, SingleDelete, DeleteRange, Merge
2. Generated the trace and analyze the trace
**Implemented but not tested (due to the limitation of trace_replay):**
1. Analyzing Iterator, supporting Seek() and SeekForPrev() analyzing
2. Analyzing the number of Key found by Get
**Future Work:**
1. Support execution time analyzing of each requests
2. Support cache hit situation and block read situation of Get
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4091
Differential Revision: D9256157
Pulled By: zhichao-cao
fbshipit-source-id: f0ceacb7eedbc43a3eee6e85b76087d7832a8fe6
Summary:
Due to 4ea56b1bd0, we should also remove the
assersion in stress test. This removal can be temporary, and we can add it back
once we figure out the reason for the 0-byte SSTs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4268
Differential Revision: D9297186
Pulled By: riversand963
fbshipit-source-id: cebba9a68f42e815f8cf24471176d2cfdf962f63
Summary:
Revert this change. Not generating the OnTableFileCreated() notification for a 0 byte SST on flush breaks the assumption that every OnTableFileCreationStarted() notification is followed by a corresponding OnTableFileCreated().
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4263
Differential Revision: D9285623
Pulled By: anand1976
fbshipit-source-id: 808c3dcd498b4b4f4ed4be947a29a24b2296aa8d
Summary:
In the current trace_and replay, Get an WriteBatch are traced. This pull request track down the Seek() and SeekForPrev() to the trace file. <target_key, timestamp, column_family_id> are write to the file.
Replay of Iterator is not supported in the current implementation.
Tested with trace_analyzer.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4228
Differential Revision: D9201381
Pulled By: zhichao-cao
fbshipit-source-id: 6f9cc9cb3c20260af741bee065ec35c5c96354ab
Summary:
The pair of ROCKSDB_LITE condition inclusion is redundant, it is already inside the #ifndef ROCKSDB_LITE. Remove them to void confusion.
Tested by make asan_check.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4254
Differential Revision: D9281652
Pulled By: zhichao-cao
fbshipit-source-id: 06bf7641ede71391f21f6a3fe37fbd13f0e2a43a
Summary:
After refactoring in https://github.com/facebook/rocksdb/pull/4158 the properties block is written after the index block. This breaks the existing logic in estimating the index size in partitioned indexes. The patch fixes that by using the accurate index block size, which is available since by the time we write the properties block, the index block is already written.
The patch also fixes an issue in estimating the partition size with format_version=3 which was resulting into partitions smaller than the configured metadata_block_size.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4259
Differential Revision: D9274454
Pulled By: maysamyabandeh
fbshipit-source-id: c82d045505cca3e7ed1a44ee1eaa26e4f25a4272
Summary:
TSAN fails due to comparison between signed int and unsigned long. Fix it by
static_casting.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4250
Differential Revision: D9256535
Pulled By: riversand963
fbshipit-source-id: c6bad23ff70c6d0ec58e2e85c401ce0ad45de609
Summary:
Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the block cache more efficiently. The feature is enabled with using format_version 4.
The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size.
Results with sysbench read-only using 4k blocks and using 16 index restart interval:
Format 2:
19585 rocksdb read-only range=100
Format 3:
19569 rocksdb read-only range=100
Format 4:
19352 rocksdb read-only range=100
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983
Differential Revision: D8361343
Pulled By: maysamyabandeh
fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651
Summary:
I noticed we were building against zstd 1.3.0 which is missing optimizations that our fbcode customers have (they're on zstd 1.3.5). Ran `./build_tools/update_dependencies.sh` to catch us up. Omitted the changes it made for gcc-4.8 since it's broken.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4244
Differential Revision: D9230111
Pulled By: ajkr
fbshipit-source-id: 3e8ec1d8a961f98ec77c8c6580bde4caacf2d437
Summary:
The error_filter.py script parses the output of the "Build and run" stage of continuous tests to check for errors. It is currently only detecting compile errors and not link errors. This change fixes that.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4247
Differential Revision: D9233735
Pulled By: anand1976
fbshipit-source-id: 16e5a04950891cd9aba5cb3efcb6abc2a2e0d5ae
Summary:
This change allows using std::string_view objects directly in the
DB API:
db->Get(some_string_view_object, ...);
The conversion from std::string_view to rocksdb::Slice is done
automatically, thanks to the added constructor.
I'm stopping short of adding an implicit conversion operator
from rocksdb::Slice to std::string_view, as I don't think that's
a good idea for PinnableSlices.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4242
Differential Revision: D9224134
Pulled By: anand1976
fbshipit-source-id: f50aad04dd0b01737907c0fb88d495c83a81f4e4
Summary:
Although delete scheduler implementation allows for the interface not to be supported, the delete_scheduler_test does not allow for that.
Address compiler warnings
Make sst_dump_test use test directory structure as the current execution directory may not be writiable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4221
Differential Revision: D9210152
Pulled By: siying
fbshipit-source-id: 381a74511e969ecb8089d5c4b4df87dc30c8df63