Summary:
Implement trace sampling to allow user to specify the sampling frequency, i.e. save one per how many requests, so that a user does not need to log all if he/she is interested in only a sampled set.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4963
Differential Revision: D14011190
Pulled By: tang-jianfeng
fbshipit-source-id: 078b631d9319b67cb089dd2c30e21d0df8dc406a
Summary:
ValidateSnapshot checks if another txn has committed a value to about-to-be-locked key since a particular snapshot. It applies an optimization of looking into only the memtable if snapshot seq is larger than the earliest seq in the memtables. With a long-running txn in WritePrepared, the prepared value might be flushed out to the disk and yet it commits after the snapshot, which breaks this optimization. The patch fixes that by disabling this optimization when the min_uncomitted seq at the time the snapshot was taken is lower than earliest seq in the memtables.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4961
Differential Revision: D14009947
Pulled By: maysamyabandeh
fbshipit-source-id: 1d11679950326f7c4094b433e6b821b729f08850
Summary:
It would avoid bugs if the reused PinnableSlice is not actually reassigned and yet the programmer makes conclusions based on the size of the Slice.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4962
Differential Revision: D14012710
Pulled By: maysamyabandeh
fbshipit-source-id: 23f4e173386b5461fd5650f44cde470805f4e816
Summary:
We want to reserve some right that some extra information added manifest
in the future can be forward compatible by previous versions. Now we create a
place holder for that. A bit in tag is added to indicate that a field can be
safely ignored.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4960
Differential Revision: D14000484
Pulled By: siying
fbshipit-source-id: cbf5bad3f9d5ec798f789806f244d1c20d3b66d6
Summary:
We found that the behavior of CompactionFilter::IgnoreSnapshots() = false isn't
what we have expected. We thought that snapshot will always be preserved.
However, we just realized that, if no snapshot is created while compaction
starts, and a snapshot is created after that, the data seen from the snapshot
can successfully be dropped by the compaction. This creates a strange behavior
to the feature, which is hard to explain. Like what is documented in code
comment, this feature is not very useful with snapshot anyway. The decision
is to deprecate the feature.
We keep the function to avoid to break users code. However, we will fail
compactions if false is returned.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4954
Differential Revision: D13981900
Pulled By: siying
fbshipit-source-id: 2db8c2c3865acd86a28dca625945d1481b1d1e36
Summary:
Cuckoo Hash is less useful than we initially expected. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4953
Differential Revision: D13979264
Pulled By: siying
fbshipit-source-id: 2a60afdaa989f045357398b43a1cc5d46f4492ed
Summary:
Commit of delayed prepared has two non-atomic steps: add to commit cache, remove from delayed_prepared_. Similarly in ::IsInSnapshot we read from commit cache first and then look into delayed_prepared_. Due to non-atomicity thus the reader might not find the
prep_seq that is just committed neither in commit cache nor in delayed_prepared_. To fix that i)
we check if there was any delayed prepared BEFORE looking into commit
cache, ii) if there was, we complete the search steps to be these: i)
commit cache, ii) delayed prepared, commit cache again. In this way if
the first query to commit cache missed the commit, the 2nd will catch it. The cost of the redundant read from commit cache is paid only if delayed_prepared_ is nonempty which should be a very rare scenario.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4947
Differential Revision: D13952754
Pulled By: maysamyabandeh
fbshipit-source-id: 8f47826b13f8ce154398d842028342423f4ca2b2
Summary:
4985a9f73b (diff-e5276985b26a0551957144f4420a594bR511)
changes the meaning of latency reporting from running time per query, to elapse_time / #ops, without providing a reason why.
Considering that this is a counter-intuitive reporting, Reverting the change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4949
Differential Revision: D13964684
Pulled By: siying
fbshipit-source-id: d6304d3d4b5a802daa292302623c7dbca9a680bc
Summary:
This will fix the following build error:
> db/db_test.cc: In member function ‘virtual void rocksdb::DBTest_CompactFilesShouldTriggerAutoCompaction_Test::TestBody()’:
> db/db_test.cc:5462:8: error: ‘class rocksdb::DB’ has no member named ‘GetColumnFamilyMetaData’
> db_->GetColumnFamilyMetaData(db_->DefaultColumnFamily(), &cf_meta_data);
> db/db_test.cc:5490:8: error: ‘class rocksdb::DB’ has no member named ‘GetColumnFamilyMetaData’
> db_->GetColumnFamilyMetaData(db_->DefaultColumnFamily(), &cf_meta_data);
> db/db_test.cc:5499:8: error: ‘class rocksdb::DB’ has no member named ‘GetColumnFamilyMetaData’
> db_->GetColumnFamilyMetaData(db_->DefaultColumnFamily(), &cf_meta_data);
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4950
Differential Revision: D13965378
Pulled By: miasantreble
fbshipit-source-id: a975435476fe555b1cd9d5da263ee3da3acdea56
Summary:
Existing implementation of PerfContext does not define copy constructor or assignment operator, which could potentially cause problems when user create copies and resets the builtin one. This PR address the issue by providing these two constructors with deep copy semantics.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4919
Differential Revision: D13960406
Pulled By: miasantreble
fbshipit-source-id: 36aab5aaee65d4480f537e4e22148faa45e8e334
Summary:
CompactFiles() may block auto compaction which could cuase DB hang when it
reachs level0_stop_writes_trigger.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4940
Differential Revision: D13929648
Pulled By: cooldoger
fbshipit-source-id: 10842df38df3bebf862cd1a120a88ce961fdd381
Summary:
In NotFound cases, stats BYTES_READ and perf_context.get_read_bytes is still be increased. The amount increased will be
whatever size of the string or PinnableSlice that users passed in as the output data structure. This is wrong. Fix this by not
increasing these two counters.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4938
Differential Revision: D13908963
Pulled By: siying
fbshipit-source-id: 60bce42e4fbb9862bba3da36dbc27b2963ea6162
Summary:
Fix the ouput overlap bug when using subcompactions, the upper bound of output
file was extended incorrectly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4898
Differential Revision: D13736107
Pulled By: ajkr
fbshipit-source-id: 21dca09f81d5f07bf2766bf566f9b50dcab7d8e3
Summary:
WritePrepared maintains a list of snapshots that are <= max_evicted_seq_. Based on this list, old_commit_map_ is updated if an evicted commit entry overlaps with such snapshot. Such lists are garbage collected when the release of snapshot is reported to WritePreparedTxnDB, which is the next time max_evicted_seq_ is updated and yet the snapshot is not found is the list returned from DB. This logic was broken since ReleaseSnapshotInternal was using "< max_evicted_seq_" to cleanup old_commit_map_, which would leave a snapshot uncleaned if it "= max_evicted_seq_". The patch fixes that and adds a unit test to check for the bug.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4944
Differential Revision: D13945000
Pulled By: maysamyabandeh
fbshipit-source-id: 0c904294f735911f52348a148bf1f945282fc17c
Summary:
The patch fixes the following analyze error by checking the return status of ParseInternalKey.
```
db/merge_helper.cc:306:23: warning: The right operand of '==' is a garbage value
assert(kTypeMerge == orig_ikey.type);
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4937
Differential Revision: D13908506
Pulled By: maysamyabandeh
fbshipit-source-id: 68d7771e75519da3d4bd807fd231675ec12093f6
Summary:
nvme device path doesn't have "block" as like "nvme/nvme0/nvme0n1"
or "nvme/nvme0/nvme0n1/nvme0n1p1". the last directory such as
"nvme0n1p1" should be removed if nvme drive is partitioned.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4866
Differential Revision: D13627824
Pulled By: riversand963
fbshipit-source-id: 09ab968f349f3dbb890beea20193f1359b17d317
Summary:
1. this commit fixes our handling of a combination of two separate edge
cases. If a flush job does not pick any memtable to flush (because another
flush job has already picked the same memtables), and the column family
assigned to the flush job is dropped right before RocksDB calls
rocksdb::InstallMemtableAtomicFlushResults, our original code passes
a FileMetaData object whose file number is 0, failing the assertion in
rocksdb::InstallMemtableAtomicFlushResults (assert(m->GetFileNumber() > 0)).
2. Also piggyback a small change: since we already create a local copy of column family's mutable CF options to eliminate potential race condition with `SetOptions` call, we might as well use the local copy in other function calls in the same scope.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4932
Differential Revision: D13901322
Pulled By: riversand963
fbshipit-source-id: b936580af7c127ea0c6c19ea10cd5fcede9fb0f9
Summary:
The `DeleteRange` end key is exclusive, not inclusive. Updated API comment accordingly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4935
Differential Revision: D13905406
Pulled By: ajkr
fbshipit-source-id: f577db841a279427991ecf9005cd56b30c8eb3c7
Summary:
FlushMemTablesToOutputFiles calls FlushMemTableToOutputFile for each column family. The patch moves the take-snapshot logic to outside FlushMemTableToOutputFile so that it does it once for all the flushes. This also addresses a deadlock issue for resetting the managed snapshot of job_snapshot in the 2nd call to FlushMemTableToOutputFile.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4934
Differential Revision: D13900747
Pulled By: maysamyabandeh
fbshipit-source-id: f3cd650c5fff24cf95c1aaf8a10c149d42bf042c
Summary:
Measure CPU time consumed for a compaction and report it in the stats report
Enable NowCPUNanos() to work for MacOS
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889
Differential Revision: D13701276
Pulled By: zinoale
fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055
Summary:
before file ingestion (in preparation phase), verify the checksums of
the blocks of the external SST file, including properties block with global
seqno.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4916
Differential Revision: D13863501
Pulled By: riversand963
fbshipit-source-id: dc54697f970e3807832e2460f7228fcc7efe81ee
Summary:
Blob DB files are not tracked by the SFM, so they currently don't get
deleted in the background. Force them to be deleted in background so
rate limiting can be applied
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4928
Differential Revision: D13854649
Pulled By: anand1976
fbshipit-source-id: 8031ce66842ff0af440c715d886b377983dad7d8
Summary:
We used to call `printf $(t_run)` and later feed the result to GNU parallel in the recipe of target `check_0`. However, this approach is problematic when the length of $(t_run) exceeds the
maximum length of a command and the `printf` command cannot be executed. Instead we use 'find -print' to avoid generating an overly long command.
**This PR is actually the last commit of #4916. Prefer to merge this PR separately.**
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4922
Differential Revision: D13845883
Pulled By: riversand963
fbshipit-source-id: b56de7f7af43337c6ec89b931de843c9667cb679
Summary:
While stepping through the code I noticed that there is a redundant call to TableFileName.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4925
Differential Revision: D13845749
Pulled By: sagar0
fbshipit-source-id: 31db45716b4d720e0e0350dd457b49d6f1848e7d
Summary:
Store_index_in_file is a less useful feature. To simplify the code to maintain, we are dropping the feature.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4914
Differential Revision: D13791883
Pulled By: siying
fbshipit-source-id: d187c5d662584866103e4b77d09dfb925509ae2e
Summary:
In the MixGraph benchmark of db_bench #4788 , the char array is initialized with an argument from user's input, which can cause build error on some platforms. Also, the msg char array size can be potentially smaller than the printed data, which should be extended from 100 to 256.
Tested with make check.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4918
Differential Revision: D13844298
Pulled By: sagar0
fbshipit-source-id: 33c4809c5c4438f0a9f7b289d3f42e20c545bbab
Summary:
- When building with internal dependencies, specify this toolchain by setting `ROCKSDB_FBCODE_BUILD_WITH_PLATFORM007=1`
- It is not enabled by default. However, it is enabled for TSAN builds in CI since there is a known problem with TSAN in gcc-5: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71090
- I did not add support for Lua since (1) we agreed to deprecate it, and (2) we only have an internal build for v5.3 with this toolchain while that has breaking changes compared to our current version (v5.2).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4923
Differential Revision: D13827226
Pulled By: ajkr
fbshipit-source-id: 9aa3388ed3679777cfb15ef8cbcb83c07f62f947
Summary:
https://github.com/facebook/rocksdb/pull/4053 avoids memcopy for Get() results if files are immortable
(read-only DB, max_open_files=-1) and the file is ammaped. The same optimization is being applied to PlainTable
here.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4924
Differential Revision: D13827749
Pulled By: siying
fbshipit-source-id: 1f2cbfc530b40ce08ccd53f95f6e78de4d1c2f96
Summary:
I didn't find where customized hash function is used in DynamicBloom. This can only reduce performance. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4915
Differential Revision: D13794452
Pulled By: siying
fbshipit-source-id: e38669b11e01444d2d782da11c7decabbd851819
Summary:
Previously compaction was not collapsing operands for a first
key on a layer, even in cases when it was its root of history. Some
tests (CompactionJobTest.NonAssocMerge) was actually accounting
for that bug,
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4909
Differential Revision: D13781169
Pulled By: finik
fbshipit-source-id: d2de353ecf05bec39b942cd8d5b97a8dc445f336
Summary:
- If block cache disabled or not used for meta-blocks, `BlockBasedTableReader::Rep::uncompression_dict` owns the `UncompressionDict`. It is preloaded during `PrefetchIndexAndFilterBlocks`.
- If block cache is enabled and used for meta-blocks, block cache owns the `UncompressionDict`, which holds dictionary and digested dictionary when needed. It is never prefetched though there is a TODO for this in the code. The cache key is simply the compression dictionary block handle.
- New stats for compression dictionary accesses in block cache: "BLOCK_CACHE_COMPRESSION_DICT_*" and "compression_dict_block_read_count"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4881
Differential Revision: D13663801
Pulled By: ajkr
fbshipit-source-id: bdcc54044e180855cdcc57639b493b0e016c9a3f
Summary:
Even one key falls in a file's range, we can not infer it definitely exists in this file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4902
Differential Revision: D13795018
Pulled By: siying
fbshipit-source-id: 590956f727e9440fcdee55ad9541ace934c64914
Summary:
compaction_pri = kMinOverlappingRatio usually provides much better write amplification than the default.
https://github.com/facebook/rocksdb/pull/4907 fixes one shortcome of this option. Make it default.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4911
Differential Revision: D13789262
Pulled By: siying
fbshipit-source-id: d90acf8c4dede44f00d183ca4c7a210259378269
Summary:
The pointer `get_context` was passed as the value for the boolean argument `index_key_is_full`. Luckily the pointer was always non-null so evaluated to true which is the correct value for the boolean argument. But we were missing out on batch updates to stats since we were not passing anything for the `GetContext*` argument and it defaults to `nullptr`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4913
Differential Revision: D13791449
Pulled By: ajkr
fbshipit-source-id: dbe40bf406c64d34cb5298604145d18b9e0ca9be
Summary:
Fixed clang static analyzer warning about division by 0.
```
ar: creating librocksdb_debug.a
tools/db_bench_tool.cc:4650:43: warning: Division by zero
int pos = static_cast<int>(rand_num % range_);
~~~~~~~~~^~~~~~~~
1 warning generated.
make: *** [analyze] Error 1
```
This is from the new code I recently merged in ce8e88d2d7.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4910
Differential Revision: D13788037
Pulled By: sagar0
fbshipit-source-id: f48851dca85047c19fbb1a361e25ce643aa4c7ea
Summary:
Right now, CompactionPri = kMinOverlappingRatio provides best write amplification, but it doesn't
prioritize files with more tombstones. We combine the two good features: make kMinOverlappingRatio
to boost files with lots of tombstones too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4907
Differential Revision: D13788774
Pulled By: siying
fbshipit-source-id: 1991cbb495fb76c8b529de69896e38d81ed9d9b3
Summary:
False-negative about path not existing. The regex is ignoring the "." in front of a path.
Example: "./path/to/file"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4682
Differential Revision: D13777110
Pulled By: sagar0
fbshipit-source-id: 9f8173b7581407555fdc055580732aeab37d4ade
Summary:
Right now, deleting blob files is not rate limited, even if SstFileManger is specified.
On the other hand, rate limiting blob deletion is not supported. With this change, Blob file
deletion will go through SstFileManager too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4904
Differential Revision: D13772545
Pulled By: siying
fbshipit-source-id: bd1b1d0beb26d5167385e00b7ecb8b94b879de84
Summary:
Remove unused blob WAL filter so that users are not confused.
I was initially under the impression that we have WAL Filter support in BlobDB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4896
Differential Revision: D13725709
Pulled By: sagar0
fbshipit-source-id: f997d7546e138a474036e88b957907cc714327f1
Summary:
Based on the specific workload models (key access distribution, value size distribution, and iterator scan length distribution, the QPS variation), the MixGraph benchmark generate the synthetic workload according to these distributions which can reflect the real-world workload characteristics.
After user enable the tracing function, they will get the trace file. By analyzing the trace file with the trace_analyzer tool, user can generate a set of statistic data files including. The *_accessed_key_stats.txt, *-accessed_value_size_distribution.txt, *-iterator_length_distribution.txt, and *-qps_stats.txt are mainly used to fit the Matlab model fitting. After that, user can get the parameters of the workload distributions (the modeling details are described: [here](https://github.com/facebook/rocksdb/wiki/RocksDB-Trace%2C-Replay%2C-and-Analyzer))
The key access distribution follows the The two-term power model. The probability density function is: `f(x) = ax^{b}+c`. The corresponding parameters are key_dist_a, key_dist_b, and key_dist_c in db_bench
For the value size distribution and iterator scan length distribution, they both follow the Generalized Pareto Distribution. The probability density function is `f(x) = (1/sigma)(1+k*(x-theta)/sigma))^{-1-1/k)`. The parameters are: value_k, value_theta, value_sigma and iter_k, iter_theta, iter_sigma. For more information about the Generalized Pareto Distribution, users can find the [wiki](https://en.wikipedia.org/wiki/Generalized_Pareto_distribution) and [Matalb page](https://www.mathworks.com/help/stats/generalized-pareto-distribution.html)
As for the QPS, it follows the diurnal pattern. So Sine is a good model to fit it. `F(x) = sine_a*sin(sine_b*x + sine_c) + sine_d`. The trace_will tell you the average QPS in the print out resutls, which is sine_d. After user fit the "*-qps_stats.txt" to the Matlab model, user can get the sine_a, sine_b, and sine_c. By using the 4 parameters, user can control the QPS variation including the period, average, changes.
To use the bench mark, user can indicate the following parameters as examples:
```
-benchmarks="mixgraph" -key_dist_a=0.002312 -key_dist_b=0.3467 -value_k=0.9233 -value_sigma=226.4092 -iter_k=2.517 -iter_sigma=14.236 -mix_get_ratio=0.7 -mix_put_ratio=0.25 -mix_seek_ratio=0.05 -sine_mix_rate_interval_milliseconds=500 -sine_a=15000 -sine_b=1 -sine_d=20000
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4788
Differential Revision: D13573940
Pulled By: sagar0
fbshipit-source-id: e184c27e07b4f1bc0b436c2be36c5090c1fb0222
Summary:
This is essentially a re-submission of #4251 with a few improvements:
- Split `CompressionDict` into two separate classes: `CompressionDict` and `UncompressionDict`
- Eliminated `Init` functions. Instead do all initialization work in constructors.
- Added test case for parallel DB open, which is the scenario where #4251 failed under TSAN.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4849
Differential Revision: D13606039
Pulled By: ajkr
fbshipit-source-id: 08c236059798c710db9cbf545fce0f371232d447
Summary:
Fix how CompactionIterator::findEarliestVisibleSnapshots handles released snapshot. It fixing the two scenarios:
Scenario 1:
key1 has two values v1 and v2. There're two snapshots s1 and s2 taken after v1 and v2 are committed. Right after compaction output v2, s1 is released. Now findEarliestVisibleSnapshot may see s1 being released, and return the next snapshot, which is s2. That's larger than v2's earliest visible snapshot, which was s1.
The fix: the only place we check against last snapshot and current key snapshot is when we decide whether to compact out a value if it is hidden by a later value. In the check if we see current snapshot is even larger than last snapshot, we know last snapshot is released, and we are safe to compact out current key.
Scenario 2:
key1 has two values v1 and v2. there are two snapshots s1 and s2 taken after v1 and v2 are committed. During compaction before we process the key, s1 is released. When compaction process v2, snapshot checker may return kSnapshotReleased, and the earliest visible snapshot for v2 become s2. When compaction process v1, snapshot checker may return kIsInSnapshot (for WritePrepared transaction, it could be because v1 is still in commit cache). The result will become inconsistent here.
The fix: remember the set of released snapshots ever reported by snapshot checker, and ignore them when finding result for findEarliestVisibleSnapshot.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4890
Differential Revision: D13705538
Pulled By: maysamyabandeh
fbshipit-source-id: e577f0d9ee1ff5a6035f26859e56902ecc85a5a4
Summary:
Here is the order of ops in a commit: 1) update commit cache 2) publish seq, 3) RemovePrepared. In case of a delayed prepared, there will be a gap between when the commit is visible to snapshots until delayed_prepared_ is cleaned up. To tell apart this case from a delayed uncommitted txn from, the commit entry of a delayed prepared is also stored in delayed_prepared_commits_, which is updated before publishing the commit.
Also logic in GetSnapshotInternal that ensures that each new snapshot is always larger than max_evicted_seq_ is updated to check against the upcoming value of max_evicted_seq_ rather than its current one. This is because AdvanceMaxEvictedSeq gets the list of snapshots lower than the new max, before updating max_evicted_seq_.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4894
Differential Revision: D13726988
Pulled By: maysamyabandeh
fbshipit-source-id: 1e70d78061b50c944c9816bf4b6dac405ab4ccd3