Summary:
This patch addresses a couple of minor TODOs for WritePrepared Txn such as double checking some assert statements at runtime as well, skip extra AddPrepared in non-2pc transactions, and safety check for infinite loops.
Closes https://github.com/facebook/rocksdb/pull/3302
Differential Revision: D6617002
Pulled By: maysamyabandeh
fbshipit-source-id: ef6673c139cb49f64c0879508d2f573b78609aca
Summary:
**# Summary**
RocksDB uses SSE crc32 intrinsics to calculate the crc32 values but it does it in single way fashion (not pipelined on single CPU core). Intel's whitepaper () published an algorithm that uses 3-way pipelining for the crc32 intrinsics, then use pclmulqdq intrinsic to combine the values. Because pclmulqdq has overhead on its own, this algorithm will show perf gains on buffers larger than 216 bytes, which makes RocksDB a perfect user, since most of the buffers RocksDB call crc32c on is over 4KB. Initial db_bench show tremendous CPU gain.
This change uses the 3-way SSE algorithm by default. The old SSE algorithm is now behind a compiler tag NO_THREEWAY_CRC32C. If user compiles the code with NO_THREEWAY_CRC32C=1 then the old SSE Crc32c algorithm would be used. If the server does not have SSE4.2 at the run time the slow way (Non SSE) will be used.
**# Performance Test Results**
We ran the FillRandom and ReadRandom benchmarks in db_bench. ReadRandom is the point of interest here since it calculates the CRC32 for the in-mem buffers. We did 3 runs for each algorithm.
Before this change the CRC32 value computation takes about 11.5% of total CPU cost, and with the new 3-way algorithm it reduced to around 4.5%. The overall throughput also improved from 25.53MB/s to 27.63MB/s.
1) ReadRandom in db_bench overall metrics
PER RUN
Algorithm | run | micros/op | ops/sec |Throughput (MB/s)
3-way | 1 | 4.143 | 241387 | 26.7
3-way | 2 | 3.775 | 264872 | 29.3
3-way | 3 | 4.116 | 242929 | 26.9
FastCrc32c|1 | 4.037 | 247727 | 27.4
FastCrc32c|2 | 4.648 | 215166 | 23.8
FastCrc32c|3 | 4.352 | 229799 | 25.4
AVG
Algorithm | Average of micros/op | Average of ops/sec | Average of Throughput (MB/s)
3-way | 4.01 | 249,729 | 27.63
FastCrc32c | 4.35 | 230,897 | 25.53
2) Crc32c computation CPU cost (inclusive samples percentage)
PER RUN
Implementation | run | TotalSamples | Crc32c percentage
3-way | 1 | 4,572,250,000 | 4.37%
3-way | 2 | 3,779,250,000 | 4.62%
3-way | 3 | 4,129,500,000 | 4.48%
FastCrc32c | 1 | 4,663,500,000 | 11.24%
FastCrc32c | 2 | 4,047,500,000 | 12.34%
FastCrc32c | 3 | 4,366,750,000 | 11.68%
**# Test Plan**
make -j64 corruption_test && ./corruption_test
By default it uses 3-way SSE algorithm
NO_THREEWAY_CRC32C=1 make -j64 corruption_test && ./corruption_test
make clean && DEBUG_LEVEL=0 make -j64 db_bench
make clean && DEBUG_LEVEL=0 NO_THREEWAY_CRC32C=1 make -j64 db_bench
Closes https://github.com/facebook/rocksdb/pull/3173
Differential Revision: D6330882
Pulled By: yingsu00
fbshipit-source-id: 8ec3d89719533b63b536a736663ca6f0dd4482e9
Summary:
We added a new verification that ensures a value that snapshot reads when is released is the same as when it was created. This test however fails when the cf is dropped in between. The patch skips the tests if that was the case.
Closes https://github.com/facebook/rocksdb/pull/3279
Differential Revision: D6581584
Pulled By: maysamyabandeh
fbshipit-source-id: afe37d371c0f91818d2e279b3949b810e112e8eb
Summary:
Add "--use_txn" option to use transactional API in db_stress, default being WRITE_PREPARED policy, which is the main intention of modifying db_stress. It also extend the existing snapshots to verify that before releasing a snapshot a read from it returns the same value as before.
Closes https://github.com/facebook/rocksdb/pull/3243
Differential Revision: D6556912
Pulled By: maysamyabandeh
fbshipit-source-id: 1ae31465be362d44bd06e635e2e9e49a1da11268
Summary:
clang-analyzer complaint about db_ being nullptr, but it couldn't be because it checks exec_stats before proceed. Add an assert to get around the false-positive.
Test Plan
`make analyze`
Closes https://github.com/facebook/rocksdb/pull/3236
Differential Revision: D6505417
Pulled By: yiwu-arbug
fbshipit-source-id: e5b65764ea994dd9e4bab3e697b97dc70dc22cab
Summary:
This is to fix tools/check_format_compatible.sh. The tool try to open
old versions of rocksdb with the provided options file. When options
file is missing (e.g. rocksdb 2.2), it should still proceed with default
options.
Closes https://github.com/facebook/rocksdb/pull/3232
Differential Revision: D6503955
Pulled By: yiwu-arbug
fbshipit-source-id: e44cfcce7ddc7d12cf83466ed3f3fe7624aa78b8
Summary:
Add recent versions for format compatible test. We should probably update the script to auto include available versions (by looking at include/rocksdb/versions.h and deduce branch names), but we can do it later.
Closes https://github.com/facebook/rocksdb/pull/3233
Differential Revision: D6503631
Pulled By: yiwu-arbug
fbshipit-source-id: e2b01d1ef6e784ff6ffa1bd75d741755e3c69a8c
Summary:
I started adding gflags support for cmake on linux and got frustrated that I'd need to duplicate the build_detect_platform logic, which determines namespace based on attempting compilation. We can do it differently -- use the GFLAGS_NAMESPACE macro if available, and if not, that indicates it's an old gflags version without configurable namespace so we can simply hardcode "google".
Closes https://github.com/facebook/rocksdb/pull/3212
Differential Revision: D6456973
Pulled By: ajkr
fbshipit-source-id: 3e6d5bde3ca00d4496a120a7caf4687399f5d656
Summary:
Add PreReleaseCallback to be called at the end of WriteImpl but before publishing the sequence number. The callback is used in WritePrepareTxn to i) update the commit map, ii) update the last published sequence number in the 2nd write queue. It also ensures that all the commits will go to the 2nd queue.
These changes will ensure that the commit map is updated before the sequence number is published and used by reading snapshots. If we use two write queues, the snapshots will use the seq number published by the 2nd queue. If we use one write queue (the default, the snapshots will use the last seq number in the memtable, which also indicates the last published seq number.
Closes https://github.com/facebook/rocksdb/pull/3205
Differential Revision: D6438959
Pulled By: maysamyabandeh
fbshipit-source-id: f8b6c434e94bc5f5ab9cb696879d4c23e2577ab9
Summary:
- Made CLI arguments take precedence over options file when both are provided. Note some of the CLI args are not settable via options file, like `--compression_max_dict_bytes`, so it's necessary to allow both ways of providing options simultaneously.
- Changed `PrepareOptionsForOpenDB` to update the proper `ColumnFamilyOptions` if one exists for the user's `--column_family_name` argument. I supported this only in the base class, `LDBCommand`, so it works for the general arguments. Will defer adding support for subcommand-specific arguments.
- Made the command fail if `--try_load_options` is provided and loading options file returns NotFound. I found the previous behavior of silently continuing confusing.
Closes https://github.com/facebook/rocksdb/pull/3144
Differential Revision: D6270544
Pulled By: ajkr
fbshipit-source-id: 7c2eac9f9b38720523d74466fb9e78db53561367
Summary:
tools/ldb_cmd.cc:
```
310 ignore_unknown_options_ = IsFlagPresent(flags, ARG_IGNORE_UNKNOWN_OPTIONS);
CID 1322798 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
5. uninit_member: Non-static class member db_ttl_ is not initialized in this constructor nor in any functions that it calls.
311}
```
Closes https://github.com/facebook/rocksdb/pull/3122
Differential Revision: D6428576
Pulled By: sagar0
fbshipit-source-id: d77f04dd201f7f1d9f59ef88a215ee7ad7b934e9
Summary:
We hit "Illegal instruction" error in regression test with "shlx" instruction. Setting PORTABLE=1 to resolve it.
Closes https://github.com/facebook/rocksdb/pull/3165
Differential Revision: D6321972
Pulled By: yiwu-arbug
fbshipit-source-id: cc9fe0dbd4698d1b66a750a0b062f66899862719
Summary:
- moved existing compression options to `InitializeOptionsGeneral` since they cannot be set through options file
- added flag for `zstd_max_train_bytes` which was recently introduced by #3057
Closes https://github.com/facebook/rocksdb/pull/3128
Differential Revision: D6240460
Pulled By: ajkr
fbshipit-source-id: 27dbebd86a55de237ba6a45cc79cff9214e82ebc
Summary:
Without this option, running the compact benchmark on a DB containing only bottommost files simply returned immediately.
Closes https://github.com/facebook/rocksdb/pull/3138
Differential Revision: D6256660
Pulled By: ajkr
fbshipit-source-id: e3b64543acd503d821066f4200daa201d4fb3a9d
Summary:
- Release all snapshots before crashing and reopening the DB. Without this, we may attempt to release snapshots from an old DB using a new DB. That tripped an assertion.
- Release multiple snapshots in the same operation if needed. Without this, we would sometimes leak snapshots.
Closes https://github.com/facebook/rocksdb/pull/3098
Differential Revision: D6194923
Pulled By: ajkr
fbshipit-source-id: b9c89bcca7ebcbb6c7802c616f9d1175a005aadf
Summary:
Add options to `db_stress` (correctness testing tool) to randomly acquire snapshot and release it after some period of time. It's useful for correctness testing of #3009, as well as other parts of compaction that behave differently depending on which snapshots are held.
Closes https://github.com/facebook/rocksdb/pull/3038
Differential Revision: D6086501
Pulled By: ajkr
fbshipit-source-id: 3ec0d8666c78ac507f1f808887c4ff759ba9b865
Summary:
Dynamic adjustment of rate limit according to demand for background I/O. It increases by a factor when limiter is drained too frequently, and decreases by the same factor when limiter is not drained frequently enough. The parameters for this behavior are fixed in `GenericRateLimiter::Tune`. Other changes:
- make rate limiter's `Env*` configurable for testing
- track num drain intervals in RateLimiter so we don't have to rely on stats, which may be shared across different DB instances from the ones that share the RateLimiter.
Closes https://github.com/facebook/rocksdb/pull/2899
Differential Revision: D5858704
Pulled By: ajkr
fbshipit-source-id: cc2bac30f85e7f6fd63655d0a6732ef9ed7403b1
Summary:
it's unsupported in options file, so the flag should be respected by db_bench even when an options file is provided.
Closes https://github.com/facebook/rocksdb/pull/2910
Differential Revision: D5869836
Pulled By: ajkr
fbshipit-source-id: f67f591ae083e95e989f86b6fad50765d2e3d855
Summary:
- Switched all instances of SetMinPossibleForUserKey and SetMaxPossibleForUserKey in accordance to InternalKeyComparator's comparison logic
Closes https://github.com/facebook/rocksdb/pull/2868
Differential Revision: D5804152
Pulled By: axxufb
fbshipit-source-id: 80be35e04f2e8abc35cc64abe1fecb03af24e183
Summary:
**Summary**:
Set defaults for high-pri and low-pri thread pools in regression test script.
**Reason for this change**:
With #2680 , high-pri and low-pri thread pools get different numbers than before if `num_high_pri_threads` and `num_low_pri_threads` options are not explicitly passed to db_bench in regression test script ... leading to a false-positive regression.
**Test Plan**:
REMOTE_HOST=udb1671.prn3 TEST_MODE=1 FBSOURCE=~/fbsource ~/fbsource/fbcode/rocks/tools/debug_regression_test.sh viewstate (with very minor changes to the internals).
Observe P50 and P99 which showed up as regressions in our graphs.
Stats with the commit prior to #2680 , ie. 4f81ab3 :
seekrandomwhilewriting : 75.096 micros/op 13316 ops/sec; 168.6 MB/s (7499074 of 7500000 found)
Microseconds per seek:
Count: 120000000 Average: 1197.7254 StdDev: 33.35
Min: 187 Median: 980.5292 Max: 1816424
Percentiles: **P50: 980.53** P75: 1494.57 **P99: 4185.64** P99.9: 7800.11 P99.99: 15039.64
Stats at #2680, ie. at commit dce6d5a (false-positive regression):
seekrandomwhilewriting : 85.330 micros/op 11719 ops/sec; 148.4 MB/s (7499073 of 7500000 found)
Microseconds per seek:
Count: 120000000 Average: 1362.3261 StdDev: 27.86
Min: 185 Median: 1088.1915 Max: 652760
Percentiles: **P50: 1088.19** P75: 1658.12 **P99: 5361.15** P99.9: 7997.95 P99.99: 11730.07
Stats with the current change on top of dce6d5a :
seekrandomwhilewriting : 77.780 micros/op 12856 ops/sec; 162.8 MB/s (7499102 of 7500000 found)
Microseconds per seek:
Count: 120000000 Average: 1226.6744 StdDev: 17.16
Min: 185 Median: 994.2956 Max: 2553530
Percentiles: **P50: 994.30** P75: 1513.68 **P99: 4284.30** P99.9: 9338.64 P99.99: 23008.86
Closes https://github.com/facebook/rocksdb/pull/2801
Differential Revision: D5742338
Pulled By: sagar0
fbshipit-source-id: cc5d727c1a131f2a7070d1bb892efbe929b976ff
Summary:
store a zero as the checksum when disabled since it's easier to keep block trailer a fixed length.
Closes https://github.com/facebook/rocksdb/pull/2781
Differential Revision: D5694702
Pulled By: ajkr
fbshipit-source-id: 69cea9da415778ba2b600dfd9d0dfc8cb5188ecd
Summary:
Clang complain about an cast from uint64_t to int32 in db_stress. Fixing it.
Closes https://github.com/facebook/rocksdb/pull/2755
Differential Revision: D5655947
Pulled By: yiwu-arbug
fbshipit-source-id: cfac10e796e0adfef4727090b50975b0d6e2c9be
Summary:
fix some things that made this command hard to use from CLI:
- use default values for `target_file_size_base` and `max_bytes_for_level_base`. previously we were using small values for these but default value of `write_buffer_size`, which led to enormous number of L1 files.
- failure message for `value_size_mult` too big. previously there was just an assert, so in non-debug mode it'd overrun the value buffer and crash mysteriously.
- only print verification success if there's no failure. before it'd print both in the failure case.
- support `memtable_prefix_bloom_size_ratio`
- support `num_bottom_pri_threads` (universal compaction)
Closes https://github.com/facebook/rocksdb/pull/2741
Differential Revision: D5629495
Pulled By: ajkr
fbshipit-source-id: ddad97d6d4ba0884e7c0f933b0a359712514fc1d
Summary:
Support a window of `active_width` keys that rolls through `[0, max_key)` over the duration of the test. Operations only affect keys inside the window. This gives us the ability to detect L0->L0 deletion bug (#2722).
Closes https://github.com/facebook/rocksdb/pull/2739
Differential Revision: D5628555
Pulled By: ajkr
fbshipit-source-id: 9cb2d8f4ab1a7c73f7797b8e19f7094970ea8749
Summary:
- like other subcommands, reporting compression sizes should be specified with the `--command` CLI arg.
- also added `--compression_types` arg as it's useful to restrict the types of compression used, at least in my dictionary compression experiments.
Closes https://github.com/facebook/rocksdb/pull/2706
Differential Revision: D5589520
Pulled By: ajkr
fbshipit-source-id: 305bb4ebcc95eecc8a85523cd3b1050619c9ddc5
Summary:
Previously we could only select the CF on which to operate uniformly at random. This is a limitation, e.g., when testing universal compaction as all CFs would need to run full compaction at roughly the same time, which isn't realistic.
This PR allows the user to specify the probability distribution for selecting CFs via the `--column_family_distribution` argument.
Closes https://github.com/facebook/rocksdb/pull/2677
Differential Revision: D5544436
Pulled By: ajkr
fbshipit-source-id: 478d56260995236ae90895ce5bd51f38882e185a
Summary:
Right now, if direct I/O is enabled, prefetching the last 512KB cannot be applied, except compaction inputs or readahead is enabled for iterators. This can create a lot of I/O for HDD cases. To solve the problem, the 512KB is prefetched in block based table if direct I/O is enabled. The prefetched buffer is passed in totegher with random access file reader, so that we try to read from the buffer before reading from the file. This can be extended in the future to support flexible user iterator readahead too.
Closes https://github.com/facebook/rocksdb/pull/2708
Differential Revision: D5593091
Pulled By: siying
fbshipit-source-id: ee36ff6d8af11c312a2622272b21957a7b5c81e7
Summary:
TSAN shows error when we grab too many locks at the same time. In TSAN crash test, make one shard key cover 2^22 keys so that no many keys will be hold at the same time.
Closes https://github.com/facebook/rocksdb/pull/2719
Differential Revision: D5609035
Pulled By: siying
fbshipit-source-id: 930e5d63fff92dbc193dc154c4c615efbdf06c6a
Summary:
We need a tool to check any sst file corruption in the db.
It will check all the sst files in current version and read all the blocks (data, meta, index) with checksum verification. If any verification fails, the function will return non-OK status.
Closes https://github.com/facebook/rocksdb/pull/2498
Differential Revision: D5324269
Pulled By: lightmark
fbshipit-source-id: 6f8a272008b722402a772acfc804524c9d1a483b
Summary:
The background thread pools' sizes weren't easily configurable by `max_background_compactions` and `max_background_flushes` in multi-instance setups. Introduced separate arguments for their sizes.
Closes https://github.com/facebook/rocksdb/pull/2680
Differential Revision: D5550675
Pulled By: ajkr
fbshipit-source-id: bab5f0a7bc5db63bb084d0c10facbe437096367d
Summary:
When we had a single thread pool for compactions, a thread could be busy for a long time (minutes) executing a compaction involving the bottom level. In multi-instance setups, the entire thread pool could be consumed by such bottom-level compactions. Then, top-level compactions (e.g., a few L0 files) would be blocked for a long time ("head-of-line blocking"). Such top-level compactions are critical to prevent compaction stalls as they can quickly reduce number of L0 files / sorted runs.
This diff introduces a bottom-priority queue for universal compactions including the bottom level. This alleviates the head-of-line blocking situation for fast, top-level compactions.
- Added `Env::Priority::BOTTOM` thread pool. This feature is only enabled if user explicitly configures it to have a positive number of threads.
- Changed `ThreadPoolImpl`'s default thread limit from one to zero. This change is invisible to users as we call `IncBackgroundThreadsIfNeeded` on the low-pri/high-pri pools during `DB::Open` with values of at least one. It is necessary, though, for bottom-pri to start with zero threads so the feature is disabled by default.
- Separated `ManualCompaction` into two parts in `PrepickedCompaction`. `PrepickedCompaction` is used for any compaction that's picked outside of its execution thread, either manual or automatic.
- Forward universal compactions involving last level to the bottom pool (worker thread's entry point is `BGWorkBottomCompaction`).
- Track `bg_bottom_compaction_scheduled_` so we can wait for bottom-level compactions to finish. We don't count them against the background jobs limits. So users of this feature will get an extra compaction for free.
Closes https://github.com/facebook/rocksdb/pull/2580
Differential Revision: D5422916
Pulled By: ajkr
fbshipit-source-id: a74bd11f1ea4933df3739b16808bb21fcd512333
Summary:
Move an option necessary for running db_bench on multiple CFs into the general initialization area, so it works with both flag-based init and OPTIONS-based init.
Closes https://github.com/facebook/rocksdb/pull/2675
Differential Revision: D5541378
Pulled By: ajkr
fbshipit-source-id: 169926cb4ae95c17974f744faf7cc794d41e5c0a
Summary:
* Dump blob db options to info log
* Remove BlobDBOptionsImpl to disallow dynamic cast *BlobDBOptions into *BlobDBOptionsImpl. Move options there to be constants or into BlobDBOptions. The dynamic cast is broken after #2645
* Change some of the default options
* Remove blob_db_options.min_blob_size, which is unimplemented. Will implement it soon.
Closes https://github.com/facebook/rocksdb/pull/2671
Differential Revision: D5529912
Pulled By: yiwu-arbug
fbshipit-source-id: dcd58ca981db5bcc7f123b65a0d6f6ae0dc703c7
Summary:
Replace dynamic_cast<> so that users can choose to build with RTTI off, so that they can save several bytes per object, and get tiny more memory available.
Some nontrivial changes:
1. Add Comparator::GetRootComparator() to get around the internal comparator hack
2. Add the two experiemental functions to DB
3. Add TableFactory::GetOptionString() to avoid unnecessary casting to get the option string
4. Since 3 is done, move the parsing option functions for table factory to table factory files too, to be symmetric.
Closes https://github.com/facebook/rocksdb/pull/2645
Differential Revision: D5502723
Pulled By: siying
fbshipit-source-id: fd13cec5601cf68a554d87bfcf056f2ffa5fbf7c
Summary:
it should be a bool
Closes https://github.com/facebook/rocksdb/pull/2653
Differential Revision: D5506148
Pulled By: ajkr
fbshipit-source-id: f142f0f3aa8b678c68adef12e5ac6e1e163306f3
Summary:
This reverts the previous commit 1d7048c598, which broke the build.
Did a `git revert 1d7048c`.
Closes https://github.com/facebook/rocksdb/pull/2627
Differential Revision: D5476473
Pulled By: sagar0
fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06
Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually.
Reviewed By: igorsugak
Differential Revision: D5454343
fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2
Summary:
For forward_compatible_checkout_objs the local branch is already created in previous step. This patch avoid recreating it. This should address "fatal: A branch named '3.10.fb' already exists." errors.
Closes https://github.com/facebook/rocksdb/pull/2606
Differential Revision: D5443786
Pulled By: maysamyabandeh
fbshipit-source-id: 69d5a67b87677429cf36e3a467bd114d341f3b9c
Summary:
The new local branch specified with -b cannot be called master. Use tmp prefix to avoid name collision.
Closes https://github.com/facebook/rocksdb/pull/2600
Differential Revision: D5442944
Pulled By: maysamyabandeh
fbshipit-source-id: 4a623d9b21d6cc01bee812b2799790315bdf5f6e
Summary:
This will fix the error: "error: pathspec '2.2.fb.branch' did not match any file(s) known to git."
Tested by manually sshing to sandcastle and running the command.
Closes https://github.com/facebook/rocksdb/pull/2599
Differential Revision: D5441130
Pulled By: maysamyabandeh
fbshipit-source-id: a22fd6a52221471bafbba8990394b499535e5812