rocksdb

Commit Graph

Author	SHA1	Message	Date
Mark Callaghan	177b2fa341	Set the value for --version, add --build_info (#10275 ) Summary: ./db_bench --version db_bench version 7.5.0 ./db_bench --build_info (RocksDB) 7.5.0 rocksdb_build_date: 2022-06-29 09:58:04 rocksdb_build_git_sha: `d96febeeaa` rocksdb_build_git_tag: print_version_githash Pull Request resolved: https://github.com/facebook/rocksdb/pull/10275 Test Plan: run it Reviewed By: ajkr Differential Revision: D37524720 Pulled By: mdcallag fbshipit-source-id: 0f6c819dbadf7b033a4a3ba2941992bb76b4ff99	3 years ago
Mark Callaghan	9eced1a344	Add the git hash and full RocksDB version to report.tsv (#10277 ) Summary: Previously the version was displayed as $major.$minor This changes it to $major.$minor.$path This also adds the git hash for the time from which RocksDB was built to the end of report.tsv. I confirmed that benchmark_log_tool.py still parses it and that the people who consume/graph these results are OK with it. Example output: ops_sec mb_sec lsm_sz blob_sz c_wgb w_amp c_mbps c_wsecs c_csecs b_rgb b_wgb usec_op p50 p99 p99.9 p99.99 pmax uptime stall% Nstall u_cpu s_cpu rss test date version job_id githash 609488 244.1 1GB 0.0GB, 1.4 0.7 93.3 39 38 0 0 1.6 1.0 4 15 26 5365 15 0.0 0 0.1 0.0 0.5 fillseq.wal_disabled.v400 2022-06-29T13:36:05 7.5.0 `6115254416` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10277 Test Plan: Run it Reviewed By: jay-zhuang Differential Revision: D37532418 Pulled By: mdcallag fbshipit-source-id: 55e472640d51265819b228d3373c9fa9b62b660d	3 years ago
zczhu	e716bda010	Add FLAGS_compaction_pri into crash_test (#10255 ) Summary: Add FLAGS_compaction_pri into correctness test Pull Request resolved: https://github.com/facebook/rocksdb/pull/10255 Test Plan: run crash_test with FLAGS_compaction_pri Reviewed By: ajkr Differential Revision: D37510372 Pulled By: littlepig2013 fbshipit-source-id: 73d93a0a047d0c3993c8a512383dd6ee6acef641	3 years ago
Mark Callaghan	720ab355f9	Add undefok for BlobDB options not supported prior to 7.5 (#10276 ) Summary: This adds --undefok to support use of this script with BlobDB for db_bench versions prior to 7.5 when the options land in a release. While there is a limit to how far back this script can go WRT backwards compatiblity, this is an easy change to support early 7.x releases. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10276 Test Plan: Run it with versions of db_bench that do not and then do support these options Reviewed By: gangliao Differential Revision: D37529299 Pulled By: mdcallag fbshipit-source-id: 7bb1feec5c68760e6d64792c585bfbde4f5e52d8	3 years ago
Guido Tagliavini Ponce	57a0e2f304	Clock cache (#10273 ) Summary: This is the initial step in the development of a lock-free clock cache. This PR includes the base hash table design (which we mostly ported over from FastLRUCache) and the clock eviction algorithm. Importantly, it's still _not_ lock-free---all operations use a shard lock. Besides the locking, there are other features left as future work: - Remove keys from the handles. Instead, use 128-bit bijective hashes of them for handle comparisons, probing (we need two 32-bit hashes of the key for double hashing) and sharding (we need one 6-bit hash). - Remove the clock_usage_ field, which is updated on every lookup. Even if it were atomically updated, it could cause memory invalidations across cores. - Middle insertions into the clock list. - A test that exercises the clock eviction policy. - Update the Java API of ClockCache and Java calls to C++. Along the way, we improved the code and comments quality of FastLRUCache. These changes are relatively minor. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10273 Test Plan: ``make -j24 check`` Reviewed By: pdillinger Differential Revision: D37522461 Pulled By: guidotag fbshipit-source-id: 3d70b737dbb70dcf662f00cef8c609750f083943	3 years ago
Mark Callaghan	28f2d3cca6	Benchmark fix write amplification computation (#10236 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10236 Reviewed By: ajkr Differential Revision: D37489898 Pulled By: mdcallag fbshipit-source-id: 4b4565973b1f2c47342b4d1b857c8f89e91da145	3 years ago
Yanqin Jin	d3de59255a	Enable compaction filter for db_stress with user-defined timestamp (#10259 ) Summary: Before this PR, when user-defined timestamp is enabled, db_stress disables compaction filter. This is no longer necessary after this PR, since the `DbStressCompactionFilter` is now aware of the presence of timestamps. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10259 Test Plan: TEST_TMPDIR=/dev/shm make crash_test_with_ts Reviewed By: ajkr Differential Revision: D37459692 Pulled By: riversand963 fbshipit-source-id: 8fe62e90a63bd9317fe1bb95a2b4984080c9e5ef	3 years ago
Andrew Kryczka	f322f273b0	Temporarily disable mempurge in crash test (#10252 ) Summary: Need to disable it for now as CI is failing, particularly `MultiOpsTxnsStressTest`. Investigation details in internal task T124324915. This PR disables mempurge more widely than `MultiOpsTxnsStressTest` until we know the issue is contained to that particular test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10252 Reviewed By: riversand963 Differential Revision: D37432948 Pulled By: ajkr fbshipit-source-id: d0cf5b0e0ec7c3142c382a0347f35a4c34f4607a	3 years ago
Mark Callaghan	6061905790	Wrapper for benchmark.sh to run a sequence of db_bench tests (#10215 ) Summary: This provides two things: 1) Runs a sequence of db_bench tests. This sequence was chosen to provide good coverage with less variance. 2) Makes it easier to do A/B testing for multiple binaries. This combines the report.tsv files into summary.tsv to make it easier to compare results across multiple binaries. Example output for 2) is: ops_sec mb_sec lsm_sz blob_sz c_wgb w_amp c_mbps c_wsecs c_csecs b_rgb b_wgb usec_op p50 p99 p99.9 p99.99 pmax uptime stall% Nstall u_cpu s_cpu rss test date version job_id 1115171 446.7 9GB 8.9 1.0 454.7 26 26 0 0 0.9 0.5 2 7 51 5547 20 0.0 0 0.1 0.1 0.2 fillseq.wal_disabled.v400 2022-04-12T08:53:51 6.0 1045726 418.9 8GB 0.0GB 8.4 1.0 432.4 27 26 0 0 1.0 0.5 2 6 102 5618 20 0.0 0 0.1 0.0 0.1 fillseq.wal_disabled.v400 2022-04-12T12:25:36 6.28 ops_sec mb_sec lsm_sz blob_sz c_wgb w_amp c_mbps c_wsecs c_csecs b_rgb b_wgb usec_op p50 p99 p99.9 p99.99 pmax uptime stall% Nstall u_cpu s_cpu rss test date version job_id 2969192 1189.3 16GB 0.0 0.0 0 0 0 0 10.8 9.3 25 33 49 13551 1781 0.0 0 48.2 6.8 16.8 readrandom.t32 2022-04-12T08:54:28 6.0 2692922 1078.6 16GB 0.0GB 0.0 0.0 0 0 0 0 11.9 10.2 30 38 56 49735 1781 0.0 0 47.8 6.7 16.8 readrandom.t32 2022-04-12T12:26:15 6.28 ... ops_sec mb_sec lsm_sz blob_sz c_wgb w_amp c_mbps c_wsecs c_csecs b_rgb b_wgb usec_op p50 p99 p99.9 p99.99 pmax uptime stall% Nstall u_cpu s_cpu rss test date version job_id 180227 72.2 38GB 1126.4 8.7 643.2 3286 3218 0 0 177.6 50.2 2687 4083 6148 854083 1793 68.4 7804 17.0 5.9 0.5 overwrite.t32.s0 2022-04-12T11:55:21 6.0 236512 94.7 31GB 0.0GB 1502.9 8.9 862.2 5242 5125 0 0 135.3 59.9 2537 3268 5404 18545 1785 49.7 5112 25.5 8.0 9.4 overwrite.t32.s0 2022-04-12T15:27:25 6.28 Example output with formatting preserved is here: https://gist.github.com/mdcallag/4432e5bbaf91915c916d46bd6ce3c313 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10215 Test Plan: run it Reviewed By: jay-zhuang Differential Revision: D37299892 Pulled By: mdcallag fbshipit-source-id: e6e0ed638fd7e8deeb869d700593fdc3eba899c8	3 years ago
Gang Liao	2352e2dfda	Add the blob cache to the stress tests and the benchmarking tool (#10202 ) Summary: In order to facilitate correctness and performance testing, we would like to add the new blob cache to our stress test tool `db_stress` and our continuously running crash test script `db_crashtest.py`, as well as our synthetic benchmarking tool `db_bench` and the BlobDB performance testing script `run_blob_bench.sh`. As part of this task, we would also like to utilize these benchmarking tools to get some initial performance numbers about the effectiveness of caching blobs. This PR is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10202 Reviewed By: ltamasi Differential Revision: D37325739 Pulled By: gangliao fbshipit-source-id: deb65d0d414502270dd4c324d987fd5469869fa8	3 years ago
Peter Dillinger	84210c9489	Add data block hash index to crash test, fix MultiGet issue (#10220 ) Summary: There was a bug in the MultiGet enhancement in https://github.com/facebook/rocksdb/issues/9899 with data block hash index, which was not caught because data block hash index was never added to stress tests. This change fixes both issues. Fixes https://github.com/facebook/rocksdb/issues/10186 I intend to pick this into the 7.4.0 release candidate Pull Request resolved: https://github.com/facebook/rocksdb/pull/10220 Test Plan: Failure quickly reproduces in crash test with kDataBlockBinaryAndHash, and does not seem to with the fix. Reproducing the failure with a unit test I believe would be too tricky and fragile to be worthwhile. Reviewed By: anand1976 Differential Revision: D37315647 Pulled By: pdillinger fbshipit-source-id: 9f648265bba867275edc752f7a56611a59401cba	3 years ago
Guido Tagliavini Ponce	3afed7408c	Replace per-shard chained hash tables with open-addressing scheme (#10194 ) Summary: In FastLRUCache, we replace the current chained per-shard hash table by an open-addressing hash table. In particular, this allows us to preallocate all handles. Because all handles are preallocated, this implementation doesn't support strict_capacity_limit = false (i.e., allowing insertions beyond the predefined capacity). This clashes with current assumptions of some tests, namely two tests in cache_test and the crash tests. We have disabled these for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10194 Test Plan: ``make -j24 check`` Reviewed By: pdillinger Differential Revision: D37296770 Pulled By: guidotag fbshipit-source-id: 232ff1b8260331d868ebf4e3e5d8ad709390b0ad	3 years ago
Gang Liao	deff48bcef	Add blob source to retrieve blobs in RocksDB (#10198 ) Summary: There is currently no caching mechanism for blobs, which is not ideal especially when the database resides on remote storage (where we cannot rely on the OS page cache). As part of this task, we would like to make it possible for the application to configure a blob cache. In this task, we formally introduced the blob source to RocksDB. BlobSource is a new abstraction layer that provides universal access to blobs, regardless of whether they are in the blob cache, secondary cache, or (remote) storage. Depending on user settings, it always fetch blobs from multi-tier cache and storage with minimal cost. Note: The new `MultiGetBlob()` implementation is not included in the current PR. To go faster, we aim to create a separate PR for it in parallel! This PR is a part of https://github.com/facebook/rocksdb/issues/10156 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10198 Reviewed By: ltamasi Differential Revision: D37294735 Pulled By: gangliao fbshipit-source-id: 9cb50422d9dd1bc03798501c2778b6c7520c7a1e	3 years ago
Peter Dillinger	ccb4f047ae	Add 7.4 to format compatibility test (#10209 ) Summary: Forgotten in https://github.com/facebook/rocksdb/issues/10204 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10209 Test Plan: local run with SHORT_TEST=1 Reviewed By: hx235 Differential Revision: D37284028 Pulled By: pdillinger fbshipit-source-id: 631c1969906d002acc930662dcd5eefc0c758429	3 years ago
Hui Xiao	a5d773e077	Add rate-limiting support to batched MultiGet() (#10159 ) Summary: Context/Summary: https://github.com/facebook/rocksdb/pull/9424 added rate-limiting support for user reads, which does not include batched `MultiGet()`s that call `RandomAccessFileReader::MultiRead()`. The reason is that it's harder (compared with RandomAccessFileReader::Read()) to implement the ideal rate-limiting where we first call `RateLimiter::RequestToken()` for allowed bytes to multi-read and then consume those bytes by satisfying as many requests in `MultiRead()` as possible. For example, it can be tricky to decide whether we want partially fulfilled requests within one `MultiRead()` or not. However, due to a recent urgent user request, we decide to pursue an elementary (but a conditionally ineffective) solution where we accumulate enough rate limiter requests toward the total bytes needed by one `MultiRead()` before doing that `MultiRead()`. This is not ideal when the total bytes are huge as we will actually consume a huge bandwidth from rate-limiter causing a burst on disk. This is not what we ultimately want with rate limiter. Therefore a follow-up work is noted through TODO comments. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10159 Test Plan: - Modified existing unit test `DBRateLimiterOnReadTest/DBRateLimiterOnReadTest.NewMultiGet` - Traced the underlying system calls `io_uring_enter` and verified they are 10 seconds apart from each other correctly under the setting of `strace -ftt -e trace=io_uring_enter ./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb2 -readonly -num=50 -threads=1 -multiread_batched=1 -batch_size=100 -duration=10 -rate_limiter_bytes_per_sec=200 -rate_limiter_refill_period_us=1000000 -rate_limit_bg_reads=1 -disable_auto_compactions=1 -rate_limit_user_ops=1` where each `MultiRead()` read about 2000 bytes (inspected by debugger) and the rate limiter grants 200 bytes per seconds. - Stress test: - Verified `./db_stress (-test_cf_consistency=1/test_batches_snapshots=1) -use_multiget=1 -cache_size=1048576 -rate_limiter_bytes_per_sec=10241024 -rate_limit_bg_reads=1 -rate_limit_user_ops=1` work Reviewed By: ajkr, anand1976 Differential Revision: D37135172 Pulled By: hx235 fbshipit-source-id: 73b8e8f14761e5d4b77235dfe5d41f4eea968bcd	3 years ago
Andrew Kryczka	5d6005c780	Add WriteOptions::protection_bytes_per_key (#10037 ) Summary: Added an option, `WriteOptions::protection_bytes_per_key`, that controls how many bytes per key we use for integrity protection in `WriteBatch`. It takes effect when `WriteBatch::GetProtectionBytesPerKey() == 0`. Currently the only supported value is eight. Invoking a user API with it set to any other nonzero value will result in `Status::NotSupported` returned to the user. There is also a bug fix for integrity protection with `inplace_callback`, where we forgot to take into account the possible change in varint length when calculating KV checksum for the final encoded buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10037 Test Plan: - Manual - Set default value of `WriteOptions::protection_bytes_per_key` to eight and ran `make check -j24` - Enabled in MyShadow for 1+ week - Automated - Unit tests have a `WriteMode` that enables the integrity protection via `WriteOptions` - Crash test - in most cases, use `WriteOptions::protection_bytes_per_key` to enable integrity protection Reviewed By: cbi42 Differential Revision: D36614569 Pulled By: ajkr fbshipit-source-id: 8650087ceac9b61b560f1e5fafe5e1baf9c725fb	3 years ago
Peter Dillinger	126c223714	Remove deprecated block-based filter (#10184 ) Summary: In https://github.com/facebook/rocksdb/issues/9535, release 7.0, we hid the old block-based filter from being created using the public API, because of its inefficiency. Although we normally maintain read compatibility on old DBs forever, filters are not required for reading a DB, only for optimizing read performance. Thus, it should be acceptable to remove this code and the substantial maintenance burden it carries as useful features are developed and validated (such as user timestamp). This change completely removes the code for reading and writing the old block-based filters, net removing about 1370 lines of code no longer needed. Options removed from testing / benchmarking tools. The prior existence is only evident in a couple of places: * `CacheEntryRole::kDeprecatedFilterBlock` - We can update this public API enum in a major release to minimize source code incompatibilities. * A warning is logged when an old table file is opened that used the old block-based filter. This is provided as a courtesy, and would be a pain to unit test, so manual testing should suffice. Unfortunately, sst_dump does not tell you whether a file uses block-based filter, and the structure of the code makes it very difficult to fix. * To detect that case, `kObsoleteFilterBlockPrefix` (renamed from `kFilterBlockPrefix`) for metaindex is maintained (for now). Other notes: * In some cases where numbers are associated with filter configurations, we have had to update the assigned numbers so that they all correspond to something that exists. * Fixed potential stat counting bug by assuming `filter_checked = false` for cases like `filter == nullptr` rather than assuming `filter_checked = true` * Removed obsolete `block_offset` and `prefix_extractor` parameters from several functions. * Removed some unnecessary checks `if (!table_prefix_extractor() && !prefix_extractor)` because the caller guarantees the prefix extractor exists and is compatible Pull Request resolved: https://github.com/facebook/rocksdb/pull/10184 Test Plan: tests updated, manually test new warning in LOG using base version to generate a DB Reviewed By: riversand963 Differential Revision: D37212647 Pulled By: pdillinger fbshipit-source-id: 06ee020d8de3b81260ffc36ad0c1202cbf463a80	3 years ago
Peter Dillinger	94329ae4ec	Use only ASCII in source files (#10164 ) Summary: Fix existing usage of non-ASCII and add a check to prevent future use. Added `-n` option to greps to provide line numbers. Alternative to https://github.com/facebook/rocksdb/issues/10147 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10164 Test Plan: used new checker to find & fix cases, manually check db_bench output is preserved Reviewed By: akankshamahajan15 Differential Revision: D37148792 Pulled By: pdillinger fbshipit-source-id: 68c8b57e7ab829369540d532590bf756938855c7	3 years ago
Changyu Bi	9882652b0e	Verify write batch checksum before WAL (#10114 ) Summary: Context: WriteBatch can have key-value checksums when it was created `with protection_bytes_per_key > 0`. This PR added checksum verification for write batches before they are written to WAL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10114 Test Plan: - Added new unit tests to db_kv_checksum_test.cc: `make check -j32` - benchmark on performance regression: `./db_bench --benchmarks=fillrandom[-X20] -db=/dev/shm/test_rocksdb -write_batch_protection_bytes_per_key=8` - Pre-PR: ` fillrandom [AVG 20 runs] : 198875 (± 3006) ops/sec; 22.0 (± 0.3) MB/sec ` - Post-PR: ` fillrandom [AVG 20 runs] : 196487 (± 2279) ops/sec; 21.7 (± 0.3) MB/sec ` Mean regressed about 1% (198875 -> 196487 ops/sec). Reviewed By: ajkr Differential Revision: D36917464 Pulled By: cbi42 fbshipit-source-id: 29beb74edf65f04b1a890b4f650d873dc7ed790d	3 years ago
Yanqin Jin	ce419c0f10	Allow db_bench and db_stress to set `allow_data_in_errors` (#10171 ) Summary: There is `Options::allow_data_in_errors` that controls whether RocksDB is allowed to log data, e.g. key, value, etc in LOG files. It is false by default. However, in db_bench and db_stress, it is often ok to log data because there is no concern about privacy. This PR allows db_stress and db_bench to set this option on the command line, while it remains false by default. Furthermore, make crash/recovery test driven by db_crashtest.py to opt-in. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10171 Test Plan: Stress test and db_bench Reviewed By: hx235 Differential Revision: D37163787 Pulled By: riversand963 fbshipit-source-id: 0242f24d292ba15b6faf8ff903963b85d3e011f8	3 years ago
Hui Xiao	d665afdbf3	Account memory of FileMetaData in global memory limit (#9924 ) Summary: Context/Summary: As revealed by heap profiling, allocation of `FileMetaData` for [newly created file added to a Version](https://github.com/facebook/rocksdb/pull/9924/files#diff-a6aa385940793f95a2c5b39cc670bd440c4547fa54fd44622f756382d5e47e43R774) can consume significant heap memory. This PR is to account that toward our global memory limit based on block cache capacity. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9924 Test Plan: - Previous `make check` verified there are only 2 places where the memory of the allocated `FileMetaData` can be released - New unit test `TEST_P(ChargeFileMetadataTestWithParam, Basic)` - db bench (CPU cost of `charge_file_metadata` in write and compact) - write micros/op: -0.24% : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 (remove this option for pre-PR) -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 \| egrep 'fillseq'` - compact micros/op -0.87% : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 -numdistinct=1000 && ./db_bench -benchmarks=compact -db=$TEST_TMPDIR -use_existing_db=1 -charge_file_metadata=1 -disable_auto_compactions=1 \| egrep 'compact'` table 1 - write #-run \| (pre-PR) avg micros/op \| std micros/op \| (post-PR) micros/op \| std micros/op \| change (%) -- \| -- \| -- \| -- \| -- \| -- 10 \| 3.9711 \| 0.264408 \| 3.9914 \| 0.254563 \| 0.5111933721 20 \| 3.83905 \| 0.0664488 \| 3.8251 \| 0.0695456 \| -0.3633711465 40 \| 3.86625 \| 0.136669 \| 3.8867 \| 0.143765 \| 0.5289363078 80 \| 3.87828 \| 0.119007 \| 3.86791 \| 0.115674 \| -0.2673865734 160 \| 3.87677 \| 0.162231 \| 3.86739 \| 0.16663 \| -0.2419539978 table 2 - compact #-run \| (pre-PR) avg micros/op \| std micros/op \| (post-PR) micros/op \| std micros/op \| change (%) -- \| -- \| -- \| -- \| -- \| -- 10 \| 2,399,650.00 \| 96,375.80 \| 2,359,537.00 \| 53,243.60 \| -1.67 20 \| 2,410,480.00 \| 89,988.00 \| 2,433,580.00 \| 91,121.20 \| 0.96 40 \| 2.41E+06 \| 121811 \| 2.39E+06 \| 131525 \| -0.96 80 \| 2.40E+06 \| 134503 \| 2.39E+06 \| 108799 \| -0.78 - stress test: `python3 tools/db_crashtest.py blackbox --charge_file_metadata=1 --cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D36055583 Pulled By: hx235 fbshipit-source-id: b60eab94707103cb1322cf815f05810ef0232625	3 years ago
Guido Tagliavini Ponce	f105e1a501	Make the per-shard hash table fixed-size. (#10154 ) Summary: We make the size of the per-shard hash table fixed. The base level of the hash table is now preallocated with the required capacity. The user must provide an estimate of the size of the values. Notice that even though the base level becomes fixed, the chains are still dynamic. Overall, the shard capacity mechanisms haven't changed, so we don't need to test this. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10154 Test Plan: `make -j24 check` Reviewed By: pdillinger Differential Revision: D37124451 Pulled By: guidotag fbshipit-source-id: cba6ac76052fe0ec60b8ff4211b3de7650e80d0c	3 years ago
Mark Callaghan	04bd347995	Increase num_levels for universal from 8 to 40 (#10158 ) Summary: See https://github.com/facebook/rocksdb/issues/10082 for more details. Trivial move isn't done for universal when compaction is from L0 into L0. So a too small value for num_levels with db_bench means there will be fewer trivial moves with universal and that means that write-amp will increase. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10158 Test Plan: run it Reviewed By: siying Differential Revision: D37122519 Pulled By: mdcallag fbshipit-source-id: 1cb39049676f68a6cc3ea8d105a9965f89d4d09e	3 years ago
Yanqin Jin	1777e5f7e9	Snapshots with user-specified timestamps (#9879 ) Summary: In RocksDB, keys are associated with (internal) sequence numbers which denote when the keys are written to the database. Sequence numbers in different RocksDB instances are unrelated, thus not comparable. It is nice if we can associate sequence numbers with their corresponding actual timestamps. One thing we can do is to support user-defined timestamp, which allows the applications to specify the format of custom timestamps and encode a timestamp with each key. More details can be found at https://github.com/facebook/rocksdb/wiki/User-defined-Timestamp-%28Experimental%29. This PR provides a different but complementary approach. We can associate rocksdb snapshots (defined in https://github.com/facebook/rocksdb/blob/7.2.fb/include/rocksdb/snapshot.h#L20) with user-specified timestamps. Since a snapshot is essentially an object representing a sequence number, this PR establishes a bi-directional mapping between sequence numbers and timestamps. In the past, snapshots are usually taken by readers. The current super-version is grabbed, and a `rocksdb::Snapshot` object is created with the last published sequence number of the super-version. You can see that the reader actually has no good idea of what timestamp to assign to this snapshot, because by the time the `GetSnapshot()` is called, an arbitrarily long period of time may have already elapsed since the last write, which is when the last published sequence number is written. This observation motivates the creation of "timestamped" snapshots on the write path. Currently, this functionality is exposed only to the layer of `TransactionDB`. Application can tell RocksDB to create a snapshot when a transaction commits, effectively associating the last sequence number with a timestamp. It is also assumed that application will ensure any two snapshots with timestamps should satisfy the following: ``` snapshot1.seq < snapshot2.seq iff. snapshot1.ts < snapshot2.ts ``` If the application can guarantee that when a reader takes a timestamped snapshot, there is no active writes going on in the database, then we also allow the user to use a new API `TransactionDB::CreateTimestampedSnapshot()` to create a snapshot with associated timestamp. Code example ```cpp // Create a timestamped snapshot when committing transaction. txn->SetCommitTimestamp(100); txn->SetSnapshotOnNextOperation(); txn->Commit(); // A wrapper API for convenience Status Transaction::CommitAndTryCreateSnapshot( std::shared_ptr<TransactionNotifier> notifier, TxnTimestamp ts, std::shared_ptr<const Snapshot>* ret); // Create a timestamped snapshot if caller guarantees no concurrent writes std::pair<Status, std::shared_ptr<const Snapshot>> snapshot = txn_db->CreateTimestampedSnapshot(100); ``` The snapshots created in this way will be managed by RocksDB with ref-counting and potentially shared with other readers. We provide the following APIs for readers to retrieve a snapshot given a timestamp. ```cpp // Return the timestamped snapshot correponding to given timestamp. If ts is // kMaxTxnTimestamp, then we return the latest timestamped snapshot if present. // Othersise, we return the snapshot whose timestamp is equal to `ts`. If no // such snapshot exists, then we return null. std::shared_ptr<const Snapshot> TransactionDB::GetTimestampedSnapshot(TxnTimestamp ts) const; // Return the latest timestamped snapshot if present. std::shared_ptr<const Snapshot> TransactionDB::GetLatestTimestampedSnapshot() const; ``` We also provide two additional APIs for stats collection and reporting purposes. ```cpp Status TransactionDB::GetAllTimestampedSnapshots( std::vector<std::shared_ptr<const Snapshot>>& snapshots) const; // Return timestamped snapshots whose timestamps fall in [ts_lb, ts_ub) and store them in `snapshots`. Status TransactionDB::GetTimestampedSnapshots( TxnTimestamp ts_lb, TxnTimestamp ts_ub, std::vector<std::shared_ptr<const Snapshot>>& snapshots) const; ``` To prevent the number of timestamped snapshots from growing infinitely, we provide the following API to release timestamped snapshots whose timestamps are older than or equal to a given threshold. ```cpp void TransactionDB::ReleaseTimestampedSnapshotsOlderThan(TxnTimestamp ts); ``` Before shutdown, RocksDB will release all timestamped snapshots. Comparison with user-defined timestamp and how they can be combined: User-defined timestamp persists every key with a timestamp, while timestamped snapshots maintain a volatile mapping between snapshots (sequence numbers) and timestamps. Different internal keys with the same user key but different timestamps will be treated as different by compaction, thus a newer version will not hide older versions (with smaller timestamps) unless they are eligible for garbage collection. In contrast, taking a timestamped snapshot at a certain sequence number and timestamp prevents all the keys visible in this snapshot from been dropped by compaction. Here, visible means (seq < snapshot and most recent). The timestamped snapshot supports the semantics of reading at an exact point in time. Timestamped snapshots can also be used with user-defined timestamp. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9879 Test Plan: ``` make check TEST_TMPDIR=/dev/shm make crash_test_with_txn ``` Reviewed By: siying Differential Revision: D35783919 Pulled By: riversand963 fbshipit-source-id: 586ad905e169189e19d3bfc0cb0177a7239d1bd4	3 years ago
Akanksha Mahajan	ecfd4aef0c	Enable wal_compression in crash_tests (#10141 ) Summary: Same as title Pull Request resolved: https://github.com/facebook/rocksdb/pull/10141 Test Plan: ``` export CRASH_TEST_EXT_ARGS=" --wal_compression=zstd" make crash_test -j ``` Reviewed By: riversand963 Differential Revision: D37042810 Pulled By: akankshamahajan15 fbshipit-source-id: 53f0793d78241f1b5c954dcc808cb4c0a3e9172a	3 years ago
Mark Callaghan	9efae14428	Fix parsing of db_bench output (#10124 ) Summary: A recent diff add a few more fields to one of the db_bench output lines that gets parsed. This diff updates tools/benchmark.sh to handle that. overwrite : 7.939 micros/op 125963 ops/sec; 50.5 MB/s overwrite : 7.854 micros/op 127320 ops/sec 1800.001 seconds 229176999 operations; 51.0 MB/s Pull Request resolved: https://github.com/facebook/rocksdb/pull/10124 Test Plan: Run it Reviewed By: jay-zhuang Differential Revision: D36945137 Pulled By: mdcallag fbshipit-source-id: 9c96f79491411da997e369a3be9c6b921a21d0fa	3 years ago
Yanqin Jin	f890527b16	Update test for secondary instance in stress test (#10121 ) Summary: This PR updates secondary instance testing in stress test by default. A background thread will be started (disabled by default), running a secondary instance tailing the logs of the primary. Periodically (every 1 sec), this thread calls `TryCatchUpWithPrimary()` and uses point lookup or range scan to read some random keys with only very basic verification to make sure no assertion failure is triggered. Thanks to https://github.com/facebook/rocksdb/issues/10061 , we can enable secondary instance when user-defined timestamp is enabled. Also removed a less useful test configuration, `secondary_catch_up_one_in`. This is very similar to the periodic catch-up. In the last commit, I decided not to enable it now, but just update the tests, since secondary instance does not work well when the underlying file is renamed by primary, e.g. SstFileManager. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10121 Test Plan: ``` TEST_TMPDIR=/dev/shm/rocksdb make crash_test TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_ts TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_atomic_flush ``` Reviewed By: ajkr Differential Revision: D36939458 Pulled By: riversand963 fbshipit-source-id: 1c065b7efc3690fc341569b9d369a5cbd8ef6b3e	3 years ago
Levi Tamasi	7d36bc4273	Fix some bugs in verify_random_db.sh (#10112 ) Summary: The patch attempts to fix three bugs in `verify_random_db.sh`: 1) https://github.com/facebook/rocksdb/pull/9937 changed the default for `--try_load_options` to true in the script's use case, so we have to explicitly set it to false if the corresponding argument of the script is 0. This should fix the issue we've been seeing with our forward compatibility tests where 7.3 is unable to open a database created by the version on main after adding a new configuration option. 2) The script seems to support two "extra parameters"; however, in practice, if the second one was set, only that one was passed on to `ldb`. Now both get forwarded. 3) When running the `diff` command, the base DB directory was passed as the second argument instead of the file containing the `ldb` output (this actually seems to work, probably accidentally though). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10112 Reviewed By: pdillinger Differential Revision: D36911363 Pulled By: ltamasi fbshipit-source-id: fe29db4e28d373cee51a12322c59050fc50e926d	3 years ago
Guido Tagliavini Ponce	cf85607795	Add support for FastLRUCache in db_bench. (#10096 ) Summary: db_bench can now run with FastLRUCache. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10096 Test Plan: - Temporarily add an ``assert(false)`` in the execution path that sets up the FastLRUCache. Run ``make -j24 db_bench``. Then test the appropriate code is used by running ``./db_bench -cache_type=fast_lru_cache`` and checking that the assert is called. Repeat for LRUCache. - Verify that FastLRUCache (currently a clone of LRUCache) produces similar benchmark data than LRUCache, by comparing the outputs of ``./db_bench -benchmarks=fillseq,fillrandom,readseq,readrandom -cache_type=fast_lru_cache`` and ``./db_bench -benchmarks=fillseq,fillrandom,readseq,readrandom -cache_type=lru_cache``. Reviewed By: gitbw95 Differential Revision: D36898774 Pulled By: guidotag fbshipit-source-id: f9f6b6f6da124f88b21b3c8dee742fbb04eff773	3 years ago
Yanqin Jin	2b3c50c429	Temporarily disable wal compression (#10108 ) Summary: Will re-enable after fixing the bug in https://github.com/facebook/rocksdb/issues/10099 and https://github.com/facebook/rocksdb/issues/10097. Right now, the priority is https://github.com/facebook/rocksdb/issues/10087, but the bug in WAL compression prevents the mini crash test from passing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10108 Reviewed By: pdillinger Differential Revision: D36897214 Pulled By: riversand963 fbshipit-source-id: d64dc52738222d5f66003f7731dc46eaeed812be	3 years ago
Mark Callaghan	5506954b1f	Enhance to support more tuning options, and universal and integrated… (#9704 ) Summary: … BlobDB for all tests This does two big things: * provides more tuning options * supports universal and integrated BlobDB for all of the benchmarks that are leveled-only It does several smaller things, and I will list a few * sets l0_slowdown_writes_trigger which wasn't set before this diff. * improves readability in report.tsv by using smaller field names in the header * adds more columns to report.tsv report.tsv before this diff: ``` ops_sec mb_sec total_size_gb level0_size_gb sum_gb write_amplification write_mbps usec_op percentile_50 percentile_75 percentile_99 percentile_99.9 percentile_99.99 uptime stall_time stall_percent test_name test_date rocksdb_version job_id 823294 329.8 0.0 21.5 21.5 1.0 183.4 1.2 1.0 1.0 3 6 14 120 00:00:0.000 0.0 fillseq.wal_disabled.v400 2022-03-16T15:46:45.000-07:00 7.0 326520 130.8 0.0 0.0 0.0 0.0 0 12.2 139.8 155.1 170 234 250 60 00:00:0.000 0.0 multireadrandom.t4 2022-03-16T15:48:47.000-07:00 7.0 86313 345.7 0.0 0.0 0.0 0.0 0 46.3 44.8 50.6 75 84 108 60 00:00:0.000 0.0 revrangewhilewriting.t4 2022-03-16T15:50:48.000-07:00 7.0 101294 405.7 0.0 0.1 0.1 1.0 1.6 39.5 40.4 45.9 64 75 103 62 00:00:0.000 0.0 fwdrangewhilewriting.t4 2022-03-16T15:52:50.000-07:00 7.0 258141 103.4 0.0 0.1 1.2 18.2 19.8 15.5 14.3 18.1 28 34 48 62 00:00:0.000 0.0 readwhilewriting.t4 2022-03-16T15:54:51.000-07:00 7.0 334690 134.1 0.0 7.6 18.7 4.2 308.8 12.0 11.8 13.7 21 30 62 62 00:00:0.000 0.0 overwrite.t4.s0 2022-03-16T15:56:53.000-07:00 7.0 ``` report.tsv with this diff: ``` ops_sec mb_sec lsm_sz blob_sz c_wgb w_amp c_mbps c_wsecs c_csecs b_rgb b_wgb usec_op p50 p99 p99.9 p99.99 pmax uptime stall% Nstall u_cpu s_cpu rss test date version job_id 831144 332.9 22GB 0.0GB, 21.7 1.0 185.1 264 262 0 0 1.2 1.0 3 6 14 9198 120 0.0 0 0.4 0.0 0.7 fillseq.wal_disabled.v400 2022-03-16T16:21:23 7.0 325229 130.3 22GB 0.0GB, 0.0 0.0 0 0 0 0 12.3 139.8 170 237 249 572 60 0.0 0 0.4 0.1 1.2 multireadrandom.t4 2022-03-16T16:23:25 7.0 312920 125.3 26GB 0.0GB, 11.1 2.6 189.3 115 113 0 0 12.8 11.8 21 34 1255 6442 60 0.2 1 0.7 0.1 0.6 overwritesome.t4.s0 2022-03-16T16:25:27 7.0 81698 327.2 25GB 0.0GB, 0.0 0.0 0 0 0 0 48.9 46.2 79 246 369 9445 60 0.0 0 0.4 0.1 1.4 revrangewhilewriting.t4 2022-03-16T16:30:21 7.0 92484 370.4 25GB 0.0GB, 0.1 1.5 1.1 1 0 0 0 43.2 42.3 75 103 110 9512 62 0.0 0 0.4 0.1 1.4 fwdrangewhilewriting.t4 2022-03-16T16:32:24 7.0 241661 96.8 25GB 0.0GB, 0.1 1.5 1.1 1 0 0 0 16.5 17.1 30 34 49 9092 62 0.0 0 0.4 0.1 1.4 readwhilewriting.t4 2022-03-16T16:34:27 7.0 305234 122.3 30GB 0.0GB, 12.1 2.7 201.7 127 124 0 0 13.1 11.8 21 128 1934 6339 62 0.0 0 0.7 0.1 0.7 overwrite.t4.s0 2022-03-16T16:36:30 7.0 ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9704 Test Plan: run it Reviewed By: jay-zhuang Differential Revision: D36864627 Pulled By: mdcallag fbshipit-source-id: d5af1cfc258a16865210163fa6fd1b803ab1a7d3	3 years ago
Gang Liao	e6432dfd4c	Make it possible to enable blob files starting from a certain LSM tree level (#10077 ) Summary: Currently, if blob files are enabled (i.e. `enable_blob_files` is true), large values are extracted both during flush/recovery (when SST files are written into level 0 of the LSM tree) and during compaction into any LSM tree level. For certain use cases that have a mix of short-lived and long-lived values, it might make sense to support extracting large values only during compactions whose output level is greater than or equal to a specified LSM tree level (e.g. compactions into L1/L2/... or above). This could reduce the space amplification caused by large values that are turned into garbage shortly after being written at the price of some write amplification incurred by long-lived values whose extraction to blob files is delayed. In order to achieve this, we would like to do the following: - Add a new configuration option `blob_file_starting_level` (default: 0) to `AdvancedColumnFamilyOptions` (and `MutableCFOptions` and extend the related logic) - Instantiate `BlobFileBuilder` in `BuildTable` (used during flush and recovery, where the LSM tree level is L0) and `CompactionJob` iff `enable_blob_files` is set and the LSM tree level is `>= blob_file_starting_level` - Add unit tests for the new functionality, and add the new option to our stress tests (`db_stress` and `db_crashtest.py` ) - Add the new option to our benchmarking tool `db_bench` and the BlobDB benchmark script `run_blob_bench.sh` - Add the new option to the `ldb` tool (see https://github.com/facebook/rocksdb/wiki/Administration-and-Data-Access-Tool) - Ideally extend the C and Java bindings with the new option - Update the BlobDB wiki to document the new option. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10077 Reviewed By: ltamasi Differential Revision: D36884156 Pulled By: gangliao fbshipit-source-id: 942bab025f04633edca8564ed64791cb5e31627d	3 years ago
Zichen Zhu	65893ad959	Explicitly closing all directory file descriptors (#10049 ) Summary: Currently, the DB directory file descriptor is left open until the deconstruction process (`DB::Close()` does not close the file descriptor). To verify this, comment out the lines between `db_ = nullptr` and `db_->Close()` (line 512, 513, 514, 515 in ldb_cmd.cc) to leak the ``db_'' object, build `ldb` tool and run ``` strace --trace=open,openat,close ./ldb --db=$TEST_TMPDIR --ignore_unknown_options put K1 V1 --create_if_missing ``` There is one directory file descriptor that is not closed in the strace log. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10049 Test Plan: Add a new unit test DBBasicTest.DBCloseAllDirectoryFDs: Open a database with different WAL directory and three different data directories, and all directory file descriptors should be closed after calling Close(). Explicitly call Close() after a directory file descriptor is not used so that the counter of directory open and close should be equivalent. Reviewed By: ajkr, hx235 Differential Revision: D36722135 Pulled By: littlepig2013 fbshipit-source-id: 07bdc2abc417c6b30997b9bbef1f79aa757b21ff	3 years ago
Guido Tagliavini Ponce	b4d0e041d0	Add support for FastLRUCache in stress and crash tests. (#10081 ) Summary: Stress tests can run with the experimental FastLRUCache. Crash tests randomly choose between LRUCache and FastLRUCache. Since only LRUCache supports a secondary cache, we validate the `--secondary_cache_uri` and `--cache_type` flags---when `--secondary_cache_uri` is set, the `--cache_type` is set to `lru_cache`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10081 Test Plan: - To test that the FastLRUCache is used and the stress test runs successfully, run `make -j24 CRASH_TEST_EXT_ARGS=—duration=960 blackbox_crash_test_with_atomic_flush`. The cache type should sometimes be `fast_lru_cache`. - To test the flag validation, run `make -j24 CRASH_TEST_EXT_ARGS="--duration=960 --secondary_cache_uri=x" blackbox_crash_test_with_atomic_flush` multiple times. The test will always be aborted (which is okay). Check that the cache type is always `lru_cache`. Reviewed By: anand1976 Differential Revision: D36839908 Pulled By: guidotag fbshipit-source-id: ebcdfdcd12ec04c96c09ae5b9c9d1e613bdd1725	3 years ago
Andrew Kryczka	f6e45382e9	Disable file ingestion in crash test for CF consistency (#10067 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10067 Reviewed By: jay-zhuang Differential Revision: D36727948 Pulled By: ajkr fbshipit-source-id: a3502730412c01ba63d822a5d4bf56f8bae8fcb2	3 years ago
Andrew Kryczka	91ba7837b7	Enable IngestExternalFile() in crash test (#9357 ) Summary: Thanks to https://github.com/facebook/rocksdb/issues/9919 and https://github.com/facebook/rocksdb/issues/10051 the known bugs in file ingestion (besides mmap read + file checksum) are fixed. Now we can try again to enable file ingestion in crash test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9357 Test Plan: stress file ingestion heavily for an hour: `$ TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --max_key=1000000 --ingest_external_file_one_in=100 --duration=3600 --interval=20 --write_buffer_size=524288 --target_file_size_base=524288 --max_bytes_for_level_base=2097152` Reviewed By: riversand963 Differential Revision: D33410746 Pulled By: ajkr fbshipit-source-id: d276431390995a67f68390d61c06a40945fdd280	3 years ago
Peter Dillinger	bd170dda03	Abort RocksDB performance regression test on failure in test setup (#10053 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10053 Need to exit if ldb command fails, to avoid running db_bench on empty/bad DB and considering the results valid. Reviewed By: jay-zhuang Differential Revision: D36673200 fbshipit-source-id: e0d78a0d397e0e335d82d9349bfd612d38ffb552	3 years ago
Yanqin Jin	9901e7f681	Enable checkpoint and backup in db_stress when timestamp is enabled (#10047 ) Summary: After https://github.com/facebook/rocksdb/issues/10030 and https://github.com/facebook/rocksdb/issues/10004, we can enable checkpoint and backup in stress tests when user-defined timestamp is enabled. This PR has no production risk. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10047 Test Plan: ``` TEST_TMPDIR=/dev/shm make crash_test_with_ts ``` Reviewed By: jowlyzhang Differential Revision: D36641565 Pulled By: riversand963 fbshipit-source-id: d86c9d87efcc34c32d1aa176af691d32b897644a	3 years ago
Changyu Bi	8515bd50c9	Support read rate-limiting in SequentialFileReader (#9973 ) Summary: Added rate limiter and read rate-limiting support to SequentialFileReader. I've updated call sites to SequentialFileReader::Read with appropriate IO priority (or left a TODO and specified IO_TOTAL for now). The PR is separated into four commits: the first one added the rate-limiting support, but with some fixes in the unit test since the number of request bytes from rate limiter in SequentialFileReader are not accurate (there is overcharge at EOF). The second commit fixed this by allowing SequentialFileReader to check file size and determine how many bytes are left in the file to read. The third commit added benchmark related code. The fourth commit moved the logic of using file size to avoid overcharging the rate limiter into backup engine (the main user of SequentialFileReader). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9973 Test Plan: - `make check`, backup_engine_test covers usage of SequentialFileReader with rate limiter. - Run db_bench to check if rate limiting is throttling as expected: Verified that reads and writes are together throttled at 2MB/s, and at 0.2MB chunks that are 100ms apart. - Set up: `./db_bench --benchmarks=fillrandom -db=/dev/shm/test_rocksdb` - Benchmark: ``` strace -ttfe read,write ./db_bench --benchmarks=backup -db=/dev/shm/test_rocksdb --backup_rate_limit=2097152 --use_existing_db strace -ttfe read,write ./db_bench --benchmarks=restore -db=/dev/shm/test_rocksdb --restore_rate_limit=2097152 --use_existing_db ``` - db bench on backup and restore to ensure no performance regression. - backup (avg over 50 runs): pre-change: 1.90443e+06 micros/op; post-change: 1.8993e+06 micros/op (improve by 0.2%) - restore (avg over 50 runs): pre-change: 1.79105e+06 micros/op; post-change: 1.78192e+06 micros/op (improve by 0.5%) ``` # Set up ./db_bench --benchmarks=fillrandom -db=/tmp/test_rocksdb -num=10000000 # benchmark TEST_TMPDIR=/tmp/test_rocksdb NUM_RUN=50 for ((j=0;j<$NUM_RUN;j++)) do ./db_bench -db=$TEST_TMPDIR -num=10000000 -benchmarks=backup -use_existing_db \| egrep 'backup' # Restore #./db_bench -db=$TEST_TMPDIR -num=10000000 -benchmarks=restore -use_existing_db done > rate_limit.txt && awk -v NUM_RUN=$NUM_RUN '{sum+=$3;sum_sqrt+=$3^2}END{print sum/NUM_RUN, sqrt(sum_sqrt/NUM_RUN-(sum/NUM_RUN)^2)}' rate_limit.txt >> rate_limit_2.txt ``` Reviewed By: hx235 Differential Revision: D36327418 Pulled By: cbi42 fbshipit-source-id: e75d4307cff815945482df5ba630c1e88d064691	3 years ago
Levi Tamasi	253ae017fa	Update version on main to 7.4 and add 7.3 to the format compatibility checks (#10038 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10038 Reviewed By: riversand963 Differential Revision: D36604533 Pulled By: ltamasi fbshipit-source-id: 54ccd0a4b32a320b5640a658ea6846ee897065d1	3 years ago
Changyu Bi	cc23b46da1	Support using ZDICT_finalizeDictionary to generate zstd dictionary (#9857 ) Summary: An untrained dictionary is currently simply the concatenation of several samples. The ZSTD API, ZDICT_finalizeDictionary(), can improve such a dictionary's effectiveness at low cost. This PR changes how dictionary is created by calling the ZSTD ZDICT_finalizeDictionary() API instead of creating raw content dictionary (when max_dict_buffer_bytes > 0), and pass in all buffered uncompressed data blocks as samples. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9857 Test Plan: #### db_bench test for cpu/memory of compression+decompression and space saving on synthetic data: Set up: change the parameter [here](`fb9a167a55/tools/db_bench_tool.cc (L1766)`) to 16384 to make synthetic data more compressible. ``` # linked local ZSTD with version 1.5.2 # DEBUG_LEVEL=0 ROCKSDB_NO_FBCODE=1 ROCKSDB_DISABLE_ZSTD=1 EXTRA_CXXFLAGS="-DZSTD_STATIC_LINKING_ONLY -DZSTD -I/data/users/changyubi/install/include/" EXTRA_LDFLAGS="-L/data/users/changyubi/install/lib/ -l:libzstd.a" make -j32 db_bench dict_bytes=16384 train_bytes=1048576 echo "========== No Dictionary ==========" TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1 TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=0 -block_size=4096 2>&1 \| grep elapsed du -hc /dev/shm/dbbench/sst \| grep total echo "========== Raw Content Dictionary ==========" TEST_TMPDIR=/dev/shm ./db_bench_main -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1 TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench_main -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -block_size=4096 2>&1 \| grep elapsed du -hc /dev/shm/dbbench/sst \| grep total echo "========== FinalizeDictionary ==========" TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1 TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false -block_size=4096 2>&1 \| grep elapsed du -hc /dev/shm/dbbench/sst \| grep total echo "========== TrainDictionary ==========" TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=filluniquerandom,compact -num=10000000 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 -max_background_jobs=24 -memtablerep=vector -allow_concurrent_memtable_write=false -disable_wal=true -max_write_buffer_number=8 >/dev/null 2>&1 TEST_TMPDIR=/dev/shm /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -block_size=4096 2>&1 \| grep elapsed du -hc /dev/shm/dbbench/sst \| grep total # Result: TrainDictionary is much better on space saving, but FinalizeDictionary seems to use less memory. # before compression data size: 1.2GB dict_bytes=16384 max_dict_buffer_bytes = 1048576 space cpu/memory No Dictionary 468M 14.93user 1.00system 0:15.92elapsed 100%CPU (0avgtext+0avgdata 23904maxresident)k Raw Dictionary 251M 15.81user 0.80system 0:16.56elapsed 100%CPU (0avgtext+0avgdata 156808maxresident)k FinalizeDictionary 236M 11.93user 0.64system 0:12.56elapsed 100%CPU (0avgtext+0avgdata 89548maxresident)k TrainDictionary 84M 7.29user 0.45system 0:07.75elapsed 100%CPU (0avgtext+0avgdata 97288maxresident)k ``` #### Benchmark on 10 sample SST files for spacing saving and CPU time on compression: FinalizeDictionary is comparable to TrainDictionary in terms of space saving, and takes less time in compression. ``` dict_bytes=16384 train_bytes=1048576 for sst_file in `ls ../temp/myrock-sst/` do echo "******** $sst_file ********" echo "========== No Dictionary ==========" ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD echo "========== Raw Content Dictionary ==========" ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes echo "========== FinalizeDictionary ==========" ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes --compression_use_zstd_finalize_dict echo "========== TrainDictionary ==========" ./sst_dump --file="../temp/myrock-sst/$sst_file" --command=recompress --compression_level_from=6 --compression_level_to=6 --compression_types=kZSTD --compression_max_dict_bytes=$dict_bytes --compression_zstd_max_train_bytes=$train_bytes done 010240.sst (Size/Time) 011029.sst 013184.sst 021552.sst 185054.sst 185137.sst 191666.sst 7560381.sst 7604174.sst 7635312.sst No Dictionary 28165569 / 2614419 32899411 / 2976832 32977848 / 3055542 31966329 / 2004590 33614351 / 1755877 33429029 / 1717042 33611933 / 1776936 33634045 / 2771417 33789721 / 2205414 33592194 / 388254 Raw Content Dictionary 28019950 / 2697961 33748665 / 3572422 33896373 / 3534701 26418431 / 2259658 28560825 / 1839168 28455030 / 1846039 28494319 / 1861349 32391599 / 3095649 33772142 / 2407843 33592230 / 474523 FinalizeDictionary 27896012 / 2650029 33763886 / 3719427 33904283 / 3552793 26008225 / 2198033 28111872 / 1869530 28014374 / 1789771 28047706 / 1848300 32296254 / 3204027 33698698 / 2381468 33592344 / 517433 TrainDictionary 28046089 / 2740037 33706480 / 3679019 33885741 / 3629351 25087123 / 2204558 27194353 / 1970207 27234229 / 1896811 27166710 / 1903119 32011041 / 3322315 32730692 / 2406146 33608631 / 570593 ``` #### Decompression/Read test: With FinalizeDictionary/TrainDictionary, some data structure used for decompression are in stored in dictionary, so they are expected to be faster in terms of decompression/reads. ``` dict_bytes=16384 train_bytes=1048576 echo "No Dictionary" TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=0 > /dev/null 2>&1 TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=0 2>&1 \| grep MB/s echo "Raw Dictionary" TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes > /dev/null 2>&1 TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes 2>&1 \| grep MB/s echo "FinalizeDict" TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false > /dev/null 2>&1 TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes -compression_use_zstd_dict_trainer=false 2>&1 \| grep MB/s echo "Train Dictionary" TEST_TMPDIR=/dev/shm/ ./db_bench -benchmarks=filluniquerandom,compact -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes > /dev/null 2>&1 TEST_TMPDIR=/dev/shm/ ./db_bench -use_existing_db=true -benchmarks=readrandom -cache_size=0 -compression_type=zstd -compression_max_dict_bytes=$dict_bytes -compression_zstd_max_train_bytes=$train_bytes 2>&1 \| grep MB/s No Dictionary readrandom : 12.183 micros/op 82082 ops/sec 12.183 seconds 1000000 operations; 9.1 MB/s (1000000 of 1000000 found) Raw Dictionary readrandom : 12.314 micros/op 81205 ops/sec 12.314 seconds 1000000 operations; 9.0 MB/s (1000000 of 1000000 found) FinalizeDict readrandom : 9.787 micros/op 102180 ops/sec 9.787 seconds 1000000 operations; 11.3 MB/s (1000000 of 1000000 found) Train Dictionary readrandom : 9.698 micros/op 103108 ops/sec 9.699 seconds 1000000 operations; 11.4 MB/s (1000000 of 1000000 found) ``` Reviewed By: ajkr Differential Revision: D35720026 Pulled By: cbi42 fbshipit-source-id: 24d230fdff0fd28a1bb650658798f00dfcfb2a1f	3 years ago
Peter Dillinger	280b9f371a	Fix auto_prefix_mode performance with partitioned filters (#10012 ) Summary: Essentially refactored the RangeMayExist implementation in FullFilterBlockReader to FilterBlockReaderCommon so that it applies to partitioned filters as well. (The function is not called for the block-based filter case.) RangeMayExist is essentially a series of checks around a possible PrefixMayExist, and I'm confident those checks should be the same for partitioned as for full filters. (I think it's likely that bugs remain in those checks, but this change is overall a simplifying one.) Added auto_prefix_mode support to db_bench Other small fixes as well Fixes https://github.com/facebook/rocksdb/issues/10003 Pull Request resolved: https://github.com/facebook/rocksdb/pull/10012 Test Plan: Expanded unit test that uses statistics to check for filter optimization, fails without the production code changes here Performance: populate two DBs with ``` TEST_TMPDIR=/dev/shm/rocksdb_nonpartitioned ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 TEST_TMPDIR=/dev/shm/rocksdb_partitioned ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=30000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -partition_index_and_filters ``` Observe no measurable change in non-partitioned performance ``` TEST_TMPDIR=/dev/shm/rocksdb_nonpartitioned ./db_bench -benchmarks=seekrandom[-X1000] -num=10000000 -readonly -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -auto_prefix_mode -cache_index_and_filter_blocks=1 -cache_size=1000000000 -duration 20 ``` Before: seekrandom [AVG 15 runs] : 11798 (± 331) ops/sec After: seekrandom [AVG 15 runs] : 11724 (± 315) ops/sec Observe big improvement with partitioned (also supported by bloom use statistics) ``` TEST_TMPDIR=/dev/shm/rocksdb_partitioned ./db_bench -benchmarks=seekrandom[-X1000] -num=10000000 -readonly -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=8 -partition_index_and_filters -auto_prefix_mode -cache_index_and_filter_blocks=1 -cache_size=1000000000 -duration 20 ``` Before: seekrandom [AVG 12 runs] : 2942 (± 57) ops/sec After: seekrandom [AVG 12 runs] : 7489 (± 184) ops/sec Reviewed By: siying Differential Revision: D36469796 Pulled By: pdillinger fbshipit-source-id: bcf1e2a68d347b32adb2b27384f945434e7a266d	3 years ago
Jay Zhuang	c6d326d3d7	Track SST unique id in MANIFEST and verify (#9990 ) Summary: Start tracking SST unique id in MANIFEST, which is used to verify with SST properties to make sure the SST file is not overwritten or misplaced. A DB option `try_verify_sst_unique_id` is introduced to enable/disable the verification, if enabled, it opens all SST files during DB-open to read the unique_id from table properties (default is false), so it's recommended to use it with `max_open_files = -1` to pre-open the files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9990 Test Plan: unittests, format-compatible test, mini-crash Reviewed By: anand1976 Differential Revision: D36381863 Pulled By: jay-zhuang fbshipit-source-id: 89ea2eb6b35ed3e80ead9c724eb096083eaba63f	3 years ago
Hui Xiao	3573558ec5	Rewrite memory-charging feature's option API (#9926 ) Summary: Context: Previous PR https://github.com/facebook/rocksdb/pull/9748, https://github.com/facebook/rocksdb/pull/9073, https://github.com/facebook/rocksdb/pull/8428 added separate flag for each charged memory area. Such API design is not scalable as we charge more and more memory areas. Also, we foresee an opportunity to consolidate this feature with other cache usage related features such as `cache_index_and_filter_blocks` using `CacheEntryRole`. Therefore we decided to consolidate all these flags with `CacheUsageOptions cache_usage_options` and this PR serves as the first step by consolidating memory-charging related flags. Summary: - Replaced old API reference with new ones, including making `kCompressionDictionaryBuildingBuffer` opt-out and added a unit test for that - Added missing db bench/stress test for some memory charging features - Renamed related test suite to indicate they are under the same theme of memory charging - Refactored a commonly used mocked cache component in memory charging related tests to reduce code duplication - Replaced the phrases "memory tracking" / "cache reservation" (other than CacheReservationManager-related ones) with "memory charging" for standard description of this feature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9926 Test Plan: - New unit test for opt-out `kCompressionDictionaryBuildingBuffer` `TEST_F(ChargeCompressionDictionaryBuildingBufferTest, Basic)` - New unit test for option validation/sanitization `TEST_F(CacheUsageOptionsOverridesTest, SanitizeAndValidateOptions)` - CI - db bench (in case querying new options introduces regression) +0.5% micros/op: `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_compression_dictionary_building_buffer=1(remove this for comparison) -compression_max_dict_bytes=10000 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 \| egrep 'fillseq'` #-run \| (pre-PR) avg micros/op \| std micros/op \| (post-PR) micros/op \| std micros/op \| change (%) -- \| -- \| -- \| -- \| -- \| -- 10 \| 3.9711 \| 0.264408 \| 3.9914 \| 0.254563 \| 0.5111933721 20 \| 3.83905 \| 0.0664488 \| 3.8251 \| 0.0695456 \| -0.3633711465 40 \| 3.86625 \| 0.136669 \| 3.8867 \| 0.143765 \| 0.5289363078 - db_stress: `python3 tools/db_crashtest.py blackbox -charge_compression_dictionary_building_buffer=1 -charge_filter_construction=1 -charge_table_reader=1 -cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D36054712 Pulled By: hx235 fbshipit-source-id: d406e90f5e0c5ea4dbcb585a484ad9302d4302af	3 years ago
Yanqin Jin	f6d9730ea1	Fix stress test with best-efforts-recovery (#9986 ) Summary: This PR - since we are testing with disable_wal = true and best_efforts_recovery, we should set column family count to 1, due to the requirement of `ExpectedState` tracking and replaying logic. - during backup and checkpoint restore, disable best-efforts-recovery. This does not matter now because db_crashtest.py always disables wal when testing best-efforts-recovery. In the future, if we enable wal, then not setting `restore_opitions.best_efforts_recovery` will cause backup db not to recover the WALs, and differ from db (that enables WAL). - during verification of backup and checkpoint restore, print the key where inconsistency exists between expected state and db. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9986 Test Plan: TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_best_efforts_recovery Reviewed By: siying Differential Revision: D36353105 Pulled By: riversand963 fbshipit-source-id: a484da161273e6216a1f7e245bac15a349693917	3 years ago
Andrew Kryczka	e943bbdd2f	Temporarily disable sync_fault_injection (#9979 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9979 Reviewed By: siying Differential Revision: D36301555 Pulled By: ajkr fbshipit-source-id: ed298d3484b6aad3ef19746e984bf4c52be33a9f	3 years ago
yaphet	26768edb65	Support single delete in ldb (#9469 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9469 Reviewed By: riversand963 Differential Revision: D33953484 fbshipit-source-id: f4e84a2d9865957d744c7e84ff02ffbb0a62b0a8	3 years ago
Peter Dillinger	c5c58708db	Fix format_compatible blowing away its TEST_TMPDIR (#9970 ) Summary: https://github.com/facebook/rocksdb/issues/9961 broke format_compatible check because of `make clean` referencing TEST_TMPDIR. The Makefile behavior seems reasonable to me, so here's a fix in check_format_compatible.sh Apparently I also included removing a redundant part of our CircleCI config. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9970 Test Plan: manual run: SHORT_TEST=1 ./tools/check_format_compatible.sh Reviewed By: riversand963 Differential Revision: D36258172 Pulled By: pdillinger fbshipit-source-id: d46507f04614e888b414ff23b88d040ae2b5c294	3 years ago
sdong	736a7b5433	Remove own ToString() (#9955 ) Summary: ToString() is created as some platform doesn't support std::to_string(). However, we've already used std::to_string() by mistake for 16 months (in db/db_info_dumper.cc). This commit just remove ToString(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9955 Test Plan: Watch CI tests Reviewed By: riversand963 Differential Revision: D36176799 fbshipit-source-id: bdb6dcd0e3a3ab96a1ac810f5d0188f684064471	3 years ago
Andrew Kryczka	62d84e2a2b	db_stress fault injection in release mode (#9957 ) Summary: Previously all fault injection was ignored in release mode. This PR adds it back except for read fault injection (`--read_fault_one_in > 0`) since its dependency (`IGNORE_STATUS_IF_ERROR`) is unavailable in release mode. Other notable changes include: - Moved `EnableWriteErrorInjection()` for `--write_fault_one_in > 0` so it's after `DB::Open()` without depending on `SyncPoint` - Made `--read_fault_one_in > 0` return an error in release mode - Updated `db_crashtest.py` to always set `--read_fault_one_in=0` in release mode Pull Request resolved: https://github.com/facebook/rocksdb/pull/9957 Test Plan: ``` $ DEBUG_LEVEL=0 make -j24 db_stress $ DEBUG_LEVEL=0 TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox ``` Reviewed By: anand1976 Differential Revision: D36193830 Pulled By: ajkr fbshipit-source-id: 0b97946b4e3f06e3e0f6e7833c2763da08ec5321	3 years ago

1 2 3 4 5 ...

1340 Commits (2acbf386a38421760d73a62cfcf2a66bfaf8d711)