rocksdb

Commit Graph

Author	SHA1	Message	Date
Xinyu Zeng	8b74cea7fe	Reduce comparator objects init cost in BlockIter (#9611 ) Summary: This PR solves the problem discussed in https://github.com/facebook/rocksdb/issues/7149. By storing the pointer of InternalKeyComparator as icmp_ in BlockIter, the object size remains the same. And for each call to CompareCurrentKey, there is no need to create Comparator objects. One can use icmp_ directly or use the "user_comparator" from the icmp_. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9611 Test Plan: with https://github.com/facebook/rocksdb/issues/9903, ``` $ TEST_TMPDIR=/dev/shm python3.6 ../benchmark/tools/compare.py benchmarks ./db_basic_bench ../rocksdb-pr9611/db_basic_bench --benchmark_filter=DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:1/iterations:262144/threads:1 --benchmark_repetitions=50 ... Comparing ./db_basic_bench to ../rocksdb-pr9611/db_basic_bench Benchmark Time CPU Time Old Time New CPU Old CPU New ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ... DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:1/iterations:262144/threads:1_pvalue 0.0001 0.0001 U Test, Repetitions: 50 vs 50 DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:1/iterations:262144/threads:1_mean -0.0483 -0.0483 3924 3734 3924 3734 DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:1/iterations:262144/threads:1_median -0.0713 -0.0713 3971 3687 3970 3687 DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:1/iterations:262144/threads:1_stddev -0.0342 -0.0344 225 217 225 217 DBGet/comp_style:0/max_data:134217728/per_key_size:256/enable_statistics:0/negative_query:0/enable_filter:0/mmap:1/iterations:262144/threads:1_cv +0.0148 +0.0146 0 0 0 0 OVERALL_GEOMEAN -0.0483 -0.0483 0 0 0 0 ``` Reviewed By: akankshamahajan15 Differential Revision: D35882037 Pulled By: ajkr fbshipit-source-id: 9e5337bbad8f1239dff7aa9f6549020d599bfcdf	4 years ago
Siying Dong	b82edffc7b	Improve comments to options.allow_mmap_reads (#9936 ) Summary: It confused users and use that with options.allow_mmap_reads = true, CPU is high with checksum verification. Add a comment to explain it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9936 Reviewed By: anand1976 Differential Revision: D36106529 fbshipit-source-id: 3d723bd686f96a84c694c8b2d91ad28d9ccfd979	4 years ago
Andrew Kryczka	440c7f6306	db_basic_bench fix for DB object cleanup (#9939 ) Summary: Use `unique_ptr<DB>` to make sure the DB object is deleted. Previously it was not, which led to accumulating file descriptors for deleted directories because a `DBImpl::db_dir_` from each test remained alive. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9939 Test Plan: run `lsof -p $(pidof db_basic_bench)` while benchmark runs; verify no FDs for deleted directories. Reviewed By: jay-zhuang Differential Revision: D36108761 Pulled By: ajkr fbshipit-source-id: cfe02646b038a445af7d5db8989eb1f40d658359	4 years ago
Peter Dillinger	bb87164db3	Fork and simplify LRUCache for developing enhancements (#9917 ) Summary: To support a project to prototype and evaluate algorithmic enhancments and alternatives to LRUCache, here I have separated out LRUCache into internal-only "FastLRUCache" and cut it down to essentials, so that details like secondary cache handling and priorities do not interfere with prototyping. These can be re-integrated later as needed, along with refactoring to minimize code duplication (which would slow down prototyping for now). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9917 Test Plan: unit tests updated to ensure basic functionality has (likely) been preserved Reviewed By: anand1976 Differential Revision: D35995554 Pulled By: pdillinger fbshipit-source-id: d67b20b7ada3b5d3bfe56d897a73885894a1d9db	4 years ago
Peter Dillinger	4b9a1a2f56	Fix db_crashtest.py call inconsistency in crash_test.mk (#9935 ) Summary: Some tests crashing because not using custom DB_STRESS_CMD Pull Request resolved: https://github.com/facebook/rocksdb/pull/9935 Test Plan: internal tests Reviewed By: riversand963 Differential Revision: D36104347 Pulled By: pdillinger fbshipit-source-id: 23f080704a124174203f54ffd85578c2047effe5	4 years ago
Mark Callaghan	b6ec3328af	Make --benchmarks=flush flush the default column family (#9887 ) Summary: db_bench --benchmarks=flush wasn't flushing the default column family. This is for https://github.com/facebook/rocksdb/issues/9880 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9887 Test Plan: Confirm that flush works (.log is empty) when "flush" added to benchmark list Confirm that .log is not empty otherwise. Repeat for all combinations for: uses column families, uses multiple databases ./db_bench --benchmarks=overwrite --num=10000 ls -lrt /tmp/rocksdbtest-2260/dbbench/.log -rw-r--r-- 1 me users 1380286 Apr 21 10:47 /tmp/rocksdbtest-2260/dbbench/000004.log ./db_bench --benchmarks=overwrite,flush --num=10000 ls -lrt /tmp/rocksdbtest-2260/dbbench/.log -rw-r--r-- 1 me users 0 Apr 21 10:48 /tmp/rocksdbtest-2260/dbbench/000008.log ./db_bench --benchmarks=overwrite --num=10000 --num_column_families=4 ls -lrt /tmp/rocksdbtest-2260/dbbench/.log -rw-r--r-- 1 me users 1387823 Apr 21 10:49 /tmp/rocksdbtest-2260/dbbench/000004.log ./db_bench --benchmarks=overwrite,flush --num=10000 --num_column_families=4 ls -lrt /tmp/rocksdbtest-2260/dbbench/.log -rw-r--r-- 1 me users 0 Apr 21 10:51 /tmp/rocksdbtest-2260/dbbench/000014.log ./db_bench --benchmarks=overwrite --num=10000 --num_multi_db=2 ls -lrt /tmp/rocksdbtest-2260/dbbench/[01]/.log -rw-r--r-- 1 me users 1380838 Apr 21 10:55 /tmp/rocksdbtest-2260/dbbench/0/000004.log -rw-r--r-- 1 me users 1379734 Apr 21 10:55 /tmp/rocksdbtest-2260/dbbench/1/000004.log ./db_bench --benchmarks=overwrite,flush --num=10000 --num_multi_db=2 ls -lrt /tmp/rocksdbtest-2260/dbbench/[01]/.log -rw-r--r-- 1 me users 0 Apr 21 10:57 /tmp/rocksdbtest-2260/dbbench/0/000013.log -rw-r--r-- 1 me users 0 Apr 21 10:57 /tmp/rocksdbtest-2260/dbbench/1/000013.log ./db_bench --benchmarks=overwrite --num=10000 --num_column_families=4 --num_multi_db=2 ls -lrt /tmp/rocksdbtest-2260/dbbench/[01]/.log -rw-r--r-- 1 me users 1395108 Apr 21 10:52 /tmp/rocksdbtest-2260/dbbench/1/000004.log -rw-r--r-- 1 me users 1380411 Apr 21 10:52 /tmp/rocksdbtest-2260/dbbench/0/000004.log ./db_bench --benchmarks=overwrite,flush --num=10000 --num_column_families=4 --num_multi_db=2 ls -lrt /tmp/rocksdbtest-2260/dbbench/[01]/.log -rw-r--r-- 1 me users 0 Apr 21 10:54 /tmp/rocksdbtest-2260/dbbench/0/000022.log -rw-r--r-- 1 me users 0 Apr 21 10:54 /tmp/rocksdbtest-2260/dbbench/1/000022.log Reviewed By: ajkr Differential Revision: D36026777 Pulled By: mdcallag fbshipit-source-id: d42d3d7efceea7b9a25bbbc0f04461d2b7301122	4 years ago
Yanqin Jin	2b5df21e95	Remove ifdef for try_emplace after upgrading to c++17 (#9932 ) Summary: Test plan make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/9932 Reviewed By: ajkr Differential Revision: D36085404 Pulled By: riversand963 fbshipit-source-id: 2ece14ca0e2e4c1288339ff79e7e126b76eaf786	4 years ago
Andrew Kryczka	cda34dd64a	Allow consecutive SingleDelete() in stress/crash test (#9930 ) Summary: We need to support consecutive SingleDelete(), so this PR adds it to the stress/crash tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9930 Test Plan: `python3 tools/db_crashtest.py blackbox --simple --nooverwritepercent=50 --writepercent=90 --delpercent=10 --readpercent=0 --prefixpercent=0 --delrangepercent=0 --iterpercent=0 --max_key=1000000 --duration=3600 --interval=10 --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --value_size_mult=33` Reviewed By: riversand963 Differential Revision: D36081863 Pulled By: ajkr fbshipit-source-id: 3566cdbaed375b8003126fc298968eb1a854317f	4 years ago
Yanqin Jin	06394ff4e7	Fix a bug of CompactionIterator/CompactionFilter using `Delete` (#9929 ) Summary: When compaction filter determines that a key should be removed, it updates the internal key's type to `Delete`. If this internal key is preserved in current compaction but seen by a later compaction together with `SingleDelete`, it will cause compaction iterator to return Corruption. To fix the issue, compaction filter should return more information in addition to the intention of removing a key. Therefore, we add a new `kRemoveWithSingleDelete` to `CompactionFilter::Decision`. Seeing `kRemoveWithSingleDelete`, compaction iterator will update the op type of the internal key to `kTypeSingleDelete`. In addition, I updated db_stress_shared_state.[cc\|h] so that `no_overwrite_ids_` becomes `const`. It is easier to reason about thread-safety if accessed from multiple threads. This information is passed to `PrepareTxnDBOptions()` when calling from `Open()` so that we can set up the rollback deletion type callback for transactions. Finally, disable compaction filter for multiops_txn because the key removal logic of `DbStressCompactionFilter` does not quite work with `MultiOpsTxnsStressTest`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9929 Test Plan: make check make crash_test make crash_test_with_txn Reviewed By: anand1976 Differential Revision: D36069678 Pulled By: riversand963 fbshipit-source-id: cedd2f1ba958af59ad3916f1ba6f424307955f92	4 years ago
Changyu Bi	37f490834d	Specify largest_seqno in VerifyChecksum (#9919 ) Summary: `VerifyChecksum()` does not specify `largest_seqno` when creating a `TableReader`. As a result, the `TableReader` uses the `TableReaderOptions` default value (0) for `largest_seqno`. This causes the following error when the file has a nonzero global seqno in its properties: ``` Corruption: An external sst file with version 2 have global seqno property with value , while largest seqno in the file is 0 ``` This PR fixes this by specifying `largest_seqno` in `VerifyChecksumInternal` with `largest_seqno` from the file metadata. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9919 Test Plan: `make check` Reviewed By: ajkr Differential Revision: D36028824 Pulled By: cbi42 fbshipit-source-id: 428d028a79386f46ef97bb6b6051dc76c83e1f2b	4 years ago
Yanqin Jin	2b5c29f9f3	Enforce the contract of SingleDelete (#9888 ) Summary: Enforce the contract of SingleDelete so that they are not mixed with Delete for the same key. Otherwise, it will lead to undefined behavior. See https://github.com/facebook/rocksdb/wiki/Single-Delete#notes. Also fix unit tests and write-unprepared. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9888 Test Plan: make check Reviewed By: ajkr Differential Revision: D35837817 Pulled By: riversand963 fbshipit-source-id: acd06e4dcba8cb18df92b44ed18c57e10e5a7635	4 years ago
Anvesh Komuravelli	aafb377bb5	Update protection info on recovered logs data (#9875 ) Summary: Update protection info on recovered logs data Pull Request resolved: https://github.com/facebook/rocksdb/pull/9875 Test Plan: - Benchmark setup: `TEST_TMPDIR=/dev/shm/100MB_WAL_DB/ ./db_bench -benchmarks=fillrandom -write_buffer_size=1048576000` - Benchmark command: `TEST_TMPDIR=/dev/shm/100MB_WAL_DB/ /usr/bin/time ./db_bench -use_existing_db=true -benchmarks=overwrite -write_buffer_size=1048576000 -writes=1 -report_open_timing=true` - Results before this PR ``` OpenDb: 2350.14 milliseconds OpenDb: 2296.94 milliseconds OpenDb: 2184.29 milliseconds OpenDb: 2167.59 milliseconds OpenDb: 2231.24 milliseconds OpenDb: 2109.57 milliseconds OpenDb: 2197.71 milliseconds OpenDb: 2120.8 milliseconds OpenDb: 2148.12 milliseconds OpenDb: 2207.95 milliseconds ``` - Results after this PR ``` OpenDb: 2424.52 milliseconds OpenDb: 2359.84 milliseconds OpenDb: 2317.68 milliseconds OpenDb: 2339.4 milliseconds OpenDb: 2325.36 milliseconds OpenDb: 2321.06 milliseconds OpenDb: 2353.98 milliseconds OpenDb: 2344.64 milliseconds OpenDb: 2384.09 milliseconds OpenDb: 2428.58 milliseconds ``` Mean regressed 7.2% (2201.4 -> 2359.9) Reviewed By: ajkr Differential Revision: D36012787 Pulled By: akomurav fbshipit-source-id: d2aba09f29c6beb2fd0fe8e1e359be910b4ef02a	4 years ago
Akanksha Mahajan	fce65e7e4f	Fix bug in async_io path which reads incorrect length (#9916 ) Summary: In FilePrefetchBuffer, in case data is overlapping between two buffers and more data is required to read and copy that to third buffer, incorrect length was updated resulting in ``` Iterator diverged from control iterator which has value 00000000000310C3000000000000012B0000000000000274 total_order_seek: 1 auto_prefix_mode: 0 S 000000000002C37F000000000000012B000000000000001C NNNPPPPPNN; total_order_seek: 1 auto_prefix_mode: 0 S 000000000002F10B00000000000000BF78787878787878 NNNPNNNNPN; total_order_seek: 1 auto_prefix_mode: 0 S 00000000000310C3000000000000012B000000000000026B iterator is not valid Control CF default db_stress: db_stress_tool/db_stress_test_base.cc:1388: void rocksdb::StressTest::VerifyIterator(rocksdb::ThreadState, rocksdb::ColumnFamilyHandle, const rocksdb::ReadOptions&, rocksdb::Iterator, rocksdb::Iterator, rocksdb::StressTest::LastIterateOp, const rocksdb::Slice&, const string&, bool*): Assertion `false' failed. Aborted (core dumped) ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9916 Test Plan: ``` - CircleCI jobs - Ran db_stress with OPTIONS file which caught the bug ./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=42.26248932628998 --bottommost_compression_type=disable --cache_index_and_filter_blocks=0 --cache_size=8388608 --checkpoint_one_in=0 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_ttl=0 --compression_max_dict_buffer_bytes=1073741823 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zstd --compression_zstd_max_train_bytes=65536 --continuous_verification_interval=0 --db=/dev/shm/rocksdb/ --db_write_buffer_size=134217728 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_blob_files=0 --enable_compaction_filter=0 --enable_pipelined_write=0 --fail_if_options_file_error=0 --file_checksum_impl=none --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=12 --index_type=2 --ingest_external_file_one_in=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --long_running_snapshots=0 --mark_for_compaction_one_file_in=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=8388608 --memtable_prefix_bloom_size_ratio=0.001 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=0 --mock_direct_io=False --nooverwritepercent=1 --open_files=100 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=16 --ops_per_thread=100000000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=0 --read_only=0 --readpercent=50 --recycle_log_file_num=1 --reopen=0 --reserve_table_reader_memory=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0 --secondary_catch_up_one_in=0 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=False --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 --top_level_index_pinning=3 --unpartitioned_pinning=3 --use_blob_db=0 --use_block_based_filter=0 --use_clock_cache=0 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=0 --use_full_merge_v1=0 --use_merge=0 --use_multiget=0 --use_txn=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --wal_compression=zstd --write_buffer_size=4194304 --write_dbid_to_manifest=0 --writepercent=35 --options_file=/home/akankshamahajan/OPTIONS.orig -column_families=1 db_bench with async_io enabled to make sure db_bench completes successfully without any failure. - ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1 -async_io=1 ``` crash_test in progress Reviewed By: anand1976 Differential Revision: D35985789 Pulled By: akankshamahajan15 fbshipit-source-id: 5abe185f34caa99ca587d4bdc8954bd0802b1bf9	4 years ago
Yanqin Jin	94e245a14d	Improve stress test for MultiOpsTxnsStressTest (#9829 ) Summary: Adds more coverage to `MultiOpsTxnsStressTest` with a focus on write-prepared transactions. 1. Add a hack to manually evict commit cache entries. We currently cannot assign small values to `wp_commit_cache_bits` because it requires a prepared transaction to commit within a certain range of sequence numbers, otherwise it will throw. 2. Add coverage for commit-time-write-batch. If write policy is write-prepared, we need to set `use_only_the_last_commit_time_batch_for_recovery` to true. 3. After each flush/compaction, verify data consistency. This is possible since data size can be small: default numbers of primary/secondary keys are just 1000. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9829 Test Plan: ``` TEST_TMPDIR=/dev/shm/rocksdb_crashtest_blackbox/ make blackbox_crash_test_with_multiops_wp_txn ``` Reviewed By: pdillinger Differential Revision: D35806678 Pulled By: riversand963 fbshipit-source-id: d7fde7a29fda0fb481a61f553e0ca0c47da93616	4 years ago
Herman Lee	d9d456de49	Fix locktree accesses to PessimisticTransactions (#9898 ) Summary: The current locktree implementation stores the address of the PessimisticTransactions object as the TXNID. However, when a transaction is blocked on a lock, it records the list of waitees with conflicting locks using the rocksdb assigned TransactionID. This is performed by calling GetID() on PessimisticTransactions objects of the waitees, and then recorded in the waiter's list. However, there is no guarantee the objects are valid when recording the waitee list during the conflict callbacks because the waitee could have released the lock and freed the PessimisticTransactions object. The waitee/txnid values are only valid PessimisticTransaction objects while the mutex for the root of the locktree is held. The simplest fix for this problem is to use the address of the PessimisticTransaction as the TransactionID so that it is consistent with its usage in the locktree. The TXNID is only converted back to a PessimisticTransaction for the report_wait callbacks. Since these callbacks are now all made within the critical section where the lock_request queue mutx is held, these conversions will be safe. Otherwise, only the uint64_t TXNID of the waitee is registerd with the waiter transaction. The PessimisitcTransaction object of the waitee is never referenced. The main downside of this approach is the TransactionID will not change if the PessimisticTransaction object is reused for new transactions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9898 Test Plan: Add a new test case and run unit tests. Also verified with MyRocks workloads using range locks that the crash no longer happens. Reviewed By: riversand963 Differential Revision: D35950376 Pulled By: hermanlee fbshipit-source-id: 8c9cae272e23e487fc139b6a8ed5b8f8f24b1570	4 years ago
Paras Sethia	68ee228dec	RocksDB: fix bug in crash-recovery correctness testing (#9897 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9897 Fixes https://github.com/facebook/rocksdb/issues/9385. Update State to reflect the value in the DB after a crash Reviewed By: ajkr Differential Revision: D35788808 fbshipit-source-id: 2d21d8537ab380a17cad3e90ac72b3eb1b56de9f	4 years ago
Peter Dillinger	9d0cae7104	Eliminate unnecessary (slow) block cache Ref()ing in MultiGet (#9899 ) Summary: When MultiGet() determines that multiple query keys can be served by examining the same data block in block cache (one Lookup()), each PinnableSlice referring to data in that data block needs to hold on to the block in cache so that they can be released at arbitrary times by the API user. Historically this is accomplished with extra calls to Ref() on the Handle from Lookup(), with each PinnableSlice cleanup calling Release() on the Handle, but this creates extra contention on the block cache for the extra Ref()s and Release()es, especially because they hit the same cache shard repeatedly. In the case of merge operands (possibly more cases?), the problem was compounded by doing an extra Ref()+eventual Release() for each merge operand for a key reusing a block (which could be the same key!), rather than one Ref() per key. (Note: the non-shared case with `biter` was already one per key.) This change optimizes MultiGet not to rely on these extra, contentious Ref()+Release() calls by instead, in the shared block case, wrapping the cache Release() cleanup in a refcounted object referenced by the PinnableSlices, such that after the last wrapped reference is released, the cache entry is Release()ed. Relaxed atomic refcounts should be much faster than mutex-guarded Ref() and Release(), and much less prone to a performance cliff when MultiGet() does a lot of block sharing. Note that I did not use std::shared_ptr, because that would require an extra indirection object (shared_ptr itself new/delete) in order to associate a ref increment/decrement with a Cleanable cleanup entry. (If I assumed it was the size of two pointers, I could do some hackery to make it work without the extra indirection, but that's too fragile.) Some details: * Fixed (removed) extra block cache tracing entries in cases of cache entry reuse in MultiGet, but it's likely that in some other cases traces are missing (XXX comment inserted) * Moved existing implementations for cleanable.h from iterator.cc to new cleanable.cc * Improved API comments on Cleanable * Added a public SharedCleanablePtr class to cleanable.h in case others could benefit from the same pattern (potentially many Cleanables and/or smart pointers referencing a shared Cleanable) * Add a typedef for MultiGetContext::Mask * Some variable renaming for clarity Pull Request resolved: https://github.com/facebook/rocksdb/pull/9899 Test Plan: Added unit tests for SharedCleanablePtr. Greatly enhanced ability of existing tests to detect cache use-after-free. * Release PinnableSlices from MultiGet as they are read rather than in bulk (in db_test_util wrapper). * In ASAN build, default to using a trivially small LRUCache for block_cache so that entries are immediately erased when unreferenced. (Updated two tests that depend on caching.) New ASAN testsuite running time seems OK to me. If I introduce a bug into my implementation where we skip the shared cleanups on block reuse, ASAN detects the bug in `db_basic_test MultiGet`. If I remove either of the above testing enhancements, the bug is not detected. Consider for follow-up work: manipulate or randomize ordering of PinnableSlice use and release from MultiGet db_test_util wrapper. But in typical cases, natural ordering gives pretty good functional coverage. Performance test: In the extreme (but possible) case of MultiGetting the same or adjacent keys in a batch, throughput can improve by an order of magnitude. `./db_bench -benchmarks=multireadrandom -db=/dev/shm/testdb -readonly -num=5 -duration=10 -threads=20 -multiread_batched -batch_size=200` Before ops/sec, num=5: 1,384,394 Before ops/sec, num=500: 6,423,720 After ops/sec, num=500: 10,658,794 After ops/sec, num=5: 16,027,257 Also note that previously, with high parallelism, having query keys concentrated in a single block was worse than spreading them out a bit. Now concentrated in a single block is faster than spread out, which is hopefully consistent with natural expectation. Random query performance: with num=1000000, over 999 x 10s runs running before & after simultaneously (each -threads=12): Before: multireadrandom [AVG 999 runs] : 1088699 (± 7344) ops/sec; 120.4 (± 0.8 ) MB/sec After: multireadrandom [AVG 999 runs] : 1090402 (± 7230) ops/sec; 120.6 (± 0.8 ) MB/sec Possibly better, possibly in the noise. Reviewed By: anand1976 Differential Revision: D35907003 Pulled By: pdillinger fbshipit-source-id: bbd244d703649a8ca12d476f2d03853ed9d1a17e	4 years ago
Andrew Kryczka	ce2d8a4239	fix clang-analyze in corruption_test (#9908 ) Summary: This PR fixes a clang-analyze error that I introduced in https://github.com/facebook/rocksdb/issues/9906: ``` db/corruption_test.cc:358:15: warning: Called C++ object pointer is null ASSERT_OK(db_->Put(WriteOptions(), cfhs[0], "k", "v")); ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ./test_util/testharness.h:76:62: note: expanded from macro 'ASSERT_OK' ASSERT_PRED_FORMAT1(ROCKSDB_NAMESPACE::test::AssertStatus, s) ^ third-party/gtest-1.8.1/fused-src/gtest/gtest.h:19909:36: note: expanded from macro 'ASSERT_PRED_FORMAT1' GTEST_PRED_FORMAT1_(pred_format, v1, GTEST_FATAL_FAILURE_) ^~ third-party/gtest-1.8.1/fused-src/gtest/gtest.h:19892:34: note: expanded from macro 'GTEST_PRED_FORMAT1_' GTEST_ASSERT_(pred_format(#v1, v1), \ ^~ third-party/gtest-1.8.1/fused-src/gtest/gtest.h:19868:52: note: expanded from macro 'GTEST_ASSERT_' if (const ::testing::AssertionResult gtest_ar = (expression)) \ ^~~~~~~~~~ 1 warning generated. ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9908 Reviewed By: riversand963 Differential Revision: D35953147 Pulled By: ajkr fbshipit-source-id: 9b837bd7581c6e1e2cdbc961c099652256eb9d4b	4 years ago
Andrew Kryczka	1eb279dcce	Add mmap DBGet microbench parameters (#9903 ) Summary: I tried evaluating https://github.com/facebook/rocksdb/issues/9611 using DBGet microbenchmarks but mostly found the change is well within the noise even for hundreds of repetitions; meanwhile, the InternalKeyComparator CPU it saves is 1-2% according to perf so it should be measurable. In this PR I tried adding a mmap mode that will bypass compression/checksum/block cache/file read to focus more on the block lookup paths, and also increased the Get() count. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9903 Reviewed By: jay-zhuang, riversand963 Differential Revision: D35907375 Pulled By: ajkr fbshipit-source-id: 69490d5040ef0863e1ce296724104d0aa7667215	4 years ago
Andrew Kryczka	c5d367f472	Revert open logic changes in #9634 (#9906 ) Summary: Left HISTORY.md and unit tests. Added a new unit test to repro the corruption scenario that this PR fixes, and HISTORY.md line for that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9906 Reviewed By: riversand963 Differential Revision: D35940093 Pulled By: ajkr fbshipit-source-id: 9816f99e1ce405ba36f316beb4f6378c37c8c86b	4 years ago
Akanksha Mahajan	3653029dda	Add stats related to async prefetching (#9845 ) Summary: Add stats PREFETCHED_BYTES_DISCARDED and POLL_WAIT_MICROS. PREFETCHED_BYTES_DISCARDED records number of prefetched bytes discarded by FilePrefetchBuffer. POLL_WAIT_MICROS records the time taken by underling file_system Poll API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9845 Test Plan: Update existing tests Reviewed By: anand1976 Differential Revision: D35909694 Pulled By: akankshamahajan15 fbshipit-source-id: e009ef940bb9ed72c9446f5529095caabb8a1e36	4 years ago
RoeyMaor	6d2577e567	Bugfix/fix manual flush blocking bug (#9893 ) Summary: Fix https://github.com/facebook/rocksdb/issues/9892 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9893 Reviewed By: jay-zhuang Differential Revision: D35880959 Pulled By: ajkr fbshipit-source-id: dad1139ad0983cfbd5c5cd6fa6b71022f889735a	4 years ago
Jaromir Vanek	fb9a167a55	Add 95% confidence intervals to db_bench output (#9882 ) Summary: Enhancing `db_bench` output with 95% statistical confidence intervals for better performance evaluation. The goal is to unambiguously separate random variance when running benchmark over multiple iterations. Output enhanced with confidence intervals exposed in brackets: ``` $ ./db_bench --benchmarks=fillseq[-X10] Running benchmark for 10 times fillseq : 4.961 micros/op 201578 ops/sec; 22.3 MB/s fillseq : 5.030 micros/op 198824 ops/sec; 22.0 MB/s fillseq [AVG 2 runs] : 200201 (± 2698) ops/sec; 22.1 (± 0.3) MB/sec fillseq : 4.963 micros/op 201471 ops/sec; 22.3 MB/s fillseq [AVG 3 runs] : 200624 (± 1765) ops/sec; 22.2 (± 0.2) MB/sec fillseq : 5.035 micros/op 198625 ops/sec; 22.0 MB/s fillseq [AVG 4 runs] : 200124 (± 1586) ops/sec; 22.1 (± 0.2) MB/sec fillseq : 4.979 micros/op 200861 ops/sec; 22.2 MB/s fillseq [AVG 5 runs] : 200272 (± 1262) ops/sec; 22.2 (± 0.1) MB/sec fillseq : 4.893 micros/op 204367 ops/sec; 22.6 MB/s fillseq [AVG 6 runs] : 200954 (± 1688) ops/sec; 22.2 (± 0.2) MB/sec fillseq : 4.914 micros/op 203502 ops/sec; 22.5 MB/s fillseq [AVG 7 runs] : 201318 (± 1595) ops/sec; 22.3 (± 0.2) MB/sec fillseq : 4.998 micros/op 200074 ops/sec; 22.1 MB/s fillseq [AVG 8 runs] : 201163 (± 1415) ops/sec; 22.3 (± 0.2) MB/sec fillseq : 4.946 micros/op 202188 ops/sec; 22.4 MB/s fillseq [AVG 9 runs] : 201277 (± 1267) ops/sec; 22.3 (± 0.1) MB/sec fillseq : 5.093 micros/op 196331 ops/sec; 21.7 MB/s fillseq [AVG 10 runs] : 200782 (± 1491) ops/sec; 22.2 (± 0.2) MB/sec fillseq [AVG 10 runs] : 200782 (± 1491) ops/sec; 22.2 (± 0.2) MB/sec fillseq [MEDIAN 10 runs] : 201166 ops/sec; 22.3 MB/s ``` For more explicit interval representation, use `--confidence_interval_only` flag: ``` $ ./db_bench --benchmarks=fillseq[-X10] --confidence_interval_only Running benchmark for 10 times fillseq : 4.935 micros/op 202648 ops/sec; 22.4 MB/s fillseq : 5.078 micros/op 196943 ops/sec; 21.8 MB/s fillseq [CI95 2 runs] : (194205, 205385) ops/sec; (21.5, 22.7) MB/sec fillseq : 5.159 micros/op 193816 ops/sec; 21.4 MB/s fillseq [CI95 3 runs] : (192735, 202869) ops/sec; (21.3, 22.4) MB/sec fillseq : 4.947 micros/op 202158 ops/sec; 22.4 MB/s fillseq [CI95 4 runs] : (194721, 203061) ops/sec; (21.5, 22.5) MB/sec fillseq : 4.908 micros/op 203756 ops/sec; 22.5 MB/s fillseq [CI95 5 runs] : (196113, 203615) ops/sec; (21.7, 22.5) MB/sec fillseq : 5.063 micros/op 197528 ops/sec; 21.9 MB/s fillseq [CI95 6 runs] : (196319, 202631) ops/sec; (21.7, 22.4) MB/sec fillseq : 5.214 micros/op 191799 ops/sec; 21.2 MB/s fillseq [CI95 7 runs] : (194953, 201803) ops/sec; (21.6, 22.3) MB/sec fillseq : 5.260 micros/op 190095 ops/sec; 21.0 MB/s fillseq [CI95 8 runs] : (193749, 200937) ops/sec; (21.4, 22.2) MB/sec fillseq : 5.076 micros/op 196992 ops/sec; 21.8 MB/s fillseq [CI95 9 runs] : (194134, 200474) ops/sec; (21.5, 22.2) MB/sec fillseq : 5.388 micros/op 185603 ops/sec; 20.5 MB/s fillseq [CI95 10 runs] : (192487, 199781) ops/sec; (21.3, 22.1) MB/sec fillseq [AVG 10 runs] : 196134 (± 3647) ops/sec; 21.7 (± 0.4) MB/sec fillseq [MEDIAN 10 runs] : 196968 ops/sec; 21.8 MB/sec ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/9882 Reviewed By: pdillinger Differential Revision: D35796148 Pulled By: vanekjar fbshipit-source-id: 8313712d16728ff982b8aff28195ee56622385b8	4 years ago
Akanksha Mahajan	5bd374b392	Add experimental new FS API AbortIO to cancel read request (#9901 ) Summary: Add experimental new API AbortIO in FileSystem to abort the read requests submitted asynchronously through ReadAsync API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9901 Test Plan: Existing tests Reviewed By: anand1976 Differential Revision: D35885591 Pulled By: akankshamahajan15 fbshipit-source-id: df3944e6e9e6e487af1fa688376b4abb6837fb02	4 years ago
yuzhangyu	ac29645743	Add blob dump support to the dump_live_files command (#9896 ) Summary: This patch completes the second part of the task: "Add blob support to the dump and dump_live_files command" Pull Request resolved: https://github.com/facebook/rocksdb/pull/9896 Reviewed By: ltamasi Differential Revision: D35852667 Pulled By: jowlyzhang fbshipit-source-id: a006456c881f468a92da689e895134762e9574e1	4 years ago
yuzhangyu	fff28a7725	Add blob dump support to the dump command (#9881 ) Summary: This patch is the first part of adding blob dump support. It only adds blob dump support to the dump command. A follow up patch will add blob dump support to the dump_live_files command. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9881 Reviewed By: ltamasi Differential Revision: D35796731 Pulled By: jowlyzhang fbshipit-source-id: 2cc5973b222d505a331ac7b969edcf992b47c5ee	4 years ago
Yanqin Jin	d13825e586	Add rollback_deletion_type_callback to TxnDBOptions (#9873 ) Summary: This PR does not affect write-committed. Add a member, `rollback_deletion_type_callback` to TransactionDBOptions so that a write-prepared transaction, when rolling back, can call this callback to decide if a `Delete` or `SingleDelete` should be used to cancel a prior `Put` written to the database during prepare phase. The purpose of this PR is to prevent mixing `Delete` and `SingleDelete` for the same key, causing undefined behaviors. Without this PR, the following can happen: ``` // The application always issues SingleDelete when deleting keys. txn1->Put('a'); txn1->Prepare(); // writes to memtable and potentially gets flushed/compacted to Lmax txn1->Rollback(); // inserts DELETE('a') txn2->Put('a'); txn2->Commit(); // writes to memtable and potentially gets flushed/compacted ``` In the database, we may have ``` L0: [PUT('a', s=100)] L1: [DELETE('a', s=90)] Lmax: [PUT('a', s=0)] ``` If a compaction compacts L0 and L1, then we have ``` L1: [PUT('a', s=100)] Lmax: [PUT('a', s=0)] ``` If a future transaction issues a SingleDelete, we have ``` L0: [SD('a', s=110)] L1: [PUT('a', s=100)] Lmax: [PUT('a', s=0)] ``` Then, a compaction including L0, L1 and Lmax leads to ``` Lmax: [PUT('a', s=0)] ``` which is incorrect. Similar bugs reported and addressed in https://github.com/cockroachdb/pebble/issues/1255. Based on our team's current priority, we have decided to take this approach for now. We may come back and revisit in the future. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9873 Test Plan: make check Reviewed By: ltamasi Differential Revision: D35762170 Pulled By: riversand963 fbshipit-source-id: b28d56eefc786b53c9844b9ef4a7807acdd82c8d	4 years ago
Peter Dillinger	1bac873fcf	Mark GetLiveFilesStorageInfo ready for production use (#9868 ) Summary: ... by filling out remaining testing hole: handling of db_pathsi+cf_paths. (Note that while GetLiveFilesStorageInfo works with db_paths / cf_paths, Checkpoint and BackupEngine do not and are marked appropriately.) Also improved comments for "live files" APIs, and grouped them together in db.h. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9868 Test Plan: Adding to existing unit tests Reviewed By: jay-zhuang Differential Revision: D35752254 Pulled By: pdillinger fbshipit-source-id: c70eb67748fad61826e2f554b674638700abefb2	4 years ago
Jay Zhuang	2ea4205a69	Add 7.2 to compatible check (#9858 ) Summary: Add 7.2 to compatible check (should change it with version update). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9858 Reviewed By: riversand963 Differential Revision: D35722897 Pulled By: jay-zhuang fbshipit-source-id: 08c782b9344599d7296543eb0c61afcd9a869a1a	4 years ago
yuzhangyu	9b5790f018	Add --decode_blob_index option to idump and dump commands (#9870 ) Summary: This patch completes the first part of the task: "Extend all three commands so they can decode and print blob references if a new option --decode_blob_index is specified" Pull Request resolved: https://github.com/facebook/rocksdb/pull/9870 Reviewed By: ltamasi Differential Revision: D35753932 Pulled By: jowlyzhang fbshipit-source-id: 9d2bbba0eef2ed86b982767eba9de1b4881f35c9	4 years ago
Hui Xiao	a5063c8931	Fix issue of opening too many files in BlockBasedTableReaderCapMemoryTest.CapMemoryUsageUnderCacheCapacity (#9869 ) Summary: Context: Unit test for https://github.com/facebook/rocksdb/pull/9748 keeps opening new files to see whether the new feature is able to correctly constrain the opening based on block cache capacity. However, the unit test has two places written inefficiently that can lead to opening too many new files relative to underlying operating system/file system constraint, even before hitting the block cache capacity: (1) [opened_table_reader_num < 2 * max_table_reader_num](https://github.com/facebook/rocksdb/pull/9748/files?show-viewed-files=true&file-filters%5B%5D=#diff-ec9f5353e317df71093094734ba29193b94a998f0f9c9af924e4c99692195eeaR438), which can leads to 1200 + open files because of (2) below (2) NewLRUCache(6 * CacheReservationManagerImpl<CacheEntryRole::kBlockBasedTableReader>::GetDummyEntrySize()) in [here](https://github.com/facebook/rocksdb/pull/9748/files?show-viewed-files=true&file-filters%5B%5D=#diff-ec9f5353e317df71093094734ba29193b94a998f0f9c9af924e4c99692195eeaR364) Therefore we see CI failures like this on machine with a strict open file limit ~1000 (see the "table_1021" naming in following error msg) https://app.circleci.com/pipelines/github/facebook/rocksdb/12886/workflows/75524682-3fa4-41ee-9a61-81827b51d99b/jobs/345270 ``` fs_->NewWritableFile(path, foptions, &file, nullptr) IO error: While open a file for appending: /dev/shm/rocksdb.Jedwt/run-block_based_table_reader_test-CapMemoryUsageUnderCacheCapacity-BlockBasedTableReaderCapMemoryTest.CapMemoryUsageUnderCacheCapacity-0/block_based_table_reader_test_1668910_829492452552920927/table_1021: Too many open files ``` Summary: - Revised the test more efficiently on the above 2 places, including using 1.1 instead 2 in the threshold and lowering down the block cache capacity a bit - Renamed some variables for clarity Pull Request resolved: https://github.com/facebook/rocksdb/pull/9869 Test Plan: - Manual inspection of max opened table reader in all test case, which is around ~389 - Circle CI to see if error is gone Reviewed By: ajkr Differential Revision: D35752655 Pulled By: hx235 fbshipit-source-id: 8a0953d39d561babfa4257b8ed8550bb21b04839	4 years ago
Bo Wang	01fdec23fe	Add release note for #9747 (#9874 ) Summary: Add release note for CompressedSecondaryCache and the update of SecondaryCache::Lookup(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/9874 Reviewed By: jay-zhuang Differential Revision: D35765973 Pulled By: gitbw95 fbshipit-source-id: 98232508c4f2047216def9c11a038cfb98709690	4 years ago
Peter Dillinger	682fc8ba6a	Release note for #9546 (#9872 ) Summary: We don't really have a mechanism for internal-only release notes, so adding this to the standard release notes. For picking into 7.2 release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9872 Test Plan: release note only Reviewed By: jay-zhuang Differential Revision: D35761307 Pulled By: pdillinger fbshipit-source-id: 5d1932767fff48456323df948604dbb956ac27b2	4 years ago
Federico Guerinoni	bbf5867353	Add C API for setting `strict_capacity_limit` (#9855 ) Summary: This allows to set with true the field `strict_capacity_limit` from C API and other languages that wrap that. Signed-off-by: Federico Guerinoni <guerinoni.federico@gmail.com> Closes: https://github.com/facebook/rocksdb/issues/9707 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9855 Reviewed By: ajkr Differential Revision: D35724150 Pulled By: jay-zhuang fbshipit-source-id: d8514797e9d90b1cd88329018f9ac4776722aa0f	4 years ago
Andrew Kryczka	690f1edf37	Avoid overwriting OPTIONS file settings in db_bench (#9862 ) Summary: `InitializeOptionsGeneral()` was overwriting many options that were already configured by OPTIONS file, potentially with the flag default values. This PR changes that function to only overwrite options in limited scenarios, as described at the top of its definition. Block cache is still a violation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9862 Test Plan: ran under various scenarios (multi-DB, single DB, OPTIONS file, flags) and verified options are set as expected Reviewed By: jay-zhuang Differential Revision: D35736960 Pulled By: ajkr fbshipit-source-id: 75b77740af37e6f5741618f8a8f5685df2417d03	4 years ago
Peter Dillinger	1601433b3a	Misc CI improvements / additions (#9859 ) Summary: * Add valgrind test to nightly CircleCI (in case it can catch something that ASAN/UBSAN does not) * Add clang13+asan+ubsan+folly test to nightly CircleCI, for broader testing * Consolidate many copies of ASAN_OPTIONS= while also allowing it to be inherited from parent environment rather than always overridden. * Move UBSAN exclusion from Makefile into options_settable_test.cc Pull Request resolved: https://github.com/facebook/rocksdb/pull/9859 Test Plan: CI Reviewed By: jay-zhuang Differential Revision: D35730903 Pulled By: pdillinger fbshipit-source-id: 6f5464034e8115f9a07f6f7aec1de9219ec2837c	4 years ago
Hui Xiao	e83c55439a	Conditionally declare and define variable that is unused in LITE mode (#9854 ) Summary: Context: As mentioned in https://github.com/facebook/rocksdb/issues/9701, we have the following in LITE=1 make static_lib for v7.0.2 ``` CC file/sequence_file_reader.o CC file/sst_file_manager_impl.o CC file/writable_file_writer.o In file included from file/writable_file_writer.cc:10: ./file/writable_file_writer.h:163:15: error: private field 'temperature_' is not used [-Werror,-Wunused-private-field] Temperature temperature_; ^ 1 error generated. make: *** [file/writable_file_writer.o] Error 1 ``` as titled Pull Request resolved: https://github.com/facebook/rocksdb/pull/9854 Test Plan: - Local `LITE=1 make static_lib` reveals the same error and error is gone after this fix - CI Reviewed By: ajkr, jay-zhuang Differential Revision: D35706585 Pulled By: hx235 fbshipit-source-id: 7743310298231ad6866304ffa2225c8abdc91d9a	4 years ago
Peter Dillinger	41237dd306	Add "no compression" job to CircleCI (#9850 ) Summary: Since they operate at distinct abstraction layers, I thought it was prudent to combine with EncryptedEnv CI test for each PR, for efficiency in testing. Also added supported compressions to sst_dump --help output so that CI job can verify no compiled-in compression support. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9850 Test Plan: CI, some manual stuff Reviewed By: riversand963 Differential Revision: D35682346 Pulled By: pdillinger fbshipit-source-id: be9879c1533fed304ee32c89fd9ba4b07c2b90cc	4 years ago
Jay Zhuang	3d473235d4	Update main version.h to NEXT release (7.3) (#9852 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9852 Reviewed By: ajkr Differential Revision: D35694753 Pulled By: jay-zhuang fbshipit-source-id: 729d416afc588e5db2367e899589bbb5419820d6	4 years ago
Jay Zhuang	673ada8225	Update HISTORY.md for 7.2 release (#9848 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/9848 Reviewed By: riversand963 Differential Revision: D35677606 Pulled By: jay-zhuang fbshipit-source-id: 8a597ea47f302a6f51fb6672a33c848d613bccfc	4 years ago
sdong	4f9c0fd083	Add Aggregation Merge Operator (#9780 ) Summary: Add a merge operator that allows users to register specific aggregation function so that they can does aggregation based per key using different aggregation types. See comments of function CreateAggMergeOperator() for actual usage. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9780 Test Plan: Add a unit test to coverage various cases. Reviewed By: ltamasi Differential Revision: D35267444 fbshipit-source-id: 5b02f31c4f3e17e96dd4025cdc49fca8c2868628	4 years ago
Levi Tamasi	db536ee045	Propagate errors from UpdateBoundaries (#9851 ) Summary: In `FileMetaData`, we keep track of the lowest-numbered blob file referenced by the SST file in question for the purposes of BlobDB's garbage collection in the `oldest_blob_file_number` field, which is updated in `UpdateBoundaries`. However, with the current code, `BlobIndex` decoding errors (or invalid blob file numbers) are swallowed in this method. The patch changes this by propagating these errors and failing the corresponding flush/compaction. (Note that since blob references are generated by the BlobDB code and also parsed by `CompactionIterator`, in reality this can only happen in the case of memory corruption.) This change necessitated updating some unit tests that involved fake/corrupt `BlobIndex` objects. Some of these just used a dummy string like `"blob_index"` as a placeholder; these were replaced with real `BlobIndex`es. Some were relying on the earlier behavior to simulate corruption; these were replaced with `SyncPoint`-based test code that corrupts a valid blob reference at read time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9851 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D35683671 Pulled By: ltamasi fbshipit-source-id: f7387af9945c48e4d5c4cd864f1ba425c7ad51f6	4 years ago
Yanqin Jin	be81609b43	Add a `fail_if_not_bottommost_level` to IngestExternalFileOptions (#9849 ) Summary: This new options allows application to specify that files must be ingested to bottommost level, otherwise the ingestion will fail instead of silently ingesting to a non-bottommost level. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9849 Test Plan: make check Reviewed By: ajkr Differential Revision: D35680307 Pulled By: riversand963 fbshipit-source-id: 01cf54ef6c76198f7654dc06b5544631dea1be1e	4 years ago
Akanksha Mahajan	0c7f455f85	Make initial auto readahead_size configurable (#9836 ) Summary: Make initial auto readahead_size configurable Pull Request resolved: https://github.com/facebook/rocksdb/pull/9836 Test Plan: Added new unit test Ran regression: Without change: ``` ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags RocksDB: version 7.0 Date: Thu Mar 17 13:11:34 2022 CPU: 24 * Intel Core Processor (Broadwell) CPUCache: 16384 KB Keys: 32 bytes each (+ 0 bytes user-defined timestamp) Values: 512 bytes each (256 bytes after compression) Entries: 5000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 2594.0 MB (estimated) FileSize: 1373.3 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: Snappy Compression sampling rate: 0 Memtablerep: SkipListFactory Perf Level: 1 ------------------------------------------------ DB path: [/tmp/prefix_scan_prefetch_main] seekrandom : 483618.390 micros/op 2 ops/sec; 338.9 MB/s (249 of 249 found) ``` With this change: ``` ./db_bench -use_existing_db=true -db=/tmp/prefix_scan_prefetch_main -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -use_direct_reads=true -seek_nexts=327680 -duration=120 -ops_between_duration_checks=1 Set seed to 1649895440554504 because --seed was 0 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags RocksDB: version 7.2 Date: Wed Apr 13 17:17:20 2022 CPU: 24 * Intel Core Processor (Broadwell) CPUCache: 16384 KB Keys: 32 bytes each (+ 0 bytes user-defined timestamp) Values: 512 bytes each (256 bytes after compression) Entries: 5000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 2594.0 MB (estimated) FileSize: 1373.3 MB (estimated) Write rate: 0 bytes/second Read rate: 0 ops/second Compression: Snappy Compression sampling rate: 0 Memtablerep: SkipListFactory Perf Level: 1 ------------------------------------------------ DB path: [/tmp/prefix_scan_prefetch_main] ... finished 100 ops seekrandom : 476892.488 micros/op 2 ops/sec; 344.6 MB/s (252 of 252 found) ``` Reviewed By: anand1976 Differential Revision: D35632815 Pulled By: akankshamahajan15 fbshipit-source-id: c8057a88f9294c9d03b1d434b03affe02f74d796	4 years ago
sdong	d5dfa8c6fe	Upgrade development environment. (#9843 ) Summary: It's to support Meta's internal environment platform010. Gcc still doesn't work but USE_CLANG=1 should work. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9843 Test Plan: Try to make and ROCKSDB_FBCODE_BUILD_WITH_PLATFORM010=1 USE_CLANG=1 make Reviewed By: pdillinger Differential Revision: D35652507 fbshipit-source-id: a4a14b2fa4a2d6ca6fbf1b65060e81c39f079363	4 years ago
Jay Zhuang	e91ec64cac	Remove flaky servicelab metrics DBPut P95/P99 (#9844 ) Summary: The P95 and P99 metrics are flaky, similar to DBGet ones which removed in https://github.com/facebook/rocksdb/issues/9742 . Pull Request resolved: https://github.com/facebook/rocksdb/pull/9844 Test Plan: `$ ./buckifier/buckify_rocksdb.py` Reviewed By: ajkr Differential Revision: D35655531 Pulled By: jay-zhuang fbshipit-source-id: c1409f0fba4e23d461a65f988c27ac5e2ae85d13	4 years ago
yuzhangyu	082eb04200	Add option --decode_blob_index to dump_live_files command (#9842 ) Summary: This change only add decode blob index support to dump_live_files command, which is part of a task to add blob support to a few commands. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9842 Reviewed By: ltamasi Differential Revision: D35650167 Pulled By: jowlyzhang fbshipit-source-id: a78151b98bc38ac6f52c6e01ca6927a3429ddd14	4 years ago
Yanqin Jin	fe63899d1a	Add checks to GetUpdatesSince (#9459 ) Summary: Make `DB::GetUpdatesSince` return early if told to scan WALs generated by transactions with write-prepared or write-unprepared policies (`seq_per_batch` is true), as indicated by API comment. Also add checks to `TransactionLogIterator` to clarify some conditions. No API change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9459 Test Plan: make check Closing https://github.com/facebook/rocksdb/issues/1565 Reviewed By: akankshamahajan15 Differential Revision: D33821243 Pulled By: riversand963 fbshipit-source-id: c8b155d020ce0980e2d3b3b1da40b96e65b48d79	4 years ago
Yanqin Jin	0bd4dcde6b	CompactionIterator sees consistent view of which keys are committed (#9830 ) Summary: This PR does not affect the functionality of `DB` and write-committed transactions. `CompactionIterator` uses `KeyCommitted(seq)` to determine if a key in the database is committed. As the name 'write-committed' implies, if write-committed policy is used, a key exists in the database only if it is committed. In fact, the implementation of `KeyCommitted()` is as follows: ``` inline bool KeyCommitted(SequenceNumber seq) { // For non-txn-db and write-committed, snapshot_checker_ is always nullptr. return snapshot_checker_ == nullptr \|\| snapshot_checker_->CheckInSnapshot(seq, kMaxSequence) == SnapshotCheckerResult::kInSnapshot; } ``` With that being said, we focus on write-prepared/write-unprepared transactions. A few notes: - A key can exist in the db even if it's uncommitted. Therefore, we rely on `snapshot_checker_` to determine data visibility. We also require that all writes go through transaction API instead of the raw `WriteBatch` + `Write`, thus at most one uncommitted version of one user key can exist in the database. - `CompactionIterator` outputs a key as long as the key is uncommitted. Due to the above reasons, it is possible that `CompactionIterator` decides to output an uncommitted key without doing further checks on the key (`NextFromInput()`). By the time the key is being prepared for output, the key becomes committed because the `snapshot_checker_(seq, kMaxSequence)` becomes true in the implementation of `KeyCommitted()`. Then `CompactionIterator` will try to zero its sequence number and hit assertion error if the key is a tombstone. To fix this issue, we should make the `CompactionIterator` see a consistent view of the input keys. Note that for write-prepared/write-unprepared, the background flush/compaction jobs already take a "job snapshot" before starting processing keys. The job snapshot is released only after the entire flush/compaction finishes. We can use this snapshot to determine whether a key is committed or not with minor change to `KeyCommitted()`. ``` inline bool KeyCommitted(SequenceNumber sequence) { // For non-txn-db and write-committed, snapshot_checker_ is always nullptr. return snapshot_checker_ == nullptr \|\| snapshot_checker_->CheckInSnapshot(sequence, job_snapshot_) == SnapshotCheckerResult::kInSnapshot; } ``` As a result, whether a key is committed or not will remain a constant throughout compaction, causing no trouble for `CompactionIterator`s assertions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9830 Test Plan: make check Reviewed By: ltamasi Differential Revision: D35561162 Pulled By: riversand963 fbshipit-source-id: 0e00d200c195240341cfe6d34cbc86798b315b9f	4 years ago
Jonathan Albrecht	844a35108b	Fix minimum libzstd version that supports ZSTD_STREAMING (#9841 ) Summary: The minimum libzstd version that has `ZSTD_compressStream2` is 1.4.0 so only define ZSTD_STREAMING in that case. Fixes building on Ubuntu 18.04 which has libzstd 1.3.3 as its repository version. Fixes https://github.com/facebook/rocksdb/issues/9795 Pull Request resolved: https://github.com/facebook/rocksdb/pull/9841 Test Plan: Build and test on Ubuntu 18.04 with: apt-get install libsnappy-dev zlib1g-dev libbz2-dev liblz4-dev \ libzstd-dev libgflags-dev g++ make curl Reviewed By: ajkr Differential Revision: D35648738 Pulled By: jay-zhuang fbshipit-source-id: 2a9e969bcc17a7dc10172f3817283409de885811	4 years ago

1 2 3 4 5 ...

10983 Commits (8b74cea7fec704cf67a168fef9e452ede56f1974) All Branches Search

10983 Commits (8b74cea7fec704cf67a168fef9e452ede56f1974)

All Branches