Tag:
					 Branch:
					 Tree:
					f4a5446cab
				
			
			
		
		main
				
					oxigraph-8.1.1
				
					oxigraph-8.3.2
				
					oxigraph-main
				
			
			
				
			
		${ noResults }
		
	
		
			301 Commits (f4a5446cab9a5797758e5fd491917280a39359ca)
		
	
	
		
	
	| Author | SHA1 | Message | Date | 
|---|---|---|---|
|  Changyu Bi | e9d6a0d7ce | Fix asan failure caused by range tombstone start key use-after-free (#11106) Summary: the `last_tombstone_start_user_key` variable in `BuildTable()` and in `CompactionOutputs::AddRangeDels()` may point to a start key that is freed if user-defined timestamp is enabled. This was causing ASAN failure and this PR fixes this issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11106 Test Plan: Added UT for repro. Reviewed By: ajkr Differential Revision: D42590862 Pulled By: cbi42 fbshipit-source-id: c493265ececdf89636d801d55ae929806c4d4b2c | 3 years ago | 
|  Changyu Bi | 4d0f9a995c | Consider TTL compaction file cutting earlier to prevent small output file (#11075) Summary: in `CompactionOutputs::ShouldStopBefore()`, TTL-related states, `cur_files_to_cut_for_ttl_` and `next_files_to_cut_for_ttl_`, are not updated if the function returns early. This can cause unnecessary compaction output file cuttings and hence produce smaller output files, which may hurt write amp. See the example in the unit test for how this "unnecessary file cutting" can happen. This PR fixes this issue by moving the code for updating TTL states earlier in `CompactionOutputs::ShouldStopBefore()` so that the states are updated for each key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11075 Test Plan: - Added new unit test. Reviewed By: hx235 Differential Revision: D42398739 Pulled By: cbi42 fbshipit-source-id: 09fab66679c1a734abcfc31bcea33dd9aeb9dbc7 | 3 years ago | 
|  Changyu Bi | 6a82b68788 | Avoid counting extra range tombstone compensated size in `AddRangeDels()` (#11091) Summary: in `CompactionOutputs::AddRangeDels()`, range tombstones with the same start and end key but different sequence numbers all contribute to compensated range tombstone size. This PR removes this redundancy. This PR also includes a fix from https://github.com/facebook/rocksdb/issues/11067 where a range tombstone that is not within a file's range was being added to the file. This fixes an assertion failure for `icmp.Compare(start, end) <= 0` in VersionSet::ApproximateSize() when calculating compensated range tombstone size. Assertions and a comment/essay was added to reason that no such range tombstone will be added after this fix. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11091 Test Plan: - Added unit tests - Stress test with small key range: `python3 tools/db_crashtest.py blackbox --simple --max_key=100 --interval=600 --write_buffer_size=262144 --target_file_size_base=256 --max_bytes_for_level_base=262144 --block_size=128 --value_size_mult=33 --subcompactions=10` Reviewed By: ajkr Differential Revision: D42521588 Pulled By: cbi42 fbshipit-source-id: 5bda3fe38997995314e1f7592319af12b69bc4f8 | 3 years ago | 
|  Changyu Bi | f515d9d203 | Revert #10802 Consider range tombstone in compaction output file cutting (#11089) Summary:
This reverts commit  | 3 years ago | 
|  Hui Xiao | 9502856edd | Add missing range conflict check between file ingestion and RefitLevel() (#10988) Summary: **Context:** File ingestion never checks whether the key range it acts on overlaps with an ongoing RefitLevel() (used in `CompactRange()` with `change_level=true`). That's because RefitLevel() doesn't register and make its key range known to file ingestion. Though it checks overlapping with other compactions by https://github.com/facebook/rocksdb/blob/7.8.fb/db/external_sst_file_ingestion_job.cc#L998. RefitLevel() (used in `CompactRange()` with `change_level=true`) doesn't check whether the key range it acts on overlaps with an ongoing file ingestion. That's because file ingestion does not register and make its key range known to other compactions. - Note that non-refitlevel-compaction (e.g, manual compaction w/o RefitLevel() or general compaction) also does not check key range overlap with ongoing file ingestion for the same reason. - But it's fine. Credited to cbi42's discovery, `WaitForIngestFile` was called by background and foreground compactions. They were introduced in | 3 years ago | 
|  Changyu Bi | cc6f323705 | Include estimated bytes deleted by range tombstones in compensated file size (#10734) Summary: compensate file sizes in compaction picking so files with range tombstones are preferred, such that they get compacted down earlier as they tend to delete a lot of data. This PR adds a `compensated_range_deletion_size` field in FileMeta that is computed during Flush/Compaction and persisted in MANIFEST. This value is added to `compensated_file_size` which will be used for compaction picking. Currently, for a file in level L, `compensated_range_deletion_size` is set to the estimated bytes deleted by range tombstone of this file in all levels > L. This helps to reduce space amp when data in older levels are covered by range tombstones in level L. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10734 Test Plan: - Added unit tests. - benchmark to check if the above definition `compensated_range_deletion_size` is reducing space amp as intended, without affecting write amp too much. The experiment set up favorable for this optimization: large range tombstone issued infrequently. Command used: ``` ./db_bench -benchmarks=fillrandom,waitforcompaction,stats,levelstats -use_existing_db=false -avoid_flush_during_recovery=true -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -max_bytes_for_level_base=134217728 -target_file_size_base=33554432 -writes_per_range_tombstone=500000 -range_tombstone_width=5000000 -num=50000000 -benchmark_write_rate_limit=8388608 -threads=16 -duration=1800 --max_num_range_tombstones=1000000000 ``` In this experiment, each thread wrote 16 range tombstones over the duration of 30 minutes, each range tombstone has width 5M that is the 10% of the key space width. Results shows this PR generates a smaller DB size. Compaction stats from this PR: ``` Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 2/0 31.54 MB 0.5 0.0 0.0 0.0 8.4 8.4 0.0 1.0 0.0 63.4 135.56 110.94 544 0.249 0 0 0.0 0.0 L4 3/0 96.55 MB 0.8 18.5 6.7 11.8 18.4 6.6 0.0 2.7 65.3 64.9 290.08 284.03 108 2.686 284M 1957K 0.0 0.0 L5 15/0 404.41 MB 1.0 19.1 7.7 11.4 18.8 7.4 0.3 2.5 66.6 65.7 292.93 285.34 220 1.332 293M 3808K 0.0 0.0 L6 143/0 4.12 GB 0.0 45.0 7.5 37.5 41.6 4.1 0.0 5.5 71.2 65.9 647.00 632.66 251 2.578 739M 47M 0.0 0.0 Sum 163/0 4.64 GB 0.0 82.6 21.9 60.7 87.2 26.5 0.3 10.4 61.9 65.4 1365.58 1312.97 1123 1.216 1318M 52M 0.0 0.0 ``` Compaction stats from main: ``` Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 0/0 0.00 KB 0.0 0.0 0.0 0.0 8.4 8.4 0.0 1.0 0.0 60.5 142.12 115.89 569 0.250 0 0 0.0 0.0 L4 3/0 85.68 MB 1.0 17.7 6.8 10.9 17.6 6.7 0.0 2.6 62.7 62.3 289.05 281.79 112 2.581 272M 2309K 0.0 0.0 L5 11/0 293.73 MB 1.0 18.8 7.5 11.2 18.5 7.2 0.5 2.5 64.9 63.9 296.07 288.50 220 1.346 288M 4365K 0.0 0.0 L6 130/0 3.94 GB 0.0 51.5 7.6 43.9 47.9 3.9 0.0 6.3 67.2 62.4 784.95 765.92 258 3.042 848M 51M 0.0 0.0 Sum 144/0 4.31 GB 0.0 88.0 21.9 66.0 92.3 26.3 0.5 11.0 59.6 62.5 1512.19 1452.09 1159 1.305 1409M 58M 0.0 0.0``` Reviewed By: ajkr Differential Revision: D39834713 Pulled By: cbi42 fbshipit-source-id: fe9341040b8704a8fbb10cad5cf5c43e962c7e6b | 3 years ago | 
|  Changyu Bi | 53b703eafe | Fix an assertion failure in `CompactionOutputs::AddRangeDels()` (#11040) Summary: the [assertion]( | 3 years ago | 
|  Changyu Bi | f02c708aa3 | Consider range tombstone in compaction output file cutting (#10802) Summary: This PR is the first step for Issue https://github.com/facebook/rocksdb/issues/4811. Currently compaction output files are cut at point keys, and the decision is made mainly in `CompactionOutputs::ShouldStopBefore()`. This makes it possible for range tombstones to cause large compactions that does not respect `max_compaction_bytes`. For example, we can have a large range tombstone that overlaps with too many files from the next level. Another example is when there is a gap between a range tombstone and another key. The first issue may be more acceptable, as a lot of data is deleted. This PR address the second issue by calling `ShouldStopBefore()` for range tombstone start keys. The main change is for `CompactionIterator` to emit range tombstone start keys to be processed by `CompactionOutputs`. A new `CompactionMergingIterator` is introduced and only used under `CompactionIterator` for this purpose. Further improvement after this PR include 1) cut compaction output at some grandparent boundary key instead of at the next point key or range tombstone start key and 2) cut compaction output file within a large range tombstone (it may be easier and reasonable to only do it for range tombstones at the end of a compaction output). Pull Request resolved: https://github.com/facebook/rocksdb/pull/10802 Test Plan: - added unit tests in db_range_del_test. - stress test: `python3 tools/db_crashtest.py whitebox --[simple|enable_ts] --verify_iterator_with_expected_state_one_in=5 --delrangepercent=5 --prefixpercent=2 --writepercent=58 --readpercen=21 --duration=36000 --range_deletion_width=1000000` Reviewed By: ajkr, jay-zhuang Differential Revision: D40308827 Pulled By: cbi42 fbshipit-source-id: a8fd6f70a3f09d0ef7a40e006f6c964bba8c00df | 3 years ago | 
|  Hui Xiao | 98d5db5c2e | Sort L0 files by newly introduced epoch_num (#10922) Summary:
**Context:**
Sorting L0 files by `largest_seqno` has at least two inconvenience:
-  File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap.
    - For example, consider the following sequence of events ("key@n" indicates key at seqno "n")
       - insert k1@1 to memtable m1
       - ingest file s1 with k2@2, ingest file s2 with k3@3
        - insert k4@4 to m1
       - compact files s1, s2 and  result in new file s3 of seqno range [2, 3]
       - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1
    - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption.
- Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption
    - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr  for this example)
        - an existing SST s1 contains only k1@1
        - insert k1@2 to memtable m1
        - ingest file s2 with k3@3, ingest file s3 with k4@4
        - insert single delete k5@5 in m1
        - flush m1 and result in new file s4 of seqno range [2, 5]
        - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4]
        - compact s4 and result in new file s6 of seqno range [2] due to single delete
    - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno`
Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways:
- In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number`  ordering check. This will result in file ordering s1 <  s2 <  s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more.
- In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption.
**Summary:**
- Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`.
  - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`)
  - Compaction output file  is assigned with the minimum `epoch_number` among input files'
      - Refit level: reuse refitted file's epoch_number
  -  Other paths needing `epoch_number` treatment:
     - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`
     - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`.
  -  Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or  by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair).
  - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder.
- Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery
   - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more
   - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag`
- Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above
   - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`.
- Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR.
- Misc:
   - update existing tests with `epoch_number` so make check will pass
   - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases
   - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber()
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922
Test Plan:
- `make check`
- New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc`
- Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930
- [Ongoing] Compatibility test: manually run  | 3 years ago | 
|  Jay Zhuang | 1078d860a9 | Add an unittest for Periodic compaction conflict with ongoing compaction (#10908) Summary: Add a tiered storage migration test which would conflict with an ongoing penultimate level compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10908 Test Plan: Test only change Reviewed By: anand1976 Differential Revision: D40864509 Pulled By: ajkr fbshipit-source-id: e316e849a01a6c71a41be130101f909b6c0498cb | 3 years ago | 
|  Hui Xiao | f1574a20ff | Revert PR 10777 "Fix FIFO causing overlapping seqnos in L0 files due to overla…" (#10999) Summary:
**Context/Summary:**
This reverts commit  | 3 years ago | 
|  Hui Xiao | d8c043f7ad | Trigger FIFO file deletion in non L0 only if exceeding max_table_files_size (#10955) Summary: **Context** https://github.com/facebook/rocksdb/pull/10348 allows multi-level FIFO but accidentally made change to the logic of deleting files in `FIFOCompactionPicker::PickSizeCompaction`. With [this](https://github.com/facebook/rocksdb/pull/10348/files#diff-d8fb3d50749aa69b378de447e3d9cf2f48abe0281437f010b5d61365a7b813fdR156) and [this](https://github.com/facebook/rocksdb/pull/10348/files#diff-d8fb3d50749aa69b378de447e3d9cf2f48abe0281437f010b5d61365a7b813fdR235) together, it deletes one file in non-L0 even when `total_size <= mutable_cf_options.compaction_options_fifo.max_table_files_size`, which is incorrect. As a consequence, FIFO exercises more file deletion in our crash testing, which is not able to verify correctly on deleted keys in the file deleted by compaction. This results in errors `error : inconsistent values for key 000000000000239F000000000000012B000000000000028B: expected state has the key, Get() returns NotFound. Verification failed :(` or `Expected state has key 00000000000023A90000000000000003787878, iterator is at key 00000000000023A9000000000000004178 Column family: default, op_logs: S 00000000000023A90000000000000003787878` **Summary**: - Delete file for non-L0 only if `total_size <= mutable_cf_options.compaction_options_fifo.max_table_files_size` - Add some helpful log to LOG file Pull Request resolved: https://github.com/facebook/rocksdb/pull/10955 Test Plan: - Errors repro-ed by ``` ./db_stress --preserve_unverified_changes=1 --acquire_snapshot_one_in=10000 --adaptive_readahead=0 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --async_io=0 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=10 --bottommost_compression_type=none --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=1 --checkpoint_one_in=1000000 --checksum_type=kxxHash --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=3 --compaction_style=2 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8589934591 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=xpress --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test/rocksdb_crashtest_whitebox --db_write_buffer_size=1048576 --delpercent=0 --delrangepercent=0 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=/dev/shm/rocksdb_test/rocksdb_crashtest_expected --fail_if_options_file_error=1 --fifo_allow_compaction=1 --file_checksum_impl=xxh64 --flush_one_in=1000000 --format_version=4 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=10 --index_type=2 --ingest_external_file_one_in=1000000 --initial_auto_readahead_size=16384 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=False --log2_keys_per_lock=10 --long_running_snapshots=0 --manual_wal_flush_one_in=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=524288 --max_background_compactions=1 --max_bytes_for_level_base=67108864 --max_key=25000000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.01 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=0 --mock_direct_io=True --nooverwritepercent=0 --num_file_reads_for_auto_readahead=2 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=40000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=7 --prefixpercent=5 --prepopulate_block_cache=0 --preserve_internal_time_seconds=3600 --progress_reports=0 --read_fault_one_in=1000 --readahead_size=0 --readpercent=65 --recycle_log_file_num=1 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --stats_dump_period_sec=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=16777216 --target_file_size_multiplier=1 --test_batches_snapshots=0 --top_level_index_pinning=1 --unpartitioned_pinning=1 --use_direct_io_for_flush_and_compaction=1 --use_direct_reads=1 --use_full_merge_v1=1 --use_merge=0 --use_multiget=0 --use_put_entity_one_in=0 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_iterator_with_expected_state_one_in=0 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=33554432 --write_dbid_to_manifest=1 --writepercent=20 ``` is gone after this fix - CI Reviewed By: ajkr Differential Revision: D41319441 Pulled By: hx235 fbshipit-source-id: 6939753767007f7449ea7055b1420aabd03d7709 | 3 years ago | 
|  Andrew Kryczka | 54c2542df2 | Support tiering when file endpoints overlap (#10961) Summary: Enabled output to penultimate level when file endpoints overlap. This is probably only possible when range tombstones span files. Otherwise the overlapping files would all be included in the penultimate level inputs thanks to our atomic compaction unit logic. Also, corrected `penultimate_output_range_type_`, which is a minor fix as it appears only used for logging. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10961 Test Plan: updated unit test Reviewed By: cbi42 Differential Revision: D41370615 Pulled By: ajkr fbshipit-source-id: 7e75ec369a3b41b8382b336446c81825a4c4f572 | 3 years ago | 
|  Changyu Bi | 6c5ec92070 | Set correct temperature for range tombstone only file in penultimate level (#10972) Summary: before this PR, if there is a range tombstone-only file generated in penultimate level, it is marked the `last_level_temperature`. This PR fixes this issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10972 Test Plan: added unit test for this scenario. Reviewed By: ajkr Differential Revision: D41449215 Pulled By: cbi42 fbshipit-source-id: 1e06b5ae3bc0183db2991a45965a9807a7e8be0c | 3 years ago | 
|  Andrew Kryczka | 097f9f4425 | Fix CompactionIterator flag for penultimate level output (#10967) Summary: We were not resetting it in non-debug mode so it could be true once and then stay true for future keys where it should be false. This PR adds the reset logic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10967 Test Plan: - built `db_bench` with DEBUG_LEVEL=0 - ran benchmark: `TEST_TMPDIR=/dev/shm/prefix ./db_bench -benchmarks=fillrandom -compaction_style=1 -preserve_internal_time_seconds=100 -preclude_last_level_data_seconds=10 -write_buffer_size=1048576 -target_file_size_base=1048576 -subcompactions=8 -duration=120` - compared "output_to_penultimate_level: X bytes + last: Y bytes" lines in LOG output - Before this fix, Y was always zero - After this fix, Y gradually increased throughout the benchmark Reviewed By: riversand963 Differential Revision: D41417726 Pulled By: ajkr fbshipit-source-id: ace1e9a289e751a5b0c2fbaa8addd4eda5525329 | 3 years ago | 
|  Yanqin Jin | 7d26e4c5a3 | Basic Support for Merge with user-defined timestamp (#10819) Summary:
This PR implements the originally disabled `Merge()` APIs when user-defined timestamp is enabled.
Simplest usage:
```cpp
// assume string append merge op is used with '.' as delimiter.
// ts1 < ts2
db->Put(WriteOptions(), "key", ts1, "v0");
db->Merge(WriteOptions(), "key", ts2, "1");
ReadOptions ro;
ro.timestamp = &ts2;
db->Get(ro, "key", &value);
ASSERT_EQ("v0.1", value);
```
Some code comments are added for clarity.
Note: support for timestamp in `DB::GetMergeOperands()` will be done in a follow-up PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10819
Test Plan: make check
Reviewed By: ltamasi
Differential Revision: D40603195
Pulled By: riversand963
fbshipit-source-id: f96d6f183258f3392d80377025529f7660503013 | 3 years ago | 
|  Jay Zhuang | b36ec37a4b | clang-format for db/compaction (#10882) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10882 Reviewed By: riversand963 Differential Revision: D40724867 Pulled By: jay-zhuang fbshipit-source-id: 7f387724f8cd07d8d2b90566a515a4e9078d21f1 | 3 years ago | 
|  Hui Xiao | fc74abb436 | Fix FIFO causing overlapping seqnos in L0 files due to overlapped seqnos between ingested files and memtable's (#10777) Summary: **Context:** Same as https://github.com/facebook/rocksdb/pull/5958#issue-511150930 but apply the fix to FIFO Compaction case Repro: ``` COERCE_CONTEXT_SWICH=1 make -j56 db_stress ./db_stress --acquire_snapshot_one_in=0 --adaptive_readahead=0 --allow_data_in_errors=True --async_io=1 --avoid_flush_during_recovery=0 --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=0 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=18 --bottommost_compression_type=disable --bytes_per_sync=262144 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=0 --charge_file_metadata=1 --charge_filter_construction=1 --charge_table_reader=1 --checkpoint_one_in=0 --checksum_type=kCRC32c --clear_column_family_one_in=0 --column_families=1 --compact_files_one_in=0 --compact_range_one_in=1000 --compaction_pri=3 --open_files=-1 --compaction_style=2 --fifo_allow_compaction=1 --compaction_ttl=0 --compression_max_dict_buffer_bytes=8388607 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=zlib --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --data_block_index_type=0 --db=/dev/shm/rocksdb_test0/rocksdb_crashtest_whitebox --db_write_buffer_size=8388608 --delpercent=4 --delrangepercent=1 --destroy_db_initially=1 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=0 --get_property_one_in=0 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=15 --index_type=3 --ingest_external_file_one_in=100 --initial_auto_readahead_size=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --log2_keys_per_lock=10 --long_running_snapshots=0 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=100000 --max_key_len=3 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=1048576 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=4194304 --memtable_prefix_bloom_size_ratio=0.5 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --memtablerep=skip_list --mmap_read=1 --mock_direct_io=False --nooverwritepercent=1 --num_file_reads_for_auto_readahead=0 --num_levels=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=32 --open_write_fault_one_in=0 --ops_per_thread=200000 --optimize_filters_for_memory=0 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=1 --pause_background_one_in=0 --periodic_compaction_seconds=0 --prefix_size=8 --prefixpercent=5 --prepopulate_block_cache=0 --progress_reports=0 --read_fault_one_in=0 --readahead_size=16384 --readpercent=45 --recycle_log_file_num=1 --reopen=20 --ribbon_starting_level=999 --snapshot_hold_ops=1000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=2 --sync=0 --sync_fault_injection=0 --target_file_size_base=524288 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --unpartitioned_pinning=0 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_full_merge_v1=1 --use_merge=0 --use_multiget=1 --user_timestamp_size=0 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=0 --verify_db_one_in=1000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=zstd --write_buffer_size=524288 --write_dbid_to_manifest=0 --writepercent=35 put or merge error: Corruption: force_consistency_checks(DEBUG): VersionBuilder: L0 file https://github.com/facebook/rocksdb/issues/479 with seqno 23711 29070 vs. file https://github.com/facebook/rocksdb/issues/482 with seqno 27138 29049 ``` **Summary:** FIFO only does intra-L0 compaction in the following four cases. For other cases, FIFO drops data instead of compacting on data, which is irrelevant to the overlapping seqno issue we are solving. - [FIFOCompactionPicker::PickSizeCompaction](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L155) when `total size < compaction_options_fifo.max_table_files_size` and `compaction_options_fifo.allow_compaction == true` - For this path, we simply reuse the fix in `FindIntraL0Compaction` https://github.com/facebook/rocksdb/pull/5958/files#diff-c261f77d6dd2134333c4a955c311cf4a196a08d3c2bb6ce24fd6801407877c89R56 - This path was not stress-tested at all. Therefore we covered `fifo.allow_compaction` in stress test to surface the overlapping seqno issue we are fixing here. - [FIFOCompactionPicker::PickCompactionToWarm](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L313) when `compaction_options_fifo.age_for_warm > 0` - For this path, we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and skip files of largest seqno greater than `earliest_mem_seqno` - This path was not stress-tested at all. However covering `age_for_warm` option worths a separate PR to deal with db stress compatibility. Therefore we manually tested this path for this PR - [FIFOCompactionPicker::CompactRange](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker_fifo.cc#L365) that ends up picking one of the above two compactions - [CompactionPicker::CompactFiles](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.cc#L378) - Since `SanitizeCompactionInputFiles()` will be called [before](https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_picker.h#L111-L113) `CompactionPicker::CompactFiles` , we simply replicate the idea in https://github.com/facebook/rocksdb/pull/5958#issue-511150930 in `SanitizeCompactionInputFiles()`. To simplify implementation, we return `Stats::Abort()` on encountering seqno-overlapped file when doing compaction to L0 instead of skipping the file and proceed with the compaction. Some additional clean-up included in this PR: - Renamed `earliest_memtable_seqno` to `earliest_mem_seqno` for consistent naming - Added comment about `earliest_memtable_seqno` in related APIs - Made parameter `earliest_memtable_seqno` constant and required Pull Request resolved: https://github.com/facebook/rocksdb/pull/10777 Test Plan: - make check - New unit test `TEST_P(DBCompactionTestFIFOCheckConsistencyWithParam, FlushAfterIntraL0CompactionWithIngestedFile)`corresponding to the above 4 cases, which will fail accordingly without the fix - Regular CI stress run on this PR + stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 and on FIFO compaction only Reviewed By: ajkr Differential Revision: D40090485 Pulled By: hx235 fbshipit-source-id: 52624186952ee7109117788741aeeac86b624a4f | 3 years ago | 
|  Jay Zhuang | f726d29a82 | Allow penultimate level output for the last level only compaction (#10822) Summary: Allow the last level only compaction able to output result to penultimate level if the penultimate level is empty. Which will also block the other compaction output to the penultimate level. (it includes the PR https://github.com/facebook/rocksdb/issues/10829) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10822 Reviewed By: siying Differential Revision: D40389180 Pulled By: jay-zhuang fbshipit-source-id: 4e5dcdce307795b5e07b5dd1fa29dd75bb093bad | 3 years ago | 
|  Changyu Bi | 333abe9c55 | Ignore max_compaction_bytes for compaction input that are within output key-range (#10835) Summary: When picking compaction input files, we sometimes stop picking a file that is fully included in the output key-range due to hitting max_compaction_bytes. Including these input files can potentially reduce WA at the expense of larger compactions. Larger compaction should be fine as files from input level are usually 10X smaller than files from output level. This PR adds a mutable CF option `ignore_max_compaction_bytes_for_input` that is enabled by default. We can remove this option once we are sure it is safe. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10835 Test Plan: - CI, a unit test on max_compaction_bytes fails before turning this flag off. - Benchmark does not show much difference in WA: `./db_bench --benchmarks=fillrandom,waitforcompaction,stats,levelstats -max_background_jobs=12 -num=2000000000 -target_file_size_base=33554432 --write_buffer_size=33554432` ``` main: ** Compaction Stats [default] ** Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 3/0 91.59 MB 0.8 70.9 0.0 70.9 200.8 129.9 0.0 1.5 25.2 71.2 2886.55 2463.45 9725 0.297 1093M 254K 0.0 0.0 L1 9/0 248.03 MB 1.0 392.0 129.8 262.2 391.7 129.5 0.0 3.0 69.0 68.9 5821.71 5536.90 804 7.241 6029M 5814K 0.0 0.0 L2 87/0 2.50 GB 1.0 537.0 128.5 408.5 533.8 125.2 0.7 4.2 69.5 69.1 7912.24 7323.70 4417 1.791 8299M 36M 0.0 0.0 L3 836/0 24.99 GB 1.0 616.9 118.3 498.7 594.5 95.8 5.2 5.0 66.9 64.5 9442.38 8490.28 4204 2.246 9749M 306M 0.0 0.0 L4 2355/0 62.95 GB 0.3 67.3 37.1 30.2 54.2 24.0 38.9 1.5 72.2 58.2 954.37 821.18 917 1.041 1076M 173M 0.0 0.0 Sum 3290/0 90.77 GB 0.0 1684.2 413.7 1270.5 1775.0 504.5 44.9 13.7 63.8 67.3 27017.25 24635.52 20067 1.346 26G 522M 0.0 0.0 Cumulative compaction: 1774.96 GB write, 154.29 MB/s write, 1684.19 GB read, 146.40 MB/s read, 27017.3 seconds This PR: ** Compaction Stats [default] ** Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) CompMergeCPU(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Rblob(GB) Wblob(GB) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ L0 3/0 45.71 MB 0.8 72.9 0.0 72.9 202.8 129.9 0.0 1.6 25.4 70.7 2938.16 2510.36 9741 0.302 1124M 265K 0.0 0.0 L1 8/0 234.54 MB 0.9 384.5 129.8 254.7 384.2 129.6 0.0 3.0 69.0 68.9 5708.08 5424.43 791 7.216 5913M 5753K 0.0 0.0 L2 84/0 2.47 GB 1.0 543.1 128.6 414.5 539.9 125.4 0.7 4.2 69.6 69.2 7989.31 7403.13 4418 1.808 8393M 36M 0.0 0.0 L3 839/0 24.96 GB 1.0 615.6 118.4 497.2 593.2 96.0 5.1 5.0 66.6 64.1 9471.23 8489.31 4193 2.259 9726M 306M 0.0 0.0 L4 2360/0 63.04 GB 0.3 67.6 37.3 30.3 54.4 24.1 38.9 1.5 71.5 57.6 967.30 827.99 907 1.066 1080M 173M 0.0 0.0 Sum 3294/0 90.75 GB 0.0 1683.8 414.2 1269.6 1774.5 504.9 44.8 13.7 63.7 67.1 27074.08 24655.22 20050 1.350 26G 522M 0.0 0.0 Cumulative compaction: 1774.52 GB write, 157.09 MB/s write, 1683.77 GB read, 149.06 MB/s read, 27074.1 seconds ``` Reviewed By: ajkr Differential Revision: D40518319 Pulled By: cbi42 fbshipit-source-id: f4ea614bc0ebefe007ffaf05bb9aec9a8ca25b60 | 3 years ago | 
|  Jay Zhuang | 1663f77d2a | Fix no internal time recorded for small preclude_last_level (#10829) Summary: When the `preclude_last_level_data_seconds` or `preserve_internal_time_seconds` is smaller than 100 (seconds), no seqno->time information was recorded. Also make sure all data will be compacted to the last level even if there's no write to record the time information. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10829 Test Plan: added unittest Reviewed By: siying Differential Revision: D40443934 Pulled By: jay-zhuang fbshipit-source-id: 2ecf1361daf9f3e5c3385aee6dc924fa59e2813a | 3 years ago | 
|  Yueh-Hsuan Chiang | e267909ecf | Enable a multi-level db to smoothly migrate to FIFO via DB::Open (#10348) Summary: FIFO compaction can theoretically open a DB with any compaction style. However, the current code only allows FIFO compaction to open a DB with a single level. This PR relaxes the limitation of FIFO compaction and allows it to open a DB with multiple levels. Below is the read / write / compaction behavior: * The read behavior is untouched, and it works like a regular rocksdb instance. * The write behavior is untouched as well. When a FIFO compacted DB is opened with multiple levels, all new files will still be in level 0, and no files will be moved to a different level. * Compaction logic is extended. It will first identify the bottom-most non-empty level. Then, it will delete the oldest file in that level. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10348 Test Plan: Added a new test to verify the migration from level to FIFO where the db has multiple levels. Extended existing test cases in db_test and db_basic_test to also verify all entries of a key after reopening the DB with FIFO compaction. Reviewed By: jay-zhuang Differential Revision: D40233744 fbshipit-source-id: 6cc011d6c3467e6bfb9b6a4054b87619e69815e1 | 3 years ago | 
|  Peter Dillinger | e466173d5c | Print stack traces on frozen tests in CI (#10828) Summary: Instead of existing calls to ps from gnu_parallel, call a new wrapper that does ps, looks for unit test like processes, and uses pstack or gdb to print thread stack traces. Also, using `ps -wwf` instead of `ps -wf` ensures output is not cut off. For security, CircleCI runs with security restrictions on ptrace (/proc/sys/kernel/yama/ptrace_scope = 1), and this change adds a work-around to `InstallStackTraceHandler()` (only used by testing tools) to allow any process from the same user to debug it. (I've also touched >100 files to ensure all the unit tests call this function.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10828 Test Plan: local manual + temporary infinite loop in a unit test to observe in CircleCI Reviewed By: hx235 Differential Revision: D40447634 Pulled By: pdillinger fbshipit-source-id: 718a4c4a5b54fa0f9af2d01a446162b45e5e84e1 | 3 years ago | 
|  Jay Zhuang | 5a5f21c489 | Allow the last level data moving up to penultimate level (#10782) Summary: Lock the penultimate level for the whole compaction inputs range, so any key in that compaction is safe to move up from the last level to penultimate level. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10782 Reviewed By: siying Differential Revision: D40231540 Pulled By: siying fbshipit-source-id: ca115cc8b4018b35d797329fa85a19b06cc8c13e | 3 years ago | 
|  Jay Zhuang | c401f285c3 | Add option `preserve_internal_time_seconds` to preserve the time info (#10747) Summary: Add option `preserve_internal_time_seconds` to preserve the internal time information. It's mostly for the migration of the existing data to tiered storage ( `preclude_last_level_data_seconds`). When the tiering feature is just enabled, the existing data won't have the time information to decide if it's hot or cold. Enabling this feature will start collect and preserve the time information for the new data. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10747 Reviewed By: siying Differential Revision: D39910141 Pulled By: siying fbshipit-source-id: 25c21638e37b1a7c44006f636b7d714fe7242138 | 3 years ago | 
|  Yanqin Jin | 11943e8b27 | Exclude timestamp when checking compaction boundaries (#10787) Summary: When checking if a range [start, end) overlaps with a compaction whose range is [start1, end1), always exclude timestamp from start, end, start1 and end1, otherwise some versions of one user key may be compacted to bottommost layer while others remain in the original level. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10787 Test Plan: make check Reviewed By: ltamasi Differential Revision: D40187672 Pulled By: ltamasi fbshipit-source-id: 81226267fd3e33ffa79665c62abadf2ebec45496 | 3 years ago | 
|  Jay Zhuang | 23fa5b7789 | Use `sstableKeyCompare()` for compaction output boundary check (#10763) Summary: To make it consistent with the compaction picker which uses the `sstableKeyCompare()` to pick the overlap files. For example, without this change, it may cut L1 files like: ``` L1: [2-21] [22-30] L2: [1-10] [21-30] ``` Because "21" on L1 is smaller than "21" on L2. But for compaction, these 2 files are overlapped. `sstableKeyCompare()` also take range delete into consideration which may cut file for the same key. It also makes the `max_compaction_bytes` calculation more accurate for cases like above, the overlapped bytes was under estimated. Also make sure the 2 keys won't be splitted to 2 files because of reaching `max_compaction_bytes`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10763 Reviewed By: cbi42 Differential Revision: D39971904 Pulled By: cbi42 fbshipit-source-id: bcc309e9c3dc61a8f50667a6f633e6132c0154a8 | 3 years ago | 
|  Jay Zhuang | f007ad8b4f | RoundRobin TTL compaction (#10725) Summary: For RoundRobin compaction, the data should be mostly sorted per level and within level. Use normal compaction picker for RR until all expired data is compacted. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10725 Reviewed By: ajkr Differential Revision: D39771069 Pulled By: jay-zhuang fbshipit-source-id: 7ccf88d7c093fad5673bda73a7b08cc4757780cd | 3 years ago | 
|  Changyu Bi | 9f2363f4c4 | User-defined timestamp support for `DeleteRange()` (#10661) Summary: Add user-defined timestamp support for range deletion. The new API is `DeleteRange(opt, cf, begin_key, end_key, ts)`. Most of the change is to update the comparator to compare without timestamp. Other than that, major changes are - internal range tombstone data structures (`FragmentedRangeTombstoneList`, `RangeTombstone`, etc.) to store timestamps. - Garbage collection of range tombstones and range tombstone covered keys during compaction. - Get()/MultiGet() to return the timestamp of a range tombstone when needed. - Get/Iterator with range tombstones bounded by readoptions.timestamp. - timestamp crash test now issues DeleteRange by default. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10661 Test Plan: - Added unit test: `make check` - Stress test: `python3 tools/db_crashtest.py --enable_ts whitebox --readpercent=57 --prefixpercent=4 --writepercent=25 -delpercent=5 --iterpercent=5 --delrangepercent=4` - Ran `db_bench` to measure regression when timestamp is not enabled. The tests are for write (with some range deletion) and iterate with DB fitting in memory: `./db_bench--benchmarks=fillrandom,seekrandom --writes_per_range_tombstone=200 --max_write_buffer_number=100 --min_write_buffer_number_to_merge=100 --writes=500000 --reads=500000 --seek_nexts=10 --disable_auto_compactions -disable_wal=true --max_num_range_tombstones=1000`. Did not see consistent regression in no timestamp case. | micros/op | fillrandom | seekrandom | | --- | --- | --- | |main| 2.58 |10.96| |PR 10661| 2.68 |10.63| Reviewed By: riversand963 Differential Revision: D39441192 Pulled By: cbi42 fbshipit-source-id: f05aca3c41605caf110daf0ff405919f300ddec2 | 3 years ago | 
|  Changyu Bi | fd71a82f4f | Use actual file size when checking max_compaction_size (#10728) Summary: currently, there are places in compaction_picker where we add up `compensated_file_size` of files being compacted and limit the sum to be under `max_compaction_bytes`. `compensated_file_size` contains booster for point tombstones and should be used only for determining file's compaction priority. This PR replaces `compensated_file_size` with actual file size in such places. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10728 Test Plan: CI Reviewed By: ajkr Differential Revision: D39789427 Pulled By: cbi42 fbshipit-source-id: 1f89fb6c0159c53bf01d8dc783f465959f442c81 | 3 years ago | 
|  Jay Zhuang | f3cc66632b | Align compaction output file boundaries to the next level ones (#10655) Summary:
Try to align the compaction output file boundaries to the next level ones
(grandparent level), to reduce the level compaction write-amplification.
In level compaction, there are "wasted" data at the beginning and end of the
output level files. Align the file boundary can avoid such "wasted" compaction.
With this PR, it tries to align the non-bottommost level file boundaries to its
next level ones. It may cut file when the file size is large enough (at least
50% of target_file_size) and not too large (2x target_file_size).
db_bench shows about 12.56% compaction reduction:
```
TEST_TMPDIR=/data/dbbench2 ./db_bench --benchmarks=fillrandom,readrandom -max_background_jobs=12 -num=400000000 -target_file_size_base=33554432
# baseline:
Flush(GB): cumulative 25.882, interval 7.216
Cumulative compaction: 285.90 GB write, 162.36 MB/s write, 269.68 GB read, 153.15 MB/s read, 2926.7 seconds
# with this change:
Flush(GB): cumulative 25.882, interval 7.753
Cumulative compaction: 249.97 GB write, 141.96 MB/s write, 233.74 GB read, 132.74 MB/s read, 2534.9 seconds
```
The compaction simulator shows a similar result (14% with 100G random data).
As a side effect, with this PR, the SST file size can exceed the
target_file_size, but is capped at 2x target_file_size. And there will be
smaller files. Here are file size statistics when loading 100GB with the target
file size 32MB:
```
          baseline      this_PR
count  1.656000e+03  1.705000e+03
mean   3.116062e+07  3.028076e+07
std    7.145242e+06  8.046139e+06
```
The feature is enabled by default, to revert to the old behavior disable it
with `AdvancedColumnFamilyOptions.level_compaction_dynamic_file_size = false`
Also includes https://github.com/facebook/rocksdb/issues/1963 to cut file before skippable grandparent file. Which is for
use case like user adding 2 or more non-overlapping data range at the same
time, it can reduce the overlapping of 2 datasets in the lower levels.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10655
Reviewed By: cbi42
Differential Revision: D39552321
Pulled By: jay-zhuang
fbshipit-source-id: 640d15f159ab0cd973f2426cfc3af266fc8bdde2 | 3 years ago | 
|  Jay Zhuang | 92df36985d | Deflake CompactionServiceTest.BasicCompactions (#10697) Summary: The background compaction may still running while the test end, which would cause ASAN stack-use-after-scope error. Explicitly close the DB before test end. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10697 Test Plan: able to reproduce with: ``` gtest-parallel ./compaction_service_test --gtest_filter=CompactionServiceTest.BasicCompactions -r 10000 -w 100 ``` Reviewed By: gitbw95 Differential Revision: D39590974 Pulled By: jay-zhuang fbshipit-source-id: da264b2e6a276afbda7d5ff7adb9d7b8d4213d90 | 3 years ago | 
|  Jay Zhuang | 849cf1bf68 | Refactor Compaction file cut `ShouldStopBefore()` (#10629) Summary: Consolidate compaction output cut logic to `ShouldStopBefore()` and move it inside of CompactionOutputs class. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10629 Reviewed By: cbi42 Differential Revision: D39315536 Pulled By: jay-zhuang fbshipit-source-id: 7d81037babbd35c276bbaad02dbc2bb555fdac18 | 3 years ago | 
|  Yanqin Jin | ce2c11d848 | Fix a bug by setting up subcompaction bounds properly (#10658) Summary: When user-defined timestamp is enabled, subcompaction bounds should be set up properly. When creating InputIterator for the compaction, the `start` and `end` should have their timestamp portions set to kMaxTimestamp, which is the highest possible timestamp. This is similar to what we do with setting up their sequence numbers to `kMaxSequenceNumber`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10658 Test Plan: ```bash make check rm -rf /dev/shm/rocksdb/* && mkdir /dev/shm/rocksdb/rocksdb_crashtest_expected && ./db_stress --allow_data_in_errors=True --clear_column_family_one_in=0 --continuous_verification_interval=0 --data_block_index_type=1 --db=/dev/shm/rocksdb//rocksdb_crashtest_blackbox --delpercent=5 --delrangepercent=0 --expected_values_dir=/dev/shm/rocksdb//rocksdb_crashtest_expected --iterpercent=0 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=25000000 --max_write_batch_group_size_bytes=1048576 --nooverwritepercent=1 --ops_per_thread=300000 --paranoid_file_checks=1 --partition_filters=0 --prefix_size=8 --prefixpercent=5 --readpercent=30 --reopen=0 --snapshot_hold_ops=100000 --subcompactions=4 --target_file_size_base=65536 --target_file_size_multiplier=2 --test_batches_snapshots=0 --test_cf_consistency=0 --use_multiget=1 --user_timestamp_size=8 --value_size_mult=32 --verify_checksum=1 --write_buffer_size=65536 --writepercent=60 -disable_wal=1 -column_families=1 ``` Reviewed By: akankshamahajan15 Differential Revision: D39393402 Pulled By: riversand963 fbshipit-source-id: f276e35b19fce51a175c368a502fb0718d1f3871 | 3 years ago | 
|  Yanqin Jin | 3d67d79154 | Fix overlapping check by excluding timestamp (#10615) Summary:
With user-defined timestamp, checking overlapping should exclude
timestamp part from key. This has already been done for range checking
for files in sstableKeyCompare(), but not yet done when checking with
concurrent compactions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10615
Test Plan:
(Will add more tests)
make check
(Repro seems easier with this commit sha: git checkout  | 3 years ago | 
|  Peter Dillinger | 6de7081cf3 | Always verify SST unique IDs on SST file open (#10532) Summary: Although we've been tracking SST unique IDs in the DB manifest unconditionally, checking has been opt-in and with an extra pass at DB::Open time. This changes the behavior of `verify_sst_unique_id_in_manifest` to check unique ID against manifest every time an SST file is opened through table cache (normal DB operations), replacing the explicit pass over files at DB::Open time. This change also enables the option by default and removes the "EXPERIMENTAL" designation. One possible criticism is that the option no longer ensures the integrity of a DB at Open time. This is far from an all-or-nothing issue. Verifying the IDs of all SST files hardly ensures all the data in the DB is readable. (VerifyChecksum is supposed to do that.) Also, with max_open_files=-1 (default, extremely common), all SST files are opened at DB::Open time anyway. Implementation details: * `VerifySstUniqueIdInManifest()` functions are the extra/explicit pass that is now removed. * Unit tests that manipulate/corrupt table properties have to opt out of this check, because that corrupts the "actual" unique id. (And even for testing we don't currently have a mechanism to set "no unique id" in the in-memory file metadata for new files.) * A lot of other unit test churn relates to (a) default checking on, and (b) checking on SST open even without DB::Open (e.g. on flush) * Use `FileMetaData` for more `TableCache` operations (in place of `FileDescriptor`) so that we have access to the unique_id whenever we might need to open an SST file. **There is the possibility of performance impact because we can no longer use the more localized `fd` part of an `FdWithKeyRange` but instead follow the `file_metadata` pointer. However, this change (possible regression) is only done for `GetMemoryUsageByTableReaders`.** * Removed a completely unnecessary constructor overload of `TableReaderOptions` Possible follow-up: * Verification only happens when opening through table cache. Are there more places where this should happen? * Improve error message when there is a file size mismatch vs. manifest (FIXME added in the appropriate place). * I'm not sure there's a justification for `FileDescriptor` to be distinct from `FileMetaData`. * I'm skeptical that `FdWithKeyRange` really still makes sense for optimizing some data locality by duplicating some data in memory, but I could be wrong. * An unnecessary overload of NewTableReader was recently added, in the public API nonetheless (though unusable there). It should be cleaned up to put most things under `TableReaderOptions`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10532 Test Plan: updated unit tests Performance test showing no significant difference (just noise I think): `./db_bench -benchmarks=readwhilewriting[-X10] -num=3000000 -disable_wal=1 -bloom_bits=8 -write_buffer_size=1000000 -target_file_size_base=1000000` Before: readwhilewriting [AVG 10 runs] : 68702 (± 6932) ops/sec After: readwhilewriting [AVG 10 runs] : 68239 (± 7198) ops/sec Reviewed By: jay-zhuang Differential Revision: D38765551 Pulled By: pdillinger fbshipit-source-id: a827a708155f12344ab2a5c16e7701c7636da4c2 | 3 years ago | 
|  Hui Xiao | 8a85946f58 | Add missing mutex when reading from shared variable bg_bottom_compaction_scheduled_, bg_compaction_scheduled_ (#10610) Summary: **Context/Summary:** According to https://github.com/facebook/rocksdb/blob/7.6.fb/db/compaction/compaction_job.h#L328-L332, any reading in the form of `*bg_compaction_scheduled_` , `*bg_bottom_compaction_scheduled_` should be protected by mutex, which isn't the case for some assert statement. This leads to a data race that can be repro-ed by the following command (command coming soon) ``` db=/dev/shm/rocksdb_crashtest_blackbox exp=/dev/shm/rocksdb_crashtest_expected rm -rf $db $exp mkdir -p $exp ./db_stress --clear_column_family_one_in=0 --column_families=1 --db=$db --delpercent=10 --delrangepercent=0 --destroy_db_initially=1 --expected_values_dir=$exp --iterpercent=0 --key_len_percent_dist=1,30,69 --max_key=1000000 --max_key_len=3 --prefixpercent=0 --readpercent=0 --reopen=0 --ops_per_thread=100000000 --value_size_mult=32 --writepercent=90 --compaction_pri=4 --use_txn=1 --level_compaction_dynamic_level_bytes=True --compaction_ttl=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --value_size_mult=32 --verify_db_one_in=1000 --write_buffer_size=65536 --mark_for_compaction_one_file_in=10 --max_background_compactions=20 --max_key=25000000 --max_key_len=3 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=2097152 --target_file_size_base=2097152 --target_file_size_multiplier=2 ``` ``` WARNING: ThreadSanitizer: data race (pid=73424) Read of size 4 at 0x7b8c0000151c by thread T13: #0 ReleaseSubcompactionResources internal_repo_rocksdb/repo/db/compaction/compaction_job.cc:390 (db_stress+0x630aa3) https://github.com/facebook/rocksdb/issues/1 rocksdb::CompactionJob::Run() internal_repo_rocksdb/repo/db/compaction/compaction_job.cc:741 (db_stress+0x630aa3) https://github.com/facebook/rocksdb/issues/2 rocksdb::DBImpl::BackgroundCompaction(bool*, rocksdb::JobContext*, rocksdb::LogBuffer*, rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority) internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:3436 (db_stress+0x60b2cc) https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority) internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:2950 (db_stress+0x606d79) https://github.com/facebook/rocksdb/issues/4 rocksdb::DBImpl::BGWorkCompaction(void*) internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:2693 (db_stress+0x60356a) Previous write of size 4 at 0x7b8c0000151c by thread T12 (mutexes: write M438955329917552448): #0 rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction*, rocksdb::Env::Priority) internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:3018 (db_stress+0x6072a1) https://github.com/facebook/rocksdb/issues/1 rocksdb::DBImpl::BGWorkCompaction(void*) internal_repo_rocksdb/repo/db/db_impl/db_impl_compaction_flush.cc:2693 (db_stress+0x60356a) Location is heap block of size 6720 at 0x7b8c00000000 allocated by main thread: #0 operator new(unsigned long, std::align_val_t) <null> (db_stress+0xbab5bb) https://github.com/facebook/rocksdb/issues/1 rocksdb::DBImpl::Open(rocksdb::DBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::DB**, bool, bool) internal_repo_rocksdb/repo/db/db_impl/db_impl_open.cc:1811 (db_stress+0x69769a) https://github.com/facebook/rocksdb/issues/2 rocksdb::TransactionDB::Open(rocksdb::DBOptions const&, rocksdb::TransactionDBOptions const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<rocksdb::ColumnFamilyDescriptor, std::allocator<rocksdb::ColumnFamilyDescriptor> > const&, std::vector<rocksdb::ColumnFamilyHandle*, std::allocator<rocksdb::ColumnFamilyHandle*> >*, rocksdb::TransactionDB**) internal_repo_rocksdb/repo/utilities/transactions/pessimistic_transaction_db.cc:258 (db_stress+0x8ae1f4) https://github.com/facebook/rocksdb/issues/3 rocksdb::StressTest::Open(rocksdb::SharedState*) internal_repo_rocksdb/repo/db_stress_tool/db_stress_test_base.cc:2611 (db_stress+0x32b927) https://github.com/facebook/rocksdb/issues/4 rocksdb::StressTest::InitDb(rocksdb::SharedState*) internal_repo_rocksdb/repo/db_stress_tool/db_stress_test_base.cc:290 (db_stress+0x34712c) ``` This PR added all the missing mutex that should've been in place Pull Request resolved: https://github.com/facebook/rocksdb/pull/10610 Test Plan: - Past repro command - Existing CI Reviewed By: riversand963 Differential Revision: D39143016 Pulled By: hx235 fbshipit-source-id: 51dd4db55ad306f3dbda5d0dd54d6f2513cf70f2 | 3 years ago | 
|  Hui Xiao | e484b81eee | Sync dir containing CURRENT after RenameFile on CURRENT as much as possible (#10573) Summary:
**Context:**
Below crash test revealed a bug that directory containing CURRENT file (short for `dir_contains_current_file` below) was not always get synced after a new CURRENT is created and being called with `RenameFile` as part of the creation.
This bug exposes a risk that such un-synced directory containing the updated CURRENT can’t survive a host crash (e.g, power loss) hence get corrupted. This then will be followed by a recovery from a corrupted CURRENT that we don't want.
The root-cause is that a nullptr `FSDirectory* dir_contains_current_file` sometimes gets passed-down to `SetCurrentFile()` hence in those case `dir_contains_current_file->FSDirectory::FsyncWithDirOptions()` will be skipped  (which otherwise will internally call`Env/FS::SyncDic()` )
```
./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_data_in_errors=True --avoid_unnecessary_blocking_io=0 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=8 --block_size=16384 --bloom_bits=134.8015470676662 --bottommost_compression_type=disable --cache_size=8388608 --checkpoint_one_in=1000000 --checksum_type=kCRC32c --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=2 --compaction_ttl=100 --compression_max_dict_buffer_bytes=511 --compression_max_dict_bytes=16384 --compression_type=zstd --compression_use_zstd_dict_trainer=1 --compression_zstd_max_train_bytes=65536 --continuous_verification_interval=0 --data_block_index_type=0 --db=$db --db_write_buffer_size=1048576 --delpercent=5 --delrangepercent=0 --destroy_db_initially=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=1 --expected_values_dir=$exp --fail_if_options_file_error=1 --file_checksum_impl=none --flush_one_in=1000000 --get_current_wal_file_one_in=0 --get_live_files_one_in=1000000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --ingest_external_file_one_in=0 --iterpercent=10 --key_len_percent_dist=1,30,69 --level_compaction_dynamic_level_bytes=True --mark_for_compaction_one_file_in=10 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=10000 --max_key_len=3 --max_manifest_file_size=16384 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=0 --memtable_prefix_bloom_size_ratio=0.001 --memtable_protection_bytes_per_key=1 --memtable_whole_key_filtering=1 --mmap_read=1 --nooverwritepercent=1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_pinning=2 --pause_background_one_in=1000000 --periodic_compaction_seconds=0 --prefix_size=5 --prefixpercent=5 --prepopulate_block_cache=1 --progress_reports=0 --read_fault_one_in=1000 --readpercent=45 --recycle_log_file_num=0 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=32 --secondary_cache_uri=compressed_secondary_cache://capacity=8388608 --set_options_one_in=10000 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=0 --sst_file_manager_bytes_per_truncate=0 --subcompactions=3 --sync_fault_injection=1 --target_file_size_base=2097 --target_file_size_multiplier=2 --test_batches_snapshots=1 --top_level_index_pinning=1 --use_full_merge_v1=1 --use_merge=1 --value_size_mult=32 --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=524288 --write_buffer_size=4194 --writepercent=35
```
```
stderr:
WARNING: prefix_size is non-zero but memtablerep != prefix_hash
db_stress: utilities/fault_injection_fs.cc:748: virtual rocksdb::IOStatus rocksdb::FaultInjectionTestFS::RenameFile(const std::string &, const std::string &, const rocksdb::IOOptions &, rocksdb::IODebugContext *): Assertion `tlist.find(tdn.second) == tlist.end()' failed.`
```
**Summary:**
The PR ensured the non-test path pass down a non-null dir containing CURRENT (which is by current RocksDB assumption just db_dir) by doing the following:
- Renamed `directory_to_fsync` as `dir_contains_current_file` in `SetCurrentFile()` to tighten the association between this directory and CURRENT file
- Changed `SetCurrentFile()` API to require `dir_contains_current_file` being passed-in, instead of making it by default nullptr.
    -  Because `SetCurrentFile()`'s `dir_contains_current_file` is passed down from `VersionSet::LogAndApply()` then `VersionSet::ProcessManifestWrites()` (i.e, think about this as a chain of 3 functions related to MANIFEST update), these 2 functions also got refactored to require `dir_contains_current_file`
- Updated the non-test-path callers of these 3 functions to obtain and pass in non-nullptr `dir_contains_current_file`, which by current assumption of RocksDB, is the `FSDirectory* db_dir`.
    - `db_impl` path will obtain `DBImpl::directories_.getDbDir()` while others with no access to such `directories_` are obtained on the fly by creating such object `FileSystem::NewDirectory(..)` and manage it by unique pointers to ensure short life time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10573
Test Plan:
- `make check`
- Passed the repro db_stress command
- For future improvement, since we currently don't assert dir containing CURRENT to be non-nullptr due to https://github.com/facebook/rocksdb/pull/10573#pullrequestreview-1087698899, there is still chances that future developers mistakenly pass down nullptr dir containing CURRENT thus resulting skipped sync dir and cause the bug again. Therefore a smarter test (e.g, such as quoted from ajkr  "(make) unsynced data loss to be dropping files corresponding to unsynced directory entries") is still needed.
Reviewed By: ajkr
Differential Revision: D39005886
Pulled By: hx235
fbshipit-source-id: 336fb9090d0cfa6ca3dd580db86268007dde7f5a | 3 years ago | 
|  Chen Lixiang | 9593fd1c82 | Fix wrong compression type and options in universal compaction picker (#10515) Summary: In UniversalCompactionBuilder::PickCompactionToReduceSortedRuns, we passed start_level to get compression type and options. I think that is wrong and we should use output_level instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10515 Reviewed By: hx235 Differential Revision: D38611335 Pulled By: ajkr fbshipit-source-id: bb860caed4b6c6bbde8f75fc50cf875a9f04723d | 3 years ago | 
|  Changyu Bi | fd165c869d | Add memtable per key-value checksum (#10281) Summary: Append per key-value checksum to internal key. These checksums are verified on read paths including Get, Iterator and during Flush. Get and Iterator will return `Corruption` status if there is a checksum verification failure. Flush will make DB become read-only upon memtable entry checksum verification failure. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10281 Test Plan: - Added new unit test cases: `make check` - Benchmark on memtable insert ``` TEST_TMPDIR=/dev/shm/memtable_write ./db_bench -benchmarks=fillseq -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100 # avg over 10 runs Baseline: 1166936 ops/sec memtable 2 bytes kv checksum : 1.11674e+06 ops/sec (-4%) memtable 2 bytes kv checksum + write batch 8 bytes kv checksum: 1.08579e+06 ops/sec (-6.95%) write batch 8 bytes kv checksum: 1.17979e+06 ops/sec (+1.1%) ``` - Benchmark on only memtable read: ops/sec dropped 31% for `readseq` due to time spend on verifying checksum. ops/sec for `readrandom` dropped ~6.8%. ``` # Readseq sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillseq,readseq"[-X20]" -disable_wal=true -max_write_buffer_number=100 -num=10000000 -min_write_buffer_number_to_merge=100 readseq [AVG 20 runs] : 7432840 (± 212005) ops/sec; 822.3 (± 23.5) MB/sec readseq [MEDIAN 20 runs] : 7573878 ops/sec; 837.9 MB/sec With -memtable_protection_bytes_per_key=2: readseq [AVG 20 runs] : 5134607 (± 119596) ops/sec; 568.0 (± 13.2) MB/sec readseq [MEDIAN 20 runs] : 5232946 ops/sec; 578.9 MB/sec # Readrandom sudo TEST_TMPDIR=/dev/shm/memtable_read ./db_bench -benchmarks=fillrandom,readrandom"[-X10]" -disable_wal=true -max_write_buffer_number=100 -num=1000000 -min_write_buffer_number_to_merge=100 readrandom [AVG 10 runs] : 140236 (± 3938) ops/sec; 9.8 (± 0.3) MB/sec readrandom [MEDIAN 10 runs] : 140545 ops/sec; 9.8 MB/sec With -memtable_protection_bytes_per_key=2: readrandom [AVG 10 runs] : 130632 (± 2738) ops/sec; 9.1 (± 0.2) MB/sec readrandom [MEDIAN 10 runs] : 130341 ops/sec; 9.1 MB/sec ``` - Stress test: `python3 -u tools/db_crashtest.py whitebox --duration=1800` Reviewed By: ajkr Differential Revision: D37607896 Pulled By: cbi42 fbshipit-source-id: fdaefb475629d2471780d4a5f5bf81b44ee56113 | 3 years ago | 
|  Peter Dillinger | 9fa5c146d7 | LOG more info on oldest snapshot and sequence numbers (#10454) Summary: The info LOG file does not currently give any direct information about the existence of old, live snapshots, nor how to estimate wall time from a sequence number within the scope of LOG history. This change addresses both with: * Logging smallest and largest seqnos for generated SST files, which can help associate sequence numbers with write time (based on flushes). * Logging oldest_snapshot_seqno for each compaction, which (along with that seqno info) helps us to determine how much old data might be kept around for old (leaked?) snapshots. Including the date here I thought might be excessive. I wanted to log the date and seqno of the oldest snapshot with periodic stats, but the current structure of the code doesn't really support that because `DumpDBStats` doesn't have access to the DB object. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10454 Test Plan: manual inspect LOG from `KEEP_DB=1 ./db_basic_test --gtest_filter=*CompactBetweenSnapshots*` Reviewed By: ajkr Differential Revision: D38326948 Pulled By: pdillinger fbshipit-source-id: 294918ffc04a419844146cd826045321b4d5c038 | 3 years ago | 
|  sherriiiliu | 4753e5a2e9 | Fix wrong value passed in compaction filter in BlobDB (#10391) Summary: New blobdb has a bug in compaction filter, where `blob_value_` is not reset for next iterated key. This will cause blob_value_ not empty and previous value read from blob is passed into the filter function for next key, even if its value is not in blob. Fixed by reseting regardless of key type. Test Case: Add `FilterByValueLength` test case in `DBBlobCompactionTest` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10391 Reviewed By: riversand963 Differential Revision: D38629900 Pulled By: ltamasi fbshipit-source-id: 47d23ff2e5ec697958a210db9e6ceeb8b2fc49fa | 3 years ago | 
|  Yanqin Jin | fee2c472d0 | Include minimal contextual information in `CompactionIterator` (#10505) Summary: The main purpose is to make debugging easier without sacrificing performance. Instead of using a boolean variable for `CompactionIterator::valid_`, we can extend it to an `uint8_t`, using the LSB to denote if the compaction iterator is valid and 4 additional bits to denote where the iterator is set valid inside `NextFromInput()`. Therefore, when the control flow reaches `PrepareOutput()` and hits assertion there, we can have a better idea of what has gone wrong. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10505 Test Plan: make check ``` TEST_TMPDIR=/dev/shm/rocksdb time ./db_bench -compression_type=none -write_buffer_size=1073741824 -benchmarks=fillseq,flush ``` The above command has a 'flush' benchmark which uses `CompactionIterator`. I haven't observed any CPU regression or drop in throughput or latency increase. Reviewed By: ltamasi Differential Revision: D38551615 Pulled By: riversand963 fbshipit-source-id: 1250848fc118bb753d71fa9ff8ba840df999f5e0 | 3 years ago | 
|  Jay Zhuang | 3f763763aa | Change `bottommost_temperture` to `last_level_temperture` (#10471) Summary: Change tiered compaction feature from `bottommost_temperture` to `last_level_temperture`. The old option is kept for migration purpose only, which is behaving the same as `last_level_temperture` and it will be removed in the next release. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10471 Test Plan: CI Reviewed By: siying Differential Revision: D38450621 Pulled By: jay-zhuang fbshipit-source-id: cc1cdf8bad409376fec0152abc0a64fb72a91527 | 3 years ago | 
|  Jay Zhuang | 375534752a | Improve universal compaction picker for tiered compaction (#10467) Summary: Current universal compaction picker may cause extra size amplification compaction if there're more hot data on penultimate level. Improve the picker to skip the last level for size amp calculation if tiered compaction is enabled, which can 1. avoid extra unnecessary size amp compaction; 2. typically cold tier (the last level) is not size constrained, so skip size amp for cold tier is intended; Pull Request resolved: https://github.com/facebook/rocksdb/pull/10467 Test Plan: CI and added unittest Reviewed By: siying Differential Revision: D38391350 Pulled By: jay-zhuang fbshipit-source-id: 103c0731c05e0a7e8f267e9e829d022328be25d2 | 3 years ago | 
|  Levi Tamasi | cc8ded6152 | Do not put blobs read during compaction into cache (#10457) Summary: During compaction, blobs are currently read using the default `ReadOptions`, which has the `fill_cache` flag set to true. Earlier, this didn't make any difference since we didn't have a blob cache; however, now we have to explicitly set this flag to false to avoid polluting the cache during compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10457 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D38333528 Pulled By: ltamasi fbshipit-source-id: 5b4d49a1e39543bee73c7df2aa9194fb101875e2 | 3 years ago | 
|  Yanqin Jin | fbfcf5cbcd | Remove unused fields from FileMetaData (temporarily) (#10443) Summary: FileMetaData::[min|max]_timestamp are not currently being used or tracked by RocksDB, even when user-defined timestamp is enabled. Each of them is a std::string which can occupy 32 bytes. Remove them for now. They may be added back when we have a pressing need for them. When we do add them back, consider store them in a more compact way, e.g. one boolean flag and a byte array of size 16. Per file min/max timestamp bounds are available as table properties. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10443 Test Plan: make check Reviewed By: pdillinger Differential Revision: D38292275 Pulled By: riversand963 fbshipit-source-id: 841dc4e855ad8f8481c80cb020603de9607c9c94 | 3 years ago | 
|  Akanksha Mahajan | 56463d443d | Provide support for subcompactions with user-defined timestamps (#10344) Summary: The subcompaction logic currently picks file boundaries as subcompaction boundaries. This is not compatible with user-defined timestamps because of two issues. Issue1: ReadOptions.iterate_lower_bound and ReadOptions.iterate_upper_bound contains timestamps which results in assertion failure as BlockBasedTableIterator expects bounds to be without timestamps. As result, because of wrong comparison end key is returned as user_key resulting in assertion failure. Issue2: Since it might result in two keys that only differ by user timestamp getting processed by two different subcompactions (and thus two different CompactionIterator state machines), which in turn can cause data correction issues. This PR provide support to reenable subcompactions with user-defined timestamps. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10344 Test Plan: Added new unit test - Without fix for Issue1 unit test MultipleSubCompactions fails with error: ``` db_with_timestamp_compaction_test: ./db/compaction/clipping_iterator.h:247: void rocksdb::ClippingIterat│ or::AssertBounds(): Assertion `!valid_ || !end_ || cmp_->Compare(key(), *end_) < 0' failed. Received signal 6 (Aborted) │ #0 /usr/local/fbcode/platform009/lib/libc.so.6(gsignal+0x100) [0x7f8fbbbfe530] db_with_timestamp_compaction_test: ./db/compaction/clipping_iterator.h:247: void rocksdb::ClippingIterator::AssertBounds(): Assertion `!valid_ || !end_ || cmp_->Compare(key(), *end_) < 0' failed. Aborted (core dumped) ``` Ran stress test `make crash_test_with_ts -j32` Reviewed By: riversand963 Differential Revision: D38220841 Pulled By: akankshamahajan15 fbshipit-source-id: 5d5cae2bd37fcaeba1e77fce0a69070ad4158ccb | 3 years ago | 
|  Zichen Zhu | c945a9a664 | Allow sufficient subcompactions under round-robin compaction priority (#10422) Summary: Allow sufficient subcompactions can be used when the number of input files is less than `max_subcompactions` under round-robin compaction priority. Test Case: Add `RoundRobinWithoutAdditionalResources` into `db_compaction_test` Pull Request resolved: https://github.com/facebook/rocksdb/pull/10422 Reviewed By: ajkr Differential Revision: D38186545 Pulled By: littlepig2013 fbshipit-source-id: b8e5098306f1e5b9561dfafafc8300a38f7fe88e | 3 years ago | 
|  Zichen Zhu | 8860fc902a | Support subcmpct using reserved resources for round-robin priority (#10341) Summary: Earlier implementation of round-robin priority can only pick one file at a time and disallows parallel compactions within the same level. In this PR, round-robin compaction policy will expand towards more input files with respecting some additional constraints, which are summarized as follows: * Constraint 1: We can only pick consecutive files - Constraint 1a: When a file is being compacted (or some input files are being compacted after expanding), we cannot choose it and have to stop choosing more files - Constraint 1b: When we reach the last file (with the largest keys), we cannot choose more files (the next file will be the first one with small keys) * Constraint 2: We should ensure the total compaction bytes (including the overlapped files from the next level) is no more than `mutable_cf_options_.max_compaction_bytes` * Constraint 3: We try our best to pick as many files as possible so that the post-compaction level size can be just less than `MaxBytesForLevel(start_level_)` * Constraint 4: If trivial move is allowed, we reuse the logic of `TryNonL0TrivialMove()` instead of expanding files with Constraint 3 More details can be found in `LevelCompactionBuilder::SetupOtherFilesWithRoundRobinExpansion()`. The above optimization accelerates the process of moving the compaction cursor, in which the write-amp can be further reduced. While a large compaction may lead to high write stall, we break this large compaction into several subcompactions **regardless of** the `max_subcompactions` limit. The number of subcompactions for round-robin compaction priority is determined through the following steps: * Step 1: Initialized against `max_output_file_limit`, the number of input files in the start level, and also the range size limit `ranges.size()` * Step 2: Call `AcquireSubcompactionResources()`when max subcompactions is not sufficient, but we may or may not obtain desired resources, additional number of resources is stored in `extra_num_subcompaction_threads_reserved_`). Subcompaction limit is changed and update `num_planned_subcompactions` with `GetSubcompactionLimit()` * Step 3: Call `ShrinkSubcompactionResources()` to ensure extra resources can be released (extra resources may exist for round-robin compaction when the number of actual number of subcompactions is less than the number of planned subcompactions) More details can be found in `CompactionJob::AcquireSubcompactionResources()`,`CompactionJob::ShrinkSubcompactionResources()`, and `CompactionJob::ReleaseSubcompactionResources()`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10341 Test Plan: Add `CompactionPriMultipleFilesRoundRobin[1-3]` unit test in `compaction_picker_test.cc` and `RoundRobinSubcompactionsAgainstResources.SubcompactionsUsingResources/[0-4]`, `RoundRobinSubcompactionsAgainstPressureToken.PressureTokenTest/[0-1]` in `db_compaction_test.cc` Reviewed By: ajkr, hx235 Differential Revision: D37792644 Pulled By: littlepig2013 fbshipit-source-id: 7fecb7c4ffd97b34bbf6e3b760b2c35a772a0657 | 3 years ago |