rocksdb

Commit Graph

Author	SHA1	Message	Date
Hui Xiao	629605d645	Move prefetching responsibility to page cache for compaction read under non directIO usecase (#11631 ) Summary: Context/Summary As titled. The benefit of doing so is to explicitly call readahead() instead of relying page cache behavior for compaction read when we know that we most likely need readahead as compaction read is sequential read . Test Extended the existing UT to cover compaction read case Pull Request resolved: https://github.com/facebook/rocksdb/pull/11631 Reviewed By: ajkr Differential Revision: D47681437 Pulled By: hx235 fbshipit-source-id: 78792f64985c4dc44aa8f2a9c41ab3e8bbc0bc90	2 years ago
Andrew Kryczka	05c3b8ecac	Prepare for specialized interface for row cache (#11620 ) Summary: An internal user wants to implement a key-aware row cache policy. For that, they need to know the components of the cache key, especially the user key component. With a specialized `RowCache` interface, we will be able to tell them the components so they won't have to make assumptions about our internal key schema. This PR prepares for the specialized `RowCache` interface by updating the migration plan of https://github.com/facebook/rocksdb/issues/11450. I added a release note for the removed APIs and didn't mention the added ones for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11620 Reviewed By: pdillinger Differential Revision: D47536962 Pulled By: ajkr fbshipit-source-id: bbee0fc4ad67fc699a66b8f2b4ea4544dd003691	2 years ago
akankshamahajan	749b179c04	Remove reallocation of AlignedBuffer in direct_io sync reads if already aligned (#11600 ) Summary: Remove reallocation of AlignedBuffer in direct_io sync reads in RandomAccessFileReader::Read if buffer passed is already aligned. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11600 Test Plan: Setup: `TEST_TMPDIR=./tmp-db/ ./db_bench -benchmarks=filluniquerandom -disable_auto_compactions=true -target_file_size_base=1048576 -write_buffer_size=1048576 -compression_type=none` Benchmark: `TEST_TMPDIR=./tmp-db/ perf record ./db_bench --cache_size=8388608 --use_existing_db=true --disable_auto_compactions=true --benchmarks=seekrandom --use_direct_reads=true -use_direct_io_for_flush_and_compaction=true -reads=1000 -seek_nexts=1 -max_auto_readahead_size=131072 -initial_auto_readahead_size=16384 -adaptive_readahead=true -num_file_reads_for_auto_readahead=0` Perf profile- Before: ``` 8.73% db_bench libc.so.6 [.] __memmove_evex_unaligned_erms 3.34% db_bench [kernel.vmlinux] [k] filemap_get_read_batch ``` After: ``` 2.50% db_bench [kernel.vmlinux] [k] filemap_get_read_batch 2.29% db_bench libc.so.6 [.] __memmove_evex_unaligned_erms ``` `make crash_test -j `with direct_io enabled completed succesfully locally. Ran few benchmarks with direct_io from seek_nexts varying between 912 to 327680 and different readahead_size parameters and it showed no regression so far. Reviewed By: ajkr Differential Revision: D47478598 Pulled By: akankshamahajan15 fbshipit-source-id: 6a48e21cb34696f5d09c22a6311a3a1cb5f9cf33	2 years ago
Peter Dillinger	b1b6f87fbe	Some small improvements to HyperClockCache (#11601 ) Summary: Stacked on https://github.com/facebook/rocksdb/issues/11572 * Minimize use of std::function and lambdas to minimize chances of compiler heap-allocating closures (unnecessary stress on allocator). It appears that converting FindSlot to a template enables inlining the lambda parameters, avoiding heap allocations. * Clean up some logic with FindSlot (FIXMEs from https://github.com/facebook/rocksdb/issues/11572) * Fix handling of rare case of probing all slots, with new unit test. (Previously Insert would not roll back displacements in that case, which would kill performance if it were to happen.) * Add an -early_exit option to cache_bench for gathering memory stats before deallocation. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11601 Test Plan: unit test added for probing all slots ## Seeing heap allocations Run `MALLOC_CONF="stats_print:true" ./cache_bench -cache_type=hyper_clock_cache` before https://github.com/facebook/rocksdb/issues/11572 vs. after this change. Before, we see this in the interesting bin statistics: ``` size nrequests ---- --------- 32 578460 64 24340 8192 578460 ``` And after: ``` size nrequests ---- --------- 32 (insignificant) 64 24370 8192 579130 ``` ## Performance test Build with `make USE_CLANG=1 PORTABLE=0 DEBUG_LEVEL=0 -j32 cache_bench` Run `./cache_bench -cache_type=hyper_clock_cache -ops_per_thread=5000000` in before and after configurations, simultaneously: ``` Before: Complete in 33.244 s; Rough parallel ops/sec = 2406442 After: Complete in 32.773 s; Rough parallel ops/sec = 2441019 ``` Reviewed By: jowlyzhang Differential Revision: D47375092 Pulled By: pdillinger fbshipit-source-id: 46f0f57257ddb374290a0a38c651764ea60ba410	2 years ago
Changyu Bi	df082c8d1d	Deprecate option `periodic_compaction_seconds` for FIFO compaction (#11550 ) Summary: both options `ttl` and `periodic_compaction_seconds` have the same meaning for FIFO compaction, which is redundant and can be confusing to use. For example, setting TTL to 0 does not disable TTL: user needs to also set periodic_compaction_seconds to 0. Another example is that dynamically setting `periodic_compaction_seconds` (surprisingly) has no effect on TTL compaction. This is because FIFO compaction picker internally only looks at value of `ttl`. The value of `ttl` is in `SanitizeOptions()` which take into account the value of `periodic_compaction_seconds`, but dynamically setting an option does not invoke this method. This PR clarifies the usage of both options for FIFO compaction: only `ttl` should be used, `periodic_compaction_seconds` will not have any effect on FIFO compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11550 Test Plan: - updated existing unit test `DBOptionsTest.SanitizeFIFOPeriodicCompaction` - checked existing values of both options in feature matrix: https://fburl.com/daiquery/xxd0gs9w. All current uses cases either have `periodic_compaction_seconds = 0` or have `periodic_compaction_seconds > ttl`, so should not cause change of behavior. Reviewed By: ajkr Differential Revision: D46902959 Pulled By: cbi42 fbshipit-source-id: a9ede235b276783b4906aaec443551fa62ceff4c	2 years ago
akankshamahajan	94c247bff8	Update HISTORY.md for branch cut for 8.4.fb (#11565 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/11565 Reviewed By: jowlyzhang, cbi42 Differential Revision: D47027788 Pulled By: akankshamahajan15 fbshipit-source-id: e5e8db2eb21f8aa68fe072f0e1b63b83ba7beb9f	2 years ago
akankshamahajan	ff1cc8a63e	Fix extra prefetching when num_file_reads_for_auto_readahead is 1 in async_io (#11560 ) Summary: When num_file_reads_for_auto_readahead = 1, during seek, it would go for prefetchingextra data in second buffer along with seek data, that would lead to increase in read data and discarded bytes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11560 Test Plan: Added unit test Reviewed By: anand1976 Differential Revision: D47008102 Pulled By: akankshamahajan15 fbshipit-source-id: 566c6131cb5f968d5efb81fd0ab233ff7e534ab0	2 years ago
akankshamahajan	fbd2f563bb	Add an interface to provide support for underlying FS to pass their own buffer during reads (#11324 ) Summary: 1. Public API change: Replace `use_async_io` API in file_system with `SupportedOps` API which is used by underlying FileSystem to indicate to upper layers whether the FileSystem supports different operations introduced in `enum FSSupportedOps `. Right now operations are `async_io` and whether FS will provide its own buffer during reads or not. The api is changed to extend it to various FileSystem operations in one API rather than creating a separate API for each operation. 2. Provide support for underlying FS to pass their own buffer during Reads (async and sync read) instead of using RocksDB provided `scratch` (buffer) in `FSReadRequest`. Currently only MultiRead supports it and later will be extended to other reads as well (point lookup, scan etc). More details in how to enable in file_system.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/11324 Test Plan: Tested locally Reviewed By: anand1976 Differential Revision: D44465322 Pulled By: akankshamahajan15 fbshipit-source-id: 9ec9e08f839b5cc815e75d5dade6cd549998d0ec	2 years ago
Yu Zhang	7521478b43	Record the `persist_user_defined_timestamps` flag in manifest (#11515 ) Summary: Start to record the value of the flag `AdvancedColumnFamilyOptions.persist_user_defined_timestamps` in the Manifest and table properties for a SST file when it is created. And use the recorded flag when creating a table reader for the SST file. This flag's default value is true, it is only explicitly recorded if it's false. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11515 Test Plan: ``` make all check ./version_edit_test ``` Reviewed By: ltamasi Differential Revision: D46920386 Pulled By: jowlyzhang fbshipit-source-id: 075c20363d3d2cc1368422ecc805617ed135cc26	2 years ago
Alexandre Lavigne	2926e0718c	Add missing parameter in C API (#11542 ) Summary: The class `NewCompactOnDeletionCollectorFactory` exposes the parameter `delete_ratio`. The C API `rocksdb_options_add_compact_on_deletion_collector_factory` does not allow a user to pass a delete ration to be passed down the the C++ class bellow. The class has default value for the delete ratio which makes it pass the compilation and the tests. closes https://github.com/facebook/rocksdb/issues/11541 Pull Request resolved: https://github.com/facebook/rocksdb/pull/11542 Reviewed By: ajkr Differential Revision: D46770908 Pulled By: cbi42 fbshipit-source-id: 7b5162fe459896052e392e2d85a8f6c01db3b464	2 years ago
Yu Zhang	b421a8c21b	Add a ticker to track number of trash files deleted in background thread (#11540 ) Summary: This ticker combined with `rocksdb.files.marked.trash` can help give a better picture of how DeleteScheduler is keeping up. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11540 Test Plan: ``` ./delete_scheduler_test ``` Reviewed By: ajkr Differential Revision: D46746401 Pulled By: jowlyzhang fbshipit-source-id: f3daa622aa3ddefe7d673e0cc257a47699d506df	2 years ago
Changyu Bi	bc04ec85db	Make option `level_compaction_dynamic_level_bytes` true by default (#11525 ) Summary: after https://github.com/facebook/rocksdb/issues/11321 and https://github.com/facebook/rocksdb/issues/11340 (both included in RocksDB v8.2), migration from `level_compaction_dynamic_level_bytes=false` to `level_compaction_dynamic_level_bytes=true` is automatic by RocksDB and requires no manual compaction from user. Making the option true by default as it has several advantages: 1. better space amplification guarantee (a more stable LSM shape). 2. compaction is more adaptive to write traffic. 3. automatic draining of unneeded levels. Wiki is updated with more detail: https://github.com/facebook/rocksdb/wiki/Leveled-Compaction#option-level_compaction_dynamic_level_bytes-and-levels-target-size. The PR mostly contains fixes for unit tests as they assumed `level_compaction_dynamic_level_bytes=false`. Most notable change is commit `f742be330c` and `b1928e42b3` which override the default option in DBTestBase to still set `level_compaction_dynamic_level_bytes=false` by default. This helps to reduce the change needed for unit tests. I think this default option override in unit tests is okay since the behavior of `level_compaction_dynamic_level_bytes=true` is tested by explicitly setting this option. Also, `level_compaction_dynamic_level_bytes=false` may be more desired in unit tests as it makes it easier to create a desired LSM shape. Comment for option `level_compaction_dynamic_level_bytes` is updated to reflect this change and change made in https://github.com/facebook/rocksdb/issues/10057. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11525 Test Plan: `make -j32 J=32 check` several times to try to catch flaky tests due to this option change. Reviewed By: ajkr Differential Revision: D46654256 Pulled By: cbi42 fbshipit-source-id: 6b5827dae124f6f1fdc8cca2ac6f6fcd878830e1	2 years ago
mayue.fight	fa878a0107	Support to create a CF by importing multiple non-overlapping CFs (#11378 ) Summary: The original Feature Request is from [https://github.com/facebook/rocksdb/issues/11317](https://github.com/facebook/rocksdb/issues/11317). Flink uses rocksdb as the state backend, all DB options are the same, and the keys of each DB instance are adjacent and there is no key overlap between two db instances. In the Flink rescaling scenario, it is necessary to quickly split the DB according to a certain key range or quickly merge multiple DBs into one. This PR is mainly used to quickly merge multiple DBs into one. We hope to extend the function of `CreateColumnFamilyWithImports` to support creating ColumnFamily by importing multiple ColumnFamily with no overlapping keys. The import logic is almost the same as `CreateColumnFamilyWithImport`, but it will check whether there is key overlap between CF when importing. The import will fail if there are key overlaps. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11378 Reviewed By: ajkr Differential Revision: D46413709 Pulled By: cbi42 fbshipit-source-id: 846d0049fad11c59cf460fa846c345b26c658dfb	2 years ago
Peter Dillinger	70bf5ef093	Avoid destroying default PosixEnv, safely (#11538 ) Summary: Use another static object to join threads instead. This change is motivated by a case in which some code using NewLRUCache() -> ShardedCacheBase -> SemiStructuredUniqueIdGen -> GenerateRawUniqueId() -> Env::Default() was happening during static destruction. I didn't see anything else in PosixEnv or base classes that would cause a problem by not destroying. (WinEnv is already not destroyed; see env_default.cc) Pull Request resolved: https://github.com/facebook/rocksdb/pull/11538UndefinedBehaviorSanitizer: undefined-behavior env/env_test.cc:3561:23 in $ ``` Test Plan: test added, which would previously fail with UBSAN: ``` $ ./env_test --gtest_filter=Destruct Note: Google Test filter = Destruct [==========] Running 1 test from 1 test case. [----------] Global test environment set-up. [----------] 1 test from EnvTestMisc [ RUN ] EnvTestMisc.StaticDestruction [ OK ] EnvTestMisc.StaticDestruction (0 ms) [----------] 1 test from EnvTestMisc (0 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test case ran. (0 ms total) [ PASSED ] 1 test. env/env_test.cc:3561:23: runtime error: member call on address 0x7f7b96671ca8 which does not point to an object of type 'rocksdb::Env' 0x7f7b96671ca8: note: object is of type 'N7rocksdb12ConfigurableE' 00 00 00 00 90 a7 f7 95 7b 7f 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ^~~~~~~~~~~~~~~~~~~~~~~ vptr for 'N7rocksdb12ConfigurableE' Reviewed By: jowlyzhang Differential Revision: D46737389 Pulled By: pdillinger fbshipit-source-id: 0f80a443bf799ffc5641e898cf3a75f7d10a987b	2 years ago
Changyu Bi	15e8a843d9	Do not include last level in compaction when `allow_ingest_behind=true` (#11489 ) Summary: when a DB is configured with `allow_ingest_behind = true`, the last level should be reserved for ingested files and these files should not be included in any compaction. Currently, a major compaction can compact these files to smaller levels. This can cause future files to be rejected for ingest behind (see `ExternalSstFileIngestionJob::CheckLevelForIngestedBehindFile()`). This PR fixes the issue such that files in the last level is not included in any compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11489 Test Plan: * Updated unit test `ExternalSSTFileTest.IngestBehind` to test that last level is not included in manual and auto-compaction. Reviewed By: ajkr Differential Revision: D46455711 Pulled By: cbi42 fbshipit-source-id: 5e2142c2a709ef932ad797897795021c06c4ac8c	2 years ago
Andrew Kryczka	cac3240cbf	add property "rocksdb.obsolete-sst-files-size" (#11533 ) Summary: See "unreleased_history/new_features/obsolete_sst_files_size.md" for description Pull Request resolved: https://github.com/facebook/rocksdb/pull/11533 Test Plan: updated unit test Reviewed By: jowlyzhang Differential Revision: D46703152 Pulled By: ajkr fbshipit-source-id: ea5e31cd6293eccc154130c13e66b5271f57c102	2 years ago
Ignat Loskutov	7c67aee4a0	statistics.cc: fix mistype (#11509 ) Summary: Add new tickers: `rocksdb.error.handler.bg.error.count`, `rocksdb.error.handler.bg.io.error.count`, `rocksdb.error.handler.bg.retryable.io.error.count` to replace the misspelled ones: `rocksdb.error.handler.bg.errro.count`, `rocksdb.error.handler.bg.io.errro.count`, `rocksdb.error.handler.bg.retryable.io.errro.count` ('error' instead of 'errro'). Users should switch to use the new tickers before 9.0 release as the misspelled old tickers will be completely removed then. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11509 Reviewed By: ltamasi Differential Revision: D46542809 Pulled By: jowlyzhang fbshipit-source-id: a2a6d8354af46a060de81d40ef6f5336a80bd32e	2 years ago
Yu Zhang	77dda0d9d8	Fix use after move in data block hash index (#11505 ) Summary: Fix a use-after-move issue in block.cc and added some unit tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11505 Test Plan: ``` make all check ./block_test ``` Reviewed By: ltamasi Differential Revision: D46506188 Pulled By: jowlyzhang fbshipit-source-id: 316ed8ddd221c00b2bce2cf9fd47eea686cd74a5	2 years ago
Changyu Bi	2e8cc98ab2	Fix subcompaction bug to allow running two subcompactions (#11501 ) Summary: as reported in https://github.com/facebook/rocksdb/issues/11476, RocksDB currently does not execute compactions in two subcompactions even when they qualify. This PR fixes this issue. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11501 Test Plan: * Add a new unit test. * Run crash test with max_subcompactions=2: `python3 tools/db_crashtest.py blackbox --simple --subcompactions=2 --target_file_size_base=1048576 --compaction_style=0` * saw logs showing compactions being executed as 2 subcompactions ``` 2023/06/01-17:28:44.028470 3147486 (Original Log Time 2023/06/01-17:28:44.025972) EVENT_LOG_v1 {"time_micros": 1685665724025939, "job": 6, "event": "compaction_finished", "compaction_time_micros": 34539, "compaction_time_cpu_micros": 26404, "output_level": 6, "num_output_files": 2, "total_output_size": 1109796, "num_input_records": 13188, "num_output_records": 13021, "num_subcompactions": 2, "output_compression": "NoCompression", "num_single_delete_mismatches": 0, "num_single_delete_fallthrough": 0, "lsm_state": [0, 0, 0, 0, 0, 0, 13]} ``` Reviewed By: ajkr Differential Revision: D46411497 Pulled By: cbi42 fbshipit-source-id: 3ebfc02e19f78f782e114a9546dc3d481d496258	2 years ago
Changyu Bi	4aa52d89cf	Drop range tombstone during non-bottommost compaction (#11459 ) Summary: Similar to point tombstones, we can drop a range tombstone during compaction when we know its range does not exist in any higher level. This PR adds this optimization. Some existing test in db_range_del_test is fixed to work under this optimization. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11459 Test Plan: * Add unit test `DBRangeDelTest, NonBottommostCompactionDropRangetombstone`. * Ran crash test that issues range deletion for a few hours: `python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=1048576 --delrangepercent=10 --writepercent=31 --readpercent=40` Reviewed By: ajkr Differential Revision: D46007904 Pulled By: cbi42 fbshipit-source-id: 3f37205b6778b7d55ed106369ca41b0632a6d0fd	2 years ago
Peter Dillinger	7a9b264f36	Some fixes to unreleased_history/ (#11504 ) Summary: * Add a "Performance Improvements" section * Add required copyright headers Pull Request resolved: https://github.com/facebook/rocksdb/pull/11504 Test Plan: manual Reviewed By: hx235 Differential Revision: D46405128 Pulled By: pdillinger fbshipit-source-id: 4f878dfd0170d381d3051a44c13479c860e812c0	2 years ago
Changyu Bi	e95cc1217d	`CompactRange()` always compacts to bottommost level for leveled compaction (#11468 ) Summary: currently for leveled compaction, the max output level of a call to `CompactRange()` is pre-computed before compacting each level. This max output level is the max level whose key range overlaps with the manual compaction key range. However, during manual compaction, files in the max output level may be compacted down further by some background compaction. When this background compaction is a trivial move, there is a race condition and the manual compaction may not be able to compact all keys in the specified key range. This PR updates `CompactRange()` to always compact to the bottommost level to make this race condition more unlikely (it can still happen, see more in comment here: `796f58f42a/db/db_impl/db_impl_compaction_flush.cc (L1180C29-L1184)`). This PR also changes the behavior of CompactRange() when `bottommost_level_compaction=kIfHaveCompactionFilter` (the default option). The old behavior is that, if a compaction filter is provided, CompactRange() always does an intra-level compaction at the final output level for all files in the manual compaction key range. The only exception when `first_overlapped_level = 0` and `max_overlapped_level = 0`. It’s awkward to maintain the same behavior after this PR since we do not compute max_overlapped_level anymore. So the new behavior is similar to kForceOptimized: always does intra-level compaction at the bottommost level, but not including new files generated during this manual compaction. Several unit tests are updated to work with this new manual compaction behavior. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11468 Test Plan: Add new unit tests `DBCompactionTest.ManualCompactionCompactAllKeysInRange*` Reviewed By: ajkr Differential Revision: D46079619 Pulled By: cbi42 fbshipit-source-id: 19d844ba4ec8dc1a0b8af5d2f36ff15820c6e76f	2 years ago
Changyu Bi	9f1ce6d804	Make `unreleased_history/release.sh` work on macOS (#11494 ) Summary: I got the following errors when running `unreleased_history/release.sh` on my mac. This is due to mac does not have gnu version of awk and find by default. This PR updates the script to work on macOS. ``` awk: calling undefined function strftime input record number 43, file source line number 4 find: -regextype: unknown primary or operator ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/11494 Test Plan: manually run `DRY_RUN=1 unreleased_history/release.sh \| less` on macOS and CentOS8 machines. Reviewed By: ajkr Differential Revision: D46328442 Pulled By: cbi42 fbshipit-source-id: a7570cd3480fcd25ac1438beb0d59fe655f9a71a	2 years ago
Yu Zhang	56ca9e3106	Logging timestamp size record in WAL and use it during recovery (#11471 ) Summary: Start logging the timestamp size record in WAL and use the record during recovery. Currently, user comparator cannot be different from what was used to create a column family, so the timestamp size record is just used to confirm it's consistent with the timestamp size the running user comparator indicates. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11471 Test Plan: ``` make all check ./db_secondary_test ./db_wal_test --gtest_filter="WithTimestamp" ./repair_test --gtest_filter="WithTimestamp" ``` Reviewed By: ltamasi Differential Revision: D46236769 Pulled By: jowlyzhang fbshipit-source-id: f6c60b5c8defdb05021c63df302ccc0be1275ad0	2 years ago
Peter Dillinger	8848ec92dd	Better management of unreleased HISTORY (#11481 ) Summary: See new NOTE in HISTORY.md and unreleased_history/README.txt Pull Request resolved: https://github.com/facebook/rocksdb/pull/11481 Test Plan: some manual testing on my CentOS 8 system Reviewed By: jaykorean Differential Revision: D46233342 Pulled By: pdillinger fbshipit-source-id: daf59cf3dc907f450b469090dcc481a30a7d7c0d	2 years ago

25 Commits (1567108fc10e50c68f6d9df1223c1c6e2d6aab2e)