rocksdb

Commit Graph

Author	SHA1	Message	Date
Jay Zhuang	ac7dcfda10	Add missing ComputeCompactionScore() for a new universal manual compaction (#7281 ) Summary: Seems it's only causing assert failure during compaction pick, but in production code, the problematic compactions are excluded at a later step. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7281 Reviewed By: akankshamahajan15 Differential Revision: D23228000 Pulled By: jay-zhuang fbshipit-source-id: 2e4055aeebe0f5a2b07e299e0a2d51a1ad2e216d	5 years ago
Levi Tamasi	b9bb59d49d	Add initial set of options for integrated blob write path (#7280 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7280 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D23195192 Pulled By: ltamasi fbshipit-source-id: 743b382de391963e62ba86119e9fbd0233ea3b3a	5 years ago
Akanksha Mahajan	cc24ac14eb	Store FSSequentialFilePtr object in SequenceFileReader (#7190 ) Summary: This diff contains following changes: 1. Replace `FSSequentialFile` pointer with `FSSequentialFilePtr` object that wraps `FSSequentialFile` pointer in `SequenceFileReader`. Objective: If tracing is enabled, `FSSequentialFilePtr` returns `FSSequentialFileTracingWrapper` pointer that includes all necessary information in `IORecord` and calls underlying FileSystem and invokes `IOTracer` to dump that record in a binary file. If tracing is disabled then, underlying `FileSystem` pointer is returned directly. `FSSequentialFilePtr` wrapper class is added to bypass the `FSSequentialFileTracingWrapper` when tracing is disabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7190 Test Plan: make check -j64 COMPILE_WITH_TSAN=1 make check -j64 Reviewed By: anand1976 Differential Revision: D23059616 Pulled By: akankshamahajan15 fbshipit-source-id: 1564b94dd1297cd0fbfe2ed5c9cc3e20f7395301	5 years ago
sdong	b194c21bba	Whole DBTest to skip fsync (#7274 ) Summary: After https://github.com/facebook/rocksdb/pull/7036, we still see extra DBTest that can timeout when running 10 or 20 in parallel. Expand skip-fsync mode in whole DBTest. Still preserve other tests from doing this mode to be conservative. This commit reinstates https://github.com/facebook/rocksdb/issues/7049, whose un-revert was lost in an automatic infrastructure mis-merge. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7274 Test Plan: Run all existing files. Reviewed By: pdillinger Differential Revision: D23177444 fbshipit-source-id: 1f61690b2ac6333c3b2c87176fef6b2cba086b33	5 years ago
Andrew Kryczka	5d5ff82408	Disable `recycle_log_file_num` with `kTolerateCorruptedTailRecords` (#7271 ) Summary: The two features are naturally incompatible. WAL recycling expects the recovery to succeed upon encountering a corrupt record at the point where new data ends and recycled data remains at the tail. However, `WALRecoveryMode::kTolerateCorruptedTailRecords` must fail upon encountering any such corrupt record, as it cannot differentiate between this and a real corruption, which would cause committed updates to be truncated. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7271 Reviewed By: riversand963 Differential Revision: D23169923 Pulled By: ajkr fbshipit-source-id: 2cf8a3bcd2c9a0ecb0055a84725047a10fd4db50	5 years ago
Yanqin Jin	92593d511a	Add a new EntryType for deletion with timestamp (#7195 ) Summary: Add `kEntryDeleteWithTimestamp` to `EntryType` which is a public API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7195 Test Plan: make check Reviewed By: ajkr Differential Revision: D22914704 Pulled By: riversand963 fbshipit-source-id: 886f73c6b70c527cad1c8fc9fc8d3afe60e1ea39	5 years ago
Levi Tamasi	9b083cb11c	Build blob file reader/writer classes in LITE mode as well (#7272 ) Summary: The patch makes sure that the functionality required for the new integrated BlobDB implementation (most importantly, the classes related to reading and writing blob files) is also built in LITE mode by removing the corresponding `#ifndef`s. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7272 Test Plan: Ran `make check` in both regular and LITE mode. Reviewed By: zhichao-cao Differential Revision: D23173280 Pulled By: ltamasi fbshipit-source-id: 1596bd1a76409a8a6d83d8f1dbfe08bfdea7ffe6	5 years ago
sdong	1760637539	CompactRange() refit level should confirm destination level is not empty (#7261 ) Summary: There is potential data race related CompactRange() with level refitting. After the compaction step and refitting step, some automatic compaction could put data to the destination level and cause the DB to be corrupted. Fix the bug by checking the target level to be empty. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7261 Test Plan: Add a unit test, which would fail with "Corruption: L1 have overlapping ranges '666F6F' seq:6, type:1 vs. '626172' seq:2, type:1", and now it succeeds. Reviewed By: ajkr Differential Revision: D23142269 fbshipit-source-id: 28bc14d5ac934c192260b23a4ce3f10a95e3ee91	5 years ago
matthewvon	2ad88ceae9	Populate cf_id member of CompactionJobInfo for OnCompactionBegin (#6938 ) Summary: Looks like somebody simply missed initializing a member variable. The column family ID, cf_id, is not set during OnCompactionBegin. But it is set properly in the next function for OnCompactionCompleted. Need this cf_id for tracking progress of a Stardog optimize since there may be multiple compactions required for a given column family. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6938 Reviewed By: siying Differential Revision: D23153235 Pulled By: ajkr fbshipit-source-id: 932938de3a4ebbc7ac89702f655583862587d251	5 years ago
Jay Zhuang	69760b4d05	Introduce a global StatsDumpScheduler for stats dumping (#7223 ) Summary: Have a global StatsDumpScheduler for all DB instance stats dumping, including `DumpStats()` and `PersistStats()`. Before this, there're 2 dedicate threads for every DB instance, one for DumpStats() one for PersistStats(), which could create lots of threads if there're hundreds DB instances. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7223 Reviewed By: riversand963 Differential Revision: D23056737 Pulled By: jay-zhuang fbshipit-source-id: 0faa2311142a73433ebb3317361db7cbf43faeba	5 years ago
Yanqin Jin	d758273ceb	Get() with timestamp should respect snapshot (#7227 ) Summary: If user-defined timestamp is enabled, current implementation can expose newer data to queries even if an older sequence number is specified via read_options.snapshot. This PR makes Get() respect sequence-number-based snapshot. Solution is simple. Besides using <ukey, ts, seq> to search the index for the key, we also verify that the candidate result's seq is smaller than or equal to seq. This requires passing a seq via `GetContext`, which results in the majority of code change caused by this PR. Also added a few unit tests to demonstrate standard visibility during point lookup and range scan when timestamp and snapshot are both present. Test plan (devserver): ``` make check $./db_bench --benchmarks=fillseq,readrandom -cache_size=$[6410241024] ``` Result this PR: readrandom : 4.827 micros/op 207180 ops/sec; 22.9 MB/s (1000000 of 1000000 found) master: readrandom : 4.936 micros/op 202610 ops/sec; 22.4 MB/s (1000000 of 1000000 found) Pull Request resolved: https://github.com/facebook/rocksdb/pull/7227 Reviewed By: ltamasi Differential Revision: D23015242 Pulled By: riversand963 fbshipit-source-id: ea7b85a728654553ba357d2e6a207b5e40f7376a	5 years ago
Andrew Kryczka	a1aa3f8385	Disable manual compaction during `ReFitLevel()` (#7250 ) Summary: Manual compaction with `CompactRangeOptions::change_levels` set could refit to a level targeted by another manual compaction. If force_consistency_checks were disabled, it could be possible for overlapping files to be written at that target level. This PR prevents the possibility by calling `DisableManualCompaction()` prior to `ReFitLevel()`. It also improves the manual compaction disabling mechanism to wait for pending manual compactions to complete before returning, and support disabling from multiple threads. Fixes https://github.com/facebook/rocksdb/issues/6432. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7250 Test Plan: crash test command that repro'd the bug reliably: ``` $ TEST_TMPDIR=/dev/shm python tools/db_crashtest.py blackbox --simple -target_file_size_base=524288 -write_buffer_size=1048576 -clear_column_family_one_in=0 -reopen=0 -max_key=10000000 -column_families=1 -max_background_compactions=8 -compact_range_one_in=100000 -compression_type=none -compaction_style=1 -num_levels=5 -universal_min_merge_width=4 -universal_max_merge_width=8 -level0_file_num_compaction_trigger=12 -rate_limiter_bytes_per_sec=1048576000 -universal_max_size_amplification_percent=100 --duration=3600 --interval=60 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --enable_compaction_filter=0 ``` Reviewed By: ltamasi Differential Revision: D23090800 Pulled By: ajkr fbshipit-source-id: afcbcd51b42ce76789fdb907d8b9ada790709c13	5 years ago
sdong	e7358da9a2	Upgrade tool chain (#7251 ) Summary: Upgrade tool chain to the latest. It is done mostly manually as build_tools/build_detect_platform fails to update many of them. Try to fix a new clang analyze warning with the new tool chain. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7251 Test Plan: "make all", "USE_CLANG=1 make all" Reviewed By: riversand963 Differential Revision: D23091090 fbshipit-source-id: 732e5a30137837431438f85f36296406b641f975	5 years ago
Levi Tamasi	9d6f48ec1d	Clean up CompressBlock/CompressBlockInternal a bit (#7249 ) Summary: The patch cleans up and refactors `CompressBlock` and `CompressBlockInternal` a bit. In particular, it does the following: * It renames `CompressBlockInternal` to `CompressData` and moves it to `util/compression.h`, where other general compression-related utilities are located. This will facilitate reuse in the BlobDB write path. * The signature of the method is changed so it now takes `compression_format_version` (similarly to the compression library specific methods) instead of `format_version` (which is specific to the block based table). * `GetCompressionFormatForVersion` no longer takes `compression_type` as a parameter. This parameter was only used in a (not entirely up-to-date) assertion; also, removing it eliminates the need to ensure this precondition holds at all call sites. * Does some minor cleanup in `CompressBlock`, for instance, it is now possible to pass only one of `sampled_output_fast` and `sampled_output_slow`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7249 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D23087278 Pulled By: ltamasi fbshipit-source-id: e6316e45baed8b4e7de7c1780c90501c2a3439b3	5 years ago
Akanksha Mahajan	1f9f630b27	Store FileSystemPtr object that contains FileSystem ptr (#7180 ) Summary: As part of the IOTracing project, this PR 1. Caches "FileSystemPtr" object(wrapper class that returns file system pointer based on tracing enabled) instead of "FileSystem" pointer. 2. FileSystemPtr object is created using FileSystem pointer and IOTracer pointer. 3. IOTracer shared_ptr is created in DBImpl and it is passed to different classes through constructor. 4. When tracing is enabled through DB::StartIOTrace, FileSystemPtr returns FileSystemTracingWrapper pointer for tracing purpose and when it is disabled underlying FileSystem pointer is returned. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7180 Test Plan: make check -j64 COMPILE_WITH_TSAN=1 make check -j64 Reviewed By: anand1976 Differential Revision: D22987117 Pulled By: akankshamahajan15 fbshipit-source-id: 6073617e4c2d5bc363914f3a1f55ae3b0a58fbf1	5 years ago
Zitan Chen	b578ca2e4d	BackupEngine supports custom file checksums (#7085 ) Summary: A new option `std::shared_ptr<FileChecksumGenFactory> backup_checksum_gen_factory` is added to `BackupableDBOptions`. This allows custom checksum functions to be used for creating, verifying, or restoring backups. Tests are added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7085 Test Plan: Passed make check Reviewed By: pdillinger Differential Revision: D22390756 Pulled By: gg814 fbshipit-source-id: 3b7756ca444c2129844536b91c3ca09f53b6248f	5 years ago
Peter Dillinger	6ac1d25fd0	Fix+clean up handling of mock sleeps (#7101 ) Summary: We have a number of tests hanging on MacOS and windows due to mishandling of code for mock sleeps. In addition, the code was in terrible shape because the same variable (addon_time_) would sometimes refer to microseconds and sometimes to seconds. One test even assumed it was nanoseconds but was written to pass anyway. This has been cleaned up so that DB tests generally use a SpecialEnv function to mock sleep, for either some number of microseconds or seconds depending on the function called. But to call one of these, the test must first call SetMockSleep (precondition enforced with assertion), which also turns sleeps in RocksDB into mock sleeps. To also removes accounting for actual clock time, call SetTimeElapseOnlySleepOnReopen, which implies SetMockSleep (on DB re-open). This latter setting only works by applying on DB re-open, otherwise havoc can ensue if Env goes back in time with DB open. More specifics: Removed some unused test classes, and updated comments on the general problem. Fixed DBSSTTest.GetTotalSstFilesSize using a sync point callback instead of mock time. For this we have the only modification to production code, inserting a sync point callback in flush_job.cc, which is not a change to production behavior. Removed unnecessary resetting of mock times to 0 in many tests. RocksDB deals in relative time. Any behaviors relying on absolute date/time are likely a bug. (The above test DBSSTTest.GetTotalSstFilesSize was the only one clearly injecting a specific absolute time for actual testing convenience.) Just in case I misunderstood some test, I put this note in each replacement: // NOTE: Presumed unnecessary and removed: resetting mock time in env Strengthened some tests like MergeTestTime, MergeCompactionTimeTest, and FilterCompactionTimeTest in db_test.cc stats_history_test and blob_db_test are each their own beast, rather deeply dependent on MockTimeEnv. Each gets its own variant of a work-around for TimedWait in a mock time environment. (Reduces redundancy and inconsistency in stats_history_test.) Intended follow-up: Remove TimedWait from the public API of InstrumentedCondVar, and only make that accessible through Env by passing in an InstrumentedCondVar and a deadline. Then the Env implementations mocking time can fix this problem without using sync points. (Test infrastructure using sync points interferes with individual tests' control over sync points.) With that change, we can simplify/consolidate the scattered work-arounds. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7101 Test Plan: make check on Linux and MacOS Reviewed By: zhichao-cao Differential Revision: D23032815 Pulled By: pdillinger fbshipit-source-id: 7f33967ada8b83011fb54e8279365c008bd6610b	5 years ago
Levi Tamasi	a99fb67233	Remove redundant consistency check from VersionStorageInfo::AddFile (#7237 ) Summary: `VersionStorageInfo::AddFile` currently has a debug-mode consistency check to make sure the newly added file does not overlap with the previous one (for levels below L0). Considering that `VersionBuilder::CheckConsistency` also performs similar checks (in fact, those checks are more comprehensive and cover L0 as well), this check is redundant. The patch removes it and also cleans up `AddFile` a little. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7237 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D23041937 Pulled By: ltamasi fbshipit-source-id: e00665f3b83bfd17f86c54c238800f3d77d739bd	5 years ago
anand76	f308da5273	Fix delete triggered compaction for single level universal (#7224 ) Summary: Delete triggered compaction (DTC) for universal compaction style with ```num_levels = 1``` has been disabled for sometime due to a data correctness bug. This PR re-enables it with a bug fix. A file marked for compaction can be picked, along with all L0 files after it as the compaction input. We stop adding files to the input once we encounter a file already being compacted (the original bug failed to check the compaction status of the files). Tests: Add unit tests to ```compaction_picker_test.cc``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/7224 Reviewed By: ajkr Differential Revision: D23031845 Pulled By: anand1976 fbshipit-source-id: 9de3cab5f9774cede666c2c48d309a7d9b88a505	5 years ago
Yuhong Guo	5444942f15	Fix cmake build on MacOS (#7205 ) Summary: 1. `std::random_shuffle` is deprecated and now we can use `std::shuffle` ``` /rocksdb/db/prefix_test.cc:590:12: error: 'random_shuffle<std::__1::__wrap_iter<unsigned long long > >' is deprecated [-Werror,-Wdeprecated-declarations] std::random_shuffle(prefixes.begin(), prefixes.end()); ^ /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/algorithm:2982:1: note: 'random_shuffle<std::__1::__wrap_iter<unsigned long long > >' has been explicitly marked deprecated here _LIBCPP_DEPRECATED_IN_CXX14 void ^ /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__config:1107:39: note: expanded from macro '_LIBCPP_DEPRECATED_IN_CXX14' # define _LIBCPP_DEPRECATED_IN_CXX14 _LIBCPP_DEPRECATED ^ /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/../include/c++/v1/__config:1090:48: note: expanded from macro '_LIBCPP_DEPRECATED' # define _LIBCPP_DEPRECATED __attribute__ ((deprecated)) ``` 2. `c_test` link error with `-DROCKSDB_BUILD_SHARED=OFF`: ``` [ 7%] Linking CXX executable c_test ld: library not found for -lrocksdb-shared clang: error: linker command failed with exit code 1 (use -v to see invocation) make[5]: * [c_test] Error 1 make[4]: * [CMakeFiles/c_test.dir/all] Error 2 make[4]: *** Waiting for unfinished jobs.... ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/7205 Reviewed By: ajkr Differential Revision: D23030641 Pulled By: pdillinger fbshipit-source-id: f270e50fc0b824ca1a0876ec5c65d33f55a72dd0	5 years ago
Remington Brasga	633bff2f19	Fixed typo on Value mismatch error in db_test (#6587 ) Summary: The debug is supposed to print out two keys to show the value mismatch, which was compared just a few lines above. However, the actual print-out is the same values (so they obviously won't be mismatched) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6587 Reviewed By: riversand963 Differential Revision: D23025279 Pulled By: ajkr fbshipit-source-id: 4c6c35bc60b273f13c08b5464b6f690d8a5cfe41	5 years ago
anand76	832b056a30	Enable IO timeouts for iterators (#7161 ) Summary: Introduce io_timeout in ReadOptions and enabled deadline/io_timeout for Iterators. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7161 Test Plan: New unit tests in db_basic_test Reviewed By: riversand963 Differential Revision: D22687352 Pulled By: anand1976 fbshipit-source-id: 67bbb0e6d7ae80b256589244468494292538c6ec	5 years ago
Zhichao Cao	b79f13b2aa	Fix the potential deadlock in WriteImplWALOnly and UnorderedWriteMemtable (#7199 ) Summary: Pointed out by https://github.com/facebook/rocksdb/issues/7197 , there is a double lock in WriteImplWALOnly. Also find another deadlock in UnorderedWriteMemtable. Move the check after switch_all_.notify_all(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7199 Test Plan: pass make check Reviewed By: anand1976 Differential Revision: D22961714 Pulled By: zhichao-cao fbshipit-source-id: 0707922dc50d28ea141a15a8cdcbd1c8993ea0d8	5 years ago
mrambacher	56f468b356	Add more tests to ASSERT_STATUS_CHECKED (#7211 ) Summary: Added 4 more tests to those which pass ASSERT_STATUS_CHECKED (cache_test, lru_cache_test, filename_test, filelock_test). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7211 Reviewed By: ajkr Differential Revision: D22982858 Pulled By: zhichao-cao fbshipit-source-id: acdd071582ed6aa7447ed96c5732f10bf720d783	5 years ago
Yingchun Lai	67bbac3621	Remove duplicate colon in Status message (#7041 ) Summary: A colon will be added after 'msg' automatically when invoke function Status(Code _code, const Slice& msg, const Slice& msg2), it's not needed to append a colon explicitly to 'msg'. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7041 Reviewed By: ajkr Differential Revision: D22292801 fbshipit-source-id: 8f2d69065bb779d2613468bf9fc9169f32c3f1ec	5 years ago
jsteemann	5e1808d515	fix typo: paraniod -> paranoid (#7163 ) Summary: Rename "paraniod" to "paranoid" in a few places. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7163 Reviewed By: ajkr Differential Revision: D22678242 fbshipit-source-id: 28b1011a736d0a95612676f7e1b9500a70c324b4	5 years ago
Cheng Chang	cd48ecaa1a	Define WAL related classes to be used in VersionEdit and VersionSet (#7164 ) Summary: `WalAddition`, `WalDeletion` are defined in `wal_version.h` and used in `VersionEdit`. `WalAddition` is used to represent events of creating a new WAL (no size, just log number), or closing a WAL (with size). `WalDeletion` is used to represent events of deleting or archiving a WAL, it means the WAL is no longer alive (won't be replayed during recovery). `WalSet` is the set of alive WALs kept in `VersionSet`. 1. Why use `WalDeletion` instead of relying on `MinLogNumber` to identify outdated WALs On recovery, we can compute `MinLogNumber()` based on the log numbers kept in MANIFEST, any log with number < MinLogNumber can be ignored. So it seems that we don't need to persist `WalDeletion` to MANIFEST, since we can ignore the WALs based on MinLogNumber. But the `MinLogNumber()` is actually a lower bound, it does not exactly mean that logs starting from MinLogNumber must exist. This is because in a corner case, when a column family is empty and never flushed, its log number is set to the largest log number, but not persisted in MANIFEST. So let's say there are 2 column families, when creating the DB, the first WAL has log number 1, so it's persisted to MANIFEST for both column families. Then CF 0 is empty and never flushed, CF 1 is updated and flushed, so a new WAL with log number 2 is created and persisted to MANIFEST for CF 1. But CF 0's log number in MANIFEST is still 1. So on recovery, MinLogNumber is 1, but since log 1 only contains data for CF 1, and CF 1 is flushed, log 1 might have already been deleted from disk. We can make `MinLogNumber()` be the exactly minimum log number that must exist, by persisting the most recent log number for empty column families that are not flushed. But if there are N such column families, then every time a new WAL is created, we need to add N records to MANIFEST. In current design, a record is persisted to MANIFEST only when WAL is created, closed, or deleted/archived, so the number of WAL related records are bounded to 3x number of WALs. 2. Why keep `WalSet` in `VersionSet` instead of applying the `VersionEdit`s to `VersionStorageInfo` `VersionEdit`s are originally designed to track the addition and deletion of SST files. The SST files are related to column families, each column family has a list of `Version`s, and each `Version` keeps the set of active SST files in `VersionStorageInfo`. But WALs are a concept of DB, they are not bounded to specific column families. So logically it does not make sense to store WALs in a column family's `Version`s. Also, `Version`'s purpose is to keep reference to SST / blob files, so that they are not deleted until there is no version referencing them. But a WAL is deleted regardless of version references. So we keep the WALs in `VersionSet` for the purpose of writing out the DB state's snapshot when creating new MANIFESTs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7164 Test Plan: make version_edit_test && ./version_edit_test make wal_edit_test && ./wal_edit_test Reviewed By: ltamasi Differential Revision: D22677936 Pulled By: cheng-chang fbshipit-source-id: 5a3b6890140e572ffd79eb37e6e4c3c32361a859	5 years ago
sdong	5c1a544122	Clean up InternalIterator upper bound logic a little bit (#7200 ) Summary: IteratorIterator::IsOutOfBound() and IteratorIterator::MayBeOutOfUpperBound() are two functions that related to upper bound check. It is hard for users to reason about this complexity. Consolidate the two functions into one and assign an enum as results to improve readability. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7200 Test Plan: Run all existing test. Would run crash test with atomic for a while. Reviewed By: anand1976 Differential Revision: D22833181 fbshipit-source-id: a0c724267056adbd0476bde74650e6c7226077e6	5 years ago
Yanqin Jin	2735b0275d	ReadOptions.iter_start_ts should support tombstones (#7178 ) Summary: as title. When ReadOptions.iter_start_ts is not nullptr, DBIter::key() should return internal keys including value type. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7178 Test Plan: make check Reviewed By: ltamasi Differential Revision: D22935879 Pulled By: riversand963 fbshipit-source-id: 7508d962cf11ebcfa6386d2529b4f3606b47ccfd	5 years ago
Akanksha Mahajan	493f425e77	Add support to start and end IOTracing through DB APIs (#7203 ) Summary: 1. Add support to start io tracing through DB::StartIOTrace(Env, const TraceOptions&, std::unique_ptr<TraceWriter>&&) and end tracing through DB::EndIOTrace(). This doesn't trace DB::Open. User side code: //Open DB DB::Open(options, dbname, &db); / Start tracing / db->StartIOTrace(env, trace_opt, std::move(trace_writer)); / Perform Operations / /End tracing*/ db->EndIOTrace(); 2. Fix the build errors for Windows. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7203 Test Plan: make check -j64 Reviewed By: anand1976 Differential Revision: D22901947 Pulled By: akankshamahajan15 fbshipit-source-id: e59c0b785a802168e6f1aa028d99c224a35cb30c	5 years ago
Andrew Kryczka	a4a4a2dabd	dedup ReadOptions in iterator hierarchy (#7210 ) Summary: Previously, a `ReadOptions` object was stored in every `BlockBasedTableIterator` and every `LevelIterator`. This redundancy consumes extra memory, resulting in the `Arena` making more allocations, and iteration observing worse cache performance. This PR migrates callers of `NewInternalIterator()` and `MakeInputIterator()` to provide a `ReadOptions` object guaranteed to outlive the returned iterator. When the iterator's lifetime will be managed by the user, this lifetime guarantee is achieved by storing the `ReadOptions` value in `ArenaWrappedDBIter`. Then, sub-iterators of `NewInternalIterator()` and `MakeInputIterator()` can hold a reference-to-const `ReadOptions`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7210 Test Plan: - `make check` under ASAN and valgrind - benchmark: on a DB with 2 L0 files and 3 L1+ levels, this PR reduced `Arena` allocation 4792 -> 4160 bytes. Reviewed By: anand1976 Differential Revision: D22861323 Pulled By: ajkr fbshipit-source-id: 54aebb3e89c872eeab0f5793b4b6e42878d093ce	5 years ago
Aaron Kabcenell	56ed601df3	Compaction Read/Write Stats by Compaction Type (#7165 ) Summary: Adds compaction statistics (total bytes read and written) for compactions that occur for delete-triggered, periodic, and TTL compaction reasons. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7165 Test Plan: TTL and periodic can be checked by runnning db_bench with the options activated: /db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -periodic_compaction_seconds=1 ./db_bench --benchmarks="fillrandom,stats" --statistics --num=10000000 -base_background_compactions=16 -fifo_compaction_ttl=1 Setting the time to one second causes non-zero bytes read/written for those compaction reasons. Disabling them or setting them to times longer than the test run length causes the stats to return to zero as expected. Delete-triggered compaction counting is tested in DBTablePropertiesTest.DeletionTriggeredCompactionMarking Reviewed By: ajkr Differential Revision: D22693050 Pulled By: akabcenell fbshipit-source-id: d15cef4d94576f703015c8942d5f0d492f69401d	5 years ago
codingsh	50f206ad84	feat: export SetBackgroundThreads(n, Env::BOTTOM); (#7191 ) Summary: - https://github.com/rust-rocksdb/rust-rocksdb/pull/448 Pull Request resolved: https://github.com/facebook/rocksdb/pull/7191 Reviewed By: riversand963 Differential Revision: D22809066 Pulled By: ajkr fbshipit-source-id: 036939f9a28cacc3f677c318d1aed97fe5f4f85e	5 years ago
sdong	692f6a3138	Implement NextAndGetResult() in memtable and level iterator (#7179 ) Summary: NextAndGetResult() is not implemented in memtable and is very simply implemented in level iterator. The result is that for a normal leveled iterator, performance regression will be observed for calling PrepareValue() for most iterator Next(). Mitigate the problem by implementing the function for both iterators. In level iterator, the implementation cannot be perfect as when calling file iterator's SeekToFirst() we don't have information about whether the value is prepared. Fortunately, the first key should not cause a big portion of the CPu. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7179 Test Plan: Run normal crash test for a while. Reviewed By: anand1976 Differential Revision: D22783840 fbshipit-source-id: c19f45cdf21b756190adef97a3b66ccde3936e05	5 years ago
mrambacher	d9d190742c	Make env_test work with ASSERT_STATUS_CHECKED (#7176 ) Summary: Make (most of) the env_test pass when ASSERT_STATUS_CHECKED is enabled. One test that opens a database is currently disabled in this mode, as there are many errors that need revisited for DB tests and status checks. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7176 Reviewed By: cheng-chang Differential Revision: D22799278 Pulled By: ajkr fbshipit-source-id: 16d8a02eaeecd6df1060249b6a5811292801f2ed	5 years ago
codingsh	83ea266b43	export stats_persist_period_sec (#7168 ) Summary: fixed - https://github.com/rust-rocksdb/rust-rocksdb/issues/447 - https://github.com/rust-rocksdb/rust-rocksdb/pull/448 Pull Request resolved: https://github.com/facebook/rocksdb/pull/7168 Reviewed By: cheng-chang Differential Revision: D22736013 Pulled By: ajkr fbshipit-source-id: fdd784aa75d26a367b9108b05ffdd94a2ae117d3	5 years ago
Tomas Kolda	cd4592c220	SST Partitioner interface that allows to split SST files (#6957 ) Summary: SST Partitioner interface that allows to split SST files during compactions. It basically instruct compaction to create a new file when needed. When one is using well defined prefixes and prefixed way of defining tables it is good to define also partitioning so that promotion of some SST file does not cover huge key space on next level (worst case complete space). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6957 Reviewed By: ajkr Differential Revision: D22461239 fbshipit-source-id: 9ce07bba08b3ba89c2d45630520368f704d1316e	5 years ago
Jay Zhuang	b0c5ecd6b3	Make max_subcompactions dynamically changeable (#7159 ) Summary: Make `max-subcompactions` dynamically changeable by passing the `DBOption` to Compaction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7159 Reviewed By: siying Differential Revision: D22671238 Pulled By: jay-zhuang fbshipit-source-id: 311ca9f6bb606965544d8708616d358cfed5be42	5 years ago
Levi Tamasi	0d04a8434a	Sync blob files before closing them (#7160 ) Summary: BlobDB currently syncs each blob file periodically after writing a certain amount of data (as specified by the configuration option `BlobDBOptions::bytes_per_sync`) and all open blob files when the base DB's memtables are flushed. With the patch, in addition to the above, blob files are also synced right before being closed, after the footer has been written. This will be beneficial for the new integrated blob file write path as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7160 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D22672646 Pulled By: ltamasi fbshipit-source-id: 62b34263543a7e74abcbb7adf011daa1e699998f	5 years ago
Cheng Chang	96ce0470a7	Clean snapshot dir before taking snapshot (#7156 ) Summary: `DBTest::SnapshotFiles` runs the tests in a `while` loop. Currently, the snapshot directory is not cleaned up in each loop, so previous snapshot files may remain in the next loop's snapshot. When I'm working on https://github.com/facebook/rocksdb/pull/7129, when checking the tracked WALs in MANIFEST, I find that this test always fails because it reads some unknown WAL. It turns out that the unknown WAL is left from previous loops. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7156 Test Plan: make db_test && ./db_test --gtest_filters=*SnapshotFiles Reviewed By: siying Differential Revision: D22668360 Pulled By: cheng-chang fbshipit-source-id: 69d4aa3506038ba30e218e8ae966357935a99c6c	5 years ago
mrambacher	d44cbc5314	Add hash of key/value checks when paranoid_file_checks=true (#7134 ) Summary: When paraoid_files_checks=true, a rolling key-value hash is generated and compared to what is written to the file. If the values do not match, the SST file is rejected. Code put in place for the check for both flush and compaction jobs. Corresponding test added to corruption_test. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7134 Reviewed By: cheng-chang Differential Revision: D22646149 fbshipit-source-id: 8fde1984a1a11edd3bd82a413acffc5ea7aa683f	5 years ago
Haosen Wen	dbc51adbac	Use steady_clock instead of system_clock in FileOperationInfo::TimePoint (#7153 ) Summary: Issue https://github.com/facebook/rocksdb/issues/7133 reported that using `system_clock` in `FileOperationInfo::TimePoint` causes the duration of file flush operation (which can be a noop on MacOS in some scenarios) appears to be 0 and fail an assertion in listener_test. Using `steady_clock` supposedly fixed the problem. `steady_clock` actually fits better into the use cases of `FileOperationInfo::TimePoint` as all usages care about durations but not wall clock time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7153 Test Plan: make check. Reviewed By: riversand963 Differential Revision: D22654136 Pulled By: roghnin fbshipit-source-id: 5980b1080734bdae496a18071a2c2b5887c67d85	5 years ago
sdong	1cf4731dbb	column_family_test: fix a data race related to sleeping task (#7150 ) Summary: TSAN reports warning in one column_family_test: WARNING: ThreadSanitizer: data race (pid=16352) Write of size 8 at 0x7ffcdf042158 by main thread: #0 pthread_cond_destroy <null> (column_family_test+0x471f65) https://github.com/facebook/rocksdb/issues/1 rocksdb::port::CondVar::~CondVar() /home/circleci/project/port/port_posix.cc:101:49 (column_family_test+0x8a627a) https://github.com/facebook/rocksdb/issues/2 rocksdb::test::SleepingBackgroundTask::~SleepingBackgroundTask() /home/circleci/project/./test_util/testutil.h:397:7 (column_family_test+0x54b6e2) https://github.com/facebook/rocksdb/issues/3 rocksdb::ColumnFamilyTest_FlushCloseWALFiles_Test::TestBody() /home/circleci/project/db/column_family_test.cc:3008:1 (column_family_test+0x54b6e2) ...... Previous read of size 8 at 0x7ffcdf042158 by thread T2 (mutexes: write M0): #0 pthread_cond_broadcast <null> (column_family_test+0x471dd2) https://github.com/facebook/rocksdb/issues/1 rocksdb::port::CondVar::SignalAll() /home/circleci/project/port/port_posix.cc:139:28 (column_family_test+0x8a651a) https://github.com/facebook/rocksdb/issues/2 rocksdb::test::SleepingBackgroundTask::DoSleep() /home/circleci/project/./test_util/testutil.h:412:12 (column_family_test+0x58574b) ...... Likely, SleepingBackgroundTask::DoSleep() started to execute after the main thread has finished everything, cancelled and waited for sleeping tasks to finish. At this time, although DoSlee() will not sleep, but it also accesses the mutex, creating a data race with destructor of the test. Fix this bug by waiting for the sleeping task to start sleeping after it is scheduled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7150 Test Plan: Run these modified tests and make sure it doesn't break. Reviewed By: riversand963 Differential Revision: D22630716 fbshipit-source-id: cc5781cf69083685de406490438898238bdfc2d3	5 years ago
sdong	9870704420	Fix a minor data race in stats dumping threads initialization (#7151 ) Summary: https://github.com/facebook/rocksdb/pull/7145 creates a minor data race against the stat creation counter. Turn it to atomic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7151 Test Plan: Run the test. Reviewed By: ajkr Differential Revision: D22631014 fbshipit-source-id: c6fb69ac5b9df7139795dacea5ce9fb9fd3278d7	5 years ago
Zhichao Cao	ed4712fe7e	Remove time out testing cases in error_handler_fs_test (#7141 ) Summary: Remove the 3 testing cases that cause the time out in linux build by https://github.com/facebook/rocksdb/issues/6765 . Will fix them later. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7141 Test Plan: make asan_check, buck run Reviewed By: ajkr Differential Revision: D22593831 Pulled By: zhichao-cao fbshipit-source-id: 14956c36476ecc3393f613178c22e13df843126e	5 years ago
Andrew Kryczka	9a83fd21e6	stagger first DumpMallocStats after opening DB (#7145 ) Summary: Previously when running `db_bench` with large value for `num_multi_dbs` and enabled `Options::dump_malloc_stats`, we would see most CPU spent in jemalloc locking. After this PR that no longer shows up at the top of the profile. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7145 Reviewed By: riversand963 Differential Revision: D22593031 Pulled By: ajkr fbshipit-source-id: 3b3fc91f93249c6afee53f59f34c487c3fc5add6	5 years ago
sdong	ca5a069a79	Suppress a TSAN warning (#7126 ) Summary: TSAN shows warning with clang with warning similar to this: WARNING: ThreadSanitizer: data race (pid=10159) Atomic write of size 8 at 0x7b5000002890 by thread T33: #0 __tsan_atomic64_store <null> (db_test+0x4ca2b5) https://github.com/facebook/rocksdb/issues/1 std::__atomic_base<unsigned long>::store(unsigned long, std::memory_order) /usr/bin/../lib/gcc/x86_64-linux-gnu/5.4.0/../../../../include/c++/5.4.0/bits/atomic_base.h:374:2 (db_test+0x774fde) https://github.com/facebook/rocksdb/issues/2 rocksdb::VersionSet::SetLastSequence(unsigned long) /home/circleci/project/./db/version_set.h:1057:20 (db_test+0x774fde) https://github.com/facebook/rocksdb/issues/3 rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&, rocksdb::WriteBatch, rocksdb::WriteCallback, unsigned long, unsigned long, bool, unsigned long, unsigned long, rocksdb::PreReleaseCallback) /home/circleci/project/db/db_impl/db_impl_write.cc:449:18 (db_test+0x774fde) ...... Previous read of size 8 at 0x7b5000002890 by thread T5 (mutexes: write M1044689462619020832): #0 rocksdb::DBImpl::ReleaseSnapshot(rocksdb::Snapshot const) /home/circleci/project/db/db_impl/db_impl.cc (db_test+0x6f4ae7) https://github.com/facebook/rocksdb/issues/1 rocksdb::(anonymous namespace)::MTThreadBody(void) /home/circleci/project/db/db_test.cc:2514:13 (db_test+0x56ac59) https://github.com/facebook/rocksdb/issues/2 rocksdb::(anonymous namespace)::StartThreadWrapper(void) /home/circleci/project/env/env_posix.cc:443:3 (db_test+0x88c4cd) It is not limited to ReleaseSnapshot() and rocksdb::DBImpl::MultiCFSnapshot(). While we are not 100% sure it doesn't indicate any correctness violation, we suppress them for now to keep TSAN clean with more tests so that we can cover more bugs with CI. In the gcc runs we have been running, this warning rarely shows up. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7126 Test Plan: See the mini-TSAN test to pass with reasonable run time. Reviewed By: ajkr Differential Revision: D22552375 fbshipit-source-id: ebdd3854cb3becec3403970326a1ca961db2ab00	5 years ago
Zhichao Cao	a10f12eda1	Auto resume the DB from Retryable IO Error (#6765 ) Summary: In current codebase, in write path, if Retryable IO Error happens, SetBGError is called. The retryable IO Error is converted to hard error and DB is in read only mode. User or application needs to resume it. In this PR, if Retryable IO Error happens in one DB, SetBGError will create a new thread to call Resume (auto resume). otpions.max_bgerror_resume_count controls if auto resume is enabled or not (if max_bgerror_resume_count<=0, auto resume will not be enabled). options.bgerror_resume_retry_interval controls the time interval to call Resume again if the previous resume fails due to the Retryable IO Error. If non-retryable error happens during resume, auto resume will terminate. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6765 Test Plan: Added the unit test cases in error_handler_fs_test and pass make asan_check Reviewed By: anand1976 Differential Revision: D21916789 Pulled By: zhichao-cao fbshipit-source-id: acb8b5e5dc3167adfa9425a5b7fc104f6b95cb0b	5 years ago
Yanqin Jin	27735dea9a	Report corrupted keys during compaction (#7124 ) Summary: Currently, RocksDB lets compaction to go through even in case of corrupted keys, the number of which is reported in CompactionJobStats. However, RocksDB does not check this value. We should let compaction run in a stricter mode. Temporarily disable two tests that allow corrupted keys in compaction. With this PR, the two tests will assert(false) and terminate. Still need to investigate what is the recommended google-test way of doing it. Death test (EXPECT_DEATH) in gtest has warnings now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7124 Test Plan: make check Reviewed By: ajkr Differential Revision: D22530722 Pulled By: riversand963 fbshipit-source-id: 6a5a6a992028c6d4f92cb74693c92db462ae4ad6	5 years ago
Levi Tamasi	bdf4de6cb9	Remove some dead code from BlobLogWriter (#7125 ) Summary: Periodic syncing of blob files is performed by `WritableFileWriter`; `bytes_per_sync_` and `next_sync_offset_` in `BlobLogWriter` are actually unused (or more precisely, only used by methods that are themselves unused). The patch removes all this dead code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7125 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D22531021 Pulled By: ltamasi fbshipit-source-id: 6b293ad5a79d3e6bf15c5c68f7aedd7ce7a15f10	5 years ago

... 2 3 4 5 6 ...

4275 Commits (e062a719cc8c7c9aa19c8c58131784b3acaec859)