rocksdb

Commit Graph

Author	SHA1	Message	Date
Giuseppe Ottaviano	48cd7a3aae	Fix flush reason attribution (#8150 ) Summary: Current flush reason attribution is misleading or incorrect (depending on what the original intention was): - Flush due to WAL reaching its maximum size is attributed to `kWriteBufferManager` - Flushes due to full write buffer and write buffer manager are not distinguishable, both are attributed to `kWriteBufferFull` This changes the first to a new flush reason `kWALFull`, and splits the second between `kWriteBufferManager` and `kWriteBufferFull`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8150 Reviewed By: zhichao-cao Differential Revision: D27569645 Pulled By: ot fbshipit-source-id: 7e3c8ca186a6e71976e6b8e937297eebd4b769cc	4 years ago
Peter Dillinger	879357fdb0	Make backups openable as read-only DBs (#8142 ) Summary: A current limitation of backups is that you don't know the exact database state of when the backup was taken. With this new feature, you can at least inspect the backup's DB state without restoring it by opening it as a read-only DB. Rather than add something like OpenAsReadOnlyDB to the BackupEngine API, which would inhibit opening stackable DB implementations read-only (if/when their APIs support it), we instead provide a DB name and Env that can be used to open as a read-only DB. Possible follow-up work: * Add a version of GetBackupInfo for a single backup. * Let CreateNewBackup return the BackupID of the newly-created backup. Implementation details: Refactored ChrootFileSystem to split off new base class RemapFileSystem, which allows more general remapping of files. We use this base class to implement BackupEngineImpl::RemapSharedFileSystem. To minimize API impact, I decided to just add these fields `name_for_open` and `env_for_open` to those set by GetBackupInfo when include_file_details=true. Creating the RemapSharedFileSystem adds a bit to the memory consumption, perhaps unnecessarily in some cases, but this has been mitigated by (a) only initialize the RemapSharedFileSystem lazily when GetBackupInfo with include_file_details=true is called, and (b) using the existing `shared_ptr<FileInfo>` objects to hold most of the mapping data. To enhance API safety, RemapSharedFileSystem is wrapped by new ReadOnlyFileSystem which rejects any attempts to write. This uncovered a couple of places in which DB::OpenForReadOnly would write to the filesystem, so I fixed these. Added a release note because this affects logging. Additional minor refactoring in backupable_db.cc to support the new functionality. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8142 Test Plan: new test (run with ASAN and UBSAN), added to stress test and ran it for a while with amplified backup_one_in Reviewed By: ajkr Differential Revision: D27535408 Pulled By: pdillinger fbshipit-source-id: 04666d310aa0261ef6b2385c43ca793ce1dfd148	4 years ago
Akanksha Mahajan	689b13e639	Add request_id in IODebugContext. (#8045 ) Summary: Add request_id in IODebugContext which will be populated by underlying FileSystem for IOTracing purposes. Update IOTracer to trace request_id in the tracing records. Provided API IODebugContext::SetRequestId which will set the request_id and enable tracing for request_id. The API hides the implementation and underlying file system needs to call this API directly. Update DB::StartIOTrace API and remove redundant Env* from the argument as its not used and DB already has Env that is passed down to IOTracer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8045 Test Plan: Update unit test. Differential Revision: D26899871 Pulled By: akankshamahajan15 fbshipit-source-id: 56adef52ee5af0fb3060b607c3af1ec01635fa2b	4 years ago
sherriiiliu	e6534900bd	Fix possible hang issue in ~DBImpl() when flush is scheduled in LOW pool (#8125 ) Summary: In DBImpl::CloseHelper, we wait for bg_compaction_scheduled_ and bg_flush_scheduled_ to drop to 0. Unschedule is called prior to cancel any unscheduled flushes/compactions. It is assumed that anything in the high priority is a flush, and anything in the low priority pool is a compaction. This assumption, however, is broken when the high-pri pool is full. As a result, bg_compaction_scheduled_ can go < 0 and bg_flush_scheduled_ will remain > 0 and DB can be in hang state. The fix is, we decrement the `bg_{flush,compaction,bottom_compaction}_scheduled_` inside the `Unschedule{Flush,Compaction,BottomCompaction}Callback()`s. DB `mutex_` will make the counts atomic in `Unschedule`. Related discussion: https://github.com/facebook/rocksdb/issues/7928 Pull Request resolved: https://github.com/facebook/rocksdb/pull/8125 Test Plan: Added new test case which hangs without the fix. Reviewed By: jay-zhuang Differential Revision: D27390043 Pulled By: ajkr fbshipit-source-id: 78a367fba9a59ac5607ad24bd1c46dc16d5ec110	4 years ago
anand76	7d7f14480e	Always truncate the latest WAL file on DB Open (#8122 ) Summary: Currently, we only truncate the latest alive WAL files when the DB is opened. If the latest WAL file is empty or was flushed during Open, its not truncated since the file will be deleted later on in the Open path. However, before deletion, a new WAL file is created, and if the process crash loops between the new WAL file creation and deletion of the old WAL file, the preallocated space will keep accumulating and eventually use up all disk space. To prevent this, always truncate the latest WAL file, even if its empty or the data was flushed. Tests: Add unit tests to db_wal_test Pull Request resolved: https://github.com/facebook/rocksdb/pull/8122 Reviewed By: riversand963 Differential Revision: D27366132 Pulled By: anand1976 fbshipit-source-id: f923cc03ef033ccb32b140d36c6a63a8152f0e8e	4 years ago
Akanksha Mahajan	27d57a035e	Use SST file manager to track blob files as well (#8037 ) Summary: Extend support to track blob files in SST File manager. This PR notifies SstFileManager whenever a new blob file is created, via OnAddFile and an obsolete blob file deleted via OnDeleteFile and delete file via ScheduleFileDeletion. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8037 Test Plan: Add new unit tests Reviewed By: ltamasi Differential Revision: D26891237 Pulled By: akankshamahajan15 fbshipit-source-id: 04c69ccfda2a73782fd5c51982dae58dd11979b6	4 years ago
mrambacher	3dff28cf9b	Use SystemClock* instead of std::shared_ptr<SystemClock> in lower level routines (#8033 ) Summary: For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>. The shared ptr has some performance degradation on certain hardware classes. For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere. For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it. The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource. There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold. In those cases, the shared pointer was preserved. Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17: 6.17: readrandom : 28.046 micros/op 854902 ops/sec; 61.3 MB/s (355999 of 355999 found) 6.18: readrandom : 32.615 micros/op 735306 ops/sec; 52.7 MB/s (290999 of 290999 found) PR: readrandom : 27.500 micros/op 871909 ops/sec; 62.5 MB/s (367999 of 367999 found) (Note that the times for 6.18 are prior to revert of the SystemClock). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033 Reviewed By: pdillinger Differential Revision: D27014563 Pulled By: mrambacher fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67	4 years ago
Akanksha Mahajan	46cf5fbfdd	Extend VerifyFileChecksums API for blob files (#7979 ) Summary: Extend VerifyFileChecksums API to verify blob files in case of use_file_checksum. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7979 Test Plan: New unit test db_blob_corruption_test Reviewed By: ltamasi Differential Revision: D26534040 Pulled By: akankshamahajan15 fbshipit-source-id: 7dc5951a3df9d265ea1265e0122b43c966856ade	4 years ago
Zhichao Cao	b0fd1cc45a	Introduce a new trace file format (v 0.2) for better extension (#7977 ) Summary: The trace file record and payload encode is fixed, which requires complex backward compatibility resolving. This PR introduce a new trace file format, which makes it easier to add new entries to the payload and does not have backward compatible issues. V 0.1 is still supported in this PR. Added the tracing for lower_bound and upper_bound for iterator. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7977 Test Plan: make check. tested with old trace file in replay and analyzing. Reviewed By: anand1976 Differential Revision: D26529948 Pulled By: zhichao-cao fbshipit-source-id: ebb75a127ce3c07c25a1ccc194c551f917896a76	4 years ago
Jay Zhuang	cf160b98e1	Add full_history_ts_low option to compaction (#7884 ) Summary: The full_history_ts_low is used for user-defined timestamp GC compaction, which is introduced in https://github.com/facebook/rocksdb/issues/7740, https://github.com/facebook/rocksdb/issues/7657 and https://github.com/facebook/rocksdb/issues/7655. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7884 Reviewed By: ltamasi Differential Revision: D25982553 Pulled By: jay-zhuang fbshipit-source-id: 36303d412d65b5d8166b6da24fa21ad85adbabee	4 years ago
Zhichao Cao	95013df278	Do not set bg error for compaction in retryable IO Error case (#7899 ) Summary: When retryable IO error occurs during compaction, it is mapped to soft error and set the BG error. However, auto resume is not called to clean the soft error since compaction will reschedule by itself. In this change, When retryable IO error occurs during compaction, BG error is not set. User will be informed the error via EventHelper. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7899 Test Plan: tested with error_handler_fs_test Reviewed By: anand1976 Differential Revision: D26094097 Pulled By: zhichao-cao fbshipit-source-id: c53424f11d237405592cd762f43cbbdf8da8234f	4 years ago
mrambacher	12f1137355	Add a SystemClock class to capture the time functions of an Env (#7858 ) Summary: Introduces and uses a SystemClock class to RocksDB. This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock. Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead. There are likely more places that can be changed, but this is a start to show what can/should be done. Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock. There are several Env classes that implement these functions. Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR. It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc). Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858 Reviewed By: pdillinger Differential Revision: D26006406 Pulled By: mrambacher fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90	4 years ago
Levi Tamasi	1afbd1948c	Add initial blob support to batched MultiGet (#7766 ) Summary: The patch adds initial support for reading blobs to the batched `MultiGet` API. The current implementation simply retrieves the blob values as the blob indexes are encountered; that is, reads from blob files are currently not batched. (This will be optimized in a separate phase.) In addition, the patch removes some dead code related to BlobDB from the batched `MultiGet` implementation, namely the `is_blob` / `is_blob_index` flags that are passed around in `DBImpl` and `MemTable` / `MemTableListVersion`. These were never hooked up to anything and wouldn't work anyways, since a single flag is not sufficient to communicate the "blobness" of multiple key-values. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7766 Test Plan: `make check` Reviewed By: jay-zhuang Differential Revision: D25479290 Pulled By: ltamasi fbshipit-source-id: 7aba2d290e31876ee592bcf1adfd1018713a8000	4 years ago
Adam Retter	8ff6557e7f	Add further tests to ASSERT_STATUS_CHECKED (2) (#7698 ) Summary: Second batch of adding more tests to ASSERT_STATUS_CHECKED. * external_sst_file_basic_test * checkpoint_test * db_wal_test * db_block_cache_test * db_logical_block_size_cache_test * db_blob_index_test * optimistic_transaction_test * transaction_test * point_lock_manager_test * write_prepared_transaction_test * write_unprepared_transaction_test Pull Request resolved: https://github.com/facebook/rocksdb/pull/7698 Reviewed By: cheng-chang Differential Revision: D25441664 Pulled By: pdillinger fbshipit-source-id: 9e78867f32321db5d4833e95eb96c5734526ef00	4 years ago
Levi Tamasi	61932cdf1d	Add blob support to DBIter (#7731 ) Summary: The patch adds iterator support to the integrated BlobDB implementation. Whenever a blob reference is encountered during iteration, the corresponding blob is retrieved by calling `Version::GetBlob`, assuming the `expose_blob_index` (formerly `allow_blob`) flag is not set. (Note: the flag is set by the old stacked BlobDB implementation, which has its own blob file handling/blob retrieval logic.) In addition, `DBIter` now uniformly returns `Status::NotSupported` with the error message `"BlobDB does not support merge operator."` when encountering a blob reference while performing a merge (instead of potentially returning a message that implies the database should be opened using the stacked BlobDB's `Open`.) TODO: We can implement support for lazily retrieving the blob value (or in other words, bypassing the retrieval of blob values based on key) by extending the `Iterator` API with a new `PrepareValue` method (similarly to `InternalIterator`, which already supports lazy values). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7731 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D25256293 Pulled By: ltamasi fbshipit-source-id: c39cd782011495a526cdff99c16f5fca400c4811	4 years ago
Cheng Chang	70f2e0916a	Write min_log_number_to_keep to MANIFEST during atomic flush under 2 phase commit (#7570 ) Summary: When 2 phase commit is enabled, if there are prepared data in a WAL, the WAL should be kept, the minimum log number for such a WAL is written to MANIFEST during flush. In atomic flush, such information is not written to MANIFEST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7570 Test Plan: Added a new unit test `DBAtomicFlushTest.ManualFlushUnder2PC`, this test fails in atomic flush without this PR, after this PR, it succeeds. Reviewed By: riversand963 Differential Revision: D24394222 Pulled By: cheng-chang fbshipit-source-id: 60ce74b21b704804943be40c8de01b41269cf116	4 years ago
Jay Zhuang	7fec715db4	Make CompactRange and GetApproximateSizes work with timestamp (#7684 ) Summary: Add timestamp to the `CompactRange()` and `GetApproximateSizes` range keys if needed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7684 Test Plan: make check Reviewed By: riversand963 Differential Revision: D25015421 Pulled By: jay-zhuang fbshipit-source-id: 51ca0756087eb053a3b11801e5c7ce1c6e2d38a9	4 years ago
Cheng Chang	8c93b16f02	Track WAL in MANIFEST: Update logic for computing min_log_number_to_keep in atomic flush (#7660 ) Summary: The logic for computing min_log_number_to_keep in atomic flush was incorrect. For example, when all column families are flushed, the min_log_number_to_keep should be the latest new log. But the incorrect logic calls `PrecomputeMinLogNumberToKeepNon2PC` for each column family, and returns the minimum of them. However, `PrecomputeMinLogNumberToKeepNon2PC(cf)` assumes column families other than `cf` are flushed, but in case all column families are flushed, this assumption is incorrect. Without this fix, the WAL referenced by the computed min_log_number_to_keep may actually contain no unflushed data, so the WAL might have actually been deleted from disk on recovery, then an incorrect error `Corruption: missing WAL` will be reported. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7660 Test Plan: run `make crash_test_with_atomic_flush` on devserver added a unit test in `db_flush_test` Reviewed By: riversand963 Differential Revision: D24906265 Pulled By: cheng-chang fbshipit-source-id: 08deda62e71f67f59e3b7925cdd86dd09bd4f430	4 years ago
Cheng Chang	5e794b0841	Fix a recovery corner case (#7621 ) Summary: Consider the following sequence of events: 1. Db flushed an SST with file number N, appended to MANIFEST, and tried to sync the MANIFEST. 2. Syncing MANIFEST failed and db crashed. 3. Db tried to recover with this MANIFEST. In the meantime, no entry about the newly-flushed SST was found in the MANIFEST. Therefore, RocksDB replayed WAL and tried to flush to an SST file reusing the same file number N. This failed because file system does not support overwrite. Then Db deleted this file. 4. Db crashed again. 5. Db tried to recover. When db read the MANIFEST, there was an entry referencing N.sst. This could happen probably because the append in step 1 finally reached the MANIFEST and became visible. Since N.sst had been deleted in step 3, recovery failed. It is possible that N.sst created in step 1 is valid. Although step 3 would still fail since the MANIFEST was not synced properly in step 1 and 2, deleting N.sst would make it impossible for the db to recover even if the remaining part of MANIFEST was appended and visible after step 5. After this PR, in step 3, immediately after recovering from MANIFEST, a new MANIFEST is created, then we find that N.sst is not referenced in the MANIFEST, so we delete it, and we'll not reuse N as file number. Then in step 5, since the new MANIFEST does not contain N.sst, the recovery failure situation in step 5 won't happen. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7621 Test Plan: 1. some tests are updated, because these tests assume that new MANIFEST is created after WAL recovery. 2. a new unit test is added in db_basic_test to simulate step 3. Reviewed By: riversand963 Differential Revision: D24668144 Pulled By: cheng-chang fbshipit-source-id: 90d7487fbad2bc3714f5ede46ea949895b15ae3b	4 years ago
Cheng Chang	1e40696dd1	Track WAL in MANIFEST: LogAndApply WAL events to MANIFEST (#7601 ) Summary: When a WAL is synced, an edit is written to MANIFEST. After flushing memtables, the obsoleted WALs are piggybacked to MANIFEST while writing the new L0 files to MANIFEST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7601 Test Plan: `track_and_verify_wals_in_manifest` is enabled by default for all tests extending `DBBasicTest`, and in db_stress_test. Unit test `wal_edit_test`, `version_edit_test`, and `version_set_test` are also updated. Watch all tests to pass. Reviewed By: ltamasi Differential Revision: D24553957 Pulled By: cheng-chang fbshipit-source-id: 66a569ff1bdced38e22900bd240b73113906e040	4 years ago
Yanqin Jin	fde0cd7ced	Add API to verify whole sst file checksum (#7578 ) Summary: Existing API `VerifyChecksum()` allows application to verify sst files' block checksums. Since whole file, user-specified checksum is tracked in MANIFEST, we can expose a new API to verify sst files' file checksums. ``` // Compute table file checksums if applicable and compare with MANIFEST. // Returns OK if no file has mismatching whole-file checksum. Status DB::VerifyFileChecksums(const ReadOptions& /read_options/); ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/7578 Test Plan: make check Reviewed By: pdillinger Differential Revision: D24436783 Pulled By: riversand963 fbshipit-source-id: 52b51519b842f2b3c4e3351998a97c86cbec85b3	4 years ago
Andrew Kryczka	1e00909730	Periodically flush info log out of application buffer (#7488 ) Summary: This PR schedules a background thread (shared across all DB instances) to flush info log every ten seconds. This improves debuggability in case of RocksDB hanging since it ensures the log messages leading up to the hang will eventually become visible in the log. The bulk of this PR is moving monitoring/stats_dump_scheduler* to db/periodic_work_scheduler* and making the corresponding name changes since now the scheduler handles info log flushing, not just stats dumping. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7488 Reviewed By: riversand963 Differential Revision: D24065165 Pulled By: ajkr fbshipit-source-id: 339c47a0ff43b79fdbd055fbd9fefbb6f9d8d3b5	4 years ago
Zhichao Cao	c268628c25	Map retryable IO error during Flush without WAL to soft error and no switch memtable during resume (#7310 ) Summary: In the current implementation, any retryable IO error happens during Flush is mapped to a hard error. In this case, DB is stopped and write is stalled unless the background error is cleaned. In this PR, if WAL is DISABLED, the retryable IO error during FLush is mapped to a soft error. Such that, the memtable can continue receive the writes. At the same time, if auto resume is triggered, SwtichMemtable will not be called during Flush when resuming the DB to avoid to many small memtables. Testing cases are added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7310 Test Plan: adding new unit test, pass make check. Reviewed By: anand1976 Differential Revision: D23710892 Pulled By: zhichao-cao fbshipit-source-id: bc4ca50d11c6b23b60d2c0cb171d86d542b038e9	4 years ago
Adam Retter	3ac07a12fe	RocksJava - Add errorIfLogFileExists parameter to RocksDB.openReadOnly (#7046 ) Summary: Expose from C++ API to Java API. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7046 Reviewed By: riversand963 Differential Revision: D23726297 Pulled By: pdillinger fbshipit-source-id: fc66bf626ce6fe9797e7d021ac849eacab91bf6d	4 years ago
Jay Zhuang	69760b4d05	Introduce a global StatsDumpScheduler for stats dumping (#7223 ) Summary: Have a global StatsDumpScheduler for all DB instance stats dumping, including `DumpStats()` and `PersistStats()`. Before this, there're 2 dedicate threads for every DB instance, one for DumpStats() one for PersistStats(), which could create lots of threads if there're hundreds DB instances. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7223 Reviewed By: riversand963 Differential Revision: D23056737 Pulled By: jay-zhuang fbshipit-source-id: 0faa2311142a73433ebb3317361db7cbf43faeba	4 years ago
Andrew Kryczka	a1aa3f8385	Disable manual compaction during `ReFitLevel()` (#7250 ) Summary: Manual compaction with `CompactRangeOptions::change_levels` set could refit to a level targeted by another manual compaction. If force_consistency_checks were disabled, it could be possible for overlapping files to be written at that target level. This PR prevents the possibility by calling `DisableManualCompaction()` prior to `ReFitLevel()`. It also improves the manual compaction disabling mechanism to wait for pending manual compactions to complete before returning, and support disabling from multiple threads. Fixes https://github.com/facebook/rocksdb/issues/6432. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7250 Test Plan: crash test command that repro'd the bug reliably: ``` $ TEST_TMPDIR=/dev/shm python tools/db_crashtest.py blackbox --simple -target_file_size_base=524288 -write_buffer_size=1048576 -clear_column_family_one_in=0 -reopen=0 -max_key=10000000 -column_families=1 -max_background_compactions=8 -compact_range_one_in=100000 -compression_type=none -compaction_style=1 -num_levels=5 -universal_min_merge_width=4 -universal_max_merge_width=8 -level0_file_num_compaction_trigger=12 -rate_limiter_bytes_per_sec=1048576000 -universal_max_size_amplification_percent=100 --duration=3600 --interval=60 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --enable_compaction_filter=0 ``` Reviewed By: ltamasi Differential Revision: D23090800 Pulled By: ajkr fbshipit-source-id: afcbcd51b42ce76789fdb907d8b9ada790709c13	4 years ago
Akanksha Mahajan	1f9f630b27	Store FileSystemPtr object that contains FileSystem ptr (#7180 ) Summary: As part of the IOTracing project, this PR 1. Caches "FileSystemPtr" object(wrapper class that returns file system pointer based on tracing enabled) instead of "FileSystem" pointer. 2. FileSystemPtr object is created using FileSystem pointer and IOTracer pointer. 3. IOTracer shared_ptr is created in DBImpl and it is passed to different classes through constructor. 4. When tracing is enabled through DB::StartIOTrace, FileSystemPtr returns FileSystemTracingWrapper pointer for tracing purpose and when it is disabled underlying FileSystem pointer is returned. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7180 Test Plan: make check -j64 COMPILE_WITH_TSAN=1 make check -j64 Reviewed By: anand1976 Differential Revision: D22987117 Pulled By: akankshamahajan15 fbshipit-source-id: 6073617e4c2d5bc363914f3a1f55ae3b0a58fbf1	4 years ago
Zhichao Cao	b79f13b2aa	Fix the potential deadlock in WriteImplWALOnly and UnorderedWriteMemtable (#7199 ) Summary: Pointed out by https://github.com/facebook/rocksdb/issues/7197 , there is a double lock in WriteImplWALOnly. Also find another deadlock in UnorderedWriteMemtable. Move the check after switch_all_.notify_all(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7199 Test Plan: pass make check Reviewed By: anand1976 Differential Revision: D22961714 Pulled By: zhichao-cao fbshipit-source-id: 0707922dc50d28ea141a15a8cdcbd1c8993ea0d8	4 years ago
Akanksha Mahajan	493f425e77	Add support to start and end IOTracing through DB APIs (#7203 ) Summary: 1. Add support to start io tracing through DB::StartIOTrace(Env, const TraceOptions&, std::unique_ptr<TraceWriter>&&) and end tracing through DB::EndIOTrace(). This doesn't trace DB::Open. User side code: //Open DB DB::Open(options, dbname, &db); / Start tracing / db->StartIOTrace(env, trace_opt, std::move(trace_writer)); / Perform Operations / /End tracing*/ db->EndIOTrace(); 2. Fix the build errors for Windows. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7203 Test Plan: make check -j64 Reviewed By: anand1976 Differential Revision: D22901947 Pulled By: akankshamahajan15 fbshipit-source-id: e59c0b785a802168e6f1aa028d99c224a35cb30c	4 years ago
Andrew Kryczka	a4a4a2dabd	dedup ReadOptions in iterator hierarchy (#7210 ) Summary: Previously, a `ReadOptions` object was stored in every `BlockBasedTableIterator` and every `LevelIterator`. This redundancy consumes extra memory, resulting in the `Arena` making more allocations, and iteration observing worse cache performance. This PR migrates callers of `NewInternalIterator()` and `MakeInputIterator()` to provide a `ReadOptions` object guaranteed to outlive the returned iterator. When the iterator's lifetime will be managed by the user, this lifetime guarantee is achieved by storing the `ReadOptions` value in `ArenaWrappedDBIter`. Then, sub-iterators of `NewInternalIterator()` and `MakeInputIterator()` can hold a reference-to-const `ReadOptions`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7210 Test Plan: - `make check` under ASAN and valgrind - benchmark: on a DB with 2 L0 files and 3 L1+ levels, this PR reduced `Arena` allocation 4792 -> 4160 bytes. Reviewed By: anand1976 Differential Revision: D22861323 Pulled By: ajkr fbshipit-source-id: 54aebb3e89c872eeab0f5793b4b6e42878d093ce	4 years ago
Zhichao Cao	a10f12eda1	Auto resume the DB from Retryable IO Error (#6765 ) Summary: In current codebase, in write path, if Retryable IO Error happens, SetBGError is called. The retryable IO Error is converted to hard error and DB is in read only mode. User or application needs to resume it. In this PR, if Retryable IO Error happens in one DB, SetBGError will create a new thread to call Resume (auto resume). otpions.max_bgerror_resume_count controls if auto resume is enabled or not (if max_bgerror_resume_count<=0, auto resume will not be enabled). options.bgerror_resume_retry_interval controls the time interval to call Resume again if the previous resume fails due to the Retryable IO Error. If non-retryable error happens during resume, auto resume will terminate. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6765 Test Plan: Added the unit test cases in error_handler_fs_test and pass make asan_check Reviewed By: anand1976 Differential Revision: D21916789 Pulled By: zhichao-cao fbshipit-source-id: acb8b5e5dc3167adfa9425a5b7fc104f6b95cb0b	4 years ago
wenh	4924a506b9	Reduce `env_->GetChildren()` calls in DBImpl::Recover() (#7044 ) Summary: There currently exist multiple `GetChildren()` calls in `DBImpl::Recover()`, which can be expensive in cases of distributed file systems. This pull request try to call `DBImpl::Recover()` of each necessary directory only _once_ and reuse the results in the places of repeated calls in current code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7044 Test Plan: Run `make check` and use the default test suite. The modified code should be semantically identical to the current code. As a proof of this solution, we may optionally deploy the system onto a (real or simulated) distributed system and expect reduced latency caused by manifest fetching. (WIP) Reviewed By: riversand963 Differential Revision: D22419925 Pulled By: roghnin fbshipit-source-id: d3774fbfbc246c5527101bc16747eb5c90919886	4 years ago
Zitan Chen	373d5ac485	BackupEngine verifies table file checksums on creating new backups (#7015 ) Summary: When table file checksums are enabled and stored in the DB manifest by using the RocksDB default crc32c checksum function, BackupEngine will calculate the crc32c checksum of the file to be copied and compare the calculated result with the one stored in the DB manifest before copying the file to the backup directory. After copying to the backup directory, BackupEngine will verify the checksum of the copied file with the one calculated before copying. This helps detect some rare corruption events such as bit-flips during the copying process. No verification with checksums in DB manifest will be performed if the table file checksum function is not the RocksDB default crc32c checksum function. In addition, If `share_table_files` and `share_files_with_checksum` are true, BackupEngine will compare the checksums computed before and after copying of the table files. Corresponding tests are added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7015 Test Plan: Passed make check Reviewed By: pdillinger Differential Revision: D22165732 Pulled By: gg814 fbshipit-source-id: ee0e8cc397c455eba64545c29380b9d9853588ec	4 years ago
sdong	80b107a0a9	Divide WriteCallbackTest.WriteWithCallbackTest (#7037 ) Summary: WriteCallbackTest.WriteWithCallbackTest has a deep for-loop and in some cases runs very long. Parameterimized it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7037 Test Plan: Run the test and see it passes. Reviewed By: ltamasi Differential Revision: D22269259 fbshipit-source-id: a1b6687b5bf4609754833d14cf383d68bc7ab27a	4 years ago
Yanqin Jin	e66199d848	First step towards handling MANIFEST write error (#6949 ) Summary: This PR provides preliminary support for handling IO error during MANIFEST write. File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted. One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach. If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled. Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled. Possible future directions: - Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added. Test plan (dev server): make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949 Reviewed By: anand1976 Differential Revision: D22026020 Pulled By: riversand963 fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8	4 years ago
sdong	d6b7b7712f	Fix a bug that causes iterator to return wrong result in a rare data race (#6973 ) Summary: The bug fixed in https://github.com/facebook/rocksdb/pull/1816/ is now applicable to iterator too. This was not an issue but https://github.com/facebook/rocksdb/pull/2886 caused the regression. If a put and DB flush happens just between iterator to get latest sequence number and getting super version, empty result for the key or an older value can be returned, which is wrong. Fix it in the same way as the fix in https://github.com/facebook/rocksdb/issues/1816, that is to get the sequence number after referencing the super version. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6973 Test Plan: Will run stress tests for a while to make sure there is no general regression. Reviewed By: ajkr Differential Revision: D22029348 fbshipit-source-id: 94390f93630906796d6e2fec321f44a920953fd1	4 years ago
Zitan Chen	88db97b06d	Add a DB Session ID (#6959 ) Summary: Added DB::GetDbSessionId by using the same format and machinery as DB::GetDbIdentity. The DB Session ID is generated (and therefore, updated) each time a DB object is opened. It is written to the LOG file right after the line of “DB SUMMARY”. A test for the uniqueness, for different openings and during the same opening, is also added. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6959 Test Plan: Passed make check Reviewed By: zhichao-cao Differential Revision: D21951721 Pulled By: gg814 fbshipit-source-id: 958a48a612db49a39998ea703cded45987d3fa8b	4 years ago
Levi Tamasi	ac3ae1df0b	Find/purge obsolete blob files (#6807 ) Summary: The patch extends `FindObsoleteFiles` and `PurgeObsoleteFiles` with support for blob files. The behavior is analogous to SST files: obsolete blob files are put on the "candidates for deletion" list, while live (and pending) files are preserved. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6807 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D21406249 Pulled By: ltamasi fbshipit-source-id: 1948f71c31927564b61e8af394f50ca3964880d9	5 years ago
Levi Tamasi	a00ddf1574	Expose the set of live blob files from Version/VersionSet (#6785 ) Summary: The patch adds logic that returns the set of live blob files from `Version::AddLiveFiles` and `VersionSet::AddLiveFiles` (in addition to live table files), and also cleans up the code a bit, for example, by exposing only the numbers of table files as opposed to the earlier `FileDescriptor`s that no clients used. Moreover, the patch extends the `GetLiveFiles` API so that it also exposes blob files in the current version. Similarly to https://github.com/facebook/rocksdb/pull/6755, this is a building block for identifying and purging obsolete blob files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6785 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D21336210 Pulled By: ltamasi fbshipit-source-id: fc1aede8a49eacd03caafbc5f6f9ce43b6270821	5 years ago
Yanqin Jin	e04f3bce4f	Update CURRENT file after best-efforts recovery (#6746 ) Summary: After a successful recovery, the CURRENT file should be updated to point to the valid MANIFEST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6746 Test Plan: make check Reviewed By: anand1976 Differential Revision: D21189876 Pulled By: riversand963 fbshipit-source-id: 7537b49988c5c425ebe9505a5cc260de351ad79b	5 years ago
anand76	c1ccd6b6af	Implement deadline support for MultiGet (#6710 ) Summary: Initial implementation of ReadOptions.deadline for MultiGet. If the request takes longer than the deadline, the keys not yet found will be returned with Status::TimedOut(). This implementation enforces the deadline in DBImpl, which is fairly high level. Its best effort and may not check the deadline after every key lookup, but may do so after a batch of keys. In subsequent stages, we will extend this to passing a timeout down to the FileSystem. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6710 Test Plan: Add new unit tests Reviewed By: riversand963 Differential Revision: D21149158 Pulled By: anand1976 fbshipit-source-id: 9f44eecffeb40873f5034ed59a66d21f9f88879e	5 years ago
Mike Kolupaev	e45673dece	Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621 ) Summary: Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype. Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling. It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas. Note that the deferred value loading only happens for internal iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621 Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats. Reviewed By: siying Differential Revision: D20786930 Pulled By: al13n321 fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee	5 years ago
Zhichao Cao	4246888101	Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487 ) Summary: In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status. The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487 Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check Reviewed By: anand1976 Differential Revision: D20685017 Pulled By: zhichao-cao fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0	5 years ago
Huisheng Liu	a6ce5c823b	multiget support for timestamps (#6483 ) Summary: Add timestamp support for MultiGet(). timestamp from readoptions is honored, and timestamps can be returned along with values. MultiReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks. base line (commit `17bef7d3a`): multireadrandom : 104.173 micros/op 307167 ops/sec; (5462999 of 5462999 found) This PR: multireadrandom : 104.199 micros/op 307095 ops/sec; (5307999 of 5307999 found) .\db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=multireadrandom --use_existing_db=1 --num=25000000 --threads=32 --allow_concurrent_memtable_write=0 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6483 Reviewed By: anand1976 Differential Revision: D20498373 Pulled By: riversand963 fbshipit-source-id: 8505f22bc40fd791bc7dd05e48d7e67c91edb627	5 years ago
Yanqin Jin	fb09ef05dc	Attempt to recover from db with missing table files (#6334 ) Summary: There are situations when RocksDB tries to recover, but the db is in an inconsistent state due to SST files referenced in the MANIFEST being missing. In this case, previous RocksDB will just fail the recovery and return a non-ok status. This PR enables another possibility. During recovery, RocksDB checks possible MANIFEST files, and try to recover to the most recent state without missing table file. `VersionSet::Recover()` applies version edits incrementally and "materializes" a version only when this version does not reference any missing table file. After processing the entire MANIFEST, the version created last will be the latest version. `DBImpl::Recover()` calls `VersionSet::Recover()`. Afterwards, WAL replay will not be performed. To use this capability, set `options.best_efforts_recovery = true` when opening the db. Best-efforts recovery is currently incompatible with atomic flush. Test plan (on devserver): ``` $make check $COMPILE_WITH_ASAN=1 make all && make check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6334 Reviewed By: anand1976 Differential Revision: D19778960 Pulled By: riversand963 fbshipit-source-id: c27ea80f29bc952e7d3311ecf5ee9c54393b40a8	5 years ago
Zhichao Cao	8d73137ae8	Replace Directory with FSDirectory in DB (#6468 ) Summary: In the current code base, we can use Directory from Env to manage directory (e.g, Fsync()). The PR https://github.com/facebook/rocksdb/issues/5761 introduce the File System as a new Env API. So we further replace the Directory class in DB with FSDirectory such that we can have more IO information from IOStatus returned by FSDirectory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6468 Test Plan: pass make asan_check Differential Revision: D20195261 Pulled By: zhichao-cao fbshipit-source-id: 93962cb9436852bfcfb76e086d9e7babd461cbe1	5 years ago
Huisheng Liu	904a60ff63	return timestamp from get (#6409 ) Summary: Added new Get() methods that return timestamp. Dummy implementation is given so that classes derived from DB don't need to be touched to provide their implementation. MultiGet is not included. ReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks. base line (commit `72ee067b9`): 101.712 micros/op 314602 ops/sec; 36.0 MB/s (5658999 of 5658999 found) This PR: 100.288 micros/op 319071 ops/sec; 36.5 MB/s (5674999 of 5674999 found) ./db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=readrandom --use_existing_db=1 --num=25000000 --threads=32 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6409 Differential Revision: D20200086 Pulled By: riversand963 fbshipit-source-id: 490edd74d924f62bd8ae9c29c2a6bbbb8410ca50	5 years ago
Michael R. Crusoe	051696bf98	fix some spelling typos (#6464 ) Summary: Found from Debian's "Lintian" program Pull Request resolved: https://github.com/facebook/rocksdb/pull/6464 Differential Revision: D20162862 Pulled By: zhichao-cao fbshipit-source-id: 06941ee2437b038b2b8045becbe9d2c6fbff3e12	5 years ago
sdong	fdf882ded2	Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433 ) Summary: When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433 Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag. Differential Revision: D19977691 fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e	5 years ago
Andrew Kryczka	c6abe30ee3	Fix concurrent full purge and WAL recycling (#5900 ) Summary: We were removing the file from `log_recycle_files_` before renaming it with `ReuseWritableFile()`. Since `ReuseWritableFile()` occurs outside the DB mutex, it was possible for a concurrent full purge to sneak in and delete the file before it could be renamed. Consequently, `SwitchMemtable()` would fail and the DB would enter read-only mode. The fix is to hold the old file number in `log_recycle_files_` until after the file has been renamed. Full purge uses that list to decide which files to keep, so it can no longer delete a file pending recycling. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5900 Test Plan: new unit test Differential Revision: D19771719 Pulled By: ajkr fbshipit-source-id: 094346349ca3fb499712e62de03905acc30b5ce8	5 years ago

1 2

88 Commits (2e5de5a2c31e9fd694e8e83a5333b97c4a4aac29)