rocksdb

Commit Graph

Author	SHA1	Message	Date
Yanqin Jin	4fce44fc8b	Improve flushing multiple column families (#4708 ) Summary: If one column family is dropped, we should simply skip it and continue to flush other active ones. Currently we use Status::ShutdownInProgress to notify caller of column families being dropped. In the future, we should consider using a different Status code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4708 Differential Revision: D13378954 Pulled By: riversand963 fbshipit-source-id: 42f248cdf2d32d4c0f677cd39012694b8f1328ca	6 years ago
DorianZheng	2670fe8c73	Get `CompactionJobInfo` from CompactFiles Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4716 Differential Revision: D13207677 Pulled By: ajkr fbshipit-source-id: d0ccf5a66df6cbb07288b0c5ebad81fd9df3926b	6 years ago
Burton Li	a8b9891f95	Concurrent task limiter for compaction thread control (#4332 ) Summary: The PR is targeting to resolve the issue of: https://github.com/facebook/rocksdb/issues/3972#issue-330771918 We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD. When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth. Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us. With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening. ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant. The usage is straight forward: e.g.: // // Enable compaction thread limiter thru ColumnFamilyOptions // std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4)); Options options; ColumnFamilyOptions cf_opt(options); cf_opt.compaction_thread_limiter = ctl; ... // // Compaction thread limiter can be tuned or disabled on-the-fly // ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks ... ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter ctl->SetMaxOutstandingTask(-1); // Same as above ... ctl->SetMaxOutstandingTask(0); // full throttle (0 task) // // Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue) // std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8)); std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4)); Options options; ColumnFamilyOptions cf_opt_ssd1(options); ColumnFamilyOptions cf_opt_ssd2(options); ColumnFamilyOptions cf_opt_hdd1(options); ColumnFamilyOptions cf_opt_hdd2(options); ColumnFamilyOptions cf_opt_hdd3(options); // SSD CFs cf_opt_ssd1.compaction_thread_limiter = ctl_ssd; cf_opt_ssd2.compaction_thread_limiter = ctl_ssd; // HDD CFs cf_opt_hdd1.compaction_thread_limiter = ctl_hdd; cf_opt_hdd2.compaction_thread_limiter = ctl_hdd; cf_opt_hdd3.compaction_thread_limiter = ctl_hdd; ... // // The limiter is disabled by default (or set to nullptr explicitly) // Options options; ColumnFamilyOptions cf_opt(options); cf_opt.compaction_thread_limiter = nullptr; Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332 Differential Revision: D13226590 Pulled By: siying fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d	6 years ago
DorianZheng	4862720e08	Expose column family id to `FlushJobInfo` Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4772 Differential Revision: D13428923 Pulled By: ajkr fbshipit-source-id: e351e9c5eea97816db25429e129357a8af90712a	6 years ago
Yanqin Jin	b96fccb1e6	Move a function to critical section (#4752 ) Summary: Test plan ``` $make clean && make -j32 all check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4752 Differential Revision: D13344705 Pulled By: riversand963 fbshipit-source-id: fc3a43174d09d70ccc2b09decd78e1da1b6ba9d1	6 years ago
Yanqin Jin	8d7bc76f36	Fix a flaky test DBFlushTest.SyncFail (#4633 ) Summary: There is a race condition in DBFlushTest.SyncFail, as illustrated below. ``` time thread1 bg_flush_thread \| Flush(wait=false, cfd) \| refs_before=cfd->current()->TEST_refs() PickMemtable calls cfd->current()->Ref() V ``` The race condition between thread1 getting the ref count of cfd's current version and bg_flush_thread incrementing the cfd's current version makes it possible for later assertion on refs_before to fail. Therefore, we add test sync points to enforce the order and assert on the ref count before and after PickMemtable is called in bg_flush_thread. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4633 Differential Revision: D12967131 Pulled By: riversand963 fbshipit-source-id: a99d2bacb7869ec5d8d03b24ef2babc0e6ae1a3b	6 years ago
Yanqin Jin	147697420a	Rollback memtable flush upon atomic flush fail (#4641 ) Summary: This fixes an assertion. An atomic flush can have multiple flush jobs. Some of them may fail. If any of them fails, we need to rollback all of them. For the flush jobs that do fail, we already call `RollbackMemTableFlush` in `FlushJob::Run`. The tricky part is for flush jobs that have completed successfully. We need to call `RollbackMemTableFlush` for them as well. The newly added DBAtomicFlushTest.AtomicFlushRollbackSomeJobs will SigAbort without the corresponding change in AtomicFlushMemTablesToOutputFiles. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4641 Differential Revision: D12943649 Pulled By: riversand963 fbshipit-source-id: c66a4a664a1e0938e938fd41edc5a70c34cdd868	6 years ago
Soli	a478682260	Fix #3840 : only `SyncClosedLogs` for multiple CFs (#4460 ) Summary: Call `SyncClosedLogs()` only if there are more than one column families. Update several unit tests (in `fault_injection_test` and `db_flush_test`) correspondingly. See #3840 for more info. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4460 Differential Revision: D12896377 Pulled By: riversand963 fbshipit-source-id: f49afdaec32568f12f001219a3aec1dfde3b32bf	6 years ago
DorianZheng	0f88160f67	Fix `CompactFiles` bug (#4665 ) Summary: `CompactFiles` gets `SuperVersion` before `WaitForIngestFile`, while `IngestExternalFile` may add files that overlap with `input_file_names` The timeline of execution flow is as follow: Let's say that level N has two file [1,2] and [5,6] ``` timeline user_thread1 user_thread2 t0 \| CompactFiles([1, 2], [5, 6]) begin t1 \| GetReferencedSuperVersion() t2 \| IngestExternalFile([3,4]) to level N begin t3 \| CompactFiles resume V ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4665 Differential Revision: D13030674 Pulled By: ajkr fbshipit-source-id: 8be19477fd6e505032267a979d32f3097cc3be51	6 years ago
Yanqin Jin	05dec0c7c7	Remove redundant member var and set options (#4631 ) Summary: In the past, both `DBImpl::atomic_flush_` and `DBImpl::immutable_db_options_.atomic_flush` exist. However, we fail to set `immutable_db_options_.atomic_flush`, but use `DBImpl::atomic_flush_` which is set correctly. This does not lead to incorrect behavior, but is a duplicate of information. Since `immutable_db_options_` is always there and has `atomic_flush`, we should use it as source of truth and remove `DBImpl::atomic_flush_`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4631 Differential Revision: D12928371 Pulled By: riversand963 fbshipit-source-id: f85a811959d3828aad4a3a1b05f71facf19c636d	6 years ago
Yi Wu	859dbda6e3	Fix DBTest.SoftLimit flakyness (#4658 ) Summary: The flakyness can be reproduced with the following patch: ``` --- a/db/db_impl_compaction_flush.cc +++ b/db/db_impl_compaction_flush.cc @@ -2013,6 +2013,9 @@ void DBImpl::BackgroundCallFlush() { if (job_context.HaveSomethingToDelete()) { PurgeObsoleteFiles(job_context); } + static int f_count = 0; + printf("clean flush job context %d\n", ++f_count); + env_->SleepForMicroseconds(1000000); job_context.Clean(); mutex_.Lock(); } ``` The issue is that FlushMemtable with opt.wait=true does not wait for `OnStallConditionsChanged` being called. The event listener is triggered on `JobContext::Clean`, which happens after flush result is installed. At the time we check for stall condition after flushing memtable, the job context cleanup may not be finished. To fix the flaykyness, we use sync point to create a custom WaitForFlush that waits for context cleanup. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4658 Differential Revision: D13007301 Pulled By: yiwu-arbug fbshipit-source-id: d98395ee7b0ad4c62e83e8d0e9b6028058c61712	6 years ago
Sagar Vemuri	dc3528077a	Update all unique/shared_ptr instances to be qualified with namespace std (#4638 ) Summary: Ran the following commands to recursively change all the files under RocksDB: ``` find . -type f -name ".cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} + find . -type f -name ".cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} + ``` Running `make format` updated some formatting on the files touched. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638 Differential Revision: D12934992 Pulled By: sagar0 fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8	6 years ago
Andrew Kryczka	5c794d94c4	Prevent manual flush hanging in read-only mode (#4615 ) Summary: The logic to wait for stall conditions to clear before beginning a manual flush didn't take into account whether the DB was in read-only mode. In read-only mode the stall conditions would never clear since no background work is happening, so the wait would be never-ending. It's probably better to return an error to the user. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4615 Differential Revision: D12888008 Pulled By: ajkr fbshipit-source-id: 1c474b42a7ac38d9fd0d0e2340ff1d53e684d83c	6 years ago
Andrew Kryczka	b8f68bac38	Prevent manual compaction hanging in read-only mode (#4611 ) Summary: A background compaction with pre-picked files (i.e., either a manual compaction or a bottom-pri compaction) fails when the DB is in read-only mode. In the failure handling, we forgot to unregister the compaction and the files it covered. Then subsequent manual compactions could conflict with this zombie compaction (possibly Halloween related) and wait forever for it to finish. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4611 Differential Revision: D12871217 Pulled By: ajkr fbshipit-source-id: 9d24e921d5bbd2ee8c2c9536a30abfa42a220c6e	6 years ago
Yanqin Jin	5b4c709fad	Enable atomic flush (#4023 ) Summary: Adds a DB option `atomic_flush` to control whether to enable this feature. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4023 Differential Revision: D8518381 Pulled By: riversand963 fbshipit-source-id: 1e3bb33e99bb102876a31b378d93b0138ff6634f	6 years ago
Yanqin Jin	e633983cf1	Add support to flush multiple CFs atomically (#4262 ) Summary: Leverage existing `FlushJob` to implement atomic flush of multiple column families. This PR depends on other PRs and is a subset of #3752 . This PR itself is not sufficient in fulfilling atomic flush. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4262 Differential Revision: D9283109 Pulled By: riversand963 fbshipit-source-id: 65401f913e4160b0a61c0be6cd02adc15dad28ed	6 years ago
Peter Pei	09814f2cfc	support OnCompactionBegin (#4431 ) Summary: fix #4288 Add `OnCompactionBegin` support to `rocksdb::EventListener`. Currently, we only have these three callbacks: - OnFlushBegin - OnFlushCompleted - OnCompactionCompleted As paolococchi requested in #4288 , and ajkr agreed, we should also support `OnCompactionBegin`. This PR is a try to implement the support of `OnCompactionBegin`. Hope it is useful to you. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4431 Differential Revision: D10055515 Pulled By: yiwu-arbug fbshipit-source-id: 39c0f95f8e9ff1c7ca3a10787502a17f258d2334	6 years ago
jsteemann	141ef7f8d3	avoid copying when iterating using range-based for (#4459 ) Summary: this avoids a few copies of std::string and other structs in the context of range-based for loops. instead of copying the values for each iteration, use a const reference to avoid copying. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4459 Differential Revision: D10282045 Pulled By: sagar0 fbshipit-source-id: 5012e910dca279abd2be847e1fb432d96274edfb	6 years ago
Zhongyi Xie	cac87fcf57	move dump stats to a separate thread (#4382 ) Summary: Currently statistics are supposed to be dumped to info log at intervals of `options.stats_dump_period_sec`. However the implementation choice was to bind it with compaction thread, meaning if the database has been serving very light traffic, the stats may not get dumped at all. We decided to separate stats dumping into a new timed thread using `TimerQueue`, which is already used in blob_db. This will allow us schedule new timed tasks with more deterministic behavior. Tested with db_bench using `--stats_dump_period_sec=20` in command line: > LOG:2018/09/17-14:07:45.575025 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS ------- LOG:2018/09/17-14:08:05.643286 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS ------- LOG:2018/09/17-14:08:25.691325 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS ------- LOG:2018/09/17-14:08:45.740989 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS ------- LOG content: > 2018/09/17-14:07:45.575025 7fe99fbfe700 [WARN] [db/db_impl.cc:605] ------- DUMPING STATS ------- 2018/09/17-14:07:45.575080 7fe99fbfe700 [WARN] [db/db_impl.cc:606] DB Stats Uptime(secs): 20.0 total, 20.0 interval Cumulative writes: 4447K writes, 4447K keys, 4447K commit groups, 1.0 writes per commit group, ingest: 5.57 GB, 285.01 MB/s Cumulative WAL: 4447K writes, 0 syncs, 4447638.00 writes per sync, written: 5.57 GB, 285.01 MB/s Cumulative stall: 00:00:0.012 H:M:S, 0.1 percent Interval writes: 4447K writes, 4447K keys, 4447K commit groups, 1.0 writes per commit group, ingest: 5700.71 MB, 285.01 MB/s Interval WAL: 4447K writes, 0 syncs, 4447638.00 writes per sync, written: 5.57 MB, 285.01 MB/s Interval stall: 00:00:0.012 H:M:S, 0.1 percent Compaction Stats [default] Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W-Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop Pull Request resolved: https://github.com/facebook/rocksdb/pull/4382 Differential Revision: D9933051 Pulled By: miasantreble fbshipit-source-id: 6d12bb1e4977674eea4bf2d2ac6d486b814bb2fa	6 years ago
DorianZheng	e0f05754ba	Expose column family id to OnCompactionCompleted (#4466 ) Summary: The controller you requested could not be found. PTAL Pull Request resolved: https://github.com/facebook/rocksdb/pull/4466 Differential Revision: D10241358 Pulled By: yiwu-arbug fbshipit-source-id: 99664eb286860a6c8844d50efeb0ef6f0e10dd1e	6 years ago
Anand Ananthabhotla	a27fce408e	Auto recovery from out of space errors (#4164 ) Summary: This commit implements automatic recovery from a Status::NoSpace() error during background operations such as write callback, flush and compaction. The broad design is as follows - 1. Compaction errors are treated as soft errors and don't put the database in read-only mode. A compaction is delayed until enough free disk space is available to accomodate the compaction outputs, which is estimated based on the input size. This means that users can continue to write, and we rely on the WriteController to delay or stop writes if the compaction debt becomes too high due to persistent low disk space condition 2. Errors during write callback and flush are treated as hard errors, i.e the database is put in read-only mode and goes back to read-write only fater certain recovery actions are taken. 3. Both types of recovery rely on the SstFileManagerImpl to poll for sufficient disk space. We assume that there is a 1-1 mapping between an SFM and the underlying OS storage container. For cases where multiple DBs are hosted on a single storage container, the user is expected to allocate a single SFM instance and use the same one for all the DBs. If no SFM is specified by the user, DBImpl::Open() will allocate one, but this will be one per DB and each DB will recover independently. The recovery implemented by SFM is as follows - a) On the first occurance of an out of space error during compaction, subsequent compactions will be delayed until the disk free space check indicates enough available space. The required space is computed as the sum of input sizes. b) The free space check requirement will be removed once the amount of free space is greater than the size reserved by in progress compactions when the first error occured c) If the out of space error is a hard error, a background thread in SFM will poll for sufficient headroom before triggering the recovery of the database and putting it in write-only mode. The headroom is calculated as the sum of the write_buffer_size of all the DB instances associated with the SFM 4. EventListener callbacks will be called at the start and completion of automatic recovery. Users can disable the auto recov ery in the start callback, and later initiate it manually by calling DB::Resume() Todo: 1. More extensive testing 2. Add disk full condition to db_stress (follow-on PR) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164 Differential Revision: D9846378 Pulled By: anand1976 fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a	6 years ago
Mikhail Antonov	927f274939	Avoiding write stall caused by manual flushes (#4297 ) Summary: Basically at the moment it seems it's possible to cause write stall by calling flush (either manually vis DB::Flush(), or from Backup Engine directly calling FlushMemTable() while background flush may be already happening. One of the ways to fix it is that in DBImpl::CompactRange() we already check for possible stall and delay flush if needed before we actually proceed to call FlushMemTable(). We can simply move this delay logic to separate method and call it from FlushMemTable. This is draft patch, for first look; need to check tests/update SyncPoints and most certainly would need to add allow_write_stall method to FlushOptions(). Pull Request resolved: https://github.com/facebook/rocksdb/pull/4297 Differential Revision: D9420705 Pulled By: mikhail-antonov fbshipit-source-id: f81d206b55e1d7b39e4dc64242fdfbceeea03fcc	6 years ago
Yanqin Jin	7daae512d2	Refactor flush request queueing and processing (#3952 ) Summary: RocksDB currently queues individual column family for flushing. This is not sufficient to support the needs of some applications that want to enforce order/dependency between column families, given that multiple foreground and background activities can trigger flushing in RocksDB. This PR aims to address this limitation. Each flush request is described as a `FlushRequest` that can contain multiple column families. A background flushing thread pops one flush request from the queue at a time and processes it. This PR does not enable atomic_flush yet, but is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752). Pull Request resolved: https://github.com/facebook/rocksdb/pull/3952 Differential Revision: D8529933 Pulled By: riversand963 fbshipit-source-id: 78908a21e389a3a3f7de2a79bae0cd13af5f3539	6 years ago
Yanqin Jin	1f802773bc	Update JobContext. (#3949 ) Summary: In the past, we assume that a job modifies a single column family. Therefore, a job can create at most one superversion since each superversion corresponds to one column family. This assumption leads to the fact that a `JobContext` has only one member variable called `superversion_context`. Now we want to support group flush of column families, indicating that each job can create multiple superversions. Therefore, we need to make the following change to accommodate this new feature. Add a vector of `SuperVersionContext` to `JobContext` to support installing superversions for multiple column families in one job context. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752). Pull Request resolved: https://github.com/facebook/rocksdb/pull/3949 Differential Revision: D8864895 Pulled By: riversand963 fbshipit-source-id: 5937a48817276370d3c8172db9c8aafc826d97ca	6 years ago
Yanqin Jin	54de56844d	Remove random writes from SST file ingestion (#4172 ) Summary: RocksDB used to store global_seqno in external SST files written by SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the `global_seqno`. Since random write is not supported in some non-POSIX compliant file systems, external SST file ingestion is not supported on these file systems. To address this limitation, we no longer update `global_seqno` during file ingestion. Later RocksDB uses the MANIFEST and other information in table properties to deduce global seqno for externally-ingested SST files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4172 Differential Revision: D8961465 Pulled By: riversand963 fbshipit-source-id: 4382ec85270a96be5bc0cf33758ca2b167b05071	6 years ago
DorianZheng	f5e46354d2	Protect external file when ingesting (#4099 ) Summary: If crash happen after a hard link established, Recover function may reuse the file number that has already assigned to the internal file, and this will overwrite the external file. To protect the external file, we have to make sure the file number will never being reused. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4099 Differential Revision: D9034092 Pulled By: riversand963 fbshipit-source-id: 3f1a737440b86aa2ef01673e5013aacbb7c33e28	6 years ago
Anand Ananthabhotla	52d4c9b7f6	Allow DB resume after background errors (#3997 ) Summary: Currently, if RocksDB encounters errors during a write operation (user requested or BG operations), it sets DBImpl::bg_error_ and fails subsequent writes. This PR allows the DB to be resumed for certain classes of errors. It consists of 3 parts - 1. Introduce Status::Severity in rocksdb::Status to indicate whether a given error can be recovered from or not 2. Refactor the error handling code so that setting bg_error_ and deciding on severity is in one place 3. Provide an API for the user to clear the error and resume the DB instance This whole change is broken up into multiple PRs. Initially, we only allow clearing the error for Status::NoSpace() errors during background flush/compaction. Subsequent PRs will expand this to include more errors and foreground operations such as Put(), and implement a polling mechanism for out-of-space errors. Closes https://github.com/facebook/rocksdb/pull/3997 Differential Revision: D8653831 Pulled By: anand1976 fbshipit-source-id: 6dc835c76122443a7668497c0226b4f072bc6afd	6 years ago
Andrew Kryczka	3d7dc75b36	Bottommost level-based compactions in bottom-pri pool Summary: This feature was introduced for universal compaction in `cc01985d`. At that point we thought it'd be used only to prevent long-running universal full compactions from blocking short-lived upper-level compactions. Now we have a level compaction user who could benefit from it since they use more expensive compression algorithm in the bottom level. So enable it for level. Closes https://github.com/facebook/rocksdb/pull/3835 Differential Revision: D7957179 Pulled By: ajkr fbshipit-source-id: 177285d2cef3b650b6a4d81dc5db84bc441c9fe4	7 years ago
Siying Dong	d59549298f	Skip deleted WALs during recovery Summary: This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic. Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction) This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2. Closes https://github.com/facebook/rocksdb/pull/3765 Differential Revision: D7747618 Pulled By: siying fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729	7 years ago
Huachao Huang	ed7a95b28c	Add max_subcompactions as a compaction option Summary: Sometimes we want to compact files as fast as possible, but don't want to set a large `max_subcompactions` in the `DBOptions` by default. I add a `max_subcompactions` options to `CompactionOptions` so that we can choose a proper concurrency dynamically. Closes https://github.com/facebook/rocksdb/pull/3775 Differential Revision: D7792357 Pulled By: ajkr fbshipit-source-id: 94f54c3784dce69e40a229721a79a97e80cd6a6c	7 years ago
Yanqin Jin	7dfbe33532	Rename pending_compaction_ to queued_for_compaction_. Summary: We use `queued_for_flush_` to indicate a column family has been added to the flush queue. Similarly and to be consistent in our naming, we need to use `queued_for_compaction_` to indicate a column family has been added to the compaction queue. In the past we used `pending_compaction_` which can also be ambiguous. Closes https://github.com/facebook/rocksdb/pull/3781 Differential Revision: D7790063 Pulled By: riversand963 fbshipit-source-id: 6786b11a4fcaea36dc9b4672233dbe042f921804	7 years ago
Yanqin Jin	513b5ce618	Rename pending_flush_ to queued_for_flush_. Summary: With ColumnFamilyData::pending_flush_, we have the following code snippet in DBImpl::ScheedulePendingFlush ``` if (!cfd->pending_flush() && cfd->imm()->IsFlushPending()) { ... } ``` `Pending` is ambiguous, and I feel `queued_for_flush` is a better name, especially for the sake of readability. Closes https://github.com/facebook/rocksdb/pull/3777 Differential Revision: D7783066 Pulled By: riversand963 fbshipit-source-id: f1bd8c8bfe5eafd2c94da0d8566c9b2b6bb57229	7 years ago
Siying Dong	63c965cdb4	Sync parent directory after deleting a file in delete scheduler Summary: sync parent directory after deleting a file in delete scheduler. Otherwise, trim speed may not be as smooth as what we want. Closes https://github.com/facebook/rocksdb/pull/3767 Differential Revision: D7760136 Pulled By: siying fbshipit-source-id: ec131d53b61953f09c60d67e901e5eeb2716b05f	7 years ago
Zhongyi Xie	e1e826b980	check return status for Sync() and Append() calls to avoid corruption Summary: Right now in `SyncClosedLogs`, `CopyFile`, and `AddRecord`, where `Sync` and `Append` are invoked in a loop, the error status are not checked. This could lead to potential corruption as later calls will overwrite the error status. Closes https://github.com/facebook/rocksdb/pull/3740 Differential Revision: D7678848 Pulled By: miasantreble fbshipit-source-id: 4b0b412975989dfe80348f73217b9c4122a4bd77	7 years ago
David Lai	3be9b36453	comment unused parameters to turn on -Wunused-parameter flag Summary: This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557. Closes https://github.com/facebook/rocksdb/pull/3662 Differential Revision: D7426121 Pulled By: Dayvedde fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352	7 years ago
Yanqin Jin	d95014b9df	fix some text in comments. Summary: 1. Remove redundant text. 2. Make terminology consistent across all comments and doc of RocksDB. Also do our best to conform to conventions. Specifically, use 'callback' instead of 'call-back' [wikipedia](https://en.wikipedia.org/wiki/Callback_(computer_programming)). Closes https://github.com/facebook/rocksdb/pull/3693 Differential Revision: D7560396 Pulled By: riversand963 fbshipit-source-id: ba8c251c487f4e7d1872a1a8dc680f9e35a6ffb8	7 years ago
Phani Shekhar Mantripragada	446b32cfc3	Support for Column family specific paths. Summary: In this change, an option to set different paths for different column families is added. This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path. To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path. Changes : 1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions. This member is used to identify the path information whenever files are accessed. 2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting. 3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths. 4) Unit tests are added appropriately. Closes https://github.com/facebook/rocksdb/pull/3102 Differential Revision: D6951697 Pulled By: ajkr fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d	7 years ago
Amy Tai	1579626d0d	Enable cancelling manual compactions if they hit the sfm size limit Summary: Manual compactions should be cancelled, just like scheduled compactions are cancelled, if sfm->EnoughRoomForCompaction is not true. Closes https://github.com/facebook/rocksdb/pull/3670 Differential Revision: D7457683 Pulled By: amytai fbshipit-source-id: 669b02fdb707f75db576d03d2c818fb98d1876f5	7 years ago
Fosco Marotto	d12112d05e	Throw NoSpace instead of IOError when out of space. Summary: Replaces #1702 and is updated from feedback. Closes https://github.com/facebook/rocksdb/pull/3531 Differential Revision: D7457395 Pulled By: gfosco fbshipit-source-id: 25a21dd8cfa5a6e42e024208b444d9379d920c82	7 years ago
Yanqin Jin	1f5def1653	Fix race condition causing double deletion of ssts Summary: Possible interleaved execution of background compaction thread calling `FindObsoleteFiles (no full scan) / PurgeObsoleteFiles` and user thread calling `FindObsoleteFiles (full scan) / PurgeObsoleteFiles` can lead to race condition on which RocksDB attempts to delete a file twice. The second attempt will fail and return `IO error`. This may occur to other files, but this PR targets sst. Also add a unit test to verify that this PR fixes the issue. The newly added unit test `obsolete_files_test` has a test case for this scenario, implemented in `ObsoleteFilesTest#RaceForObsoleteFileDeletion`. `TestSyncPoint`s are used to coordinate the interleaving the `user_thread` and background compaction thread. They execute as follows ``` timeline user_thread background_compaction thread t1 \| FindObsoleteFiles(full_scan=false) t2 \| FindObsoleteFiles(full_scan=true) t3 \| PurgeObsoleteFiles t4 \| PurgeObsoleteFiles V ``` When `user_thread` invokes `FindObsoleteFiles` with full scan, it collects ALL files in RocksDB directory, including the ones that background compaction thread have collected in its job context. Then `user_thread` will see an IO error when trying to delete these files in `PurgeObsoleteFiles` because background compaction thread has already deleted the file in `PurgeObsoleteFiles`. To fix this, we make RocksDB remember which (SST) files have been found by threads after calling `FindObsoleteFiles` (see `DBImpl#files_grabbed_for_purge_`). Therefore, when another thread calls `FindObsoleteFiles` with full scan, it will not collect such files. ajkr could you take a look and comment? Thanks! Closes https://github.com/facebook/rocksdb/pull/3638 Differential Revision: D7384372 Pulled By: riversand963 fbshipit-source-id: 01489516d60012e722ee65a80e1449e589ce26d3	7 years ago
Maysam Yabandeh	3e417a6607	WritePrepared Txn: AddPrepared for all sub-batches Summary: Currently AddPrepared is performed only on the first sub-batch if there are duplicate keys in the write batch. This could cause a problem if the transaction takes too long to commit and the seq number of the first sub-patch moved to old_prepared_ but not the seq of the later ones. The patch fixes this by calling AddPrepared for all sub-patches. Closes https://github.com/facebook/rocksdb/pull/3651 Differential Revision: D7388635 Pulled By: maysamyabandeh fbshipit-source-id: 0ccd80c150d9bc42fe955e49ddb9d7ca353067b4	7 years ago
Zhongyi Xie	1cbc96d236	FlushReason improvement Summary: Right now flush reason "SuperVersion Change" covers a few different scenarios which is a bit vague. For example, the following db_bench job should trigger "Write Buffer Full" > $ TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -write_buffer_size=1048576 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 $ grep 'flush_reason' /dev/shm/dbbench/LOG ... 2018/03/06-17:30:42.543638 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242543634, "job": 192, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018024, "flush_reason": "SuperVersion Change"} 2018/03/06-17:30:42.569541 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242569536, "job": 193, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "SuperVersion Change"} 2018/03/06-17:30:42.596396 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242596392, "job": 194, "event": "flush_started", "num_memtables": 1, "num_entries": 7008, "num_deletes": 0, "memory_usage": 1018048, "flush_reason": "SuperVersion Change"} 2018/03/06-17:30:42.622444 7f2773b99700 EVENT_LOG_v1 {"time_micros": 1520386242622440, "job": 195, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "SuperVersion Change"} With the fix: > 2018/03/19-14:40:02.341451 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602341444, "job": 98, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018008, "flush_reason": "Write Buffer Full"} 2018/03/19-14:40:02.379655 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602379642, "job": 100, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018016, "flush_reason": "Write Buffer Full"} 2018/03/19-14:40:02.418479 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602418474, "job": 101, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "Write Buffer Full"} 2018/03/19-14:40:02.455084 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602455079, "job": 102, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018048, "flush_reason": "Write Buffer Full"} 2018/03/19-14:40:02.492293 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602492288, "job": 104, "event": "flush_started", "num_memtables": 1, "num_entries": 7007, "num_deletes": 0, "memory_usage": 1018056, "flush_reason": "Write Buffer Full"} 2018/03/19-14:40:02.528720 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602528715, "job": 105, "event": "flush_started", "num_memtables": 1, "num_entries": 7006, "num_deletes": 0, "memory_usage": 1018104, "flush_reason": "Write Buffer Full"} 2018/03/19-14:40:02.566255 7f11dc257700 EVENT_LOG_v1 {"time_micros": 1521495602566238, "job": 107, "event": "flush_started", "num_memtables": 1, "num_entries": 7009, "num_deletes": 0, "memory_usage": 1018112, "flush_reason": "Write Buffer Full"} Closes https://github.com/facebook/rocksdb/pull/3627 Differential Revision: D7328772 Pulled By: miasantreble fbshipit-source-id: 67c94065fbdd36930f09930aad0aaa6d2c152bb8	7 years ago
Niv Dayan	da82aab126	allowing CompactFiles to return new file names Summary: This is a small API extension to allow the CompactFiles method to return the names of files that were created during the compaction. Closes https://github.com/facebook/rocksdb/pull/3608 Differential Revision: D7275789 Pulled By: siying fbshipit-source-id: 1ec0c3954a0f10cd877efb5f29f9be6c7b59e9ba	7 years ago
Amy Tai	e476d0e252	Adding stat to count cancelled compactions Summary: Added a stat that counts the number of cancelled compactions. Closes https://github.com/facebook/rocksdb/pull/3574 Differential Revision: D7190259 Pulled By: amytai fbshipit-source-id: d5ce82dc9398da6d6d34023ad4ed8cec909852a3	7 years ago
amytai	0a3db28d98	Disallow compactions if there isn't enough free space Summary: This diff handles cases where compaction causes an ENOSPC error. This does not handle corner cases where another background job is started while compaction is running, and the other background job triggers ENOSPC, although we do allow the user to provision for these background jobs with SstFileManager::SetCompactionBufferSize. It also does not handle the case where compaction has finished and some other background job independently triggers ENOSPC. Usage: Functionality is inside SstFileManager. In particular, users should set SstFileManager::SetMaxAllowedSpaceUsage, which is the reference highwatermark for determining whether to cancel compactions. Closes https://github.com/facebook/rocksdb/pull/3449 Differential Revision: D7016941 Pulled By: amytai fbshipit-source-id: 8965ab8dd8b00972e771637a41b4e6c645450445	7 years ago
Andrew Kryczka	5d68243e61	Comment out unused variables Summary: Submitting on behalf of another employee. Closes https://github.com/facebook/rocksdb/pull/3557 Differential Revision: D7146025 Pulled By: ajkr fbshipit-source-id: 495ca5db5beec3789e671e26f78170957704e77e	7 years ago
Andrew Kryczka	3ae0047278	skip CompactRange flush based on memtable contents Summary: CompactRange has a call to Flush because we guarantee that, at the time it's called, all existing keys in the range will be pushed through the user's compaction filter. However, previously the flush was done blindly, so it'd happen even if the memtable does not contain keys in the range specified by the user. This caused unnecessarily many L0 files to be created, leading to write stalls in some cases. This PR checks the memtable's contents, and decides to flush only if it overlaps with `CompactRange`'s range. - Move the memtable overlap check logic from `ExternalSstFileIngestionJob` to `ColumnFamilyData::RangesOverlapWithMemtables` - Reuse the above logic in `CompactRange` and skip flushing if no overlap Closes https://github.com/facebook/rocksdb/pull/3520 Differential Revision: D7018897 Pulled By: ajkr fbshipit-source-id: a3c6b1cfae56687b49dd89ccac7c948e53545934	7 years ago
Igor Sugak	aba3409740	Back out "[codemod] - comment out unused parameters" Reviewed By: igorsugak fbshipit-source-id: 4a93675cc1931089ddd574cacdb15d228b1e5f37	7 years ago
David Lai	f4a030ce81	- comment out unused parameters Reviewed By: everiq, igorsugak Differential Revision: D7046710 fbshipit-source-id: 8e10b1f1e2aecebbfb229c742e214db887e5a461	7 years ago
Andrew Kryczka	ee1c802675	Add delay before flush in CompactRange to avoid write stalling Summary: - Refactored logic for checking write stall condition to a helper function: `GetWriteStallConditionAndCause`. Now it is decoupled from the logic for updating WriteController / stats in `RecalculateWriteStallConditions`, so we can reuse it for predicting whether write stall will occur. - Updated `CompactRange` to first check whether the one additional immutable memtable / L0 file would cause stalling before it flushes. If so, it waits until that is no longer true. - Updated `bg_cv_` to be signaled on `SetOptions` calls. The stall conditions `CompactRange` cares about can change when (1) flush finishes, (2) compaction finishes, or (3) options dynamically change. The cv was already signaled for (1) and (2) but not yet for (3). Closes https://github.com/facebook/rocksdb/pull/3381 Differential Revision: D6754983 Pulled By: ajkr fbshipit-source-id: 5613e03f1524df7192dc6ae885d40fd8f091d972	7 years ago

1 2

88 Commits (c15df15f0784494f897dbf442624f8c5545fdafd)