rocksdb

Commit Graph

Author	SHA1	Message	Date
Cheng Chang	4fc216649d	Support direct IO in RandomAccessFileReader::MultiRead (#6446 ) Summary: By supporting direct IO in RandomAccessFileReader::MultiRead, the benefits of parallel IO (IO uring) and direct IO can be combined. In direct IO mode, read requests are aligned and merged together before being issued to RandomAccessFile::MultiRead, so blocks in the original requests might share the same underlying buffer, the shared buffers are returned in `aligned_bufs`, which is a new parameter of the `MultiRead` API. For example, suppose alignment requirement for direct IO is 4KB, one request is (offset: 1KB, len: 1KB), another request is (offset: 3KB, len: 1KB), then since they all belong to page (offset: 0, len: 4KB), `MultiRead` only reads the page with direct IO into a buffer on heap, and returns 2 Slices referencing regions in that same buffer. See `random_access_file_reader_test.cc` for more examples. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6446 Test Plan: Added a new test `random_access_file_reader_test.cc`. Reviewed By: anand1976 Differential Revision: D20097518 Pulled By: cheng-chang fbshipit-source-id: ca48a8faf9c3af146465c102ef6b266a363e78d1	6 years ago
Levi Tamasi	c15e85bdcb	Move BlobDB related files under db/ to db/blob/ (#6519 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6519 Test Plan: ``` make all make check ``` Differential Revision: D20400691 Pulled By: ltamasi fbshipit-source-id: 20ef911cf1c2c92c7f71ef0b493f9be64f2eef94	6 years ago
Chao Zhao	4028eba67b	Optional sequence number exporting during checkpoint creation (#5528 ) Summary: Add sequence_number_ptr to the checkpoint interface to expose the sequence number during taking the checkpoint. The number will be consistent with the seq # in rocksdb log. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5528 Test Plan: make check -j64 Reviewed By: Winger1994 Differential Revision: D16080209 fbshipit-source-id: 6dc3c7680287ee97d673c5e61f89aae1f43e33df	6 years ago
Yanqin Jin	d93812c9ae	Iterator with timestamp (#6255 ) Summary: Preliminary support for iterator with user timestamp. Current implementation does not consider merge operator and reverse iterator. Auto compaction is also disabled in unit tests. Create an iterator with timestamp. ``` ... read_opts.timestamp = &ts; auto* iter = db->NewIterator(read_opts); // target is key without timestamp. for (iter->Seek(target); iter->Valid(); iter->Next()) {} for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {} delete iter; read_opts.timestamp = &ts1; // lower_bound and upper_bound are without timestamp. read_opts.iterate_lower_bound = &lower_bound; read_opts.iterate_upper_bound = &upper_bound; auto* iter1 = db->NewIterator(read_opts); // Do Seek or SeekToFirst() delete iter1; ``` Test plan (dev server) ``` $make check ``` Simple benchmarking (dev server) 1. The overhead introduced by this PR even when timestamp is disabled. key size: 16 bytes value size: 100 bytes Entries: 1000000 Data reside in main memory, and try to stress iterator. Repeated three times on master and this PR. - Seek without next ``` ./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 ``` master: 159047.0 ops/sec this PR: 158922.3 ops/sec (2% drop in throughput) - Seek and next 10 times ``` ./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 -seek_nexts=10 ``` master: 109539.3 ops/sec this PR: 107519.7 ops/sec (2% drop in throughput) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6255 Differential Revision: D19438227 Pulled By: riversand963 fbshipit-source-id: b66b4979486f8474619f4aa6bdd88598870b0746	6 years ago
Cheng Chang	0a0151fb99	Remove memcpy from RandomAccessFileReader::Read in direct IO mode (#6455 ) Summary: In direct IO mode, RandomAccessFileReader::Read allocates an internal aligned buffer, and then copies the result into the scratch buffer. If the result is only temporarily used inside a function, there is no need to do the memcpy and just let the result Slice refer to the internally allocated buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6455 Test Plan: make check Differential Revision: D20106753 Pulled By: cheng-chang fbshipit-source-id: 44f505843837bba47a56e3fa2c4dd3bd76486b58	6 years ago
Otto Kekäläinen	f6c2777d95	Fix spelling: commited -> committed (#6481 ) Summary: In most places in the code the variable names are spelled correctly as COMMITTED but in a couple places not. This fixes them and ensures the variable is always called COMMITTED everywhere. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6481 Differential Revision: D20306776 Pulled By: pdillinger fbshipit-source-id: b6c1bfe41db559b4bc6955c530934460c07f7022	6 years ago
Huisheng Liu	904a60ff63	return timestamp from get (#6409 ) Summary: Added new Get() methods that return timestamp. Dummy implementation is given so that classes derived from DB don't need to be touched to provide their implementation. MultiGet is not included. ReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks. base line (commit `72ee067b9`): 101.712 micros/op 314602 ops/sec; 36.0 MB/s (5658999 of 5658999 found) This PR: 100.288 micros/op 319071 ops/sec; 36.5 MB/s (5674999 of 5674999 found) ./db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=readrandom --use_existing_db=1 --num=25000000 --threads=32 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6409 Differential Revision: D20200086 Pulled By: riversand963 fbshipit-source-id: 490edd74d924f62bd8ae9c29c2a6bbbb8410ca50	6 years ago
Michael R. Crusoe	051696bf98	fix some spelling typos (#6464 ) Summary: Found from Debian's "Lintian" program Pull Request resolved: https://github.com/facebook/rocksdb/pull/6464 Differential Revision: D20162862 Pulled By: zhichao-cao fbshipit-source-id: 06941ee2437b038b2b8045becbe9d2c6fbff3e12	6 years ago
Manuel Ung	41535d0218	WriteUnPrepared: Pass in correct subbatch count during rollback (#6463 ) Summary: Today `WriteUnpreparedTxn::RollbackInternal` will write the rollback batch assuming that there is only a single subbatch. However, because untracked_keys_ are currently not deduplicated, it's possible for duplicate keys to exist, and thus split the batch. Also, tracked_keys_ also does not support compators outside of the bytewise comparators, so it's possible for duplicates to occur there as well. To solve this, just pass in the correct subbatch count. Also, removed `WriteUnpreparedRollbackPreReleaseCallback` to unify the Commit/Rollback codepaths some more. Also, fixed a bug in `CommitInternal` where if 1. two_write_queue is true and 2. include_data is true, then `WriteUnpreparedCommitEntryPreReleaseCallback` ends up calling `AddCommitted` on the commit time write batch a second time on the second write. To fix, `WriteUnpreparedCommitEntryPreReleaseCallback` is re-initialized. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6463 Differential Revision: D20150153 Pulled By: lth fbshipit-source-id: df0b42d39406c75af73df995aa1138f0db539cd1	6 years ago
sdong	fdf882ded2	Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433 ) Summary: When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433 Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag. Differential Revision: D19977691 fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e	6 years ago
Andrew Kryczka	0f9dcb88b2	Return NotSupported from WriteBatchWithIndex::DeleteRange (#5393 ) Summary: As discovered in https://github.com/facebook/rocksdb/issues/5260 and https://github.com/facebook/rocksdb/issues/5392, reads on the indexed batch do not account for range tombstones. So, return `Status::NotSupported` from `WriteBatchWithIndex::DeleteRange` until we properly support it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5393 Test Plan: added unit test Differential Revision: D19912360 Pulled By: ajkr fbshipit-source-id: 0bbfc978ea015d64516ca708fce2429abba524cb	6 years ago
Manuel Ung	dc23c125c3	WriteUnPrepared: Untracked keys (#6404 ) Summary: For write unprepared, some applications may bypass the transaction api, and write keys directly into the write batch. However, since they are not tracked, rollbacks (both for savepoint and transaction) are not aware that these keys have to be rolled back. The fix is to track them in `WriteUnpreparedTxn::untracked_keys_`. This is populated whenever we flush unprepared batches into the DB. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6404 Differential Revision: D19842023 Pulled By: lth fbshipit-source-id: a9edfc643d5c905fc89da9a9a9094d30c9b70108	6 years ago
wolfkdy	29e24434fe	refine code (#6420 ) Summary: I create a new branch from the branch new upsteram/master and "git merge --squash". Maybe it will fix everything. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6420 Differential Revision: D19897152 Pulled By: zhichao-cao fbshipit-source-id: 6575d9e3b23e360f42ee1480b43028b5fcc20136	6 years ago
Manuel Ung	fb571509a7	WriteUnPrepared: Enable WAL during crash recovery (#6418 ) Summary: Unfortunately, it seems like mysqld reuses xids across machine restarts. When that happens, we could have something like the following happening: ``` BEGIN_PREPARE(unprepared) Put(a) END_PREPARE(xid = 1) -- crash and recover with Put(a) rolled back as it was not prepared BEGIN_PREPARE(prepared) Put(b) END_PREPARE(xid = 1) COMMIT(xid = 1) -- crash and recover with both a, b ``` To solve this, we will have to log the rollback batch into the WAL during recovery. WritePrepared already logs the rollback batch into the WAL, if a rollback happens after prepare, so there is no problem there. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6418 Differential Revision: D19896151 Pulled By: lth fbshipit-source-id: 2ff65ddc5fe75efd57736fed4b7cd7a109d26609	6 years ago
sdong	ac8e89a443	Should flush and sync WAL when writing it in DB::Open() (#6417 ) Summary: A recent fix related to 2pc https://github.com/facebook/rocksdb/pull/6313/ writes something to WAL, but does not flush or sync. This causes assertion failure "impl->TEST_WALBufferIsEmpty()" if manual_wal_flush = true. We should fsync the entry to make sure a second power reset can recover. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6417 Test Plan: Add manual_wal_flush=true case in TransactionTest.DoubleCrashInRecovery and fix a bug in the test so that the bug can be reproduced. It passes with the fix. Differential Revision: D19894537 fbshipit-source-id: f1e84e49e2269f583c6019743118292cd8b6598e	6 years ago
Kefu Chai	debc4ef18b	utilities/env_librados: copy use bufferlist::iterator (#6395 ) Summary: to adapt the change in ceph upstream where the bufferlist::copy() method was removed in `c724369010` Signed-off-by: Kefu Chai <tchaikov@gmail.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/6395 Differential Revision: D19816815 Pulled By: zhichao-cao fbshipit-source-id: 9210767b91af0ecdcf5dfaa3e70edcaeea55135f	6 years ago
sdong	876c2dbff4	Allow readahead when reading option files. (#6372 ) Summary: Right, when reading from option files, no readahead is used and 8KB buffer is used. It might introduce high latency if the file system provide high latency and doesn't do readahead. Instead, introduce a readahead to the file. When calling inside DB, infer the value from options.log_readahead. Otherwise, a default 512KB readahead size is used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6372 Test Plan: Add --log_readahead_size in db_bench. Run it with several options and observe read size from option files using strace. Differential Revision: D19727739 fbshipit-source-id: e6d8053b0a64259abc087f1f388b9cd66fa8a583	6 years ago
Levi Tamasi	1b4be4cac9	BlobDB: ignore trivially moved files when updating the SST<->blob file mapping (#6381 ) Summary: BlobDB keeps track of the mapping between SSTs and blob files using the `OnFlushCompleted` and `OnCompactionCompleted` callbacks of the `EventListener` interface: upon receiving a flush notification, a link is added between the newly flushed SST and the corresponding blob file; for compactions, links are removed for the inputs and added for the outputs. The earlier code performed this link deletion and addition even for trivially moved files; the new code walks through the two lists together (in a fashion that's similar to merge sort) and skips such files. This should mitigate https://github.com/facebook/rocksdb/issues/6338, wherein an assertion is triggered with the earlier code when a compaction notification for a trivial move precedes the flush notification for the moved SST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6381 Test Plan: make check Differential Revision: D19773729 Pulled By: ltamasi fbshipit-source-id: ae0f273ded061110dd9334e8fb99b0d7786650b0	6 years ago
Yanqin Jin	f2fbc5d668	Shorten certain test names to avoid infra failure (#6352 ) Summary: Unit test names, together with other components, are used to create log files during some internal testing. Overly long names cause infra failure due to file names being too long. Look for internal tests. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6352 Differential Revision: D19649307 Pulled By: riversand963 fbshipit-source-id: 6f29de096e33c0eaa87d9c8702f810eda50059e7	6 years ago
Levi Tamasi	9e3ace42a4	Add statistics for BlobDB GC (#6296 ) Summary: The patch adds statistics support to the new BlobDB garbage collection implementation; namely, it adds support for the following (pre-existing) tickers: `BLOB_DB_GC_NUM_FILES`: the number of blob files obsoleted by the GC logic. `BLOB_DB_GC_NUM_NEW_FILES`: the number of new blob files generated by the GC logic. `BLOB_DB_GC_FAILURES`: the number of failed GC passes (where a GC pass is equivalent to a (sub)compaction). `BLOB_DB_GC_NUM_KEYS_RELOCATED`: the number of blobs relocated to new blob files by the GC logic. `BLOB_DB_GC_BYTES_RELOCATED`: the total size of blobs relocated to new blob files. The tickers `BLOB_DB_GC_NUM_KEYS_OVERWRITTEN`, `BLOB_DB_GC_NUM_KEYS_EXPIRED`, `BLOB_DB_GC_BYTES_OVERWRITTEN`, `BLOB_DB_GC_BYTES_EXPIRED`, and `BLOB_DB_GC_MICROS` are not relevant for the new GC logic, and are thus marked deprecated. The patch also adds a couple of log messages that log the number and total size of blobs encountered and relocated during a GC pass, as well as the number of blob files created and obsoleted. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6296 Test Plan: Extended unit tests and used the BlobDB mode of `db_bench`. Differential Revision: D19402513 Pulled By: ltamasi fbshipit-source-id: d53d2bfbf4928a1db1e9346c67ebb9007b8932ec	6 years ago
Maysam Yabandeh	2f973ca96e	Double Crash in kPointInTimeRecovery with TransactionDB (#6313 ) Summary: In WritePrepared there could be gap in sequence numbers. This breaks the trick we use in kPointInTimeRecovery which assume the first seq in the log right after the corrupted log is one larger than the last seq we read from the logs. To let this trick keep working, we add a dummy entry with the expected sequence to the first log right after recovery. Also in WriteCommitted, if the log right after the corrupted log is empty, since it has no sequence number to let the sequential trick work, it is assumed as unexpected behavior. This is however expected to happen if we close the db after recovering from a corruption and before writing anything new to it. To remedy that, we apply the same technique by writing a dummy entry to the log that is created after the corrupted log. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6313 Differential Revision: D19458291 Pulled By: maysamyabandeh fbshipit-source-id: 09bc49e574690085df45b034ca863ff315937e2d	6 years ago
matthewvon	e6e8b9e871	Correct pragma once problem with Bazel on Windows (#6321 ) Summary: This is a simple edit to have two #include file paths be consistent within range_del_aggregator.{h,cc} with everywhere else. The impact of this inconsistency is that it actual breaks a Bazel based build on the Windows platform. The same pragma once failure occurs with both Windows Visual C++ 2019 and clang for Windows 9.0. Bazel's "sandboxing" of the builds causes both compilers to not properly recognize "rocksdb/types.h" and "include/rocksdb/types.h" to be the same file (also comparator.h). My guess is that the backslash versus forward slash mixing within path names is the underlying issue. But, everything builds fine once the include paths in these two source files are consistent with the rest of the repository. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6321 Differential Revision: D19506585 Pulled By: ltamasi fbshipit-source-id: 294c346607edc433ab99eaabc9c880ee7426817a	6 years ago
Levi Tamasi	1dd7873e08	Remove earlier partial BlobDB GC implementation (#6278 ) Summary: In addition to removing the earlier partially implemented garbage collection logic from the BlobDB codebase, the patch also removes the test cases (as well as the related sync points, as appropriate) that were only relevant for the old implementation, and reworks the remaining ones so they use the new GC logic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6278 Test Plan: `make check` Differential Revision: D19335226 Pulled By: ltamasi fbshipit-source-id: 0cc1794bc9892feda1426ed5522a318f3cb1b692	6 years ago
Maysam Yabandeh	eff5e076f5	unordered_write incompatible with max_successive_merges (#6284 ) Summary: unordered_write is incompatible with non-zero max_successive_merges. Although we check this at runtime, we currently don't prevent the user from setting this combination in options. This has led to stress tests to fail with this combination is tried in ::SetOptions. The patch fixes that and also reverts the changes performed by https://github.com/facebook/rocksdb/pull/6254, in which max_successive_merges was mistakenly declared incompatible with unordered_write. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6284 Differential Revision: D19356115 Pulled By: maysamyabandeh fbshipit-source-id: f06dadec777622bd75f267361c022735cf8cecb6	6 years ago
sdong	39410bcb3d	Fix some shadow warning (#6242 ) Summary: Some shadow warning shows up when using gcc 4.8. An example: ./utilities/blob_db/blob_compaction_filter.h: In constructor ‘rocksdb::blob_db::BlobIndexCompactionFilterFactoryBase::BlobIndexCompactionFilterFactoryBase(rocksdb::blob_db::lobDBImpl, rocksdb::Env, rocksdb::Statistics*)’: ./utilities/blob_db/blob_compaction_filter.h:121:7: error: declaration of ‘blob_db_impl’ shadows a member of 'this' [-Werror=shadow] : blob_db_impl_(blob_db_impl), env_(_env), statistics_(_statistics) {} ^ Fix them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6242 Test Plan: Build and see the warnings go away. Differential Revision: D19217789 fbshipit-source-id: 8ef631941f23dab47a388e060adec24b72efd65e	6 years ago
Maysam Yabandeh	5709e97a74	Skip CancelAllBackgroundWork if DBImpl is already closed (#6268 ) Summary: WritePreparedTxnDB calls CancelAllBackgroundWork in its destructor to avoid dangling references to it from background job's SnapshotChecker callback. However, if the DBImpl is already closed, the info log might be closed with it, which causes memory leak when CancelAllBackgroundWork tries to print to the info log. The patch fixes that by calling CancelAllBackgroundWork only if the db is not closed already. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6268 Differential Revision: D19303439 Pulled By: maysamyabandeh fbshipit-source-id: 4228a6be7e78d43c90630347baa89b008200bd15	6 years ago
wolfkdy	1ab1231acf	parallel occ (#6240 ) Summary: This is a continuation of https://github.com/facebook/rocksdb/pull/5320/files I open a new mr for these purposes, half a year has past since the old mr is posted so it's almost impossible to fulfill some points below on the old mr, especially 5) 1) add validation modes for optimistic txns 2) modify unittests to test both modes 3) make format 4) refine hash functor 5) push to master Pull Request resolved: https://github.com/facebook/rocksdb/pull/6240 Differential Revision: D19301296 fbshipit-source-id: 5b5b3cbd39558f43947f7d2dec6cd31a06386edb	6 years ago
Maysam Yabandeh	48a678b7c9	Prevent an incompatible combination of options (#6254 ) Summary: allow_concurrent_memtable_write is incompatible with non-zero max_successive_merges. Although we check this at runtime, we currently don't prevent the user from setting this combination in options. This has led to stress tests to fail with this combination is tried in ::SetOptions. The patch fixes that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6254 Differential Revision: D19265819 Pulled By: maysamyabandeh fbshipit-source-id: 47f2e2dc26fe0972c7152f4da15dadb9703f1179	6 years ago
Levi Tamasi	1ebaa762e6	Log garbage_collection_cutoff alongside the other BlobDB options Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6229 Differential Revision: D19191195 Pulled By: ltamasi fbshipit-source-id: 2a3c4785299641a46e022fc012460b759a689fce	6 years ago
Levi Tamasi	7a7ca8eb5b	BlobDB: only compare CF IDs when checking whether an API call is for the default CF (#6226 ) Summary: BlobDB currently only supports using the default column family. The earlier code enforces this by comparing the `ColumnFamilyHandle` passed to the `Get`/`Put`/etc. call with the handle returned by `DefaultColumnFamily` (which, at the end of the day, comes from `DBImpl::default_cf_handle_`). Since other `ColumnFamilyHandle`s can also point to the default column family, this can reject legitimate requests as well. (As an example, with the earlier code, the handle returned by `BlobDB::Open` cannot actually be used in API calls.) The patch fixes this by comparing only the IDs of the column family handles instead of the pointers themselves. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6226 Test Plan: `make check` Differential Revision: D19187461 Pulled By: ltamasi fbshipit-source-id: 54ce2e12ebb1f07e6d1e70e3b1e0213dfa94bda2	6 years ago
anand1976	1be48cb895	Fix crash in Transaction::MultiGet() when num_keys > 32 Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6192 Test Plan: Add a unit test that fails without the fix and passes now make check Differential Revision: D19124781 Pulled By: anand1976 fbshipit-source-id: 8c8cb6fa16c3fc23ec011e168561a13f76bbd783	6 years ago
Levi Tamasi	0d2172f128	Make it possible to enable periodic compactions for BlobDB (#6172 ) Summary: Periodic compactions ensure that even SSTs that do not get picked up otherwise eventually go through compaction; used in conjunction with BlobDB's garbage collection, they enable BlobDB to reclaim space when old blob files are used by such straggling SSTs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6172 Test Plan: Ran `make check` and used the BlobDB mode of `db_bench`. Differential Revision: D19045045 Pulled By: ltamasi fbshipit-source-id: 04636ecc4b6cfe8d495bf656faa65d54a5eb1a93	6 years ago
anand76	afa2420c2b	Introduce a new storage specific Env API (#5761 ) Summary: The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc. This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO. The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before. This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection. The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761 Differential Revision: D18868376 Pulled By: anand1976 fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f	6 years ago
Levi Tamasi	583c6953d8	Move out valid blobs from the oldest blob files during compaction (#6121 ) Summary: The patch adds logic that relocates live blobs from the oldest N non-TTL blob files as they are encountered during compaction (assuming the BlobDB configuration option `enable_garbage_collection` is `true`), where N is defined as the number of immutable non-TTL blob files multiplied by the value of a new BlobDB configuration option called `garbage_collection_cutoff`. (The default value of this parameter is 0.25, that is, by default the valid blobs residing in the oldest 25% of immutable non-TTL blob files are relocated.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6121 Test Plan: Added unit test and tested using the BlobDB mode of `db_bench`. Differential Revision: D18785357 Pulled By: ltamasi fbshipit-source-id: 8c21c512a18fba777ec28765c88682bb1a5e694e	6 years ago
Levi Tamasi	3b607610df	Do not update SST <-> blob file mapping if compaction failed Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6156 Test Plan: Extended unit tests. Differential Revision: D18943867 Pulled By: ltamasi fbshipit-source-id: b3669d2dd6af08e987ad1a59d6712ae2514da0b1	6 years ago
Adam Retter	a61ec9ae3b	Fix BlobDB compilation on older GCC versions Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6094 Differential Revision: D18731951 Pulled By: ltamasi fbshipit-source-id: 5b73c6009c748f6a2a48d4d880b1259980d801d4	6 years ago
Adam Retter	6d58ea901d	Fix compilation under MSVC VS2015 (#6081 ) Summary: NOTE: this also needs to be back-ported to 6.4.6 and possibly older branches if further releases from them is envisaged. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6081 Differential Revision: D18710107 Pulled By: zhichao-cao fbshipit-source-id: 03260f9316566e2bfc12c7d702d6338bb7941e01	6 years ago
Levi Tamasi	d9314a9214	Refactor and clean up the code that reads a blob from a file (#6093 ) Summary: This patch factors out the logic that reads a (potentially compressed) blob from a file into a separate helper method `GetRawBlobFromFile`, and cleans up the code a bit. Also, errors during decompression are now logged/propagated to the user by returning a `Status` code of `Corruption`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6093 Test Plan: `make check` Differential Revision: D18716673 Pulled By: ltamasi fbshipit-source-id: 44144bc064cab616862d5643f34384f2bae6eb78	6 years ago
Levi Tamasi	72daa92d3a	Refactor blob file creation logic (#6066 ) Summary: The patch refactors and cleans up the logic around creating new blob files by moving the common code of `SelectBlobFile` and `SelectBlobFileTTL` to a new helper method `CreateBlobFileAndWriter`, bringing the implementation of `SelectBlobFile` and `SelectBlobFileTTL` into sync, and increasing encapsulation by adding new constructors for `BlobFile` and `BlobLogHeader`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6066 Test Plan: Ran `make check` and used the BlobDB mode of `db_bench` to sanity test both the TTL and the non-TTL code paths. Differential Revision: D18646921 Pulled By: ltamasi fbshipit-source-id: e5705a84807932e31dccab4f49b3e64369cea26d	6 years ago
Sebastiano Peluso	fcd7e03832	Ignore value of BackupableDBOptions::max_valid_backups_to_open when B… (#6072 ) Summary: This change ignores the value of BackupableDBOptions::max_valid_backups_to_open when a BackupEngine is not read-only. Issue: https://github.com/facebook/rocksdb/issues/4997 Note on tests: I had to remove test case WriteOnlyEngine of BackupableDBTest because it was not consistent with the new semantic of BackupableDBOptions::max_valid_backups_to_open. Maybe, we should think about adding a new interface for append-only BackupEngines. On the other hand, I changed LimitBackupsOpened test case to use a read-only BackupEngine, and I added a new specific test case for the change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6072 Reviewed By: pdillinger Differential Revision: D18687364 Pulled By: sebastianopeluso fbshipit-source-id: 77bc1f927d623964d59137a93de123bbd719da4e	6 years ago
Levi Tamasi	279c488395	Mark blob files not needed by any memtables/SSTs obsolete (#6032 ) Summary: The patch adds logic to mark no longer needed blob files obsolete upon database open and whenever a flush or compaction completes. Unneeded blob files are detected by iterating through live immutable non-TTL blob files starting from the lowest-numbered one, and stopping when a blob file used by any SSTs or potentially used by memtables is found. (The latter is determined by comparing the sequence number at which the blob file became immutable with the largest sequence number received in flush notifications.) In addition, the patch cleans up the logic around closing and obsoleting blob files and enforces invariants around this area (blob files are now guaranteed to go through the stages mutable-non-obsolete, immutable-non-obsolete, and immutable-obsolete in this order). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6032 Test Plan: Extended unit tests and tested using the BlobDB mode of `db_bench`. Differential Revision: D18495610 Pulled By: ltamasi fbshipit-source-id: 11825b84af74f3f4abfd9bcae04e80870ae58961	6 years ago
Maysam Yabandeh	0058daef7b	Disable SmallestUnCommittedSeq in Valgrind run (#6035 ) Summary: SmallestUnCommittedSeq sometimes takes too long when run under Valgrind. The patch disables it when the tests are run under Valgrind. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6035 Differential Revision: D18509198 Pulled By: maysamyabandeh fbshipit-source-id: 1191443b9fedb6b9c50d6b76f5c92371f5030230	6 years ago
Peter Dillinger	e8e7fb1dcf	More fixes to auto-GarbageCollect in BackupEngine (#6023 ) Summary: Production: * Fixes GarbageCollect (and auto-GC triggered by PurgeOldBackups, DeleteBackup, or CreateNewBackup) to clean up backup directory independent of current settings (except max_valid_backups_to_open; see issue https://github.com/facebook/rocksdb/issues/4997) and prior settings used with same backup directory. * Fixes GarbageCollect (and auto-GC) not to attempt to remove "." and ".." entries from directories. * Clarifies contract with users in modifying BackupEngine operations. In short, leftovers from any incomplete operation are cleaned up by any subsequent call to that same kind of operation (PurgeOldBackups and DeleteBackup considered the same kind of operation). GarbageCollect is available to clean up after all kinds. (NB: right now PurgeOldBackups and DeleteBackup will clean up after incomplete CreateNewBackup, but we aren't promising to continue that behavior.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6023 Test Plan: * Refactors open parameters to use an option enum, for readability, etc. (Also fixes an unused parameter bug in the redundant OpenDBAndBackupEngineShareWithChecksum.) * Fixes an apparent bug in ShareTableFilesWithChecksumsTransition in which old backup data was destroyed in the transition to be tested. That test is now augmented to ensure GarbageCollect (or auto-GC) does not remove shared files when BackupEngine is opened with share_table_files=false. * Augments DeleteTmpFiles test to ensure that CreateNewBackup does auto-GC when an incompletely created backup is detected. Differential Revision: D18453559 Pulled By: pdillinger fbshipit-source-id: 5e54e7b08d711b161bc9c656181012b69a8feac4	6 years ago
anand76	6c7b1a0cc7	Batched MultiGet API for multiple column families (#5816 ) Summary: Add a new API that allows a user to call MultiGet specifying multiple keys belonging to different column families. This is mainly useful for users who want to do a consistent read of keys across column families, with the added performance benefits of batching and returning values using PinnableSlice. As part of this change, the code in the original multi-column family MultiGet for acquiring the super versions has been refactored into a separate function that can be used by both, the batching and the non-batching versions of MultiGet. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5816 Test Plan: make check make asan_check asan_crash_test Differential Revision: D18408676 Pulled By: anand1976 fbshipit-source-id: 933e7bec91dd70e7b633be4ff623a1116cc28c8d	6 years ago
Levi Tamasi	8e7aa62813	BlobDB: Maintain mapping between blob files and SSTs (#6020 ) Summary: The patch adds logic to BlobDB to maintain the mapping between blob files and SSTs for which the blob file in question is the oldest blob file referenced by the SST file. The mapping is initialized during database open based on the information retrieved using `GetLiveFilesMetaData`, and updated after flushes/compactions based on the information received through the `EventListener` interface (or, in the case of manual compactions issued through the `CompactFiles` API, the `CompactionJobInfo` object). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6020 Test Plan: Added a unit test; also tested using the BlobDB mode of `db_bench`. Differential Revision: D18410508 Pulled By: ltamasi fbshipit-source-id: dd9e778af781cfdb0d7056298c54ba9cebdd54a5	6 years ago
Peter Dillinger	aa63abf698	Auto-GarbageCollect on PurgeOldBackups and DeleteBackup (#6015 ) Summary: Only if there is a crash, power failure, or I/O error in DeleteBackup, shared or private files from the backup might be left behind that are not cleaned up by PurgeOldBackups or DeleteBackup-- only by GarbageCollect. This makes the BackupEngine API "leaky by default." Even if it means a modest performance hit, I think we should make Delete and Purge do as they say, with ongoing best effort: i.e. future calls will attempt to finish any incomplete work from earlier calls. This change does that by having DeleteBackup and PurgeOldBackups do a GarbageCollect, unless (to minimize performance hit) this BackupEngine has already done a GarbageCollect and there have been no deletion-related I/O errors in that GarbageCollect or since then. Rejected alternative 1: remove meta file last instead of first. This would in theory turn partially deleted backups into corrupted backups, but code changes would be needed to allow the missing files and consider it acceptably corrupt, rather than failing to open the BackupEngine. This might be a reasonable choice, but I mostly rejected it because it doesn't solve the legacy problem of cleaning up existing lingering files. Rejected alternative 2: use a deletion marker file. If deletion started with creating a file that marks a backup as flagged for deletion, then we could reliably detect partially deleted backups and efficiently finish removing them. In addition to not solving the legacy problem, this could be precarious if there's a disk full situation, and we try to create a new file in order to delete some files. Ugh. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6015 Test Plan: Updated unit tests Differential Revision: D18401333 Pulled By: pdillinger fbshipit-source-id: 12944e372ce6809f3f5a4c416c3b321a8927d925	6 years ago
Levi Tamasi	f80050fa8f	Add file number/oldest referenced blob file number to {Sst,Live}FileMetaData (#6011 ) Summary: The patch exposes the file numbers of the SSTs as well as the oldest blob files they contain a reference to through the GetColumnFamilyMetaData/ GetLiveFilesMetaData interface. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6011 Test Plan: Fixed and extended the existing unit tests. (The earlier ColumnFamilyMetaDataTest wasn't really testing anything because the generated memtables were never flushed, so the metadata structure was essentially empty.) Differential Revision: D18361697 Pulled By: ltamasi fbshipit-source-id: d5ed1d94ac70858b84393c48711441ddfe1251e9	6 years ago
Yanqin Jin	67e735dbf9	Rename BlockBasedTable::ReadMetaBlock (#6009 ) Summary: According to https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format, the block read by BlockBasedTable::ReadMetaBlock is actually the meta index block. Therefore, it is better to rename the function to ReadMetaIndexBlock. This PR also applies some format change to existing code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6009 Test Plan: make check Differential Revision: D18333238 Pulled By: riversand963 fbshipit-source-id: 2c4340a29b3edba53d19c132cbfd04caf6242aed	6 years ago
Sergei Petrunia	230bcae7b6	Add a limited support for iteration bounds into BaseDeltaIterator (#5403 ) Summary: For MDEV-19670: MyRocks: key lookups into deleted data are very slow BaseDeltaIterator remembers iterate_upper_bound and will not let delta_iterator_ walk above the iterate_upper_bound if base_iterator_ is not valid anymore. == Rationale == The most straightforward way would be to make the delta_iterator (which is a rocksdb::WBWIIterator) to support iterator bounds. But checking for bounds has an extra CPU overhead. So we put the check into BaseDeltaIterator, and only make it when base_iterator_ is not valid. (note: We could take it even further, and move the check a few lines down, and only check iterator bounds ourselves if base_iterator_ is not valid AND delta_iterator_ hit a tombstone). Pull Request resolved: https://github.com/facebook/rocksdb/pull/5403 Differential Revision: D15863092 Pulled By: maysamyabandeh fbshipit-source-id: 8da458e7b9af95ff49356666f69664b4a6ccf49b	6 years ago
Maysam Yabandeh	52733b4498	WritePrepared: Fix flaky test MaxCatchupWithNewSnapshot (#5850 ) Summary: MaxCatchupWithNewSnapshot tests that the snapshot sequence number will be larger than the max sequence number when the snapshot was taken. However since the test does not have access to the max sequence number when the snapshot was taken, it uses max sequence number after that, which could have advanced the snapshot by then, thus making the test flaky. The fix is to compare with max sequence number before the snapshot was taken, which is a lower bound for the value when the snapshot was taken. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5850 Test Plan: ~/gtest-parallel/gtest-parallel --repeat=12800 ./write_prepared_transaction_test --gtest_filter="MaxCatchupWithNewSnapshot" Differential Revision: D17608926 Pulled By: maysamyabandeh fbshipit-source-id: b122ae5a27f982b290bd60da852e28d3c5eb0136	6 years ago

... 3 4 5 6 7 ...

1278 Commits (61c9bd49c16ec27304c2b0685fc72d0a9997b182)