rocksdb

Commit Graph

Author	SHA1	Message	Date
sdong	4720ba4391	Remove RocksDB LITE (#11147 ) Summary: We haven't been actively mantaining RocksDB LITE recently and the size must have been gone up significantly. We are removing the support. Most of changes were done through following comments: unifdef -m -UROCKSDB_LITE `git grep -l ROCKSDB_LITE \| egrep '[.](cc\|h)'` by Peter Dillinger. Others changes were manually applied to build scripts, CircleCI manifests, ROCKSDB_LITE is used in an expression and file db_stress_test_base.cc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11147 Test Plan: See CI Reviewed By: pdillinger Differential Revision: D42796341 fbshipit-source-id: 4920e15fc2060c2cd2221330a6d0e5e65d4b7fe2	2 years ago
Levi Tamasi	4d9cb433fa	Run clang-format on utilities/ (except utilities/transactions/) (#10853 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/10853 Test Plan: `make check` Reviewed By: siying Differential Revision: D40651315 Pulled By: ltamasi fbshipit-source-id: 8b270ff4777a06464be86e376c2a680427866a46	2 years ago
Peter Dillinger	ef443cead4	Refactor to avoid confusing "raw block" (#10408 ) Summary: We have a lot of confusing code because of mixed, sometimes completely opposite uses of of the term "raw block" or "raw contents", sometimes within the same source file. For example, in `BlockBasedTableBuilder`, `raw_block_contents` and `raw_size` generally referred to uncompressed block contents and size, while `WriteRawBlock` referred to writing a block that is already compressed if it is going to be. Meanwhile, in `BlockBasedTable`, `raw_block_contents` either referred to a (maybe compressed) block with trailer, or a maybe compressed block maybe without trailer. (Note: left as follow-up work to use C++ typing to better sort out the various kinds of BlockContents.) This change primarily tries to apply some consistent terminology around the kinds of block representations, avoiding the unclear "raw". (Any meaning of "raw" assumes some bias toward the storage layer or toward the logical data layer.) Preferred terminology: * Serialized block - bytes that go into storage. For block-based table (usually the case) this includes the block trailer. WART: block `size` may or may not include the trailer; need to be clear about whether it does or not. * Maybe compressed block - like a serialized block, but without the trailer (or no promise of including a trailer). Must be accompanied by a CompressionType. * Uncompressed block - "payload" bytes that are either stored with no compression, used as input to compression function, or result of decompression function. * Parsed block - an in-memory form of a block in block cache, as it is used by the table reader. Different C++ types are used depending on the block type (see block_like_traits.h). Other refactorings: * Misc corrections/improvements of internal API comments * Remove a few misleading / unhelpful / redundant comments. * Use move semantics in some places to simplify contracts * Use better parameter names to indicate which parameters are used for outputs * Remove some extraneous `extern` * Various clean-ups to `CacheDumperImpl` (mostly unnecessary code) Pull Request resolved: https://github.com/facebook/rocksdb/pull/10408 Test Plan: existing tests Reviewed By: akankshamahajan15 Differential Revision: D38172617 Pulled By: pdillinger fbshipit-source-id: ccb99299f324ac5ca46996d34c5089621a4f260c	2 years ago
Andrew Kryczka	babe56ddba	Add rate limiter priority to ReadOptions (#9424 ) Summary: Users can set the priority for file reads associated with their operation by setting `ReadOptions::rate_limiter_priority` to something other than `Env::IO_TOTAL`. Rate limiting `VerifyChecksum()` and `VerifyFileChecksums()` is the motivation for this PR, so it also includes benchmarks and minor bug fixes to get that working. `RandomAccessFileReader::Read()` already had support for rate limiting compaction reads. I changed that rate limiting to be non-specific to compaction, but rather performed according to the passed in `Env::IOPriority`. Now the compaction read rate limiting is supported by setting `rate_limiter_priority = Env::IO_LOW` on its `ReadOptions`. There is no default value for the new `Env::IOPriority` parameter to `RandomAccessFileReader::Read()`. That means this PR goes through all callers (in some cases multiple layers up the call stack) to find a `ReadOptions` to provide the priority. There are TODOs for cases I believe it would be good to let user control the priority some day (e.g., file footer reads), and no TODO in cases I believe it doesn't matter (e.g., trace file reads). The API doc only lists the missing cases where a file read associated with a provided `ReadOptions` cannot be rate limited. For cases like file ingestion checksum calculation, there is no API to provide `ReadOptions` or `Env::IOPriority`, so I didn't count that as missing. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9424 Test Plan: - new unit tests - new benchmarks on ~50MB database with 1MB/s read rate limit and 100ms refill interval; verified with strace reads are chunked (at 0.1MB per chunk) and spaced roughly 100ms apart. - setup command: `./db_bench -benchmarks=fillrandom,compact -db=/tmp/testdb -target_file_size_base=1048576 -disable_auto_compactions=true -file_checksum=true` - benchmarks command: `strace -ttfe pread64 ./db_bench -benchmarks=verifychecksum,verifyfilechecksums -use_existing_db=true -db=/tmp/testdb -rate_limiter_bytes_per_sec=1048576 -rate_limit_bg_reads=1 -rate_limit_user_ops=true -file_checksum=true` - crash test using IO_USER priority on non-validation reads with https://github.com/facebook/rocksdb/issues/9567 reverted: `python3 tools/db_crashtest.py blackbox --max_key=1000000 --write_buffer_size=524288 --target_file_size_base=524288 --level_compaction_dynamic_level_bytes=true --duration=3600 --rate_limit_bg_reads=true --rate_limit_user_ops=true --rate_limiter_bytes_per_sec=10485760 --interval=10` Reviewed By: hx235 Differential Revision: D33747386 Pulled By: ajkr fbshipit-source-id: a2d985e97912fba8c54763798e04f006ccc56e0c	3 years ago
Jay Zhuang	29102641dd	Skip directory fsync for filesystem btrfs (#8903 ) Summary: Directory fsync might be expensive on btrfs and it may not be needed. Here are 4 directory fsync cases: 1. creating a new file: dir-fsync is not needed on btrfs, as long as the new file itself is synced. 2. renaming a file: dir-fsync is not needed if the renamed file is synced. So an API `FsyncAfterFileRename(filename, ...)` is provided to sync the file on btrfs. By default, it just calls dir-fsync. 3. deleting files: dir-fsync is forced by set `IOOptions.force_dir_fsync = true` 4. renaming multiple files (like backup and checkpoint): dir-fsync is forced, the same as above. Pull Request resolved: https://github.com/facebook/rocksdb/pull/8903 Test Plan: run tests on btrfs and non btrfs Reviewed By: ajkr Differential Revision: D30885059 Pulled By: jay-zhuang fbshipit-source-id: dd2730b31580b0bcaedffc318a762d7dbf25de4a	3 years ago
mrambacher	3dff28cf9b	Use SystemClock* instead of std::shared_ptr<SystemClock> in lower level routines (#8033 ) Summary: For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>. The shared ptr has some performance degradation on certain hardware classes. For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere. For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it. The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource. There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold. In those cases, the shared pointer was preserved. Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17: 6.17: readrandom : 28.046 micros/op 854902 ops/sec; 61.3 MB/s (355999 of 355999 found) 6.18: readrandom : 32.615 micros/op 735306 ops/sec; 52.7 MB/s (290999 of 290999 found) PR: readrandom : 27.500 micros/op 871909 ops/sec; 62.5 MB/s (367999 of 367999 found) (Note that the times for 6.18 are prior to revert of the SystemClock). Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033 Reviewed By: pdillinger Differential Revision: D27014563 Pulled By: mrambacher fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67	4 years ago
mrambacher	4a09d632c4	Remove Legacy and Custom FileWrapper classes from header files (#7851 ) Summary: Removed the uses of the Legacy FileWrapper classes from the source code. The wrappers were creating an additional layer of indirection/wrapping, as the Env already has a FileSystem. Moved the Custom FileWrapper classes into the CustomEnv, as these classes are really for the private use the the CustomEnv class. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7851 Reviewed By: anand1976 Differential Revision: D26114816 Pulled By: mrambacher fbshipit-source-id: db32840e58d969d3a0fa6c25aaf13d6dcdc74150	4 years ago
mrambacher	12f1137355	Add a SystemClock class to capture the time functions of an Env (#7858 ) Summary: Introduces and uses a SystemClock class to RocksDB. This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock. Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead. There are likely more places that can be changed, but this is a start to show what can/should be done. Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock. There are several Env classes that implement these functions. Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR. It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc). Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858 Reviewed By: pdillinger Differential Revision: D26006406 Pulled By: mrambacher fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90	4 years ago
Levi Tamasi	431e8afba7	Do not explicitly flush blob files when using the integrated BlobDB (#7892 ) Summary: In the original stacked BlobDB implementation, which writes blobs to blob files immediately and treats blob files as logs, it makes sense to flush the file after writing each blob to protect against process crashes; however, in the integrated implementation, which builds blob files in the background jobs, this unnecessarily reduces performance. This patch fixes this by simply adding a `do_flush` flag to `BlobLogWriter`, which is set to `true` by the stacked implementation and to `false` by the new code. Note: the change itself is trivial but the tests needed some work; since in the new implementation, blobs are now buffered, adding a blob to `BlobFileBuilder` is no longer guaranteed to result in an actual I/O. Therefore, we can no longer rely on `FaultInjectionTestEnv` when testing failure cases; instead, we manipulate the return values of I/O methods directly using `SyncPoint`s. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7892 Test Plan: `make check` Reviewed By: jay-zhuang Differential Revision: D26022814 Pulled By: ltamasi fbshipit-source-id: b3dce419f312137fa70d84cdd9b908fd5d60d8cd	4 years ago
mrambacher	cc2a180d00	Add more tests to the ASC pass list (#7834 ) Summary: Fixed the following to now pass ASC checks: * `ttl_test` * `blob_db_test` * `backupable_db_test`, * `delete_scheduler_test` Pull Request resolved: https://github.com/facebook/rocksdb/pull/7834 Reviewed By: jay-zhuang Differential Revision: D25795398 Pulled By: ajkr fbshipit-source-id: a10037817deda4fc7cbb353a2e00b62ed89b6476	4 years ago
Akanksha Mahajan	20c7d7c58a	Handling misuse of snprintf return value (#7686 ) Summary: Handle misuse of snprintf return value to avoid Out of bound read/write. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7686 Test Plan: make check -j64 Reviewed By: riversand963 Differential Revision: D25030831 Pulled By: akankshamahajan15 fbshipit-source-id: 1a1d181c067c78b94d720323ae00b79566b57cfa	4 years ago
Levi Tamasi	61932cdf1d	Add blob support to DBIter (#7731 ) Summary: The patch adds iterator support to the integrated BlobDB implementation. Whenever a blob reference is encountered during iteration, the corresponding blob is retrieved by calling `Version::GetBlob`, assuming the `expose_blob_index` (formerly `allow_blob`) flag is not set. (Note: the flag is set by the old stacked BlobDB implementation, which has its own blob file handling/blob retrieval logic.) In addition, `DBIter` now uniformly returns `Status::NotSupported` with the error message `"BlobDB does not support merge operator."` when encountering a blob reference while performing a merge (instead of potentially returning a message that implies the database should be opened using the stacked BlobDB's `Open`.) TODO: We can implement support for lazily retrieving the blob value (or in other words, bypassing the retrieval of blob values based on key) by extending the `Iterator` API with a new `PrepareValue` method (similarly to `InternalIterator`, which already supports lazy values). Pull Request resolved: https://github.com/facebook/rocksdb/pull/7731 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D25256293 Pulled By: ltamasi fbshipit-source-id: c39cd782011495a526cdff99c16f5fca400c4811	4 years ago
Levi Tamasi	ee8c79d40d	Turn the compression_type check in BlobDBImpl::DecompressSlice into an assertion (#7127 ) Summary: In both cases where `BlobDBImpl::DecompressSlice` is called, `compression_type` is already checked at the call site; thus, the check inside the method is redundant and can be turned into an assertion. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7127 Test Plan: `make check` Reviewed By: zhichao-cao Differential Revision: D22533454 Pulled By: ltamasi fbshipit-source-id: ae524443fc6abe0a5fb12327a3fe761a9cd2c831	4 years ago
Levi Tamasi	bdf4de6cb9	Remove some dead code from BlobLogWriter (#7125 ) Summary: Periodic syncing of blob files is performed by `WritableFileWriter`; `bytes_per_sync_` and `next_sync_offset_` in `BlobLogWriter` are actually unused (or more precisely, only used by methods that are themselves unused). The patch removes all this dead code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7125 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D22531021 Pulled By: ltamasi fbshipit-source-id: 6b293ad5a79d3e6bf15c5c68f7aedd7ce7a15f10	4 years ago
Levi Tamasi	a693341604	Move the blob file format related classes to the main namespace, rename reader/writer (#7086 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/7086 Test Plan: `make check` Reviewed By: zhichao-cao Differential Revision: D22395420 Pulled By: ltamasi fbshipit-source-id: 088a20097bd6b73b0c433cd79725779f97ec04f2	4 years ago
Jay Zhuang	00de699096	Replace reinterpret_cast with static_cast_with_check (#7067 ) Summary: Replace `reinterpret_cast` with `static_cast_with_check` for `DBImpl` and `ColumnFamilyHandleImpl`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7067 Reviewed By: siying Differential Revision: D22361587 Pulled By: jay-zhuang fbshipit-source-id: dfe9e8f3af39c3d27cc372c55ab9ad905eb0a5a1	4 years ago
Burton Li	5be2cb6948	Compaction filter support for BlobDB (#6850 ) Summary: Added compaction filter support for BlobDB non-TTL values. Same as vanilla RocksDB, user compaction filter applies to all k/v pairs of the compaction for non-TTL values. It honors `min_blob_size`, which potentially results value transitions between inlined data and stored-in-blob data when size of value is changed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6850 Reviewed By: siying Differential Revision: D22263487 Pulled By: ltamasi fbshipit-source-id: 8fc03f8cde2a5c831e63b436b3dbf1b7f90939e8	4 years ago
Levi Tamasi	06c3b85b9a	Disallow using the base DB's storage directory as blob_dir in BlobDB (#6810 ) Summary: https://github.com/facebook/rocksdb/pull/6807 extends the logic that identifies and purges obsolete files to blob files handled by RocksDB itself. In order to prevent that from interfering with the current BlobDB code, we need to make sure that `BlobDBOptions::blob_dir` is different from the storage directories used by the base DB. (Note: this is true by default.) The patch adds a check that explicitly disallows this configuration and returns `Status::NotSupported` from `BlobDB::Open` in such cases. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6810 Test Plan: Tested using the BlobDB mode of `db_bench`. Reviewed By: riversand963 Differential Revision: D21412676 Pulled By: ltamasi fbshipit-source-id: 6630cc7481e48c8bf55d59423b25f14d52ffe681	5 years ago
anand76	ab13d43e1d	Pass a timeout to FileSystem for random reads (#6751 ) Summary: Calculate ```IOOptions::timeout``` using ```ReadOptions::deadline``` and pass it to ```FileSystem::Read/FileSystem::MultiRead```. This allows us to impose a tighter bound on the time taken by Get/MultiGet on FileSystem/Envs that support IO timeouts. Even on those that don't support, check in ```RandomAccessFileReader::Read``` and ```MultiRead``` and return ```Status::TimedOut()``` if the deadline is exceeded. For now, TableReader creation, which might do file opens and reads, are not covered. It will be implemented in another PR. Tests: Update existing unit tests to verify the correct timeout value is being passed Pull Request resolved: https://github.com/facebook/rocksdb/pull/6751 Reviewed By: riversand963 Differential Revision: D21285631 Pulled By: anand1976 fbshipit-source-id: d89af843e5a91ece866e87aa29438b52a65a8567	5 years ago
Derrick Pallas	5272305437	Fix FilterBench when RTTI=0 (#6732 ) Summary: The dynamic_cast in the filter benchmark causes release mode to fail due to no-rtti. Replace with static_cast_with_check. Signed-off-by: Derrick Pallas <derrick@pallas.us> Addition by peterd: Remove unnecessary 2nd template arg on all static_cast_with_check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6732 Reviewed By: ltamasi Differential Revision: D21304260 Pulled By: pdillinger fbshipit-source-id: 6e8eb437c4ca5a16dbbfa4053d67c4ad55f1608c	5 years ago
Peter Dillinger	31da5e34c1	C++20 compatibility (#6697 ) Summary: Based on https://github.com/facebook/rocksdb/issues/6648 (CLA Signed), but heavily modified / extended: * Implicit capture of this via [=] deprecated in C++20, and [=,this] not standard before C++20 -> now using explicit capture lists * Implicit copy operator deprecated in gcc 9 -> add explicit '= default' definition * std::random_shuffle deprecated in C++17 and removed in C++20 -> migrated to a replacement in RocksDB random.h API * Add the ability to build with different std version though -DCMAKE_CXX_STANDARD=11/14/17/20 on the cmake command line * Minimal rebuild flag of MSVC is deprecated and is forbidden with /std:c++latest (C++20) * Added MSVC 2019 C++11 & MSVC 2019 C++20 in AppVeyor * Added GCC 9 C++11 & GCC9 C++20 in Travis Pull Request resolved: https://github.com/facebook/rocksdb/pull/6697 Test Plan: make check and CI Reviewed By: cheng-chang Differential Revision: D21020318 Pulled By: pdillinger fbshipit-source-id: 12311be5dbd8675a0e2c817f7ec50fa11c18ab91	5 years ago
Cheng Chang	4fc216649d	Support direct IO in RandomAccessFileReader::MultiRead (#6446 ) Summary: By supporting direct IO in RandomAccessFileReader::MultiRead, the benefits of parallel IO (IO uring) and direct IO can be combined. In direct IO mode, read requests are aligned and merged together before being issued to RandomAccessFile::MultiRead, so blocks in the original requests might share the same underlying buffer, the shared buffers are returned in `aligned_bufs`, which is a new parameter of the `MultiRead` API. For example, suppose alignment requirement for direct IO is 4KB, one request is (offset: 1KB, len: 1KB), another request is (offset: 3KB, len: 1KB), then since they all belong to page (offset: 0, len: 4KB), `MultiRead` only reads the page with direct IO into a buffer on heap, and returns 2 Slices referencing regions in that same buffer. See `random_access_file_reader_test.cc` for more examples. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6446 Test Plan: Added a new test `random_access_file_reader_test.cc`. Reviewed By: anand1976 Differential Revision: D20097518 Pulled By: cheng-chang fbshipit-source-id: ca48a8faf9c3af146465c102ef6b266a363e78d1	5 years ago
Levi Tamasi	c15e85bdcb	Move BlobDB related files under db/ to db/blob/ (#6519 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6519 Test Plan: ``` make all make check ``` Differential Revision: D20400691 Pulled By: ltamasi fbshipit-source-id: 20ef911cf1c2c92c7f71ef0b493f9be64f2eef94	5 years ago
Cheng Chang	0a0151fb99	Remove memcpy from RandomAccessFileReader::Read in direct IO mode (#6455 ) Summary: In direct IO mode, RandomAccessFileReader::Read allocates an internal aligned buffer, and then copies the result into the scratch buffer. If the result is only temporarily used inside a function, there is no need to do the memcpy and just let the result Slice refer to the internally allocated buffer. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6455 Test Plan: make check Differential Revision: D20106753 Pulled By: cheng-chang fbshipit-source-id: 44f505843837bba47a56e3fa2c4dd3bd76486b58	5 years ago
Huisheng Liu	904a60ff63	return timestamp from get (#6409 ) Summary: Added new Get() methods that return timestamp. Dummy implementation is given so that classes derived from DB don't need to be touched to provide their implementation. MultiGet is not included. ReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks. base line (commit `72ee067b9`): 101.712 micros/op 314602 ops/sec; 36.0 MB/s (5658999 of 5658999 found) This PR: 100.288 micros/op 319071 ops/sec; 36.5 MB/s (5674999 of 5674999 found) ./db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=readrandom --use_existing_db=1 --num=25000000 --threads=32 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6409 Differential Revision: D20200086 Pulled By: riversand963 fbshipit-source-id: 490edd74d924f62bd8ae9c29c2a6bbbb8410ca50	5 years ago
sdong	fdf882ded2	Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433 ) Summary: When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433 Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag. Differential Revision: D19977691 fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e	5 years ago
Levi Tamasi	1b4be4cac9	BlobDB: ignore trivially moved files when updating the SST<->blob file mapping (#6381 ) Summary: BlobDB keeps track of the mapping between SSTs and blob files using the `OnFlushCompleted` and `OnCompactionCompleted` callbacks of the `EventListener` interface: upon receiving a flush notification, a link is added between the newly flushed SST and the corresponding blob file; for compactions, links are removed for the inputs and added for the outputs. The earlier code performed this link deletion and addition even for trivially moved files; the new code walks through the two lists together (in a fashion that's similar to merge sort) and skips such files. This should mitigate https://github.com/facebook/rocksdb/issues/6338, wherein an assertion is triggered with the earlier code when a compaction notification for a trivial move precedes the flush notification for the moved SST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6381 Test Plan: make check Differential Revision: D19773729 Pulled By: ltamasi fbshipit-source-id: ae0f273ded061110dd9334e8fb99b0d7786650b0	5 years ago
Levi Tamasi	9e3ace42a4	Add statistics for BlobDB GC (#6296 ) Summary: The patch adds statistics support to the new BlobDB garbage collection implementation; namely, it adds support for the following (pre-existing) tickers: `BLOB_DB_GC_NUM_FILES`: the number of blob files obsoleted by the GC logic. `BLOB_DB_GC_NUM_NEW_FILES`: the number of new blob files generated by the GC logic. `BLOB_DB_GC_FAILURES`: the number of failed GC passes (where a GC pass is equivalent to a (sub)compaction). `BLOB_DB_GC_NUM_KEYS_RELOCATED`: the number of blobs relocated to new blob files by the GC logic. `BLOB_DB_GC_BYTES_RELOCATED`: the total size of blobs relocated to new blob files. The tickers `BLOB_DB_GC_NUM_KEYS_OVERWRITTEN`, `BLOB_DB_GC_NUM_KEYS_EXPIRED`, `BLOB_DB_GC_BYTES_OVERWRITTEN`, `BLOB_DB_GC_BYTES_EXPIRED`, and `BLOB_DB_GC_MICROS` are not relevant for the new GC logic, and are thus marked deprecated. The patch also adds a couple of log messages that log the number and total size of blobs encountered and relocated during a GC pass, as well as the number of blob files created and obsoleted. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6296 Test Plan: Extended unit tests and used the BlobDB mode of `db_bench`. Differential Revision: D19402513 Pulled By: ltamasi fbshipit-source-id: d53d2bfbf4928a1db1e9346c67ebb9007b8932ec	5 years ago
Levi Tamasi	1dd7873e08	Remove earlier partial BlobDB GC implementation (#6278 ) Summary: In addition to removing the earlier partially implemented garbage collection logic from the BlobDB codebase, the patch also removes the test cases (as well as the related sync points, as appropriate) that were only relevant for the old implementation, and reworks the remaining ones so they use the new GC logic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6278 Test Plan: `make check` Differential Revision: D19335226 Pulled By: ltamasi fbshipit-source-id: 0cc1794bc9892feda1426ed5522a318f3cb1b692	5 years ago
Levi Tamasi	7a7ca8eb5b	BlobDB: only compare CF IDs when checking whether an API call is for the default CF (#6226 ) Summary: BlobDB currently only supports using the default column family. The earlier code enforces this by comparing the `ColumnFamilyHandle` passed to the `Get`/`Put`/etc. call with the handle returned by `DefaultColumnFamily` (which, at the end of the day, comes from `DBImpl::default_cf_handle_`). Since other `ColumnFamilyHandle`s can also point to the default column family, this can reject legitimate requests as well. (As an example, with the earlier code, the handle returned by `BlobDB::Open` cannot actually be used in API calls.) The patch fixes this by comparing only the IDs of the column family handles instead of the pointers themselves. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6226 Test Plan: `make check` Differential Revision: D19187461 Pulled By: ltamasi fbshipit-source-id: 54ce2e12ebb1f07e6d1e70e3b1e0213dfa94bda2	5 years ago
Levi Tamasi	0d2172f128	Make it possible to enable periodic compactions for BlobDB (#6172 ) Summary: Periodic compactions ensure that even SSTs that do not get picked up otherwise eventually go through compaction; used in conjunction with BlobDB's garbage collection, they enable BlobDB to reclaim space when old blob files are used by such straggling SSTs. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6172 Test Plan: Ran `make check` and used the BlobDB mode of `db_bench`. Differential Revision: D19045045 Pulled By: ltamasi fbshipit-source-id: 04636ecc4b6cfe8d495bf656faa65d54a5eb1a93	5 years ago
anand76	afa2420c2b	Introduce a new storage specific Env API (#5761 ) Summary: The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc. This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO. The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before. This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection. The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761 Differential Revision: D18868376 Pulled By: anand1976 fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f	5 years ago
Levi Tamasi	583c6953d8	Move out valid blobs from the oldest blob files during compaction (#6121 ) Summary: The patch adds logic that relocates live blobs from the oldest N non-TTL blob files as they are encountered during compaction (assuming the BlobDB configuration option `enable_garbage_collection` is `true`), where N is defined as the number of immutable non-TTL blob files multiplied by the value of a new BlobDB configuration option called `garbage_collection_cutoff`. (The default value of this parameter is 0.25, that is, by default the valid blobs residing in the oldest 25% of immutable non-TTL blob files are relocated.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6121 Test Plan: Added unit test and tested using the BlobDB mode of `db_bench`. Differential Revision: D18785357 Pulled By: ltamasi fbshipit-source-id: 8c21c512a18fba777ec28765c88682bb1a5e694e	5 years ago
Levi Tamasi	3b607610df	Do not update SST <-> blob file mapping if compaction failed Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6156 Test Plan: Extended unit tests. Differential Revision: D18943867 Pulled By: ltamasi fbshipit-source-id: b3669d2dd6af08e987ad1a59d6712ae2514da0b1	5 years ago
Adam Retter	a61ec9ae3b	Fix BlobDB compilation on older GCC versions Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6094 Differential Revision: D18731951 Pulled By: ltamasi fbshipit-source-id: 5b73c6009c748f6a2a48d4d880b1259980d801d4	5 years ago
Levi Tamasi	d9314a9214	Refactor and clean up the code that reads a blob from a file (#6093 ) Summary: This patch factors out the logic that reads a (potentially compressed) blob from a file into a separate helper method `GetRawBlobFromFile`, and cleans up the code a bit. Also, errors during decompression are now logged/propagated to the user by returning a `Status` code of `Corruption`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6093 Test Plan: `make check` Differential Revision: D18716673 Pulled By: ltamasi fbshipit-source-id: 44144bc064cab616862d5643f34384f2bae6eb78	5 years ago
Levi Tamasi	72daa92d3a	Refactor blob file creation logic (#6066 ) Summary: The patch refactors and cleans up the logic around creating new blob files by moving the common code of `SelectBlobFile` and `SelectBlobFileTTL` to a new helper method `CreateBlobFileAndWriter`, bringing the implementation of `SelectBlobFile` and `SelectBlobFileTTL` into sync, and increasing encapsulation by adding new constructors for `BlobFile` and `BlobLogHeader`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6066 Test Plan: Ran `make check` and used the BlobDB mode of `db_bench` to sanity test both the TTL and the non-TTL code paths. Differential Revision: D18646921 Pulled By: ltamasi fbshipit-source-id: e5705a84807932e31dccab4f49b3e64369cea26d	5 years ago
Levi Tamasi	279c488395	Mark blob files not needed by any memtables/SSTs obsolete (#6032 ) Summary: The patch adds logic to mark no longer needed blob files obsolete upon database open and whenever a flush or compaction completes. Unneeded blob files are detected by iterating through live immutable non-TTL blob files starting from the lowest-numbered one, and stopping when a blob file used by any SSTs or potentially used by memtables is found. (The latter is determined by comparing the sequence number at which the blob file became immutable with the largest sequence number received in flush notifications.) In addition, the patch cleans up the logic around closing and obsoleting blob files and enforces invariants around this area (blob files are now guaranteed to go through the stages mutable-non-obsolete, immutable-non-obsolete, and immutable-obsolete in this order). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6032 Test Plan: Extended unit tests and tested using the BlobDB mode of `db_bench`. Differential Revision: D18495610 Pulled By: ltamasi fbshipit-source-id: 11825b84af74f3f4abfd9bcae04e80870ae58961	5 years ago
Levi Tamasi	8e7aa62813	BlobDB: Maintain mapping between blob files and SSTs (#6020 ) Summary: The patch adds logic to BlobDB to maintain the mapping between blob files and SSTs for which the blob file in question is the oldest blob file referenced by the SST file. The mapping is initialized during database open based on the information retrieved using `GetLiveFilesMetaData`, and updated after flushes/compactions based on the information received through the `EventListener` interface (or, in the case of manual compactions issued through the `CompactFiles` API, the `CompactionJobInfo` object). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6020 Test Plan: Added a unit test; also tested using the BlobDB mode of `db_bench`. Differential Revision: D18410508 Pulled By: ltamasi fbshipit-source-id: dd9e778af781cfdb0d7056298c54ba9cebdd54a5	5 years ago
Sagar Vemuri	4c9aa30a62	Auto enable Periodic Compactions if a Compaction Filter is used (#5865 ) Summary: - Periodic compactions are auto-enabled if a compaction filter or a compaction filter factory is set, in Level Compaction. - The default value of `periodic_compaction_seconds` is changed to UINT64_MAX, which lets RocksDB auto-tune periodic compactions as needed. An explicit value of 0 will still work as before ie. to disable periodic compactions completely. For now, on seeing a compaction filter along with a UINT64_MAX value for `periodic_compaction_seconds`, RocksDB will make SST files older than 30 days to go through periodic copmactions. Some RocksDB users make use of compaction filters to control when their data can be deleted, usually with a custom TTL logic. But it is occasionally possible that the compactions get delayed by considerable time due to factors like low writes to a key range, data reaching bottom level, etc before the TTL expiry. Periodic Compactions feature was originally built to help such cases. Now periodic compactions are auto enabled by default when compaction filters or compaction filter factories are used, as it is generally helpful to all cases to collect garbage. `periodic_compaction_seconds` is set to a large value, 30 days, in `SanitizeOptions` when RocksDB sees that a `compaction_filter` or `compaction_filter_factory` is used. This is done only for Level Compaction style. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5865 Test Plan: - Added a new test `DBCompactionTest.LevelPeriodicCompactionWithCompactionFilters` to make sure that `periodic_compaction_seconds` is set if either `compaction_filter` or `compaction_filter_factory` options are set. - `COMPILE_WITH_ASAN=1 make check` Differential Revision: D17659180 Pulled By: sagar0 fbshipit-source-id: 4887b9cf2e53cf2dc93a7b658c6b15e1181217ee	5 years ago
Levi Tamasi	a59dc843a4	Move blob_index.h to db/ (#5919 ) Summary: Extracted from PR https://github.com/facebook/rocksdb/issues/5903 for technical reasons. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5919 Test Plan: make check Differential Revision: D17910132 Pulled By: ltamasi fbshipit-source-id: 6ecbb8d6e84b2a1d1f28575ad48ac3cc65833eb5	5 years ago
sdong	b931f84e56	Divide file_reader_writer.h and .cc (#5803 ) Summary: file_reader_writer.h and .cc contain several files and helper function, and it's hard to navigate. Separate it to multiple files and put them under file/ Pull Request resolved: https://github.com/facebook/rocksdb/pull/5803 Test Plan: Build whole project using make and cmake. Differential Revision: D17374550 fbshipit-source-id: 10efca907721e7a78ed25bbf74dc5410dea05987	5 years ago
Pratik Dhandharia	1b4c104a67	replace some reinterpret_cast with static_cast_with_check (#5740 ) Summary: This PR focuses on replacing some of the reinterpret_cast<DBImpl*> to static_cast_with_check<DBImpl, DB>. Files impacted: ./db/db_impl/db_impl_compaction_flush.cc ./db/write_batch.cc ./utilities/blob_db/blob_db_impl.cc ./utilities/transactions/pessimistic_transaction_db.cc ./utilities/transactions/transaction_base.cc ./utilities/transactions/write_prepared_txn_db.cc ./utilities/transactions/write_unprepared_txn_db.cc Pull Request resolved: https://github.com/facebook/rocksdb/pull/5740 Differential Revision: D17055691 Pulled By: pdhandharia fbshipit-source-id: 0f8034d1b32eade56e37d59c04b7bf236a81d8e8	5 years ago
Levi Tamasi	0a97125ec0	Fix data races in BlobDB (#5698 ) Summary: Some accesses to blob_files_ and open_ttl_files_ in BlobDBImpl, as well as to expiration_range_ in BlobFile were not properly synchronized. The patch fixes this and also makes sure the invariant that obsolete_files_ is a subset of blob_files_ holds even when an attempt to delete an obsolete blob file fails. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5698 Test Plan: COMPILE_WITH_TSAN=1 make blob_db_test gtest-parallel --repeat=1000 ./blob_db_test --gtest_filter="ShutdownWait" The test fails with TSAN errors ~20 times out of 1000 without the patch but completes successfully 1000 out of 1000 times with the fix. Differential Revision: D16793235 Pulled By: ltamasi fbshipit-source-id: 8034b987598d4fdc9f15098d4589cc49cde484e9	5 years ago
Vijay Nadimpalli	d150e01474	New API to get all merge operands for a Key (#5604 ) Summary: This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases: 1. Update subset of columns and read subset of columns - Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU. 2. Updating very few attributes in a value which is a JSON-like document - Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge. ---------------------------------------------------------------------------------------------------- API : Status GetMergeOperands( const ReadOptions& options, ColumnFamilyHandle* column_family, const Slice& key, PinnableSlice* merge_operands, GetMergeOperandsOptions* get_merge_operands_options, int* number_of_operands) Example usage : int size = 100; int number_of_operands = 0; std::vector<PinnableSlice> values(size); GetMergeOperandsOptions merge_operands_info; db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands); Description : Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion. merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604 Test Plan: Added unit test and perf test in db_bench that can be run using the command: ./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist Differential Revision: D16657366 Pulled By: vjnadimpalli fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf	5 years ago
anand76	e0d9d57750	Fix bugs in WAL trash file handling (#5520 ) Summary: 1. Cleanup WAL trash files on open 2. Don't apply deletion rate limit if WAL dir is different from db dir Pull Request resolved: https://github.com/facebook/rocksdb/pull/5520 Test Plan: Add new unit tests and make check Differential Revision: D16096750 Pulled By: anand1976 fbshipit-source-id: 6f07858ad864b754b711db416f0389c45ede599b	5 years ago
sdong	58c4aee42e	TransactionUtil::CheckKey() to skip unnecessary history (#4941 ) Summary: If a memtable definitely covers a key, there isn't a need to check older memtables. We can skip them by checking the earliest sequence number. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4941 Differential Revision: D13932666 fbshipit-source-id: b9d52f234b8ad9dd3bf6547645cd457175a3ca9b	6 years ago
Siying Dong	000b9ec217	Move some logging related files to logging/ (#5387 ) Summary: Many logging related source files are under util/. It will be more structured if they are together. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5387 Differential Revision: D15579036 Pulled By: siying fbshipit-source-id: 3850134ed50b8c0bb40a0c8ae1f184fa4081303f	6 years ago
Vijay Nadimpalli	49c5a12dbe	Organizing rocksdb/db directory Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/5390 Differential Revision: D15579388 Pulled By: vjnadimpalli fbshipit-source-id: 5bfc95e31554b8ff05b97b76d6534113f527f366	6 years ago
Siying Dong	8843129ece	Move some memory related files from util/ to memory/ (#5382 ) Summary: Move arena, allocator, and memory tools under util to a separate memory/ directory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382 Differential Revision: D15564655 Pulled By: siying fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5	6 years ago

1 2 3

149 Commits (03ccb1cd4227191d02b6794d9b0468d091a50860)