rocksdb

Commit Graph

Author	SHA1	Message	Date
Cheng Chang	40497a875a	Reduce memory copies when fetching and uncompressing blocks from SST files (#6689 ) Summary: In https://github.com/facebook/rocksdb/pull/6455, we modified the interface of `RandomAccessFileReader::Read` to be able to get rid of memcpy in direct IO mode. This PR applies the new interface to `BlockFetcher` when reading blocks from SST files in direct IO mode. Without this PR, in direct IO mode, when fetching and uncompressing compressed blocks, `BlockFetcher` will first copy the raw compressed block into `BlockFetcher::compressed_buf_` or `BlockFetcher::stack_buf_` inside `RandomAccessFileReader::Read` depending on the block size. then during uncompressing, it will copy the uncompressed block into `BlockFetcher::heap_buf_`. In this PR, we get rid of the first memcpy and directly uncompress the block from `direct_io_buf_` to `heap_buf_`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6689 Test Plan: A new unit test `block_fetcher_test` is added. Reviewed By: anand1976 Differential Revision: D21006729 Pulled By: cheng-chang fbshipit-source-id: 2370b92c24075692423b81277415feb2aed5d980	5 years ago
Yanqin Jin	e04f3bce4f	Update CURRENT file after best-efforts recovery (#6746 ) Summary: After a successful recovery, the CURRENT file should be updated to point to the valid MANIFEST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6746 Test Plan: make check Reviewed By: anand1976 Differential Revision: D21189876 Pulled By: riversand963 fbshipit-source-id: 7537b49988c5c425ebe9505a5cc260de351ad79b	5 years ago
mrambacher	4cbc19d2a1	Add a ConfigOptions for use in comparing objects and converting to/from strings (#6389 ) Summary: The methods in convenience.h are used to compare/convert objects to/from strings. There is a mishmash of parameters in use here with more needed in the future. This PR replaces those parameters with a single structure. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6389 Reviewed By: siying Differential Revision: D21163707 Pulled By: zhichao-cao fbshipit-source-id: f807b4cc7e2b0af3871536b69546b2604dfa81bd	5 years ago
Akanksha Mahajan	03a1d95db0	Set max_background_flushes dynamically (#6701 ) Summary: 1. Add changes so that max_background_flushes can be set dynamically. 2. Add a testcase DBOptionsTest.SetBackgroundFlushThreads which set the max_background_flushes dynamically using SetDBOptions. TestPlan: 1. make -j64 check 2. Using new testcase DBOptionsTest.SetBackgroundFlushThreads Pull Request resolved: https://github.com/facebook/rocksdb/pull/6701 Reviewed By: ajkr Differential Revision: D21028010 Pulled By: akankshamahajan15 fbshipit-source-id: 5f949e4a8fd3c32537b637947b7ee09a69cfc7c1	5 years ago
Yanqin Jin	243852ec15	Add IsDirectory() to Env and FS (#6711 ) Summary: IsDirectory() is a common API to check whether a path is a regular file or directory. POSIX: call stat() and use S_ISDIR(st_mode) Windows: PathIsDirectoryA() and PathIsDirectoryW() HDFS: FileSystem.IsDirectory() Java: File.IsDirectory() ... Pull Request resolved: https://github.com/facebook/rocksdb/pull/6711 Test Plan: make check Reviewed By: anand1976 Differential Revision: D21053520 Pulled By: riversand963 fbshipit-source-id: 680aadfd8ce982b63689190cf31b3145d5a89e27	5 years ago
Mike Kolupaev	e45673dece	Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621 ) Summary: Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype. Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling. It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas. Note that the deferred value loading only happens for internal iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621 Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats. Reviewed By: siying Differential Revision: D20786930 Pulled By: al13n321 fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee	5 years ago
Zhichao Cao	38dfa406ff	Add NewFileChecksumGenCrc32cFactory to file checksum (#6688 ) Summary: Add NewFileChecksumGenCrc32cFactory to file checksum public interface such that applications can use the build in crc32 checksum factory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6688 Test Plan: pass make asan_check Reviewed By: riversand963 Differential Revision: D21006859 Pulled By: zhichao-cao fbshipit-source-id: ea8a45196a8b77c310728ab05f6cc0f49f3baef0	5 years ago
Andrew Kryczka	9eca6d651d	fix comparison count for format_version=3 indexes (#6650 ) Summary: In index blocks since `format_version=3`, user keys are written rather than internal keys. When reading such blocks, the comparator is obtained via `InternalKeyComparator::user_comparator()`. That function must not return an unwrapped result as the wrapper class provides accounting logic to populate `PerfContext::user_key_comparison_count`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6650 Test Plan: ran db_bench and verified `PerfContext::user_key_comparison_count` became larger. Reviewed By: cheng-chang Differential Revision: D20866325 Pulled By: ajkr fbshipit-source-id: ad755d46bda31157dacc5b66e532279f19ad538c	5 years ago
sdong	1be3be5522	Auto-Format two recent diffs and add HISTORY.md (#6685 ) Summary: Two recent diffs can be autoformatted. Also add HISTORY.md entry for https://github.com/facebook/rocksdb/pull/6214 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6685 Test Plan: Run all existing tests Reviewed By: cheng-chang Differential Revision: D20965780 fbshipit-source-id: 195b08d7849513d42fe14073112cd19fdda6af95	5 years ago
Cheng Chang	6e6f807917	Add two more optimization improvements to HISTORY (#6679 ) Summary: Although these optimizations are not user facing, still feel it's valuable to call out in HISTORY. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6679 Test Plan: no need Reviewed By: zhichao-cao Differential Revision: D20945916 Pulled By: cheng-chang fbshipit-source-id: f3e790c07f3bcc4a8a74246c4fa232800ddd4438	5 years ago
Yi Wu	eb287c72d7	Fix wrong key being read on ingested file with global seqno and delta encoding (#6669 ) Summary: On reading an ingested SST file, `DataBlockIter` will replace seqno encoded in a key with global seqno. However, if the original seqno was part of the prefix used for the next key, the global seqno is by mistake used as part of the prefix to construct the next key, causing wrong result being returned. Although at this point it is only software error while data in the file is not corrupted, the issue can further cause compaction output out of order and corrupted result when the ingested SST participated in compaction. Fixing the issue by save the actual seqno and restore it before the key being used as prefix to construct next key. The unit test is by Little-Wallace from https://github.com/facebook/rocksdb/issues/6666. Fixing https://github.com/facebook/rocksdb/issues/6666. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6669 Test Plan: New unit test Signed-off-by: Yi Wu <yiwu@pingcap.com> Reviewed By: cheng-chang Differential Revision: D20931808 Pulled By: ajkr fbshipit-source-id: f01959c35d6a493954dca981663766c7a5a9e8ab	5 years ago
sdong	94f90ac6bc	compression related options are not copied back from MutableCFOptions… (#6668 ) Summary: … to CFOptions https://github.com/facebook/rocksdb/pull/6615 made several compression related options dynamically changeable. They are moved to MutableCFOptions. However, they are not copied back to ColumnFamilyOptions, so the changed values are not written to option files and for some other uses. Fix it by copying them back. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6668 Test Plan: Add a unit test to make sure that when a MutableCFOptions is converted to CFOptions and back to MutableCFOptions, they stay the same. This test would fail without the fix. Reviewed By: ajkr Differential Revision: D20923999 fbshipit-source-id: c3bccd6923b00d677764e2269bed6a95ad7ed780	5 years ago
Zhichao Cao	278911a2d9	Remove redundant in HISTORY (#6627 ) Summary: Remove redundant description in HISTORY no code change Pull Request resolved: https://github.com/facebook/rocksdb/pull/6627 Test Plan: make check Reviewed By: anand1976 Differential Revision: D20797269 Pulled By: zhichao-cao fbshipit-source-id: dee4c9a22f6d241c985f250c0f11bfaa9198f4c1	5 years ago
Ziyue Yang	03a781a90c	Add pipelined & parallel compression optimization (#6262 ) Summary: This PR adds support for pipelined & parallel compression optimization for `BlockBasedTableBuilder`. This optimization makes block building, block compression and block appending a pipeline, and uses multiple threads to accelerate block compression. Users can set `CompressionOptions::parallel_threads` greater than 1 to enable compression parallelism. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6262 Reviewed By: ajkr Differential Revision: D20651306 fbshipit-source-id: 62125590a9c15b6d9071def9dc72589c1696a4cb	5 years ago
sdong	57096ab13e	Fix a bug that crashes the service when write buffer manager fails to insert to block cache (#6619 ) Summary: https://github.com/facebook/rocksdb/issues/6247 reports that when write buffer manager fails to insert the dummy entry to block cache, null pointer is still stored and used to release the handle and cause corruption. Fix the bug by not releasing it with null handle. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6619 Test Plan: Add a unit test that fails without the fix. Reviewed By: ajkr Differential Revision: D20776769 fbshipit-source-id: 4127fbd9f295a0a3e45774746ffcd91f939f6287	5 years ago
Levi Tamasi	e6f86cfb36	Revert the recent cache deleter change (#6620 ) Summary: Revert "Use function objects as deleters in the block cache (https://github.com/facebook/rocksdb/issues/6545)" This reverts commit `6301dbe7a7`. Revert "Call out the cache deleter related interface change in HISTORY.md (https://github.com/facebook/rocksdb/issues/6606)" This reverts commit `3a35542f86`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6620 Test Plan: `make check` Reviewed By: zhichao-cao Differential Revision: D20773311 Pulled By: ltamasi fbshipit-source-id: 7637a761f718f323ef0e7da959462e8fb06e7a2b	5 years ago
sdong	80979f81c7	Make options.bottommost_compression, compression_opts and bottommost_compression_opts dynamically changeable. (#6615 ) Summary: These three options should be made dynamically changeable. Simply add them to MutableCFOptions and made the change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6615 Test Plan: Add a unit test to make sure that SetOptions() can change the options. Reviewed By: riversand963 Differential Revision: D20755951 fbshipit-source-id: 8165f4fd7a7a665cc7fb049698935022a5d2e7ff	5 years ago
Yanqin Jin	18cf0de640	Use flush time for the props.creation_time for FIFO compaction (#6612 ) Summary: For FIFO compaction, we use flush time instead of oldest key time as the creation time. This is to prevent FIFO compaction dropping files whose oldest key time is older than TTL but which has newer keys than TTL. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6612 Test Plan: make check Reviewed By: siying Differential Revision: D20748217 Pulled By: riversand963 fbshipit-source-id: 3f7b00a847020760537cdddd12f6fe039e5bc663	5 years ago
Zhichao Cao	eaf95c7d1a	Update release version to 6.9.0 (#6610 ) Summary: Update release version to 6.9.0 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6610 Test Plan: no code change Reviewed By: riversand963 Differential Revision: D20741094 Pulled By: zhichao-cao fbshipit-source-id: 80a9e9ea8d164b6923112352d36fcbc1be85c034	5 years ago
Zhichao Cao	e8d332d97e	Use FileChecksumGenFactory for SST file checksum (#6600 ) Summary: In the current implementation, sst file checksum is calculated by a shared checksum function object, which may make some checksum function hard to be applied here such as SHA1. In this implementation, each sst file will have its own checksum generator obejct, created by FileChecksumGenFactory. User needs to implement its own FilechecksumGenerator and Factory to plugin the in checksum calculation method. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6600 Test Plan: tested with make asan_check Reviewed By: riversand963 Differential Revision: D20717670 Pulled By: zhichao-cao fbshipit-source-id: 2a74c1c280ac11a07a1980185b43b671acaa71c6	5 years ago
Cheng Chang	ee50b8d499	Be able to decrease background thread's CPU priority when creating database backup (#6602 ) Summary: When creating a database backup, the background threads will not only consume IO resources by copying files, but also consuming CPU such as by computing checksums. During peak times, the CPU consumption by the background threads might affect online queries. This PR makes it possible to decrease CPU priority of these threads when creating a new backup. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6602 Test Plan: make check Reviewed By: siying, zhichao-cao Differential Revision: D20683216 Pulled By: cheng-chang fbshipit-source-id: 9978b9ed9488e8ce135e90ca083e5b4b7221fd84	5 years ago
Levi Tamasi	3a35542f86	Call out the cache deleter related interface change in HISTORY.md (#6606 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6606 Reviewed By: riversand963 Differential Revision: D20708411 Pulled By: ltamasi fbshipit-source-id: c15b4ded19a4b5c84e3e4240bdcec15460806c88	5 years ago
Peter Dillinger	93b80ca7ba	Update default BBTO::format_version from 2 to 4 (#6582 ) Summary: Version 4 has been around long enough, for compatibility and extensive validation, that it should be default. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6582 Test Plan: CI (w.r.t. changing the default; format_version=4 is well tested and massively in production at Facebook) Reviewed By: siying Differential Revision: D20625233 Pulled By: pdillinger fbshipit-source-id: 2f83ed874cffa4a39bc7a66cdf3833b978fbb948	5 years ago
sdong	921cdd37e2	Fix bug that number of table loading threads is set as a boolean (#6576 ) Summary: When applying a new version in non DB open case, optimize_filters_for_hits is used for max_threads, which is clearly a bug. It is not clear what the indented value in the first place, but it value 1 makes sense here, which would create no extra threads. This bug is not expected to cause user visible problems, assuming C++ implicitly cast bool to 0 or 1. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6576 Test Plan: Run all exsiting test. Reviewed By: ajkr Differential Revision: D20602467 fbshipit-source-id: 40b2cd8619aba09ae9242b36c415464db3c9b737	5 years ago
Yanqin Jin	fb09ef05dc	Attempt to recover from db with missing table files (#6334 ) Summary: There are situations when RocksDB tries to recover, but the db is in an inconsistent state due to SST files referenced in the MANIFEST being missing. In this case, previous RocksDB will just fail the recovery and return a non-ok status. This PR enables another possibility. During recovery, RocksDB checks possible MANIFEST files, and try to recover to the most recent state without missing table file. `VersionSet::Recover()` applies version edits incrementally and "materializes" a version only when this version does not reference any missing table file. After processing the entire MANIFEST, the version created last will be the latest version. `DBImpl::Recover()` calls `VersionSet::Recover()`. Afterwards, WAL replay will not be performed. To use this capability, set `options.best_efforts_recovery = true` when opening the db. Best-efforts recovery is currently incompatible with atomic flush. Test plan (on devserver): ``` $make check $COMPILE_WITH_ASAN=1 make all && make check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6334 Reviewed By: anand1976 Differential Revision: D19778960 Pulled By: riversand963 fbshipit-source-id: c27ea80f29bc952e7d3311ecf5ee9c54393b40a8	5 years ago
sdong	331e6199df	Include more information in file lock failure (#6507 ) Summary: When users fail to open a DB with file lock failure, it is sometimes hard for users to debug. We now include the time the lock is acquired and the thread ID that acquired the lock, to help users debug problems like this. Default Env's thread ID is used. Since type of lockedFiles is changed, rename it to follow naming convention too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6507 Test Plan: Add a unit test and improve an existing test to validate the case. Differential Revision: D20378333 fbshipit-source-id: 312fe0e9733fd1d1e9969c321b90ce523cf4708a	5 years ago
Yanqin Jin	d93812c9ae	Iterator with timestamp (#6255 ) Summary: Preliminary support for iterator with user timestamp. Current implementation does not consider merge operator and reverse iterator. Auto compaction is also disabled in unit tests. Create an iterator with timestamp. ``` ... read_opts.timestamp = &ts; auto* iter = db->NewIterator(read_opts); // target is key without timestamp. for (iter->Seek(target); iter->Valid(); iter->Next()) {} for (iter->SeekToFirst(); iter->Valid(); iter->Next()) {} delete iter; read_opts.timestamp = &ts1; // lower_bound and upper_bound are without timestamp. read_opts.iterate_lower_bound = &lower_bound; read_opts.iterate_upper_bound = &upper_bound; auto* iter1 = db->NewIterator(read_opts); // Do Seek or SeekToFirst() delete iter1; ``` Test plan (dev server) ``` $make check ``` Simple benchmarking (dev server) 1. The overhead introduced by this PR even when timestamp is disabled. key size: 16 bytes value size: 100 bytes Entries: 1000000 Data reside in main memory, and try to stress iterator. Repeated three times on master and this PR. - Seek without next ``` ./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 ``` master: 159047.0 ops/sec this PR: 158922.3 ops/sec (2% drop in throughput) - Seek and next 10 times ``` ./db_bench -db=/dev/shm/rocksdbtest-1000 -benchmarks=fillseq,seekrandom -enable_pipelined_write=false -disable_wal=true -format_version=3 -seek_nexts=10 ``` master: 109539.3 ops/sec this PR: 107519.7 ops/sec (2% drop in throughput) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6255 Differential Revision: D19438227 Pulled By: riversand963 fbshipit-source-id: b66b4979486f8474619f4aa6bdd88598870b0746	5 years ago
Otto Kekäläinen	f6c2777d95	Fix spelling: commited -> committed (#6481 ) Summary: In most places in the code the variable names are spelled correctly as COMMITTED but in a couple places not. This fixes them and ensures the variable is always called COMMITTED everywhere. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6481 Differential Revision: D20306776 Pulled By: pdillinger fbshipit-source-id: b6c1bfe41db559b4bc6955c530934460c07f7022	5 years ago
Cheng Chang	afb97094ae	Skip high levels with no key falling in the range in CompactRange (#6482 ) Summary: In CompactRange, if there is no key in memtable falling in the specified range, then flush is skipped. This PR extends this skipping logic to SST file levels: it starts compaction from the highest level (starting from L0) that has files with key falling in the specified range, instead of always starts from L0. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6482 Test Plan: A new test ManualCompactionTest::SkipLevel is added. Also updated a test related to statistics of index block cache hit in db_test2, the index cache hit is increased by 1 in this PR because when checking overlap for the key range in L0, OverlapWithLevelIterator will do a seek in the table cache iterator, which will read from the cached index. Also updated db_compaction_test and db_test to use correct range for full compaction. Differential Revision: D20251149 Pulled By: cheng-chang fbshipit-source-id: f822157cf4796972bd5035d9d7178d8dfb7af08b	5 years ago
sdong	17bef7d3a8	Fix data race of GetCreationTimeOfOldestFile() (#6473 ) Summary: When DBImpl::GetCreationTimeOfOldestFile() calls Version::GetCreationTimeOfOldestFile(), the version is not directly or indirectly referenced, so an event like compaction can race with the operation and cause DBImpl::GetCreationTimeOfOldestFile() to access delocated data. This was caught by an ASAN run: ==268==ERROR: AddressSanitizer: heap-use-after-free on address 0x612000b7d198 at pc 0x000018332913 bp 0x7f391510d310 sp 0x7f391510d308 READ of size 8 at 0x612000b7d198 thread T845 (store_load-33) SCARINESS: 51 (8-byte-read-heap-use-after-free) #0 0x18332912 in rocksdb::Version::GetCreationTimeOfOldestFile(unsigned long) rocksdb/src/db/version_set.cc:1488 https://github.com/facebook/rocksdb/issues/1 0x1803ddaa in rocksdb::DBImpl::GetCreationTimeOfOldestFile(unsigned long) rocksdb/src/db/db_impl/db_impl.cc:4499 https://github.com/facebook/rocksdb/issues/2 0xe24ca09 in rocksdb::StackableDB::GetCreationTimeOfOldestFile(unsigned long) rocksdb/utilities/stackable_db.h:392 ...... 0x612000b7d198 is located 216 bytes inside of 296-byte region [0x612000b7d0c0,0x612000b7d1e8) freed by thread T28 here: ...... https://github.com/facebook/rocksdb/issues/5 0x1832c73f in std::vector<rocksdb::FileMetaData, std::allocator<rocksdb::FileMetaData> >::~vector() third-party-buck/platform007/build/libgcc/include/c++/trunk/bits/stl_vector.h:435 https://github.com/facebook/rocksdb/issues/6 0x1832c73f in rocksdb::VersionStorageInfo::~VersionStorageInfo() rocksdb/src/db/version_set.cc:734 https://github.com/facebook/rocksdb/issues/7 0x1832cf42 in rocksdb::Version::~Version() rocksdb/src/db/version_set.cc:758 https://github.com/facebook/rocksdb/issues/8 0x9d1bb5 in rocksdb::Version::Unref() rocksdb/src/db/version_set.cc:2869 https://github.com/facebook/rocksdb/issues/9 0x183e7631 in rocksdb::Compaction::~Compaction() rocksdb/src/db/compaction/compaction.cc:275 https://github.com/facebook/rocksdb/issues/10 0x9e6de6 in std::default_delete<rocksdb::Compaction>::operator()(rocksdb::Compaction) const third-party-buck/platform007/build/libgcc/include/c++/trunk/bits/unique_ptr.h:78 https://github.com/facebook/rocksdb/issues/11 0x9e6de6 in std::unique_ptr<rocksdb::Compaction, std::default_delete<rocksdb::Compaction> >::reset(rocksdb::Compaction) third-party-buck/platform007/build/libgcc/include/c++/trunk/bits/unique_ptr.h:376 https://github.com/facebook/rocksdb/issues/12 0x9e6de6 in rocksdb::DBImpl::BackgroundCompaction(bool, rocksdb::JobContext, rocksdb::LogBuffer, rocksdb::DBImpl::PrepickedCompaction, rocksdb::Env::Priority) rocksdb/src/db/db_impl/db_impl_compaction_flush.cc:2826 https://github.com/facebook/rocksdb/issues/13 0x9ac3b8 in rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction, rocksdb::Env::Priority) rocksdb/src/db/db_impl/db_impl_compaction_flush.cc:2320 https://github.com/facebook/rocksdb/issues/14 0x9abff7 in rocksdb::DBImpl::BGWorkCompaction(void*) rocksdb/src/db/db_impl/db_impl_compaction_flush.cc:2096 ...... Fix the issue by reference the super version and use the referenced version from it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6473 Test Plan: Run ASAN for all existing tests. Differential Revision: D20196416 fbshipit-source-id: 5f4a7918110fc7b8dd7841932d376bc9d1e59d6f	5 years ago
Andrew Kryczka	69679e7375	Fix range deletion tombstone ingestion with global seqno (#6429 ) Summary: Original author: jeffrey-xiao If we are writing a global seqno for an ingested file, the range tombstone metablock gets accessed and put into the cache during ingestion preparation. At the time, the global seqno of the ingested file has not yet been determined, so the cached block will not have a global seqno. When the file is ingested and we read its range tombstone metablock, it will be returned from the cache with no global seqno. In that case, we use the actual seqnos stored in the range tombstones, which are all zero, so the tombstones cover nothing. This commit removes global_seqno_ variable from Block. When iterating over a block, the global seqno for the block is determined by the iterator instead of storing this mutable attribute in Block. Additionally, this commit adds a regression test to check that keys are deleted when ingesting a file with a global seqno and range deletion tombstones. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6429 Differential Revision: D19961563 Pulled By: ajkr fbshipit-source-id: 5cf777397fa3e452401f0bf0364b0750492487b7	5 years ago
Cheng Chang	b47a714051	Update release version to 6.8 (#6450 ) Summary: Update release version to 6.8 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6450 Test Plan: no code change Differential Revision: D20071889 Pulled By: cheng-chang fbshipit-source-id: 91450aae09b201926469ff32f59ed436366f3b74	5 years ago
sdong	d75ce0a8ae	Mention rocksdb_namespace.h in HISTORY.md (#6439 ) Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/6439 Differential Revision: D20012442 fbshipit-source-id: a7c9569826eac39cd7ea69c90f08a21dd4caa335	5 years ago
Yanqin Jin	362b8d4393	Fix MANIFEST name assignment (#6426 ) Summary: Currently, a new MANIFEST file is assigned a new file number when 1) no MANIFEST is open, or 2) current MANIFEST file size exceeds a threshold. This is not sufficient. There are cases when the caller explicitly specifies that a new MANIFEST be created. For example, if user sets options.write_dbid_to_manifest = true, and there are WAL files, then RocksDB will run into an issue during recovery. `DBImpl::Recover()` will call `LogAndApply()` to write dbid. At this point, the db being recovered creates a new MANIFEST, say, MANIFEST-000003. Since there are WALs, `DBImpl::RecoverLogFiles` will be called. Towards the end of this function, we call `LogAndApply(new_descriptor_log=true)`, which explicitly creates a new MANIFEST. However, the manifest_file_number is wrong before this fix. Consequently, RocksDB opens an existing, non-empty file for append, effectively truncating the file to zero. If a crash occurs, then there will be data loss. Test Plan (devserver): make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6426 Test Plan: make check Differential Revision: D19951866 Pulled By: riversand963 fbshipit-source-id: 4b1b9fc28d4fe2ac12764b388ef9e61f05e766da	5 years ago
sdong	fdf882ded2	Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433 ) Summary: When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433 Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag. Differential Revision: D19977691 fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e	5 years ago
Andrew Kryczka	0f9dcb88b2	Return NotSupported from WriteBatchWithIndex::DeleteRange (#5393 ) Summary: As discovered in https://github.com/facebook/rocksdb/issues/5260 and https://github.com/facebook/rocksdb/issues/5392, reads on the indexed batch do not account for range tombstones. So, return `Status::NotSupported` from `WriteBatchWithIndex::DeleteRange` until we properly support it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5393 Test Plan: added unit test Differential Revision: D19912360 Pulled By: ajkr fbshipit-source-id: 0bbfc978ea015d64516ca708fce2429abba524cb	5 years ago
sdong	6e97d4de00	By default turn IO Uring off. (#6405 ) Summary: We realized bugs related to IO Uring. Turn it off by default. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6405 Test Plan: Manually run build_tools/build_detect_platform and observe outputs. Differential Revision: D19862792 fbshipit-source-id: 5d5e8e2762997b72a145ae59389ef3d7e4ccd060	5 years ago
Burton Li	e64508917b	db_bench supports for generating random variable sized value. (#6386 ) Summary: 1. `db_bench` now supports `value_size_distribution_type`, `value_size_min`, `value_size_max` options for generating random variable sized value. 2. Added `blob_db_compression_type` option for BlobDB to enable blob compression. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6386 Differential Revision: D19859406 Pulled By: zhichao-cao fbshipit-source-id: ace52674090023fde15d832392110bf288a8e215	5 years ago
anand76	3e49249d30	Ensure all MultiGet IO errors are propagated to user (#6403 ) Summary: Unrevert the previous fix to propagate error status, and an additional fix to not treat a memtable lookup MergeInProgress status as an error. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6403 Test Plan: Unit tests Tried running stress tests but couldn't repro the stress failure Differential Revision: D19846721 Pulled By: anand1976 fbshipit-source-id: 7db10cccbdc863d9b559497f0a46b608d2488ca4	5 years ago
Tomas Kolda	e412a426d6	JNI direct buffer support for basic operations (#2283 ) Summary: It is very useful to support direct ByteBuffers in Java. It allows to have zero memory copy and some serializers are using that directly so one do not need to create byte[] array for it. This change also contains some fixes for Windows JNI build. Pull Request resolved: https://github.com/facebook/rocksdb/pull/2283 Differential Revision: D19834971 Pulled By: pdillinger fbshipit-source-id: 44173aa02afc9836c5498c592fd1ea95b6086e8e	5 years ago
anand76	35ed530d2c	Revert "Check KeyContext status in MultiGet (#6387 )" (#6401 ) Summary: This reverts commit `d70011bccc`. The commit is causing some stress test failure due to unexpected Status::MergeInProgress() return for some keys. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6401 Differential Revision: D19826623 Pulled By: anand1976 fbshipit-source-id: edd634cede9cb7bdd2cb8f46e662ea709b16d2f1	5 years ago
Zhichao Cao	4369f2c7bb	Checksum for each SST file and stores in MANIFEST (#6216 ) Summary: In the current code base, RocksDB generate the checksum for each block and verify the checksum at usage. Current PR enable SST file checksum. After a SST file is generated by Flush or Compaction, RocksDB generate the SST file checksum and store the checksum value and checksum method name in the vs_info and MANIFEST as part for the FileMetadata. Added the enable_sst_file_checksum to Options to enable or disable file checksum. Added sst_file_checksum to Options such that user can plugin their own SST file checksum calculate method via overriding the SstFileChecksum class. The checksum information inlcuding uint32_t checksum value and a checksum name (string). A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST. If user enables the file checksum but does not provide the sst_file_checksum instance, RocksDB will use the default crc32checksum implemented in table/sst_file_checksum_crc32c.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/6216 Test Plan: Added the testing case in table_test and ldb_cmd_test to verify checksum is correct in different level. Pass make asan_check. Differential Revision: D19171461 Pulled By: zhichao-cao fbshipit-source-id: b2e53479eefc5bb0437189eaa1941670e5ba8b87	5 years ago
anand76	d70011bccc	Check KeyContext status in MultiGet (#6387 ) Summary: Currently, any IO errors and checksum mismatches while reading data blocks, are being ignored by the batched MultiGet. Its only looking at the GetContext state. Fix that. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6387 Test Plan: Add unit tests Differential Revision: D19799819 Pulled By: anand1976 fbshipit-source-id: 46133dccbb04e64067b9fe6cda73e282203db969	5 years ago
sdong	876c2dbff4	Allow readahead when reading option files. (#6372 ) Summary: Right, when reading from option files, no readahead is used and 8KB buffer is used. It might introduce high latency if the file system provide high latency and doesn't do readahead. Instead, introduce a readahead to the file. When calling inside DB, infer the value from options.log_readahead. Otherwise, a default 512KB readahead size is used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6372 Test Plan: Add --log_readahead_size in db_bench. Run it with several options and observe read size from option files using strace. Differential Revision: D19727739 fbshipit-source-id: e6d8053b0a64259abc087f1f388b9cd66fa8a583	5 years ago
Levi Tamasi	1b4be4cac9	BlobDB: ignore trivially moved files when updating the SST<->blob file mapping (#6381 ) Summary: BlobDB keeps track of the mapping between SSTs and blob files using the `OnFlushCompleted` and `OnCompactionCompleted` callbacks of the `EventListener` interface: upon receiving a flush notification, a link is added between the newly flushed SST and the corresponding blob file; for compactions, links are removed for the inputs and added for the outputs. The earlier code performed this link deletion and addition even for trivially moved files; the new code walks through the two lists together (in a fashion that's similar to merge sort) and skips such files. This should mitigate https://github.com/facebook/rocksdb/issues/6338, wherein an assertion is triggered with the earlier code when a compaction notification for a trivial move precedes the flush notification for the moved SST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6381 Test Plan: make check Differential Revision: D19773729 Pulled By: ltamasi fbshipit-source-id: ae0f273ded061110dd9334e8fb99b0d7786650b0	5 years ago
Mike Kolupaev	637e64b9ac	Add an option to prevent DB::Open() from querying sizes of all sst files (#6353 ) Summary: When paranoid_checks is on, DBImpl::CheckConsistency() iterates over all sst files and calls Env::GetFileSize() for each of them. As far as I could understand, this is pretty arbitrary and doesn't affect correctness - if filesystem doesn't corrupt fsynced files, the file sizes will always match; if it does, it may as well corrupt contents as well as sizes, and rocksdb doesn't check contents on open. If there are thousands of sst files, getting all their sizes takes a while. If, on top of that, Env is overridden to use some remote storage instead of local filesystem, it can be really slow and overload the remote storage service. This PR adds an option to not do GetFileSize(); instead it does GetChildren() for parent directory to check that all the expected sst files are at least present, but doesn't check their sizes. We can't just disable paranoid_checks instead because paranoid_checks do a few other important things: make the DB read-only on write errors, print error messages on read errors, etc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6353 Test Plan: ran the added sanity check unit test. Will try it out in a LogDevice test cluster where the GetFileSize() calls are causing a lot of trouble. Differential Revision: D19656425 Pulled By: al13n321 fbshipit-source-id: c2c421b367633033760d1f56747bad206d1fbf82	5 years ago
Adam Retter	7242dae7fe	Improve RocksJava Comparator (#6252 ) Summary: This is a redesign of the API for RocksJava comparators with the aim of improving performance. It also simplifies the class hierarchy. NOTE: This breaks backwards compatibility for existing 3rd party Comparators implemented in Java... so we need to consider carefully which release branches this goes into. Previously when implementing a comparator in Java the developer had a choice of subclassing either `DirectComparator` or `Comparator` which would use direct and non-direct byte-buffers resepectively (via `DirectSlice` and `Slice`). In this redesign there we have eliminated the overhead of using the Java Slice classes, and just use `ByteBuffer`s. The `ComparatorOptions` supplied when constructing a Comparator allow you to choose between direct and non-direct byte buffers by setting `useDirect`. In addition, the `ComparatorOptions` now allow you to choose whether a ByteBuffer is reused over multiple comparator calls, by setting `maxReusedBufferSize > 0`. When buffers are reused, ComparatorOptions provides a choice of mutex type by setting `useAdaptiveMutex`. --- [JMH benchmarks previously indicated](https://github.com/facebook/rocksdb/pull/6241#issue-356398306) that the difference between C++ and Java for implementing a comparator was ~7x slowdown in Java. With these changes, when reusing buffers and guarding access to them via mutexes the slowdown is approximately the same. However, these changes offer a new facility to not reuse mutextes, which reduces the slowdown to ~5.5x in Java. We also offer a `thread_local` mechanism for reusing buffers, which reduces slowdown to ~5.2x in Java (closes https://github.com/facebook/rocksdb/pull/4425). These changes also form a good base for further optimisation work such as further JNI lookup caching, and JNI critical. --- These numbers were captured without jemalloc. With jemalloc, the performance improves for all tests, and the Java slowdown reduces to between 4.8x and 5.x. ``` ComparatorBenchmarks.put native_bytewise thrpt 25 124483.795 ± 2032.443 ops/s ComparatorBenchmarks.put native_reverse_bytewise thrpt 25 114414.536 ± 3486.156 ops/s ComparatorBenchmarks.put java_bytewise_non-direct_reused-64_adaptive-mutex thrpt 25 17228.250 ± 1288.546 ops/s ComparatorBenchmarks.put java_bytewise_non-direct_reused-64_non-adaptive-mutex thrpt 25 16035.865 ± 1248.099 ops/s ComparatorBenchmarks.put java_bytewise_non-direct_reused-64_thread-local thrpt 25 21571.500 ± 871.521 ops/s ComparatorBenchmarks.put java_bytewise_direct_reused-64_adaptive-mutex thrpt 25 23613.773 ± 8465.660 ops/s ComparatorBenchmarks.put java_bytewise_direct_reused-64_non-adaptive-mutex thrpt 25 16768.172 ± 5618.489 ops/s ComparatorBenchmarks.put java_bytewise_direct_reused-64_thread-local thrpt 25 23921.164 ± 8734.742 ops/s ComparatorBenchmarks.put java_bytewise_non-direct_no-reuse thrpt 25 17899.684 ± 839.679 ops/s ComparatorBenchmarks.put java_bytewise_direct_no-reuse thrpt 25 22148.316 ± 1215.527 ops/s ComparatorBenchmarks.put java_reverse_bytewise_non-direct_reused-64_adaptive-mutex thrpt 25 11311.126 ± 820.602 ops/s ComparatorBenchmarks.put java_reverse_bytewise_non-direct_reused-64_non-adaptive-mutex thrpt 25 11421.311 ± 807.210 ops/s ComparatorBenchmarks.put java_reverse_bytewise_non-direct_reused-64_thread-local thrpt 25 11554.005 ± 960.556 ops/s ComparatorBenchmarks.put java_reverse_bytewise_direct_reused-64_adaptive-mutex thrpt 25 22960.523 ± 1673.421 ops/s ComparatorBenchmarks.put java_reverse_bytewise_direct_reused-64_non-adaptive-mutex thrpt 25 18293.317 ± 1434.601 ops/s ComparatorBenchmarks.put java_reverse_bytewise_direct_reused-64_thread-local thrpt 25 24479.361 ± 2157.306 ops/s ComparatorBenchmarks.put java_reverse_bytewise_non-direct_no-reuse thrpt 25 7942.286 ± 626.170 ops/s ComparatorBenchmarks.put java_reverse_bytewise_direct_no-reuse thrpt 25 11781.955 ± 1019.843 ops/s ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6252 Differential Revision: D19331064 Pulled By: pdillinger fbshipit-source-id: 1f3b794e6a14162b2c3ffb943e8c0e64a0c03738	5 years ago
Maysam Yabandeh	3316d29221	Disable recycle_log_file_num when it is incompatible with recovery mode (#6351 ) Summary: Non-zero recycle_log_file_num is incompatible with kPointInTimeRecovery and kAbsoluteConsistency recovery modes. Currently SanitizeOptions changes the recovery mode to kTolerateCorruptedTailRecords, while to resolve this option conflict it makes more sense to compromise recycle_log_file_num, which is a performance feature, instead of wal_recovery_mode, which is a safety feature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6351 Differential Revision: D19648931 Pulled By: maysamyabandeh fbshipit-source-id: dd0bf78349edc007518a00c4d63931fd69294ad7	5 years ago
anand76	fb05b5a652	Force a new manifest file if append to current one fails (#6331 ) Summary: Fix for issue https://github.com/facebook/rocksdb/issues/6316 When an append/sync of the manifest file fails due to an IO error such as NoSpace, we don't always put the DB in read-only mode. This is true for flush and compactions, as well as foreground operatons such as column family add/drop, CompactFiles etc. Subsequent changes to the DB will be recorded in the same manifest file, which would have a corrupted record in the middle due to the previous failure. On next DB::Open(), it will fail to process the full manifest and data will be lost. To fix this, we reset VersionSet::descriptor_log_ on append/sync failure, which will force a new manifest file to be written on the next append. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6331 Test Plan: Add new unit tests in error_handler_test.cc Differential Revision: D19632951 Pulled By: anand1976 fbshipit-source-id: 68d527cb6e59a94cbbbf9f5a17a7f464381d51e3	5 years ago
Levi Tamasi	9e3ace42a4	Add statistics for BlobDB GC (#6296 ) Summary: The patch adds statistics support to the new BlobDB garbage collection implementation; namely, it adds support for the following (pre-existing) tickers: `BLOB_DB_GC_NUM_FILES`: the number of blob files obsoleted by the GC logic. `BLOB_DB_GC_NUM_NEW_FILES`: the number of new blob files generated by the GC logic. `BLOB_DB_GC_FAILURES`: the number of failed GC passes (where a GC pass is equivalent to a (sub)compaction). `BLOB_DB_GC_NUM_KEYS_RELOCATED`: the number of blobs relocated to new blob files by the GC logic. `BLOB_DB_GC_BYTES_RELOCATED`: the total size of blobs relocated to new blob files. The tickers `BLOB_DB_GC_NUM_KEYS_OVERWRITTEN`, `BLOB_DB_GC_NUM_KEYS_EXPIRED`, `BLOB_DB_GC_BYTES_OVERWRITTEN`, `BLOB_DB_GC_BYTES_EXPIRED`, and `BLOB_DB_GC_MICROS` are not relevant for the new GC logic, and are thus marked deprecated. The patch also adds a couple of log messages that log the number and total size of blobs encountered and relocated during a GC pass, as well as the number of blob files created and obsoleted. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6296 Test Plan: Extended unit tests and used the BlobDB mode of `db_bench`. Differential Revision: D19402513 Pulled By: ltamasi fbshipit-source-id: d53d2bfbf4928a1db1e9346c67ebb9007b8932ec	5 years ago

1 2 3 4 5 ...

715 Commits (791e5714a5166f3c14afa563f839b7b90922db37)