rocksdb

Commit Graph

Author	SHA1	Message	Date
Levi Tamasi	a00ddf1574	Expose the set of live blob files from Version/VersionSet (#6785 ) Summary: The patch adds logic that returns the set of live blob files from `Version::AddLiveFiles` and `VersionSet::AddLiveFiles` (in addition to live table files), and also cleans up the code a bit, for example, by exposing only the numbers of table files as opposed to the earlier `FileDescriptor`s that no clients used. Moreover, the patch extends the `GetLiveFiles` API so that it also exposes blob files in the current version. Similarly to https://github.com/facebook/rocksdb/pull/6755, this is a building block for identifying and purging obsolete blob files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6785 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D21336210 Pulled By: ltamasi fbshipit-source-id: fc1aede8a49eacd03caafbc5f6f9ce43b6270821	6 years ago
anand76	ab13d43e1d	Pass a timeout to FileSystem for random reads (#6751 ) Summary: Calculate ```IOOptions::timeout``` using ```ReadOptions::deadline``` and pass it to ```FileSystem::Read/FileSystem::MultiRead```. This allows us to impose a tighter bound on the time taken by Get/MultiGet on FileSystem/Envs that support IO timeouts. Even on those that don't support, check in ```RandomAccessFileReader::Read``` and ```MultiRead``` and return ```Status::TimedOut()``` if the deadline is exceeded. For now, TableReader creation, which might do file opens and reads, are not covered. It will be implemented in another PR. Tests: Update existing unit tests to verify the correct timeout value is being passed Pull Request resolved: https://github.com/facebook/rocksdb/pull/6751 Reviewed By: riversand963 Differential Revision: D21285631 Pulled By: anand1976 fbshipit-source-id: d89af843e5a91ece866e87aa29438b52a65a8567	6 years ago
Levi Tamasi	fe238e5438	Keep track of obsolete blob files in VersionSet (#6755 ) Summary: The patch adds logic to keep track of obsolete blob files. A blob file becomes obsolete when the last `shared_ptr` that points to the corresponding `SharedBlobFileMetaData` object goes away, which, in turn, happens when the last `Version` that contains the blob file is destroyed. No longer needed blob files are added to the obsolete list in `VersionSet` using a custom deleter to avoid unnecessary coupling between `SharedBlobFileMetaData` and `VersionSet`. Obsolete blob files are returned by `VersionSet::GetObsoleteFiles` and stored in `JobContext`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6755 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D21233155 Pulled By: ltamasi fbshipit-source-id: 47757e06fdc0127f27ed57f51abd27893d9a7b7a	6 years ago
Derrick Pallas	5272305437	Fix FilterBench when RTTI=0 (#6732 ) Summary: The dynamic_cast in the filter benchmark causes release mode to fail due to no-rtti. Replace with static_cast_with_check. Signed-off-by: Derrick Pallas <derrick@pallas.us> Addition by peterd: Remove unnecessary 2nd template arg on all static_cast_with_check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6732 Reviewed By: ltamasi Differential Revision: D21304260 Pulled By: pdillinger fbshipit-source-id: 6e8eb437c4ca5a16dbbfa4053d67c4ad55f1608c	6 years ago
Yanqin Jin	d4398e08fc	Fix timestamp support for MultiGet (#6748 ) Summary: 1. Avoid nullptr dereference when passing timestamp to KeyContext creation. 2. Construct LookupKey correctly with timestamp when creating MultiGetContext. 3. Compare without timestamp when sorting KeyContexts. Fixes https://github.com/facebook/rocksdb/issues/6745 Test plan (dev server): make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6748 Reviewed By: pdillinger Differential Revision: D21258691 Pulled By: riversand963 fbshipit-source-id: 44e65b759c18b9986947783edf03be4f890bb004	6 years ago
Yanqin Jin	e04f3bce4f	Update CURRENT file after best-efforts recovery (#6746 ) Summary: After a successful recovery, the CURRENT file should be updated to point to the valid MANIFEST. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6746 Test Plan: make check Reviewed By: anand1976 Differential Revision: D21189876 Pulled By: riversand963 fbshipit-source-id: 7537b49988c5c425ebe9505a5cc260de351ad79b	6 years ago
anand76	c1ccd6b6af	Implement deadline support for MultiGet (#6710 ) Summary: Initial implementation of ReadOptions.deadline for MultiGet. If the request takes longer than the deadline, the keys not yet found will be returned with Status::TimedOut(). This implementation enforces the deadline in DBImpl, which is fairly high level. Its best effort and may not check the deadline after every key lookup, but may do so after a batch of keys. In subsequent stages, we will extend this to passing a timeout down to the FileSystem. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6710 Test Plan: Add new unit tests Reviewed By: riversand963 Differential Revision: D21149158 Pulled By: anand1976 fbshipit-source-id: 9f44eecffeb40873f5034ed59a66d21f9f88879e	6 years ago
Akanksha Mahajan	03a1d95db0	Set max_background_flushes dynamically (#6701 ) Summary: 1. Add changes so that max_background_flushes can be set dynamically. 2. Add a testcase DBOptionsTest.SetBackgroundFlushThreads which set the max_background_flushes dynamically using SetDBOptions. TestPlan: 1. make -j64 check 2. Using new testcase DBOptionsTest.SetBackgroundFlushThreads Pull Request resolved: https://github.com/facebook/rocksdb/pull/6701 Reviewed By: ajkr Differential Revision: D21028010 Pulled By: akankshamahajan15 fbshipit-source-id: 5f949e4a8fd3c32537b637947b7ee09a69cfc7c1	6 years ago
Mike Kolupaev	e45673dece	Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621 ) Summary: Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype. Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling. It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas. Note that the deferred value loading only happens for internal iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621 Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats. Reviewed By: siying Differential Revision: D20786930 Pulled By: al13n321 fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee	6 years ago
Steven Fackler	f53cdab3d7	Hex encode keys in compaction flush logs (#6616 ) Summary: The raw key bytes are currently dumped directly into the log messages, which is not ideal if the keys aren't ASCII strings. Null bytes in particular can cut off bits of the message early. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6616 Reviewed By: ajkr Differential Revision: D20879218 Pulled By: anand1976 fbshipit-source-id: 825a20715fe6d8012c0163c6e7b8159f7926a1a7	6 years ago
sdong	80979f81c7	Make options.bottommost_compression, compression_opts and bottommost_compression_opts dynamically changeable. (#6615 ) Summary: These three options should be made dynamically changeable. Simply add them to MutableCFOptions and made the change. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6615 Test Plan: Add a unit test to make sure that SetOptions() can change the options. Reviewed By: riversand963 Differential Revision: D20755951 fbshipit-source-id: 8165f4fd7a7a665cc7fb049698935022a5d2e7ff	6 years ago
Cheng Chang	ee50b8d499	Be able to decrease background thread's CPU priority when creating database backup (#6602 ) Summary: When creating a database backup, the background threads will not only consume IO resources by copying files, but also consuming CPU such as by computing checksums. During peak times, the CPU consumption by the background threads might affect online queries. This PR makes it possible to decrease CPU priority of these threads when creating a new backup. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6602 Test Plan: make check Reviewed By: siying, zhichao-cao Differential Revision: D20683216 Pulled By: cheng-chang fbshipit-source-id: 9978b9ed9488e8ce135e90ca083e5b4b7221fd84	6 years ago
Zhichao Cao	4246888101	Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487 ) Summary: In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status. The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()). Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487 Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check Reviewed By: anand1976 Differential Revision: D20685017 Pulled By: zhichao-cao fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0	6 years ago
sdong	6fd0ed4993	CompactRange() to use bottom pool when goes to bottommost level (#6593 ) Summary: In automatic compaction, if a compaction is bottommost, it goes to bottom thread pool. We should do the same for manual compaction too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6593 Test Plan: Add a unit test. See all existing tests pass. Reviewed By: ajkr Differential Revision: D20637408 fbshipit-source-id: cb03031e8f895085f7acf6d2d65e69e84c9ddef3	6 years ago
Huisheng Liu	a6ce5c823b	multiget support for timestamps (#6483 ) Summary: Add timestamp support for MultiGet(). timestamp from readoptions is honored, and timestamps can be returned along with values. MultiReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks. base line (commit `17bef7d3a`): multireadrandom : 104.173 micros/op 307167 ops/sec; (5462999 of 5462999 found) This PR: multireadrandom : 104.199 micros/op 307095 ops/sec; (5307999 of 5307999 found) .\db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=multireadrandom --use_existing_db=1 --num=25000000 --threads=32 --allow_concurrent_memtable_write=0 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6483 Reviewed By: anand1976 Differential Revision: D20498373 Pulled By: riversand963 fbshipit-source-id: 8505f22bc40fd791bc7dd05e48d7e67c91edb627	6 years ago
anand76	a9d168cfd7	Simplify migration to FileSystem API (#6552 ) Summary: The current Env/FileSystem API separation has a couple of issues - 1. It requires the user to specify 2 options - ```Options::env``` and ```Options::file_system``` - which means they have to make code changes to benefit from the new APIs. Furthermore, there is a risk of accessing the same APIs in two different ways, through Env in the old way and through FileSystem in the new way. The two may not always match, for example, if env is ```PosixEnv``` and FileSystem is a custom implementation. Any stray RocksDB calls to env will use the ```PosixEnv``` implementation rather than the file_system implementation. 2. There needs to be a simple way for the FileSystem developer to instantiate an Env for backward compatibility purposes. This PR solves the above issues and simplifies the migration in the following ways - 1. Embed a shared_ptr to the ```FileSystem``` in the ```Env```, and remove ```Options::file_system``` as a configurable option. This way, no code changes will be required in application code to benefit from the new API. The default Env constructor uses a ```LegacyFileSystemWrapper``` as the embedded ```FileSystem```. 1a. - This also makes it more robust by ensuring that even if RocksDB has some stray calls to Env APIs rather than FileSystem, they will go through the same object and thus there is no risk of getting out of sync. 2. Provide a ```NewCompositeEnv()``` API that can be used to construct a PosixEnv with a custom FileSystem implementation. This eliminates an indirection to call Env APIs, and relieves the FileSystem developer of the burden of having to implement wrappers for the Env APIs. 3. Add a couple of missing FileSystem APIs - ```SanitizeEnvOptions()``` and ```NewLogger()``` Tests: 1. New unit tests 2. make check and make asan_check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6552 Reviewed By: riversand963 Differential Revision: D20592038 Pulled By: anand1976 fbshipit-source-id: c3801ad4153f96d21d5a3ae26c92ba454d1bf1f7	6 years ago
Yanqin Jin	fb09ef05dc	Attempt to recover from db with missing table files (#6334 ) Summary: There are situations when RocksDB tries to recover, but the db is in an inconsistent state due to SST files referenced in the MANIFEST being missing. In this case, previous RocksDB will just fail the recovery and return a non-ok status. This PR enables another possibility. During recovery, RocksDB checks possible MANIFEST files, and try to recover to the most recent state without missing table file. `VersionSet::Recover()` applies version edits incrementally and "materializes" a version only when this version does not reference any missing table file. After processing the entire MANIFEST, the version created last will be the latest version. `DBImpl::Recover()` calls `VersionSet::Recover()`. Afterwards, WAL replay will not be performed. To use this capability, set `options.best_efforts_recovery = true` when opening the db. Best-efforts recovery is currently incompatible with atomic flush. Test plan (on devserver): ``` $make check $COMPILE_WITH_ASAN=1 make all && make check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6334 Reviewed By: anand1976 Differential Revision: D19778960 Pulled By: riversand963 fbshipit-source-id: c27ea80f29bc952e7d3311ecf5ee9c54393b40a8	6 years ago
Cheng Chang	2d9efc9ab2	Cache result of GetLogicalBufferSize in Linux (#6457 ) Summary: In Linux, when reopening DB with many SST files, profiling shows that 100% system cpu time spent for a couple of seconds for `GetLogicalBufferSize`. This slows down MyRocks' recovery time when site is down. This PR introduces two new APIs: 1. `Env::RegisterDbPaths` and `Env::UnregisterDbPaths` lets `DB` tell the env when it starts or stops using its database directories . The `PosixFileSystem` takes this opportunity to set up a cache from database directories to the corresponding logical block sizes. 2. `LogicalBlockSizeCache` is defined only for OS_LINUX to cache the logical block sizes. Other modifications: 1. rename `logical buffer size` to `logical block size` to be consistent with Linux terms. 2. declare `GetLogicalBlockSize` in `PosixHelper` to expose it to `PosixFileSystem`. 3. change the functions `IOError` and `IOStatus` in `env/io_posix.h` to have external linkage since they are used in other translation units too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6457 Test Plan: 1. A new unit test is added for `LogicalBlockSizeCache` in `env/io_posix_test.cc`. 2. A new integration test is added for `DB` operations related to the cache in `db/db_logical_block_size_cache_test.cc`. `make check` Differential Revision: D20131243 Pulled By: cheng-chang fbshipit-source-id: 3077c50f8065c0bffb544d8f49fb10bba9408d04	6 years ago
Cheng Chang	afb97094ae	Skip high levels with no key falling in the range in CompactRange (#6482 ) Summary: In CompactRange, if there is no key in memtable falling in the specified range, then flush is skipped. This PR extends this skipping logic to SST file levels: it starts compaction from the highest level (starting from L0) that has files with key falling in the specified range, instead of always starts from L0. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6482 Test Plan: A new test ManualCompactionTest::SkipLevel is added. Also updated a test related to statistics of index block cache hit in db_test2, the index cache hit is increased by 1 in this PR because when checking overlap for the key range in L0, OverlapWithLevelIterator will do a seek in the table cache iterator, which will read from the cached index. Also updated db_compaction_test and db_test to use correct range for full compaction. Differential Revision: D20251149 Pulled By: cheng-chang fbshipit-source-id: f822157cf4796972bd5035d9d7178d8dfb7af08b	6 years ago
Kefu Chai	03dbd11ead	s/const auto/const auto&/ when doing loop (#6477 ) Summary: this silences following warning from clang-11 ``` rocksdb/db/db_impl/db_impl_compaction_flush.cc:1040:21: warning: loop variable 'newf' of type 'const std::pair<int, rocksdb::FileMetaData>' creates a copy from type 'const std::pair<int\ , rocksdb::FileMetaData>' [-Wrange-loop-analysis] for (const auto newf : c->edit()->GetNewFiles()) { ^ rocksdb/db/db_impl/db_impl_compaction_flush.cc:1040:10: note: use reference type 'const std::pair<int, rocksdb::FileMetaData> &' to prevent copying for (const auto newf : c->edit()->GetNewFiles()) { ^~~~~~~~~~~~~~~~~ & ``` Signed-off-by: Kefu Chai <tchaikov@gmail.com> Pull Request resolved: https://github.com/facebook/rocksdb/pull/6477 Differential Revision: D20211850 Pulled By: ltamasi fbshipit-source-id: 3e89e13a12bba79f1b934d46b7c4c0576cdafb01	6 years ago
sdong	17bef7d3a8	Fix data race of GetCreationTimeOfOldestFile() (#6473 ) Summary: When DBImpl::GetCreationTimeOfOldestFile() calls Version::GetCreationTimeOfOldestFile(), the version is not directly or indirectly referenced, so an event like compaction can race with the operation and cause DBImpl::GetCreationTimeOfOldestFile() to access delocated data. This was caught by an ASAN run: ==268==ERROR: AddressSanitizer: heap-use-after-free on address 0x612000b7d198 at pc 0x000018332913 bp 0x7f391510d310 sp 0x7f391510d308 READ of size 8 at 0x612000b7d198 thread T845 (store_load-33) SCARINESS: 51 (8-byte-read-heap-use-after-free) #0 0x18332912 in rocksdb::Version::GetCreationTimeOfOldestFile(unsigned long) rocksdb/src/db/version_set.cc:1488 https://github.com/facebook/rocksdb/issues/1 0x1803ddaa in rocksdb::DBImpl::GetCreationTimeOfOldestFile(unsigned long) rocksdb/src/db/db_impl/db_impl.cc:4499 https://github.com/facebook/rocksdb/issues/2 0xe24ca09 in rocksdb::StackableDB::GetCreationTimeOfOldestFile(unsigned long) rocksdb/utilities/stackable_db.h:392 ...... 0x612000b7d198 is located 216 bytes inside of 296-byte region [0x612000b7d0c0,0x612000b7d1e8) freed by thread T28 here: ...... https://github.com/facebook/rocksdb/issues/5 0x1832c73f in std::vector<rocksdb::FileMetaData, std::allocator<rocksdb::FileMetaData> >::~vector() third-party-buck/platform007/build/libgcc/include/c++/trunk/bits/stl_vector.h:435 https://github.com/facebook/rocksdb/issues/6 0x1832c73f in rocksdb::VersionStorageInfo::~VersionStorageInfo() rocksdb/src/db/version_set.cc:734 https://github.com/facebook/rocksdb/issues/7 0x1832cf42 in rocksdb::Version::~Version() rocksdb/src/db/version_set.cc:758 https://github.com/facebook/rocksdb/issues/8 0x9d1bb5 in rocksdb::Version::Unref() rocksdb/src/db/version_set.cc:2869 https://github.com/facebook/rocksdb/issues/9 0x183e7631 in rocksdb::Compaction::~Compaction() rocksdb/src/db/compaction/compaction.cc:275 https://github.com/facebook/rocksdb/issues/10 0x9e6de6 in std::default_delete<rocksdb::Compaction>::operator()(rocksdb::Compaction) const third-party-buck/platform007/build/libgcc/include/c++/trunk/bits/unique_ptr.h:78 https://github.com/facebook/rocksdb/issues/11 0x9e6de6 in std::unique_ptr<rocksdb::Compaction, std::default_delete<rocksdb::Compaction> >::reset(rocksdb::Compaction) third-party-buck/platform007/build/libgcc/include/c++/trunk/bits/unique_ptr.h:376 https://github.com/facebook/rocksdb/issues/12 0x9e6de6 in rocksdb::DBImpl::BackgroundCompaction(bool, rocksdb::JobContext, rocksdb::LogBuffer, rocksdb::DBImpl::PrepickedCompaction, rocksdb::Env::Priority) rocksdb/src/db/db_impl/db_impl_compaction_flush.cc:2826 https://github.com/facebook/rocksdb/issues/13 0x9ac3b8 in rocksdb::DBImpl::BackgroundCallCompaction(rocksdb::DBImpl::PrepickedCompaction, rocksdb::Env::Priority) rocksdb/src/db/db_impl/db_impl_compaction_flush.cc:2320 https://github.com/facebook/rocksdb/issues/14 0x9abff7 in rocksdb::DBImpl::BGWorkCompaction(void*) rocksdb/src/db/db_impl/db_impl_compaction_flush.cc:2096 ...... Fix the issue by reference the super version and use the referenced version from it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6473 Test Plan: Run ASAN for all existing tests. Differential Revision: D20196416 fbshipit-source-id: 5f4a7918110fc7b8dd7841932d376bc9d1e59d6f	6 years ago
Zhichao Cao	8d73137ae8	Replace Directory with FSDirectory in DB (#6468 ) Summary: In the current code base, we can use Directory from Env to manage directory (e.g, Fsync()). The PR https://github.com/facebook/rocksdb/issues/5761 introduce the File System as a new Env API. So we further replace the Directory class in DB with FSDirectory such that we can have more IO information from IOStatus returned by FSDirectory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6468 Test Plan: pass make asan_check Differential Revision: D20195261 Pulled By: zhichao-cao fbshipit-source-id: 93962cb9436852bfcfb76e086d9e7babd461cbe1	6 years ago
Huisheng Liu	904a60ff63	return timestamp from get (#6409 ) Summary: Added new Get() methods that return timestamp. Dummy implementation is given so that classes derived from DB don't need to be touched to provide their implementation. MultiGet is not included. ReadRandom perf test (10 minutes) on the same development machine ram drive with the same DB data shows no regression (within marge of error). The test is adapted from https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks. base line (commit `72ee067b9`): 101.712 micros/op 314602 ops/sec; 36.0 MB/s (5658999 of 5658999 found) This PR: 100.288 micros/op 319071 ops/sec; 36.5 MB/s (5674999 of 5674999 found) ./db_bench --db=r:\rocksdb.github --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --cache_size=2147483648 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=134217728 --max_bytes_for_level_base=1073741824 --disable_wal=0 --wal_dir=r:\rocksdb.github\WAL_LOG --sync=0 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --duration=600 --benchmarks=readrandom --use_existing_db=1 --num=25000000 --threads=32 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6409 Differential Revision: D20200086 Pulled By: riversand963 fbshipit-source-id: 490edd74d924f62bd8ae9c29c2a6bbbb8410ca50	6 years ago
Michael R. Crusoe	051696bf98	fix some spelling typos (#6464 ) Summary: Found from Debian's "Lintian" program Pull Request resolved: https://github.com/facebook/rocksdb/pull/6464 Differential Revision: D20162862 Pulled By: zhichao-cao fbshipit-source-id: 06941ee2437b038b2b8045becbe9d2c6fbff3e12	6 years ago
Yanqin Jin	890d87fadc	Some minor fix-ups (#6440 ) Summary: Cleanup some code without any real change in functionality. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6440 Differential Revision: D20015891 Pulled By: riversand963 fbshipit-source-id: 33e18754b0f002006a6d4805e9aaf84c0c8ad25a	6 years ago
sdong	fdf882ded2	Replace namespace name "rocksdb" with ROCKSDB_NAMESPACE (#6433 ) Summary: When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433 Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag. Differential Revision: D19977691 fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e	6 years ago
Andrew Kryczka	c6abe30ee3	Fix concurrent full purge and WAL recycling (#5900 ) Summary: We were removing the file from `log_recycle_files_` before renaming it with `ReuseWritableFile()`. Since `ReuseWritableFile()` occurs outside the DB mutex, it was possible for a concurrent full purge to sneak in and delete the file before it could be renamed. Consequently, `SwitchMemtable()` would fail and the DB would enter read-only mode. The fix is to hold the old file number in `log_recycle_files_` until after the file has been renamed. Full purge uses that list to decide which files to keep, so it can no longer delete a file pending recycling. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5900 Test Plan: new unit test Differential Revision: D19771719 Pulled By: ajkr fbshipit-source-id: 094346349ca3fb499712e62de03905acc30b5ce8	6 years ago
sdong	ac8e89a443	Should flush and sync WAL when writing it in DB::Open() (#6417 ) Summary: A recent fix related to 2pc https://github.com/facebook/rocksdb/pull/6313/ writes something to WAL, but does not flush or sync. This causes assertion failure "impl->TEST_WALBufferIsEmpty()" if manual_wal_flush = true. We should fsync the entry to make sure a second power reset can recover. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6417 Test Plan: Add manual_wal_flush=true case in TransactionTest.DoubleCrashInRecovery and fix a bug in the test so that the bug can be reproduced. It passes with the fix. Differential Revision: D19894537 fbshipit-source-id: f1e84e49e2269f583c6019743118292cd8b6598e	6 years ago
Huisheng Liu	5138764eb5	Fix destroydb (#6308 ) Summary: It's observed on Windows DestroyDB failed to remove the log file because the logger is still alive in sst file manager and holding a handle to the log file. This fix makes sure the logger is released before attempt to clear the database directory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6308 Differential Revision: D19818829 Pulled By: riversand963 fbshipit-source-id: 54c3e6859aadaaba4a49b3e851b73dc35ec7dc6a	6 years ago
Zhichao Cao	4369f2c7bb	Checksum for each SST file and stores in MANIFEST (#6216 ) Summary: In the current code base, RocksDB generate the checksum for each block and verify the checksum at usage. Current PR enable SST file checksum. After a SST file is generated by Flush or Compaction, RocksDB generate the SST file checksum and store the checksum value and checksum method name in the vs_info and MANIFEST as part for the FileMetadata. Added the enable_sst_file_checksum to Options to enable or disable file checksum. Added sst_file_checksum to Options such that user can plugin their own SST file checksum calculate method via overriding the SstFileChecksum class. The checksum information inlcuding uint32_t checksum value and a checksum name (string). A new tool is added to LDB such that user can dump out a list of file checksum information from MANIFEST. If user enables the file checksum but does not provide the sst_file_checksum instance, RocksDB will use the default crc32checksum implemented in table/sst_file_checksum_crc32c.h Pull Request resolved: https://github.com/facebook/rocksdb/pull/6216 Test Plan: Added the testing case in table_test and ldb_cmd_test to verify checksum is correct in different level. Pass make asan_check. Differential Revision: D19171461 Pulled By: zhichao-cao fbshipit-source-id: b2e53479eefc5bb0437189eaa1941670e5ba8b87	6 years ago
Yutian Li	2e0159ec9e	Add error status for no_slowdown & low priority write (#6396 ) Summary: When `no_slowdown` is enabled, it returns `Status::Incomplete("Write stall")` if a stall would occur. This patch adds descriptive text for when `no_slowdown` and `low_pri` are enabled. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6396 Differential Revision: D19808978 Pulled By: cheng-chang fbshipit-source-id: a53b0d25ed414c821a086531e0222027f925e627	6 years ago
Levi Tamasi	752c87af78	Clean up VersionEdit a bit (#6383 ) Summary: This is a bunch of small improvements to `VersionEdit`. Namely, the patch * Makes the names and order of variables, methods, and code chunks related to the various information elements more consistent, and adds missing getters for the sake of completeness. * Initializes previously uninitialized stack variables. * Marks all getters const to improve const correctness. * Adds in-class initializers and removes the default ctor that would create an object with uninitialized built-in fields and call `Clear` afterwards. * Adds a new type alias for new files and changes the existing `typedef` for deleted files into a type alias as well. * Makes the helper method `DecodeNewFile4From` private. * Switches from long-winded iterator syntax to range based loops in a couple of places. * Fixes a couple of assignments where an integer 0 was assigned to boolean members. * Fixes a getter which used to return a `const std::string` instead of the intended `const std::string&`. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6383 Test Plan: make check Differential Revision: D19780537 Pulled By: ltamasi fbshipit-source-id: b0b4f09fee0ec0e7c7b7a6d76bfe5346e91824d0	6 years ago
Cheng Chang	5f478b9f75	Remove outdated comment (#6379 ) Summary: Since the logic for handling IDENTITY file is now inside `NewDB`, the comment above `NewDB` is no longer relevant. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6379 Test Plan: not needed Differential Revision: D19795440 Pulled By: cheng-chang fbshipit-source-id: 0b1cca87ac6d92474701c46aa4c8d4d708bfa19b	6 years ago
Cheng Chang	0a74e1b958	Add status checks during DB::Open (#6380 ) Summary: Several statuses were not checked during DB::Open. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6380 Test Plan: make check Differential Revision: D19780237 Pulled By: cheng-chang fbshipit-source-id: c8d189d20344bd1607890dd1449345bda2ef96b9	6 years ago
Yanqin Jin	f361cedf06	Atomic flush rollback once on failure (#6385 ) Summary: Before this fix, atomic flush codepath may hit an assertion failure on a specific failure case. If all flush jobs within an atomic flush succeed (they do not write to MANIFEST), but batch writing version edits to MANIFEST fails, then `cfd->imm()->RollbackMemTableFlush()` will be called twice, and the second invocation hits assertion failure `assert(m->flush_in_progress_)` since the first invocation resets the variable `flush_in_progress_` to false already. Test plan (dev server): ``` ./db_flush_test --gtest_filter=DBAtomicFlushTest/DBAtomicFlushTest.RollbackAfterFailToInstallResults make check ``` Both must succeed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6385 Differential Revision: D19782943 Pulled By: riversand963 fbshipit-source-id: 84e1592625e729d1b70fdc8479959387a74cb121	6 years ago
Mike Kolupaev	1ed7d9b1b5	Avoid lots of calls to Env::GetFileSize() in SstFileManagerImpl when opening DB (#6363 ) Summary: Before this PR it calls GetFileSize() once for each sst file in the DB. This can take a long time if there are be tens of thousands of sst files (e.g. in thousands of column families), and even longer if Env is talking to some remote service rather than local filesystem. This PR makes DB::Open() use sst file sizes that are already known from manifest (typically almost all files in the DB) and only call GetFileSize() for non-sst or obsolete files. Note that GetFileSize() is also called and checked against manifest in CheckConsistency(), so the calls in SstFileManagerImpl were completely redundant. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6363 Test Plan: deployed to a test cluster, looked at a dump of Env calls (from a custom instrumented Env) - no more thousands of GetFileSize()s. Differential Revision: D19702509 Pulled By: al13n321 fbshipit-source-id: 99f8110620cb2e9d0c092dfcdbb11f3af4ff8b73	6 years ago
Mike Kolupaev	637e64b9ac	Add an option to prevent DB::Open() from querying sizes of all sst files (#6353 ) Summary: When paranoid_checks is on, DBImpl::CheckConsistency() iterates over all sst files and calls Env::GetFileSize() for each of them. As far as I could understand, this is pretty arbitrary and doesn't affect correctness - if filesystem doesn't corrupt fsynced files, the file sizes will always match; if it does, it may as well corrupt contents as well as sizes, and rocksdb doesn't check contents on open. If there are thousands of sst files, getting all their sizes takes a while. If, on top of that, Env is overridden to use some remote storage instead of local filesystem, it can be really slow and overload the remote storage service. This PR adds an option to not do GetFileSize(); instead it does GetChildren() for parent directory to check that all the expected sst files are at least present, but doesn't check their sizes. We can't just disable paranoid_checks instead because paranoid_checks do a few other important things: make the DB read-only on write errors, print error messages on read errors, etc. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6353 Test Plan: ran the added sanity check unit test. Will try it out in a LogDevice test cluster where the GetFileSize() calls are causing a lot of trouble. Differential Revision: D19656425 Pulled By: al13n321 fbshipit-source-id: c2c421b367633033760d1f56747bad206d1fbf82	6 years ago
sdong	f195d8d523	Use ReadFileToString() to get content from IDENTITY file (#6365 ) Summary: Right now when reading IDENTITY file, we use a very similar logic as ReadFileToString() while it does an extra file size check, which may be expensive in some file systems. There is no reason to duplicate the logic. Use ReadFileToString() instead. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6365 Test Plan: RUn all existing tests. Differential Revision: D19709399 fbshipit-source-id: 3bac31f3b2471f98a0d2694278b41e9cd34040fe	6 years ago
sdong	36c504be17	Avoid create directory for every column families (#6358 ) Summary: A relatively recent regression causes for every CF, create and open directory is called for the DB directory, unless CF has a private directory. This doesn't scale well with large number of column families. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6358 Test Plan: Run all existing tests and see it pass. strace with db_bench --num_column_families and observe it doesn't open directory for number of column families. Differential Revision: D19675141 fbshipit-source-id: da01d9216f1dae3f03d4064fbd88ce71245bd9be	6 years ago
Maysam Yabandeh	3316d29221	Disable recycle_log_file_num when it is incompatible with recovery mode (#6351 ) Summary: Non-zero recycle_log_file_num is incompatible with kPointInTimeRecovery and kAbsoluteConsistency recovery modes. Currently SanitizeOptions changes the recovery mode to kTolerateCorruptedTailRecords, while to resolve this option conflict it makes more sense to compromise recycle_log_file_num, which is a performance feature, instead of wal_recovery_mode, which is a safety feature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6351 Differential Revision: D19648931 Pulled By: maysamyabandeh fbshipit-source-id: dd0bf78349edc007518a00c4d63931fd69294ad7	6 years ago
Maysam Yabandeh	2f973ca96e	Double Crash in kPointInTimeRecovery with TransactionDB (#6313 ) Summary: In WritePrepared there could be gap in sequence numbers. This breaks the trick we use in kPointInTimeRecovery which assume the first seq in the log right after the corrupted log is one larger than the last seq we read from the logs. To let this trick keep working, we add a dummy entry with the expected sequence to the first log right after recovery. Also in WriteCommitted, if the log right after the corrupted log is empty, since it has no sequence number to let the sequential trick work, it is assumed as unexpected behavior. This is however expected to happen if we close the db after recovering from a corruption and before writing anything new to it. To remedy that, we apply the same technique by writing a dummy entry to the log that is created after the corrupted log. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6313 Differential Revision: D19458291 Pulled By: maysamyabandeh fbshipit-source-id: 09bc49e574690085df45b034ca863ff315937e2d	6 years ago
Andrew Kryczka	5b33cfa1e3	fix `WriteBufferManager` flush log message (#6335 ) Summary: It chooses the oldest memtable, not the largest one. This is an important difference for users whose CFs receive non-uniform write rates. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6335 Differential Revision: D19588865 Pulled By: maysamyabandeh fbshipit-source-id: 62ad4325b0182f5f27858584cd73fd5978fb2cec	6 years ago
Levi Tamasi	73f65b457e	Adjust thread pool sizes when setting max_background_jobs dynamically (#6300 ) Summary: https://github.com/facebook/rocksdb/pull/2205 introduced a new configuration option called `max_background_jobs`, superseding the earlier options `max_background_flushes` and `max_background_compactions`. However, unlike `max_background_compactions`, setting `max_background_jobs` dynamically through the `SetDBOptions` interface does not adjust the size of the thread pools (see https://github.com/facebook/rocksdb/issues/6298). The patch fixes this. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6300 Test Plan: Extended unit test. Differential Revision: D19430899 Pulled By: ltamasi fbshipit-source-id: 704006605b3c13c3d1b997ccc0831ee369721074	6 years ago
Yanqin Jin	bce5189f4d	Fix error message (#6264 ) Summary: Fix an error message when CURRENT is not found. Test plan (dev server) ``` make check ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/6264 Differential Revision: D19300699 Pulled By: riversand963 fbshipit-source-id: 303fa206386a125960ecca1dbdeff07422690caf	6 years ago
Mike Kolupaev	ce63eda6f0	Fix use-after-free and double-deleting files in BackgroundCallPurge() (#6193 ) Summary: The bad code was: ``` mutex.Lock(); // `mutex` protects `container` for (auto& x : container) { mutex.Unlock(); // do stuff to x mutex.Lock(); } ``` It's incorrect because both `x` and the iterator may become invalid if another thread modifies the container while this thread is not holding the mutex. Broken by https://github.com/facebook/rocksdb/pull/5796 - it replaced a `while (!container.empty())` loop with a `for (auto x : container)`. (RocksDB code does a lot of such unlocking+re-locking of mutexes, and this type of bugs comes up a lot :/ ) Pull Request resolved: https://github.com/facebook/rocksdb/pull/6193 Test Plan: Ran some logdevice integration tests that were crashing without this fix. Differential Revision: D19116874 Pulled By: al13n321 fbshipit-source-id: 9672bc4227c1b68f46f7436db2b96811adb8c703	6 years ago
解轶伦	39fcaf8246	delete superversions in BackgroundCallPurge (#6146 ) Summary: I found that CleanupSuperVersion() may block Get() for 30ms+ （per MemTable is 256MB）. Then I found "delete sv" in ~SuperVersion() takes the time. The backtrace looks like this DBImpl::GetImpl() -> DBImpl::ReturnAndCleanupSuperVersion() -> DBImpl::CleanupSuperVersion() : delete sv; -> ~SuperVersion() I think it's better to delete in a background thread, please review it。 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6146 Differential Revision: D18972066 fbshipit-source-id: 0f7b0b70b9bb1e27ad6fc1c8a408fbbf237ae08c	6 years ago
anand76	afa2420c2b	Introduce a new storage specific Env API (#5761 ) Summary: The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc. This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO. The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before. This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection. The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761 Differential Revision: D18868376 Pulled By: anand1976 fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f	6 years ago
Levi Tamasi	6d54eb3dc2	Do not create/install new SuperVersion if nothing was deleted during memtable trim (#6169 ) Summary: We have observed an increase in CPU load caused by frequent calls to `ColumnFamilyData::InstallSuperVersion` from `DBImpl::TrimMemtableHistory` when using `max_write_buffer_size_to_maintain` to limit the amount of memtable history maintained for transaction conflict checking. As it turns out, this is caused by the code creating and installing a new `SuperVersion` even if no memtables were actually trimmed. The patch adds a check to avoid this. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6169 Test Plan: Compared `perf` output for ``` ./db_bench -benchmarks=randomtransaction -optimistic_transaction_db=1 -statistics -stats_interval_seconds=1 -duration=90 -num=500000 --max_write_buffer_size_to_maintain=16000000 --transaction_set_snapshot=1 --threads=32 ``` before and after the change. With the fix, the call chain `rocksdb::DBImpl::TrimMemtableHistory` -> `rocksdb::ColumnFamilyData::InstallSuperVersion` -> `rocksdb::ThreadLocalPtr::StaticMeta::Scrape` no longer registers in the `perf` report. Differential Revision: D19031509 Pulled By: ltamasi fbshipit-source-id: 02686fce594e5b50eba0710e4b28a9b808c8aa20	6 years ago
Jermy Li	c2029f9716	Support concurrent CF iteration and drop (#6147 ) Summary: It's easy to cause coredump when closing ColumnFamilyHandle with unreleased iterators, especially iterators release is controlled by java GC when using JNI. This patch fixed concurrent CF iteration and drop, we let iterators(actually SuperVersion) hold a ColumnFamilyData reference to prevent the CF from being released too early. fixed https://github.com/facebook/rocksdb/issues/5982 Pull Request resolved: https://github.com/facebook/rocksdb/pull/6147 Differential Revision: D18926378 fbshipit-source-id: 1dff6d068c603d012b81446812368bfee95a5e15	6 years ago
Connor	a844591201	wait pending memtable writes on file ingestion or compact range (#6113 ) Summary: Summary: This PR fixes two unordered_write related issues: - ingestion job may skip the necessary memtable flush https://github.com/facebook/rocksdb/issues/6026 - compact range may cause memtable is flushed before pending unordered write finished 1. `CompactRange` triggers memtable flush but doesn't wait for pending-writes 2. there are some pending writes but memtable is already flushed 3. the memtable related WAL is removed( note that the pending-writes were recorded in that WAL). 4. pending-writes write to newer created memtable 5. there is a restart 6. missing the previous pending-writes because WAL is removed but they aren't included in SST. How to solve: - Wait pending memtable writes before ingestion job check memtable key range - Wait pending memtable writes before flush memtable. Note that: `CompactRange` calls `RangesOverlapWithMemtables` too without waiting for pending waits, but I'm not sure whether it affects the correctness. Test Plan: make check Pull Request resolved: https://github.com/facebook/rocksdb/pull/6113 Differential Revision: D18895674 Pulled By: maysamyabandeh fbshipit-source-id: da22b4476fc7e06c176020e7cc171eb78189ecaf	6 years ago

... 2 3 4 5 6

281 Commits (47b424f4bd51078591e674ff936de5a270530ce2)