rocksdb

Commit Graph

Author	SHA1	Message	Date
Siying Dong	a5e851e113	Reformatting some recent changes (#4161 ) Summary: Lint is not happy with some new code recently committed. Format them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4161 Differential Revision: D8940582 Pulled By: siying fbshipit-source-id: c9b43b1ef8c88b5e923911058b44eb77234b36b7	7 years ago
Siying Dong	8425c8bd4d	BlockBasedTableReader: automatically adjust tail prefetch size (#4156 ) Summary: Right now we use one hard-coded prefetch size to prefetch data from the tail of the SST files. However, this may introduce a waste for some use cases, while not efficient for others. Introduce a way to adjust this prefetch size by tracking 32 recent times, and pick a value with which the wasted read is less than 10% Pull Request resolved: https://github.com/facebook/rocksdb/pull/4156 Differential Revision: D8916847 Pulled By: siying fbshipit-source-id: 8413f9eb3987e0033ed0bd910f83fc2eeaaf5758	7 years ago
Andrew Kryczka	ab35505e21	Write properties metablock last in block-based tables (#4158 ) Summary: The properties meta-block should come at the end since we always need to read it when opening a file, unlike index/filter/other meta-blocks, which are sometimes read depending on the user's configuration. This ordering will allow us to (in a future PR) do a small readahead on the end of the file to read properties and meta-index blocks with one I/O. The bulk of this PR is a refactoring of the `BlockBasedTableBuilder::Finish` function. It was previously too large with inconsistent error handling, which made it difficult to change. So I broke it up into one function per meta-block write, and tried to make error handling consistent within those functions. Then reordering the metablocks was trivial -- just reorder the calls to these helper functions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4158 Differential Revision: D8921705 Pulled By: ajkr fbshipit-source-id: 96c9cc3182eb1adf11af46adab79dbeba7b12fcc	7 years ago
Andrew Kryczka	b5613227a9	Smaller tail readahead when not reading index/filters (#4159 ) Summary: In all cases during `BlockBasedTable::Open`, we issue at least three read requests to the file's tail: (1) footer, (2) metaindex block, and (3) properties block. Depending on the config, we may also read other metablocks like filter and index. This PR issues smaller readahead when we expect to do only the three necessary reads mentioned above. Then, 4KB should be enough (ignoring the case where there are lots of user-defined properties). We can keep doing 512KB readahead when additional reads are expected. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4159 Differential Revision: D8924002 Pulled By: ajkr fbshipit-source-id: cfc713275de4d05ce11f18571f1d72e27ccd3356	7 years ago
Maysam Yabandeh	b55da012f6	Refactor IndexBlockIter (#4141 ) Summary: Refactor IndexBlockIter to reduce conditional branches on key_includes_seq_. IndexBlockIter::Prev is also separated from DataBlockIter::Prev, not to cache the prev entries as they are of less importance when iterating over the index block. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4141 Differential Revision: D8866437 Pulled By: maysamyabandeh fbshipit-source-id: fdac76880426fc2be7d3c6354c09ab98f6657d4b	7 years ago
Siying Dong	8f06b4fa01	Separate some IndexBlockIter logic from BlockIter (#4136 ) Summary: Some logic only related to IndexBlockIter is separated from BlockIter to IndexBlockIter. This is done by writing an exclusive Seek() and SeekForPrev() for DataBlockIter, and all metadata block iter and tombstone block iter now use data block iter. Dealing with the BinarySeek() sharing problem by passing in the comparator to use. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4136 Reviewed By: maysamyabandeh Differential Revision: D8859673 Pulled By: siying fbshipit-source-id: 703e5e6824b82b7cbf4721f3594b94127797ca9e	7 years ago
Nathan VanBenschoten	ef7815b803	Support range deletion tombstones in IngestExternalFile SSTs (#3778 ) Summary: Fixes #3391. This change adds a `DeleteRange` method to `SstFileWriter` and adds support for ingesting SSTs with range deletion tombstones. This is important for applications that need to atomically ingest SSTs while clearing out any existing keys in a given key range. Pull Request resolved: https://github.com/facebook/rocksdb/pull/3778 Differential Revision: D8821836 Pulled By: anand1976 fbshipit-source-id: ca7786c1947ff129afa703dab011d524c7883844	7 years ago
Maysam Yabandeh	8581a93a6b	Per-thread unique test db names (#4135 ) Summary: The patch makes sure that two parallel test threads will operate on different db paths. This enables using open source tools such as gtest-parallel to run the tests of a file in parallel. Example: ``` ~/gtest-parallel/gtest-parallel ./table_test``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4135 Differential Revision: D8846653 Pulled By: maysamyabandeh fbshipit-source-id: 799bad1abb260e3d346bcb680d2ae207a852ba84	7 years ago
Tamir Duberstein	7bee48bdbd	Add GCC 8 to Travis (#3433 ) Summary: - Avoid `strdup` to use jemalloc on Windows - Use `size_t` for consistency - Add GCC 8 to Travis - Add CMAKE_BUILD_TYPE=Release to Travis Pull Request resolved: https://github.com/facebook/rocksdb/pull/3433 Differential Revision: D6837948 Pulled By: sagar0 fbshipit-source-id: b8543c3a4da9cd07ee9a33f9f4623188e233261f	7 years ago
Maysam Yabandeh	d4ad32d7bd	Refactor BlockIter (#4121 ) Summary: BlockIter is getting crowded including details that specific only to either index or data blocks. The patch moves down such details to DataBlockIter and IndexBlockIter, both inheriting from BlockIter. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4121 Differential Revision: D8816832 Pulled By: maysamyabandeh fbshipit-source-id: d492e74155c11d8a0c1c85cd7ee33d24c7456197	7 years ago
Nikhil Benesch	5cd8240b86	Test range deletions with more configurations (#4021 ) Summary: Run the basic range deletion tests against the standard set of configurations. This testing exposed that files with hash indexes and partitioned indexes were not handling the case where the file contained only range deletions--i.e., where the index was empty. Additionally file a TODO about the fact that range deletions are broken when allow_mmap_reads = true is set. /cc ajkr nvanbenschoten Best viewed with ?w=1: https://github.com/facebook/rocksdb/pull/4021/files?w=1 Pull Request resolved: https://github.com/facebook/rocksdb/pull/4021 Differential Revision: D8811860 Pulled By: ajkr fbshipit-source-id: 3cc07e6d6210a2a00b932866481b3d5c59775343	7 years ago
Yanqin Jin	d4d9fe8e57	Fix a bug caused by not copying the block trailer. (#4096 ) Summary: This was caught by crash test, and the following is a simple way to reproduce it and verify the fix. One way to trigger this code path is to use the following configuration: - Compress SST file - Enable direct IO and prefetch buffer - Do NOT use compressed block cache Closes https://github.com/facebook/rocksdb/pull/4096 Differential Revision: D8742009 Pulled By: riversand963 fbshipit-source-id: f13381078bbb0dce92f60bd313a78ab602bcacd2	7 years ago
Yanqin Jin	39218a72a4	Increase the size of LRU cache. (#4090 ) Summary: Increase the size of each shard so that the number of cache hit/miss match expectation. Otherwise FilterBlockInBlockCache test will fail. Closes https://github.com/facebook/rocksdb/pull/4090 Differential Revision: D8736158 Pulled By: riversand963 fbshipit-source-id: 5cdbc06b02390389fd5b72a6d251d88949ad3d91	7 years ago
Maysam Yabandeh	2462763b2e	Fix mis-spoken assert on prefetch_filter and prefetch_index (#4077 ) Summary: We can have prefetch_index without prefetch_filter but not the other way around. The assert statement is fixed. Closes https://github.com/facebook/rocksdb/pull/4077 Differential Revision: D8694472 Pulled By: maysamyabandeh fbshipit-source-id: ccd2804d9d9cdafb1c3e65062c7bc38603e69004	7 years ago
Maysam Yabandeh	29ffbb8a50	Charging block cache more accurately (#4073 ) Summary: Currently the block cache is charged only by the size of the raw data block and excludes the overhead of the c++ objects that contain the raw data block. The patch improves the accuracy of the charge by including the c++ object overhead into it. Closes https://github.com/facebook/rocksdb/pull/4073 Differential Revision: D8686552 Pulled By: maysamyabandeh fbshipit-source-id: 8472f7fc163c0644533bc6942e20cdd5725f520f	7 years ago
Zhongyi Xie	14f409c0f1	PrefixMayMatch: remove unnecessary check for prefix_extractor_ (#4067 ) Summary: with https://github.com/facebook/rocksdb/pull/3601 and https://github.com/facebook/rocksdb/pull/3899, `prefix_extractor_` is not really being used in block based filter and full filter's version of `PrefixMayMatch` because now `prefix_extractor` is passed as an argument. Also it is now possible that prefix_extractor_ may be initialized to nullptr when a non-standard prefix_extractor is used and also for ROCKSDB_LITE. Removing these checks should not break any existing tests. Closes https://github.com/facebook/rocksdb/pull/4067 Differential Revision: D8669002 Pulled By: miasantreble fbshipit-source-id: 0e701ba912b8a26734fadb72d15bb1b266b6176a	7 years ago
Zhichao Cao	1f6efabe23	Add bottommost_compression_opts to for bottommost_compression (#3985 ) Summary: …ression For `CompressionType` we have options `compression` and `bottommost_compression`. Thus, to make the compression options consitent with the compression type when bottommost_compression is enabled, we add the bottommost_compression_opts Closes https://github.com/facebook/rocksdb/pull/3985 Reviewed By: riversand963 Differential Revision: D8385911 Pulled By: zhichao-cao fbshipit-source-id: 07bc533dd61bcf1cef5927d8d62901c13d38d5fc	7 years ago
Maysam Yabandeh	235ab9dd32	Pin mmap files in ReadOnlyDB (#4053 ) Summary: https://github.com/facebook/rocksdb/pull/3881 fixed a bug where PinnableSlice pin mmap files which could be deleted with background compaction. This is however a non-issue for ReadOnlyDB when there is no compaction running and max_open_files is -1. This patch reenables the pinning feature for that case. Closes https://github.com/facebook/rocksdb/pull/4053 Differential Revision: D8662546 Pulled By: maysamyabandeh fbshipit-source-id: 402962602eb0f644e17822748332999c3af029fd	7 years ago
Manuel Ung	a16e00b7b9	WriteUnPrepared Txn: Disable seek to snapshot optimization (#3955 ) Summary: This is implemented by extending ReadCallback with another function `MaxUnpreparedSequenceNumber` which returns the largest visible sequence number for the current transaction, if there is uncommitted data written to DB. Otherwise, it returns zero, indicating no uncommitted data. There are the places where reads had to be modified. - Get and Seek/Next was just updated to seek to max(snapshot_seq, MaxUnpreparedSequenceNumber()) instead, and iterate until a key was visible. - Prev did not need need updates since it did not use the Seek to sequence number optimization. Assuming that locks were held when writing unprepared keys, and ValidateSnapshot runs, there should only be committed keys and unprepared keys of the current transaction, all of which are visible. Prev will simply iterate to get the last visible key. - Reseeking to skip keys optimization was also disabled for write unprepared, since it's possible to hit the max_skip condition even while reseeking. There needs to be some way to resolve infinite looping in this case. Closes https://github.com/facebook/rocksdb/pull/3955 Differential Revision: D8286688 Pulled By: lth fbshipit-source-id: 25e42f47fdeb5f7accea0f4fd350ef35198caafe	7 years ago
Nikhil Benesch	17339dc2f3	Add table property tracking number of range deletions (#4016 ) Summary: Add a new table property, rocksdb.num.range-deletions, which tracks the number of range deletions in a block-based table. Range deletions are no longer counted in rocksdb.num.entries; as discovered in PR #3778, there are various code paths that implicitly assume that rocksdb.num.entries counts only true keys, not range deletions. /cc ajkr nvanbenschoten Closes https://github.com/facebook/rocksdb/pull/4016 Differential Revision: D8527575 Pulled By: ajkr fbshipit-source-id: 92e7edbe78fda53756a558013c9fb496e7764fd7	7 years ago
Zhongyi Xie	408205a36b	use user_key and iterate_upper_bound to determine compatibility of bloom filters (#3899 ) Summary: Previously in https://github.com/facebook/rocksdb/pull/3601 bloom filter will only be checked if `prefix_extractor` in the mutable_cf_options matches the one found in the SST file. This PR relaxes the requirement by checking if all keys in the range [user_key, iterate_upper_bound) all share the same prefix after transforming using the BF in the SST file. If so, the bloom filter is considered compatible and will continue to be looked at. Closes https://github.com/facebook/rocksdb/pull/3899 Differential Revision: D8157459 Pulled By: miasantreble fbshipit-source-id: 18d17cba56a1005162f8d5db7a27aba277089c41	7 years ago
Sagar Vemuri	189f0c27aa	Make BlockBasedTableIterator compaction-aware (#4048 ) Summary: Pass in `for_compaction` to `BlockBasedTableIterator` via `BlockBasedTableReader::NewIterator`. In `7103559f49`, `for_compaction` was set in `BlockBasedTable::Rep` via `BlockBasedTable::SetupForCompaction`. In hindsight it was not the right decision; it also caused TSAN to complain. Closes https://github.com/facebook/rocksdb/pull/4048 Differential Revision: D8601056 Pulled By: sagar0 fbshipit-source-id: 30127e898c15c38c1080d57710b8c5a6d64a0ab3	7 years ago
Maysam Yabandeh	80ade9ad83	Pin top-level index on partitioned index/filter blocks (#4037 ) Summary: Top-level index in partitioned index/filter blocks are small and could be pinned in memory. So far we use that by cache_index_and_filter_blocks to false. This however make it difficult to keep account of the total memory usage. This patch introduces pin_top_level_index_and_filter which in combination with cache_index_and_filter_blocks=true keeps the top-level index in cache and yet pinned them to avoid cache misses and also cache lookup overhead. Closes https://github.com/facebook/rocksdb/pull/4037 Differential Revision: D8596218 Pulled By: maysamyabandeh fbshipit-source-id: 3a5f7f9ca6b4b525b03ff6bd82354881ae974ad2	7 years ago
Sagar Vemuri	7103559f49	Improve direct IO range scan performance with readahead (#3884 ) Summary: This PR extends the improvements in #3282 to also work when using Direct IO. We see 4.5X performance improvement in seekrandom benchmark doing long range scans, when using direct reads, on flash. Description: This change improves the performance of iterators doing long range scans (e.g. big/full index or table scans in MyRocks) by using readahead and prefetching additional data on each disk IO, and storing in a local buffer. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan. Implementation Details: - Used `FilePrefetchBuffer` as the underlying buffer to store the readahead data. `FilePrefetchBuffer` can now take file_reader, readahead_size and max_readahead_size as input to the constructor, and automatically do readahead. - `FilePrefetchBuffer::TryReadFromCache` can now call `FilePrefetchBuffer::Prefetch` if readahead is enabled. - `AlignedBuffer` (which is the underlying store for `FilePrefetchBuffer`) now takes a few additional args in `AlignedBuffer::AllocateNewBuffer` to allow copying data from the old buffer. - Made sure not to re-read partial chunks of data that were already available in the buffer, from device again. - Fixed a couple of cases where `AlignedBuffer::cursize_` was not being properly kept up-to-date. Constraints: - Similar to #3282, this gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value). - Since the prefetched data is stored in a temporary buffer allocated on heap, this could increase the memory usage if you have many iterators doing long range scans simultaneously. - Enabled only for user reads, and disabled for compactions. Compaction reads are controlled by the options `use_direct_io_for_flush_and_compaction` and `compaction_readahead_size`, and the current feature takes precautions not to mess with them. Benchmarks: I used the same benchmark as used in #3282. Data fill: ``` TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes ``` Do a long range scan: Seekrandom with large number of nexts ``` TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -use_direct_reads -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram ``` ``` Before: seekrandom : 37939.906 micros/op 26 ops/sec; 29.2 MB/s (1636 of 1999 found) With this change: seekrandom : 8527.720 micros/op 117 ops/sec; 129.7 MB/s (6530 of 7999 found) ``` ~4.5X perf improvement. Taken on an average of 3 runs. Closes https://github.com/facebook/rocksdb/pull/3884 Differential Revision: D8082143 Pulled By: sagar0 fbshipit-source-id: 4d7a8561cbac03478663713df4d31ad2620253bb	7 years ago
Maysam Yabandeh	28a9d8910b	Fix the bug with duplicate prefix in partition filters (#4024 ) Summary: https://github.com/facebook/rocksdb/pull/3764 introduced an optimization feature to skip duplicate prefix entires in full bloom filters. Unfortunately it also introduces a bug in partitioned full filters, where the duplicate prefix should still be inserted if it is in a new partition. The patch fixes the bug by resetting the duplicate detection logic each time a partition is cut. This bug could result into false negatives, which means that DB could skip an existing key. Closes https://github.com/facebook/rocksdb/pull/4024 Differential Revision: D8518866 Pulled By: maysamyabandeh fbshipit-source-id: 044f4d988e606a330ecafd8c79daceb68b8796bf	7 years ago
Siying Dong	92ee3350e0	BlockBasedTableIterator to keep BlockIter after out of upper bound (#4004 ) Summary: `b555ed30a4` makes the BlockBasedTableIterator to be invalidated if the current position if over the upper bound. However, this can bring performance regression to the case of multiple Seek()s hitting the same data block but all out of upper bound. For example, if an SST file has a data block containing following keys : {a, z} The user sets the upper bound to be "x", and it executed following queries: Seek("b") Seek("c") Seek("d") Before the upper bound optimization, these queries always come to this same current data block of the iterator, but now inside each Seek() the data block is read from the block cache but is returned again. To prevent this regression case, we keep the current data block iterator if it is upper bound. Closes https://github.com/facebook/rocksdb/pull/4004 Differential Revision: D8463192 Pulled By: siying fbshipit-source-id: 8710628b30acde7063a097c3184d6c4333a8ef81	7 years ago
Zhongyi Xie	80bc35927c	Should only decode restart points for uncompressed blocks (#3996 ) Summary: The Block object assumes contents are uncompressed. Block's constructor tries to read the number of restarts, but does not get an accurate number when its contents are compressed, which is causing issues like https://github.com/facebook/rocksdb/issues/3843. This PR address this issue by skipping reconstruction of restart points when blocks are known to be compressed. Somehow the restart points can be read directly when Snappy is used and some tests (for example https://github.com/facebook/rocksdb/blob/master/db/db_block_cache_test.cc#L196) expects blocks to be fully constructed even when Snappy compression is used, so here we keep the restart point logic for Snappy. Closes https://github.com/facebook/rocksdb/pull/3996 Differential Revision: D8416186 Pulled By: miasantreble fbshipit-source-id: 002c0b62b9e5d89fb7736563d354ce0023c8cb28	7 years ago
Siying Dong	d82f1421b4	Fix regression bug of Prev() with upper bound (#3989 ) Summary: A recent change pushed down the upper bound checking to child iterators. However, this causes the logic of following sequence wrong: Seek(key); if (!Valid()) SeekToLast(); Because !Valid() may be caused by upper bounds, rather than the end of the iterator. In this case SeekToLast() points to totally wrong places. This can cause wrong results, infinite loops, or segfault in some cases. This sequence is called when changing direction from forward to backward. And this by itself also implicitly happen during reseeking optimization in Prev(). Fix this bug by using SeekForPrev() rather than this sequuence, as what is already done in prefix extrator case. Closes https://github.com/facebook/rocksdb/pull/3989 Differential Revision: D8385422 Pulled By: siying fbshipit-source-id: 429e869990cfd2dc389421e0836fc496bed67bb4	7 years ago
Andrew Kryczka	9d347332fb	Fix argument mismatch in BlockBasedTableBuilder (#3974 ) Summary: The sixth argument should be `key_includes_seq` bool, the seventh a `GetContext`. We were mistakenly passing the `GetContext` as the sixth argument and relying on the default (nullptr) for the seventh. This would make statistics inaccurate, at least. Blame: `402b7aa0` Closes https://github.com/facebook/rocksdb/pull/3974 Differential Revision: D8344907 Pulled By: ajkr fbshipit-source-id: 3ad865a0541d6d30f75dfc726352788118cfe12e	7 years ago
Fenggang Wu	3593275357	Remove restart point from the properties_block (#3970 ) Summary: Property block will be read sequentially and cached in a heap located object, so there's no need for restart points. Thus we set the restart interval to infinity to save space. Closes https://github.com/facebook/rocksdb/pull/3970 Differential Revision: D8332586 Pulled By: fgwu fbshipit-source-id: 899c3267832a81d0f084ec2db6b387332f461134	7 years ago
Fenggang Wu	f4502944c3	Change db path for BlockBasedTableTest.BadOptions (#3965 ) Summary: BadOptions test creates a temporary db path changed to table_block_based_bad_options_test to avoid collide with that created by the PrefixAndWholeKeyTest Closes https://github.com/facebook/rocksdb/pull/3965 Differential Revision: D8316080 Pulled By: fgwu fbshipit-source-id: bb8e0fdfdb9abf0e5ce94494b4388cd1622ee032	7 years ago
Maysam Yabandeh	b73652169e	Extend format 3 to partitioned index/filters (#3958 ) Summary: format_version 3 changes the format of index blocks by storing user keys instead of the internal keys, which saves 8-bytes per key. This patch extends the format to top-level indexes in partitioned index/filters. Closes https://github.com/facebook/rocksdb/pull/3958 Differential Revision: D8294615 Pulled By: maysamyabandeh fbshipit-source-id: 17666cc16b8076c363972e2308e31547e835f0fe	7 years ago
Zhongyi Xie	f1592a06c2	run make format for PR 3838 (#3954 ) Summary: PR https://github.com/facebook/rocksdb/pull/3838 made some changes that triggers lint warnings. Run `make format` to fix formatting as suggested by siying . Also piggyback two changes: 1) fix singleton destruction order for windows and posix env 2) fix two clang warnings Closes https://github.com/facebook/rocksdb/pull/3954 Differential Revision: D8272041 Pulled By: miasantreble fbshipit-source-id: 7c4fd12bd17aac13534520de0c733328aa3c6c9f	7 years ago
Mike Kolupaev	812c7371d3	Fix performance regression in Get() for block-based tables (#3953 ) Summary: This fixes a regression in one of myrocks regression tests (readwhilewriting), introduced in `8bf555f487` This PR changes two lines of code: one of them actually fixes the observed regression, the other is a mostly unrelated small fix that I'm piggy-backing here. EDIT: Nevermind, it fixes one line. More details in inline comments. Closes https://github.com/facebook/rocksdb/pull/3953 Differential Revision: D8270664 Pulled By: al13n321 fbshipit-source-id: a7d91e196807d1e816551591257c700f70e4ccac	7 years ago
Maysam Yabandeh	d0c38c0c8c	Extend some tests to format_version=3 (#3942 ) Summary: format_version=3 changes the format of SST index. This is however not being tested currently since tests only work with the default format_version which is currently 2. The patch extends the most related tests to also test for format_version=3. Closes https://github.com/facebook/rocksdb/pull/3942 Differential Revision: D8238413 Pulled By: maysamyabandeh fbshipit-source-id: 915725f55753dd8e9188e802bf471c23645ad035	7 years ago
Dmitri Smirnov	f4b72d7056	Provide a way to override windows memory allocator with jemalloc for ZSTD Summary: Windows does not have LD_PRELOAD mechanism to override all memory allocation functions and ZSTD makes use of C-tuntime calloc. During flushes and compactions default system allocator fragments and the system slows down considerably. For builds with jemalloc we employ an advanced ZSTD context creation API that re-directs memory allocation to jemalloc. To reduce the cost of context creation on each block we cache ZSTD context within the block based table builder while a new SST file is being built, this will help all platform builds including those w/o jemalloc. This avoids system allocator fragmentation and improves the performance. The change does not address random reads and currently on Windows reads with ZSTD regress as compared with SNAPPY compression. Closes https://github.com/facebook/rocksdb/pull/3838 Differential Revision: D8229794 Pulled By: miasantreble fbshipit-source-id: 719b622ab7bf4109819bc44f45ec66f0dd3ee80d	7 years ago
Andrew Kryczka	fea2b1dfb2	Copy Get() result when file reads use mmap Summary: For iterator reads, a `SuperVersion` is pinned to preserve a snapshot of SST files, and `Block`s are pinned to allow `key()` and `value()` to return pointers directly into a RocksDB memory region. This works for both non-mmap reads, where the block owns the memory region, and mmap reads, where the file owns the memory region. For point reads with `PinnableSlice`, only the `Block` object is pinned. This works for non-mmap reads because the block owns the memory region, so even if the file is deleted after compaction, the memory region survives. However, for mmap reads, file deletion causes the memory region to which the `PinnableSlice` refers to be unmapped. The result is usually a segfault upon accessing the `PinnableSlice`, although sometimes it returned wrong results (I repro'd this a bunch of times with `db_stress`). This PR copies the value into the `PinnableSlice` when it comes from mmap'd memory. We can tell whether the `Block` owns its memory using `Block::cachable()`, which is unset when reads do not use the provided buffer as is the case with mmap file reads. When that is false we ensure the result of `Get()` is copied. This feels like a short-term solution as ideally we'd have the `PinnableSlice` pin the mmap'd memory so we can do zero-copy reads. It seemed hard so I chose this approach to fix correctness in the meantime. Closes https://github.com/facebook/rocksdb/pull/3881 Differential Revision: D8076288 Pulled By: ajkr fbshipit-source-id: 31d78ec010198723522323dbc6ea325122a46b08	7 years ago
Zhongyi Xie	2a0dfaa044	fix PrefixExtractorChanged: pass raw pointer instead shared_ptr Summary: This should resolve the performance regression caused by the unnecessary copying of the shared_ptr. Closes https://github.com/facebook/rocksdb/pull/3937 Differential Revision: D8232330 Pulled By: miasantreble fbshipit-source-id: 7885bf7cd190b6f87164c52d6edd328298c13f97	7 years ago
Maysam Yabandeh	44cf84932f	Fix the bug of some test scenarios being put after kEnd Summary: DBTestBase::OptionConfig includes the scenarios that unit tests could iterate over them by calling ChangeOptions(). Some of the options have been mistakenly put after kEnd which makes them essentially invisible to ChangeOptions() caller. This patch fixes it except for kUniversalSubcompactions which is left as TODO since it would break some unit tests. Closes https://github.com/facebook/rocksdb/pull/3935 Differential Revision: D8230748 Pulled By: maysamyabandeh fbshipit-source-id: edddb8fffcd161af1809fef24798ce118f8593db	7 years ago
Maysam Yabandeh	03cda531e4	Check for rep_->table_properties being nullptr Summary: The very old sst formats do not have table_properties and rep_->table_properties is thus nullptr. The recent patch in https://github.com/facebook/rocksdb/pull/3894 does not check for nullptr and hence makes it backward incompatible. This patch adds the check. Closes https://github.com/facebook/rocksdb/pull/3918 Differential Revision: D8188638 Pulled By: maysamyabandeh fbshipit-source-id: b1d986665ecf0b4d1c442adfa8a193b97707d47b	7 years ago
Maysam Yabandeh	402b7aa07f	Exclude seq from index keys Summary: Index blocks have the same format as data blocks. The keys therefore similarly to the keys in the data blocks are internal keys, which means that in addition to the user key it also has 8 bytes that encodes sequence number and value type. This extra 8 bytes however is not necessary in index blocks since the index keys act as an separator between two data blocks. The only exception is when the last key of a block and the first key of the next block share the same user key, in which the sequence number is required to act as a separator. The patch excludes the sequence from index keys only if the above special case does not happen for any of the index keys. It then records that in the property block. The reader looks at the property block to see if it should expect sequence numbers in the keys of the index block.s Closes https://github.com/facebook/rocksdb/pull/3894 Differential Revision: D8118775 Pulled By: maysamyabandeh fbshipit-source-id: 915479f028b5799ca91671d67455ecdefbd873bd	7 years ago
Nathan VanBenschoten	8c3bf0801b	Check status when reading HashIndexPrefixesMetadataBlock Summary: This was missed in a refactor of `ReadBlockContents` (`2f1a3a4`). Closes https://github.com/facebook/rocksdb/pull/3906 Differential Revision: D8172648 Pulled By: ajkr fbshipit-source-id: 27e453b19795fea974bfed4721105be6f3a12090	7 years ago
Yanqin Jin	aa53579d6c	Fix segfault caused by object premature destruction Summary: Please refer to earlier discussion in [issue 3609](https://github.com/facebook/rocksdb/issues/3609). There was also an alternative fix in [PR 3888](https://github.com/facebook/rocksdb/pull/3888), but the proposed solution requires complex change. To summarize the cause of the problem. Upon creation of a column family, a `BlockBasedTableFactory` object is `new`ed and encapsulated by a `std::shared_ptr`. Since there is no other `std::shared_ptr` pointing to this `BlockBasedTableFactory`, when the column family is dropped, the `ColumnFamilyData` is `delete`d, causing the destructor of `std::shared_ptr`. Since there is no other `std::shared_ptr`, the underlying memory is also freed. Later when the db exits, it releases all the table readers, including the table readers that have been operating on the dropped column family. This needs to access the `table_options` owned by `BlockBasedTableFactory` that has already been deleted. Therefore, a segfault is raised. Previous workaround is to purge all obsolete files upon `ColumnFamilyData` destruction, which leads to a force release of table readers of the dropped column family. However this does not work when the user disables file deletion. Our solution in this PR is making a copy of `table_options` in `BlockBasedTable::Rep`. This solution increases memory copy and usage, but is much simpler. Test plan ``` $ make -j16 $ ./column_family_test --gtest_filter=ColumnFamilyTest.CreateDropAndDestroy:ColumnFamilyTest.CreateDropAndDestroyWithoutFileDeletion ``` Expected behavior: All tests should pass. Closes https://github.com/facebook/rocksdb/pull/3898 Differential Revision: D8149421 Pulled By: riversand963 fbshipit-source-id: eaecc2e064057ef607fbdd4cc275874f866c3438	7 years ago
Zhongyi Xie	6c73a46693	Fix a backward compatibility problem with table_properties being nullptr Summary: Currently when ldb built from master tries to open a DB from version 2.2, there will be a segfault because table_properties didn't exist back then. Closes https://github.com/facebook/rocksdb/pull/3890 Differential Revision: D8100914 Pulled By: miasantreble fbshipit-source-id: b255e8aedc54695432be2e704839c857dabdd65a	7 years ago
Zhongyi Xie	c3ebc75843	Move prefix_extractor to MutableCFOptions Summary: Currently it is not possible to change bloom filter config without restart the db, which is causing a lot of operational complexity for users. This PR aims to make it possible to dynamically change bloom filter config. Closes https://github.com/facebook/rocksdb/pull/3601 Differential Revision: D7253114 Pulled By: miasantreble fbshipit-source-id: f22595437d3e0b86c95918c484502de2ceca120c	7 years ago
Andrew Kryczka	7b655214d2	Assert keys/values pinned by range deletion meta-block iterators Summary: `RangeDelAggregator` holds the pointers returned by `BlockIter::key()` and `BlockIter::value()` so requires the data to which they point is pinned. `BlockIter::key()` points into block memory and is guaranteed to be pinned if and only if prefix encoding is disabled (or, equivalently, restart interval is set to one). I think `BlockIter::value()` is always pinned. Added an assert for these and removed the wrong TODO about increasing restart interval, which would enable key prefix encoding and break the assertion. Closes https://github.com/facebook/rocksdb/pull/3875 Differential Revision: D8063667 Pulled By: ajkr fbshipit-source-id: 60b5ebcc0cdd610dd6aad9e74a23378793672c41	7 years ago
Siying Dong	26da3676d9	class Block to store num_restarts_ Summary: Right now, every Block::NewIterator() reads num_restarts_ from the block, which is already read in Block::Block(). This sometimes cause a CPU cache miss. Although fetching this cacheline can usually benefit follow-up block restart offset reading, as they are close to each other, it's almost free to get ride of this read by storing it in the Block class. Closes https://github.com/facebook/rocksdb/pull/3869 Differential Revision: D8052493 Pulled By: siying fbshipit-source-id: 9c72360f0c2d7329f3c198ce4eaedd2bc14b87c1	8 years ago
Mike Kolupaev	8bf555f487	Change and clarify the relationship between Valid(), status() and Seek() for all iterators. Also fix some bugs Summary: Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are: If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted. * When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not. However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours). This PR changes the convention to: * If status() is not ok, Valid() always returns false. * Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.) This does sacrifice the two use cases listed above, but siying said it's ok. Overview of the changes: * A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario. * Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed. * A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator. To find the iterator types that needed changes I searched for "public .Iterator" in the code. Here's an overview of all 27 iterator types: Iterators that didn't need changes: status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator. * Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator. Iterators with changes (see inline comments for details): * DBIter - an overhaul: - It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back. - It had a few code paths silently discarding subiterator's status. The stress test caught a few. - The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example. - Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption. - It used to not reset status on seek for some types of errors. - Some simplifications and better comments. - Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer. * MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status. * ForwardIterator - changed to the new convention, also slightly simplified. * ForwardLevelIterator - fixed some bugs and simplified. * LevelIterator - simplified. * TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_. * BlockBasedTableIterator - minor changes. * BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid. * PlainTableIterator - some seeks used to not reset status. * CuckooTableIterator - tiny code cleanup. * ManagedIterator - fixed some bugs. * BaseDeltaIterator - changed to the new convention and fixed a bug. * BlobDBIterator - seeks used to not reset status. * KeyConvertingIterator - some small change. Closes https://github.com/facebook/rocksdb/pull/3810 Differential Revision: D7888019 Pulled By: al13n321 fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d	8 years ago
Siying Dong	ddfd2525d2	Make BlockIter final Summary: Now BlockBasedTableIterator directly uses BlockIter. By making BlockIter final, we can prevent unintended virtual function overriding. Closes https://github.com/facebook/rocksdb/pull/3828 Differential Revision: D7933816 Pulled By: siying fbshipit-source-id: 026a08cb5c5b6d3d6f44743152b4251da4756f2c	8 years ago
Maysam Yabandeh	fc522bdb3e	Evenly split HarnessTest.Randomized Summary: Currently HarnessTest.Randomized is already split but some of the splits are faster than the others. The reason is that each split takes a continuous range of the generated args and the test with later args takes longer to finish. The patch evenly split the args among splits in a round robin fashion. Before: ``` [ OK ] HarnessTest.Randomized1n2 (2278 ms) [ OK ] HarnessTest.Randomized3n4 (1095 ms) [ OK ] HarnessTest.Randomized5 (658 ms) [ OK ] HarnessTest.Randomized6 (1258 ms) [ OK ] HarnessTest.Randomized7 (6476 ms) [ OK ] HarnessTest.Randomized8 (8182 ms) ``` After ``` [ OK ] HarnessTest.Randomized1 (2649 ms) [ OK ] HarnessTest.Randomized2 (2645 ms) [ OK ] HarnessTest.Randomized3 (2577 ms) [ OK ] HarnessTest.Randomized4 (2490 ms) [ OK ] HarnessTest.Randomized5 (2553 ms) [ OK ] HarnessTest.Randomized6 (2560 ms) [ OK ] HarnessTest.Randomized7 (2501 ms) [ OK ] HarnessTest.Randomized8 (2574 ms) ``` Closes https://github.com/facebook/rocksdb/pull/3808 Differential Revision: D7882663 Pulled By: maysamyabandeh fbshipit-source-id: 09b749a9684b6d7d65466aa4b00c5334a49e833e	8 years ago

... 4 5 6 7 8 ...

1022 Commits (8abd41a54413751748c1a62397002483e0a001e2)