Summary:
For our default block cache, each additional entry has extra memory overhead. It include LRUHandle (72 bytes currently) and the cache key (two varint64, file id and offset). The usage is not negligible. For example for block_size=4k, the overhead accounts for an extra 2% memory usage for the cache. The patch charging the cache for the extra usage, reducing untracked memory usage outside block cache. The feature is enabled by default and can be disabled by passing kDontChargeCacheMetadata to the cache constructor.
This PR builds up on https://github.com/facebook/rocksdb/issues/4258
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5797
Test Plan:
- Existing tests are updated to either disable the feature when the test has too much dependency on the old way of accounting the usage or increasing the cache capacity to account for the additional charge of metadata.
- The Usage tests in cache_test.cc are augmented to test the cache usage under kFullChargeCacheMetadata.
Differential Revision: D17396833
Pulled By: maysamyabandeh
fbshipit-source-id: 7684ccb9f8a40ca595e4f5efcdb03623afea0c6f
Summary:
Use delete to disable automatic generated methods instead of private, and put the constructor together for more clear.This modification cause the unused field warning, so add unused attribute to disable this warning.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5009
Differential Revision: D17288733
fbshipit-source-id: 8a767ce096f185f1db01bd28fc88fef1cdd921f3
Summary:
The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes.
Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it.
So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks.
Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files.
This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289
Differential Revision: D15256423
Pulled By: al13n321
fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
Summary:
Instead of creating a new DataBlockIterator for every key in a MultiGet batch, reuse it if the next key is in the same block. This results in a small 1-2% cpu improvement.
TEST_TMPDIR=/dev/shm/multiget numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Without the change -
multireadrandom : 3.066 micros/op 326122 ops/sec; (29375968 of 29375968 found)
With the change -
multireadrandom : 3.003 micros/op 332945 ops/sec; (29983968 of 29983968 found)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5314
Differential Revision: D15742108
Pulled By: anand1976
fbshipit-source-id: 220fb0b8eea9a0d602ddeb371528f7af7936d771
Summary:
There are too many types of files under util/. Some test related files don't belong to there or just are just loosely related. Mo
ve them to a new directory test_util/, so that util/ is cleaner.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5377
Differential Revision: D15551366
Pulled By: siying
fbshipit-source-id: 0f5c8653832354ef8caa31749c0143815d719e2c
Summary:
In long scans, virtual function calls of Next(), Valid(), key() and value() are not trivial. By introducing NextAndGetResult(), Some of the Next(), Valid() and key() calls are consolidated into one virtual function call to reduce CPU.
Also did some inline tricks and add some "final" randomly in some functions. Even without the "final" annotation, most Next() calls are inlined with -O3, but sometimes with a final it is inlined by O2 too. It doesn't hurt to add those final annotations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5197
Differential Revision: D14945977
Pulled By: siying
fbshipit-source-id: 7003969f9a5f1d5717f0bda503b91d19ba75ed88
Summary:
This is second attempt for #5101. Original commit message:
`BlockBasedTableIterator` avoid reading next block on `Next()` if it detects the iterator will be out of bound, by checking against index key. The optimization was added in #2239, and by the time it only check the bound per block. It seems later change make it a per-key check, which introduce unnecessary key comparisons.
This patch come with two fixes:
Fix 1: To optimize checking for bounds, we need comparing the bounds with index key as well. However BlockBasedTableIterator doesn't know whether its index iterator is internally using user keys or internal keys. The patch fixes that by extending InternalIterator with a user_key() function that is overridden by In IndexBlockIter.
Fix 2: In #5101 we return `IsOutOfBound()=true` when block index key is out of bound. But the index key can be larger than smallest key of the next file on the level. That file can be within upper bound and should not be filtered out.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5142
Differential Revision: D14907113
Pulled By: siying
fbshipit-source-id: ac95775c5b4e7b700f76ab43e39f45402c98fbfb
Summary:
Following files were run through automatic formatter:
db/db_impl.cc
db/db_impl.h
db/db_impl_compaction_flush.cc
db/db_impl_debug.cc
db/db_impl_files.cc
db/db_impl_readonly.h
db/db_impl_write.cc
db/dbformat.cc
db/dbformat.h
table/block.cc
table/block.h
table/block_based_filter_block.cc
table/block_based_filter_block.h
table/block_based_filter_block_test.cc
table/block_based_table_builder.cc
table/block_based_table_reader.cc
table/block_based_table_reader.h
table/block_builder.cc
table/block_builder.h
table/block_fetcher.cc
table/block_prefix_index.cc
table/block_prefix_index.h
table/block_test.cc
table/format.cc
table/format.h
I could easily run all the files, but I don't want people to feel that
I'm doing it for lines of code changes :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5114
Differential Revision: D14633040
Pulled By: siying
fbshipit-source-id: 3f346cb53bf21e8c10704400da548dfce1e89a52
Summary:
Always enable properties block checksum verification for block-based table. For external SST file ingested with 'write_global_seqno==true', we use 'DecodeEntrySlow' to parse its blocks' contents so that the process will not die upon failing the assertion possibly caused by corruption.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4956
Differential Revision: D14012741
Pulled By: riversand963
fbshipit-source-id: 8b766e6f54b36f8f9e074c0e19e0926ec3cce186
Summary:
We carry compression type and "cachable" variables for every block in the block cache, while they take well-known values. 8-byte is wasted for each block (2-byte for useful information but it takes 8 bytes because of padding). With this change, these two variables are removed.
The cachable information is only useful in the process of reading the block. We use other information to infer from it. For compressed blocks, the compression type is a part of the block content itself so we can get it from there.
Some code is slightly refactored so that the cachable information can flow better.
Another change is to only use class BlockContents for compressed block, and narrow the class Block to only be used for uncompressed blocks, including blocks in compressed block cache. This can make the Block class less confusing. It also saves tens of bytes for each block in compressed block cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4650
Differential Revision: D12969070
Pulled By: siying
fbshipit-source-id: 548b62724e9eb66993026429fd9c7c3acd1f95ed
Summary:
With #3983 the size of IndexBlockIter was increased. This had resulted in a regression on P50 latencies in one of our benchmarks. The patch reduces IndexBlockIter size be eliminating active_comparator_ field from the class.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4358
Differential Revision: D9781737
Pulled By: maysamyabandeh
fbshipit-source-id: 71e2b28d90ff0813db9e04b737ae73e185583c52
Summary:
Add hash index support to data blocks, which helps to reduce the CPU utilization of point-lookup operations. This feature is backward compatible with the data block created without the hash index. It is disabled by default unless `BlockBasedTableOptions::data_block_index_type` is set to `data_block_index_type = kDataBlockBinaryAndHash.`
The DB size would be bigger with the hash index option as a hash table is added at the end of each data block. If the hash utilization ratio is 1:1, the space overhead is one byte per key. The hash table utilization ratio is adjustable using `BlockBasedTableOptions::data_block_hash_table_util_ratio`. A lower utilization ratio will improve more on the point-lookup efficiency, but take more space too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4174
Differential Revision: D8965914
Pulled By: fgwu
fbshipit-source-id: 1c6bae5d1fc39c80282d8890a72e9e67bc247198
Summary:
Add a unit test to check that iterators release data blocks after it has moved away from it. Verify the same for compaction input iterators.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4170
Differential Revision: D8962513
Pulled By: siying
fbshipit-source-id: 05a5b604d7d29887fb488f2cda7286f554a14407
Summary:
Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the block cache more efficiently. The feature is enabled with using format_version 4.
The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size.
Results with sysbench read-only using 4k blocks and using 16 index restart interval:
Format 2:
19585 rocksdb read-only range=100
Format 3:
19569 rocksdb read-only range=100
Format 4:
19352 rocksdb read-only range=100
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983
Differential Revision: D8361343
Pulled By: maysamyabandeh
fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651
Summary:
Refactor IndexBlockIter to reduce conditional branches on key_includes_seq_. IndexBlockIter::Prev is also separated from DataBlockIter::Prev, not to cache the prev entries as they are of less importance when iterating over the index block.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4141
Differential Revision: D8866437
Pulled By: maysamyabandeh
fbshipit-source-id: fdac76880426fc2be7d3c6354c09ab98f6657d4b
Summary:
Some logic only related to IndexBlockIter is separated from BlockIter to IndexBlockIter. This is done by writing an exclusive Seek() and SeekForPrev() for DataBlockIter, and all metadata block iter and tombstone block iter now use data block iter. Dealing with the BinarySeek() sharing problem by passing in the comparator to use.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4136
Reviewed By: maysamyabandeh
Differential Revision: D8859673
Pulled By: siying
fbshipit-source-id: 703e5e6824b82b7cbf4721f3594b94127797ca9e
Summary:
BlockIter is getting crowded including details that specific only to either index or data blocks. The patch moves down such details to DataBlockIter and IndexBlockIter, both inheriting from BlockIter.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4121
Differential Revision: D8816832
Pulled By: maysamyabandeh
fbshipit-source-id: d492e74155c11d8a0c1c85cd7ee33d24c7456197
Summary:
Currently the block cache is charged only by the size of the raw data block and excludes the overhead of the c++ objects that contain the raw data block. The patch improves the accuracy of the charge by including the c++ object overhead into it.
Closes https://github.com/facebook/rocksdb/pull/4073
Differential Revision: D8686552
Pulled By: maysamyabandeh
fbshipit-source-id: 8472f7fc163c0644533bc6942e20cdd5725f520f
Summary:
This fixes a regression in one of myrocks regression tests (readwhilewriting), introduced in 8bf555f487
This PR changes two lines of code: one of them actually fixes the observed regression, the other is a mostly unrelated small fix that I'm piggy-backing here. EDIT: Nevermind, it fixes one line. More details in inline comments.
Closes https://github.com/facebook/rocksdb/pull/3953
Differential Revision: D8270664
Pulled By: al13n321
fbshipit-source-id: a7d91e196807d1e816551591257c700f70e4ccac
Summary:
For iterator reads, a `SuperVersion` is pinned to preserve a snapshot of SST files, and `Block`s are pinned to allow `key()` and `value()` to return pointers directly into a RocksDB memory region. This works for both non-mmap reads, where the block owns the memory region, and mmap reads, where the file owns the memory region.
For point reads with `PinnableSlice`, only the `Block` object is pinned. This works for non-mmap reads because the block owns the memory region, so even if the file is deleted after compaction, the memory region survives. However, for mmap reads, file deletion causes the memory region to which the `PinnableSlice` refers to be unmapped. The result is usually a segfault upon accessing the `PinnableSlice`, although sometimes it returned wrong results (I repro'd this a bunch of times with `db_stress`).
This PR copies the value into the `PinnableSlice` when it comes from mmap'd memory. We can tell whether the `Block` owns its memory using `Block::cachable()`, which is unset when reads do not use the provided buffer as is the case with mmap file reads. When that is false we ensure the result of `Get()` is copied.
This feels like a short-term solution as ideally we'd have the `PinnableSlice` pin the mmap'd memory so we can do zero-copy reads. It seemed hard so I chose this approach to fix correctness in the meantime.
Closes https://github.com/facebook/rocksdb/pull/3881
Differential Revision: D8076288
Pulled By: ajkr
fbshipit-source-id: 31d78ec010198723522323dbc6ea325122a46b08
Summary:
Index blocks have the same format as data blocks. The keys therefore similarly to the keys in the data blocks are internal keys, which means that in addition to the user key it also has 8 bytes that encodes sequence number and value type. This extra 8 bytes however is not necessary in index blocks since the index keys act as an separator between two data blocks. The only exception is when the last key of a block and the first key of the next block share the same user key, in which the sequence number is required to act as a separator.
The patch excludes the sequence from index keys only if the above special case does not happen for any of the index keys. It then records that in the property block. The reader looks at the property block to see if it should expect sequence numbers in the keys of the index block.s
Closes https://github.com/facebook/rocksdb/pull/3894
Differential Revision: D8118775
Pulled By: maysamyabandeh
fbshipit-source-id: 915479f028b5799ca91671d67455ecdefbd873bd
Summary:
Right now, every Block::NewIterator() reads num_restarts_ from the block, which is already read in Block::Block(). This sometimes cause a CPU cache miss. Although fetching this cacheline can usually benefit follow-up block restart offset reading, as they are close to each other, it's almost free to get ride of this read by storing it in the Block class.
Closes https://github.com/facebook/rocksdb/pull/3869
Differential Revision: D8052493
Pulled By: siying
fbshipit-source-id: 9c72360f0c2d7329f3c198ce4eaedd2bc14b87c1
Summary:
Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are:
* If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted.
* When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not.
However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours).
This PR changes the convention to:
* If status() is not ok, Valid() always returns false.
* Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.)
This does sacrifice the two use cases listed above, but siying said it's ok.
Overview of the changes:
* A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario.
* Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed.
* A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator.
To find the iterator types that needed changes I searched for "public .*Iterator" in the code. Here's an overview of all 27 iterator types:
Iterators that didn't need changes:
* status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator.
* Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator.
Iterators with changes (see inline comments for details):
* DBIter - an overhaul:
- It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back.
- It had a few code paths silently discarding subiterator's status. The stress test caught a few.
- The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example.
- Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption.
- It used to not reset status on seek for some types of errors.
- Some simplifications and better comments.
- Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer.
* MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status.
* ForwardIterator - changed to the new convention, also slightly simplified.
* ForwardLevelIterator - fixed some bugs and simplified.
* LevelIterator - simplified.
* TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_.
* BlockBasedTableIterator - minor changes.
* BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid.
* PlainTableIterator - some seeks used to not reset status.
* CuckooTableIterator - tiny code cleanup.
* ManagedIterator - fixed some bugs.
* BaseDeltaIterator - changed to the new convention and fixed a bug.
* BlobDBIterator - seeks used to not reset status.
* KeyConvertingIterator - some small change.
Closes https://github.com/facebook/rocksdb/pull/3810
Differential Revision: D7888019
Pulled By: al13n321
fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d
Summary:
Now BlockBasedTableIterator directly uses BlockIter. By making BlockIter final, we can prevent unintended virtual function overriding.
Closes https://github.com/facebook/rocksdb/pull/3828
Differential Revision: D7933816
Pulled By: siying
fbshipit-source-id: 026a08cb5c5b6d3d6f44743152b4251da4756f2c
Summary:
This is a pre-cleaning up before a major block based table iterator refactoring. BlockBasedTable::NewDataBlockIterator() will always return BlockIter. This simplifies the logic and code and enable further refactoring and optimization.
Closes https://github.com/facebook/rocksdb/pull/3398
Differential Revision: D6780165
Pulled By: siying
fbshipit-source-id: 273f7dc896724f682c0118fb69a359d9cc4418b4
Summary:
In gcc-7 the following is an error identified by -Werror=class-memaccess
In file included from ./table/get_context.h:14:0,
from db/version_set.cc:43:
./table/block.h: In constructor ‘rocksdb::BlockReadAmpBitmap::BlockReadAmpBitmap(size_t, size_t, rocksdb::Statistics*)’:
./table/block.h:73:53: error: ‘void* memset(void*, int, size_t)’ clearing an object of type ‘struct std::atomic<unsigned int>’ with no trivial copy-assignment; use value-initialization instead [-Werror=class-memaccess]
memset(bitmap_, 0, bitmap_size * kBytesPersEntry);
^
In file included from ./db/version_set.h:23:0,
from db/version_set.cc:12:
/toolchain/include/c++/8.0.0/atomic:684:12: note: ‘struct std::atomic<unsigned int>’ declared here
struct atomic<unsigned int> : __atomic_base<unsigned int>
^~~~~~~~~~~~~~~~~~~~
As a solution the default initializer can be applied in list context.
Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
Closes https://github.com/facebook/rocksdb/pull/2561
Differential Revision: D5398714
Pulled By: siying
fbshipit-source-id: d883fb88ec7535eee60d551038fe91f14488be36
Summary:
Consider BlockReadAmpBitmap with bytes_per_bit = 32. Suppose bytes [a, b) were used, while bytes [a-32, a)
and [b+1, b+33) weren't used; more formally, the union of ranges passed to BlockReadAmpBitmap::Mark() contains [a, b) and doesn't intersect with [a-32, a) and [b+1, b+33). Then bits [floor(a/32), ceil(b/32)] will be set, and so the number of useful bytes will be estimated as (ceil(b/32) - floor(a/32)) * 32, which is on average equal to b-a+31.
An extreme example: if we use 1 byte from each block, it'll be counted as 32 bytes from each block.
It's easy to remove this bias by slightly changing the semantics of the bitmap. Currently each bit represents a byte range [i*32, (i+1)*32).
This diff makes each bit represent a single byte: i*32 + X, where X is a random number in [0, 31] generated when bitmap is created. So, e.g., if you read a single byte at random, with probability 31/32 it won't be counted at all, and with probability 1/32 it will be counted as 32 bytes; so, on average it's counted as 1 byte.
*But there is one exception: the last bit will always set with the old way.*
(*) - assuming read_amp_bytes_per_bit = 32.
Closes https://github.com/facebook/rocksdb/pull/2259
Differential Revision: D5035652
Pulled By: lightmark
fbshipit-source-id: bd98b1b9b49fbe61f9e3781d07f624e3cbd92356
Summary:
to void future bug that caused by the mix of userkey/internalkey
Closes https://github.com/facebook/rocksdb/pull/2084
Differential Revision: D4825889
Pulled By: lightmark
fbshipit-source-id: 28411db
Summary:
We can support SST files >2GB but we don't support blocks >2GB
Closes https://github.com/facebook/rocksdb/pull/1465
Differential Revision: D4132140
Pulled By: yiwu-arbug
fbshipit-source-id: 63bf12d
Summary:
reland https://reviews.facebook.net/D62523
- Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
- Update TableProperties to be aware of the offset of each property in the file
- Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file
Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks
Test Plan: unit tests
Reviewers: sdong, yhchiang
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D65211
Summary:
- Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
- Update TableProperties to be aware of the offset of each property in the file
- Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file
Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks
Test Plan: unit tests
Reviewers: andrewkr, yhchiang, yiwu, sdong
Reviewed By: sdong
Subscribers: hcz, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D62523
Summary:
Add new Iterator API, `SeekForPrev`: find the last key that <= target key
support prefix_extractor
support prefix_same_as_start
support upper_bound
not supported in iterators without Prev()
Also add tests in db_iter_test and db_iterator_test
Pass all tests
Cheers!
Test Plan: make all check -j64
Reviewers: andrewkr, yiwu, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64149
Summary: Fix two Windows build problems.
Test Plan: Build on Windows and run all Linux tests.
Reviewers: IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D63189
Summary:
Add ReadOptions::read_amp_bytes_per_bit option which allow us to create a bitmap for every data block we read
the bitmap will contain (block_size / read_amp_bytes_per_bit) bits.
We will use this bitmap to mark which bytes have been used of the block so we can calculate the read amplification
Test Plan: added new tests
Reviewers: andrewkr, yhchiang, sdong
Reviewed By: sdong
Subscribers: yiwu, leveldb, march, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58707
Summary:
Right now the way we do BlockIter::Prev() is like this
- Go to the beginning of the restart interval
- Keep moving forward (and decoding keys using ParseNextKey()) until we reach the desired key
This can be optimized by caching the decoded entries in the first pass and reusing them in consecutive BlockIter::Prev() calls
Before caching
```
DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readreverse" --db="/dev/shm/bench_prev_opt/" --use_existing_db --disable_auto_compactions
DB path: [/dev/shm/bench_prev_opt/]
readreverse : 0.413 micros/op 2423972 ops/sec; 268.2 MB/s
DB path: [/dev/shm/bench_prev_opt/]
readreverse : 0.414 micros/op 2413867 ops/sec; 267.0 MB/s
DB path: [/dev/shm/bench_prev_opt/]
readreverse : 0.410 micros/op 2440881 ops/sec; 270.0 MB/s
DB path: [/dev/shm/bench_prev_opt/]
readreverse : 0.414 micros/op 2417298 ops/sec; 267.4 MB/s
DB path: [/dev/shm/bench_prev_opt/]
readreverse : 0.413 micros/op 2421682 ops/sec; 267.9 MB/s
```
After caching
```
DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readreverse" --db="/dev/shm/bench_prev_opt/" --use_existing_db --disable_auto_compactions
DB path: [/dev/shm/bench_prev_opt/]
readreverse : 0.324 micros/op 3088955 ops/sec; 341.7 MB/s
DB path: [/dev/shm/bench_prev_opt/]
readreverse : 0.335 micros/op 2980999 ops/sec; 329.8 MB/s
DB path: [/dev/shm/bench_prev_opt/]
readreverse : 0.341 micros/op 2929681 ops/sec; 324.1 MB/s
DB path: [/dev/shm/bench_prev_opt/]
readreverse : 0.344 micros/op 2908490 ops/sec; 321.8 MB/s
DB path: [/dev/shm/bench_prev_opt/]
readreverse : 0.338 micros/op 2958404 ops/sec; 327.3 MB/s
```
Test Plan: COMPILE_WITH_ASAN=1 make check -j64
Reviewers: andrewkr, yiwu, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D59463
Summary: Deprecate this one option and delete code and tests that are now superfluous.
Test Plan: all tests pass
Reviewers: igor, yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: msalib, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D55317
Summary:
While trying to reuse PinData() / ReleasePinnedData() .. to optimize away some memcpys I realized that there is a significant overhead for using PinData() / ReleasePinnedData if they were called many times.
This diff refactor the pinning logic by introducing PinnedIteratorsManager a centralized component that will be created once and will be notified whenever we need to Pin an Iterator. This implementation have much less overhead than the original implementation
Test Plan:
make check -j64
COMPILE_WITH_ASAN=1 make check -j64
Reviewers: yhchiang, sdong, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56493
Summary:
This patch update the Iterator API to introduce new functions that allow users to keep the Slices returned by key() valid as long as the Iterator is not deleted
ReadOptions::pin_data : If true keep loaded blocks in memory as long as the iterator is not deleted
Iterator::IsKeyPinned() : If true, this mean that the Slice returned by key() is valid as long as the iterator is not deleted
Also add a new option BlockBasedTableOptions::use_delta_encoding to allow users to disable delta_encoding if needed.
Benchmark results (using https://phabricator.fb.com/P20083553)
```
// $ du -h /home/tec/local/normal.4K.Snappy/db10077
// 6.1G /home/tec/local/normal.4K.Snappy/db10077
// $ du -h /home/tec/local/zero.8K.LZ4/db10077
// 6.4G /home/tec/local/zero.8K.LZ4/db10077
// Benchmarks for shard db10077
// _build/opt/rocks/benchmark/rocks_copy_benchmark \
// --normal_db_path="/home/tec/local/normal.4K.Snappy/db10077" \
// --zero_db_path="/home/tec/local/zero.8K.LZ4/db10077"
// First run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s
// ============================================================================
// BM_StringCopy 1.73s 576.97m
// BM_StringPiece 103.74% 1.67s 598.55m
// ============================================================================
// Match rate : 1000000 / 1000000
// Second run
// ============================================================================
// rocks/benchmark/RocksCopyBenchmark.cpp relative time/iter iters/s
// ============================================================================
// BM_StringCopy 611.99ms 1.63
// BM_StringPiece 203.76% 300.35ms 3.33
// ============================================================================
// Match rate : 1000000 / 1000000
```
Test Plan: Unit tests
Reviewers: sdong, igor, anthony, yhchiang, rven
Reviewed By: rven
Subscribers: dhruba, lovro, adsharma
Differential Revision: https://reviews.facebook.net/D48999
Summary:
Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type.
This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's.
At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it.
Test Plan: Run all existing tests.
Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven
Reviewed By: rven
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48549
Summary:
Currently, when we insert something into block cache, we say that the block cache capacity decreased by the size of the block. However, size of the block might be less than the actual memory used by this object. For example, 4.5KB block will actually use 8KB of memory. So even if we configure block cache to 10GB, our actually memory usage of block cache will be 20GB!
This problem showed up a lot in testing and just recently also showed up in MongoRocks production where we were using 30GB more memory than expected.
This diff will fix the problem. Instead of counting the block size, we will count memory used by the block. That way, a block cache configured to be 10GB will actually use only 10GB of memory.
I'm using non-portable function and I couldn't find info on portability on Google. However, it seems to work on Linux, which will cover majority of our use-cases.
Test Plan:
1. fill up mongo instance with 80GB of data
2. restart mongo with block cache size configured to 10GB
3. do a table scan in mongo
4. memory usage before the diff: 12GB. memory usage after the diff: 10.5GB
Reviewers: sdong, MarkCallaghan, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D40635
Summary:
We need to turn on -Wshorten-64-to-32 for mobile. See D1671432 (internal phabricator) for details.
This diff turns on the warning flag and fixes all the errors. There were also some interesting errors that I might call bugs, especially in plain table. Going forward, I think it makes sense to have this flag turned on and be very very careful when converting 64-bit to 32-bit variables.
Test Plan: compiles
Reviewers: ljin, rven, yhchiang, sdong
Reviewed By: yhchiang
Subscribers: bobbaldwin, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D28689