Summary:
Add kTypeBlobIndex value type, which will be used by blob db only, to insert a (key, blob_offset) KV pair. The purpose is to
1. Make it possible to open existing rocksdb instance as blob db. Existing value will be of kTypeIndex type, while value inserted by blob db will be of kTypeBlobIndex.
2. Make rocksdb able to detect if the db contains value written by blob db, if so return error.
3. Make it possible to have blob db optionally store value in SST file (with kTypeValue type) or as a blob value (with kTypeBlobIndex type).
The root db (DBImpl) basically pretended kTypeBlobIndex are normal value on write. On Get if is_blob is provided, return whether the value read is of kTypeBlobIndex type, or return Status::NotSupported() status if is_blob is not provided. On scan allow_blob flag is pass and if the flag is true, return wether the value is of kTypeBlobIndex type via iter->IsBlob().
Changes on blob db side will be in a separate patch.
Closes https://github.com/facebook/rocksdb/pull/2886
Differential Revision: D5838431
Pulled By: yiwu-arbug
fbshipit-source-id: 3c5306c62bc13bb11abc03422ec5cbcea1203cca
Summary:
In SST files, restart interval helps us search in data blocks. However, some meta blocks will be read sequentially, so there's no need for restart points. Restart interval will introduce extra space in the block (https://github.com/facebook/rocksdb/blob/master/table/block_builder.cc#L80). We will see if we can remove this redundant space. (Maybe set restart interval to infinite.)
Closes https://github.com/facebook/rocksdb/pull/2940
Differential Revision: D5930139
Pulled By: miasantreble
fbshipit-source-id: 92b1b23c15cffa90378343ac846b713623b19c21
Summary:
When using with compressed cache it is possible that the status is ok but the block is not actually added to the block cache. The patch takes this case into account.
Closes https://github.com/facebook/rocksdb/pull/2945
Differential Revision: D5937613
Pulled By: maysamyabandeh
fbshipit-source-id: 5428cf1115e5046b3d01ab78d26cb181122af4c6
Summary:
For every merge operand encountered for a key in the read path we now have the ability to decide whether to look further (to retrieve more merge operands for the key) or stop and invoke the merge operator to return the value. The user needs to override `ShouldMerge()` method with a condition to terminate search when true to avail this facility.
This has a couple of advantages:
1. It helps in limiting the number of merge operands that are looked at to compute a value as part of a user Get operation.
2. It allows to peek at a merge key-value to see if further merge operands need to look at.
Example: Limiting the number of merge operands that are looked at: Lets say you have 10 merge operands for a key spread over various levels. If you only want RocksDB to look at the latest two merge operands instead of all 10 to compute the value, it is now possible with this PR. You can set the condition in `ShouldMerge()` to return true when the size of the operand list is 2. Look at the example implementation in the unit test. Without this PR, a Get might look at all the 10 merge operands in different levels before invoking the merge-operator.
Added a new unit test.
Made sure that there is no perf regression by running benchmarks.
Command line to Load data:
```
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks="mergerandom" --merge_operator="uint64add" --num=10000000
...
mergerandom : 12.861 micros/op 77757 ops/sec; 8.6 MB/s ( updates:10000000)
```
**ReadRandomMergeRandom bechmark results:**
Command line:
```
TEST_TMPDIR=/dev/shm ./db_bench --benchmarks="readrandommergerandom" --merge_operator="uint64add" --num=10000000
```
Base -- Without this code change (on commit fc7476b):
```
readrandommergerandom : 38.586 micros/op 25916 ops/sec; (reads:3001599 merges:6998401 total:10000000 hits:842235 maxlength:8)
```
With this code change:
```
readrandommergerandom : 38.653 micros/op 25870 ops/sec; (reads:3001599 merges:6998401 total:10000000 hits:842235 maxlength:8)
```
Closes https://github.com/facebook/rocksdb/pull/2923
Differential Revision: D5898239
Pulled By: sagar0
fbshipit-source-id: daefa325019f77968639a75c851d46352c2303ef
Summary:
A user encountered segfault on the call to `CacheDependencies()`, probably because `NewIndexIterator()` failed before populating `*index_entry`. Let's avoid the call in that case.
Closes https://github.com/facebook/rocksdb/pull/2939
Differential Revision: D5928611
Pulled By: ajkr
fbshipit-source-id: 484be453dbb00e5e160e9c6a1bc933df7d80f574
Summary:
Three small optimizations:
(1) iter_->IsKeyPinned() shouldn't be called if read_options.pin_data is not true. This may trigger function call all the way down the iterator tree.
(2) reuse the iterator key object in DBIter::FindNextUserEntryInternal(). The constructor of the class has some overheads.
(3) Move the switching direction logic in MergingIterator::Next() to a separate function.
These three in total improves readseq performance by about 3% in my benchmark setting.
Closes https://github.com/facebook/rocksdb/pull/2880
Differential Revision: D5829252
Pulled By: siying
fbshipit-source-id: 991aea10c6d6c3b43769cb4db168db62954ad1e3
Summary:
Merging iterator invokes InternalKeyComparator.Compare() frequently to heap merge. By making InternalKeyComparator final and merging iterator to directly use InternalKeyComparator rather than through Iterator interface, we can give compiler a choice to avoid one more virtual function call if possible. I ran readseq benchmark in memory-only use case to make sure the performance at least doesn't regress.
I have to disable the final key word in debug build, as a hack test class depends on overriding the class.
Closes https://github.com/facebook/rocksdb/pull/2860
Differential Revision: D5800461
Pulled By: siying
fbshipit-source-id: ab876f22a09bb5c560740911412336e0e25ccb53
Summary:
This patch instruments the read path to verify each read value against an optional ReadCallback class. If the value is rejected, the reader moves on to the next value. The WritePreparedTxn makes use of this feature to skip sequence numbers that are not in the read snapshot.
Closes https://github.com/facebook/rocksdb/pull/2850
Differential Revision: D5787375
Pulled By: maysamyabandeh
fbshipit-source-id: 49d808b3062ab35e7ae98ad388f659757794184c
Summary:
store a zero as the checksum when disabled since it's easier to keep block trailer a fixed length.
Closes https://github.com/facebook/rocksdb/pull/2781
Differential Revision: D5694702
Pulled By: ajkr
fbshipit-source-id: 69cea9da415778ba2b600dfd9d0dfc8cb5188ecd
Summary:
This is the warning that clang considers a bug and has been causing it to fail:
```
table/block_based_table_reader.cc:240:27: warning: Potential leak of memory pointed to by 'block.value'
for (; biter.Valid(); biter.Next()) {
^~~~~
```
Actually clang just doesn't have enough knowledge to statically determine it's safe. We can teach it using an assert.
Closes https://github.com/facebook/rocksdb/pull/2779
Differential Revision: D5691225
Pulled By: ajkr
fbshipit-source-id: 3f0d545bf44636953b30ee5243c63239e8f16d8e
Summary:
Allow `Slice` holding nullptr as a sentinel value but not in comparisons. This new restriction eliminates the need for the manual checks in 39ef900551, while still conforming to glibc's `memcmp` API. Thanks siying for the idea. Users may need to migrate, so mentioned it in HISTORY.md.
Closes https://github.com/facebook/rocksdb/pull/2777
Differential Revision: D5686016
Pulled By: ajkr
fbshipit-source-id: 03a2ca3fd9a0ebade9d0d5686c81d59a9534f563
Summary:
This is the continuation of https://github.com/facebook/rocksdb/pull/2661 for filter partitions. When pin_l0 is set (along with cache_xxx), then open table open the filter partitions are loaded into the cache and pinned there.
Closes https://github.com/facebook/rocksdb/pull/2766
Differential Revision: D5671098
Pulled By: maysamyabandeh
fbshipit-source-id: 174f24018f1d7f1129621e7380287b65b67d2115
Summary:
This fixes the existing logic for pinning l0 index partitions. The patch preloads the partitions into block cache and pin them if they belong to level 0 and pin_l0 is set.
The drawback is that it does many small IOs when preloading all the partitions into the cache is direct io is enabled. Working for a solution for that.
Closes https://github.com/facebook/rocksdb/pull/2661
Differential Revision: D5554010
Pulled By: maysamyabandeh
fbshipit-source-id: 1e6f32a3524d71355c77d4138516dcfb601ca7b2
Summary:
Right now, if direct I/O is enabled, prefetching the last 512KB cannot be applied, except compaction inputs or readahead is enabled for iterators. This can create a lot of I/O for HDD cases. To solve the problem, the 512KB is prefetched in block based table if direct I/O is enabled. The prefetched buffer is passed in totegher with random access file reader, so that we try to read from the buffer before reading from the file. This can be extended in the future to support flexible user iterator readahead too.
Closes https://github.com/facebook/rocksdb/pull/2708
Differential Revision: D5593091
Pulled By: siying
fbshipit-source-id: ee36ff6d8af11c312a2622272b21957a7b5c81e7
Summary:
We need a tool to check any sst file corruption in the db.
It will check all the sst files in current version and read all the blocks (data, meta, index) with checksum verification. If any verification fails, the function will return non-OK status.
Closes https://github.com/facebook/rocksdb/pull/2498
Differential Revision: D5324269
Pulled By: lightmark
fbshipit-source-id: 6f8a272008b722402a772acfc804524c9d1a483b
Summary:
Replace dynamic_cast<> so that users can choose to build with RTTI off, so that they can save several bytes per object, and get tiny more memory available.
Some nontrivial changes:
1. Add Comparator::GetRootComparator() to get around the internal comparator hack
2. Add the two experiemental functions to DB
3. Add TableFactory::GetOptionString() to avoid unnecessary casting to get the option string
4. Since 3 is done, move the parsing option functions for table factory to table factory files too, to be symmetric.
Closes https://github.com/facebook/rocksdb/pull/2645
Differential Revision: D5502723
Pulled By: siying
fbshipit-source-id: fd13cec5601cf68a554d87bfcf056f2ffa5fbf7c
Summary:
Breaking commit: d12691b86f
In the above commit, I moved the `TableCache` cleanup logic from `Version` destructor into `PurgeObsoleteFiles`. I missed cleaning up `TableCache` entries for the current `Version` during DB destruction.
This PR adds that logic to `VersionSet` destructor. One unfortunate side effect is now we're potentially deleting `TableReader`s after `column_family_set_.reset()`, which means we can't call `BlockBasedTableReader::Close` a second time as the block cache might already be destroyed.
Closes https://github.com/facebook/rocksdb/pull/2662
Differential Revision: D5515108
Pulled By: ajkr
fbshipit-source-id: 2cb820e19aa813e0d258d17f76b2d7b6b7ee0b18
Summary:
This reverts the previous commit 1d7048c598, which broke the build.
Did a `git revert 1d7048c`.
Closes https://github.com/facebook/rocksdb/pull/2627
Differential Revision: D5476473
Pulled By: sagar0
fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06
Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually.
Reviewed By: igorsugak
Differential Revision: D5454343
fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2
Summary:
This patch enables using PinnableSlice for RowCache, changes include
not releasing the cache handle immediately after lookup in TableCache::Get, instead pass a Cleanble function which does Cache::RleaseHandle.
Closes https://github.com/facebook/rocksdb/pull/2492
Differential Revision: D5316216
Pulled By: maysamyabandeh
fbshipit-source-id: d2a684bd7e4ba73772f762e58a82b5f4fbd5d362
Summary:
In gcc-7 the following is an error identified by -Werror=class-memaccess
In file included from ./table/get_context.h:14:0,
from db/version_set.cc:43:
./table/block.h: In constructor ‘rocksdb::BlockReadAmpBitmap::BlockReadAmpBitmap(size_t, size_t, rocksdb::Statistics*)’:
./table/block.h:73:53: error: ‘void* memset(void*, int, size_t)’ clearing an object of type ‘struct std::atomic<unsigned int>’ with no trivial copy-assignment; use value-initialization instead [-Werror=class-memaccess]
memset(bitmap_, 0, bitmap_size * kBytesPersEntry);
^
In file included from ./db/version_set.h:23:0,
from db/version_set.cc:12:
/toolchain/include/c++/8.0.0/atomic:684:12: note: ‘struct std::atomic<unsigned int>’ declared here
struct atomic<unsigned int> : __atomic_base<unsigned int>
^~~~~~~~~~~~~~~~~~~~
As a solution the default initializer can be applied in list context.
Signed-off-by: Daniel Black <daniel.black@au.ibm.com>
Closes https://github.com/facebook/rocksdb/pull/2561
Differential Revision: D5398714
Pulled By: siying
fbshipit-source-id: d883fb88ec7535eee60d551038fe91f14488be36
Summary:
this modify allows third-party tables able to support delete range
Closes https://github.com/facebook/rocksdb/pull/2035
Differential Revision: D5407973
Pulled By: ajkr
fbshipit-source-id: 82e364b7dd5a198660788d59543f15b8f95cc418
Summary:
The casting seemed to cause a problem.
I think this might increase it to unsigned long.
Closes https://github.com/facebook/rocksdb/pull/2562
Differential Revision: D5406842
Pulled By: siying
fbshipit-source-id: 736adef31448229a58a1a48bdbe77792f36736e8
Summary:
Valgrind reports that it is not initialized.
Closes https://github.com/facebook/rocksdb/pull/2541
Differential Revision: D5376084
Pulled By: maysamyabandeh
fbshipit-source-id: 55c312f4f506863aa0d25ff92c8c34b57f48b860
Summary:
Currently metadata_block_size controls only index partition size. With this patch a partition is cut after any of index or filter partitions reaches metadata_block_size.
Closes https://github.com/facebook/rocksdb/pull/2452
Differential Revision: D5275651
Pulled By: maysamyabandeh
fbshipit-source-id: 5057e4424b4c8902043782e6bf8c38f0c4f25160
Summary:
We've got some DBs where iterators return Status with message "Corruption: block checksum mismatch" all the time. That's not very informative. It would be much easier to investigate if the error message contained the file name - then we would know e.g. how old the corrupted file is, which would be very useful for finding the root cause. This PR adds file name, offset and other stuff to some block corruption-related status messages.
It doesn't improve all the error messages, just a few that were easy to improve. I'm mostly interested in "block checksum mismatch" and "Bad table magic number" since they're the only corruption errors that I've ever seen in the wild.
Closes https://github.com/facebook/rocksdb/pull/2507
Differential Revision: D5345702
Pulled By: al13n321
fbshipit-source-id: fc8023d43f1935ad927cef1b9c55481ab3cb1339
Summary:
Introducing FIFO compactions with TTL.
FIFO compaction is based on size only which makes it tricky to enable in production as use cases can have organic growth. A user requested an option to drop files based on the time of their creation instead of the total size.
To address that request:
- Added a new TTL option to FIFO compaction options.
- Updated FIFO compaction score to take TTL into consideration.
- Added a new table property, creation_time, to keep track of when the SST file is created.
- Creation_time is set as below:
- On Flush: Set to the time of flush.
- On Compaction: Set to the max creation_time of all the files involved in the compaction.
- On Repair and Recovery: Set to the time of repair/recovery.
- Old files created prior to this code change will have a creation_time of 0.
- FIFO compaction with TTL is enabled when ttl > 0. All files older than ttl will be deleted during compaction. i.e. `if (file.creation_time < (current_time - ttl)) then delete(file)`. This will enable cases where you might want to delete all files older than, say, 1 day.
- FIFO compaction will fall back to the prior way of deleting files based on size if:
- the creation_time of all files involved in compaction is 0.
- the total size (of all SST files combined) does not drop below `compaction_options_fifo.max_table_files_size` even if the files older than ttl are deleted.
This feature is not supported if max_open_files != -1 or with table formats other than Block-based.
**Test Plan:**
Added tests.
**Benchmark results:**
Base: FIFO with max size: 100MB ::
```
svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100
readwhilewriting : 1.924 micros/op 519858 ops/sec; 13.6 MB/s (1176277 of 5000000 found)
```
With TTL (a low one for testing) ::
```
svemuri@dev15905 ~/rocksdb (fifo-compaction) $ TEST_TMPDIR=/dev/shm ./db_bench --benchmarks=readwhilewriting --num=5000000 --threads=16 --compaction_style=2 --fifo_compaction_max_table_files_size_mb=100 --fifo_compaction_ttl=20
readwhilewriting : 1.902 micros/op 525817 ops/sec; 13.7 MB/s (1185057 of 5000000 found)
```
Example Log lines:
```
2017/06/26-15:17:24.609249 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609177) [db/compaction_picker.cc:1471] [default] FIFO compaction: picking file 40 with creation time 1498515423 for deletion
2017/06/26-15:17:24.609255 7fd5a45ff700 (Original Log Time 2017/06/26-15:17:24.609234) [db/db_impl_compaction_flush.cc:1541] [default] Deleted 1 files
...
2017/06/26-15:17:25.553185 7fd5a61a5800 [DEBUG] [db/db_impl_files.cc:309] [JOB 0] Delete /dev/shm/dbbench/000040.sst type=2 #40 -- OK
2017/06/26-15:17:25.553205 7fd5a61a5800 EVENT_LOG_v1 {"time_micros": 1498515445553199, "job": 0, "event": "table_file_deletion", "file_number": 40}
```
SST Files remaining in the dbbench dir, after db_bench execution completed:
```
svemuri@dev15905 ~/rocksdb (fifo-compaction) $ ls -l /dev/shm//dbbench/*.sst
-rw-r--r--. 1 svemuri users 30749887 Jun 26 15:17 /dev/shm//dbbench/000042.sst
-rw-r--r--. 1 svemuri users 30768779 Jun 26 15:17 /dev/shm//dbbench/000044.sst
-rw-r--r--. 1 svemuri users 30757481 Jun 26 15:17 /dev/shm//dbbench/000046.sst
```
Closes https://github.com/facebook/rocksdb/pull/2480
Differential Revision: D5305116
Pulled By: sagar0
fbshipit-source-id: 3e5cfcf5dd07ed2211b5b37492eb235b45139174
Summary:
Throughput: 46k tps in our sysbench settings (filling the details later)
The idea is to have the simplest change that gives us a reasonable boost
in 2PC throughput.
Major design changes:
1. The WAL file internal buffer is not flushed after each write. Instead
it is flushed before critical operations (WAL copy via fs) or when
FlushWAL is called by MySQL. Flushing the WAL buffer is also protected
via mutex_.
2. Use two sequence numbers: last seq, and last seq for write. Last seq
is the last visible sequence number for reads. Last seq for write is the
next sequence number that should be used to write to WAL/memtable. This
allows to have a memtable write be in parallel to WAL writes.
3. BatchGroup is not used for writes. This means that we can have
parallel writers which changes a major assumption in the code base. To
accommodate for that i) allow only 1 WriteImpl that intends to write to
memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes
come via group commit phase which is serial anyway, ii) make all the
parts in the code base that assumed to be the only writer (via
EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are
protected via a stat_mutex_.
Note: the first commit has the approach figured out but is not clean.
Submitting the PR anyway to get the early feedback on the approach. If
we are ok with the approach I will go ahead with this updates:
0) Rebase with Yi's pipelining changes
1) Currently batching is disabled by default to make sure that it will be
consistent with all unit tests. Will make this optional via a config.
2) A couple of unit tests are disabled. They need to be updated with the
serial commit of 2PC taken into account.
3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires
releasing mutex_ beforehand (the same way EnterUnbatched does). This
needs to be cleaned up.
Closes https://github.com/facebook/rocksdb/pull/2345
Differential Revision: D5210732
Pulled By: maysamyabandeh
fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4
Summary:
We currently do not support partitioning filters if indexes are not partitioned. The patch makes sure that these two are consistent.
Closes https://github.com/facebook/rocksdb/pull/2455
Differential Revision: D5275644
Pulled By: maysamyabandeh
fbshipit-source-id: b61701ac8914c2206d06f5e33ff6f67b24406d1d
Summary:
When Partitioning index/filter is enabled the user might need to check the index block size as well as the top-level index size via sst_dump. This patch records i) number of partitions, ii) top-level index size and make it accessible through sst_dump. The number of partitions for filters is the same as that of indexes. The top-level index for filters has a similar size to top-level index for indexes, so it is not repeated.
Closes https://github.com/facebook/rocksdb/pull/2437
Differential Revision: D5224225
Pulled By: maysamyabandeh
fbshipit-source-id: 5324598c75793523aef1bb7ee225a5475e95a9cb
Summary:
We estimate number of reads per SST files, by updating the counter per file in sampled read requests. This information can later be used to trigger compactions to improve read performacne.
Closes https://github.com/facebook/rocksdb/pull/2417
Differential Revision: D5193528
Pulled By: siying
fbshipit-source-id: b4241c5ad0eaf444b61afb53f8e6290d9f5da2df
Summary:
filter_block_set_ access must also be protected with mutex.
Closes https://github.com/facebook/rocksdb/pull/2413
Differential Revision: D5193159
Pulled By: maysamyabandeh
fbshipit-source-id: 6987fc219d9a65c20b9c7e52151aef4b8e4882e6
Summary:
… headers
https://github.com/facebook/rocksdb/pull/2199 should not reference RocksDB-specific macros (like ROCKSDB_SUPPORT_THREAD_LOCAL in this case) to public headers, `iostats_context.h` and `perf_context.h`. We shouldn't do that because users have to provide these compiler flags when building their binary with RocksDB.
We should hide the thread local global variable inside our implementation and just expose a function api to retrieve these variables. It may break some users for now but good for long term.
make check -j64
Closes https://github.com/facebook/rocksdb/pull/2380
Differential Revision: D5177896
Pulled By: lightmark
fbshipit-source-id: 6fcdfac57f2e2dcfe60992b7385c5403f6dcb390
Summary:
Fixes the following scenario:
1. Set prefix extractor. Enable bloom filters, with `whole_key_filtering = false`. Use compaction filter that sometimes returns `kRemoveAndSkipUntil`.
2. Do a compaction.
3. Compaction creates an iterator with `total_order_seek = false`, calls `SeekToFirst()` on it, then repeatedly calls `Next()`.
4. At some point compaction filter returns `kRemoveAndSkipUntil`.
5. Compaction calls `Seek(skip_until)` on the iterator. The key that it seeks to happens to have prefix that doesn't match the bloom filter. Since `total_order_seek = false`, iterator becomes invalid, and compaction thinks that it has reached the end. The rest of the compaction input is silently discarded.
The fix is to make compaction iterator use `total_order_seek = true`.
The implementation for PlainTable is quite awkward. I've made `kRemoveAndSkipUntil` officially incompatible with PlainTable. If you try to use them together, compaction will fail, and DB will enter read-only mode (`bg_error_`). That's not a very graceful way to communicate a misconfiguration, but the alternatives don't seem worth the implementation time and complexity. To be able to check in advance that `kRemoveAndSkipUntil` is not going to be used with PlainTable, we'd need to extend the interface of either `CompactionFilter` or `InternalIterator`. It seems unlikely that anyone will ever want to use `kRemoveAndSkipUntil` with PlainTable: PlainTable probably has very few users, and `kRemoveAndSkipUntil` has only one user so far: us (logdevice).
Closes https://github.com/facebook/rocksdb/pull/2349
Differential Revision: D5110388
Pulled By: lightmark
fbshipit-source-id: ec29101a99d9dcd97db33923b87f72bce56cc17a
Summary:
Some users want to monitor column family activity in their custom memtable implementations. Previously there was no way to figure out with which column family a memtable is associated. This diff:
- adds an overload to MemTableRepFactory::CreateMemTableRep() that provides the CF ID. For compatibility, its default implementation calls the old overload.
- updates MemTable to create MemTableRep's using the new overload.
Closes https://github.com/facebook/rocksdb/pull/2346
Differential Revision: D5108061
Pulled By: ajkr
fbshipit-source-id: 3a1921214a348dd8ea0f54e1cab3b71c3d46d616
Summary:
Previously sst_file_writer only supports kTypeValue, we need kTypeMerge and kTypeDeletion also as user requested.
Closes https://github.com/facebook/rocksdb/pull/2361
Differential Revision: D5139402
Pulled By: lightmark
fbshipit-source-id: 092a60756d01692539d817a3765ebfd58a8d7f88
Summary:
The default IO priority of WritableFiles is IO_TOTAL, meaning that
they will bypass the rate limiter if it's passed in the options.
This change allows to pass an io priority in construction, so that by
setting IO_LOW or IO_HIGH the rate limit will be honored.
It also fixes a minor bug: SstFileWriter's copy and move constructor
are not disabled and incorrect, as any copy/move will result in a
double free. Switching to unique_ptr makes the object correctly
movable and non-copyable as expected.
Also fix minor style inconsistencies.
Closes https://github.com/facebook/rocksdb/pull/2335
Differential Revision: D5113260
Pulled By: sagar0
fbshipit-source-id: e084236e7ff0b50a56cbeceaa9fedd5e210bf9f8
Summary:
BlockBasedTable::compaction_optimized_ is never used but can cause TSAN warning. Remove it.
Closes https://github.com/facebook/rocksdb/pull/2324
Differential Revision: D5085533
Pulled By: siying
fbshipit-source-id: 2feefce6806d559dfb4ab2989aa3db36752fe25d
Summary:
Consider BlockReadAmpBitmap with bytes_per_bit = 32. Suppose bytes [a, b) were used, while bytes [a-32, a)
and [b+1, b+33) weren't used; more formally, the union of ranges passed to BlockReadAmpBitmap::Mark() contains [a, b) and doesn't intersect with [a-32, a) and [b+1, b+33). Then bits [floor(a/32), ceil(b/32)] will be set, and so the number of useful bytes will be estimated as (ceil(b/32) - floor(a/32)) * 32, which is on average equal to b-a+31.
An extreme example: if we use 1 byte from each block, it'll be counted as 32 bytes from each block.
It's easy to remove this bias by slightly changing the semantics of the bitmap. Currently each bit represents a byte range [i*32, (i+1)*32).
This diff makes each bit represent a single byte: i*32 + X, where X is a random number in [0, 31] generated when bitmap is created. So, e.g., if you read a single byte at random, with probability 31/32 it won't be counted at all, and with probability 1/32 it will be counted as 32 bytes; so, on average it's counted as 1 byte.
*But there is one exception: the last bit will always set with the old way.*
(*) - assuming read_amp_bytes_per_bit = 32.
Closes https://github.com/facebook/rocksdb/pull/2259
Differential Revision: D5035652
Pulled By: lightmark
fbshipit-source-id: bd98b1b9b49fbe61f9e3781d07f624e3cbd92356
Summary:
Based on my experience with linkbench, We should not skip loading bloom filter blocks when they are not available in block cache when using Iterator::Seek
Actually I am not sure why this behavior existed in the first place
Closes https://github.com/facebook/rocksdb/pull/2255
Differential Revision: D5010721
Pulled By: maysamyabandeh
fbshipit-source-id: 0af545a06ac4baeecb248706ec34d009c2480ca4
Summary:
Any non-raw-data dependent object must be destructed before the table
closes. There was a bug of not doing that for filter object. This patch
fixes the bug and adds a unit test to prevent such bugs in future.
Closes https://github.com/facebook/rocksdb/pull/2246
Differential Revision: D5001318
Pulled By: maysamyabandeh
fbshipit-source-id: 6d8772e58765485868094b92964da82ef9730b6d