Summary:
Given that index value is a BlockHandle, which is basically an <offset, size> pair we can apply delta encoding on the values. The first value at each index restart interval encoded the full BlockHandle but the rest encode only the size. Refer to IndexBlockIter::DecodeCurrentValue for the detail of the encoding. This reduces the index size which helps using the block cache more efficiently. The feature is enabled with using format_version 4.
The feature comes with a bit of cpu overhead which should be paid back by the higher cache hits due to smaller index block size.
Results with sysbench read-only using 4k blocks and using 16 index restart interval:
Format 2:
19585 rocksdb read-only range=100
Format 3:
19569 rocksdb read-only range=100
Format 4:
19352 rocksdb read-only range=100
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3983
Differential Revision: D8361343
Pulled By: maysamyabandeh
fbshipit-source-id: f882ee082322acac32b0072e2bdbb0b5f854e651
Summary:
HashMayMatch is related to AddKey() instead of CreateFilter().
Also applies some minor Fixes#4191#4200#3910
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4202
Differential Revision: D9180945
Pulled By: maysamyabandeh
fbshipit-source-id: 6f07b81c5bb9bda5c0273475b486ba8a030471e6
Summary:
Hi, it would be great if we could expose this API, so that LogDevice can use it to track the total size of trash files and alarm if it grows too large in relation to disk size. There's probably other customers that would be interested in this as well. :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4206
Differential Revision: D9115516
Pulled By: gdavidsson
fbshipit-source-id: f34993a940e39cb0a0b544ae8298546499b7e047
Summary:
A framework for tracing and replaying RocksDB operations.
A binary trace file is created by capturing the DB operations, and it can be replayed back at the same rate using db_bench.
- Column-families are supported
- Multi-threaded tracing is supported.
- TraceReader and TraceWriter are exposed to the user, so that tracing to various destinations can be enabled (say, to other messaging/logging services). By default, a FileTraceReader and FileTraceWriter are implemented to capture to a file and replay from it.
- This is not yet ideal to be enabled in production due to large performance overhead, but it can be safely tried out in a shadow setup, say, for analyzing RocksDB operations.
Currently supported DB operations:
- Writes:
-- Put
-- Merge
-- Delete
-- SingleDelete
-- DeleteRange
-- Write
- Reads:
-- Get (point lookups)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3837
Differential Revision: D7974837
Pulled By: sagar0
fbshipit-source-id: 8ec65aaf336504bc1f6ed0feae67f6ed5ef97a72
Summary:
The cache line size was computed dynamically based on the length of the filter bits, and the number of cache-lines encoded in the footer. This calculation had to be dynamic in case users migrate their data between platforms with different cache line sizes. The downside, though, was bloom filter probing became expensive as it did integer mod and division.
However, since we know all possible cache line sizes are powers of two, we should be able to use bit shift to find the cache line, and bitwise-and to find the bit within the cache line. To do this, we compute the log-base-two of cache line size in the constructor, and use that in bitwise operations to replace division/mod.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4071
Differential Revision: D8684067
Pulled By: ajkr
fbshipit-source-id: 50298872fba5acd01e8269cd7abcc51a095e0f61
Summary:
This adds support for writing unprepared batches based on size defined in `TransactionOptions::max_write_batch_size`. This is done by overriding methods that modify data (Put/Delete/SingleDelete/Merge) and checking first if write batch size has exceeded threshold. If so, the write batch is written to DB as an unprepared batch.
Support for Commit/Rollback for unprepared batch is added as well. This has been done by simply extending the WritePrepared Commit/Rollback logic to take care of all unprep_seq numbers either when updating prepare heap, or adding to commit map. For updating the commit map, this logic exists inside `WriteUnpreparedCommitEntryPreReleaseCallback`.
A test change was also made to have transactions unregister themselves when committing without prepare. This is because with write unprepared, there may be unprepared entries (which act similarly to prepared entries) already when a commit is done without prepare.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4104
Differential Revision: D8785717
Pulled By: lth
fbshipit-source-id: c02006e281ec1ce00f628e2a7beec0ee73096a91
Summary:
The member msgs of class Status contains all types of status messages.
When users dump a Status object, msgs will confuse users. So move it out
of class Status by making it as file-local static variable.
Closes#3831 .
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4144
Differential Revision: D8941419
Pulled By: sagar0
fbshipit-source-id: 56b0510258465ff26db15aa6b04e01532e053e3d
Summary:
Lint is not happy with some new code recently committed. Format them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4161
Differential Revision: D8940582
Pulled By: siying
fbshipit-source-id: c9b43b1ef8c88b5e923911058b44eb77234b36b7
Summary:
Right now we use one hard-coded prefetch size to prefetch data from the tail of the SST files. However, this may introduce a waste for some use cases, while not efficient for others.
Introduce a way to adjust this prefetch size by tracking 32 recent times, and pick a value with which the wasted read is less than 10%
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4156
Differential Revision: D8916847
Pulled By: siying
fbshipit-source-id: 8413f9eb3987e0033ed0bd910f83fc2eeaaf5758
Summary: Windows requires new/delete for memory allocations to be overriden. Refactor to be less intrusive.
Differential Revision: D8878047
Pulled By: siying
fbshipit-source-id: 35f2b5fec2f88ea48c9be926539c6469060aab36
Summary:
Users sometime see their memtable size far smaller than expected. They probably have hit a fragementation of shard blocks. Cap their size anyway to reduce the impact of problem. 128KB is conservative so I don't imagine it can cause any performance problem.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4147
Differential Revision: D8886706
Pulled By: siying
fbshipit-source-id: 8528a2a4196aa4457274522e2565fd3ff28f621e
Summary:
Added Get Put Encode Decode support for Fixed16 (uint16_t). Unit test added in `coding_test.cc`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4142
Differential Revision: D8873516
Pulled By: fgwu
fbshipit-source-id: 331913e0a9a8fe9c95606a08e856e953477d64d3
Summary:
Our "rocksdb.sst.read.micros" stat includes time spent waiting for rate limiter. It probably only affects people rate limiting compaction reads, which is fairly rare.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4102
Differential Revision: D8848506
Pulled By: miasantreble
fbshipit-source-id: 01258ac5ae56e4eee372978cfc9143a6869f8bfc
Summary:
The patch makes sure that two parallel test threads will operate on different db paths. This enables using open source tools such as gtest-parallel to run the tests of a file in parallel.
Example: ``` ~/gtest-parallel/gtest-parallel ./table_test```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4135
Differential Revision: D8846653
Pulled By: maysamyabandeh
fbshipit-source-id: 799bad1abb260e3d346bcb680d2ae207a852ba84
Summary:
Picked up a task to convert this to use the gtest framework. It can't be this simple, can it?
It works, but should all the std::cout be removed?
```
[$] ~/git/rocksdb [gft !]: ./merge_test
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from MergeTest
[ RUN ] MergeTest.MergeDbTest
Test read-modify-write counters...
a: 3
1
2
a: 3
b: 1225
3
Compaction started ...
Compaction ended
a: 3
b: 1225
Test merge-based counters...
a: 3
1
2
a: 3
b: 1225
3
Test merge in memtable...
a: 3
1
2
a: 3
b: 1225
3
Test Partial-Merge
Test merge-operator not set after reopen
[ OK ] MergeTest.MergeDbTest (93 ms)
[ RUN ] MergeTest.MergeDbTtlTest
Opening database with TTL
Test read-modify-write counters...
a: 3
1
2
a: 3
b: 1225
3
Compaction started ...
Compaction ended
a: 3
b: 1225
Test merge-based counters...
a: 3
1
2
a: 3
b: 1225
3
Test merge in memtable...
Opening database with TTL
a: 3
1
2
a: 3
b: 1225
3
Test Partial-Merge
Opening database with TTL
Opening database with TTL
Opening database with TTL
Opening database with TTL
Test merge-operator not set after reopen
[ OK ] MergeTest.MergeDbTtlTest (97 ms)
[----------] 2 tests from MergeTest (190 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (190 ms total)
[ PASSED ] 2 tests.
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4114
Differential Revision: D8822886
Pulled By: gfosco
fbshipit-source-id: c299d008e883c3bb911d2b357a2e9e4423f8e91a
Summary:
- Avoid `strdup` to use jemalloc on Windows
- Use `size_t` for consistency
- Add GCC 8 to Travis
- Add CMAKE_BUILD_TYPE=Release to Travis
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3433
Differential Revision: D6837948
Pulled By: sagar0
fbshipit-source-id: b8543c3a4da9cd07ee9a33f9f4623188e233261f
Summary:
Copy data between buffers inside FilePrefetchBuffer only when chunk length is greater than 0. Otherwise AlignedBuffer was accessing memory out of its range causing crashes.
Removing the tracking of buffer length outside of `AlignedBuffer`, i.e. in `FilePrefetchBuffer` and `ReadaheadRandomAccessFile`, will follow in a separate PR, as it is not the root cause of the crash reported in #4051. (`FilePrefetchBuffer` itself has been this way from its inception, and `ReadaheadRandomAccessFile` was updated to add the buffer length at some point).
Comprehensive tests for `FilePrefetchBuffer` also to follow in a separate PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4100
Differential Revision: D8792590
Pulled By: sagar0
fbshipit-source-id: 3578f45761cf6884243e767f749db4016ccc93e1
Summary:
Right now slow deletion with ftruncate doesn't work well with checkpoints because it ruin hard linked files in checkpoints. To fix it, check the file has no other hard link before ftruncate it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4093
Differential Revision: D8730360
Pulled By: siying
fbshipit-source-id: 756eea5bce8a87b9a2ea3a5bfa190b2cab6f75df
Summary:
Since the filter data is unaligned, even though we ensure all probes are within a span of `cache_line_size` bytes, those bytes can span two cache lines. In that case I doubt hardware prefetching does a great job considering we don't necessarily access those two cache lines in order. This guess seems correct since adding explicit prefetch instructions reduced filter lookup overhead by 19.4%.
Closes https://github.com/facebook/rocksdb/pull/4068
Differential Revision: D8674189
Pulled By: ajkr
fbshipit-source-id: 747427d9a17900151c17820488e3f7efe06b1871
Summary:
Currently, if RocksDB encounters errors during a write operation (user requested or BG operations), it sets DBImpl::bg_error_ and fails subsequent writes. This PR allows the DB to be resumed for certain classes of errors. It consists of 3 parts -
1. Introduce Status::Severity in rocksdb::Status to indicate whether a given error can be recovered from or not
2. Refactor the error handling code so that setting bg_error_ and deciding on severity is in one place
3. Provide an API for the user to clear the error and resume the DB instance
This whole change is broken up into multiple PRs. Initially, we only allow clearing the error for Status::NoSpace() errors during background flush/compaction. Subsequent PRs will expand this to include more errors and foreground operations such as Put(), and implement a polling mechanism for out-of-space errors.
Closes https://github.com/facebook/rocksdb/pull/3997
Differential Revision: D8653831
Pulled By: anand1976
fbshipit-source-id: 6dc835c76122443a7668497c0226b4f072bc6afd
Summary:
Various rearrangements of the cch maths failed or replacing = '\0' with
memset failed to convince the compiler it was nul terminated. So took
the perverse option of changing strncpy to strcpy.
Return null if memory couldn't be allocated.
util/status.cc: In static member function ‘static const char* rocksdb::Status::CopyState(const char*)’:
util/status.cc:28:15: error: ‘char* strncpy(char*, const char*, size_t)’ output truncated before terminating nul copying as many bytes from a string as its length [-Werror=stringop-truncation]
std::strncpy(result, state, cch - 1);
~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~
util/status.cc:19:18: note: length computed here
std::strlen(state) + 1; // +1 for the null terminator
~~~~~~~~~~~^~~~~~~
cc1plus: all warnings being treated as errors
make: *** [Makefile:645: shared-objects/util/status.o] Error 1
closes#2705
Closes https://github.com/facebook/rocksdb/pull/3870
Differential Revision: D8594114
Pulled By: anand1976
fbshipit-source-id: ab20f3a456a711e4d29144ebe630e4fe3c99ec25
Summary:
Previously in https://github.com/facebook/rocksdb/pull/3601 bloom filter will only be checked if `prefix_extractor` in the mutable_cf_options matches the one found in the SST file.
This PR relaxes the requirement by checking if all keys in the range [user_key, iterate_upper_bound) all share the same prefix after transforming using the BF in the SST file. If so, the bloom filter is considered compatible and will continue to be looked at.
Closes https://github.com/facebook/rocksdb/pull/3899
Differential Revision: D8157459
Pulled By: miasantreble
fbshipit-source-id: 18d17cba56a1005162f8d5db7a27aba277089c41
Summary:
Top-level index in partitioned index/filter blocks are small and could be pinned in memory. So far we use that by cache_index_and_filter_blocks to false. This however make it difficult to keep account of the total memory usage. This patch introduces pin_top_level_index_and_filter which in combination with cache_index_and_filter_blocks=true keeps the top-level index in cache and yet pinned them to avoid cache misses and also cache lookup overhead.
Closes https://github.com/facebook/rocksdb/pull/4037
Differential Revision: D8596218
Pulled By: maysamyabandeh
fbshipit-source-id: 3a5f7f9ca6b4b525b03ff6bd82354881ae974ad2
Summary:
This PR extends the improvements in #3282 to also work when using Direct IO.
We see **4.5X performance improvement** in seekrandom benchmark doing long range scans, when using direct reads, on flash.
**Description:**
This change improves the performance of iterators doing long range scans (e.g. big/full index or table scans in MyRocks) by using readahead and prefetching additional data on each disk IO, and storing in a local buffer. This prefetching is automatically enabled on noticing more than 2 IOs for the same table file during iteration. The readahead size starts with 8KB and is exponentially increased on each additional sequential IO, up to a max of 256 KB. This helps in cutting down the number of IOs needed to complete the range scan.
**Implementation Details:**
- Used `FilePrefetchBuffer` as the underlying buffer to store the readahead data. `FilePrefetchBuffer` can now take file_reader, readahead_size and max_readahead_size as input to the constructor, and automatically do readahead.
- `FilePrefetchBuffer::TryReadFromCache` can now call `FilePrefetchBuffer::Prefetch` if readahead is enabled.
- `AlignedBuffer` (which is the underlying store for `FilePrefetchBuffer`) now takes a few additional args in `AlignedBuffer::AllocateNewBuffer` to allow copying data from the old buffer.
- Made sure not to re-read partial chunks of data that were already available in the buffer, from device again.
- Fixed a couple of cases where `AlignedBuffer::cursize_` was not being properly kept up-to-date.
**Constraints:**
- Similar to #3282, this gets currently enabled only when ReadOptions.readahead_size = 0 (which is the default value).
- Since the prefetched data is stored in a temporary buffer allocated on heap, this could increase the memory usage if you have many iterators doing long range scans simultaneously.
- Enabled only for user reads, and disabled for compactions. Compaction reads are controlled by the options `use_direct_io_for_flush_and_compaction` and `compaction_readahead_size`, and the current feature takes precautions not to mess with them.
**Benchmarks:**
I used the same benchmark as used in #3282.
Data fill:
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=fillrandom -num=1000000000 -compression_type="none" -level_compaction_dynamic_level_bytes
```
Do a long range scan: Seekrandom with large number of nexts
```
TEST_TMPDIR=/data/users/$USER/benchmarks/iter ./db_bench -benchmarks=seekrandom -use_direct_reads -duration=60 -num=1000000000 -use_existing_db -seek_nexts=10000 -statistics -histogram
```
```
Before:
seekrandom : 37939.906 micros/op 26 ops/sec; 29.2 MB/s (1636 of 1999 found)
With this change:
seekrandom : 8527.720 micros/op 117 ops/sec; 129.7 MB/s (6530 of 7999 found)
```
~4.5X perf improvement. Taken on an average of 3 runs.
Closes https://github.com/facebook/rocksdb/pull/3884
Differential Revision: D8082143
Pulled By: sagar0
fbshipit-source-id: 4d7a8561cbac03478663713df4d31ad2620253bb
Summary:
We potentially need this information for tracing, profiling and diagnosis.
Closes https://github.com/facebook/rocksdb/pull/4026
Differential Revision: D8555214
Pulled By: riversand963
fbshipit-source-id: 4263e06c00b6d5410b46aa46eb4e358ff2161dd2
Summary:
Here are some fixes for build on Solaris Sparc.
It is also fixing CRC test on BigEndian platforms.
Closes https://github.com/facebook/rocksdb/pull/4000
Differential Revision: D8455394
Pulled By: ajkr
fbshipit-source-id: c9289a7b541a5628139c6b77e84368e14dc3d174
Summary:
Rebased and resubmitting #1831 on behalf of stevelittle.
The problem is when a single process attempts to open the same DB twice, the second attempt fails due to LOCK file held. If the second attempt had opened the LOCK file, it'll now need to close it, and closing causes the file to be unlocked. Then, any subsequent attempt to open the DB will succeed, which is the wrong behavior.
The solution was to track which files a process has locked in PosixEnv, and check those before opening a LOCK file.
Fixes#1780.
Closes https://github.com/facebook/rocksdb/pull/3993
Differential Revision: D8398984
Pulled By: ajkr
fbshipit-source-id: 2755fe66950a0c9de63075f932f9e15768041918
Summary:
In `db_stress` profile the vast majority of CPU time is spent acquiring the `SyncPoint` mutex. I mistakenly assumed #3939 had fixed this mutex contention problem by disabling `SyncPoint` processing. But actually the lock was still being acquired just to check whether processing is enabled. We can avoid that overhead by using an atomic to track whether it's enabled.
Closes https://github.com/facebook/rocksdb/pull/3991
Differential Revision: D8393825
Pulled By: ajkr
fbshipit-source-id: 5bc4e3c722ee7304e7a9c2439998c456b05a6897
Summary:
Depending on the compression type, `CompressBlock` calls the compress method for each compression type. It calls ZSTD_Compress for both kZSTD and kZSTDNotFinalCompression (https://github.com/facebook/rocksdb/blob/master/table/block_based_table_builder.cc#L169).
However currently ZSTD_Compress only expects the type to be kZSTD and this is causing assert failures and crashes. The same also applies to ZSTD_Uncompress.
Closes https://github.com/facebook/rocksdb/pull/3964
Differential Revision: D8308715
Pulled By: miasantreble
fbshipit-source-id: e5125f53edb829c9c33733167bec74e4793d0782
Summary:
PR https://github.com/facebook/rocksdb/pull/3838 made some changes that triggers lint warnings.
Run `make format` to fix formatting as suggested by siying .
Also piggyback two changes:
1) fix singleton destruction order for windows and posix env
2) fix two clang warnings
Closes https://github.com/facebook/rocksdb/pull/3954
Differential Revision: D8272041
Pulled By: miasantreble
fbshipit-source-id: 7c4fd12bd17aac13534520de0c733328aa3c6c9f
Summary:
format_version=3 changes the format of SST index. This is however not being tested currently since tests only work with the default format_version which is currently 2. The patch extends the most related tests to also test for format_version=3.
Closes https://github.com/facebook/rocksdb/pull/3942
Differential Revision: D8238413
Pulled By: maysamyabandeh
fbshipit-source-id: 915725f55753dd8e9188e802bf471c23645ad035
Summary:
Ensure the PosixEnv singleton is destroyed first since its destructor waits for background threads to all complete. This ensures background threads cannot hit sync points after the SyncPoint singleton is destroyed, which was previously possible.
Closes https://github.com/facebook/rocksdb/pull/3951
Differential Revision: D8265295
Pulled By: ajkr
fbshipit-source-id: 7738dd458c5d993a78377dd0420e82badada81ab
Summary:
Windows does not have LD_PRELOAD mechanism to override all memory allocation functions and ZSTD makes use of C-tuntime calloc. During flushes and compactions default system allocator fragments and the system slows down considerably.
For builds with jemalloc we employ an advanced ZSTD context creation API that re-directs memory allocation to jemalloc. To reduce the cost of context creation on each block we cache ZSTD context within the block based table builder while a new SST file is being built, this will help all platform builds including those w/o jemalloc. This avoids system allocator fragmentation and improves the performance.
The change does not address random reads and currently on Windows reads with ZSTD regress as compared with SNAPPY compression.
Closes https://github.com/facebook/rocksdb/pull/3838
Differential Revision: D8229794
Pulled By: miasantreble
fbshipit-source-id: 719b622ab7bf4109819bc44f45ec66f0dd3ee80d
Summary:
Index blocks have the same format as data blocks. The keys therefore similarly to the keys in the data blocks are internal keys, which means that in addition to the user key it also has 8 bytes that encodes sequence number and value type. This extra 8 bytes however is not necessary in index blocks since the index keys act as an separator between two data blocks. The only exception is when the last key of a block and the first key of the next block share the same user key, in which the sequence number is required to act as a separator.
The patch excludes the sequence from index keys only if the above special case does not happen for any of the index keys. It then records that in the property block. The reader looks at the property block to see if it should expect sequence numbers in the keys of the index block.s
Closes https://github.com/facebook/rocksdb/pull/3894
Differential Revision: D8118775
Pulled By: maysamyabandeh
fbshipit-source-id: 915479f028b5799ca91671d67455ecdefbd873bd
Summary:
Previously we were using -1 as the default for every library, which was legacy from our zlib options. That worked for a while, but after zstd introduced a146ee04ae, it started giving poor compression ratios by default in zstd.
This PR adds a constant to RocksDB public API, `CompressionOptions::kDefaultCompressionLevel`, which will get translated to the default value specific to the compression library being used in "util/compression.h". The constant uses a number that appears to be larger than any library's maximum compression level.
Closes https://github.com/facebook/rocksdb/pull/3895
Differential Revision: D8125780
Pulled By: ajkr
fbshipit-source-id: 2db157a89118cd4f94577c2f4a0a5ff31c8391c6
Summary:
`RangeDelAggregator` holds the pointers returned by `BlockIter::key()` and `BlockIter::value()` so requires the data to which they point is pinned. `BlockIter::key()` points into block memory and is guaranteed to be pinned if and only if prefix encoding is disabled (or, equivalently, restart interval is set to one). I think `BlockIter::value()` is always pinned. Added an assert for these and removed the wrong TODO about increasing restart interval, which would enable key prefix encoding and break the assertion.
Closes https://github.com/facebook/rocksdb/pull/3875
Differential Revision: D8063667
Pulled By: ajkr
fbshipit-source-id: 60b5ebcc0cdd610dd6aad9e74a23378793672c41
Summary:
Right now ReverseBytewiseComparator::FindShortestSeparator() doesn't really shorten key, and ReverseBytewiseComparator::FindShortestSuccessor() seems to return wrong results. The code is confusing too as it uses BytewiseComparatorImpl::FindShortestSeparator() but the function actually won't do anything if the the first key is larger than the second.
Implement ReverseBytewiseComparator::FindShortestSeparator() and override ReverseBytewiseComparator::FindShortestSuccessor() to be empty.
Closes https://github.com/facebook/rocksdb/pull/3836
Differential Revision: D7959762
Pulled By: siying
fbshipit-source-id: 93acb621c16ce6f23e087ae4e19f7d84d1254683
Summary:
Currently manual_wal_flush if set in the options will be used only for the wal files created during wal switch. The configuration thus does not affect the first wal file. The patch fixes that and also update the related unit tests.
This PR is built on top of https://github.com/facebook/rocksdb/pull/3756
Closes https://github.com/facebook/rocksdb/pull/3824
Differential Revision: D7909153
Pulled By: maysamyabandeh
fbshipit-source-id: 024ed99d2555db06bf096c902b998e432bb7b9ce
Summary:
Previously `DBOptions::use_direct_io_for_flush_and_compaction=true` combined with `DBOptions::use_direct_reads=false` could cause RocksDB to simultaneously read from two file descriptors for the same file, where background reads used direct I/O and foreground reads used buffered I/O. Our measurements found this mixed-mode I/O negatively impacted foreground read perf, compared to when only buffered I/O was used.
This PR makes the mixed-mode I/O situation impossible by repurposing `DBOptions::use_direct_io_for_flush_and_compaction` to only apply to background writes, and `DBOptions::use_direct_reads` to apply to all reads. There is no risk of direct background direct writes happening simultaneously with buffered reads since we never read from and write to the same file simultaneously.
Closes https://github.com/facebook/rocksdb/pull/3829
Differential Revision: D7915443
Pulled By: ajkr
fbshipit-source-id: 78bcbf276449b7e7766ab6b0db246f789fb1b279
Summary:
`ReadaheadRandomAccessFile` had an unwritten assumption, which was that its wrapped file's `Read()` function always copies into the provided scratch buffer. Actually this was not true when the wrapped file was `PosixMmapReadableFile`, whose `Read()` implementation does no copying and instead returns a `Slice` pointing directly into the `mmap`'d memory region. This PR:
- prevents `ReadaheadRandomAccessFile` from ever wrapping mmap readable files
- adds an assert for the assumption `ReadaheadRandomAccessFile` makes about the wrapped file's use of scratch buffer
Closes https://github.com/facebook/rocksdb/pull/3813
Differential Revision: D7891513
Pulled By: ajkr
fbshipit-source-id: dc64a55222d6af280c39a1852ee39e9e9d7cde7d
Summary:
Rollback was disabled in stress test since there was a concurrency issue in WritePrepared rollback algorithm. The issue is fixed by caching the column family handles in WritePrepared to skip getting them from the db when needed for rollback.
Tested by running transaction stress test under tsan.
Closes https://github.com/facebook/rocksdb/pull/3785
Differential Revision: D7793727
Pulled By: maysamyabandeh
fbshipit-source-id: d81ab6fda0e53186ca69944cfe0712ce4869451e
Summary:
sync parent directory after deleting a file in delete scheduler. Otherwise, trim speed may not be as smooth as what we want.
Closes https://github.com/facebook/rocksdb/pull/3767
Differential Revision: D7760136
Pulled By: siying
fbshipit-source-id: ec131d53b61953f09c60d67e901e5eeb2716b05f
Summary:
WritePrepared rollback implementation is not ready to be invoked in the middle of workload. This is due the lack of synchronization to obtain the cf handle from db. Temporarily disabling this until the problem with rollback is fixed.
Closes https://github.com/facebook/rocksdb/pull/3772
Differential Revision: D7769041
Pulled By: maysamyabandeh
fbshipit-source-id: 0e3b0ce679bc2afba82e653a40afa3f045722754
Summary:
Background activities like compaction can negatively affect
latency of higher-priority tasks like request processing. To avoid this,
rocksdb already lowers the IO priority of background threads on Linux
systems. While this takes care of typical IO-bound systems, it does not
help much when CPU (temporarily) becomes the bottleneck. This is
especially likely when using more expensive compression settings.
This patch adds an API to allow for lowering the CPU priority of
background threads, modeled on the IO priority API. Benchmarks (see
below) show significant latency and throughput improvements when CPU
bound. As a result, workloads with some CPU usage bursts should benefit
from lower latencies at a given utilization, or should be able to push
utilization higher at a given request latency target.
A useful side effect is that compaction CPU usage is now easily visible
in common tools, allowing for an easier estimation of the contribution
of compaction vs. request processing threads.
As with IO priority, the implementation is limited to Linux, degrading
to a no-op on other systems.
Closes https://github.com/facebook/rocksdb/pull/3763
Differential Revision: D7740096
Pulled By: gwicke
fbshipit-source-id: e5d32373e8dc403a7b0c2227023f9ce4f22b413c
Summary:
Currently WritePrepared rolls back a transaction with prepare sequence number prepare_seq by i) write a single rollback batch with rollback_seq, ii) add <rollback_seq, rollback_seq> to commit cache, iii) remove prepare_seq from PrepareHeap.
This is correct assuming that there is no snapshot taken when a transaction is rolled back. This is the case the way MySQL does rollback which is after recovery. Otherwise if max_evicted_seq advances the prepare_seq, the live snapshot might assume data as committed since it does not find them in CommitCache.
The change is to simply add <prepare_seq. rollback_seq> to commit cache before removing prepare_seq from PrepareHeap. In this way if max_evicted_seq advances prpeare_seq, the existing mechanism that we have to check evicted entries against live snapshots will make sure that the live snapshot will not see the data of rolled back transaction.
Closes https://github.com/facebook/rocksdb/pull/3745
Differential Revision: D7696193
Pulled By: maysamyabandeh
fbshipit-source-id: c9a2d46341ddc03554dded1303520a1cab74ef9c
Summary:
Right now in `SyncClosedLogs`, `CopyFile`, and `AddRecord`, where `Sync` and `Append` are invoked in a loop, the error status are not checked. This could lead to potential corruption as later calls will overwrite the error status.
Closes https://github.com/facebook/rocksdb/pull/3740
Differential Revision: D7678848
Pulled By: miasantreble
fbshipit-source-id: 4b0b412975989dfe80348f73217b9c4122a4bd77
Summary:
Previously threads were named "rocksdb:bg\<index in thread pool\>", so the first thread in all thread pools would be named "rocksdb:bg0". Users want to be able to distinguish threads used for flush (high-pri) vs regular compaction (low-pri) vs compaction to bottom-level (bottom-pri). So I changed the thread naming convention to include the thread-pool priority.
Closes https://github.com/facebook/rocksdb/pull/3702
Differential Revision: D7581415
Pulled By: ajkr
fbshipit-source-id: ce04482b6acd956a401ef22dc168b84f76f7d7c1
Summary:
db_stress was already capable running transactions by setting use_txn. Running it under stress showed a couple of problems fixed in this patch.
- The uncommitted transaction must be either rolled back or commit after recovery.
- Current implementation of WritePrepared transaction cannot handle cf drop before crash. Clarified that in the comments and added safety checks. When running with use_txn, clear_column_family_one_in must be set to 0.
Closes https://github.com/facebook/rocksdb/pull/3733
Differential Revision: D7654419
Pulled By: maysamyabandeh
fbshipit-source-id: a024bad80a9dc99677398c00d29ff17d4436b7f3
Summary:
this PR fixes a few failed contbuild:
1. ASAN memory leak in Block::NewIterator (table/block.cc:429). the proper destruction of first_level_iter_ and second_level_iter_ of two_level_iterator.cc is missing from the code after the refactoring in https://github.com/facebook/rocksdb/pull/3406
2. various unused param errors introduced by https://github.com/facebook/rocksdb/pull/3662
3. updated comment for `ForceReleaseCachedEntry` to emphasize the use of `force_erase` flag.
Closes https://github.com/facebook/rocksdb/pull/3718
Reviewed By: maysamyabandeh
Differential Revision: D7621192
Pulled By: miasantreble
fbshipit-source-id: 476c94264083a0730ded957c29de7807e4f5b146