Summary:
Provide a block_align option in BlockBasedTableOptions to allow
alignment of SST file data blocks. This will avoid higher
IOPS/throughput load due to < 4KB data blocks spanning 2 4KB pages.
When this option is set to true, the block alignment is set to lower of
block size and 4KB.
Closes https://github.com/facebook/rocksdb/pull/3502
Differential Revision: D7400897
Pulled By: anand1976
fbshipit-source-id: 04cc3bd144e88e3431a4f97604e63ad7a0f06d44
Summary:
Add `bytes_max_delete_chunk` in SstFileManager so that we can drop a large file in multiple batches.
Closes https://github.com/facebook/rocksdb/pull/3640
Differential Revision: D7358679
Pulled By: siying
fbshipit-source-id: ef17f0da2f5723dbece2669485a9b91b3edc0bb7
Summary:
Implemented PositionedAppend() and use_direct_io() for TestWritableFile.
With these changes, FaultInjectionTestEnv can be used with DirectIO enabled.
Closes https://github.com/facebook/rocksdb/pull/3586
Differential Revision: D7244305
Pulled By: yiwu-arbug
fbshipit-source-id: f6b7aece53daa0f9977bc684164a0693693e514c
Summary:
This diff handles cases where compaction causes an ENOSPC error.
This does not handle corner cases where another background job is started while compaction is running, and the other background job triggers ENOSPC, although we do allow the user to provision for these background jobs with SstFileManager::SetCompactionBufferSize.
It also does not handle the case where compaction has finished and some other background job independently triggers ENOSPC.
Usage: Functionality is inside SstFileManager. In particular, users should set SstFileManager::SetMaxAllowedSpaceUsage, which is the reference highwatermark for determining whether to cancel compactions.
Closes https://github.com/facebook/rocksdb/pull/3449
Differential Revision: D7016941
Pulled By: amytai
fbshipit-source-id: 8965ab8dd8b00972e771637a41b4e6c645450445
Summary:
Now that files scheduled for deletion are kept in the same directory, we don't need to constrain deletion scheduler to `db_paths[0]`. Previously this was done because there was a separate trash directory, and this constraint prevented files from being accidentally copied to another filesystem when they're scheduled for deletion.
Closes https://github.com/facebook/rocksdb/pull/3544
Differential Revision: D7093786
Pulled By: ajkr
fbshipit-source-id: 202f5c92d925eafebec1281fb95bb5828d33414f
Summary:
In attempting to build a static lib for use in iOS, I ran in to lots of type errors between uint64_t and size_t. This PR contains the changes I made to get `TARGET_OS=IOS make static_lib` to succeed while also getting Xcode to build successfully with the resulting `librocksdb.a` library imported.
This also compiles for me on macOS and tests fine, but I'm really not sure if I made the correct decisions about where to `static_cast` and where to change types.
Also up for discussion: is iOS worth supporting? Getting the static lib is just part one, we aren't providing any bridging headers or wrappers like the ObjectiveRocks project, it won't be a great experience.
Closes https://github.com/facebook/rocksdb/pull/3503
Differential Revision: D7106457
Pulled By: gfosco
fbshipit-source-id: 82ac2073de7e1f09b91f6b4faea91d18bd311f8e
Summary:
This patch addressed several issues.
Portability including db_test std::thread -> port::Thread Cc: @
and %z to ROCKSDB portable macro. Cc: maysamyabandeh
Implement Env::AreFilesSame
Make the implementation of file unique number more robust
Get rid of C-runtime and go directly to Windows API when dealing
with file primitives.
Implement GetSectorSize() and aling unbuffered read on the value if
available.
Adjust Windows Logger for the new interface, implement CloseImpl() Cc: anand1976
Fix test running script issue where $status var was of incorrect scope
so the failures were swallowed and not reported.
DestroyDB() creates a logger and opens a LOG file in the directory
being cleaned up. This holds a lock on the folder and the cleanup is
prevented. This fails one of the checkpoin tests. We observe the same in production.
We close the log file in this change.
Fix DBTest2.ReadAmpBitmapLiveInCacheAfterDBClose failure where the test
attempts to open a directory with NewRandomAccessFile which does not
work on Windows.
Fix DBTest.SoftLimit as it is dependent on thread timing. CC: yiwu-arbug
Closes https://github.com/facebook/rocksdb/pull/3552
Differential Revision: D7156304
Pulled By: siying
fbshipit-source-id: 43db0a757f1dfceffeb2b7988043156639173f5b
Summary:
Move DuplicateDetector and SetComparator to its own header file in util. It would also address a complaint in the unity test.
Closes https://github.com/facebook/rocksdb/pull/3567
Differential Revision: D7163268
Pulled By: maysamyabandeh
fbshipit-source-id: 6ddf82773473646dbbc1284ae601a78c4907c778
Summary:
Red diff to remove existing implementation of garbage collection. The current approach is reference counting kind of approach and require a lot of effort to get the size counter right on compaction and deletion. I'm going to go with a simple mark-sweep kind of approach and will send another PR for that.
CompactionEventListener was added solely for blob db and it adds complexity and overhead to compaction iterator. Removing it as well.
Closes https://github.com/facebook/rocksdb/pull/3551
Differential Revision: D7130190
Pulled By: yiwu-arbug
fbshipit-source-id: c3a375ad2639a3f6ed179df6eda602372cc5b8df
Summary:
The recent Logger::Close() and DBImpl::Close() implementation rely on
calling the CloseImpl() virtual function from the destructor, which will
not work. Refactor the implementation to have a private close helper
function in derived classes that can be called by both CloseImpl() and
the destructor.
Closes https://github.com/facebook/rocksdb/pull/3528
Reviewed By: gfosco
Differential Revision: D7049303
Pulled By: anand1976
fbshipit-source-id: 76a64cbf403209216dfe4864ecf96b5d7f3db9f4
Summary:
Deadlock: a memtable flush holds DB::mutex_ and calls ThreadLocalPtr::Scrape(), which locks ThreadLocalPtr mutex; meanwhile, a thread exit handler locks ThreadLocalPtr mutex and calls SuperVersionUnrefHandle, which tries to lock DB::mutex_.
This deadlock is hit all the time on our workload. It blocks our release.
In general, the problem is that ThreadLocalPtr takes an arbitrary callback and calls it while holding a lock on a global mutex. The same global mutex is (at least in some cases) locked by almost all ThreadLocalPtr methods, on any instance of ThreadLocalPtr. So, there'll be a deadlock if the callback tries to do anything to any instance of ThreadLocalPtr, or waits for another thread to do so.
So, probably the only safe way to use ThreadLocalPtr callbacks is to do only do simple and lock-free things in them.
This PR fixes the deadlock by making sure that local_sv_ never holds the last reference to a SuperVersion, and therefore SuperVersionUnrefHandle never has to do any nontrivial cleanup.
I also searched for other uses of ThreadLocalPtr to see if they may have similar bugs. There's only one other use, in transaction_lock_mgr.cc, and it looks fine.
Closes https://github.com/facebook/rocksdb/pull/3510
Reviewed By: sagar0
Differential Revision: D7005346
Pulled By: al13n321
fbshipit-source-id: 37575591b84f07a891d6659e87e784660fde815f
Summary:
Calling `std::vector::reserve()` causes memory to be reallocated and then data to be moved. It was called prior to adding every block. This reallocation could be done a huge amount of times, e.g., for users with large index blocks.
Instead, we can simply use `std::vector::emplace_back()` in such a way that preserves the no-memory-leak guarantee, while letting the vector decide when to reallocate space. Now I see reallocation/moving happen O(logN) times, rather than O(N) times, where N is the final size of vector.
Closes https://github.com/facebook/rocksdb/pull/3508
Differential Revision: D6994228
Pulled By: ajkr
fbshipit-source-id: ab7c11e13ff37c8c6c8249be7a79566a4068cd27
Summary:
- removed a few unneeded variables
- fused some variable declarations and their assignments
- fixed right-trimming code in string_util.cc to not underflow
- simplifed an assertion
- move non-nullptr check assertion before dereferencing of that pointer
- pass an std::string function parameter by const reference instead of by value (avoiding potential copy)
Closes https://github.com/facebook/rocksdb/pull/3507
Differential Revision: D7004679
Pulled By: sagar0
fbshipit-source-id: 52944952d9b56dfcac3bea3cd7878e315bb563c4
Summary:
We don't do fsync() after truncate in direct I/O writeable file (in fact we don't do any fsync ever). This can cause metadata not persistent to disk after the file is generated. We call it instead.
Closes https://github.com/facebook/rocksdb/pull/3500
Differential Revision: D6981482
Pulled By: siying
fbshipit-source-id: 7e2b591b7e5dd1b96fc0775515b8b9e6092980ef
Summary:
Now we suppress alignment UBSAN error as a whole. Suppressing 3-way CRC and murmurhash feels a better idea than turning off alignment check as a whole.
Closes https://github.com/facebook/rocksdb/pull/3495
Differential Revision: D6971273
Pulled By: siying
fbshipit-source-id: 080b59fed6df494b9f622ef7cb5d42d39e6a8cdf
Summary:
ForwardIterator::SVCleanup() sometimes didn't pin superversion when it was supposed to. See the added test for the scenario. Here's the ASAN output of the added test without the fix (using `COMPILE_WITH_ASAN=1 make`): https://pastebin.com/9rD0Ywws
Closes https://github.com/facebook/rocksdb/pull/3415
Differential Revision: D6817414
Pulled By: al13n321
fbshipit-source-id: bc80c44ea78a3a1fa885dfa448a26111f91afb24
Summary:
`ReadaheadRandomAccessFile` is used by iterators for file reads in several cases, like in compaction when `compaction_readahead_size > 0` or `use_direct_io_for_flush_and_compaction == true`, or in user iterator when `ReadOptions::readahead_size > 0`. `ReadaheadRandomAccessFile` maintains an internal buffer for readahead data. It assumes that, if the buffer's length is less than `ReadaheadRandomAccessFile::readahead_size_`, which is fixed in the constructor, then EOF has been reached so it doesn't try reading further.
Recently, d938226af4 started calling `RandomAccessFile::Prefetch` with various lengths: 8KB, 16KB, etc. When the `RandomAccessFile` is a `ReadaheadRandomAccessFile`, it triggers the above condition and incorrectly determines EOF. If a block is partially in the readahead buffer and EOF is incorrectly decided, the result is a truncated data block.
The problem is reproducible:
```
TEST_TMPDIR=/data/compaction_bench ./db_bench -benchmarks=fillrandom -write_buffer_size=1048576 -target_file_size_base=1048576 -block_size=18384 -use_direct_io_for_flush_and_compaction=true
...
put error: Corruption: truncated block read from /data/compaction_bench/dbbench/000014.sst offset 20245, expected 10143 bytes, got 8427
```
Closes https://github.com/facebook/rocksdb/pull/3454
Differential Revision: D6869405
Pulled By: ajkr
fbshipit-source-id: 87001c299e7600a37c0dcccbd0368e0954c929cf
Summary:
FilePrefetchBuffer::Prefetch is currently rounds the offset up which does not fit its new use cases in prefetching index/filter blocks, as it would skips over some the offsets that were requested to be prefetched. This patch rounds down instead.
Fixes#3180
Closes https://github.com/facebook/rocksdb/pull/3413
Differential Revision: D6816392
Pulled By: maysamyabandeh
fbshipit-source-id: 3aaeaf59c55d72b61dacfae6d4a8e65eccb3c553
Summary:
Currently, the only way to close an open DB is to destroy the DB
object. There is no way for the caller to know the status. In one
instance, the destructor encountered an error due to failure to
close a log file on HDFS. In order to prevent silent failures, we add
DB::Close() that calls CloseImpl() which must be implemented by its
descendants.
The main failure point in the destructor is closing the log file. This
patch also adds a Close() entry point to Logger in order to get status.
When DBOptions::info_log is allocated and owned by the DBImpl, it is
explicitly closed by DBImpl::CloseImpl().
Closes https://github.com/facebook/rocksdb/pull/3348
Differential Revision: D6698158
Pulled By: anand1976
fbshipit-source-id: 9468e2892553eb09c4c41b8723f590c0dbd8ab7d
Summary:
This fixes the following warnings when compiled with GCC7:
util/transaction_test_util.cc: In static member function ‘static rocksdb::Status rocksdb::RandomTransactionInserter::DBGet(rocksdb::DB*, rocksdb::Transaction*, rocksdb::ReadOptions&, uint16_t, uint64_t, bool, uint64_t*, std::__cxx11::string*, bool*)’:
util/transaction_test_util.cc:75:8: error: ‘snprintf’ output may be truncated before the last format character [-Werror=format-truncation=]
Status RandomTransactionInserter::DBGet(
^~~~~~~~~~~~~~~~~~~~~~~~~
util/transaction_test_util.cc:84:11: note: ‘snprintf’ output between 5 and 6 bytes into a destination of size 5
snprintf(prefix_buf, sizeof(prefix_buf), "%.4u", set_i + 1);
~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
util/transaction_test_util.cc: In static member function ‘static rocksdb::Status rocksdb::RandomTransactionInserter::Verify(rocksdb::DB*, uint16_t, uint64_t, bool, rocksdb::Random64*)’:
util/transaction_test_util.cc:245:8: error: ‘snprintf’ output may be truncated before the last format character [-Werror=format-truncation=]
Status RandomTransactionInserter::Verify(DB* db, uint16_t num_sets,
^~~~~~~~~~~~~~~~~~~~~~~~~
util/transaction_test_util.cc:268:13: note: ‘snprintf’ output between 5 and 6 bytes into a destination of size 5
snprintf(prefix_buf, sizeof(prefix_buf), "%.4u", set_i + 1);
~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Closes https://github.com/facebook/rocksdb/pull/3295
Differential Revision: D6609411
Pulled By: maysamyabandeh
fbshipit-source-id: 33f0add471056eb59db2f8bd4366e6dfbb1a187d
Summary:
**# Summary**
RocksDB uses SSE crc32 intrinsics to calculate the crc32 values but it does it in single way fashion (not pipelined on single CPU core). Intel's whitepaper () published an algorithm that uses 3-way pipelining for the crc32 intrinsics, then use pclmulqdq intrinsic to combine the values. Because pclmulqdq has overhead on its own, this algorithm will show perf gains on buffers larger than 216 bytes, which makes RocksDB a perfect user, since most of the buffers RocksDB call crc32c on is over 4KB. Initial db_bench show tremendous CPU gain.
This change uses the 3-way SSE algorithm by default. The old SSE algorithm is now behind a compiler tag NO_THREEWAY_CRC32C. If user compiles the code with NO_THREEWAY_CRC32C=1 then the old SSE Crc32c algorithm would be used. If the server does not have SSE4.2 at the run time the slow way (Non SSE) will be used.
**# Performance Test Results**
We ran the FillRandom and ReadRandom benchmarks in db_bench. ReadRandom is the point of interest here since it calculates the CRC32 for the in-mem buffers. We did 3 runs for each algorithm.
Before this change the CRC32 value computation takes about 11.5% of total CPU cost, and with the new 3-way algorithm it reduced to around 4.5%. The overall throughput also improved from 25.53MB/s to 27.63MB/s.
1) ReadRandom in db_bench overall metrics
PER RUN
Algorithm | run | micros/op | ops/sec |Throughput (MB/s)
3-way | 1 | 4.143 | 241387 | 26.7
3-way | 2 | 3.775 | 264872 | 29.3
3-way | 3 | 4.116 | 242929 | 26.9
FastCrc32c|1 | 4.037 | 247727 | 27.4
FastCrc32c|2 | 4.648 | 215166 | 23.8
FastCrc32c|3 | 4.352 | 229799 | 25.4
AVG
Algorithm | Average of micros/op | Average of ops/sec | Average of Throughput (MB/s)
3-way | 4.01 | 249,729 | 27.63
FastCrc32c | 4.35 | 230,897 | 25.53
2) Crc32c computation CPU cost (inclusive samples percentage)
PER RUN
Implementation | run | TotalSamples | Crc32c percentage
3-way | 1 | 4,572,250,000 | 4.37%
3-way | 2 | 3,779,250,000 | 4.62%
3-way | 3 | 4,129,500,000 | 4.48%
FastCrc32c | 1 | 4,663,500,000 | 11.24%
FastCrc32c | 2 | 4,047,500,000 | 12.34%
FastCrc32c | 3 | 4,366,750,000 | 11.68%
**# Test Plan**
make -j64 corruption_test && ./corruption_test
By default it uses 3-way SSE algorithm
NO_THREEWAY_CRC32C=1 make -j64 corruption_test && ./corruption_test
make clean && DEBUG_LEVEL=0 make -j64 db_bench
make clean && DEBUG_LEVEL=0 NO_THREEWAY_CRC32C=1 make -j64 db_bench
Closes https://github.com/facebook/rocksdb/pull/3173
Differential Revision: D6330882
Pulled By: yingsu00
fbshipit-source-id: 8ec3d89719533b63b536a736663ca6f0dd4482e9
Summary:
added `ThreadType::BOTTOM_PRIORITY` which is used in the `ThreadStatus` object to indicate the thread is used for bottom-pri compactions. Previously there was a bug where we mislabeled such threads as `ThreadType::LOW_PRIORITY`.
Closes https://github.com/facebook/rocksdb/pull/3270
Differential Revision: D6559428
Pulled By: ajkr
fbshipit-source-id: 96b1a50a9c19492b1a5fd1b77cf7061a6f9f1d1c
Summary:
There were a few places where MSVC's implicit truncation warnings were getting triggered, which was causing the MSVC build to fail due to warnings being treated as errors. This resolves the issues by making the truncations in some places explicit, and by making it so there are no truncations of literals.
Fixes#3239
Supersedes #3259
Closes https://github.com/facebook/rocksdb/pull/3273
Reviewed By: yiwu-arbug
Differential Revision: D6569204
Pulled By: Orvid
fbshipit-source-id: c188cf1cf98d9acb6d94b71875041cc81f8ff088
Summary:
DeleteScheduler::MarkAsTrash() don't handle existing .trash files correctly
This cause rocksdb to not being able to delete existing .trash files on restart
Closes https://github.com/facebook/rocksdb/pull/3261
Differential Revision: D6548003
Pulled By: IslamAbdelRahman
fbshipit-source-id: c3800639412e587a690062c63076a5a08881e0e6
Summary:
I started adding gflags support for cmake on linux and got frustrated that I'd need to duplicate the build_detect_platform logic, which determines namespace based on attempting compilation. We can do it differently -- use the GFLAGS_NAMESPACE macro if available, and if not, that indicates it's an old gflags version without configurable namespace so we can simply hardcode "google".
Closes https://github.com/facebook/rocksdb/pull/3212
Differential Revision: D6456973
Pulled By: ajkr
fbshipit-source-id: 3e6d5bde3ca00d4496a120a7caf4687399f5d656
Summary:
Allow users to configure the trash-to-DB size ratio limit, so
that ratelimits for deletes can be enforced even when larger portions of
the database are being deleted.
Closes https://github.com/facebook/rocksdb/pull/3158
Differential Revision: D6304897
Pulled By: gdavidsson
fbshipit-source-id: a28dd13059ebab7d4171b953ed91ce383a84d6b3
Summary:
This confused some users who were getting compression type from the logs.
Closes https://github.com/facebook/rocksdb/pull/3153
Differential Revision: D6294964
Pulled By: ajkr
fbshipit-source-id: 3c813376d33682dc6ccafc9a78df1a2e2528985e
Summary:
The current implementation of PinnableSlice move assignment have an issue #3163. We are moving away from it instead of try to get the move assignment right, since it is too tricky.
Closes https://github.com/facebook/rocksdb/pull/3164
Differential Revision: D6319201
Pulled By: yiwu-arbug
fbshipit-source-id: 8f3279021f3710da4a4caa14fd238ed2df902c48
Summary:
After move assignment, we need to re-initialized the moved PinnableSlice.
Also update blob_db_impl.cc to not reuse the moved PinnableSlice since it is supposed to be in an undefined state after move.
Closes https://github.com/facebook/rocksdb/pull/3127
Differential Revision: D6238585
Pulled By: yiwu-arbug
fbshipit-source-id: bd99f2e37406c4f7de160c7dee6a2e8126bc224e
Summary:
util/concurrent_arena.h:
CID 1396145 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
2. uninit_member: Non-static class member free_begin_ is not initialized in this constructor nor in any functions that it calls.
94 Shard() : allocated_and_unused_(0) {}
util/dynamic_bloom.cc:
1. Condition hash_func == NULL, taking true branch.
CID 1322821 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
3. uninit_member: Non-static class member data_ is not initialized in this constructor nor in any functions that it calls.
47 hash_func_(hash_func == nullptr ? &BloomHash : hash_func) {}
48
util/file_reader_writer.h:
204 private:
205 AlignedBuffer buffer_;
member_not_init_in_gen_ctor: The compiler-generated constructor for this class does not initialize buffer_offset_.
206 uint64_t buffer_offset_;
CID 1418246 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
member_not_init_in_gen_ctor: The compiler-generated constructor for this class does not initialize buffer_len_.
207 size_t buffer_len_;
208};
util/thread_local.cc:
341#endif
CID 1322795 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR)
3. uninit_member: Non-static class member pthread_key_ is not initialized in this constructor nor in any functions that it calls.
342}
40struct ThreadData {
2. uninit_member: Non-static class member next is not initialized in this constructor nor in any functions that it calls.
CID 1400668 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR)
4. uninit_member: Non-static class member prev is not initialized in this constructor nor in any functions that it calls.
41 explicit ThreadData(ThreadLocalPtr::StaticMeta* _inst) : entries(), inst(_inst) {}
42 std::vector<Entry> entries;
1. member_decl: Class member declaration for next.
43 ThreadData* next;
3. member_decl: Class member declaration for prev.
44 ThreadData* prev;
45 ThreadLocalPtr::StaticMeta* inst;
46};
Closes https://github.com/facebook/rocksdb/pull/3123
Differential Revision: D6233566
Pulled By: sagar0
fbshipit-source-id: aa2068790ea69787a0035c0db39d59b0c25108db
Summary:
Instead of using samples directly, we now support passing the samples through zstd's dictionary generator when `CompressionOptions::zstd_max_train_bytes` is set to nonzero. If set to zero, we will use the samples directly as the dictionary -- same as before.
Note this is the first step of #2987, extracted into a separate PR per reviewer request.
Closes https://github.com/facebook/rocksdb/pull/3057
Differential Revision: D6116891
Pulled By: ajkr
fbshipit-source-id: 70ab13cc4c734fa02e554180eed0618b75255497
Summary:
SstFileManager move files that need to be deleted into a trash directory.
Deprecate this behaviour and instead add ".trash" extension to files that need to be deleted
Closes https://github.com/facebook/rocksdb/pull/2970
Differential Revision: D5976805
Pulled By: IslamAbdelRahman
fbshipit-source-id: 27374ece4315610b2792c30ffcd50232d4c9a343
Summary:
ColumnFamilyOptions::compaction_options_fifo and all its sub-fields can be set dynamically now.
Some of the ways in which the fifo compaction options can be set are:
- `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024}"}})`
- `SetOptions({{"compaction_options_fifo", "{ttl=600;}"}})`
- `SetOptions({{"compaction_options_fifo", "{max_table_files_size=1024;ttl=600;}"}})`
- `SetOptions({{"compaction_options_fifo", "{max_table_files_size=51;ttl=49;allow_compaction=true;}"}})`
Most of the code has been made generic enough so that it could be reused later to make universal options (and other such nested defined-types) dynamic with very few lines of parsing/serializing code changes.
Introduced a few new functions like `ParseStruct`, `SerializeStruct` and `GetStringFromStruct`.
The duplicate code in `GetStringFromDBOptions` and `GetStringFromColumnFamilyOptions` has been moved into `GetStringFromStruct`. So they become just simple wrappers now.
Closes https://github.com/facebook/rocksdb/pull/3006
Differential Revision: D6058619
Pulled By: sagar0
fbshipit-source-id: 1e8f78b3374ca5249bb4f3be8a6d3bb4cbc52f92