Summary:
Provide a block_align option in BlockBasedTableOptions to allow
alignment of SST file data blocks. This will avoid higher
IOPS/throughput load due to < 4KB data blocks spanning 2 4KB pages.
When this option is set to true, the block alignment is set to lower of
block size and 4KB.
Closes https://github.com/facebook/rocksdb/pull/3502
Differential Revision: D7400897
Pulled By: anand1976
fbshipit-source-id: 04cc3bd144e88e3431a4f97604e63ad7a0f06d44
Summary:
Current commit cache size is 2^21. This was due to a type. With 2^23 commit entries we can have transactions as long as 64s without incurring the cost of having them evicted from the commit cache before their commit. Here is the math:
2^23 / 2 (one out of two seq numbers are for commit) / 2^16 TPS = 2^6 = 64s
Closes https://github.com/facebook/rocksdb/pull/3657
Differential Revision: D7411211
Pulled By: maysamyabandeh
fbshipit-source-id: e7cacf40579f3acf940643d8a1cfe5dd201caa35
Summary:
Currently log_writer->AddRecord in WriteImpl is protected from concurrent calls via FlushWAL only if two_write_queues_ option is set. The patch fixes the problem by i) skip log_writer->AddRecord in FlushWAL if manual_wal_flush is not set, ii) protects log_writer->AddRecord in WriteImpl via log_write_mutex_ if manual_wal_flush_ is set but two_write_queues_ is not.
Fixes#3599
Closes https://github.com/facebook/rocksdb/pull/3656
Differential Revision: D7405608
Pulled By: maysamyabandeh
fbshipit-source-id: d6cc265051c77ae49c7c6df4f427350baaf46934
Summary:
I found that each instance of MySQLStyleTransactionTest.TransactionStressTest/x is taking more than 10 hours to complete on our continuous testing environment, causing the whole valgrind run to timeout after a day. So excluding these tests.
Closes https://github.com/facebook/rocksdb/pull/3652
Differential Revision: D7400332
Pulled By: sagar0
fbshipit-source-id: 987810574506d01487adf7c2de84d4817ec3d22d
Summary:
Currently AddPrepared is performed only on the first sub-batch if there are duplicate keys in the write batch. This could cause a problem if the transaction takes too long to commit and the seq number of the first sub-patch moved to old_prepared_ but not the seq of the later ones. The patch fixes this by calling AddPrepared for all sub-patches.
Closes https://github.com/facebook/rocksdb/pull/3651
Differential Revision: D7388635
Pulled By: maysamyabandeh
fbshipit-source-id: 0ccd80c150d9bc42fe955e49ddb9d7ca353067b4
Summary:
RangeDelAggregator will remember the files whose range tombstones have been added,
so the caller can check whether the file has been added before call AddTombstones.
Closes https://github.com/facebook/rocksdb/pull/3635
Differential Revision: D7354604
Pulled By: ajkr
fbshipit-source-id: 9b9f7ec130556028df417e650711554b46d8d107
Summary:
We have not been updating our HISTORY.md change log with the RocksJava changes. Going forward, lets add Java changes also to HISTORY.md.
There is an old java/HISTORY-JAVA.md, but it hasn't been updated in years. It is much easier to remember to update the change log in a single file, HISTORY.md.
I added information about shared block cache here, which was introduced in #3623.
Closes https://github.com/facebook/rocksdb/pull/3647
Differential Revision: D7384448
Pulled By: sagar0
fbshipit-source-id: 9b6e569f44e6df5cb7ba06413d9975df0b517d20
Summary:
Summary
========
`InlineSkipList<>::Insert` takes the `key` parameter as a C-string. Then, it performs multiple comparisons with it requiring the `GetLengthPrefixedSlice()` to be spawn in `MemTable::KeyComparator::operator()(const char* prefix_len_key1, const char* prefix_len_key2)` on the same data over and over. The patch tries to optimize that.
Rough performance comparison
=====
Big keys, no compression.
```
$ ./db_bench --writes 20000000 --benchmarks="fillrandom" --compression_type none -key_size 256
(...)
fillrandom : 4.222 micros/op 236836 ops/sec; 80.4 MB/s
```
```
$ ./db_bench --writes 20000000 --benchmarks="fillrandom" --compression_type none -key_size 256
(...)
fillrandom : 4.064 micros/op 246059 ops/sec; 83.5 MB/s
```
TODO
======
In ~~a separated~~ this PR:
- [x] Go outside the write path. Maybe even eradicate the C-string-taking variant of `KeyIsAfterNode` entirely.
- [x] Try to cache the transformations applied by `KeyComparator` & friends in situations where we havy many comparisons with the same key.
Closes https://github.com/facebook/rocksdb/pull/3516
Differential Revision: D7059300
Pulled By: ajkr
fbshipit-source-id: 6f027dbb619a488129f79f79b5f7dbe566fb2dbb
Summary:
Fsync after writing global sequence number to the ingestion file in ExternalSstFileIngestionJob. Otherwise the file metadata could be incorrect.
Closes https://github.com/facebook/rocksdb/pull/3644
Differential Revision: D7373813
Pulled By: sagar0
fbshipit-source-id: 4da2c9e71a8beb5c08b4ac955f288ee1576358b8
Summary:
It was misnamed. It actually updates `bg_error_` if `PreprocessWrite()` or `WriteToWAL()` fail, not related to the user callback.
Closes https://github.com/facebook/rocksdb/pull/3485
Differential Revision: D6955787
Pulled By: ajkr
fbshipit-source-id: bd7afc3fdb7a52830c021cbfc25fcbc3ab7d5e10
Summary:
Add `bytes_max_delete_chunk` in SstFileManager so that we can drop a large file in multiple batches.
Closes https://github.com/facebook/rocksdb/pull/3640
Differential Revision: D7358679
Pulled By: siying
fbshipit-source-id: ef17f0da2f5723dbece2669485a9b91b3edc0bb7
Summary:
This commit fixes a race condition on calling SetLastPublishedSequence. The function must be called only from the 2nd write queue when two_write_queues is enabled. However there was a bug that would also call it from the main write queue if CommitTimeWriteBatch is provided to the commit request and yet use_only_the_last_commit_time_batch_for_recovery optimization is not enabled. To fix that we penalize the commit request in such cases by doing an additional write solely to publish the seq number from the 2nd queue.
Closes https://github.com/facebook/rocksdb/pull/3641
Differential Revision: D7361508
Pulled By: maysamyabandeh
fbshipit-source-id: bf8f7a27e5cccf5425dccbce25eb0032e8e5a4d7
Summary:
The 10MB buffer in BackupEngineImpl::BackupMeta::StoreToFile can be corrupted with a large number of files. Added a check to determine current buffer length and append data to file if buffer becomes full.
Resolves https://github.com/facebook/rocksdb/issues/3228
Closes https://github.com/facebook/rocksdb/pull/3636
Differential Revision: D7354160
Pulled By: ajkr
fbshipit-source-id: eec12d38095a0d17551a4aaee52b99d30a555722
Summary:
This pull request exposes the interface of PerfContext as C API
Closes https://github.com/facebook/rocksdb/pull/3607
Differential Revision: D7294225
Pulled By: ajkr
fbshipit-source-id: eddcfbc13538f379950b2c8b299486695ffb5e2c
Summary:
Changes to support sharing block cache using the Java API.
Previously DB instances could share the block cache only when the same Options instance is passed to all the DB instances. But now, with this change, it is possible to explicitly create a cache and pass it to multiple options instances, to share the block cache.
Implementing this for [Rocksandra](https://github.com/instagram/cassandra/tree/rocks_3.0), but this feature has been requested by many java api users over the years.
Closes https://github.com/facebook/rocksdb/pull/3623
Differential Revision: D7305794
Pulled By: sagar0
fbshipit-source-id: 03e4e8ed7aeee6f88bada4a8365d4279ede2ad71
Summary:
* Fix BlobDBImpl::GCFileAndUpdateLSM doesn't close the new file, and the new file will not be able to be garbage collected later.
* Fix BlobDBImpl::GCFileAndUpdateLSM doesn't copy over metadata from old file to new file.
Closes https://github.com/facebook/rocksdb/pull/3639
Differential Revision: D7355092
Pulled By: yiwu-arbug
fbshipit-source-id: 4fa3594ac5ce376bed1af04a545c532cfc0088c4
Summary:
When destorying column family handle after the column family has been deleted, the handle may hold share pointers of some objects in ColumnFamilyOptions, but in the destructor, the destructing order may cause some of the objects to be destoryed before being used by the following steps. Fix it by making a copy of the option object and destory it as the last step.
Closes https://github.com/facebook/rocksdb/pull/3610
Differential Revision: D7281025
Pulled By: siying
fbshipit-source-id: ac18f3b2841788cba4ccfa1abd8d59158c1113bc
Summary:
Previously, the compaction in `DBCompactionTestWithParam.ForceBottommostLevelCompaction` generated multiple files in no-compression use case, andone file in compression use case. I increased `target_file_size_base` so it generates one file in both use cases.
Closes https://github.com/facebook/rocksdb/pull/3625
Differential Revision: D7311885
Pulled By: ajkr
fbshipit-source-id: 97f249fa83a9924ac34357a4bb3189c969ecb107
Summary:
I modified the Makefile so that we can compile rocksdb on OpenBSD.
The instructions for building have been added to INSTALL.md.
The whole compilation process works fine like this on OpenBSD-current
Closes https://github.com/facebook/rocksdb/pull/3617
Differential Revision: D7323754
Pulled By: siying
fbshipit-source-id: 990037d1cc69138d22f85bd77ef4dc8c1ba9edea
Summary:
In original $ROCKSDB_HOME/Makefile, the command used to generate ctags is
```
ctags * -R
```
However, this failed to generate tags for me.
I did some search on the usage of ctags command and found that it should be
```
ctags -R .
```
or
```
ctags -R *
```
After the change, I can find the tags in vim using `:ts <identifier>`.
Closes https://github.com/facebook/rocksdb/pull/3626
Reviewed By: ajkr
Differential Revision: D7320217
Pulled By: riversand963
fbshipit-source-id: e4cd8f8a67842370a2343f0213df3cbd07754111
Summary:
This changes the console output when the RocksJava tests are run. It makes spotting the errors and failures much easier; perviously the output was malformed with results like "ERun" where the "E" represented an error in the preceding test.
Closes https://github.com/facebook/rocksdb/pull/3621
Differential Revision: D7306172
Pulled By: sagar0
fbshipit-source-id: 3fa6f6e1ca6c6ea7ceef55a23ca81903716132b7
Summary:
If there are a lot of overlapped files in L0, creating a merging iterator for
all files in L0 to check overlap can be very slow because we need to read and
seek all files in L0. However, in that case, the ingested file is likely to
overlap with some files in L0, so if we check those files one by one, we can stop
once we encounter overlap.
Ref: https://github.com/facebook/rocksdb/issues/3540
Closes https://github.com/facebook/rocksdb/pull/3564
Differential Revision: D7196784
Pulled By: anand1976
fbshipit-source-id: 8700c1e903bd515d0fa7005b6ce9b3a3d9db2d67
Summary:
This is a small API extension to allow the CompactFiles method to return the names of files that were created during the compaction.
Closes https://github.com/facebook/rocksdb/pull/3608
Differential Revision: D7275789
Pulled By: siying
fbshipit-source-id: 1ec0c3954a0f10cd877efb5f29f9be6c7b59e9ba
Summary:
We missed updating version.h on master when cutting 5.11.fb and 5.12.fb branches. It should be the same as the version in the latest release branch (or should it be one more?).
I noticed this when trying to run some upgrade/downgrade tests from 5.11 to some new code on master.
Closes https://github.com/facebook/rocksdb/pull/3611
Differential Revision: D7282917
Pulled By: sagar0
fbshipit-source-id: 205ee75b77c5b6bbcea95a272760b427025a4aba
Summary:
`Writer::WriteBuffer` was always called at the beginning of checkpoint/backup. But that log writer has no internal synchronization, which meant the same buffer could be flushed twice in a race condition case, causing a WAL entry to be duplicated. Then subsequent WAL entries would be at unexpected offsets, causing the 32KB block boundaries to be overlapped and manifesting as a corruption.
This PR fixes the behavior to only use `WriteBuffer` (via `FlushWAL`) in checkpoint/backup when manual WAL flush is enabled. In that case, users are responsible for providing synchronization between WAL flushes. We can also consider removing the call entirely.
Closes https://github.com/facebook/rocksdb/pull/3603
Differential Revision: D7277447
Pulled By: ajkr
fbshipit-source-id: 1b15bd7fd930511222b075418c10de0aaa70a35a
Summary:
Implemented PositionedAppend() and use_direct_io() for TestWritableFile.
With these changes, FaultInjectionTestEnv can be used with DirectIO enabled.
Closes https://github.com/facebook/rocksdb/pull/3586
Differential Revision: D7244305
Pulled By: yiwu-arbug
fbshipit-source-id: f6b7aece53daa0f9977bc684164a0693693e514c
Summary:
I landed #3544 which made this test flaky. The reason was the files scheduled for deletion sometimes went through the trash-marking process, and sometimes were deleted directly. Our counter only bumped on the former code path, so if the latter code path was used, we'd miss counting a file deleted by deletion scheduler. This PR also bumps the counter in the latter code path.
Closes https://github.com/facebook/rocksdb/pull/3593
Differential Revision: D7226173
Pulled By: yiwu-arbug
fbshipit-source-id: 81ab44c60834df6ff88db1d73ea34e26c6e93c39
Summary:
enable_pipelined_write was not set in BuildDBOptions() causing its default
value to be dumped in the OPTIONS file
Closes https://github.com/facebook/rocksdb/pull/3585
Differential Revision: D7226395
Pulled By: yiwu-arbug
fbshipit-source-id: 45a659a48d18103ac9ee74bb8805dd0a6ec12474
Summary:
This is an abstraction for working with custom Comparators implemented in native C++ code from Java. Native code must directly extend `rocksdb::Comparator`. When the native code comparator is compiled into the RocksDB codebase, you can then create a Java Class, and JNI stub to wrap it.
Useful if the C++/JNI barrier overhead is too much for your applications comparator performance.
An example is provided in `java/rocksjni/native_comparator_wrapper_test.cc` and `java/src/main/java/org/rocksdb/NativeComparatorWrapperTest.java`.
Closes https://github.com/facebook/rocksdb/pull/3334
Differential Revision: D7172605
Pulled By: miasantreble
fbshipit-source-id: e24b7eb267a3bcb6afa214e0379a1d5e8a2ceabe
Summary:
Added a stat that counts the number of cancelled compactions.
Closes https://github.com/facebook/rocksdb/pull/3574
Differential Revision: D7190259
Pulled By: amytai
fbshipit-source-id: d5ce82dc9398da6d6d34023ad4ed8cec909852a3
Summary:
The CRC is actually calculated based on the record type and payload.
The wiki should also be updated accordingly and extended with a section on the recyclable record format.
Closes https://github.com/facebook/rocksdb/pull/3576
Differential Revision: D7196478
Pulled By: siying
fbshipit-source-id: 39f7a0395075cc73e2aa2bfc9e42c85bce35e765