Summary:
Choose to preload some files if options.max_open_files != -1. This can slightly narrow the gap of performance between options.max_open_files is -1 and a large number. To avoid a significant regression to DB reopen speed if options.max_open_files != -1. Limit the files to preload in DB open time to 16.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3340
Differential Revision: D6686945
Pulled By: siying
fbshipit-source-id: 8ec11bbdb46e3d0cdee7b6ad5897a09c5a07869f
Summary:
1. Remove unused API SubtractCompactionTask().
2. Assert outstanding tasks drop to zero in ConcurrentTaskLimiterImpl destructor.
3. Remove GetOutstandingTask() check from manual compaction test, as TEST_WaitForCompact() doesn't synced with 'delete prepicked_compaction' in DBImpl::BGWorkCompaction(), which may make the test flaky.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4795
Differential Revision: D13542183
Pulled By: siying
fbshipit-source-id: 5eb2a47e62efe4126937149aa0df6e243ebefc33
Summary:
Add a new per level counter for block cache hits, increase it by one on every successful attempt to get an entry from cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4796
Differential Revision: D13513688
Pulled By: zinoale
fbshipit-source-id: 104df038f1232e3356e162eb2d8ca138e34a8281
Summary:
Previously we were cleaning up range tombstone meta-block by calling `ReleaseCachedEntry`, which wouldn't work if `value != nullptr && cache_handle == nullptr`. This happened at least in the case with mmap reads and block cache both enabled. I noticed `NewDataBlockIterator` intends to handle all these cases, so migrated to that instead of `NewUnfragmentedRangeTombstoneIterator`.
Also changed the table-opening logic to fail on `ReadRangeDelBlock` failure, since that can cause data corruption. Added a test case to verify this behavior. Note the test case does not fail on `TryReopen` because failure to preload table handlers is not considered critical. However, it does fail on any read involving that file since it cannot return correct data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4810
Differential Revision: D13534296
Pulled By: ajkr
fbshipit-source-id: 55dde1111717cea6ec4bf38418daab81ccef3599
Summary:
Introduce the first CPU timing counter, perf_context.get_cpu_nanos. This opens a door to more CPU counters in the future.
Only Posix Env has it implemented using clock_gettime() with CLOCK_THREAD_CPUTIME_ID. How accurate the counter is depends on the platform.
Make PerfStepTimer to take an Env as an argument, and sometimes pass it in. The direct reason is to make the unit tests to use SpecialEnv where we can ingest logic there. But in long term, this is a good change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4741
Differential Revision: D13287798
Pulled By: siying
fbshipit-source-id: 090361049d9d5095d1d1a369fe1338d2e2e1c73f
Summary:
To avoid a race on the flag, make it an atomic_bool. This
doesn't seem to significantly affect benchmarks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4801
Differential Revision: D13523845
Pulled By: abhimadan
fbshipit-source-id: 3bc29f53c50a4e06cd9f8c6232a4bb221868e055
Summary:
This TODO was already addressed, but I forgot to remove it
before landing the PR it came from.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4800
Differential Revision: D13522284
Pulled By: abhimadan
fbshipit-source-id: 7766bc4f5b54e47d355cf26137ef5e86c604472a
Summary:
This PR contains the following fixes:
1. Fixing Makefile to support non-default locations of developer tools
2. Fixing compile error using a patch from https://github.com/facebook/rocksdb/pull/4007
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4687
Differential Revision: D13287263
Pulled By: riversand963
fbshipit-source-id: 4525eb42ba7b6f82af5f9bfb8e52fa4024e27ccc
Summary:
1) `transaction_base.h` overrides from `transaction.h` with a `const boolean do_validate`.
The non-const base declaration, which I cannot see the need for, causes a compilation error on Microsoft Windows.
2) Implicit cast from `double` to `uint64_t` causes a compilation error on Microsoft Windows.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4798
Differential Revision: D13519734
Pulled By: sagar0
fbshipit-source-id: 6e8cb80e9a589b1122e1500c21b8e3a3a472b459
Summary:
Note that Snappy now requires CMake to build it, so I added a note about RocksJava to the README.md file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4761
Differential Revision: D13403811
Pulled By: ajkr
fbshipit-source-id: 8fcd7e3dc7f7152080364a374d3065472f417eff
Summary:
in certain cases, we do not perform memtable switching if the active
memtable of the column family is empty. Two exceptions:
1. In manual flush, if cached_recoverable_state_empty_ is false, then we need
to switch memtable due to requirement of transaction.
2. In switch WAL, we need to switch memtable anyway because we have to seal the
memtable if the WAL on which it depends will be closed.
This change can potentially delay the occurence of write stalls because number
of memtables increase more slowly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4792
Differential Revision: D13499501
Pulled By: riversand963
fbshipit-source-id: 91c9b17ae753578578039f3851667d93610005e1
Summary:
Now that v2 is fully functional, the v1 aggregator is removed.
The v2 aggregator has been renamed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4778
Differential Revision: D13495930
Pulled By: abhimadan
fbshipit-source-id: 9d69500a60a283e79b6c4fa938fc68a8aa4d40d6
Summary:
Avoids re-downloading the .tar.gz files for the static build of RocksJava if they are already present.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4769
Differential Revision: D13491919
Pulled By: ajkr
fbshipit-source-id: 9265f577e049838dc40335d54f1ff2b4f972c38c
Summary:
RangeDelAggregatorV2 now supports ShouldDelete calls on
snapshot stripes and creation of range tombstone compaction iterators.
RangeDelAggregator is no longer used on any non-test code path, and will
be removed in a future commit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4758
Differential Revision: D13439254
Pulled By: abhimadan
fbshipit-source-id: fe105bcf8e3d4a2df37a622d5510843cd71b0401
Summary:
Expose common stats min,max,count,sum via statistics JNI. These stats are not fully exposed on the Java side as is, but are available on the native side.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4742
Differential Revision: D13403766
Pulled By: ajkr
fbshipit-source-id: 5b70f7bd3fb7490aab73dcbd09f13490fce5c773
Summary:
The test fails sporadically expecting the DB to be empty after DeleteFilesInRange(..., nullptr, nullptr) call which is not. Debugging shows cases where the files are skipped since they are being compacted. The patch fixes the test by waiting for the last CompactRange to finish before calling DeleteFilesInRange.
Verified by
```
~/gtest-parallel/gtest-parallel ./db_compaction_test --gtest_filter=DBCompactionTest.DeleteFileRange --repeat=10000
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4784
Differential Revision: D13469402
Pulled By: maysamyabandeh
fbshipit-source-id: 3d8f44abe205b82c69f01e7edf27e1f8098248e1
Summary:
Updating the `HistogramType.java` and `TickerType.java` to expose and correct metrics for statistics callbacks.
Moved `NO_ITERATOR_CREATED` to the proper stat name and deprecated `NO_ITERATORS`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4733
Differential Revision: D13466936
Pulled By: sagar0
fbshipit-source-id: a58d1edcc07c7b68c3525b1aa05828212c89c6c7
Summary:
Separate flag for enabling option from flag for enabling dedicated atomic stress test. I have found setting the former without setting the latter can detect different problems.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4781
Differential Revision: D13463211
Pulled By: ajkr
fbshipit-source-id: 054f777885b2dc7d5ea99faafa21d6537eee45fd
Summary:
options_file_number_ must be written under db::mutex_ sine its read is protected by mutex_ in ::GetLiveFiles(). However currently it is written in ::RenameTempFileToOptionsFile() which according to its contract must be called without holding db::mutex_. The patch fixes the race condition by also acquitting the mutex_ before writing options_file_number_. Also it does that only if the rename of option file is successful.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4780
Differential Revision: D13461411
Pulled By: maysamyabandeh
fbshipit-source-id: 2d5bae96a1f3e969ef2505b737cf2d7ae749787b
Summary:
If one column family is dropped, we should simply skip it and continue to flush
other active ones.
Currently we use Status::ShutdownInProgress to notify caller of column families
being dropped. In the future, we should consider using a different Status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4708
Differential Revision: D13378954
Pulled By: riversand963
fbshipit-source-id: 42f248cdf2d32d4c0f677cd39012694b8f1328ca
Summary:
It sometimes times out with it is run with TSAN. The patch reduces the iteration from 50 to 30. This reduces the normal runtime from 5.2 to 3.1 seconds and should similarly address the TSAN timeout problem.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4779
Differential Revision: D13456862
Pulled By: maysamyabandeh
fbshipit-source-id: fdc0ad7d781b1c33b771d2415ff5fa2f1b5e2537
Summary:
The PR is targeting to resolve the issue of:
https://github.com/facebook/rocksdb/issues/3972#issue-330771918
We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD.
When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth.
Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us.
With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening.
ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant.
The usage is straight forward:
e.g.:
//
// Enable compaction thread limiter thru ColumnFamilyOptions
//
std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = ctl;
...
//
// Compaction thread limiter can be tuned or disabled on-the-fly
//
ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks
...
ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter
ctl->SetMaxOutstandingTask(-1); // Same as above
...
ctl->SetMaxOutstandingTask(0); // full throttle (0 task)
//
// Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue)
//
std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8));
std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt_ssd1(options);
ColumnFamilyOptions cf_opt_ssd2(options);
ColumnFamilyOptions cf_opt_hdd1(options);
ColumnFamilyOptions cf_opt_hdd2(options);
ColumnFamilyOptions cf_opt_hdd3(options);
// SSD CFs
cf_opt_ssd1.compaction_thread_limiter = ctl_ssd;
cf_opt_ssd2.compaction_thread_limiter = ctl_ssd;
// HDD CFs
cf_opt_hdd1.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd2.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd3.compaction_thread_limiter = ctl_hdd;
...
//
// The limiter is disabled by default (or set to nullptr explicitly)
//
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = nullptr;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332
Differential Revision: D13226590
Pulled By: siying
fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d
Summary:
The test has been failing sporadically probably because the configured compaction options were actually unused. Verified that by the following:
```
~/gtest-parallel/gtest-parallel ./db_compaction_test --gtest_filter=DBCompactionTest.DeleteFileRange --repeat=1000
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4776
Differential Revision: D13441052
Pulled By: maysamyabandeh
fbshipit-source-id: d35075b9e6cef9b9c9d0d571f9cd72ade8eda55d
Summary:
In Direct I/O case, WritableFileWriter::Close() rewrites the last block again, even if there is nothing new. The reason is that, Close() flushes the buffer. For non-direct I/O case, the buffer is empty in this case so it is a no-op. However, in direct I/O case, the partial data in the last block is kept in the buffer because it needs to be rewritten for the next write. This piece of data is flushed again. This commit fixes it by skipping this write out if `pending_sync_` flag shows that there isn't new data sync last sync.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4771
Differential Revision: D13420426
Pulled By: siying
fbshipit-source-id: 9d39ec9a215b1425d4ed40d85e0eba1f5daa75c6
Summary:
This PR fixes#4721. When an exception is caught and thrown as a different exception, then the original exception should be inserted as a cause of the new exception. This bug in RocksDB was swallowing the underlying exception from `NativeLibraryLoader` and throwing the following exception
```
...
Caused by: java.lang.RuntimeException: Unable to load the RocksDB shared libraryjava.nio.channels.ClosedByInterruptException
at org.rocksdb.RocksDB.loadLibrary(RocksDB.java:67)
at org.rocksdb.RocksDB.<clinit>(RocksDB.java:35)
... 73 more
```
The fix is simple and self-explanatory.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4728
Differential Revision: D13418371
Pulled By: sagar0
fbshipit-source-id: d76c25af2a83a0f8ba62cc8d7b721bfddc85fdf1
Summary:
To support the flush/compaction use cases of RangeDelAggregator
in v2, FragmentedRangeTombstoneIterator now supports dropping tombstones
that cannot be read in the compaction output file. Furthermore,
FragmentedRangeTombstoneIterator supports the "snapshot striping" use
case by allowing an iterator to be split by a list of snapshots.
RangeDelAggregatorV2 will use these changes in a follow-up change.
In the process of making these changes, other miscellaneous cleanups
were also done in these files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4740
Differential Revision: D13287382
Pulled By: abhimadan
fbshipit-source-id: f5aeb03e1b3058049b80c02a558ee48f723fa48c
Summary:
Fixes some RocksJava regressions recently introduced, whereby RocksJava would not build on JDK 7.
These should have been visible on Travis-CI!
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4768
Differential Revision: D13418173
Pulled By: sagar0
fbshipit-source-id: 57bf223188887f84d9e072031af2e0d2c8a69c30
Summary:
Change the directory where ExternalSSTFileBasicTest* tests run.
**Problem:**
Without this change, I spent considerable time chasing around a non-existent issue as ExternalSSTFileTest.* and ExternalSSTFileBasicTest.* create similar directories.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4766
Differential Revision: D13409384
Pulled By: sagar0
fbshipit-source-id: c33e1f4d505dfa6efbc788d6c57cdb680053ded3
Summary:
When adding CompactionFilter and CompactionFilterFactory settings to the Java layer, ColumnFamilyOptions was modified directly instead of ColumnFamilyOptionsInterface. This meant that the old-stye Options monolith was left behind.
This patch fixes that, by:
- promoting the CompactionFilter + CompactionFilterFactory setters from ColumnFamilyOptions -> ColumnFamilyOptionsInterface
- adding getters in ColumnFamilyOptionsInterface
- implementing setters in Options
- implementing getters in both ColumnFamilyOptions and Options
- adding testcases
- reusing a test CompactionFilterFactory by moving it to a common location
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3461
Differential Revision: D13278788
Pulled By: sagar0
fbshipit-source-id: 72602c6eb97dc80734e718abb5e2e9958d3c753b
Summary:
It is possible to see a situation like the following when
subcompactions are enabled:
1. A subcompaction boundary is set to `[b, e)`.
2. The first output file in a subcompaction has `c@20` as its smallest key
3. The range tombstone `[a, d)30` is encountered.
4. The tombstone is written to the range-del meta block and the new
smallest key is set to `b@0` (since no keys in this subcompaction's
output can be smaller than `b`).
5. A key `b@10` in a lower level will now reappear, since it is not
covered by the truncated start key `b@0`.
In general, unless the smallest data key in a file has a seqnum of 0, it
is not safe to truncate a tombstone at the start key to have a seqnum of
0, since it can expose keys with a seqnum greater than 0 but less than
the tombstone's actual seqnum.
To fix this, when the lower bound of a file is from the subcompaction
boundaries, we now set the seqnum of an artificially extended smallest
key to the tombstone's seqnum. This is safe because subcompactions
operate over disjoint sets of keys, and the subcompactions that can
experience this problem are not the first subcompaction (which is
unbounded on the left).
Furthermore, there is now an assertion to detect the described anomalous
case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4723
Differential Revision: D13236188
Pulled By: abhimadan
fbshipit-source-id: a6da6a113f2de1e2ff307ca72e055300c8fe5692
Summary:
Compile logs have a bit of noise due to missing javadoc annotations. Updating docs to reduce.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4764
Differential Revision: D13400193
Pulled By: sagar0
fbshipit-source-id: 65c7efb70747cc3bb35a336a6881ea6536ae5ff4
Summary:
Fix the definition of assumed_tracked in Transaction that was introduced in #4680
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4762
Differential Revision: D13399150
Pulled By: maysamyabandeh
fbshipit-source-id: 2a30fe49e3c44adacd7e45cd48eae95023ca9dca
Summary:
1. DBImplReadOnly::GetLiveFiles should not return NotSupported. Instead, it
should call DBImpl::GetLiveFiles(flush_memtable=false).
2. In DBImp::Recover, we should also recover the OPTIONS file name and/or
number so that an immediate subsequent GetLiveFiles will get the correct
OPTIONS name.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4681
Differential Revision: D13069205
Pulled By: riversand963
fbshipit-source-id: 3e6a0174307d06db5a01feb099b306cea1f7f88a
Summary:
Add counters to track block cache index/filter hits and misses. We currently count aggregate hits and misses, which includes index/filter/data blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4540
Differential Revision: D10459652
Pulled By: anand1976
fbshipit-source-id: 0c59eee7f12f5103dcb6686f0e7995babe63d425
Summary:
Uses a newer build toolchain but the same old GLIBC when building releases of RocksJava for Linux x64 in the Docker Container.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4756
Differential Revision: D13383575
Pulled By: sagar0
fbshipit-source-id: 27c58814876e434d5fa61395e6664cfc5f6830b1
Summary:
Refactored and simplified `BlockBasedTable::Open` to be similar to `BlockBasedTableBuilder::Finish` as both these functions complement each other. Also added `BlockBasedTableBuilder::WriteFooter` along the way.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4636
Differential Revision: D12933319
Pulled By: sagar0
fbshipit-source-id: 1ff1d02f6d80a63b5ba720a1fc75e71c7344137b
Summary:
the original test does not give enough time difference between tombstone write time and the expire time point, which make test flaky.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4755
Reviewed By: maysamyabandeh
Differential Revision: D13369681
Pulled By: wpc
fbshipit-source-id: 22576f354c63cd0b39d8b35c3913303707503ea9
Summary:
Transaction::GetForUpdate is extended with a do_validate parameter with default value of true. If false it skips validating the snapshot (if there is any) before doing the read. After the read it also returns the latest value (expects the ReadOptions::snapshot to be nullptr). This allows RocksDB applications to use GetForUpdate similarly to how InnoDB does. Similarly ::Merge, ::Put, ::Delete, and ::SingleDelete are extended with assume_exclusive_tracked with default value of false. It true it indicates that call is assumed to be after a ::GetForUpdate(do_validate=false).
The Java APIs are accordingly updated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4680
Differential Revision: D13068508
Pulled By: maysamyabandeh
fbshipit-source-id: f0b59db28f7f6a078b60844d902057140765e67d
Summary:
As titled. Update history to include a recent bug fix in
9be3e6b488.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4753
Differential Revision: D13350286
Pulled By: riversand963
fbshipit-source-id: b6324780dee4cb1757bc2209403a08531c150c08
Summary:
When write stall has already been triggered due to number of L0 files reaching
threshold, file ingestion must proceed with its flush without waiting for the
write stall condition to cleared by the compaction because compaction can wait
for ingestion to finish (circular wait).
In order to avoid this wait, we can set `FlushOptions.allow_write_stall` to be
true (default is false). Setting it to false can cause deadlock.
This can happen when the number of compaction threads is low.
Considere the following
```
Time compaction_thread ingestion_thread
| num_running_ingest_file_++
| while(num_running_ingest_file_>0){wait}
| flush
V
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4751
Differential Revision: D13343037
Pulled By: riversand963
fbshipit-source-id: d3b95938814af46ec4c463feff0b50c70bd8b23f