Summary:
https://github.com/facebook/rocksdb/pull/3340 introduces preloading when max_open_files != -1.
It doesn't preload index and filter in non-initial file loading case. This is a little bit too
complicated to understand. We observed in one MyRocks use case where the filter is expected to be
preloaded but is not. To simplify the use case, we simply always prefetch the index and filter.
They anyway is expected to be loaded in the file verification phase anyway.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4852
Differential Revision: D13595402
Pulled By: siying
fbshipit-source-id: d4d8624eb3e849e20aeb990df2100502d85aff31
Summary:
IsInSnapshotEmptyMapTest tests that IsInSnapshot returns correct value for existing data after a recovery, where max is not zero and yet commit cache is empty. The existing test was preliminary which is improved in this patch. It also increases the db sequence after recovery so that there the snapshot immediately taken after recovery would have a sequence number different than that of max_evicted_seq. This simplifies the logic in IsInSnapshot by not having to consider the special case that an old snapshot might be equal to max_evicted_seq and yet not present in old_commit_map.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4853
Differential Revision: D13595223
Pulled By: maysamyabandeh
fbshipit-source-id: 77c12ca8a3f61a47479a93bef2038ff502dc3322
Summary:
as titled.
We can remove the assersion because we do not perform verification in
AtomicFlushStressTest::TestCheckpoint for similar reasons to TestGet, TestPut,
etc.
Therefore, we override TestCheckpoint in AtomicFlushStressTest so that the
assertion `rand_column_families.size() == rand_keys.size()' is removed, and we
do not verify the DB in this function.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4846
Differential Revision: D13583377
Pulled By: riversand963
fbshipit-source-id: 03647f3da67e27a397413fd666e3bb43003bf596
Summary:
The rollback algorithm in WritePrepared transactions requires reading the values before the transaction start. Currently it uses the prepare_seq -1 as the snapshot sequence number for the read. This is not correct since the passed sequence number must be for a valid snapshot. The patch fixes it by passing kMaxSequenceNumber instead. This is fine since all the writes done by the aborted transaction will be skipped during the read anyway.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4851
Differential Revision: D13592773
Pulled By: maysamyabandeh
fbshipit-source-id: ff1bf92ea9909d4cccb173bdff49febc0e9eb7a2
Summary:
In compaction iterator, if the next value of single delete is a blob value, it should not treated as mismatch. This is only a minor fix and doesn't affect correctness.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4848
Differential Revision: D13585812
Pulled By: yiwu-arbug
fbshipit-source-id: 0ff6223fa03a644ac9fd8a2d77f9d6711d0a62b0
Summary:
Previously for point lookup we decided which file to look into based on user key overlap only. We also did not truncate range tombstones in the point lookup code path. These two ideas did not interact well in cases like this:
- L1 has range tombstone [a, c)#1 and point key b#2. The data is split between file1 with range [a#1,1, b#72057594037927935,15], and file2 with range [b#2, c#1].
- L1's file2 gets compacted to L2.
- User issues `Get()` for b#3.
- L1's file1 is opened and the range tombstone [a, c)#1 is found for b, while no point-key for b is found in L1.
- `Get()` assumes that the range tombstone must cover all data in that range in lower levels, so short circuits and returns `NotFound`.
The solution to this problem is to not look into files that only overlap with the point lookup at a range tombstone sentinel endpoint. In the above example, this would mean not opening L1's file1 or its tombstones during the `Get()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4829
Differential Revision: D13561355
Pulled By: ajkr
fbshipit-source-id: a13c21c816870a2f5d32a48af6dbd719a7d9d19f
Summary:
as titled.
Since different bg flush threads can flush different sets of column families
(due to column family creation and drop), we decide not to let one thread
perform atomic flush result installation for other threads. Bg flush threads
will install their atomic flush results sequentially to MANIFEST, using
a conditional variable, i.e. atomic_flush_install_cv_ to coordinate.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4791
Differential Revision: D13498930
Pulled By: riversand963
fbshipit-source-id: dd7482fc41f4bd22dad1e1ef7d4764ef424688d7
Summary:
Declare Jemalloc non-standard APIs as weak symbols, so that if Jemalloc is linked with the binary, these symbols will be replaced by Jemalloc's, otherwise they will be nullptr. This is similar to how folly detect jemalloc, but we assume the main program use jemalloc as long as jemalloc is linked: https://github.com/facebook/folly/blob/master/folly/memory/Malloc.h#L147
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4844
Differential Revision: D13574934
Pulled By: yiwu-arbug
fbshipit-source-id: 7ea871beb1be7d5a1259cc38f9b78078793db2db
Summary:
The original implementation has two problems:
1. f0dda35d7d/db/db_impl_write.cc (L478)f0dda35d7d/db/write_thread.h (L231)
If the callback status of leader of the write_group fails, then the whole write_group will not write to WAL, this may cause data loss.
2. f0dda35d7d/db/write_thread.h (L130)
The annotation says that Writer.status is the status of memtable inserter, but the original implementation use it for another case which is not consistent with the original design. Looks like we can still reuse Writer.status, but we should modify the annotation, so Writer.status is not only the status of memtable inserter.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4838
Differential Revision: D13574070
Pulled By: yiwu-arbug
fbshipit-source-id: a2a2aefcfd329c4c6a91652bf090aaf1ce119c4b
Summary:
The current implementation hardcode the default options in different
places, which makes it impossible to support other environments (like
encrypted environment).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4839
Differential Revision: D13573578
Pulled By: sagar0
fbshipit-source-id: 76b58b4b758902798d10ff2f52d9f39abff015e7
Summary:
DBSSTTest.RateLimitedDelete is flakey. The root cause is not completely identified, but
the compaction waiting in the test doesn't strictly wait for compaction cleaning to finish, which
may cause test flakiness. Fix it first and see whether the failures still happen.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4840
Differential Revision: D13567273
Pulled By: siying
fbshipit-source-id: 6fce38b912aff92a925231e7aa9bb0fef892761a
Summary:
Updated stress test will support testing of db in read-only mode.
The user has to make sure that only read/scan operations are enabled.
This PR relies on #4681.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4690
Differential Revision: D13102741
Pulled By: riversand963
fbshipit-source-id: f5a36b34db187fe12dd355f7eda161f99d6c75e4
Summary:
- To be consistent with the accounting of other optypes in `TableProperties`, we should count range tombstones in `TableProperties::num_entries` and `TableProperties::num_deletions`.
- Updated assertions in stress test's `OnTableFileCreated` handler to accept files with range tombstones only.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4841
Differential Revision: D13568424
Pulled By: ajkr
fbshipit-source-id: 0139d7806494eda20ece67ec460d2458dbbf6026
Summary:
Avoid locking the DB mutex in order to reference SuperVersions. Instead, we get the thread local cached SuperVersion for each column family in the list. It depends on finding a sequence number that overlaps with all the open memtables. We start with the latest published sequence number, and if any of the memtables is sealed before we can get all the SuperVersions, the process is repeated. After a few times, give up and lock the DB mutex.
Tests:
1. Unit tests
2. make check
3. db_bench -
TEST_TMPDIR=/dev/shm ./db_bench -use_existing_db=true -benchmarks=readrandom -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=5000000 -reads=1000000 -threads=32 -compression_type=none -cache_size=1048576000 -batch_size=1 -bloom_bits=1
readrandom : 0.167 micros/op 5983920 ops/sec; 426.2 MB/s (1000000 of 1000000 found)
Multireadrandom with batch size 1:
multireadrandom : 0.176 micros/op 5684033 ops/sec; (1000000 of 1000000 found)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4754
Differential Revision: D13363550
Pulled By: anand1976
fbshipit-source-id: 6243e8de7dbd9c8bb490a8eca385da0c855b1dd4
Summary:
`rocksdb-shared.lib` is missing while `rocksdb.lib` and `rocksdb-shared.dll` are installed correctly.
Add `ARCHIVE DESTINATION` to fix this issue.
Refer to CMake doc for more details: [
https://cmake.org/cmake/help/v3.13/command/install.html#installing-targets](https://cmake.org/cmake/help/v3.13/command/install.html#installing-targets)
> ARCHIVE
> Static libraries are treated as ARCHIVE targets, except those marked with the FRAMEWORK property on macOS (see FRAMEWORK below.) For DLL platforms (all Windows-based systems including Cygwin), the DLL import library is treated as an ARCHIVE target.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4832
Differential Revision: D13566301
Pulled By: siying
fbshipit-source-id: 56e4ef82f7d5c63bd181ddf23b691336ad77881a
Summary:
The `flush_reason` parameter in `DBImpl::InstallSuperVersionAndScheduleWork` is
not used. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4816
Differential Revision: D13543218
Pulled By: riversand963
fbshipit-source-id: 8fc75d49462ce092e85aef0fe0c50936140db153
Summary:
Choose to preload some files if options.max_open_files != -1. This can slightly narrow the gap of performance between options.max_open_files is -1 and a large number. To avoid a significant regression to DB reopen speed if options.max_open_files != -1. Limit the files to preload in DB open time to 16.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3340
Differential Revision: D6686945
Pulled By: siying
fbshipit-source-id: 8ec11bbdb46e3d0cdee7b6ad5897a09c5a07869f
Summary:
1. Remove unused API SubtractCompactionTask().
2. Assert outstanding tasks drop to zero in ConcurrentTaskLimiterImpl destructor.
3. Remove GetOutstandingTask() check from manual compaction test, as TEST_WaitForCompact() doesn't synced with 'delete prepicked_compaction' in DBImpl::BGWorkCompaction(), which may make the test flaky.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4795
Differential Revision: D13542183
Pulled By: siying
fbshipit-source-id: 5eb2a47e62efe4126937149aa0df6e243ebefc33
Summary:
Add a new per level counter for block cache hits, increase it by one on every successful attempt to get an entry from cache.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4796
Differential Revision: D13513688
Pulled By: zinoale
fbshipit-source-id: 104df038f1232e3356e162eb2d8ca138e34a8281
Summary:
Previously we were cleaning up range tombstone meta-block by calling `ReleaseCachedEntry`, which wouldn't work if `value != nullptr && cache_handle == nullptr`. This happened at least in the case with mmap reads and block cache both enabled. I noticed `NewDataBlockIterator` intends to handle all these cases, so migrated to that instead of `NewUnfragmentedRangeTombstoneIterator`.
Also changed the table-opening logic to fail on `ReadRangeDelBlock` failure, since that can cause data corruption. Added a test case to verify this behavior. Note the test case does not fail on `TryReopen` because failure to preload table handlers is not considered critical. However, it does fail on any read involving that file since it cannot return correct data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4810
Differential Revision: D13534296
Pulled By: ajkr
fbshipit-source-id: 55dde1111717cea6ec4bf38418daab81ccef3599
Summary:
Introduce the first CPU timing counter, perf_context.get_cpu_nanos. This opens a door to more CPU counters in the future.
Only Posix Env has it implemented using clock_gettime() with CLOCK_THREAD_CPUTIME_ID. How accurate the counter is depends on the platform.
Make PerfStepTimer to take an Env as an argument, and sometimes pass it in. The direct reason is to make the unit tests to use SpecialEnv where we can ingest logic there. But in long term, this is a good change.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4741
Differential Revision: D13287798
Pulled By: siying
fbshipit-source-id: 090361049d9d5095d1d1a369fe1338d2e2e1c73f
Summary:
To avoid a race on the flag, make it an atomic_bool. This
doesn't seem to significantly affect benchmarks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4801
Differential Revision: D13523845
Pulled By: abhimadan
fbshipit-source-id: 3bc29f53c50a4e06cd9f8c6232a4bb221868e055
Summary:
This TODO was already addressed, but I forgot to remove it
before landing the PR it came from.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4800
Differential Revision: D13522284
Pulled By: abhimadan
fbshipit-source-id: 7766bc4f5b54e47d355cf26137ef5e86c604472a
Summary:
This PR contains the following fixes:
1. Fixing Makefile to support non-default locations of developer tools
2. Fixing compile error using a patch from https://github.com/facebook/rocksdb/pull/4007
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4687
Differential Revision: D13287263
Pulled By: riversand963
fbshipit-source-id: 4525eb42ba7b6f82af5f9bfb8e52fa4024e27ccc
Summary:
1) `transaction_base.h` overrides from `transaction.h` with a `const boolean do_validate`.
The non-const base declaration, which I cannot see the need for, causes a compilation error on Microsoft Windows.
2) Implicit cast from `double` to `uint64_t` causes a compilation error on Microsoft Windows.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4798
Differential Revision: D13519734
Pulled By: sagar0
fbshipit-source-id: 6e8cb80e9a589b1122e1500c21b8e3a3a472b459
Summary:
Note that Snappy now requires CMake to build it, so I added a note about RocksJava to the README.md file.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4761
Differential Revision: D13403811
Pulled By: ajkr
fbshipit-source-id: 8fcd7e3dc7f7152080364a374d3065472f417eff
Summary:
in certain cases, we do not perform memtable switching if the active
memtable of the column family is empty. Two exceptions:
1. In manual flush, if cached_recoverable_state_empty_ is false, then we need
to switch memtable due to requirement of transaction.
2. In switch WAL, we need to switch memtable anyway because we have to seal the
memtable if the WAL on which it depends will be closed.
This change can potentially delay the occurence of write stalls because number
of memtables increase more slowly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4792
Differential Revision: D13499501
Pulled By: riversand963
fbshipit-source-id: 91c9b17ae753578578039f3851667d93610005e1
Summary:
Now that v2 is fully functional, the v1 aggregator is removed.
The v2 aggregator has been renamed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4778
Differential Revision: D13495930
Pulled By: abhimadan
fbshipit-source-id: 9d69500a60a283e79b6c4fa938fc68a8aa4d40d6
Summary:
Avoids re-downloading the .tar.gz files for the static build of RocksJava if they are already present.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4769
Differential Revision: D13491919
Pulled By: ajkr
fbshipit-source-id: 9265f577e049838dc40335d54f1ff2b4f972c38c
Summary:
RangeDelAggregatorV2 now supports ShouldDelete calls on
snapshot stripes and creation of range tombstone compaction iterators.
RangeDelAggregator is no longer used on any non-test code path, and will
be removed in a future commit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4758
Differential Revision: D13439254
Pulled By: abhimadan
fbshipit-source-id: fe105bcf8e3d4a2df37a622d5510843cd71b0401
Summary:
Expose common stats min,max,count,sum via statistics JNI. These stats are not fully exposed on the Java side as is, but are available on the native side.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4742
Differential Revision: D13403766
Pulled By: ajkr
fbshipit-source-id: 5b70f7bd3fb7490aab73dcbd09f13490fce5c773
Summary:
The test fails sporadically expecting the DB to be empty after DeleteFilesInRange(..., nullptr, nullptr) call which is not. Debugging shows cases where the files are skipped since they are being compacted. The patch fixes the test by waiting for the last CompactRange to finish before calling DeleteFilesInRange.
Verified by
```
~/gtest-parallel/gtest-parallel ./db_compaction_test --gtest_filter=DBCompactionTest.DeleteFileRange --repeat=10000
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4784
Differential Revision: D13469402
Pulled By: maysamyabandeh
fbshipit-source-id: 3d8f44abe205b82c69f01e7edf27e1f8098248e1
Summary:
Updating the `HistogramType.java` and `TickerType.java` to expose and correct metrics for statistics callbacks.
Moved `NO_ITERATOR_CREATED` to the proper stat name and deprecated `NO_ITERATORS`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4733
Differential Revision: D13466936
Pulled By: sagar0
fbshipit-source-id: a58d1edcc07c7b68c3525b1aa05828212c89c6c7
Summary:
Separate flag for enabling option from flag for enabling dedicated atomic stress test. I have found setting the former without setting the latter can detect different problems.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4781
Differential Revision: D13463211
Pulled By: ajkr
fbshipit-source-id: 054f777885b2dc7d5ea99faafa21d6537eee45fd
Summary:
options_file_number_ must be written under db::mutex_ sine its read is protected by mutex_ in ::GetLiveFiles(). However currently it is written in ::RenameTempFileToOptionsFile() which according to its contract must be called without holding db::mutex_. The patch fixes the race condition by also acquitting the mutex_ before writing options_file_number_. Also it does that only if the rename of option file is successful.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4780
Differential Revision: D13461411
Pulled By: maysamyabandeh
fbshipit-source-id: 2d5bae96a1f3e969ef2505b737cf2d7ae749787b
Summary:
If one column family is dropped, we should simply skip it and continue to flush
other active ones.
Currently we use Status::ShutdownInProgress to notify caller of column families
being dropped. In the future, we should consider using a different Status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4708
Differential Revision: D13378954
Pulled By: riversand963
fbshipit-source-id: 42f248cdf2d32d4c0f677cd39012694b8f1328ca
Summary:
It sometimes times out with it is run with TSAN. The patch reduces the iteration from 50 to 30. This reduces the normal runtime from 5.2 to 3.1 seconds and should similarly address the TSAN timeout problem.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4779
Differential Revision: D13456862
Pulled By: maysamyabandeh
fbshipit-source-id: fdc0ad7d781b1c33b771d2415ff5fa2f1b5e2537