Summary:
Rebased and resubmitting #1831 on behalf of stevelittle.
The problem is when a single process attempts to open the same DB twice, the second attempt fails due to LOCK file held. If the second attempt had opened the LOCK file, it'll now need to close it, and closing causes the file to be unlocked. Then, any subsequent attempt to open the DB will succeed, which is the wrong behavior.
The solution was to track which files a process has locked in PosixEnv, and check those before opening a LOCK file.
Fixes#1780.
Closes https://github.com/facebook/rocksdb/pull/3993
Differential Revision: D8398984
Pulled By: ajkr
fbshipit-source-id: 2755fe66950a0c9de63075f932f9e15768041918
Summary:
db_stress initialization randomly chooses a set of keys to not overwrite. It was doing it separately for each column family. That caused 30+ second initialization times for the non-simple crash tests, which have 10 CFs. This PR:
- reuses the same set of randomly chosen no-overwrite keys across all CFs
- logs a couple more timestamps so we can more easily see initialization time
Closes https://github.com/facebook/rocksdb/pull/3990
Differential Revision: D8393821
Pulled By: ajkr
fbshipit-source-id: d0b263a298df607285ffdd8b0983ff6575cc6c34
Summary:
In `db_stress` profile the vast majority of CPU time is spent acquiring the `SyncPoint` mutex. I mistakenly assumed #3939 had fixed this mutex contention problem by disabling `SyncPoint` processing. But actually the lock was still being acquired just to check whether processing is enabled. We can avoid that overhead by using an atomic to track whether it's enabled.
Closes https://github.com/facebook/rocksdb/pull/3991
Differential Revision: D8393825
Pulled By: ajkr
fbshipit-source-id: 5bc4e3c722ee7304e7a9c2439998c456b05a6897
Summary:
A recent change pushed down the upper bound checking to child iterators. However, this causes the logic of following sequence wrong:
Seek(key);
if (!Valid()) SeekToLast();
Because !Valid() may be caused by upper bounds, rather than the end of the iterator. In this case SeekToLast() points to totally wrong places. This can cause wrong results, infinite loops, or segfault in some cases.
This sequence is called when changing direction from forward to backward. And this by itself also implicitly happen during reseeking optimization in Prev().
Fix this bug by using SeekForPrev() rather than this sequuence, as what is already done in prefix extrator case.
Closes https://github.com/facebook/rocksdb/pull/3989
Differential Revision: D8385422
Pulled By: siying
fbshipit-source-id: 429e869990cfd2dc389421e0836fc496bed67bb4
Summary:
The sixth argument should be `key_includes_seq` bool, the seventh a `GetContext*`. We were mistakenly passing the `GetContext*` as the sixth argument and relying on the default (nullptr) for the seventh. This would make statistics inaccurate, at least.
Blame: 402b7aa0
Closes https://github.com/facebook/rocksdb/pull/3974
Differential Revision: D8344907
Pulled By: ajkr
fbshipit-source-id: 3ad865a0541d6d30f75dfc726352788118cfe12e
Summary:
Fix a crash in `WinEnvIO::GetSectorSize` that happens on old Windows systems (e.g Windows 7).
On old Windows systems that don't support querying StorageAccessAlignmentProperty using IOCTL_STORAGE_QUERY_PROPERTY, the flow calls a different DeviceIoControl with nullptr as lpBytesReturned.
When the code reaches this point, we get an access violation.
Closes https://github.com/facebook/rocksdb/pull/3975
Differential Revision: D8385186
Pulled By: ajkr
fbshipit-source-id: fae4c9b4b0a52c8a10182e1b35bcaa30dc393bbb
Summary:
Property block will be read sequentially and cached in a heap located
object, so there's no need for restart points. Thus we set the restart
interval to infinity to save space.
Closes https://github.com/facebook/rocksdb/pull/3970
Differential Revision: D8332586
Pulled By: fgwu
fbshipit-source-id: 899c3267832a81d0f084ec2db6b387332f461134
Summary:
BadOptions test creates a temporary db path changed to
table_block_based_bad_options_test to avoid collide with that created by
the PrefixAndWholeKeyTest
Closes https://github.com/facebook/rocksdb/pull/3965
Differential Revision: D8316080
Pulled By: fgwu
fbshipit-source-id: bb8e0fdfdb9abf0e5ce94494b4388cd1622ee032
Summary:
Fixed the fprintf format of uint64_t by using PRIu64 in file tools/ldb_cmd.cc
Closes https://github.com/facebook/rocksdb/pull/3963
Differential Revision: D8306179
Pulled By: zhichao-cao
fbshipit-source-id: 597dcd55321576801bbf2cf4714736ebc4750a0c
Summary:
We use `db_stress.cc` intensively to test and verify the behavior of RocksDB. Sometimes we need to add new tests for recently added features. Original `StressTest` class provides many general functionality that can be leveraged by other tests. Therefore, in this refactoring PR, I try to identify the general operations as well as operations that future tests most likely want to customize. Future tests can inherit `StressTest` and overriding the virtual functions to test custom logic.
Closes https://github.com/facebook/rocksdb/pull/3902
Differential Revision: D8284607
Pulled By: riversand963
fbshipit-source-id: 019302d04665a2b18334b6d05d04a477168c8ea4
Summary:
Depending on the compression type, `CompressBlock` calls the compress method for each compression type. It calls ZSTD_Compress for both kZSTD and kZSTDNotFinalCompression (https://github.com/facebook/rocksdb/blob/master/table/block_based_table_builder.cc#L169).
However currently ZSTD_Compress only expects the type to be kZSTD and this is causing assert failures and crashes. The same also applies to ZSTD_Uncompress.
Closes https://github.com/facebook/rocksdb/pull/3964
Differential Revision: D8308715
Pulled By: miasantreble
fbshipit-source-id: e5125f53edb829c9c33733167bec74e4793d0782
Summary:
format_version 3 changes the format of index blocks by storing user keys instead of the internal keys, which saves 8-bytes per key. This patch extends the format to top-level indexes in partitioned index/filters.
Closes https://github.com/facebook/rocksdb/pull/3958
Differential Revision: D8294615
Pulled By: maysamyabandeh
fbshipit-source-id: 17666cc16b8076c363972e2308e31547e835f0fe
Summary:
This adds some rules in the tools/advisor/advisor/rules.ini (refer this for more information) file and corresponding python parser scripts for parsing the rules file and the rocksdb LOG and OPTIONS files. This is WIP for adding rules depending on ODS. The starting point of the script is the rocksdb/tools/advisor/advisor/rule_parser.py file.
Closes https://github.com/facebook/rocksdb/pull/3934
Reviewed By: maysamyabandeh
Differential Revision: D8304059
Pulled By: poojam23
fbshipit-source-id: 47f2a50f04d46d40e225dd1cbf58ba490f79e239
Summary:
CompactFiles checked whether the existing files conflicted with the chosen compaction. But it missed checking whether future files would conflict, i.e., when another compaction was simultaneously writing new files to the same range at the same output level.
Closes https://github.com/facebook/rocksdb/pull/3926
Differential Revision: D8218996
Pulled By: ajkr
fbshipit-source-id: 21cb00a6fed4c8c62d3ed2ff810962e6bdc2fdfb
Summary:
PR https://github.com/facebook/rocksdb/pull/3838 made some changes that triggers lint warnings.
Run `make format` to fix formatting as suggested by siying .
Also piggyback two changes:
1) fix singleton destruction order for windows and posix env
2) fix two clang warnings
Closes https://github.com/facebook/rocksdb/pull/3954
Differential Revision: D8272041
Pulled By: miasantreble
fbshipit-source-id: 7c4fd12bd17aac13534520de0c733328aa3c6c9f
Summary:
This fixes a regression in one of myrocks regression tests (readwhilewriting), introduced in 8bf555f487
This PR changes two lines of code: one of them actually fixes the observed regression, the other is a mostly unrelated small fix that I'm piggy-backing here. EDIT: Nevermind, it fixes one line. More details in inline comments.
Closes https://github.com/facebook/rocksdb/pull/3953
Differential Revision: D8270664
Pulled By: al13n321
fbshipit-source-id: a7d91e196807d1e816551591257c700f70e4ccac
Summary:
format_version=3 changes the format of SST index. This is however not being tested currently since tests only work with the default format_version which is currently 2. The patch extends the most related tests to also test for format_version=3.
Closes https://github.com/facebook/rocksdb/pull/3942
Differential Revision: D8238413
Pulled By: maysamyabandeh
fbshipit-source-id: 915725f55753dd8e9188e802bf471c23645ad035
Summary:
Ensure the PosixEnv singleton is destroyed first since its destructor waits for background threads to all complete. This ensures background threads cannot hit sync points after the SyncPoint singleton is destroyed, which was previously possible.
Closes https://github.com/facebook/rocksdb/pull/3951
Differential Revision: D8265295
Pulled By: ajkr
fbshipit-source-id: 7738dd458c5d993a78377dd0420e82badada81ab
Summary:
Windows does not have LD_PRELOAD mechanism to override all memory allocation functions and ZSTD makes use of C-tuntime calloc. During flushes and compactions default system allocator fragments and the system slows down considerably.
For builds with jemalloc we employ an advanced ZSTD context creation API that re-directs memory allocation to jemalloc. To reduce the cost of context creation on each block we cache ZSTD context within the block based table builder while a new SST file is being built, this will help all platform builds including those w/o jemalloc. This avoids system allocator fragmentation and improves the performance.
The change does not address random reads and currently on Windows reads with ZSTD regress as compared with SNAPPY compression.
Closes https://github.com/facebook/rocksdb/pull/3838
Differential Revision: D8229794
Pulled By: miasantreble
fbshipit-source-id: 719b622ab7bf4109819bc44f45ec66f0dd3ee80d
Summary:
We need to keep the DB directory around since the direct IO check in "db_crashtest.py" relies on it existing. This PR fixes an issue where it was removed after each stress test run during the second half of whitebox crash testing.
Closes https://github.com/facebook/rocksdb/pull/3946
Differential Revision: D8247998
Pulled By: ajkr
fbshipit-source-id: 4e7cffbdab9b40df125e7842d0d59916e76261d3
Summary:
Previous commit https://github.com/facebook/rocksdb/pull/3935 unhide a few test options which includes kDirectIO. However it's not supported by RocksDB lite. Need to hide this option from the lite build.
Closes https://github.com/facebook/rocksdb/pull/3943
Differential Revision: D8242757
Pulled By: miasantreble
fbshipit-source-id: 1edfad3a5d01a46bfb7eedee765981ebe02c500a
Summary:
For iterator reads, a `SuperVersion` is pinned to preserve a snapshot of SST files, and `Block`s are pinned to allow `key()` and `value()` to return pointers directly into a RocksDB memory region. This works for both non-mmap reads, where the block owns the memory region, and mmap reads, where the file owns the memory region.
For point reads with `PinnableSlice`, only the `Block` object is pinned. This works for non-mmap reads because the block owns the memory region, so even if the file is deleted after compaction, the memory region survives. However, for mmap reads, file deletion causes the memory region to which the `PinnableSlice` refers to be unmapped. The result is usually a segfault upon accessing the `PinnableSlice`, although sometimes it returned wrong results (I repro'd this a bunch of times with `db_stress`).
This PR copies the value into the `PinnableSlice` when it comes from mmap'd memory. We can tell whether the `Block` owns its memory using `Block::cachable()`, which is unset when reads do not use the provided buffer as is the case with mmap file reads. When that is false we ensure the result of `Get()` is copied.
This feels like a short-term solution as ideally we'd have the `PinnableSlice` pin the mmap'd memory so we can do zero-copy reads. It seemed hard so I chose this approach to fix correctness in the meantime.
Closes https://github.com/facebook/rocksdb/pull/3881
Differential Revision: D8076288
Pulled By: ajkr
fbshipit-source-id: 31d78ec010198723522323dbc6ea325122a46b08
Summary:
Previously `db_stress` attempted to configure direct I/O dynamically in `SetOptions()` which had multiple problems (ummm must've never been tested):
- It's a DB option so SetDBOptions should've been called instead
- It's not a dynamic option so even SetDBOptions would fail
- It required enabling SyncPoint to mask O_DIRECT since it had no way to detect whether the DB directory was in tmpfs or not. This required locking that consumed ~80% of db_stress CPU.
In this PR I delete the broken dynamic config and instead configure it statically, only enabling it if the DB directory truly supports O_DIRECT.
Closes https://github.com/facebook/rocksdb/pull/3939
Differential Revision: D8238120
Pulled By: ajkr
fbshipit-source-id: 60bb2deebe6c9b54a3f788079261715b4a229279
Summary:
As titled.
I have not extended the Compatibility tests because the new WAL markers are still unimplemented.
Closes https://github.com/facebook/rocksdb/pull/3941
Differential Revision: D8238394
Pulled By: lth
fbshipit-source-id: 980e3d44837bbf2cfa64047f9738f559dfac4b1d
Summary:
This should resolve the performance regression caused by the unnecessary copying of the shared_ptr.
Closes https://github.com/facebook/rocksdb/pull/3937
Differential Revision: D8232330
Pulled By: miasantreble
fbshipit-source-id: 7885bf7cd190b6f87164c52d6edd328298c13f97
Summary:
DBTestBase::OptionConfig includes the scenarios that unit tests could iterate over them by calling ChangeOptions(). Some of the options have been mistakenly put after kEnd which makes them essentially invisible to ChangeOptions() caller. This patch fixes it except for kUniversalSubcompactions which is left as TODO since it would break some unit tests.
Closes https://github.com/facebook/rocksdb/pull/3935
Differential Revision: D8230748
Pulled By: maysamyabandeh
fbshipit-source-id: edddb8fffcd161af1809fef24798ce118f8593db
Summary:
DBImpl::FindObsoleteFiles() may call GetChildren() multiple times if different CFs are on the same path. Fix it.
Closes https://github.com/facebook/rocksdb/pull/3885
Differential Revision: D8084634
Pulled By: siying
fbshipit-source-id: b471fbc251f6a05e9243304dc14c0831060cc0b0
Summary:
Small format error below causes build to fail. I believe that this :
```
fprintf(stderr, "num reads to do %lu\n", reads_);
```
Can be changed to this:
```
fprintf(stderr, "num reads to do %" PRIu64 "\n", reads_);
```
Successful build
```
CC utilities/blob_db/blob_dump_tool.o
AR librocksdb_debug.a
ar: creating archive librocksdb_debug.a
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin/ranlib: file: librocksdb_debug.a(rocks_lua_compaction_filter.o) has no symbols
CC tools/db_bench.o
CC tools/db_bench_tool.o
tools/db_bench_tool.cc:4532:46: error: format specifies type 'unsigned long' but the argument has type 'int64_t' (aka 'long long') [-Werror,-Wformat]
fprintf(stderr, "num reads to do %lu\n", reads_);
~~~ ^~~~~~
%lld
1 error generated.
make: *** [tools/db_bench_tool.o] Error 1
```
```
$ cd rocksdb
$ make all
$ g++ --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 9.1.0 (clang-902.0.39.1)
Target: x86_64-apple-darwin17.5.0
Thread model: posix
InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin
```
Closes https://github.com/facebook/rocksdb/pull/3909
Differential Revision: D8215710
Pulled By: siying
fbshipit-source-id: 15e49fb02a818fec846e9f9b2a50e372b6b67751
Summary:
In order to make valgrind check test to pass in a day, remove some tests that run prohibitively slow under valgrind.
Closes https://github.com/facebook/rocksdb/pull/3924
Differential Revision: D8210184
Pulled By: siying
fbshipit-source-id: 5b06fb08f3cf57571d422d05a0dbddc9f9376f7a
Summary:
This is still WIP, but I'm hoping for early feedback on the overall approach.
This patch implements deletion triggered compaction, which till now only
worked for leveled, for universal style. SST files are marked for
compaction by the CompactOnDeletionCollertor table property. This is
expected to be used when free disk space is low and the user wants to
reclaim space by deleting a bunch of keys. The deletions are expected to
be dense. In such a situation, we want to avoid a full compaction due to
its space overhead.
The strategy used in this case is similar to leveled. We pick one file
from the set of files marked for compaction. We then expand the inputs
to a clean cut on the same level, and then pick overlapping files from
the next non-mepty level. Picking files from the next level can cause
the key range to expand, and we opportunistically expand inputs in the
source level to include files wholly in this key range.
The main side effect of this is that it breaks the property of no time
range overlap between levels. This shouldn't break any functionality.
Closes https://github.com/facebook/rocksdb/pull/3860
Differential Revision: D8124397
Pulled By: anand1976
fbshipit-source-id: bfa2a9dd6817930e991b35d3a8e7e61304ed3dcf
Summary:
The very old sst formats do not have table_properties and rep_->table_properties is thus nullptr. The recent patch in https://github.com/facebook/rocksdb/pull/3894 does not check for nullptr and hence makes it backward incompatible. This patch adds the check.
Closes https://github.com/facebook/rocksdb/pull/3918
Differential Revision: D8188638
Pulled By: maysamyabandeh
fbshipit-source-id: b1d986665ecf0b4d1c442adfa8a193b97707d47b
Summary:
Index blocks have the same format as data blocks. The keys therefore similarly to the keys in the data blocks are internal keys, which means that in addition to the user key it also has 8 bytes that encodes sequence number and value type. This extra 8 bytes however is not necessary in index blocks since the index keys act as an separator between two data blocks. The only exception is when the last key of a block and the first key of the next block share the same user key, in which the sequence number is required to act as a separator.
The patch excludes the sequence from index keys only if the above special case does not happen for any of the index keys. It then records that in the property block. The reader looks at the property block to see if it should expect sequence numbers in the keys of the index block.s
Closes https://github.com/facebook/rocksdb/pull/3894
Differential Revision: D8118775
Pulled By: maysamyabandeh
fbshipit-source-id: 915479f028b5799ca91671d67455ecdefbd873bd
Summary:
This was missed in a refactor of `ReadBlockContents` (2f1a3a4).
Closes https://github.com/facebook/rocksdb/pull/3906
Differential Revision: D8172648
Pulled By: ajkr
fbshipit-source-id: 27e453b19795fea974bfed4721105be6f3a12090
Summary:
Please refer to earlier discussion in [issue 3609](https://github.com/facebook/rocksdb/issues/3609).
There was also an alternative fix in [PR 3888](https://github.com/facebook/rocksdb/pull/3888), but the proposed solution requires complex change.
To summarize the cause of the problem. Upon creation of a column family, a `BlockBasedTableFactory` object is `new`ed and encapsulated by a `std::shared_ptr`. Since there is no other `std::shared_ptr` pointing to this `BlockBasedTableFactory`, when the column family is dropped, the `ColumnFamilyData` is `delete`d, causing the destructor of `std::shared_ptr`. Since there is no other `std::shared_ptr`, the underlying memory is also freed.
Later when the db exits, it releases all the table readers, including the table readers that have been operating on the dropped column family. This needs to access the `table_options` owned by `BlockBasedTableFactory` that has already been deleted. Therefore, a segfault is raised.
Previous workaround is to purge all obsolete files upon `ColumnFamilyData` destruction, which leads to a force release of table readers of the dropped column family. However this does not work when the user disables file deletion.
Our solution in this PR is making a copy of `table_options` in `BlockBasedTable::Rep`. This solution increases memory copy and usage, but is much simpler.
Test plan
```
$ make -j16
$ ./column_family_test --gtest_filter=ColumnFamilyTest.CreateDropAndDestroy:ColumnFamilyTest.CreateDropAndDestroyWithoutFileDeletion
```
Expected behavior:
All tests should pass.
Closes https://github.com/facebook/rocksdb/pull/3898
Differential Revision: D8149421
Pulled By: riversand963
fbshipit-source-id: eaecc2e064057ef607fbdd4cc275874f866c3438
Summary:
```PosixMmapReadableFile::fd_``` is closed after created, but needs to remain open for the lifetime of `PosixMmapReadableFile` since it is used whenever `InvalidateCache` is called.
Closes https://github.com/facebook/rocksdb/pull/2764
Differential Revision: D8152515
Pulled By: ajkr
fbshipit-source-id: b738a6a55ba4e392f9b0f374ff396a1e61c64f65
Summary:
Add flush_before_backup to rocksdb_backup_engine_create_new_backup. make c api able to control the flush before backup behavior.
Closes https://github.com/facebook/rocksdb/pull/3897
Differential Revision: D8157676
Pulled By: ajkr
fbshipit-source-id: 88998c62f89f087bf8672398fd7ddafabbada505
Summary:
Implement midpoint insertion strategy where new blocks will be insert to the middle of LRU list, then move the head on the first hit in cache.
Closes https://github.com/facebook/rocksdb/pull/3877
Differential Revision: D8100895
Pulled By: yiwu-arbug
fbshipit-source-id: f4bd83cb8be469e5d02072cfc8bd66011391f3da
Summary:
Catch up with Posix features
NewWritableRWFile must fail when file does not exists
Implement Env::Truncate()
Adjust Env options optimization functions
Implement MemoryMappedBuffer on Windows.
Closes https://github.com/facebook/rocksdb/pull/3857
Differential Revision: D8053610
Pulled By: ajkr
fbshipit-source-id: ccd0d46c29648a9f6f496873bc1c9d6c5547487e
Summary:
to workaround issue of http://tracker.ceph.com/issues/21422 .
and in tcmalloc aligned_alloc and posix_memalign() are basically the
same thing. the same applies to GNU glibc.
fixes#3175
Signed-off-by: Kefu Chai <tchaikov@gmail.com>
Closes https://github.com/facebook/rocksdb/pull/3862
Differential Revision: D8147930
Pulled By: yiwu-arbug
fbshipit-source-id: 355afe93c4dd0a96a0d711ef190e8b86fbe8d11d