Summary:
fix test failure of ReadAmpBitmap and ReadAmpBitmapLiveInCacheAfterDBClose.
test ReadAmpBitmapLiveInCacheAfterDBClose individually and make check
Closes https://github.com/facebook/rocksdb/pull/2271
Differential Revision: D5038133
Pulled By: lightmark
fbshipit-source-id: 803cd6f45ccfdd14a9d9473c8af311033e164be8
Summary:
- added a feature test in build_detect_platform to check whether sched_getcpu() is available. glibc offers it only on some platforms (e.g., linux but not mac); this way should be easier than maintaining a list of platforms on which it's available.
- refactored PhysicalCoreID() to be simpler / less repetitive. ordered the conditional compilation clauses from most-to-least preferred
Closes https://github.com/facebook/rocksdb/pull/2272
Differential Revision: D5038093
Pulled By: ajkr
fbshipit-source-id: 81d7db3cc620250de220bdeb3194b2b3d7673de7
Summary:
When building rocksdb in fbcode using `make`, util/build_version.cc is always updated (gitignore/hgignore doesn't apply because the file is already checked into fbcode). To use the rocksdb makefile from our own makefile, I would like an option to prevent the metadata update, which is of no value for us.
Closes https://github.com/facebook/rocksdb/pull/2264
Differential Revision: D5037846
Pulled By: siying
fbshipit-source-id: 9fa005725c5ecb31d9cbe2e738cbee209591f08a
Summary:
Updates to CentOS 5 have been archived as CentOS 5 is EOL. We now pull the updates from the vault. This is a stop gap solution, I will send a PR in a couple days which uses fixed Docker containers (with the updates pre-installed) instead.
sagar0 Here you go :-)
Closes https://github.com/facebook/rocksdb/pull/2270
Differential Revision: D5033637
Pulled By: sagar0
fbshipit-source-id: a9312dd1bc18bfb8653f06ffa0a1512b4415720d
Summary:
Consider BlockReadAmpBitmap with bytes_per_bit = 32. Suppose bytes [a, b) were used, while bytes [a-32, a)
and [b+1, b+33) weren't used; more formally, the union of ranges passed to BlockReadAmpBitmap::Mark() contains [a, b) and doesn't intersect with [a-32, a) and [b+1, b+33). Then bits [floor(a/32), ceil(b/32)] will be set, and so the number of useful bytes will be estimated as (ceil(b/32) - floor(a/32)) * 32, which is on average equal to b-a+31.
An extreme example: if we use 1 byte from each block, it'll be counted as 32 bytes from each block.
It's easy to remove this bias by slightly changing the semantics of the bitmap. Currently each bit represents a byte range [i*32, (i+1)*32).
This diff makes each bit represent a single byte: i*32 + X, where X is a random number in [0, 31] generated when bitmap is created. So, e.g., if you read a single byte at random, with probability 31/32 it won't be counted at all, and with probability 1/32 it will be counted as 32 bytes; so, on average it's counted as 1 byte.
*But there is one exception: the last bit will always set with the old way.*
(*) - assuming read_amp_bytes_per_bit = 32.
Closes https://github.com/facebook/rocksdb/pull/2259
Differential Revision: D5035652
Pulled By: lightmark
fbshipit-source-id: bd98b1b9b49fbe61f9e3781d07f624e3cbd92356
Summary:
Updated PhysicalCoreID() to use sched_getcpu() on x86_64 for glibc >= 2.22. Added a new
function named GetCPUID() that calls sched_getcpu(), to avoid repeated code. This change is done as per the comments of PR: https://github.com/facebook/rocksdb/pull/2230
Signed-off-by: Jos Collin <jcollin@redhat.com>
Closes https://github.com/facebook/rocksdb/pull/2260
Differential Revision: D5025734
Pulled By: ajkr
fbshipit-source-id: f4cca68c12573cafcf8531e7411a1e733bbf8eef
Summary:
Based on my experience with linkbench, We should not skip loading bloom filter blocks when they are not available in block cache when using Iterator::Seek
Actually I am not sure why this behavior existed in the first place
Closes https://github.com/facebook/rocksdb/pull/2255
Differential Revision: D5010721
Pulled By: maysamyabandeh
fbshipit-source-id: 0af545a06ac4baeecb248706ec34d009c2480ca4
Summary:
RocksDB is compiled as part of MyRocks (MySQL storage engine) build.
MySQL already defines `CACHE_LINE_SIZE` and therefore we're getting a
conflict. Change RocksDB definition to be more cognizant of this.
Closes https://github.com/facebook/rocksdb/pull/2257
Differential Revision: D5013188
Pulled By: gunnarku
fbshipit-source-id: cfa76fe99f90dcd82aa09204e2f1f35e07a82b41
Summary:
Adding DB::CreateColumnFamilie() and DB::DropColumnFamilies() to bulk create/drop column families. This is to address the problem creating/dropping 1k column families takes minutes. The bottleneck is we persist options files for every single column family create/drop, and it parses the persisted options file for verification, which take a lot CPU time.
The new APIs simply create/drop column families individually, and persist options file once at the end. This improves create 1k column families to within ~0.1s. Further improvement can be merge manifest write to one IO.
Closes https://github.com/facebook/rocksdb/pull/2248
Differential Revision: D5001578
Pulled By: yiwu-arbug
fbshipit-source-id: d4e00bda671451e0b314c13e12ad194b1704aa03
Summary:
Any non-raw-data dependent object must be destructed before the table
closes. There was a bug of not doing that for filter object. This patch
fixes the bug and adds a unit test to prevent such bugs in future.
Closes https://github.com/facebook/rocksdb/pull/2246
Differential Revision: D5001318
Pulled By: maysamyabandeh
fbshipit-source-id: 6d8772e58765485868094b92964da82ef9730b6d
Summary:
- downcase includes for case-sensitive filesystems
- give targets the same name (librocksdb) on all platforms
With this patch it is possible to cross-compile RocksDB for Windows
from a Linux host using mingw.
cc yuslepukhin orgads
Closes https://github.com/facebook/rocksdb/pull/2107
Differential Revision: D4849784
Pulled By: siying
fbshipit-source-id: ad26ed6b4d393851aa6551e6aa4201faba82ef60
Summary:
Now if we have iterate_upper_bound set, we continue read until get a key >= upper_bound. For a lot of cases that neighboring data blocks have a user key gap between them, our index key will be a user key in the middle to get a shorter size. For example, if we have blocks:
[a b c d][f g h]
Then the index key for the first block will be 'e'.
then if upper bound is any key between 'd' and 'e', for example, d1, d2, ..., d99999999999, we don't have to read the second block and also know that we have done our iteration by reaching the last key that smaller the upper bound already.
This diff can reduce RA in most cases.
Closes https://github.com/facebook/rocksdb/pull/2239
Differential Revision: D4990693
Pulled By: lightmark
fbshipit-source-id: ab30ea2e3c6edf3fddd5efed3c34fcf7739827ff
Summary:
Allow an option for users to do some compaction in FIFO compaction, to pay some write amplification for fewer number of files.
Closes https://github.com/facebook/rocksdb/pull/2163
Differential Revision: D4895953
Pulled By: siying
fbshipit-source-id: a1ab608dd0627211f3e1f588a2e97159646e1231
Summary:
Changed dynamic leveling to stop setting the base level's size bound below `max_bytes_for_level_base`.
Behavior for config where `max_bytes_for_level_base == level0_file_num_compaction_trigger * write_buffer_size` and same amount of data in L0 and base-level:
- Before #2027, compaction scoring would favor base-level due to dividing by size smaller than `max_bytes_for_level_base`.
- After #2027, L0 and Lbase get equal scores. The disadvantage is L0 is often compacted before reaching the num files trigger since `write_buffer_size` can be bigger than the dynamically chosen base-level size. This increases write-amp.
- After this diff, L0 and Lbase still get equal scores. Now it takes `level0_file_num_compaction_trigger` files of size `write_buffer_size` to trigger L0 compaction by size, fixing the write-amp problem above.
Closes https://github.com/facebook/rocksdb/pull/2123
Differential Revision: D4861570
Pulled By: ajkr
fbshipit-source-id: 467ddef56ed1f647c14d86bb018bcb044c39b964
Summary:
When user doesn't set a limit on compaction output file size, let's use the sum of the input files' sizes. This will avoid passing UINT64_MAX as fallocate()'s length. Reported in #2249.
Test setup:
- command: `TEST_TMPDIR=/data/rocksdb-test/ strace -e fallocate ./db_compaction_test --gtest_filter=DBCompactionTest.ManualCompactionUnknownOutputSize`
- filesystem: xfs
before this diff:
`fallocate(10, 01, 0, 1844674407370955160) = -1 ENOSPC (No space left on device)`
after this diff:
`fallocate(10, 01, 0, 1977) = 0`
Closes https://github.com/facebook/rocksdb/pull/2252
Differential Revision: D5007275
Pulled By: ajkr
fbshipit-source-id: 4491404a6ae8a41328aede2e2d6f4d9ac3e38880
Summary:
we align the buffer with logical sector size and should not test it with page size, which is usually 4k.
Closes https://github.com/facebook/rocksdb/pull/2245
Differential Revision: D5001842
Pulled By: lightmark
fbshipit-source-id: a7135fcf6351c6db363e8908956b1e193a4a6291
Summary:
Makes max_open_files db option dynamically set-able by SetDBOptions. During the call of SetDBOptions we call SetCapacity on the table cache, which is a LRUCache.
Closes https://github.com/facebook/rocksdb/pull/2185
Differential Revision: D4979189
Pulled By: yiwu-arbug
fbshipit-source-id: ca7e8dc5e3619c79434f579be4847c0f7e56afda
Summary: Checked the return value of __get_cpuid(). Implemented the else case where the arch is different from i386 and x86_64.
Pulled By: ajkr
Differential Revision: D4973496
fbshipit-source-id: c40fdef5840364c2a79b1d11df0db5d4ec3d6a4a
Summary:
A data race between a manual and an auto compaction can cause a scheduled automatic compaction to be cancelled and never rescheduled again. This may cause a condition of hanging forever. Fix this by always making sure the cancelled compaction is put back to the compaction queue.
Closes https://github.com/facebook/rocksdb/pull/2238
Differential Revision: D4984591
Pulled By: siying
fbshipit-source-id: 3ab153886403c7b991896dcb2158b96cac12f227
Summary:
Some filters such as partitioned filter have pointers to the table for which they are created. Therefore is they are stored in the block cache, the should be forcibly erased from block cache before closing the table, which would result into deleting the object. Otherwise the destructor will be called later when the cache is lazily erasing the object, which having the parent table no longer existent it could result into undefined behavior.
Update: there will be still cases the filter is not removed from the cache since the table has not kept a pointer to the cache handle to be able to forcibly release it later. We make sure that the filter destructor does not access the table pointer to get around such cases.
Closes https://github.com/facebook/rocksdb/pull/2207
Differential Revision: D4941591
Pulled By: maysamyabandeh
fbshipit-source-id: 56fbab2a11cf447e1aa67caa30b58d7bd7ce5bbd
Summary:
With row cache being enabled, table cache is doing a short circuit for reading data. This path needs to be updated to take advantage of pinnable slice. In the meanwhile we disabling pinning in this path.
Closes https://github.com/facebook/rocksdb/pull/2237
Differential Revision: D4982389
Pulled By: maysamyabandeh
fbshipit-source-id: 542630d0cf23cfb1f0c397da82e7053df7966591
Summary:
ColumnFamilyData::ConstructNewMemtable is called out of DB mutex, and it asserts current_ is not empty, but current_ should only be accessed inside DB mutex. Remove this assert to make TSAN happy.
Closes https://github.com/facebook/rocksdb/pull/2235
Differential Revision: D4978531
Pulled By: siying
fbshipit-source-id: 423685a7dae88ed3faaa9e1b9ccb3427ac704a4b
Summary:
VALGRIND_VER was left empty after moving the environment to GCC-5. Set it back.
Closes https://github.com/facebook/rocksdb/pull/2234
Differential Revision: D4978534
Pulled By: siying
fbshipit-source-id: f0640d58e8f575f75fb3f8b92e686c9e0b6a59bb
Summary:
TSAN sometimes complaints data race of PosixLogger::flush_pending_. Make it atomic.
Closes https://github.com/facebook/rocksdb/pull/2231
Differential Revision: D4973397
Pulled By: siying
fbshipit-source-id: 571e886e3eca3231705919d573e250c1c1ec3764
Summary:
In case users cast a subclass of db* into dbimpl*
Closes https://github.com/facebook/rocksdb/pull/2222
Differential Revision: D4964486
Pulled By: lightmark
fbshipit-source-id: 0ccdc08ee8e7a193dfbbe0218c3cbfd795662ca1
Summary:
Remove double buffering on RandomRead on Windows.
With more logic appear in file reader/write Read no longer
obeys forwarding calls to Windows implementation.
Previously direct_io (unbuffered) was only available on Windows
but now is supported as generic.
We remove intermediate buffering on Windows.
Remove random_access_max_buffer_size option which was windows specific.
Non-zero values for that opton introduced unnecessary lock contention.
Remove Env::EnableReadAhead(), Env::ShouldForwardRawRequest() that are
no longer necessary.
Add aligned buffer reads for cases when requested reads exceed read ahead size.
Closes https://github.com/facebook/rocksdb/pull/2105
Differential Revision: D4847770
Pulled By: siying
fbshipit-source-id: 8ab48f8e854ab498a4fd398a6934859792a2788f
Summary:
�fix the buffer size in case of ppl use buffer size as their block_size.
Closes https://github.com/facebook/rocksdb/pull/2198
Differential Revision: D4956878
Pulled By: lightmark
fbshipit-source-id: 8bb0dc9c133887aadcd625d5261a3d1110b71473
Summary:
It resets all the ticker and histogram stats to zero. Needed to change the locking a bit since Reset() is the only operation that manipulates multiple tickers/histograms together, and that operation should be seen as atomic by other operations that access tickers/histograms.
Closes https://github.com/facebook/rocksdb/pull/2213
Differential Revision: D4952232
Pulled By: ajkr
fbshipit-source-id: c0475c3e4c7b940120d53891b69c3091149a0679
Summary:
Every time after a compaction/flush finish, we issue user reads to put the table into block cache which includes a couple of IO that read footer, index blocks, meta block, etc. So we implement Prefetch here to reduce IO.
Closes https://github.com/facebook/rocksdb/pull/2196
Differential Revision: D4931782
Pulled By: lightmark
fbshipit-source-id: 5a13d58dcab209964352322217193bbf7ff78149
Summary:
I'm going to add more DB tests for statistics as currently we have very few. I started a file dedicated to this purpose and moved the existing stats-specific tests there.
Closes https://github.com/facebook/rocksdb/pull/2211
Differential Revision: D4951558
Pulled By: ajkr
fbshipit-source-id: 05d11c35079c40ecabdfd2cf5556ccb761f694a4
Summary:
Support buck load with universal compaction.
More test cases to be added.
Closes https://github.com/facebook/rocksdb/pull/2202
Differential Revision: D4935360
Pulled By: lightmark
fbshipit-source-id: cc3ca1b6f42faa503207dab1408d6bcf393ee5b5
Summary:
These code paths forked when checkpoint was introduced by copy/pasting the core backup logic. Over time they diverged and bug fixes were sometimes applied to one but not the other (like fix to include all relevant WALs for 2PC), or it required extra effort to fix both (like fix to forge CURRENT file). This diff reunites the code paths by extracting the core logic into a function, CreateCustomCheckpoint(), that is customizable via callbacks to implement both checkpoint and backup.
Related changes:
- flush_before_backup is now forcibly enabled when 2PC is enabled
- Extracted CheckpointImpl class definition into a header file. This is so the function, CreateCustomCheckpoint(), can be called by internal rocksdb code but not exposed to users.
- Implemented more functions in DummyDB/DummyLogFile (in backupable_db_test.cc) that are used by CreateCustomCheckpoint().
Closes https://github.com/facebook/rocksdb/pull/1932
Differential Revision: D4622986
Pulled By: ajkr
fbshipit-source-id: 157723884236ee3999a682673b64f7457a7a0d87