Summary:
Add a histogram in statistics to help users understand how many merge operands they merge.
Closes https://github.com/facebook/rocksdb/pull/2373
Differential Revision: D5139983
Pulled By: siying
fbshipit-source-id: 61b9ba8ca83f358530a4833d68f0103b56a0e182
Summary:
Previously sst_file_writer only supports kTypeValue, we need kTypeMerge and kTypeDeletion also as user requested.
Closes https://github.com/facebook/rocksdb/pull/2361
Differential Revision: D5139402
Pulled By: lightmark
fbshipit-source-id: 092a60756d01692539d817a3765ebfd58a8d7f88
Summary:
This log message shouldn't be a warning; some services are seeing high warning count due to this.
The count for the below line is a few hundreds of millions, as per Logview:
```
[rocksdb/src/db/column_family.cc:729] [checkpoints] Increasing compaction threads because we have 2 level-0 files
```
Closes https://github.com/facebook/rocksdb/pull/2364
Differential Revision: D5123565
Pulled By: sagar0
fbshipit-source-id: a07ce499a4f82f0ebde9cda9f4948fb9df6a734c
Summary:
stop calling Close() at the end of tests holding a compaction pressure token since it causes the write controller to be deleted while it's still needed. these calls were pointless anyways since Close() is already called in the test's destructor.
Closes https://github.com/facebook/rocksdb/pull/2367
Differential Revision: D5125906
Pulled By: ajkr
fbshipit-source-id: 6cad8673e5546a82ff602ac0ba59cc3f68dbde46
Summary:
- `max_background_flushes` and `max_background_compactions` are still supported for backwards compatibility
- `base_background_compactions` is completely deprecated. Now we just throttle to one background compaction when there's no pressure.
- `max_background_jobs` is added to automatically partition the concurrent background jobs into flushes vs compactions. Currently it's very simple as we just allocate one-fourth of the jobs to flushes, and the remaining can be used for compactions.
- The test cases that set `base_background_compactions > 1` needed to be updated. I just grab the pressure token such that the desired number of compactions can be scheduled.
Closes https://github.com/facebook/rocksdb/pull/2205
Differential Revision: D4937461
Pulled By: ajkr
fbshipit-source-id: df52cbbd497e13bbc9a60560a5ac2a2526b3f1f9
Summary:
It's hard for RocksDB to come up with a good default of delayed write rate. Use rate given by rate limiter if it is availalbe. This provides the I/O order of magnitude.
Closes https://github.com/facebook/rocksdb/pull/2357
Differential Revision: D5115324
Pulled By: siying
fbshipit-source-id: 341065ad2211c981fc804011c0f0e59a50c7e754
Summary:
`cf_stats_snapshot_.seconds_up` appears to be never updated, unlike `db_stats_snapshot_.seconds_up`, which is updated here: https://github.com/facebook/rocksdb/blob/master/db/internal_stats.cc#L883
This leads to wrong information in the log, for example:
```
** Compaction Stats [default] **
....
Uptime(secs): 85591.2 total, 85591.2 interval
```
Even though DB's interval is correctly logged as 60 seconds:
```
** DB Stats **
Uptime(secs): 85591.2 total, 637.8 interval
```
Closes https://github.com/facebook/rocksdb/pull/2338
Differential Revision: D5114131
Pulled By: sagar0
fbshipit-source-id: 85243a38213236ccbb601a7f7aaa8865eaa8083c
Summary:
Fix build error in db_iter.cc when running clang-analyzer.
```
CC db/db_iter.o
db/db_iter.cc:938:21: error: no matching constructor for initialization of 'rocksdb::ParsedInternalKey'
ParsedInternalKey ikey(Slice(), 0, 0);
^ ~~~~~~~~~~~~~
./db/dbformat.h:84:3: note: candidate constructor not viable: no known conversion from 'int' to 'rocksdb::ValueType' for 3rd argument
ParsedInternalKey(const Slice& u, const SequenceNumber& seq, ValueType t)
^
./db/dbformat.h:78:8: note: candidate constructor (the implicit copy constructor) not viable: requires 1 argument, but 3 were provided
struct ParsedInternalKey {
^
./db/dbformat.h:78:8: note: candidate constructor (the implicit move constructor) not viable: requires 1 argument, but 3 were provided
./db/dbformat.h:83:3: note: candidate constructor not viable: requires 0 arguments, but 3 were provided
ParsedInternalKey() { } // Intentionally left uninitialized (for speed)
^
1 error generated.
```
Closes https://github.com/facebook/rocksdb/pull/2354
Differential Revision: D5115751
Pulled By: sagar0
fbshipit-source-id: b0e386d4e935e4725b07761c3ca5f7a8cbde3692
Summary:
Previously users could set `max_background_flushes=0` to force rocksdb to use a single thread pool for both background flushes and compactions. That'll no longer be possible since I'm going to deprecate `max_background_flushes` and `max_background_compactions` in favor of a single option. This diff introduces a new way to force a single thread pool: when high-pri pool has zero threads, all background jobs will be submitted to low-pri pool.
Note the majority of the code change is adding `Env::GetBackgroundThreads()`, which is necessary to check whether the user has provided a zero-sized thread pool.
Closes https://github.com/facebook/rocksdb/pull/2204
Differential Revision: D4936256
Pulled By: ajkr
fbshipit-source-id: 929a07a0c0705f7766f5339cd013ff74e90d6e01
Summary:
Disable direct reads for log and manifest. Direct reads should not affect sequential_file
Also add kDirectIO for option_config_ in db_test_util
Closes https://github.com/facebook/rocksdb/pull/2337
Differential Revision: D5100261
Pulled By: lightmark
fbshipit-source-id: 0ebfd13b93fa1b8f9acae514ac44f8125a05868b
Summary:
PipelineWriteImpl is an alternative approach to WriteImpl. In WriteImpl, only one thread is allow to write at the same time. This thread will do both WAL and memtable writes for all write threads in the write group. Pending writers wait in queue until the current writer finishes. In the pipeline write approach, two queue is maintained: one WAL writer queue and one memtable writer queue. All writers (regardless of whether they need to write WAL) will still need to first join the WAL writer queue, and after the house keeping work and WAL writing, they will need to join memtable writer queue if needed. The benefit of this approach is that
1. Writers without memtable writes (e.g. the prepare phase of two phase commit) can exit write thread once WAL write is finish. They don't need to wait for memtable writes in case of group commit.
2. Pending writers only need to wait for previous WAL writer finish to be able to join the write thread, instead of wait also for previous memtable writes.
Merging #2056 and #2058 into this PR.
Closes https://github.com/facebook/rocksdb/pull/2286
Differential Revision: D5054606
Pulled By: yiwu-arbug
fbshipit-source-id: ee5b11efd19d3e39d6b7210937b11cefdd4d1c8d
Summary:
Fixing two types of clang-analyzer false positives:
* db is deleted and then reopen, and clang-analyzer thinks we are reusing the pointer after it has been deleted. Adding asserts to hint clang-analyzer the pointer is recreated.
* ParsedInternalKey is (intentionally) uninitialized. Initialize the struct only when clang-analyzer is running.
Closes https://github.com/facebook/rocksdb/pull/2334
Differential Revision: D5093801
Pulled By: yiwu-arbug
fbshipit-source-id: f51355382098eb3da5ab9f64e094c6d03e6bdf7d
Summary:
TSAN shows warning of data race of EnvCounter::num_new_writable_file_. Make it atomic.
Closes https://github.com/facebook/rocksdb/pull/2331
Differential Revision: D5089215
Pulled By: siying
fbshipit-source-id: 15f6dcfb770a3310cbb6337c22482c8b330daffc
Summary:
First cut for early review; there are few conceptual points to answer and some code structure issues.
For conceptual points -
- restriction-wise, we're going to disallow ingest_behind if (use_seqno_zero_out=true || disable_auto_compaction=false), the user is responsible to properly open and close DB with required params
- we wanted to ingest into reserved bottom most level. Should we fail fast if bottom level isn't empty, or should we attempt to ingest if file fits there key-ranges-wise?
- Modifying AssignLevelForIngestedFile seems the place we we'd handle that.
On code structure - going to refactor GenerateAndAddExternalFile call in the test class to allow passing instance of IngestionOptions, that's just going to incur lots of changes at callsites.
Closes https://github.com/facebook/rocksdb/pull/2144
Differential Revision: D4873732
Pulled By: lightmark
fbshipit-source-id: 81cb698106b68ef8797f564453651d50900e153a
Summary:
I've added functions to the C API to support Transactions as requested in #1637 and to support Checkpoint.
I have also added the corresponding tests to c_test.c
For now, the following is omitted:
1. Optimistic Transactions
2. The column family variation of functions
Closes https://github.com/facebook/rocksdb/pull/2236
Differential Revision: D4989510
Pulled By: yiwu-arbug
fbshipit-source-id: 518cb39f76d5e9ec9690d633fcdc014b98958071
Summary:
fix test failure of ReadAmpBitmap and ReadAmpBitmapLiveInCacheAfterDBClose.
test ReadAmpBitmapLiveInCacheAfterDBClose individually and make check
Closes https://github.com/facebook/rocksdb/pull/2271
Differential Revision: D5038133
Pulled By: lightmark
fbshipit-source-id: 803cd6f45ccfdd14a9d9473c8af311033e164be8
Summary:
- added a feature test in build_detect_platform to check whether sched_getcpu() is available. glibc offers it only on some platforms (e.g., linux but not mac); this way should be easier than maintaining a list of platforms on which it's available.
- refactored PhysicalCoreID() to be simpler / less repetitive. ordered the conditional compilation clauses from most-to-least preferred
Closes https://github.com/facebook/rocksdb/pull/2272
Differential Revision: D5038093
Pulled By: ajkr
fbshipit-source-id: 81d7db3cc620250de220bdeb3194b2b3d7673de7
Summary:
Consider BlockReadAmpBitmap with bytes_per_bit = 32. Suppose bytes [a, b) were used, while bytes [a-32, a)
and [b+1, b+33) weren't used; more formally, the union of ranges passed to BlockReadAmpBitmap::Mark() contains [a, b) and doesn't intersect with [a-32, a) and [b+1, b+33). Then bits [floor(a/32), ceil(b/32)] will be set, and so the number of useful bytes will be estimated as (ceil(b/32) - floor(a/32)) * 32, which is on average equal to b-a+31.
An extreme example: if we use 1 byte from each block, it'll be counted as 32 bytes from each block.
It's easy to remove this bias by slightly changing the semantics of the bitmap. Currently each bit represents a byte range [i*32, (i+1)*32).
This diff makes each bit represent a single byte: i*32 + X, where X is a random number in [0, 31] generated when bitmap is created. So, e.g., if you read a single byte at random, with probability 31/32 it won't be counted at all, and with probability 1/32 it will be counted as 32 bytes; so, on average it's counted as 1 byte.
*But there is one exception: the last bit will always set with the old way.*
(*) - assuming read_amp_bytes_per_bit = 32.
Closes https://github.com/facebook/rocksdb/pull/2259
Differential Revision: D5035652
Pulled By: lightmark
fbshipit-source-id: bd98b1b9b49fbe61f9e3781d07f624e3cbd92356
Summary:
Based on my experience with linkbench, We should not skip loading bloom filter blocks when they are not available in block cache when using Iterator::Seek
Actually I am not sure why this behavior existed in the first place
Closes https://github.com/facebook/rocksdb/pull/2255
Differential Revision: D5010721
Pulled By: maysamyabandeh
fbshipit-source-id: 0af545a06ac4baeecb248706ec34d009c2480ca4
Summary:
Adding DB::CreateColumnFamilie() and DB::DropColumnFamilies() to bulk create/drop column families. This is to address the problem creating/dropping 1k column families takes minutes. The bottleneck is we persist options files for every single column family create/drop, and it parses the persisted options file for verification, which take a lot CPU time.
The new APIs simply create/drop column families individually, and persist options file once at the end. This improves create 1k column families to within ~0.1s. Further improvement can be merge manifest write to one IO.
Closes https://github.com/facebook/rocksdb/pull/2248
Differential Revision: D5001578
Pulled By: yiwu-arbug
fbshipit-source-id: d4e00bda671451e0b314c13e12ad194b1704aa03
Summary:
- downcase includes for case-sensitive filesystems
- give targets the same name (librocksdb) on all platforms
With this patch it is possible to cross-compile RocksDB for Windows
from a Linux host using mingw.
cc yuslepukhin orgads
Closes https://github.com/facebook/rocksdb/pull/2107
Differential Revision: D4849784
Pulled By: siying
fbshipit-source-id: ad26ed6b4d393851aa6551e6aa4201faba82ef60
Summary:
Now if we have iterate_upper_bound set, we continue read until get a key >= upper_bound. For a lot of cases that neighboring data blocks have a user key gap between them, our index key will be a user key in the middle to get a shorter size. For example, if we have blocks:
[a b c d][f g h]
Then the index key for the first block will be 'e'.
then if upper bound is any key between 'd' and 'e', for example, d1, d2, ..., d99999999999, we don't have to read the second block and also know that we have done our iteration by reaching the last key that smaller the upper bound already.
This diff can reduce RA in most cases.
Closes https://github.com/facebook/rocksdb/pull/2239
Differential Revision: D4990693
Pulled By: lightmark
fbshipit-source-id: ab30ea2e3c6edf3fddd5efed3c34fcf7739827ff
Summary:
Allow an option for users to do some compaction in FIFO compaction, to pay some write amplification for fewer number of files.
Closes https://github.com/facebook/rocksdb/pull/2163
Differential Revision: D4895953
Pulled By: siying
fbshipit-source-id: a1ab608dd0627211f3e1f588a2e97159646e1231
Summary:
Changed dynamic leveling to stop setting the base level's size bound below `max_bytes_for_level_base`.
Behavior for config where `max_bytes_for_level_base == level0_file_num_compaction_trigger * write_buffer_size` and same amount of data in L0 and base-level:
- Before #2027, compaction scoring would favor base-level due to dividing by size smaller than `max_bytes_for_level_base`.
- After #2027, L0 and Lbase get equal scores. The disadvantage is L0 is often compacted before reaching the num files trigger since `write_buffer_size` can be bigger than the dynamically chosen base-level size. This increases write-amp.
- After this diff, L0 and Lbase still get equal scores. Now it takes `level0_file_num_compaction_trigger` files of size `write_buffer_size` to trigger L0 compaction by size, fixing the write-amp problem above.
Closes https://github.com/facebook/rocksdb/pull/2123
Differential Revision: D4861570
Pulled By: ajkr
fbshipit-source-id: 467ddef56ed1f647c14d86bb018bcb044c39b964
Summary:
When user doesn't set a limit on compaction output file size, let's use the sum of the input files' sizes. This will avoid passing UINT64_MAX as fallocate()'s length. Reported in #2249.
Test setup:
- command: `TEST_TMPDIR=/data/rocksdb-test/ strace -e fallocate ./db_compaction_test --gtest_filter=DBCompactionTest.ManualCompactionUnknownOutputSize`
- filesystem: xfs
before this diff:
`fallocate(10, 01, 0, 1844674407370955160) = -1 ENOSPC (No space left on device)`
after this diff:
`fallocate(10, 01, 0, 1977) = 0`
Closes https://github.com/facebook/rocksdb/pull/2252
Differential Revision: D5007275
Pulled By: ajkr
fbshipit-source-id: 4491404a6ae8a41328aede2e2d6f4d9ac3e38880
Summary:
Makes max_open_files db option dynamically set-able by SetDBOptions. During the call of SetDBOptions we call SetCapacity on the table cache, which is a LRUCache.
Closes https://github.com/facebook/rocksdb/pull/2185
Differential Revision: D4979189
Pulled By: yiwu-arbug
fbshipit-source-id: ca7e8dc5e3619c79434f579be4847c0f7e56afda
Summary:
A data race between a manual and an auto compaction can cause a scheduled automatic compaction to be cancelled and never rescheduled again. This may cause a condition of hanging forever. Fix this by always making sure the cancelled compaction is put back to the compaction queue.
Closes https://github.com/facebook/rocksdb/pull/2238
Differential Revision: D4984591
Pulled By: siying
fbshipit-source-id: 3ab153886403c7b991896dcb2158b96cac12f227
Summary:
ColumnFamilyData::ConstructNewMemtable is called out of DB mutex, and it asserts current_ is not empty, but current_ should only be accessed inside DB mutex. Remove this assert to make TSAN happy.
Closes https://github.com/facebook/rocksdb/pull/2235
Differential Revision: D4978531
Pulled By: siying
fbshipit-source-id: 423685a7dae88ed3faaa9e1b9ccb3427ac704a4b
Summary:
In case users cast a subclass of db* into dbimpl*
Closes https://github.com/facebook/rocksdb/pull/2222
Differential Revision: D4964486
Pulled By: lightmark
fbshipit-source-id: 0ccdc08ee8e7a193dfbbe0218c3cbfd795662ca1
Summary:
Remove double buffering on RandomRead on Windows.
With more logic appear in file reader/write Read no longer
obeys forwarding calls to Windows implementation.
Previously direct_io (unbuffered) was only available on Windows
but now is supported as generic.
We remove intermediate buffering on Windows.
Remove random_access_max_buffer_size option which was windows specific.
Non-zero values for that opton introduced unnecessary lock contention.
Remove Env::EnableReadAhead(), Env::ShouldForwardRawRequest() that are
no longer necessary.
Add aligned buffer reads for cases when requested reads exceed read ahead size.
Closes https://github.com/facebook/rocksdb/pull/2105
Differential Revision: D4847770
Pulled By: siying
fbshipit-source-id: 8ab48f8e854ab498a4fd398a6934859792a2788f
Summary:
It resets all the ticker and histogram stats to zero. Needed to change the locking a bit since Reset() is the only operation that manipulates multiple tickers/histograms together, and that operation should be seen as atomic by other operations that access tickers/histograms.
Closes https://github.com/facebook/rocksdb/pull/2213
Differential Revision: D4952232
Pulled By: ajkr
fbshipit-source-id: c0475c3e4c7b940120d53891b69c3091149a0679
Summary:
I'm going to add more DB tests for statistics as currently we have very few. I started a file dedicated to this purpose and moved the existing stats-specific tests there.
Closes https://github.com/facebook/rocksdb/pull/2211
Differential Revision: D4951558
Pulled By: ajkr
fbshipit-source-id: 05d11c35079c40ecabdfd2cf5556ccb761f694a4
Summary:
Support buck load with universal compaction.
More test cases to be added.
Closes https://github.com/facebook/rocksdb/pull/2202
Differential Revision: D4935360
Pulled By: lightmark
fbshipit-source-id: cc3ca1b6f42faa503207dab1408d6bcf393ee5b5