Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
Summary:
Enhance ::Insert and ::Verify test functions to add artificial delay between prepare and commit, and take snapshot and reads respectively. A future PR will make use of these to improve stress tests to test against long-running transactions as well as long-running backup jobs. Also randomly sets set_snapshot to false for inserters to skip setting the snapshot in the initialization phase and let the snapshot be taken later explicitly.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4970
Differential Revision: D14031342
Pulled By: maysamyabandeh
fbshipit-source-id: b52b453751f0b25b81b23c48892bc1d152464cab
Summary:
Implement trace sampling to allow user to specify the sampling frequency, i.e. save one per how many requests, so that a user does not need to log all if he/she is interested in only a sampled set.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4963
Differential Revision: D14011190
Pulled By: tang-jianfeng
fbshipit-source-id: 078b631d9319b67cb089dd2c30e21d0df8dc406a
Summary:
Measure CPU time consumed for a compaction and report it in the stats report
Enable NowCPUNanos() to work for MacOS
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889
Differential Revision: D13701276
Pulled By: zinoale
fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055
Summary:
Blob DB files are not tracked by the SFM, so they currently don't get
deleted in the background. Force them to be deleted in background so
rate limiting can be applied
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4928
Differential Revision: D13854649
Pulled By: anand1976
fbshipit-source-id: 8031ce66842ff0af440c715d886b377983dad7d8
Summary:
I didn't find where customized hash function is used in DynamicBloom. This can only reduce performance. Remove it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4915
Differential Revision: D13794452
Pulled By: siying
fbshipit-source-id: e38669b11e01444d2d782da11c7decabbd851819
Summary:
- If block cache disabled or not used for meta-blocks, `BlockBasedTableReader::Rep::uncompression_dict` owns the `UncompressionDict`. It is preloaded during `PrefetchIndexAndFilterBlocks`.
- If block cache is enabled and used for meta-blocks, block cache owns the `UncompressionDict`, which holds dictionary and digested dictionary when needed. It is never prefetched though there is a TODO for this in the code. The cache key is simply the compression dictionary block handle.
- New stats for compression dictionary accesses in block cache: "BLOCK_CACHE_COMPRESSION_DICT_*" and "compression_dict_block_read_count"
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4881
Differential Revision: D13663801
Pulled By: ajkr
fbshipit-source-id: bdcc54044e180855cdcc57639b493b0e016c9a3f
Summary:
Right now, deleting blob files is not rate limited, even if SstFileManger is specified.
On the other hand, rate limiting blob deletion is not supported. With this change, Blob file
deletion will go through SstFileManager too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4904
Differential Revision: D13772545
Pulled By: siying
fbshipit-source-id: bd1b1d0beb26d5167385e00b7ecb8b94b879de84
Summary:
This is essentially a re-submission of #4251 with a few improvements:
- Split `CompressionDict` into two separate classes: `CompressionDict` and `UncompressionDict`
- Eliminated `Init` functions. Instead do all initialization work in constructors.
- Added test case for parallel DB open, which is the scenario where #4251 failed under TSAN.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4849
Differential Revision: D13606039
Pulled By: ajkr
fbshipit-source-id: 08c236059798c710db9cbf545fce0f371232d447
Summary:
By convention, time_t almost always stores the integral number of seconds since
00:00 hours, Jan 1, 1970 UTC, according to http://www.cplusplus.com/reference/ctime/time_t/.
We surely want more precision than seconds.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4868
Differential Revision: D13633046
Pulled By: riversand963
fbshipit-source-id: 4e01e23a22e8838023c51a91247a286dbf3a5396
Summary:
Declare Jemalloc non-standard APIs as weak symbols, so that if Jemalloc is linked with the binary, these symbols will be replaced by Jemalloc's, otherwise they will be nullptr. This is similar to how folly detect jemalloc, but we assume the main program use jemalloc as long as jemalloc is linked: https://github.com/facebook/folly/blob/master/folly/memory/Malloc.h#L147
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4844
Differential Revision: D13574934
Pulled By: yiwu-arbug
fbshipit-source-id: 7ea871beb1be7d5a1259cc38f9b78078793db2db
Summary:
1. Remove unused API SubtractCompactionTask().
2. Assert outstanding tasks drop to zero in ConcurrentTaskLimiterImpl destructor.
3. Remove GetOutstandingTask() check from manual compaction test, as TEST_WaitForCompact() doesn't synced with 'delete prepicked_compaction' in DBImpl::BGWorkCompaction(), which may make the test flaky.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4795
Differential Revision: D13542183
Pulled By: siying
fbshipit-source-id: 5eb2a47e62efe4126937149aa0df6e243ebefc33
Summary:
RangeDelAggregatorV2 now supports ShouldDelete calls on
snapshot stripes and creation of range tombstone compaction iterators.
RangeDelAggregator is no longer used on any non-test code path, and will
be removed in a future commit.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4758
Differential Revision: D13439254
Pulled By: abhimadan
fbshipit-source-id: fe105bcf8e3d4a2df37a622d5510843cd71b0401
Summary:
The PR is targeting to resolve the issue of:
https://github.com/facebook/rocksdb/issues/3972#issue-330771918
We have a rocksdb created with leveled-compaction with multiple column families (CFs), some of CFs are using HDD to store big and less frequently accessed data and others are using SSD.
When there are continuously write traffics going on to all CFs, the compaction thread pool is mostly occupied by those slow HDD compactions, which blocks fully utilize SSD bandwidth.
Since atomic write and transaction is needed across CFs, so splitting it to multiple rocksdb instance is not an option for us.
With the compaction thread control, we got 30%+ HDD write throughput gain, and also a lot smooth SSD write since less write stall happening.
ConcurrentTaskLimiter can be shared with multi-CFs across rocksdb instances, so the feature does not only work for multi-CFs scenarios, but also for multi-rocksdbs scenarios, who need disk IO resource control per tenant.
The usage is straight forward:
e.g.:
//
// Enable compaction thread limiter thru ColumnFamilyOptions
//
std::shared_ptr<ConcurrentTaskLimiter> ctl(NewConcurrentTaskLimiter("foo_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = ctl;
...
//
// Compaction thread limiter can be tuned or disabled on-the-fly
//
ctl->SetMaxOutstandingTask(12); // enlarge to 12 tasks
...
ctl->ResetMaxOutstandingTask(); // disable (bypass) thread limiter
ctl->SetMaxOutstandingTask(-1); // Same as above
...
ctl->SetMaxOutstandingTask(0); // full throttle (0 task)
//
// Sharing compaction thread limiter among CFs (to resolve multiple storage perf issue)
//
std::shared_ptr<ConcurrentTaskLimiter> ctl_ssd(NewConcurrentTaskLimiter("ssd_limiter", 8));
std::shared_ptr<ConcurrentTaskLimiter> ctl_hdd(NewConcurrentTaskLimiter("hdd_limiter", 4));
Options options;
ColumnFamilyOptions cf_opt_ssd1(options);
ColumnFamilyOptions cf_opt_ssd2(options);
ColumnFamilyOptions cf_opt_hdd1(options);
ColumnFamilyOptions cf_opt_hdd2(options);
ColumnFamilyOptions cf_opt_hdd3(options);
// SSD CFs
cf_opt_ssd1.compaction_thread_limiter = ctl_ssd;
cf_opt_ssd2.compaction_thread_limiter = ctl_ssd;
// HDD CFs
cf_opt_hdd1.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd2.compaction_thread_limiter = ctl_hdd;
cf_opt_hdd3.compaction_thread_limiter = ctl_hdd;
...
//
// The limiter is disabled by default (or set to nullptr explicitly)
//
Options options;
ColumnFamilyOptions cf_opt(options);
cf_opt.compaction_thread_limiter = nullptr;
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4332
Differential Revision: D13226590
Pulled By: siying
fbshipit-source-id: 14307aec55b8bd59c8223d04aa6db3c03d1b0c1d
Summary:
In Direct I/O case, WritableFileWriter::Close() rewrites the last block again, even if there is nothing new. The reason is that, Close() flushes the buffer. For non-direct I/O case, the buffer is empty in this case so it is a no-op. However, in direct I/O case, the partial data in the last block is kept in the buffer because it needs to be rewritten for the next write. This piece of data is flushed again. This commit fixes it by skipping this write out if `pending_sync_` flag shows that there isn't new data sync last sync.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4771
Differential Revision: D13420426
Pulled By: siying
fbshipit-source-id: 9d39ec9a215b1425d4ed40d85e0eba1f5daa75c6
Summary:
Transaction::GetForUpdate is extended with a do_validate parameter with default value of true. If false it skips validating the snapshot (if there is any) before doing the read. After the read it also returns the latest value (expects the ReadOptions::snapshot to be nullptr). This allows RocksDB applications to use GetForUpdate similarly to how InnoDB does. Similarly ::Merge, ::Put, ::Delete, and ::SingleDelete are extended with assume_exclusive_tracked with default value of false. It true it indicates that call is assumed to be after a ::GetForUpdate(do_validate=false).
The Java APIs are accordingly updated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4680
Differential Revision: D13068508
Pulled By: maysamyabandeh
fbshipit-source-id: f0b59db28f7f6a078b60844d902057140765e67d
Summary:
Currently tests are failing on master with the following message:
> util/jemalloc_nodump_allocator.cc:132:8: error: unused parameter ‘options’ [-Werror=unused-parameter]
Status NewJemallocNodumpAllocator(
This PR attempts to fix the issue
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4738
Differential Revision: D13278804
Pulled By: miasantreble
fbshipit-source-id: 64a6204aa685bd85d8b5080655cafef9980fac2f
Summary:
Add option to limit tcache usage by allocation size. This is to reduce total tcache size in case there are many user threads accessing the allocator and incur non-trivial memory usage.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4736
Differential Revision: D13269305
Pulled By: yiwu-arbug
fbshipit-source-id: 95a9b7fc67facd66837c849137e30e137112e19d
Summary:
Removed `one_time_use` flag, which removed the need for some
tests, and changed all `NewRangeTombstoneIterator` methods to return
`FragmentedRangeTombstoneIterators`.
These changes also led to removing `RangeDelAggregatorV2::AddUnfragmentedTombstones`
and one of the `MemTableListVersion::AddRangeTombstoneIterators` methods.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4692
Differential Revision: D13106570
Pulled By: abhimadan
fbshipit-source-id: cbab5432d7fc2d9cdfd8d9d40361a1bffaa8f845
Summary:
If user do not end the trace manually, the tracing will continue which can potential use up all the storage space and cause problem. In this PR, the max trace file size is added to the TraceOptions and user can set the value if they need or the default is 64GB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4610
Differential Revision: D12893400
Pulled By: zhichao-cao
fbshipit-source-id: acf4b5a6076bb691778bdfbac4864e1006758953
Summary:
Previously, every range tombstone iterator was seeked on every
ShouldDelete call, which quickly degraded performance for long range
scans. This PR improves performance by tracking iterator positions and
only advancing iterators when necessary.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4677
Differential Revision: D13205373
Pulled By: abhimadan
fbshipit-source-id: 80c199dace1e19362a4c61c686bf01913eae87cb
Summary:
It called the autovector::push_back simply in autovector::emplace_back.
This was not efficient, and then optimazed this function through the
perfect forwarding.
This was the src and result of the benchmark(using the google'benchmark library, the type of elem in
autovector was std::string, and call emplace_back with the "char *" type):
https://gist.github.com/monadbobo/93448b89a42737b08cbada81de75c5cd
PS: The benchmark's result of previous PR was not accurate, and so I update the test case and result.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4606
Differential Revision: D13046813
Pulled By: sagar0
fbshipit-source-id: 19cde1bcadafe899aa454b703acb35737a1cc02d
Summary:
Use the `DBOptions` that the backup engine already holds to figure out the right `EnvOptions` to use when reading the DB files. This means that, if a user opened a DB instance with `use_direct_reads=true`, then using `BackupEngine` to back up that DB instance will use direct I/O to read files when calculating checksums and copying. Currently the WALs and manifests would still be read using buffered I/O to prevent mixing direct I/O reads with concurrent buffered I/O writes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4640
Differential Revision: D13015268
Pulled By: ajkr
fbshipit-source-id: 77006ad6f3e00ce58374ca4793b785eea0db6269
Summary:
Ran the following commands to recursively change all the files under RocksDB:
```
find . -type f -name "*.cc" -exec sed -i 's/ unique_ptr/ std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<unique_ptr/<std::unique_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/ shared_ptr/ std::shared_ptr/g' {} +
find . -type f -name "*.cc" -exec sed -i 's/<shared_ptr/<std::shared_ptr/g' {} +
```
Running `make format` updated some formatting on the files touched.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4638
Differential Revision: D12934992
Pulled By: sagar0
fbshipit-source-id: 45a15d23c230cdd64c08f9c0243e5183934338a8
Summary:
Long absolute file names in log make it hard to read the LOG files.
So we shorter them to relative to the root of RocksDB project path.
In most cases, they will only have one level directory and one file name.
There was [a talk](#4316) about making "util/logging.h" a public header file.
But we concern the conflicts that might be introduced in for macros
named `STRINGIFY`, `TOSTRING`, and `PREPEND_FILE_LINE`.
So I prepend a prefix `ROCKS_LOG_` to them.
I also remove the line that includes "port.h" which seems unneccessary here.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4616
Differential Revision: D12892857
Pulled By: siying
fbshipit-source-id: af79aaf82153b8fd66b5966aced39a51fbca9c6c
Summary:
Introduce `JemallocNodumpAllocator`, which allow exclusion of block cache usage from core dump. It utilize custom hook of jemalloc arena, and when jemalloc arena request memory from system, the allocator use the hook to set `MADV_DONTDUMP ` to the memory. The implementation is basically the same as `folly::JemallocNodumpAllocator`, except for some minor difference:
1. It only support jemalloc >= 5.0
2. When the allocator destruct, it explicitly destruct the corresponding arena via `arena.<i>.destroy` via `mallctl`.
Depending on #4502.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4534
Differential Revision: D10435474
Pulled By: yiwu-arbug
fbshipit-source-id: e80edea755d3853182485d2be710376384ce0bb4
Summary:
Rename the interface, as it is mean to be a generic interface for memory allocation.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4590
Differential Revision: D10866340
Pulled By: yiwu-arbug
fbshipit-source-id: 85cb753351a40cb856c046aeaa3f3b369eef3d16
Summary:
Previously, range tombstones were accumulated from every level, which
was necessary if a range tombstone in a higher level covered a key in a lower
level. However, RangeDelAggregator::AddTombstones's complexity is based on
the number of tombstones that are currently stored in it, which is wasteful in
the Get case, where we only need to know the highest sequence number of range
tombstones that cover the key from higher levels, and compute the highest covering
sequence number at the current level. This change introduces this optimization, and
removes the use of RangeDelAggregator from the Get path.
In the benchmark results, the following command was used to initialize the database:
```
./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8
```
...and the following command was used to measure read throughput:
```
./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32
```
The filluniquerandom command was only run once, and the resulting database was used
to measure read performance before and after the PR. Both binaries were compiled with
`DEBUG_LEVEL=0`.
Readrandom results before PR:
```
readrandom : 4.544 micros/op 220090 ops/sec; 16.9 MB/s (63103 of 100000 found)
```
Readrandom results after PR:
```
readrandom : 11.147 micros/op 89707 ops/sec; 6.9 MB/s (63103 of 100000 found)
```
So it's actually slower right now, but this PR paves the way for future optimizations (see #4493).
----
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4449
Differential Revision: D10370575
Pulled By: abhimadan
fbshipit-source-id: 9a2e152be1ef36969055c0e9eb4beb0d96c11f4d
Summary:
This fixes three tests that fail with relatively recent tools and libraries:
The tests are:
* `spatial_db_test`
* `table_test`
* `db_universal_compaction_test`
I'm using:
* `gcc` 7.3.0
* `glibc` 2.27
* `snappy` 1.1.7
* `gflags` 2.2.1
* `zlib` 1.2.11
* `bzip2` 1.0.6.0.1
* `lz4` 1.8.2
* `jemalloc` 5.0.1
The versions used in the Travis environment (which is two Ubuntu LTS versions behind the current one and doesn't use `lz4` or `jemalloc`) don't seem to have a problem. However, to be safe, I verified that these tests pass with and without my changes in a trusty Docker container without `lz4` and `jemalloc`.
However, I do get an unrelated set of other failures when using a trusty Docker container that uses `lz4` and `jemalloc`:
```
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/0, where GetParam() = (1, false) (1189 ms)
[ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/1
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/1, where GetParam() = (1, true) (1246 ms)
[ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/2
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/2, where GetParam() = (3, false) (1237 ms)
[ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/3
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/3, where GetParam() = (3, true) (1195 ms)
[ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/4
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/4, where GetParam() = (5, false) (1161 ms)
[ RUN ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/5
db/db_universal_compaction_test.cc:506: Failure
Value of: num + 1
Actual: 3
Expected: NumSortedRuns(1)
Which is: 4
[ FAILED ] UniversalCompactionNumLevels/DBTestUniversalCompaction.DynamicUniversalCompactionReadAmplification/5, where GetParam() = (5, true) (1229 ms)
```
I haven't attempted to fix these since I'm not using trusty and Travis doesn't use `lz4` and `jemalloc`. However, the final commit in this PR does at least fix the compilation errors that occur when using trusty's version of `lz4`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4562
Differential Revision: D10510917
Pulled By: maysamyabandeh
fbshipit-source-id: 59534042015ec339270e5fc2f6ac4d859370d189
Summary:
A fix similar to #4410 but on the write path. On IO error on `SelectBlobFile()` we didn't return error code properly, but simply a nullptr of `BlobFile`. The `AppendBlob()` method didn't have null check for the pointer and caused crash. The fix make sure we properly return error code in this case.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4580
Differential Revision: D10513849
Pulled By: yiwu-arbug
fbshipit-source-id: 80bca920d1d7a3541149de981015ad83e0aa14b5
Summary:
Couple of very minor improvements (typos in comments, full qualification of class name, reordering members of a struct to make it smaller)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4564
Differential Revision: D10510183
Pulled By: maysamyabandeh
fbshipit-source-id: c7ddf9bfbf2db08cd31896c3fd93789d3fa68c8b
Summary:
When `MockTimeEnv` is used in test to mock time methods, we cannot use `CondVar::TimedWait` because it is using real time, not the mocked time for wait timeout. On Mac the method can return immediately without awaking other waiting threads, if the real time is larger than `wait_until` (which is a mocked time). When that happen, the `wait()` method will fall into an infinite loop.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4560
Differential Revision: D10472851
Pulled By: yiwu-arbug
fbshipit-source-id: 898902546ace7db7ac509337dd8677a527209d19
Summary:
Current `log::Reader` does not perform retry after encountering `EOF`. In the future, we need the log reader to be able to retry tailing the log even after `EOF`.
Current implementation is simple. It does not provide more advanced retry policies. Will address this in the future.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4394
Differential Revision: D9926508
Pulled By: riversand963
fbshipit-source-id: d86d145792a41bd64a72f642a2a08c7b7b5201e1
Summary:
Using const string& can avoid one extra string copy. This PR addresses a recent comment made by siying on #3933.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4491
Differential Revision: D10381211
Pulled By: riversand963
fbshipit-source-id: 27fc2d65d84bc7cd07833c77cdc47f06dcfaeb31
Summary:
We would like to collect file-system-level statistics including file name, offset, length, return code, latency, etc., which requires to add callbacks to intercept file IO function calls when RocksDB is running.
To collect file-system-level statistics, users can inherit the class `EventListener`, as in `TestFileOperationListener `. Note that `TestFileOperationListener::ShouldBeNotifiedOnFileIO()` returns true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/3933
Differential Revision: D10219571
Pulled By: riversand963
fbshipit-source-id: 7acc577a2d31097766a27adb6f78eaf8b1e8ff15
Summary:
Closes https://github.com/facebook/rocksdb/issues/4462
I'm not sure if you'll be happy with `std::random_device{}`, perhaps you would want to use your rand instance instead. I didn't test to see if your rand instance supports the requirements that `std::shuffle` takes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4482
Differential Revision: D10325133
Pulled By: yiwu-arbug
fbshipit-source-id: 47b7adaf4bb2b8d64cf090ea6b1b48ef53180581
Summary:
The original logic was assuming that the only architectures that the code would build for on Windows were x86 and x64. This change will enable building for arm32 on Windows as well.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4349
Differential Revision: D10280887
Pulled By: sagar0
fbshipit-source-id: 9ca0bede25505d22e13acf916d38aeeaaf5d981a
Summary:
It also renames InstallMemtableFlushResults to MaybeInstallMemtableFlushResults to clarify its contract.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4464
Differential Revision: D10224918
Pulled By: maysamyabandeh
fbshipit-source-id: 04e3f2d8542002cb9f8010cb436f5152751b3cbe
Summary:
The old error message was misleading because it led people to believe the truncation operation failed.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4454
Differential Revision: D10203575
Pulled By: riversand963
fbshipit-source-id: c76482a132566635cb55d4c73d45c461f295ec43
Summary:
This is a conceptually simple change, but it touches many files to
pass the allocator through function calls.
We introduce CacheAllocator, which can be used by clients to configure
custom allocator for cache blocks. Our motivation is to hook this up
with folly's `JemallocNodumpAllocator`
(f43ce6d686/folly/experimental/JemallocNodumpAllocator.h),
but there are many other possible use cases.
Additionally, this commit cleans up memory allocation in
`util/compression.h`, making sure that all allocations are wrapped in a
unique_ptr as soon as possible.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4437
Differential Revision: D10132814
Pulled By: yiwu-arbug
fbshipit-source-id: be1343a4b69f6048df127939fea9bbc96969f564
Summary:
Introduce `RepeatableThread` utility to run task periodically in a separate thread. It is basically the same as the the same class in fbcode, and in addition provide a helper method to let tests mock time and trigger execution one at a time.
We can use this class to replace `TimerQueue` in #4382 and `BlobDB`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4423
Differential Revision: D10020932
Pulled By: yiwu-arbug
fbshipit-source-id: 3616bef108c39a33c92eedb1256de424b7c04087
Summary:
Fix IO error on read not being handle and crashing the DB. With the fix we properly return the error.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4410
Differential Revision: D9979246
Pulled By: yiwu-arbug
fbshipit-source-id: 111a85675067a29c03cb60e9a34103f4ff636694
Summary:
1. Add override keyword to overridden virtual functions in EventListener
2. Fix a memory corruption that can happen during DB shutdown when in
read-only mode due to a background write error
3. Fix uninitialized buffers in error_handler_test.cc that cause
valgrind to complain
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4375
Differential Revision: D9875779
Pulled By: anand1976
fbshipit-source-id: 022ede1edc01a9f7e21ecf4c61ef7d46545d0640
Summary:
This commit implements automatic recovery from a Status::NoSpace() error
during background operations such as write callback, flush and
compaction. The broad design is as follows -
1. Compaction errors are treated as soft errors and don't put the
database in read-only mode. A compaction is delayed until enough free
disk space is available to accomodate the compaction outputs, which is
estimated based on the input size. This means that users can continue to
write, and we rely on the WriteController to delay or stop writes if the
compaction debt becomes too high due to persistent low disk space
condition
2. Errors during write callback and flush are treated as hard errors,
i.e the database is put in read-only mode and goes back to read-write
only fater certain recovery actions are taken.
3. Both types of recovery rely on the SstFileManagerImpl to poll for
sufficient disk space. We assume that there is a 1-1 mapping between an
SFM and the underlying OS storage container. For cases where multiple
DBs are hosted on a single storage container, the user is expected to
allocate a single SFM instance and use the same one for all the DBs. If
no SFM is specified by the user, DBImpl::Open() will allocate one, but
this will be one per DB and each DB will recover independently. The
recovery implemented by SFM is as follows -
a) On the first occurance of an out of space error during compaction,
subsequent
compactions will be delayed until the disk free space check indicates
enough available space. The required space is computed as the sum of
input sizes.
b) The free space check requirement will be removed once the amount of
free space is greater than the size reserved by in progress
compactions when the first error occured
c) If the out of space error is a hard error, a background thread in
SFM will poll for sufficient headroom before triggering the recovery
of the database and putting it in write-only mode. The headroom is
calculated as the sum of the write_buffer_size of all the DB instances
associated with the SFM
4. EventListener callbacks will be called at the start and completion of
automatic recovery. Users can disable the auto recov ery in the start
callback, and later initiate it manually by calling DB::Resume()
Todo:
1. More extensive testing
2. Add disk full condition to db_stress (follow-on PR)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4164
Differential Revision: D9846378
Pulled By: anand1976
fbshipit-source-id: 80ea875dbd7f00205e19c82215ff6e37da10da4a