Summary:
The max batch size that we can write to the WAL is controlled by a static manner. So if the leader write is less than 128 KB we will have the batch size as leader write size + 128 KB else the limit will be 1 MB. Both of them are statically defined.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5759
Differential Revision: D17329298
fbshipit-source-id: a3d910629d8d8ca84ea39ad89c2b2d284571ded5
Summary:
Use delete to disable automatic generated methods instead of private, and put the constructor together for more clear.This modification cause the unused field warning, so add unused attribute to disable this warning.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5009
Differential Revision: D17288733
fbshipit-source-id: 8a767ce096f185f1db01bd28fc88fef1cdd921f3
Summary:
Adding a light weight API to get last live WAL file name and size. Meant to be used as a helper for backup/restore tooling in a larger ecosystem such as MySQL with a MyRocks storage engine.
Specifically within MySQL's backup/restore mechanism, this call can be made with a write lock on the mysql db to get a transactionally consistent snapshot of the current WAL file position along with other non-rocksdb log/data files.
Without this, the alternative would be to take the aforementioned lock, scan the WAL dir for all files, find the last file and note its exact size as the rocksdb 'checkpoint'.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5765
Differential Revision: D17172717
Pulled By: affandar
fbshipit-source-id: f2fabafd4c0e6fc45f126670c8c88a9f84cb8a37
Summary:
Each DB has a globally unique ID. A DB can be physically copied around, or backed-up and restored, and the users should be identify the same DB. This unique ID right now is stored as plain text in file IDENTITY under the DB directory. This approach introduces at least two problems: (1) the file is not checksumed; (2) the source of truth of a DB is the manifest file, which can be copied separately from IDENTITY file, causing the DB ID to be wrong.
The goal of this PR is solve this problem by moving the DB ID to manifest. To begin with we will write to both identity file and manifest. Write to Manifest is controlled via the flag write_dbid_to_manifest in Options and default is false.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5725
Test Plan: Added unit tests.
Differential Revision: D16963840
Pulled By: vjnadimpalli
fbshipit-source-id: 8a86a4c8c82c716003c40fd6b9d2d758030d92e9
Summary:
MyRocks currently sets `max_write_buffer_number_to_maintain` in order to maintain enough history for transaction conflict checking. The effectiveness of this approach depends on the size of memtables. When memtables are small, it may not keep enough history; when memtables are large, this may consume too much memory.
We are proposing a new way to configure memtable list history: by limiting the memory usage of immutable memtables. The new option is `max_write_buffer_size_to_maintain` and it will take precedence over the old `max_write_buffer_number_to_maintain` if they are both set to non-zero values. The new option accounts for the total memory usage of flushed immutable memtables and mutable memtable. When the total usage exceeds the limit, RocksDB may start dropping immutable memtables (which is also called trimming history), starting from the oldest one.
The semantics of the old option actually works both as an upper bound and lower bound. History trimming will start if number of immutable memtables exceeds the limit, but it will never go below (limit-1) due to history trimming.
In order the mimic the behavior with the new option, history trimming will stop if dropping the next immutable memtable causes the total memory usage go below the size limit. For example, assuming the size limit is set to 64MB, and there are 3 immutable memtables with sizes of 20, 30, 30. Although the total memory usage is 80MB > 64MB, dropping the oldest memtable will reduce the memory usage to 60MB < 64MB, so in this case no memtable will be dropped.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5022
Differential Revision: D14394062
Pulled By: miasantreble
fbshipit-source-id: 60457a509c6af89d0993f988c9b5c2aa9e45f5c5
Summary:
Recently readahead is introduced for checksum verifying. However, users cannot override the setting for the checksum verifying before external SST file ingestion. Introduce a new option for the purpose.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5721
Test Plan: Add a new unit test for it.
Differential Revision: D16906896
fbshipit-source-id: 218ec37001ddcc05411cefddbe233d15ab308476
Summary:
Right now VerifyChecksum() doesn't do read-ahead. In some use cases, users won't be able to achieve good performance. With this change, by default, RocksDB will do a default readahead, and users will be able to overwrite the readahead size by passing in a ReadOptions.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5713
Test Plan: Add a new unit test.
Differential Revision: D16860874
fbshipit-source-id: 0cff0fe79ac855d3d068e6ccd770770854a68413
Summary:
Add a command in ldb so that users can print out tombstones in SST files.
In order to test the code, change the interface of LDBCommandRunner::RunCommand() so that it doesn't return from the program, but return the status code.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5615
Test Plan: Add a new unit test
Differential Revision: D16550326
fbshipit-source-id: 88ddfe6984bdcbb3a528abdd115089df09eba52e
Summary:
The changes transaction_test to set `txn_db_options.default_write_batch_flush_threshold = 1` in order to give better test coverage for WriteUnprepared.
As part of the change, some tests had to be updated.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5658
Differential Revision: D16740468
Pulled By: lth
fbshipit-source-id: 3821eec20baf13917c8c1fab444332f75a509de9
Summary:
This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases:
1. Update subset of columns and read subset of columns -
Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU.
2. Updating very few attributes in a value which is a JSON-like document -
Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge.
----------------------------------------------------------------------------------------------------
API :
Status GetMergeOperands(
const ReadOptions& options, ColumnFamilyHandle* column_family,
const Slice& key, PinnableSlice* merge_operands,
GetMergeOperandsOptions* get_merge_operands_options,
int* number_of_operands)
Example usage :
int size = 100;
int number_of_operands = 0;
std::vector<PinnableSlice> values(size);
GetMergeOperandsOptions merge_operands_info;
db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands);
Description :
Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion.
merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604
Test Plan:
Added unit test and perf test in db_bench that can be run using the command:
./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist
Differential Revision: D16657366
Pulled By: vjnadimpalli
fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf
Summary:
if read_options.snapshot is not set, ::Get will take the last sequence number after taking a super-version and uses that as the sequence number. Theoretically max_eviceted_seq_ could advance this sequence number. This could lead ::IsInSnapshot that will be invoked by the ReadCallback to notice the absence of the snapshot. In this case, the ReadCallback should have passed a non-value to snap_released so that it could be set by the ::IsInSnapshot. The patch does that, and adds a unit test to verify it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5664
Differential Revision: D16614033
Pulled By: maysamyabandeh
fbshipit-source-id: 06fb3fd4aacd75806ed1a1acec7961f5d02486f2
Summary:
Add savepoint support when the current transaction has flushed unprepared batches.
Rolling back to savepoint is similar to rolling back a transaction. It requires the set of keys that have changed since the savepoint, re-reading the keys at the snapshot at that savepoint, and the restoring the old keys by writing out another unprepared batch.
For this strategy to work though, we must be capable of reading keys at a savepoint. This does not work if keys were written out using the same sequence number before and after a savepoint. Therefore, when we flush out unprepared batches, we must split the batch by savepoint if any savepoints exist.
eg. If we have the following:
```
Put(A)
Put(B)
Put(C)
SetSavePoint()
Put(D)
Put(E)
SetSavePoint()
Put(F)
```
Then we will write out 3 separate unprepared batches:
```
Put(A) 1
Put(B) 1
Put(C) 1
Put(D) 2
Put(E) 2
Put(F) 3
```
This is so that when we rollback to eg. the first savepoint, we can just read keys at snapshot_seq = 1.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5627
Differential Revision: D16584130
Pulled By: lth
fbshipit-source-id: 6d100dd548fb20c4b76661bd0f8a2647e64477fa
Summary:
In some cases, we don't have to get really accurate number. Something like 10% off is fine, we can create a new option for that use case. In this case, we can calculate size for full files first, and avoid estimation inside SST files if full files got us a huge number. For example, if we already covered 100GB of data, we should be able to skip partial dives into 10 SST files of 30MB.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5609
Differential Revision: D16433481
Pulled By: elipoz
fbshipit-source-id: 5830b31e1c656d0fd3a00d7fd2678ddc8f6e601b
Summary:
Master branch had been left at 6.2 and history of 6.3 and beyond were merged. Updated this to correct.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5652
Differential Revision: D16570498
Pulled By: gfosco
fbshipit-source-id: 79f62ec570539a3e3d7d7c84a6cf7b722395fafe
Summary:
The ssize_t type was introduced in https://github.com/facebook/rocksdb/pull/5633, but it seems like it's a POSIX specific type.
I just need a signed type to represent number of bytes, so use int64_t instead. It seems like we have a typedef from SSIZE_T for Windows, but it doesn't seem like we ever include "port/port.h" in our public header files.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5638
Differential Revision: D16526269
Pulled By: lth
fbshipit-source-id: 8d3a5c41003951b74b29bc5f1d949b2b22da0cee
Summary:
Instead of reusing `TransactionOptions::max_write_batch_size` for determining when to flush a write batch for write unprepared, add a new variable called `write_batch_flush_threshold` for this use case instead.
Also add `TransactionDBOptions::default_write_batch_flush_threshold` which sets the default value if `TransactionOptions::write_batch_flush_threshold` is unspecified.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5633
Differential Revision: D16520364
Pulled By: lth
fbshipit-source-id: d75ae5a2141ce7708982d5069dc3f0b58d250e8c
Summary:
The new DB::GetApproximateSizes with SizeApproximationOptions argument, which allows to add more options/knobs to the DB::GetApproximateSizes call (beyond only the include_flags)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5626
Differential Revision: D16496913
Pulled By: elipoz
fbshipit-source-id: ee8c6c182330a285fa056ecfc3905a592b451720
Summary:
In previous https://github.com/facebook/rocksdb/issues/5079, we added user-specified timestamp to `DB::Get()` and `DB::Put()`. Limitation is that these two functions may cause extra memory allocation and key copy. The reason is that `WriteBatch` does not allocate extra memory for timestamps because it is not aware of timestamp size, and we did not provide an API to assign/update timestamp of each key within a `WriteBatch`.
We address these issues in this PR by doing the following.
1. Add a `timestamp_size_` to `WriteBatch` so that `WriteBatch` can take timestamps into account when calling `WriteBatch::Put`, `WriteBatch::Delete`, etc.
2. Add APIs `WriteBatch::AssignTimestamp` and `WriteBatch::AssignTimestamps` so that application can assign/update timestamps for each key in a `WriteBatch`.
3. Avoid key copy in `GetImpl` by adding new constructor to `LookupKey`.
Test plan (on devserver):
```
$make clean && COMPILE_WITH_ASAN=1 make -j32 all
$./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/*
$make check
```
If the API extension looks good, I will add more unit tests.
Some simple benchmark using db_bench.
```
$rm -rf /dev/shm/dbbench/* && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000
$rm -rf /dev/shm/dbbench/* && TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000 -disable_wal=true
```
Master is at a78503bd6c.
```
| | readrandom | fillrandom |
| master | 15.53 MB/s | 25.97 MB/s |
| PR5502 | 16.70 MB/s | 25.80 MB/s |
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5502
Differential Revision: D16340894
Pulled By: riversand963
fbshipit-source-id: 51132cf792be07d1efc3ac33f5768c4ee2608bb8
Summary:
The ::snap_refresh_nanos option is incompatible with DeleteRange feature. Currently the code relies on range_del_agg.IsEmpty() to disable it if there are range delete tombstones. However ::IsEmpty does not guarantee that there is no RangeDelete tombstones in the SST files. The patch declares the two features incompatible in inline comments until we later figure how to properly detect the presence of RangeDelete tombstones in compaction inputs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5625
Differential Revision: D16468218
Pulled By: maysamyabandeh
fbshipit-source-id: bd7beca278bc7e1db75e7ee4522d05a3a6ca86f4
Summary:
The ObjectRegistry class replaces the Registrar and NewCustomObjects. Objects are registered with the registry by Type (the class must implement the static const char *Type() method).
This change is necessary for a few reasons:
- By having a class (rather than static template instances), the class can be passed between compilation units, meaning that objects could be registered and shared from a dynamic library with an executable.
- By having a class with instances, different units could have different objects registered. This could be useful if, for example, one Option allowed for a dynamic library and one did not.
When combined with some other PRs (being able to load shared libraries, a Configurable interface to configure objects to/from string), this code will allow objects in external shared libraries to be added to a RocksDB image at run-time, rather than requiring every new extension to be built into the main library and called explicitly by every program.
Test plan (on riversand963's devserver)
```
$COMPILE_WITH_ASAN=1 make -j32 all && sleep 1 && make check
```
All tests pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5293
Differential Revision: D16363396
Pulled By: riversand963
fbshipit-source-id: fbe4acb615bfc11103eef40a0b288845791c0180
Summary:
RocksDB has historically stored uncompression dictionary objects in the block
cache as opposed to storing just the block contents. This neccesitated
evicting the object upon table close. With the new code, only the raw blocks
are stored in the cache, eliminating the need for eviction.
In addition, the patch makes the following improvements:
1) Compression dictionary blocks are now prefetched/pinned similarly to
index/filter blocks.
2) A copy operation got eliminated when the uncompression dictionary is
retrieved.
3) Errors related to retrieving the uncompression dictionary are propagated as
opposed to silently ignored.
Note: the patch temporarily breaks the compression dictionary evicition stats.
They will be fixed in a separate phase.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5584
Test Plan: make asan_check
Differential Revision: D16344151
Pulled By: ltamasi
fbshipit-source-id: 2962b295f5b19628f9da88a3fcebbce5a5017a7b
Summary:
There are concerns about the correctness of this patch. Disabling by default until the concerns are resolved.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5606
Differential Revision: D16428064
Pulled By: maysamyabandeh
fbshipit-source-id: a89280f0ea85796c9c9dfbfd9a8e91dad9b000b3
Summary:
Added log_readahead_size option to control prefetching for Log::Reader.
This is mostly useful for reading a remotely located log, as it can save the number of round-trips when reading it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5592
Differential Revision: D16362989
Pulled By: elipoz
fbshipit-source-id: c5d4d5245a44008cd59879640efff70c091ad3e8
Summary:
Refresh of the earlier change here - https://github.com/facebook/rocksdb/issues/5135
This is a review request for code change needed for - https://github.com/facebook/rocksdb/issues/3469
"Add support for taking snapshot of a column family and creating column family from a given CF snapshot"
We have an implementation for this that we have been testing internally. We have two new APIs that together provide this functionality.
(1) ExportColumnFamily() - This API is modelled after CreateCheckpoint() as below.
// Exports all live SST files of a specified Column Family onto export_dir,
// returning SST files information in metadata.
// - SST files will be created as hard links when the directory specified
// is in the same partition as the db directory, copied otherwise.
// - export_dir should not already exist and will be created by this API.
// - Always triggers a flush.
virtual Status ExportColumnFamily(ColumnFamilyHandle* handle,
const std::string& export_dir,
ExportImportFilesMetaData** metadata);
Internally, the API will DisableFileDeletions(), GetColumnFamilyMetaData(), Parse through
metadata, creating links/copies of all the sst files, EnableFileDeletions() and complete the call by
returning the list of file metadata.
(2) CreateColumnFamilyWithImport() - This API is modeled after IngestExternalFile(), but invoked only during a CF creation as below.
// CreateColumnFamilyWithImport() will create a new column family with
// column_family_name and import external SST files specified in metadata into
// this column family.
// (1) External SST files can be created using SstFileWriter.
// (2) External SST files can be exported from a particular column family in
// an existing DB.
// Option in import_options specifies whether the external files are copied or
// moved (default is copy). When option specifies copy, managing files at
// external_file_path is caller's responsibility. When option specifies a
// move, the call ensures that the specified files at external_file_path are
// deleted on successful return and files are not modified on any error
// return.
// On error return, column family handle returned will be nullptr.
// ColumnFamily will be present on successful return and will not be present
// on error return. ColumnFamily may be present on any crash during this call.
virtual Status CreateColumnFamilyWithImport(
const ColumnFamilyOptions& options, const std::string& column_family_name,
const ImportColumnFamilyOptions& import_options,
const ExportImportFilesMetaData& metadata,
ColumnFamilyHandle** handle);
Internally, this API creates a new CF, parses all the sst files and adds it to the specified column family, at the same level and with same sequence number as in the metadata. Also performs safety checks with respect to overlaps between the sst files being imported.
If incoming sequence number is higher than current local sequence number, local sequence
number is updated to reflect this.
Note, as the sst files is are being moved across Column Families, Column Family name in sst file
will no longer match the actual column family on destination DB. The API does not modify Column
Family name or id in the sst files being imported.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5495
Differential Revision: D16018881
fbshipit-source-id: 9ae2251025d5916d35a9fc4ea4d6707f6be16ff9
Summary:
Current PosixLogger performs IO operations using posix calls. Thus the
current implementation will not work for non-posix env. Created a new
logger class EnvLogger that uses env specific WritableFileWriter for IO operations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5491
Test Plan: make check
Differential Revision: D15909002
Pulled By: ggaurav28
fbshipit-source-id: 13a8105176e8e42db0c59798d48cb6a0dbccc965
Summary:
Right now ldb can open running DB through read-only DB. However, it might leave info logs files to the read-only DB directory. Add an option to open the DB as secondary to avoid it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5537
Test Plan:
Run
./ldb scan --max_keys=10 --db=/tmp/rocksdbtest-2491/dbbench --secondary_path=/tmp --no_value --hex
and
./ldb get 0x00000000000000103030303030303030 --hex --db=/tmp/rocksdbtest-2491/dbbench --secondary_path=/tmp
against a normal db_bench run and observe the output changes. Also observe that no new info logs files are created under /tmp/rocksdbtest-2491/dbbench.
Run without --secondary_path and observe that new info logs created under /tmp/rocksdbtest-2491/dbbench.
Differential Revision: D16113886
fbshipit-source-id: 4e09dec47c2528f6ca08a9e7a7894ba2d9daebbb
Summary:
Previously `GetAllKeyVersions()` supports default column family only. This PR add support for other column families.
Test plan (devserver):
```
$make clean && COMPILE_WITH_ASAN=1 make -j32 db_basic_test
$./db_basic_test --gtest_filter=DBBasicTest.GetAllKeyVersions
```
All other unit tests must pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5544
Differential Revision: D16147551
Pulled By: riversand963
fbshipit-source-id: 5a61aece2a32d789e150226a9b8d53f4a5760168
Summary:
Mid-point insertion is a useful feature and is mature now. Make it default. Also changed cache_index_and_filter_blocks_with_high_priority=true as default accordingly, so that we won't evict index and filter blocks easier after the change, to avoid too many surprises to users.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5508
Test Plan: Run all existing tests.
Differential Revision: D16021179
fbshipit-source-id: ce8456e8d43b3bfb48df6c304b5290a9d19817eb
Summary:
Add C binding for secondary instance as well as unit test.
Test plan (on devserver)
```
$make clean && COMPILE_WITH_ASAN=1 make -j20 all
$./c_test
$make check
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5505
Differential Revision: D16000043
Pulled By: riversand963
fbshipit-source-id: 3361ef6bfdf4ce12438cee7290a0ac203b5250bd
Summary:
The first key is used to defer reading the data block until this file gets to the top of merging iterator's heap. For short range scans, most files never make it to the top of the heap, so this change can reduce read amplification by a lot sometimes.
Consider the following workload. There are a few data streams (we'll be calling them "logs"), each stream consisting of a sequence of blobs (we'll be calling them "records"). Each record is identified by log ID and a sequence number within the log. RocksDB key is concatenation of log ID and sequence number (big endian). Reads are mostly relatively short range scans, each within a single log. Writes are mostly sequential for each log, but writes to different logs are randomly interleaved. Compactions are disabled; instead, when we accumulate a few tens of sst files, we create a new column family and start writing to it.
So, a typical sst file consists of a few ranges of blocks, each range corresponding to one log ID (we use FlushBlockPolicy to cut blocks at log boundaries). A typical read would go like this. First, iterator Seek() reads one block from each sst file. Then a series of Next()s move through one sst file (since writes to each log are mostly sequential) until the subiterator reaches the end of this log in this sst file; then Next() switches to the next sst file and reads sequentially from that, and so on. Often a range scan will only return records from a small number of blocks in small number of sst files; in this case, the cost of initial Seek() reading one block from each file may be bigger than the cost of reading the actually useful blocks.
Neither iterate_upper_bound nor bloom filters can prevent reading one block from each file in Seek(). But this PR can: if the index contains first key from each block, we don't have to read the block until this block actually makes it to the top of merging iterator's heap, so for short range scans we won't read any blocks from most of the sst files.
This PR does the deferred block loading inside value() call. This is not ideal: there's no good way to report an IO error from inside value(). As discussed with siying offline, it would probably be better to change InternalIterator's interface to explicitly fetch deferred value and get status. I'll do it in a separate PR.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5289
Differential Revision: D15256423
Pulled By: al13n321
fbshipit-source-id: 750e4c39ce88e8d41662f701cf6275d9388ba46a
Summary:
Currently the read-ahead logic for user reads and compaction reads go through different code paths where compaction reads create new table readers and use `ReadaheadRandomAccessFile`. This change is to unify read-ahead logic to use read-ahead in BlockBasedTableReader::InitDataBlock(). As a result of the change `ReadAheadRandomAccessFile` class and `new_table_reader_for_compaction_inputs` option will no longer be used.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5431
Test Plan:
make check
Here is the benchmarking - https://gist.github.com/vjnadimpalli/083cf423f7b6aa12dcdb14c858bc18a5
Differential Revision: D15772533
Pulled By: vjnadimpalli
fbshipit-source-id: b71dca710590471ede6fb37553388654e2e479b9
Summary:
This PR adds a BlockCacheTraceSimulator that reports the miss ratios given different cache configurations. A cache configuration contains "cache_name,num_shard_bits,cache_capacities". For example, "lru, 1, 1K, 2K, 4M, 4G".
When we replay the trace, we also perform lookups and inserts on the simulated caches.
In the end, it reports the miss ratio for each tuple <cache_name, num_shard_bits, cache_capacity> in a output file.
This PR also adds a main source block_cache_trace_analyzer so that we can run the analyzer in command line.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5449
Test Plan:
Added tests for block_cache_trace_analyzer.
COMPILE_WITH_ASAN=1 make check -j32.
Differential Revision: D15797073
Pulled By: HaoyuHuang
fbshipit-source-id: aef0c5c2e7938f3e8b6a10d4a6a50e6928ecf408
Summary:
This PR continues the work in https://github.com/facebook/rocksdb/pull/4748 and https://github.com/facebook/rocksdb/pull/4535 by adding a new DBOption `persist_stats_to_disk` which instructs RocksDB to persist stats history to RocksDB itself. When statistics is enabled, and both options `stats_persist_period_sec` and `persist_stats_to_disk` are set, RocksDB will periodically write stats to a built-in column family in the following form: key -> (timestamp in microseconds)#(stats name), value -> stats value. The existing API `GetStatsHistory` will detect the current value of `persist_stats_to_disk` and either read from in-memory data structure or from the hidden column family on disk.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5046
Differential Revision: D15863138
Pulled By: miasantreble
fbshipit-source-id: bb82abdb3f2ca581aa42531734ac799f113e931b
Summary:
This PR integrates the block cache tracer class into db_impl.cc.
db_impl.cc contains a member variable of AtomicBlockCacheTraceWriter class and passes its reference to the block_based_table_reader.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5433
Differential Revision: D15728016
Pulled By: HaoyuHuang
fbshipit-source-id: 23d5659e8c82d556833dcc1a5558aac8c1f7db71
Summary:
When using `PRIu64` type of printf specifier, current code base does the following:
```
#ifndef __STDC_FORMAT_MACROS
#define __STDC_FORMAT_MACROS
#endif
#include <inttypes.h>
```
However, this can be simplified to
```
#include <cinttypes>
```
as long as flag `-std=c++11` is used.
This should solve issues like https://github.com/facebook/rocksdb/issues/5159
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5402
Differential Revision: D15701195
Pulled By: miasantreble
fbshipit-source-id: 6dac0a05f52aadb55e9728038599d3d2e4b59d03
Summary:
It's useful to be able to (optionally) associate key-value pairs with user-provided timestamps. This PR is an early effort towards this goal and continues the work of facebook#4942. A suite of new unit tests exist in DBBasicTestWithTimestampWithParam. Support for timestamp requires the user to provide timestamp as a slice in `ReadOptions` and `WriteOptions`. All timestamps of the same database must share the same length, format, etc. The format of the timestamp is the same throughout the same database, and the user is responsible for providing a comparator function (Comparator) to order the <key, timestamp> tuples. Once created, the format and length of the timestamp cannot change (at least for now).
Test plan (on devserver):
```
$COMPILE_WITH_ASAN=1 make -j32 all
$./db_basic_test --gtest_filter=Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/*
$make check
```
All tests must pass.
We also run the following db_bench tests to verify whether there is regression on Get/Put while timestamp is not enabled.
```
$TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillseq,readrandom -num=1000000
$TEST_TMPDIR=/dev/shm ./db_bench -benchmarks=fillrandom -num=1000000
```
Repeat for 6 times for both versions.
Results are as follows:
```
| | readrandom | fillrandom |
| master | 16.77 MB/s | 47.05 MB/s |
| PR5079 | 16.44 MB/s | 47.03 MB/s |
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5079
Differential Revision: D15132946
Pulled By: riversand963
fbshipit-source-id: 833a0d657eac21182f0f206c910a6438154c742c
Summary:
Previous code has a warning when compile with tsan, leading to an error since we have -Werror.
Compilation result
```
In file included from ./env/env_chroot.h:12,
from env/env_test.cc:40:
./include/rocksdb/env.h: In instantiation of ‘rocksdb::Status rocksdb::DynamicLibrary::LoadFunction(const string&, std::function<T>*) [with T = void*(void*, const char*); std::__cxx11::string = std::__cxx11::basic_string<char>]’:
env/env_test.cc:260:5: required from here
./include/rocksdb/env.h:1010:17: error: cast between incompatible function types from ‘rocksdb::DynamicLibrary::FunctionPtr’ {aka ‘void* (*)()’} to ‘void* (*)(void*, const char*)’ [-Werror=cast-function-type]
*function = reinterpret_cast<T*>(ptr);
^~~~~~~~~~~~~~~~~~~~~~~~~
cc1plus: all warnings being treated as errors
make: *** [env/env_test.o] Error 1
```
It also has another error reported by clang
```
env/env_posix.cc:141:11: warning: Value stored to 'err' during its initialization is never read
char* err = dlerror(); // Clear any old error
^~~ ~~~~~~~~~
1 warning generated.
```
Test plan (on my devserver).
```
$make clean
$OPT=-g ROCKSDB_FBCODE_BUILD_WITH_PLATFORM007=1 COMPILE_WITH_TSAN=1 make -j32
$
$make clean
$USE_CLANG=1 TEST_TMPDIR=/dev/shm/rocksdb OPT=-g make -j1 analyze
```
Both should pass.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5414
Differential Revision: D15637315
Pulled By: riversand963
fbshipit-source-id: 8e307483761019a4d5998cab92d49516d7edffbf
Summary:
Define the Env:: MultiRead() method to allow callers to request multiple block reads in one shot. The underlying Env implementation can parallelize it if it chooses to in order to reduce the overall IO latency.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5311
Differential Revision: D15502172
Pulled By: anand1976
fbshipit-source-id: 2b228269c2e11b5f54694d6b2bb3119c8a8ce2b9
Summary:
This change adds a Dynamic Library class to the RocksDB Env. Dynamic libraries are populated via the Env::LoadLibrary method.
The addition of dynamic library support allows for a few different features to be developed:
1. The compression code can be changed to use dynamic library support. This would allow RocksDB to determine at run-time what compression packages were installed. This change would eliminate the need to make sure the build-time and run-time environment had the same library set. It would also simplify some of the Java build issues (where it attempts to build and include various packages inside the RocksDB jars).
2. Along with other features (to be provided in a subsequent PR), this change would allow code/configurations to be added to RocksDB at run-time. For example, the build system includes code for building an "rados" environment and adding "Cassandra" features. Instead of these extensions being built into the base RocksDB code, these extensions could be loaded at run-time as required/appropriate, either by configuration or explicitly.
We intend to push out other changes in support of the extending RocksDB at run-time via configurations.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5281
Differential Revision: D15447613
Pulled By: riversand963
fbshipit-source-id: 452cd4f54511c0bceee18f6d9d919aae9fd25fef
Summary:
This enables the user to set TransactionDBOptions::skip_concurrency_control so the standard `DB::Write(const WriteOptions& opts, WriteBatch* updates)` would skip the concurrency control. This would give higher throughput to the users who know their use case doesn't need concurrency control.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5330
Differential Revision: D15525932
Pulled By: maysamyabandeh
fbshipit-source-id: 68421ac1ba34f549a4a8de9ce4c2dccf6fb4b06b
Summary:
RocksDB always tries to perform a hard link operation on the external SST file to ingest. This operation can fail if the external SST resides on a different device/FS, or the underlying FS does not support hard link. Currently RocksDB assumes that if the link fails, the user is willing to perform file copy, which is not true according to the post. This commit provides an option named 'failed_move_fall_back_to_copy' for users to choose which behavior they want.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5333
Differential Revision: D15457597
Pulled By: HaoyuHuang
fbshipit-source-id: f3626e13f845db4f7ed970a53ec8a2b1f0d62214
Summary:
WritePrepared transactions when configured with two_write_queues=true offers higher throughput with unordered_write feature without however compromising the rocksdb guarantees. This is because it performs ordering among writes in a 2nd step that is not tied to memtable write speed. The 2nd step is naturally provided by 2PC when the commit phase does the ordering as well. Without 2PC, the 2nd step would only be provided when we use two_write_queues=true, where WritePrepared after performing the writes, in a 2nd step uses the 2nd queue to assign order to the writes.
The patch clarifies the need for two_write_queues=true in the HISTORY and inline comments of unordered_writes. Moreover it extends the stress tests of WritePrepared to unordred_write.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5313
Differential Revision: D15379977
Pulled By: maysamyabandeh
fbshipit-source-id: 5b6f05b9b59285dcbf3b0532215ba9fe7d926e00
Summary:
Performing unordered writes in rocksdb when unordered_write option is set to true. When enabled the writes to memtable are done without joining any write thread. This offers much higher write throughput since the upcoming writes would not have to wait for the slowest memtable write to finish. The tradeoff is that the writes visible to a snapshot might change over time. If the application cannot tolerate that, it should implement its own mechanisms to work around that. Using TransactionDB with WRITE_PREPARED write policy is one way to achieve that. Doing so increases the max throughput by 2.2x without however compromising the snapshot guarantees.
The patch is prepared based on an original by siying
Existing unit tests are extended to include unordered_write option.
Benchmark Results:
```
TEST_TMPDIR=/dev/shm/ ./db_bench_unordered --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions --unordered_write=1
```
With WAL
- Vanilla RocksDB: 78.6 MB/s
- WRITER_PREPARED with unordered_write: 177.8 MB/s (2.2x)
- unordered_write: 368.9 MB/s (4.7x with relaxed snapshot guarantees)
Without WAL
- Vanilla RocksDB: 111.3 MB/s
- WRITER_PREPARED with unordered_write: 259.3 MB/s MB/s (2.3x)
- unordered_write: 645.6 MB/s (5.8x with relaxed snapshot guarantees)
- WRITER_PREPARED with unordered_write disable concurrency control: 185.3 MB/s MB/s (2.35x)
Limitations:
- The feature is not yet extended to `max_successive_merges` > 0. The feature is also incompatible with `enable_pipelined_write` = true as well as with `allow_concurrent_memtable_write` = false.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5218
Differential Revision: D15219029
Pulled By: maysamyabandeh
fbshipit-source-id: 38f2abc4af8780148c6128acdba2b3227bc81759
Summary:
There were no C bindings for lowering thread pool priority. This adds those.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5285
Differential Revision: D15290050
Pulled By: siying
fbshipit-source-id: b2ed94d0c39d27434ace2204829a242b53d0d67a
Summary:
Part of compaction cpu goes to processing snapshot list, the larger the list the bigger the overhead. Although the lifetime of most of the snapshots is much shorter than the lifetime of compactions, the compaction conservatively operates on the list of snapshots that it initially obtained. This patch allows the snapshot list to be updated via a callback if the compaction is taking long. This should let the compaction to continue more efficiently with much smaller snapshot list.
For simplicity, to avoid the feature is disabled in two cases: i) When more than one sub-compaction are sharing the same snapshot list, ii) when Range Delete is used in which the range delete aggregator has its own copy of snapshot list.
This fixes the reverted https://github.com/facebook/rocksdb/pull/5099 issue with range deletes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5278
Differential Revision: D15203291
Pulled By: maysamyabandeh
fbshipit-source-id: fa645611e606aa222c7ce53176dc5bb6f259c258