|
|
|
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
|
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
|
|
// (found in the LICENSE.Apache file in the root directory).
|
|
|
|
//
|
|
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
|
|
|
|
|
|
|
#include "db/version_set.h"
|
|
|
|
|
CompactFiles, EventListener and GetDatabaseMetaData
Summary:
This diff adds three sets of APIs to RocksDB.
= GetColumnFamilyMetaData =
* This APIs allow users to obtain the current state of a RocksDB instance on one column family.
* See GetColumnFamilyMetaData in include/rocksdb/db.h
= EventListener =
* A virtual class that allows users to implement a set of
call-back functions which will be called when specific
events of a RocksDB instance happens.
* To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners
= CompactFiles =
* CompactFiles API inputs a set of file numbers and an output level, and RocksDB
will try to compact those files into the specified level.
= Example =
* Example code can be found in example/compact_files_example.cc, which implements
a simple external compactor using EventListener, GetColumnFamilyMetaData, and
CompactFiles API.
Test Plan:
listener_test
compactor_test
example/compact_files_example
export ROCKSDB_TESTS=CompactFiles
db_test
export ROCKSDB_TESTS=MetaData
db_test
Reviewers: ljin, igor, rven, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D24705
10 years ago
|
|
|
#include <stdio.h>
|
|
|
|
|
|
|
|
#include <algorithm>
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
#include <array>
|
|
|
|
#include <cinttypes>
|
|
|
|
#include <list>
|
|
|
|
#include <map>
|
|
|
|
#include <set>
|
|
|
|
#include <string>
|
|
|
|
#include <unordered_map>
|
|
|
|
#include <vector>
|
|
|
|
|
|
|
|
#include "compaction/compaction.h"
|
|
|
|
#include "db/blob/blob_file_cache.h"
|
|
|
|
#include "db/blob/blob_file_reader.h"
|
|
|
|
#include "db/blob/blob_index.h"
|
Measure file read latency histogram per level
Summary: In internal stats, remember read latency histogram, if statistics is enabled. It can be retrieved from DB::GetProperty() with "rocksdb.dbstats" property, if it is enabled.
Test Plan: Manually run db_bench and prints out "rocksdb.dbstats" by hand and make sure it prints out as expected
Reviewers: igor, IslamAbdelRahman, rven, kradhakrishnan, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D44193
9 years ago
|
|
|
#include "db/internal_stats.h"
|
|
|
|
#include "db/log_reader.h"
|
|
|
|
#include "db/log_writer.h"
|
|
|
|
#include "db/memtable.h"
|
|
|
|
#include "db/merge_context.h"
|
|
|
|
#include "db/merge_helper.h"
|
Introduce FullMergeV2 (eliminate memcpy from merge operators)
Summary:
This diff update the code to pin the merge operator operands while the merge operation is done, so that we can eliminate the memcpy cost, to do that we need a new public API for FullMerge that replace the std::deque<std::string> with std::vector<Slice>
This diff is stacked on top of D56493 and D56511
In this diff we
- Update FullMergeV2 arguments to be encapsulated in MergeOperationInput and MergeOperationOutput which will make it easier to add new arguments in the future
- Replace std::deque<std::string> with std::vector<Slice> to pass operands
- Replace MergeContext std::deque with std::vector (based on a simple benchmark I ran https://gist.github.com/IslamAbdelRahman/78fc86c9ab9f52b1df791e58943fb187)
- Allow FullMergeV2 output to be an existing operand
```
[Everything in Memtable | 10K operands | 10 KB each | 1 operand per key]
DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=10000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000
[FullMergeV2]
readseq : 0.607 micros/op 1648235 ops/sec; 16121.2 MB/s
readseq : 0.478 micros/op 2091546 ops/sec; 20457.2 MB/s
readseq : 0.252 micros/op 3972081 ops/sec; 38850.5 MB/s
readseq : 0.237 micros/op 4218328 ops/sec; 41259.0 MB/s
readseq : 0.247 micros/op 4043927 ops/sec; 39553.2 MB/s
[master]
readseq : 3.935 micros/op 254140 ops/sec; 2485.7 MB/s
readseq : 3.722 micros/op 268657 ops/sec; 2627.7 MB/s
readseq : 3.149 micros/op 317605 ops/sec; 3106.5 MB/s
readseq : 3.125 micros/op 320024 ops/sec; 3130.1 MB/s
readseq : 4.075 micros/op 245374 ops/sec; 2400.0 MB/s
```
```
[Everything in Memtable | 10K operands | 10 KB each | 10 operand per key]
DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=1000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000
[FullMergeV2]
readseq : 3.472 micros/op 288018 ops/sec; 2817.1 MB/s
readseq : 2.304 micros/op 434027 ops/sec; 4245.2 MB/s
readseq : 1.163 micros/op 859845 ops/sec; 8410.0 MB/s
readseq : 1.192 micros/op 838926 ops/sec; 8205.4 MB/s
readseq : 1.250 micros/op 800000 ops/sec; 7824.7 MB/s
[master]
readseq : 24.025 micros/op 41623 ops/sec; 407.1 MB/s
readseq : 18.489 micros/op 54086 ops/sec; 529.0 MB/s
readseq : 18.693 micros/op 53495 ops/sec; 523.2 MB/s
readseq : 23.621 micros/op 42335 ops/sec; 414.1 MB/s
readseq : 18.775 micros/op 53262 ops/sec; 521.0 MB/s
```
```
[Everything in Block cache | 10K operands | 10 KB each | 1 operand per key]
[FullMergeV2]
$ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions
readseq : 14.741 micros/op 67837 ops/sec; 663.5 MB/s
readseq : 1.029 micros/op 971446 ops/sec; 9501.6 MB/s
readseq : 0.974 micros/op 1026229 ops/sec; 10037.4 MB/s
readseq : 0.965 micros/op 1036080 ops/sec; 10133.8 MB/s
readseq : 0.943 micros/op 1060657 ops/sec; 10374.2 MB/s
[master]
readseq : 16.735 micros/op 59755 ops/sec; 584.5 MB/s
readseq : 3.029 micros/op 330151 ops/sec; 3229.2 MB/s
readseq : 3.136 micros/op 318883 ops/sec; 3119.0 MB/s
readseq : 3.065 micros/op 326245 ops/sec; 3191.0 MB/s
readseq : 3.014 micros/op 331813 ops/sec; 3245.4 MB/s
```
```
[Everything in Block cache | 10K operands | 10 KB each | 10 operand per key]
DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10-operands-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions
[FullMergeV2]
readseq : 24.325 micros/op 41109 ops/sec; 402.1 MB/s
readseq : 1.470 micros/op 680272 ops/sec; 6653.7 MB/s
readseq : 1.231 micros/op 812347 ops/sec; 7945.5 MB/s
readseq : 1.091 micros/op 916590 ops/sec; 8965.1 MB/s
readseq : 1.109 micros/op 901713 ops/sec; 8819.6 MB/s
[master]
readseq : 27.257 micros/op 36687 ops/sec; 358.8 MB/s
readseq : 4.443 micros/op 225073 ops/sec; 2201.4 MB/s
readseq : 5.830 micros/op 171526 ops/sec; 1677.7 MB/s
readseq : 4.173 micros/op 239635 ops/sec; 2343.8 MB/s
readseq : 4.150 micros/op 240963 ops/sec; 2356.8 MB/s
```
Test Plan: COMPILE_WITH_ASAN=1 make check -j64
Reviewers: yhchiang, andrewkr, sdong
Reviewed By: sdong
Subscribers: lovro, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57075
9 years ago
|
|
|
#include "db/pinned_iterators_manager.h"
|
|
|
|
#include "db/table_cache.h"
|
|
|
|
#include "db/version_builder.h"
|
|
|
|
#include "db/version_edit_handler.h"
|
|
|
|
#include "file/filename.h"
|
|
|
|
#include "file/random_access_file_reader.h"
|
|
|
|
#include "file/read_write_util.h"
|
|
|
|
#include "file/writable_file_writer.h"
|
|
|
|
#include "monitoring/file_read_sample.h"
|
|
|
|
#include "monitoring/perf_context_imp.h"
|
|
|
|
#include "monitoring/persistent_stats_history.h"
|
|
|
|
#include "options/options_helper.h"
|
|
|
|
#include "rocksdb/env.h"
|
|
|
|
#include "rocksdb/merge_operator.h"
|
|
|
|
#include "rocksdb/write_buffer_manager.h"
|
|
|
|
#include "table/format.h"
|
|
|
|
#include "table/get_context.h"
|
|
|
|
#include "table/internal_iterator.h"
|
|
|
|
#include "table/merging_iterator.h"
|
|
|
|
#include "table/meta_blocks.h"
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
#include "table/multiget_context.h"
|
|
|
|
#include "table/plain/plain_table_factory.h"
|
|
|
|
#include "table/table_reader.h"
|
|
|
|
#include "table/two_level_iterator.h"
|
|
|
|
#include "test_util/sync_point.h"
|
|
|
|
#include "util/cast_util.h"
|
|
|
|
#include "util/coding.h"
|
|
|
|
#include "util/stop_watch.h"
|
|
|
|
#include "util/string_util.h"
|
|
|
|
#include "util/user_comparator_wrapper.h"
|
|
|
|
|
|
|
|
namespace ROCKSDB_NAMESPACE {
|
|
|
|
|
|
|
|
namespace {
|
|
|
|
|
|
|
|
// Find File in LevelFilesBrief data structure
|
|
|
|
// Within an index range defined by left and right
|
|
|
|
int FindFileInRange(const InternalKeyComparator& icmp,
|
|
|
|
const LevelFilesBrief& file_level,
|
|
|
|
const Slice& key,
|
|
|
|
uint32_t left,
|
|
|
|
uint32_t right) {
|
|
|
|
auto cmp = [&](const FdWithKeyRange& f, const Slice& k) -> bool {
|
|
|
|
return icmp.InternalKeyComparator::Compare(f.largest_key, k) < 0;
|
|
|
|
};
|
|
|
|
const auto &b = file_level.files;
|
|
|
|
return static_cast<int>(std::lower_bound(b + left,
|
|
|
|
b + right, key, cmp) - b);
|
|
|
|
}
|
|
|
|
|
|
|
|
Status OverlapWithIterator(const Comparator* ucmp,
|
|
|
|
const Slice& smallest_user_key,
|
|
|
|
const Slice& largest_user_key,
|
|
|
|
InternalIterator* iter,
|
|
|
|
bool* overlap) {
|
|
|
|
InternalKey range_start(smallest_user_key, kMaxSequenceNumber,
|
|
|
|
kValueTypeForSeek);
|
|
|
|
iter->Seek(range_start.Encode());
|
|
|
|
if (!iter->status().ok()) {
|
|
|
|
return iter->status();
|
|
|
|
}
|
|
|
|
|
|
|
|
*overlap = false;
|
|
|
|
if (iter->Valid()) {
|
|
|
|
ParsedInternalKey seek_result;
|
|
|
|
Status s = ParseInternalKey(iter->key(), &seek_result,
|
|
|
|
false /* log_err_key */); // TODO
|
|
|
|
if (!s.ok()) return s;
|
|
|
|
|
|
|
|
if (ucmp->CompareWithoutTimestamp(seek_result.user_key, largest_user_key) <=
|
|
|
|
0) {
|
|
|
|
*overlap = true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return iter->status();
|
|
|
|
}
|
|
|
|
|
|
|
|
// Class to help choose the next file to search for the particular key.
|
|
|
|
// Searches and returns files level by level.
|
|
|
|
// We can search level-by-level since entries never hop across
|
|
|
|
// levels. Therefore we are guaranteed that if we find data
|
|
|
|
// in a smaller level, later levels are irrelevant (unless we
|
|
|
|
// are MergeInProgress).
|
|
|
|
class FilePicker {
|
|
|
|
public:
|
|
|
|
FilePicker(std::vector<FileMetaData*>* files, const Slice& user_key,
|
|
|
|
const Slice& ikey, autovector<LevelFilesBrief>* file_levels,
|
|
|
|
unsigned int num_levels, FileIndexer* file_indexer,
|
|
|
|
const Comparator* user_comparator,
|
|
|
|
const InternalKeyComparator* internal_comparator)
|
|
|
|
: num_levels_(num_levels),
|
|
|
|
curr_level_(static_cast<unsigned int>(-1)),
|
Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
9 years ago
|
|
|
returned_file_level_(static_cast<unsigned int>(-1)),
|
|
|
|
hit_file_level_(static_cast<unsigned int>(-1)),
|
|
|
|
search_left_bound_(0),
|
|
|
|
search_right_bound_(FileIndexer::kLevelMaxIndex),
|
|
|
|
#ifndef NDEBUG
|
|
|
|
files_(files),
|
|
|
|
#endif
|
|
|
|
level_files_brief_(file_levels),
|
|
|
|
is_hit_file_last_in_level_(false),
|
|
|
|
curr_file_level_(nullptr),
|
|
|
|
user_key_(user_key),
|
|
|
|
ikey_(ikey),
|
|
|
|
file_indexer_(file_indexer),
|
|
|
|
user_comparator_(user_comparator),
|
|
|
|
internal_comparator_(internal_comparator) {
|
|
|
|
#ifdef NDEBUG
|
|
|
|
(void)files;
|
|
|
|
#endif
|
|
|
|
// Setup member variables to search first level.
|
|
|
|
search_ended_ = !PrepareNextLevel();
|
|
|
|
if (!search_ended_) {
|
|
|
|
// Prefetch Level 0 table data to avoid cache miss if possible.
|
|
|
|
for (unsigned int i = 0; i < (*level_files_brief_)[0].num_files; ++i) {
|
|
|
|
auto* r = (*level_files_brief_)[0].files[i].fd.table_reader;
|
|
|
|
if (r) {
|
|
|
|
r->Prepare(ikey);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
int GetCurrentLevel() const { return curr_level_; }
|
Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
9 years ago
|
|
|
|
|
|
|
FdWithKeyRange* GetNextFile() {
|
|
|
|
while (!search_ended_) { // Loops over different levels.
|
|
|
|
while (curr_index_in_curr_level_ < curr_file_level_->num_files) {
|
|
|
|
// Loops over all files in current level.
|
|
|
|
FdWithKeyRange* f = &curr_file_level_->files[curr_index_in_curr_level_];
|
|
|
|
hit_file_level_ = curr_level_;
|
|
|
|
is_hit_file_last_in_level_ =
|
|
|
|
curr_index_in_curr_level_ == curr_file_level_->num_files - 1;
|
|
|
|
int cmp_largest = -1;
|
|
|
|
|
|
|
|
// Do key range filtering of files or/and fractional cascading if:
|
|
|
|
// (1) not all the files are in level 0, or
|
|
|
|
// (2) there are more than 3 current level files
|
|
|
|
// If there are only 3 or less current level files in the system, we skip
|
|
|
|
// the key range filtering. In this case, more likely, the system is
|
|
|
|
// highly tuned to minimize number of tables queried by each query,
|
|
|
|
// so it is unlikely that key range filtering is more efficient than
|
|
|
|
// querying the files.
|
|
|
|
if (num_levels_ > 1 || curr_file_level_->num_files > 3) {
|
|
|
|
// Check if key is within a file's range. If search left bound and
|
|
|
|
// right bound point to the same find, we are sure key falls in
|
|
|
|
// range.
|
|
|
|
assert(curr_level_ == 0 ||
|
|
|
|
curr_index_in_curr_level_ == start_index_in_curr_level_ ||
|
|
|
|
user_comparator_->CompareWithoutTimestamp(
|
|
|
|
user_key_, ExtractUserKey(f->smallest_key)) <= 0);
|
|
|
|
|
|
|
|
int cmp_smallest = user_comparator_->CompareWithoutTimestamp(
|
|
|
|
user_key_, ExtractUserKey(f->smallest_key));
|
|
|
|
if (cmp_smallest >= 0) {
|
|
|
|
cmp_largest = user_comparator_->CompareWithoutTimestamp(
|
|
|
|
user_key_, ExtractUserKey(f->largest_key));
|
|
|
|
}
|
|
|
|
|
|
|
|
// Setup file search bound for the next level based on the
|
|
|
|
// comparison results
|
|
|
|
if (curr_level_ > 0) {
|
|
|
|
file_indexer_->GetNextLevelIndex(curr_level_,
|
|
|
|
curr_index_in_curr_level_,
|
|
|
|
cmp_smallest, cmp_largest,
|
|
|
|
&search_left_bound_,
|
|
|
|
&search_right_bound_);
|
|
|
|
}
|
|
|
|
// Key falls out of current file's range
|
|
|
|
if (cmp_smallest < 0 || cmp_largest > 0) {
|
|
|
|
if (curr_level_ == 0) {
|
|
|
|
++curr_index_in_curr_level_;
|
|
|
|
continue;
|
|
|
|
} else {
|
|
|
|
// Search next level.
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#ifndef NDEBUG
|
|
|
|
// Sanity check to make sure that the files are correctly sorted
|
|
|
|
if (prev_file_) {
|
|
|
|
if (curr_level_ != 0) {
|
|
|
|
int comp_sign = internal_comparator_->Compare(
|
|
|
|
prev_file_->largest_key, f->smallest_key);
|
|
|
|
assert(comp_sign < 0);
|
|
|
|
} else {
|
|
|
|
// level == 0, the current file cannot be newer than the previous
|
|
|
|
// one. Use compressed data structure, has no attribute seqNo
|
|
|
|
assert(curr_index_in_curr_level_ > 0);
|
|
|
|
assert(!NewestFirstBySeqNo(files_[0][curr_index_in_curr_level_],
|
|
|
|
files_[0][curr_index_in_curr_level_-1]));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
prev_file_ = f;
|
|
|
|
#endif
|
Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
9 years ago
|
|
|
returned_file_level_ = curr_level_;
|
|
|
|
if (curr_level_ > 0 && cmp_largest < 0) {
|
|
|
|
// No more files to search in this level.
|
|
|
|
search_ended_ = !PrepareNextLevel();
|
|
|
|
} else {
|
|
|
|
++curr_index_in_curr_level_;
|
|
|
|
}
|
|
|
|
return f;
|
|
|
|
}
|
|
|
|
// Start searching next level.
|
|
|
|
search_ended_ = !PrepareNextLevel();
|
|
|
|
}
|
|
|
|
// Search ended.
|
|
|
|
return nullptr;
|
|
|
|
}
|
|
|
|
|
|
|
|
// getter for current file level
|
|
|
|
// for GET_HIT_L0, GET_HIT_L1 & GET_HIT_L2_AND_UP counts
|
|
|
|
unsigned int GetHitFileLevel() { return hit_file_level_; }
|
|
|
|
|
|
|
|
// Returns true if the most recent "hit file" (i.e., one returned by
|
|
|
|
// GetNextFile()) is at the last index in its level.
|
|
|
|
bool IsHitFileLastInLevel() { return is_hit_file_last_in_level_; }
|
|
|
|
|
|
|
|
private:
|
|
|
|
unsigned int num_levels_;
|
|
|
|
unsigned int curr_level_;
|
Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
9 years ago
|
|
|
unsigned int returned_file_level_;
|
|
|
|
unsigned int hit_file_level_;
|
|
|
|
int32_t search_left_bound_;
|
|
|
|
int32_t search_right_bound_;
|
|
|
|
#ifndef NDEBUG
|
|
|
|
std::vector<FileMetaData*>* files_;
|
|
|
|
#endif
|
|
|
|
autovector<LevelFilesBrief>* level_files_brief_;
|
|
|
|
bool search_ended_;
|
|
|
|
bool is_hit_file_last_in_level_;
|
|
|
|
LevelFilesBrief* curr_file_level_;
|
|
|
|
unsigned int curr_index_in_curr_level_;
|
|
|
|
unsigned int start_index_in_curr_level_;
|
|
|
|
Slice user_key_;
|
|
|
|
Slice ikey_;
|
|
|
|
FileIndexer* file_indexer_;
|
|
|
|
const Comparator* user_comparator_;
|
|
|
|
const InternalKeyComparator* internal_comparator_;
|
|
|
|
#ifndef NDEBUG
|
|
|
|
FdWithKeyRange* prev_file_;
|
|
|
|
#endif
|
|
|
|
|
|
|
|
// Setup local variables to search next level.
|
|
|
|
// Returns false if there are no more levels to search.
|
|
|
|
bool PrepareNextLevel() {
|
|
|
|
curr_level_++;
|
|
|
|
while (curr_level_ < num_levels_) {
|
|
|
|
curr_file_level_ = &(*level_files_brief_)[curr_level_];
|
|
|
|
if (curr_file_level_->num_files == 0) {
|
|
|
|
// When current level is empty, the search bound generated from upper
|
|
|
|
// level must be [0, -1] or [0, FileIndexer::kLevelMaxIndex] if it is
|
|
|
|
// also empty.
|
|
|
|
assert(search_left_bound_ == 0);
|
|
|
|
assert(search_right_bound_ == -1 ||
|
|
|
|
search_right_bound_ == FileIndexer::kLevelMaxIndex);
|
|
|
|
// Since current level is empty, it will need to search all files in
|
|
|
|
// the next level
|
|
|
|
search_left_bound_ = 0;
|
|
|
|
search_right_bound_ = FileIndexer::kLevelMaxIndex;
|
|
|
|
curr_level_++;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Some files may overlap each other. We find
|
|
|
|
// all files that overlap user_key and process them in order from
|
|
|
|
// newest to oldest. In the context of merge-operator, this can occur at
|
|
|
|
// any level. Otherwise, it only occurs at Level-0 (since Put/Deletes
|
|
|
|
// are always compacted into a single entry).
|
|
|
|
int32_t start_index;
|
|
|
|
if (curr_level_ == 0) {
|
|
|
|
// On Level-0, we read through all files to check for overlap.
|
|
|
|
start_index = 0;
|
|
|
|
} else {
|
|
|
|
// On Level-n (n>=1), files are sorted. Binary search to find the
|
|
|
|
// earliest file whose largest key >= ikey. Search left bound and
|
|
|
|
// right bound are used to narrow the range.
|
Fix point lookup on range tombstone sentinel endpoint (#4829)
Summary:
Previously for point lookup we decided which file to look into based on user key overlap only. We also did not truncate range tombstones in the point lookup code path. These two ideas did not interact well in cases like this:
- L1 has range tombstone [a, c)#1 and point key b#2. The data is split between file1 with range [a#1,1, b#72057594037927935,15], and file2 with range [b#2, c#1].
- L1's file2 gets compacted to L2.
- User issues `Get()` for b#3.
- L1's file1 is opened and the range tombstone [a, c)#1 is found for b, while no point-key for b is found in L1.
- `Get()` assumes that the range tombstone must cover all data in that range in lower levels, so short circuits and returns `NotFound`.
The solution to this problem is to not look into files that only overlap with the point lookup at a range tombstone sentinel endpoint. In the above example, this would mean not opening L1's file1 or its tombstones during the `Get()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4829
Differential Revision: D13561355
Pulled By: ajkr
fbshipit-source-id: a13c21c816870a2f5d32a48af6dbd719a7d9d19f
6 years ago
|
|
|
if (search_left_bound_ <= search_right_bound_) {
|
|
|
|
if (search_right_bound_ == FileIndexer::kLevelMaxIndex) {
|
|
|
|
search_right_bound_ =
|
|
|
|
static_cast<int32_t>(curr_file_level_->num_files) - 1;
|
|
|
|
}
|
Fix point lookup on range tombstone sentinel endpoint (#4829)
Summary:
Previously for point lookup we decided which file to look into based on user key overlap only. We also did not truncate range tombstones in the point lookup code path. These two ideas did not interact well in cases like this:
- L1 has range tombstone [a, c)#1 and point key b#2. The data is split between file1 with range [a#1,1, b#72057594037927935,15], and file2 with range [b#2, c#1].
- L1's file2 gets compacted to L2.
- User issues `Get()` for b#3.
- L1's file1 is opened and the range tombstone [a, c)#1 is found for b, while no point-key for b is found in L1.
- `Get()` assumes that the range tombstone must cover all data in that range in lower levels, so short circuits and returns `NotFound`.
The solution to this problem is to not look into files that only overlap with the point lookup at a range tombstone sentinel endpoint. In the above example, this would mean not opening L1's file1 or its tombstones during the `Get()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4829
Differential Revision: D13561355
Pulled By: ajkr
fbshipit-source-id: a13c21c816870a2f5d32a48af6dbd719a7d9d19f
6 years ago
|
|
|
// `search_right_bound_` is an inclusive upper-bound, but since it was
|
|
|
|
// determined based on user key, it is still possible the lookup key
|
|
|
|
// falls to the right of `search_right_bound_`'s corresponding file.
|
|
|
|
// So, pass a limit one higher, which allows us to detect this case.
|
|
|
|
start_index =
|
|
|
|
FindFileInRange(*internal_comparator_, *curr_file_level_, ikey_,
|
|
|
|
static_cast<uint32_t>(search_left_bound_),
|
Fix point lookup on range tombstone sentinel endpoint (#4829)
Summary:
Previously for point lookup we decided which file to look into based on user key overlap only. We also did not truncate range tombstones in the point lookup code path. These two ideas did not interact well in cases like this:
- L1 has range tombstone [a, c)#1 and point key b#2. The data is split between file1 with range [a#1,1, b#72057594037927935,15], and file2 with range [b#2, c#1].
- L1's file2 gets compacted to L2.
- User issues `Get()` for b#3.
- L1's file1 is opened and the range tombstone [a, c)#1 is found for b, while no point-key for b is found in L1.
- `Get()` assumes that the range tombstone must cover all data in that range in lower levels, so short circuits and returns `NotFound`.
The solution to this problem is to not look into files that only overlap with the point lookup at a range tombstone sentinel endpoint. In the above example, this would mean not opening L1's file1 or its tombstones during the `Get()`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4829
Differential Revision: D13561355
Pulled By: ajkr
fbshipit-source-id: a13c21c816870a2f5d32a48af6dbd719a7d9d19f
6 years ago
|
|
|
static_cast<uint32_t>(search_right_bound_) + 1);
|
|
|
|
if (start_index == search_right_bound_ + 1) {
|
|
|
|
// `ikey_` comes after `search_right_bound_`. The lookup key does
|
|
|
|
// not exist on this level, so let's skip this level and do a full
|
|
|
|
// binary search on the next level.
|
|
|
|
search_left_bound_ = 0;
|
|
|
|
search_right_bound_ = FileIndexer::kLevelMaxIndex;
|
|
|
|
curr_level_++;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
// search_left_bound > search_right_bound, key does not exist in
|
|
|
|
// this level. Since no comparison is done in this level, it will
|
|
|
|
// need to search all files in the next level.
|
|
|
|
search_left_bound_ = 0;
|
|
|
|
search_right_bound_ = FileIndexer::kLevelMaxIndex;
|
|
|
|
curr_level_++;
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
start_index_in_curr_level_ = start_index;
|
|
|
|
curr_index_in_curr_level_ = start_index;
|
|
|
|
#ifndef NDEBUG
|
|
|
|
prev_file_ = nullptr;
|
|
|
|
#endif
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
// curr_level_ = num_levels_. So, no more levels to search.
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
};
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
|
|
|
|
class FilePickerMultiGet {
|
|
|
|
private:
|
|
|
|
struct FilePickerContext;
|
|
|
|
|
|
|
|
public:
|
|
|
|
FilePickerMultiGet(MultiGetRange* range,
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
autovector<LevelFilesBrief>* file_levels,
|
|
|
|
unsigned int num_levels, FileIndexer* file_indexer,
|
|
|
|
const Comparator* user_comparator,
|
|
|
|
const InternalKeyComparator* internal_comparator)
|
|
|
|
: num_levels_(num_levels),
|
|
|
|
curr_level_(static_cast<unsigned int>(-1)),
|
|
|
|
returned_file_level_(static_cast<unsigned int>(-1)),
|
|
|
|
hit_file_level_(static_cast<unsigned int>(-1)),
|
|
|
|
range_(range),
|
|
|
|
batch_iter_(range->begin()),
|
|
|
|
batch_iter_prev_(range->begin()),
|
|
|
|
upper_key_(range->begin()),
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
maybe_repeat_key_(false),
|
|
|
|
current_level_range_(*range, range->begin(), range->end()),
|
|
|
|
current_file_range_(*range, range->begin(), range->end()),
|
|
|
|
level_files_brief_(file_levels),
|
|
|
|
is_hit_file_last_in_level_(false),
|
|
|
|
curr_file_level_(nullptr),
|
|
|
|
file_indexer_(file_indexer),
|
|
|
|
user_comparator_(user_comparator),
|
|
|
|
internal_comparator_(internal_comparator) {
|
|
|
|
for (auto iter = range_->begin(); iter != range_->end(); ++iter) {
|
|
|
|
fp_ctx_array_[iter.index()] =
|
|
|
|
FilePickerContext(0, FileIndexer::kLevelMaxIndex);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Setup member variables to search first level.
|
|
|
|
search_ended_ = !PrepareNextLevel();
|
|
|
|
if (!search_ended_) {
|
|
|
|
// REVISIT
|
|
|
|
// Prefetch Level 0 table data to avoid cache miss if possible.
|
|
|
|
// As of now, only PlainTableReader and CuckooTableReader do any
|
|
|
|
// prefetching. This may not be necessary anymore once we implement
|
|
|
|
// batching in those table readers
|
|
|
|
for (unsigned int i = 0; i < (*level_files_brief_)[0].num_files; ++i) {
|
|
|
|
auto* r = (*level_files_brief_)[0].files[i].fd.table_reader;
|
|
|
|
if (r) {
|
|
|
|
for (auto iter = range_->begin(); iter != range_->end(); ++iter) {
|
|
|
|
r->Prepare(iter->ikey);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
int GetCurrentLevel() const { return curr_level_; }
|
|
|
|
|
|
|
|
// Iterates through files in the current level until it finds a file that
|
|
|
|
// contains atleast one key from the MultiGet batch
|
|
|
|
bool GetNextFileInLevelWithKeys(MultiGetRange* next_file_range,
|
|
|
|
size_t* file_index, FdWithKeyRange** fd,
|
|
|
|
bool* is_last_key_in_file) {
|
|
|
|
size_t curr_file_index = *file_index;
|
|
|
|
FdWithKeyRange* f = nullptr;
|
|
|
|
bool file_hit = false;
|
|
|
|
int cmp_largest = -1;
|
|
|
|
if (curr_file_index >= curr_file_level_->num_files) {
|
|
|
|
// In the unlikely case the next key is a duplicate of the current key,
|
|
|
|
// and the current key is the last in the level and the internal key
|
|
|
|
// was not found, we need to skip lookup for the remaining keys and
|
|
|
|
// reset the search bounds
|
|
|
|
if (batch_iter_ != current_level_range_.end()) {
|
|
|
|
++batch_iter_;
|
|
|
|
for (; batch_iter_ != current_level_range_.end(); ++batch_iter_) {
|
|
|
|
struct FilePickerContext& fp_ctx = fp_ctx_array_[batch_iter_.index()];
|
|
|
|
fp_ctx.search_left_bound = 0;
|
|
|
|
fp_ctx.search_right_bound = FileIndexer::kLevelMaxIndex;
|
|
|
|
}
|
|
|
|
}
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
return false;
|
|
|
|
}
|
|
|
|
// Loops over keys in the MultiGet batch until it finds a file with
|
|
|
|
// atleast one of the keys. Then it keeps moving forward until the
|
|
|
|
// last key in the batch that falls in that file
|
|
|
|
while (batch_iter_ != current_level_range_.end() &&
|
|
|
|
(fp_ctx_array_[batch_iter_.index()].curr_index_in_curr_level ==
|
|
|
|
curr_file_index ||
|
|
|
|
!file_hit)) {
|
|
|
|
struct FilePickerContext& fp_ctx = fp_ctx_array_[batch_iter_.index()];
|
|
|
|
f = &curr_file_level_->files[fp_ctx.curr_index_in_curr_level];
|
|
|
|
Slice& user_key = batch_iter_->ukey_without_ts;
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
|
|
|
|
// Do key range filtering of files or/and fractional cascading if:
|
|
|
|
// (1) not all the files are in level 0, or
|
|
|
|
// (2) there are more than 3 current level files
|
|
|
|
// If there are only 3 or less current level files in the system, we
|
|
|
|
// skip the key range filtering. In this case, more likely, the system
|
|
|
|
// is highly tuned to minimize number of tables queried by each query,
|
|
|
|
// so it is unlikely that key range filtering is more efficient than
|
|
|
|
// querying the files.
|
|
|
|
if (num_levels_ > 1 || curr_file_level_->num_files > 3) {
|
|
|
|
// Check if key is within a file's range. If search left bound and
|
|
|
|
// right bound point to the same find, we are sure key falls in
|
|
|
|
// range.
|
|
|
|
int cmp_smallest = user_comparator_->CompareWithoutTimestamp(
|
|
|
|
user_key, false, ExtractUserKey(f->smallest_key), true);
|
|
|
|
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
assert(curr_level_ == 0 ||
|
|
|
|
fp_ctx.curr_index_in_curr_level ==
|
|
|
|
fp_ctx.start_index_in_curr_level ||
|
|
|
|
cmp_smallest <= 0);
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
|
|
|
|
if (cmp_smallest >= 0) {
|
|
|
|
cmp_largest = user_comparator_->CompareWithoutTimestamp(
|
|
|
|
user_key, false, ExtractUserKey(f->largest_key), true);
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
} else {
|
|
|
|
cmp_largest = -1;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Setup file search bound for the next level based on the
|
|
|
|
// comparison results
|
|
|
|
if (curr_level_ > 0) {
|
|
|
|
file_indexer_->GetNextLevelIndex(
|
|
|
|
curr_level_, fp_ctx.curr_index_in_curr_level, cmp_smallest,
|
|
|
|
cmp_largest, &fp_ctx.search_left_bound,
|
|
|
|
&fp_ctx.search_right_bound);
|
|
|
|
}
|
|
|
|
// Key falls out of current file's range
|
|
|
|
if (cmp_smallest < 0 || cmp_largest > 0) {
|
|
|
|
next_file_range->SkipKey(batch_iter_);
|
|
|
|
} else {
|
|
|
|
file_hit = true;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
file_hit = true;
|
|
|
|
}
|
|
|
|
if (cmp_largest == 0) {
|
|
|
|
// cmp_largest is 0, which means the next key will not be in this
|
|
|
|
// file, so stop looking further. However, its possible there are
|
|
|
|
// duplicates in the batch, so find the upper bound for the batch
|
|
|
|
// in this file (upper_key_) by skipping past the duplicates. We
|
|
|
|
// leave batch_iter_ as is since we may have to pick up from there
|
|
|
|
// for the next file, if this file has a merge value rather than
|
|
|
|
// final value
|
|
|
|
upper_key_ = batch_iter_;
|
|
|
|
++upper_key_;
|
|
|
|
while (upper_key_ != current_level_range_.end() &&
|
|
|
|
user_comparator_->CompareWithoutTimestamp(
|
|
|
|
batch_iter_->ukey_without_ts, false,
|
|
|
|
upper_key_->ukey_without_ts, false) == 0) {
|
|
|
|
++upper_key_;
|
|
|
|
}
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
break;
|
|
|
|
} else {
|
|
|
|
if (curr_level_ == 0) {
|
|
|
|
// We need to look through all files in level 0
|
|
|
|
++fp_ctx.curr_index_in_curr_level;
|
|
|
|
}
|
|
|
|
++batch_iter_;
|
|
|
|
}
|
|
|
|
if (!file_hit) {
|
|
|
|
curr_file_index =
|
|
|
|
(batch_iter_ != current_level_range_.end())
|
|
|
|
? fp_ctx_array_[batch_iter_.index()].curr_index_in_curr_level
|
|
|
|
: curr_file_level_->num_files;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
*fd = f;
|
|
|
|
*file_index = curr_file_index;
|
|
|
|
*is_last_key_in_file = cmp_largest == 0;
|
|
|
|
if (!*is_last_key_in_file) {
|
|
|
|
// If the largest key in the batch overlapping the file is not the
|
|
|
|
// largest key in the file, upper_ley_ would not have been updated so
|
|
|
|
// update it here
|
|
|
|
upper_key_ = batch_iter_;
|
|
|
|
}
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
return file_hit;
|
|
|
|
}
|
|
|
|
|
|
|
|
FdWithKeyRange* GetNextFile() {
|
|
|
|
while (!search_ended_) {
|
|
|
|
// Start searching next level.
|
|
|
|
if (batch_iter_ == current_level_range_.end()) {
|
|
|
|
search_ended_ = !PrepareNextLevel();
|
|
|
|
continue;
|
|
|
|
} else {
|
|
|
|
if (maybe_repeat_key_) {
|
|
|
|
maybe_repeat_key_ = false;
|
|
|
|
// Check if we found the final value for the last key in the
|
|
|
|
// previous lookup range. If we did, then there's no need to look
|
|
|
|
// any further for that key, so advance batch_iter_. Else, keep
|
|
|
|
// batch_iter_ positioned on that key so we look it up again in
|
|
|
|
// the next file
|
|
|
|
// For L0, always advance the key because we will look in the next
|
|
|
|
// file regardless for all keys not found yet
|
|
|
|
if (current_level_range_.CheckKeyDone(batch_iter_) ||
|
|
|
|
curr_level_ == 0) {
|
|
|
|
batch_iter_ = upper_key_;
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
}
|
|
|
|
}
|
|
|
|
// batch_iter_prev_ will become the start key for the next file
|
|
|
|
// lookup
|
|
|
|
batch_iter_prev_ = batch_iter_;
|
|
|
|
}
|
|
|
|
|
|
|
|
MultiGetRange next_file_range(current_level_range_, batch_iter_prev_,
|
|
|
|
current_level_range_.end());
|
|
|
|
size_t curr_file_index =
|
|
|
|
(batch_iter_ != current_level_range_.end())
|
|
|
|
? fp_ctx_array_[batch_iter_.index()].curr_index_in_curr_level
|
|
|
|
: curr_file_level_->num_files;
|
|
|
|
FdWithKeyRange* f;
|
|
|
|
bool is_last_key_in_file;
|
|
|
|
if (!GetNextFileInLevelWithKeys(&next_file_range, &curr_file_index, &f,
|
|
|
|
&is_last_key_in_file)) {
|
|
|
|
search_ended_ = !PrepareNextLevel();
|
|
|
|
} else {
|
|
|
|
if (is_last_key_in_file) {
|
|
|
|
// Since cmp_largest is 0, batch_iter_ still points to the last key
|
|
|
|
// that falls in this file, instead of the next one. Increment
|
|
|
|
// the file index for all keys between batch_iter_ and upper_key_
|
|
|
|
auto tmp_iter = batch_iter_;
|
|
|
|
while (tmp_iter != upper_key_) {
|
|
|
|
++(fp_ctx_array_[tmp_iter.index()].curr_index_in_curr_level);
|
|
|
|
++tmp_iter;
|
|
|
|
}
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
maybe_repeat_key_ = true;
|
|
|
|
}
|
|
|
|
// Set the range for this file
|
|
|
|
current_file_range_ =
|
|
|
|
MultiGetRange(next_file_range, batch_iter_prev_, upper_key_);
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
returned_file_level_ = curr_level_;
|
|
|
|
hit_file_level_ = curr_level_;
|
|
|
|
is_hit_file_last_in_level_ =
|
|
|
|
curr_file_index == curr_file_level_->num_files - 1;
|
|
|
|
return f;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// Search ended
|
|
|
|
return nullptr;
|
|
|
|
}
|
|
|
|
|
|
|
|
// getter for current file level
|
|
|
|
// for GET_HIT_L0, GET_HIT_L1 & GET_HIT_L2_AND_UP counts
|
|
|
|
unsigned int GetHitFileLevel() { return hit_file_level_; }
|
|
|
|
|
|
|
|
// Returns true if the most recent "hit file" (i.e., one returned by
|
|
|
|
// GetNextFile()) is at the last index in its level.
|
|
|
|
bool IsHitFileLastInLevel() { return is_hit_file_last_in_level_; }
|
|
|
|
|
|
|
|
const MultiGetRange& CurrentFileRange() { return current_file_range_; }
|
|
|
|
|
|
|
|
private:
|
|
|
|
unsigned int num_levels_;
|
|
|
|
unsigned int curr_level_;
|
|
|
|
unsigned int returned_file_level_;
|
|
|
|
unsigned int hit_file_level_;
|
|
|
|
|
|
|
|
struct FilePickerContext {
|
|
|
|
int32_t search_left_bound;
|
|
|
|
int32_t search_right_bound;
|
|
|
|
unsigned int curr_index_in_curr_level;
|
|
|
|
unsigned int start_index_in_curr_level;
|
|
|
|
|
|
|
|
FilePickerContext(int32_t left, int32_t right)
|
|
|
|
: search_left_bound(left), search_right_bound(right),
|
|
|
|
curr_index_in_curr_level(0), start_index_in_curr_level(0) {}
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
|
|
|
|
FilePickerContext() = default;
|
|
|
|
};
|
|
|
|
std::array<FilePickerContext, MultiGetContext::MAX_BATCH_SIZE> fp_ctx_array_;
|
|
|
|
MultiGetRange* range_;
|
|
|
|
// Iterator to iterate through the keys in a MultiGet batch, that gets reset
|
|
|
|
// at the beginning of each level. Each call to GetNextFile() will position
|
|
|
|
// batch_iter_ at or right after the last key that was found in the returned
|
|
|
|
// SST file
|
|
|
|
MultiGetRange::Iterator batch_iter_;
|
|
|
|
// An iterator that records the previous position of batch_iter_, i.e last
|
|
|
|
// key found in the previous SST file, in order to serve as the start of
|
|
|
|
// the batch key range for the next SST file
|
|
|
|
MultiGetRange::Iterator batch_iter_prev_;
|
|
|
|
MultiGetRange::Iterator upper_key_;
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
bool maybe_repeat_key_;
|
|
|
|
MultiGetRange current_level_range_;
|
|
|
|
MultiGetRange current_file_range_;
|
|
|
|
autovector<LevelFilesBrief>* level_files_brief_;
|
|
|
|
bool search_ended_;
|
|
|
|
bool is_hit_file_last_in_level_;
|
|
|
|
LevelFilesBrief* curr_file_level_;
|
|
|
|
FileIndexer* file_indexer_;
|
|
|
|
const Comparator* user_comparator_;
|
|
|
|
const InternalKeyComparator* internal_comparator_;
|
|
|
|
|
|
|
|
// Setup local variables to search next level.
|
|
|
|
// Returns false if there are no more levels to search.
|
|
|
|
bool PrepareNextLevel() {
|
|
|
|
if (curr_level_ == 0) {
|
|
|
|
MultiGetRange::Iterator mget_iter = current_level_range_.begin();
|
|
|
|
if (fp_ctx_array_[mget_iter.index()].curr_index_in_curr_level <
|
|
|
|
curr_file_level_->num_files) {
|
|
|
|
batch_iter_prev_ = current_level_range_.begin();
|
|
|
|
upper_key_ = batch_iter_ = current_level_range_.begin();
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
return true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
curr_level_++;
|
|
|
|
// Reset key range to saved value
|
|
|
|
while (curr_level_ < num_levels_) {
|
|
|
|
bool level_contains_keys = false;
|
|
|
|
curr_file_level_ = &(*level_files_brief_)[curr_level_];
|
|
|
|
if (curr_file_level_->num_files == 0) {
|
|
|
|
// When current level is empty, the search bound generated from upper
|
|
|
|
// level must be [0, -1] or [0, FileIndexer::kLevelMaxIndex] if it is
|
|
|
|
// also empty.
|
|
|
|
|
|
|
|
for (auto mget_iter = current_level_range_.begin();
|
|
|
|
mget_iter != current_level_range_.end(); ++mget_iter) {
|
|
|
|
struct FilePickerContext& fp_ctx = fp_ctx_array_[mget_iter.index()];
|
|
|
|
|
|
|
|
assert(fp_ctx.search_left_bound == 0);
|
|
|
|
assert(fp_ctx.search_right_bound == -1 ||
|
|
|
|
fp_ctx.search_right_bound == FileIndexer::kLevelMaxIndex);
|
|
|
|
// Since current level is empty, it will need to search all files in
|
|
|
|
// the next level
|
|
|
|
fp_ctx.search_left_bound = 0;
|
|
|
|
fp_ctx.search_right_bound = FileIndexer::kLevelMaxIndex;
|
|
|
|
}
|
|
|
|
// Skip all subsequent empty levels
|
|
|
|
do {
|
|
|
|
++curr_level_;
|
|
|
|
} while ((curr_level_ < num_levels_) &&
|
|
|
|
(*level_files_brief_)[curr_level_].num_files == 0);
|
|
|
|
continue;
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
}
|
|
|
|
|
|
|
|
// Some files may overlap each other. We find
|
|
|
|
// all files that overlap user_key and process them in order from
|
|
|
|
// newest to oldest. In the context of merge-operator, this can occur at
|
|
|
|
// any level. Otherwise, it only occurs at Level-0 (since Put/Deletes
|
|
|
|
// are always compacted into a single entry).
|
|
|
|
int32_t start_index = -1;
|
|
|
|
current_level_range_ =
|
|
|
|
MultiGetRange(*range_, range_->begin(), range_->end());
|
|
|
|
for (auto mget_iter = current_level_range_.begin();
|
|
|
|
mget_iter != current_level_range_.end(); ++mget_iter) {
|
|
|
|
struct FilePickerContext& fp_ctx = fp_ctx_array_[mget_iter.index()];
|
|
|
|
if (curr_level_ == 0) {
|
|
|
|
// On Level-0, we read through all files to check for overlap.
|
|
|
|
start_index = 0;
|
|
|
|
level_contains_keys = true;
|
|
|
|
} else {
|
|
|
|
// On Level-n (n>=1), files are sorted. Binary search to find the
|
|
|
|
// earliest file whose largest key >= ikey. Search left bound and
|
|
|
|
// right bound are used to narrow the range.
|
|
|
|
if (fp_ctx.search_left_bound <= fp_ctx.search_right_bound) {
|
|
|
|
if (fp_ctx.search_right_bound == FileIndexer::kLevelMaxIndex) {
|
|
|
|
fp_ctx.search_right_bound =
|
|
|
|
static_cast<int32_t>(curr_file_level_->num_files) - 1;
|
|
|
|
}
|
|
|
|
// `search_right_bound_` is an inclusive upper-bound, but since it
|
|
|
|
// was determined based on user key, it is still possible the lookup
|
|
|
|
// key falls to the right of `search_right_bound_`'s corresponding
|
|
|
|
// file. So, pass a limit one higher, which allows us to detect this
|
|
|
|
// case.
|
|
|
|
Slice& ikey = mget_iter->ikey;
|
|
|
|
start_index = FindFileInRange(
|
|
|
|
*internal_comparator_, *curr_file_level_, ikey,
|
|
|
|
static_cast<uint32_t>(fp_ctx.search_left_bound),
|
|
|
|
static_cast<uint32_t>(fp_ctx.search_right_bound) + 1);
|
|
|
|
if (start_index == fp_ctx.search_right_bound + 1) {
|
|
|
|
// `ikey_` comes after `search_right_bound_`. The lookup key does
|
|
|
|
// not exist on this level, so let's skip this level and do a full
|
|
|
|
// binary search on the next level.
|
|
|
|
fp_ctx.search_left_bound = 0;
|
|
|
|
fp_ctx.search_right_bound = FileIndexer::kLevelMaxIndex;
|
|
|
|
current_level_range_.SkipKey(mget_iter);
|
|
|
|
continue;
|
|
|
|
} else {
|
|
|
|
level_contains_keys = true;
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
// search_left_bound > search_right_bound, key does not exist in
|
|
|
|
// this level. Since no comparison is done in this level, it will
|
|
|
|
// need to search all files in the next level.
|
|
|
|
fp_ctx.search_left_bound = 0;
|
|
|
|
fp_ctx.search_right_bound = FileIndexer::kLevelMaxIndex;
|
|
|
|
current_level_range_.SkipKey(mget_iter);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
fp_ctx.start_index_in_curr_level = start_index;
|
|
|
|
fp_ctx.curr_index_in_curr_level = start_index;
|
|
|
|
}
|
|
|
|
if (level_contains_keys) {
|
|
|
|
batch_iter_prev_ = current_level_range_.begin();
|
|
|
|
upper_key_ = batch_iter_ = current_level_range_.begin();
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
return true;
|
|
|
|
}
|
|
|
|
curr_level_++;
|
|
|
|
}
|
|
|
|
// curr_level_ = num_levels_. So, no more levels to search.
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
};
|
|
|
|
} // anonymous namespace
|
|
|
|
|
|
|
|
VersionStorageInfo::~VersionStorageInfo() { delete[] files_; }
|
|
|
|
|
|
|
|
Version::~Version() {
|
|
|
|
assert(refs_ == 0);
|
|
|
|
|
|
|
|
// Remove from linked list
|
|
|
|
prev_->next_ = next_;
|
|
|
|
next_->prev_ = prev_;
|
|
|
|
|
|
|
|
// Drop references to files
|
|
|
|
for (int level = 0; level < storage_info_.num_levels_; level++) {
|
|
|
|
for (size_t i = 0; i < storage_info_.files_[level].size(); i++) {
|
|
|
|
FileMetaData* f = storage_info_.files_[level][i];
|
|
|
|
assert(f->refs > 0);
|
|
|
|
f->refs--;
|
|
|
|
if (f->refs <= 0) {
|
|
|
|
assert(cfd_ != nullptr);
|
|
|
|
uint32_t path_id = f->fd.GetPathId();
|
|
|
|
assert(path_id < cfd_->ioptions()->cf_paths.size());
|
|
|
|
vset_->obsolete_files_.push_back(
|
|
|
|
ObsoleteFileInfo(f, cfd_->ioptions()->cf_paths[path_id].path));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
create compressed_levels_ in Version, allocate its space using arena. Make Version::Get, Version::FindFile faster
Summary:
Define CompressedFileMetaData that just contains fd, smallest_slice, largest_slice. Create compressed_levels_ in Version, the space is allocated using arena
Thus increase the file meta data locality, speed up "Get" and "FindFile"
benchmark with in-memory tmpfs, could have 4% improvement under "random read" and 2% improvement under "read while writing"
benchmark command:
./db_bench --db=/mnt/db/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --block_size=4096 --cache_size=17179869184 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=33554432 --max_bytes_for_level_base=1073741824 --disable_wal=0 --sync=0 --disable_data_sync=1 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_grandparent_overlap_factor=10 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --perf_level=0 --benchmarks=readwhilewriting,readwhilewriting,readwhilewriting --use_existing_db=1 --num=52428800 --threads=1 —writes_per_second=81920
Read Random:
From 1.8363 ms/op, improve to 1.7587 ms/op.
Read while writing:
From 2.985 ms/op, improve to 2.924 ms/op.
Test Plan:
make all check
Reviewers: ljin, haobo, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, igor
Differential Revision: https://reviews.facebook.net/D19419
11 years ago
|
|
|
int FindFile(const InternalKeyComparator& icmp,
|
|
|
|
const LevelFilesBrief& file_level,
|
create compressed_levels_ in Version, allocate its space using arena. Make Version::Get, Version::FindFile faster
Summary:
Define CompressedFileMetaData that just contains fd, smallest_slice, largest_slice. Create compressed_levels_ in Version, the space is allocated using arena
Thus increase the file meta data locality, speed up "Get" and "FindFile"
benchmark with in-memory tmpfs, could have 4% improvement under "random read" and 2% improvement under "read while writing"
benchmark command:
./db_bench --db=/mnt/db/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --block_size=4096 --cache_size=17179869184 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=33554432 --max_bytes_for_level_base=1073741824 --disable_wal=0 --sync=0 --disable_data_sync=1 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_grandparent_overlap_factor=10 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --perf_level=0 --benchmarks=readwhilewriting,readwhilewriting,readwhilewriting --use_existing_db=1 --num=52428800 --threads=1 —writes_per_second=81920
Read Random:
From 1.8363 ms/op, improve to 1.7587 ms/op.
Read while writing:
From 2.985 ms/op, improve to 2.924 ms/op.
Test Plan:
make all check
Reviewers: ljin, haobo, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, igor
Differential Revision: https://reviews.facebook.net/D19419
11 years ago
|
|
|
const Slice& key) {
|
|
|
|
return FindFileInRange(icmp, file_level, key, 0,
|
|
|
|
static_cast<uint32_t>(file_level.num_files));
|
create compressed_levels_ in Version, allocate its space using arena. Make Version::Get, Version::FindFile faster
Summary:
Define CompressedFileMetaData that just contains fd, smallest_slice, largest_slice. Create compressed_levels_ in Version, the space is allocated using arena
Thus increase the file meta data locality, speed up "Get" and "FindFile"
benchmark with in-memory tmpfs, could have 4% improvement under "random read" and 2% improvement under "read while writing"
benchmark command:
./db_bench --db=/mnt/db/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --block_size=4096 --cache_size=17179869184 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=33554432 --max_bytes_for_level_base=1073741824 --disable_wal=0 --sync=0 --disable_data_sync=1 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_grandparent_overlap_factor=10 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --perf_level=0 --benchmarks=readwhilewriting,readwhilewriting,readwhilewriting --use_existing_db=1 --num=52428800 --threads=1 —writes_per_second=81920
Read Random:
From 1.8363 ms/op, improve to 1.7587 ms/op.
Read while writing:
From 2.985 ms/op, improve to 2.924 ms/op.
Test Plan:
make all check
Reviewers: ljin, haobo, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, igor
Differential Revision: https://reviews.facebook.net/D19419
11 years ago
|
|
|
}
|
|
|
|
|
|
|
|
void DoGenerateLevelFilesBrief(LevelFilesBrief* file_level,
|
|
|
|
const std::vector<FileMetaData*>& files,
|
|
|
|
Arena* arena) {
|
|
|
|
assert(file_level);
|
|
|
|
assert(arena);
|
|
|
|
|
|
|
|
size_t num = files.size();
|
|
|
|
file_level->num_files = num;
|
|
|
|
char* mem = arena->AllocateAligned(num * sizeof(FdWithKeyRange));
|
|
|
|
file_level->files = new (mem)FdWithKeyRange[num];
|
|
|
|
|
|
|
|
for (size_t i = 0; i < num; i++) {
|
|
|
|
Slice smallest_key = files[i]->smallest.Encode();
|
|
|
|
Slice largest_key = files[i]->largest.Encode();
|
|
|
|
|
|
|
|
// Copy key slice to sequential memory
|
|
|
|
size_t smallest_size = smallest_key.size();
|
|
|
|
size_t largest_size = largest_key.size();
|
|
|
|
mem = arena->AllocateAligned(smallest_size + largest_size);
|
|
|
|
memcpy(mem, smallest_key.data(), smallest_size);
|
|
|
|
memcpy(mem + smallest_size, largest_key.data(), largest_size);
|
|
|
|
|
|
|
|
FdWithKeyRange& f = file_level->files[i];
|
|
|
|
f.fd = files[i]->fd;
|
|
|
|
f.file_metadata = files[i];
|
|
|
|
f.smallest_key = Slice(mem, smallest_size);
|
|
|
|
f.largest_key = Slice(mem + smallest_size, largest_size);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool AfterFile(const Comparator* ucmp,
|
create compressed_levels_ in Version, allocate its space using arena. Make Version::Get, Version::FindFile faster
Summary:
Define CompressedFileMetaData that just contains fd, smallest_slice, largest_slice. Create compressed_levels_ in Version, the space is allocated using arena
Thus increase the file meta data locality, speed up "Get" and "FindFile"
benchmark with in-memory tmpfs, could have 4% improvement under "random read" and 2% improvement under "read while writing"
benchmark command:
./db_bench --db=/mnt/db/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --block_size=4096 --cache_size=17179869184 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=33554432 --max_bytes_for_level_base=1073741824 --disable_wal=0 --sync=0 --disable_data_sync=1 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_grandparent_overlap_factor=10 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --perf_level=0 --benchmarks=readwhilewriting,readwhilewriting,readwhilewriting --use_existing_db=1 --num=52428800 --threads=1 —writes_per_second=81920
Read Random:
From 1.8363 ms/op, improve to 1.7587 ms/op.
Read while writing:
From 2.985 ms/op, improve to 2.924 ms/op.
Test Plan:
make all check
Reviewers: ljin, haobo, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, igor
Differential Revision: https://reviews.facebook.net/D19419
11 years ago
|
|
|
const Slice* user_key, const FdWithKeyRange* f) {
|
|
|
|
// nullptr user_key occurs before all keys and is therefore never after *f
|
|
|
|
return (user_key != nullptr &&
|
|
|
|
ucmp->CompareWithoutTimestamp(*user_key,
|
|
|
|
ExtractUserKey(f->largest_key)) > 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
static bool BeforeFile(const Comparator* ucmp,
|
create compressed_levels_ in Version, allocate its space using arena. Make Version::Get, Version::FindFile faster
Summary:
Define CompressedFileMetaData that just contains fd, smallest_slice, largest_slice. Create compressed_levels_ in Version, the space is allocated using arena
Thus increase the file meta data locality, speed up "Get" and "FindFile"
benchmark with in-memory tmpfs, could have 4% improvement under "random read" and 2% improvement under "read while writing"
benchmark command:
./db_bench --db=/mnt/db/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --block_size=4096 --cache_size=17179869184 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=33554432 --max_bytes_for_level_base=1073741824 --disable_wal=0 --sync=0 --disable_data_sync=1 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_grandparent_overlap_factor=10 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --perf_level=0 --benchmarks=readwhilewriting,readwhilewriting,readwhilewriting --use_existing_db=1 --num=52428800 --threads=1 —writes_per_second=81920
Read Random:
From 1.8363 ms/op, improve to 1.7587 ms/op.
Read while writing:
From 2.985 ms/op, improve to 2.924 ms/op.
Test Plan:
make all check
Reviewers: ljin, haobo, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, igor
Differential Revision: https://reviews.facebook.net/D19419
11 years ago
|
|
|
const Slice* user_key, const FdWithKeyRange* f) {
|
|
|
|
// nullptr user_key occurs after all keys and is therefore never before *f
|
|
|
|
return (user_key != nullptr &&
|
|
|
|
ucmp->CompareWithoutTimestamp(*user_key,
|
|
|
|
ExtractUserKey(f->smallest_key)) < 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
bool SomeFileOverlapsRange(
|
|
|
|
const InternalKeyComparator& icmp,
|
|
|
|
bool disjoint_sorted_files,
|
|
|
|
const LevelFilesBrief& file_level,
|
|
|
|
const Slice* smallest_user_key,
|
|
|
|
const Slice* largest_user_key) {
|
|
|
|
const Comparator* ucmp = icmp.user_comparator();
|
|
|
|
if (!disjoint_sorted_files) {
|
|
|
|
// Need to check against all files
|
create compressed_levels_ in Version, allocate its space using arena. Make Version::Get, Version::FindFile faster
Summary:
Define CompressedFileMetaData that just contains fd, smallest_slice, largest_slice. Create compressed_levels_ in Version, the space is allocated using arena
Thus increase the file meta data locality, speed up "Get" and "FindFile"
benchmark with in-memory tmpfs, could have 4% improvement under "random read" and 2% improvement under "read while writing"
benchmark command:
./db_bench --db=/mnt/db/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --block_size=4096 --cache_size=17179869184 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=33554432 --max_bytes_for_level_base=1073741824 --disable_wal=0 --sync=0 --disable_data_sync=1 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_grandparent_overlap_factor=10 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --perf_level=0 --benchmarks=readwhilewriting,readwhilewriting,readwhilewriting --use_existing_db=1 --num=52428800 --threads=1 —writes_per_second=81920
Read Random:
From 1.8363 ms/op, improve to 1.7587 ms/op.
Read while writing:
From 2.985 ms/op, improve to 2.924 ms/op.
Test Plan:
make all check
Reviewers: ljin, haobo, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, igor
Differential Revision: https://reviews.facebook.net/D19419
11 years ago
|
|
|
for (size_t i = 0; i < file_level.num_files; i++) {
|
|
|
|
const FdWithKeyRange* f = &(file_level.files[i]);
|
|
|
|
if (AfterFile(ucmp, smallest_user_key, f) ||
|
|
|
|
BeforeFile(ucmp, largest_user_key, f)) {
|
|
|
|
// No overlap
|
|
|
|
} else {
|
|
|
|
return true; // Overlap
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Binary search over file list
|
|
|
|
uint32_t index = 0;
|
|
|
|
if (smallest_user_key != nullptr) {
|
|
|
|
// Find the leftmost possible internal key for smallest_user_key
|
|
|
|
InternalKey small;
|
|
|
|
small.SetMinPossibleForUserKey(*smallest_user_key);
|
create compressed_levels_ in Version, allocate its space using arena. Make Version::Get, Version::FindFile faster
Summary:
Define CompressedFileMetaData that just contains fd, smallest_slice, largest_slice. Create compressed_levels_ in Version, the space is allocated using arena
Thus increase the file meta data locality, speed up "Get" and "FindFile"
benchmark with in-memory tmpfs, could have 4% improvement under "random read" and 2% improvement under "read while writing"
benchmark command:
./db_bench --db=/mnt/db/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --block_size=4096 --cache_size=17179869184 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=33554432 --max_bytes_for_level_base=1073741824 --disable_wal=0 --sync=0 --disable_data_sync=1 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_grandparent_overlap_factor=10 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --perf_level=0 --benchmarks=readwhilewriting,readwhilewriting,readwhilewriting --use_existing_db=1 --num=52428800 --threads=1 —writes_per_second=81920
Read Random:
From 1.8363 ms/op, improve to 1.7587 ms/op.
Read while writing:
From 2.985 ms/op, improve to 2.924 ms/op.
Test Plan:
make all check
Reviewers: ljin, haobo, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, igor
Differential Revision: https://reviews.facebook.net/D19419
11 years ago
|
|
|
index = FindFile(icmp, file_level, small.Encode());
|
|
|
|
}
|
|
|
|
|
create compressed_levels_ in Version, allocate its space using arena. Make Version::Get, Version::FindFile faster
Summary:
Define CompressedFileMetaData that just contains fd, smallest_slice, largest_slice. Create compressed_levels_ in Version, the space is allocated using arena
Thus increase the file meta data locality, speed up "Get" and "FindFile"
benchmark with in-memory tmpfs, could have 4% improvement under "random read" and 2% improvement under "read while writing"
benchmark command:
./db_bench --db=/mnt/db/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --block_size=4096 --cache_size=17179869184 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=33554432 --max_bytes_for_level_base=1073741824 --disable_wal=0 --sync=0 --disable_data_sync=1 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_grandparent_overlap_factor=10 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --perf_level=0 --benchmarks=readwhilewriting,readwhilewriting,readwhilewriting --use_existing_db=1 --num=52428800 --threads=1 —writes_per_second=81920
Read Random:
From 1.8363 ms/op, improve to 1.7587 ms/op.
Read while writing:
From 2.985 ms/op, improve to 2.924 ms/op.
Test Plan:
make all check
Reviewers: ljin, haobo, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, igor
Differential Revision: https://reviews.facebook.net/D19419
11 years ago
|
|
|
if (index >= file_level.num_files) {
|
|
|
|
// beginning of range is after all files, so no overlap.
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
create compressed_levels_ in Version, allocate its space using arena. Make Version::Get, Version::FindFile faster
Summary:
Define CompressedFileMetaData that just contains fd, smallest_slice, largest_slice. Create compressed_levels_ in Version, the space is allocated using arena
Thus increase the file meta data locality, speed up "Get" and "FindFile"
benchmark with in-memory tmpfs, could have 4% improvement under "random read" and 2% improvement under "read while writing"
benchmark command:
./db_bench --db=/mnt/db/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --block_size=4096 --cache_size=17179869184 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=33554432 --max_bytes_for_level_base=1073741824 --disable_wal=0 --sync=0 --disable_data_sync=1 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_grandparent_overlap_factor=10 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --perf_level=0 --benchmarks=readwhilewriting,readwhilewriting,readwhilewriting --use_existing_db=1 --num=52428800 --threads=1 —writes_per_second=81920
Read Random:
From 1.8363 ms/op, improve to 1.7587 ms/op.
Read while writing:
From 2.985 ms/op, improve to 2.924 ms/op.
Test Plan:
make all check
Reviewers: ljin, haobo, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, igor
Differential Revision: https://reviews.facebook.net/D19419
11 years ago
|
|
|
return !BeforeFile(ucmp, largest_user_key, &file_level.files[index]);
|
|
|
|
}
|
|
|
|
|
|
|
|
namespace {
|
|
|
|
|
|
|
|
class LevelIterator final : public InternalIterator {
|
|
|
|
public:
|
|
|
|
// @param read_options Must outlive this iterator.
|
|
|
|
LevelIterator(TableCache* table_cache, const ReadOptions& read_options,
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
const FileOptions& file_options,
|
|
|
|
const InternalKeyComparator& icomparator,
|
|
|
|
const LevelFilesBrief* flevel,
|
|
|
|
const SliceTransform* prefix_extractor, bool should_sample,
|
|
|
|
HistogramImpl* file_read_hist, TableReaderCaller caller,
|
|
|
|
bool skip_filters, int level, RangeDelAggregator* range_del_agg,
|
|
|
|
const std::vector<AtomicCompactionUnitBoundary>*
|
Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
Reviewed By: siying
Differential Revision: D20786930
Pulled By: al13n321
fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
5 years ago
|
|
|
compaction_boundaries = nullptr,
|
|
|
|
bool allow_unprepared_value = false)
|
|
|
|
: table_cache_(table_cache),
|
|
|
|
read_options_(read_options),
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
file_options_(file_options),
|
|
|
|
icomparator_(icomparator),
|
|
|
|
user_comparator_(icomparator.user_comparator()),
|
|
|
|
flevel_(flevel),
|
|
|
|
prefix_extractor_(prefix_extractor),
|
|
|
|
file_read_hist_(file_read_hist),
|
|
|
|
should_sample_(should_sample),
|
|
|
|
caller_(caller),
|
|
|
|
skip_filters_(skip_filters),
|
Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
Reviewed By: siying
Differential Revision: D20786930
Pulled By: al13n321
fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
5 years ago
|
|
|
allow_unprepared_value_(allow_unprepared_value),
|
|
|
|
file_index_(flevel_->num_files),
|
|
|
|
level_(level),
|
|
|
|
range_del_agg_(range_del_agg),
|
|
|
|
pinned_iters_mgr_(nullptr),
|
|
|
|
compaction_boundaries_(compaction_boundaries) {
|
|
|
|
// Empty level is not supported.
|
|
|
|
assert(flevel_ != nullptr && flevel_->num_files > 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
~LevelIterator() override { delete file_iter_.Set(nullptr); }
|
|
|
|
|
|
|
|
void Seek(const Slice& target) override;
|
|
|
|
void SeekForPrev(const Slice& target) override;
|
|
|
|
void SeekToFirst() override;
|
|
|
|
void SeekToLast() override;
|
|
|
|
void Next() final override;
|
|
|
|
bool NextAndGetResult(IterateResult* result) override;
|
|
|
|
void Prev() override;
|
|
|
|
|
|
|
|
bool Valid() const override { return file_iter_.Valid(); }
|
|
|
|
Slice key() const override {
|
|
|
|
assert(Valid());
|
|
|
|
return file_iter_.key();
|
|
|
|
}
|
|
|
|
|
|
|
|
Slice value() const override {
|
|
|
|
assert(Valid());
|
|
|
|
return file_iter_.value();
|
|
|
|
}
|
|
|
|
|
|
|
|
Status status() const override {
|
Change and clarify the relationship between Valid(), status() and Seek*() for all iterators. Also fix some bugs
Summary:
Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are:
* If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted.
* When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not.
However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours).
This PR changes the convention to:
* If status() is not ok, Valid() always returns false.
* Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.)
This does sacrifice the two use cases listed above, but siying said it's ok.
Overview of the changes:
* A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario.
* Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed.
* A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator.
To find the iterator types that needed changes I searched for "public .*Iterator" in the code. Here's an overview of all 27 iterator types:
Iterators that didn't need changes:
* status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator.
* Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator.
Iterators with changes (see inline comments for details):
* DBIter - an overhaul:
- It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back.
- It had a few code paths silently discarding subiterator's status. The stress test caught a few.
- The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example.
- Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption.
- It used to not reset status on seek for some types of errors.
- Some simplifications and better comments.
- Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer.
* MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status.
* ForwardIterator - changed to the new convention, also slightly simplified.
* ForwardLevelIterator - fixed some bugs and simplified.
* LevelIterator - simplified.
* TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_.
* BlockBasedTableIterator - minor changes.
* BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid.
* PlainTableIterator - some seeks used to not reset status.
* CuckooTableIterator - tiny code cleanup.
* ManagedIterator - fixed some bugs.
* BaseDeltaIterator - changed to the new convention and fixed a bug.
* BlobDBIterator - seeks used to not reset status.
* KeyConvertingIterator - some small change.
Closes https://github.com/facebook/rocksdb/pull/3810
Differential Revision: D7888019
Pulled By: al13n321
fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d
7 years ago
|
|
|
return file_iter_.iter() ? file_iter_.status() : Status::OK();
|
|
|
|
}
|
|
|
|
|
Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
Reviewed By: siying
Differential Revision: D20786930
Pulled By: al13n321
fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
5 years ago
|
|
|
bool PrepareValue() override {
|
|
|
|
return file_iter_.PrepareValue();
|
|
|
|
}
|
|
|
|
|
|
|
|
inline bool MayBeOutOfLowerBound() override {
|
|
|
|
assert(Valid());
|
|
|
|
return may_be_out_of_lower_bound_ && file_iter_.MayBeOutOfLowerBound();
|
|
|
|
}
|
|
|
|
|
|
|
|
inline IterBoundCheck UpperBoundCheckResult() override {
|
|
|
|
if (Valid()) {
|
|
|
|
return file_iter_.UpperBoundCheckResult();
|
|
|
|
} else {
|
|
|
|
return IterBoundCheck::kUnknown;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void SetPinnedItersMgr(PinnedIteratorsManager* pinned_iters_mgr) override {
|
|
|
|
pinned_iters_mgr_ = pinned_iters_mgr;
|
|
|
|
if (file_iter_.iter()) {
|
|
|
|
file_iter_.SetPinnedItersMgr(pinned_iters_mgr);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
bool IsKeyPinned() const override {
|
|
|
|
return pinned_iters_mgr_ && pinned_iters_mgr_->PinningEnabled() &&
|
|
|
|
file_iter_.iter() && file_iter_.IsKeyPinned();
|
|
|
|
}
|
|
|
|
|
|
|
|
bool IsValuePinned() const override {
|
|
|
|
return pinned_iters_mgr_ && pinned_iters_mgr_->PinningEnabled() &&
|
|
|
|
file_iter_.iter() && file_iter_.IsValuePinned();
|
|
|
|
}
|
|
|
|
|
|
|
|
private:
|
|
|
|
// Return true if at least one invalid file is seen and skipped.
|
|
|
|
bool SkipEmptyFileForward();
|
|
|
|
void SkipEmptyFileBackward();
|
|
|
|
void SetFileIterator(InternalIterator* iter);
|
|
|
|
void InitFileIterator(size_t new_file_index);
|
|
|
|
|
|
|
|
const Slice& file_smallest_key(size_t file_index) {
|
|
|
|
assert(file_index < flevel_->num_files);
|
|
|
|
return flevel_->files[file_index].smallest_key;
|
|
|
|
}
|
|
|
|
|
|
|
|
bool KeyReachedUpperBound(const Slice& internal_key) {
|
|
|
|
return read_options_.iterate_upper_bound != nullptr &&
|
|
|
|
user_comparator_.CompareWithoutTimestamp(
|
|
|
|
ExtractUserKey(internal_key), /*a_has_ts=*/true,
|
|
|
|
*read_options_.iterate_upper_bound, /*b_has_ts=*/false) >= 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
InternalIterator* NewFileIterator() {
|
|
|
|
assert(file_index_ < flevel_->num_files);
|
|
|
|
auto file_meta = flevel_->files[file_index_];
|
|
|
|
if (should_sample_) {
|
|
|
|
sample_file_read_inc(file_meta.file_metadata);
|
|
|
|
}
|
|
|
|
|
|
|
|
const InternalKey* smallest_compaction_key = nullptr;
|
|
|
|
const InternalKey* largest_compaction_key = nullptr;
|
|
|
|
if (compaction_boundaries_ != nullptr) {
|
|
|
|
smallest_compaction_key = (*compaction_boundaries_)[file_index_].smallest;
|
|
|
|
largest_compaction_key = (*compaction_boundaries_)[file_index_].largest;
|
|
|
|
}
|
|
|
|
CheckMayBeOutOfLowerBound();
|
|
|
|
return table_cache_->NewIterator(
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
read_options_, file_options_, icomparator_, *file_meta.file_metadata,
|
|
|
|
range_del_agg_, prefix_extractor_,
|
|
|
|
nullptr /* don't need reference to table */, file_read_hist_, caller_,
|
|
|
|
/*arena=*/nullptr, skip_filters_, level_,
|
|
|
|
/*max_file_size_for_l0_meta_pin=*/0, smallest_compaction_key,
|
Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
Reviewed By: siying
Differential Revision: D20786930
Pulled By: al13n321
fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
5 years ago
|
|
|
largest_compaction_key, allow_unprepared_value_);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Check if current file being fully within iterate_lower_bound.
|
|
|
|
//
|
|
|
|
// Note MyRocks may update iterate bounds between seek. To workaround it,
|
|
|
|
// we need to check and update may_be_out_of_lower_bound_ accordingly.
|
|
|
|
void CheckMayBeOutOfLowerBound() {
|
|
|
|
if (read_options_.iterate_lower_bound != nullptr &&
|
|
|
|
file_index_ < flevel_->num_files) {
|
|
|
|
may_be_out_of_lower_bound_ =
|
|
|
|
user_comparator_.CompareWithoutTimestamp(
|
|
|
|
ExtractUserKey(file_smallest_key(file_index_)), /*a_has_ts=*/true,
|
|
|
|
*read_options_.iterate_lower_bound, /*b_has_ts=*/false) < 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
TableCache* table_cache_;
|
|
|
|
const ReadOptions& read_options_;
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
const FileOptions& file_options_;
|
|
|
|
const InternalKeyComparator& icomparator_;
|
|
|
|
const UserComparatorWrapper user_comparator_;
|
|
|
|
const LevelFilesBrief* flevel_;
|
|
|
|
mutable FileDescriptor current_value_;
|
|
|
|
// `prefix_extractor_` may be non-null even for total order seek. Checking
|
|
|
|
// this variable is not the right way to identify whether prefix iterator
|
|
|
|
// is used.
|
|
|
|
const SliceTransform* prefix_extractor_;
|
|
|
|
|
Measure file read latency histogram per level
Summary: In internal stats, remember read latency histogram, if statistics is enabled. It can be retrieved from DB::GetProperty() with "rocksdb.dbstats" property, if it is enabled.
Test Plan: Manually run db_bench and prints out "rocksdb.dbstats" by hand and make sure it prints out as expected
Reviewers: igor, IslamAbdelRahman, rven, kradhakrishnan, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D44193
9 years ago
|
|
|
HistogramImpl* file_read_hist_;
|
|
|
|
bool should_sample_;
|
|
|
|
TableReaderCaller caller_;
|
Skip bottom-level filter block caching when hit-optimized
Summary:
When Get() or NewIterator() trigger file loads, skip caching the filter block if
(1) optimize_filters_for_hits is set and (2) the file is on the bottommost
level. Also skip checking filters under the same conditions, which means that
for a preloaded file or a file that was trivially-moved to the bottom level, its
filter block will eventually expire from the cache.
- added parameters/instance variables in various places in order to propagate the config ("skip_filters") from version_set to block_based_table_reader
- in BlockBasedTable::Rep, this optimization prevents filter from being loaded when the file is opened simply by setting filter_policy = nullptr
- in BlockBasedTable::Get/BlockBasedTable::NewIterator, this optimization prevents filter from being used (even if it was loaded already) by setting filter = nullptr
Test Plan:
updated unit test:
$ ./db_test --gtest_filter=DBTest.OptimizeFiltersForHits
will also run 'make check'
Reviewers: sdong, igor, paultuckfield, anthony, rven, kradhakrishnan, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D51633
9 years ago
|
|
|
bool skip_filters_;
|
Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
Reviewed By: siying
Differential Revision: D20786930
Pulled By: al13n321
fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
5 years ago
|
|
|
bool allow_unprepared_value_;
|
|
|
|
bool may_be_out_of_lower_bound_ = true;
|
|
|
|
size_t file_index_;
|
Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
9 years ago
|
|
|
int level_;
|
|
|
|
RangeDelAggregator* range_del_agg_;
|
|
|
|
IteratorWrapper file_iter_; // May be nullptr
|
|
|
|
PinnedIteratorsManager* pinned_iters_mgr_;
|
|
|
|
|
|
|
|
// To be propagated to RangeDelAggregator in order to safely truncate range
|
|
|
|
// tombstones.
|
|
|
|
const std::vector<AtomicCompactionUnitBoundary>* compaction_boundaries_;
|
|
|
|
};
|
|
|
|
|
|
|
|
void LevelIterator::Seek(const Slice& target) {
|
|
|
|
// Check whether the seek key fall under the same file
|
|
|
|
bool need_to_reseek = true;
|
|
|
|
if (file_iter_.iter() != nullptr && file_index_ < flevel_->num_files) {
|
|
|
|
const FdWithKeyRange& cur_file = flevel_->files[file_index_];
|
|
|
|
if (icomparator_.InternalKeyComparator::Compare(
|
|
|
|
target, cur_file.largest_key) <= 0 &&
|
|
|
|
icomparator_.InternalKeyComparator::Compare(
|
|
|
|
target, cur_file.smallest_key) >= 0) {
|
|
|
|
need_to_reseek = false;
|
|
|
|
assert(static_cast<size_t>(FindFile(icomparator_, *flevel_, target)) ==
|
|
|
|
file_index_);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (need_to_reseek) {
|
|
|
|
TEST_SYNC_POINT("LevelIterator::Seek:BeforeFindFile");
|
|
|
|
size_t new_file_index = FindFile(icomparator_, *flevel_, target);
|
|
|
|
InitFileIterator(new_file_index);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (file_iter_.iter() != nullptr) {
|
|
|
|
file_iter_.Seek(target);
|
|
|
|
}
|
|
|
|
if (SkipEmptyFileForward() && prefix_extractor_ != nullptr &&
|
|
|
|
!read_options_.total_order_seek && !read_options_.auto_prefix_mode &&
|
|
|
|
file_iter_.iter() != nullptr && file_iter_.Valid()) {
|
|
|
|
// We've skipped the file we initially positioned to. In the prefix
|
|
|
|
// seek case, it is likely that the file is skipped because of
|
|
|
|
// prefix bloom or hash, where more keys are skipped. We then check
|
|
|
|
// the current key and invalidate the iterator if the prefix is
|
|
|
|
// already passed.
|
|
|
|
// When doing prefix iterator seek, when keys for one prefix have
|
|
|
|
// been exhausted, it can jump to any key that is larger. Here we are
|
|
|
|
// enforcing a stricter contract than that, in order to make it easier for
|
|
|
|
// higher layers (merging and DB iterator) to reason the correctness:
|
|
|
|
// 1. Within the prefix, the result should be accurate.
|
|
|
|
// 2. If keys for the prefix is exhausted, it is either positioned to the
|
|
|
|
// next key after the prefix, or make the iterator invalid.
|
|
|
|
// A side benefit will be that it invalidates the iterator earlier so that
|
|
|
|
// the upper level merging iterator can merge fewer child iterators.
|
|
|
|
size_t ts_sz = user_comparator_.timestamp_size();
|
|
|
|
Slice target_user_key_without_ts =
|
|
|
|
ExtractUserKeyAndStripTimestamp(target, ts_sz);
|
|
|
|
Slice file_user_key_without_ts =
|
|
|
|
ExtractUserKeyAndStripTimestamp(file_iter_.key(), ts_sz);
|
|
|
|
if (prefix_extractor_->InDomain(target_user_key_without_ts) &&
|
|
|
|
(!prefix_extractor_->InDomain(file_user_key_without_ts) ||
|
|
|
|
user_comparator_.CompareWithoutTimestamp(
|
|
|
|
prefix_extractor_->Transform(target_user_key_without_ts), false,
|
|
|
|
prefix_extractor_->Transform(file_user_key_without_ts),
|
|
|
|
false) != 0)) {
|
|
|
|
SetFileIterator(nullptr);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
CheckMayBeOutOfLowerBound();
|
|
|
|
}
|
|
|
|
|
|
|
|
void LevelIterator::SeekForPrev(const Slice& target) {
|
|
|
|
size_t new_file_index = FindFile(icomparator_, *flevel_, target);
|
|
|
|
if (new_file_index >= flevel_->num_files) {
|
|
|
|
new_file_index = flevel_->num_files - 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
InitFileIterator(new_file_index);
|
|
|
|
if (file_iter_.iter() != nullptr) {
|
|
|
|
file_iter_.SeekForPrev(target);
|
|
|
|
SkipEmptyFileBackward();
|
|
|
|
}
|
|
|
|
CheckMayBeOutOfLowerBound();
|
|
|
|
}
|
|
|
|
|
|
|
|
void LevelIterator::SeekToFirst() {
|
|
|
|
InitFileIterator(0);
|
|
|
|
if (file_iter_.iter() != nullptr) {
|
|
|
|
file_iter_.SeekToFirst();
|
|
|
|
}
|
|
|
|
SkipEmptyFileForward();
|
|
|
|
CheckMayBeOutOfLowerBound();
|
|
|
|
}
|
|
|
|
|
|
|
|
void LevelIterator::SeekToLast() {
|
|
|
|
InitFileIterator(flevel_->num_files - 1);
|
|
|
|
if (file_iter_.iter() != nullptr) {
|
|
|
|
file_iter_.SeekToLast();
|
|
|
|
}
|
|
|
|
SkipEmptyFileBackward();
|
|
|
|
CheckMayBeOutOfLowerBound();
|
|
|
|
}
|
|
|
|
|
|
|
|
void LevelIterator::Next() {
|
|
|
|
assert(Valid());
|
|
|
|
file_iter_.Next();
|
|
|
|
SkipEmptyFileForward();
|
|
|
|
}
|
|
|
|
|
|
|
|
bool LevelIterator::NextAndGetResult(IterateResult* result) {
|
|
|
|
assert(Valid());
|
|
|
|
bool is_valid = file_iter_.NextAndGetResult(result);
|
|
|
|
if (!is_valid) {
|
|
|
|
SkipEmptyFileForward();
|
|
|
|
is_valid = Valid();
|
|
|
|
if (is_valid) {
|
|
|
|
result->key = key();
|
|
|
|
result->bound_check_result = file_iter_.UpperBoundCheckResult();
|
|
|
|
// Ideally, we should return the real file_iter_.value_prepared but the
|
|
|
|
// information is not here. It would casue an extra PrepareValue()
|
|
|
|
// for the first key of a file.
|
|
|
|
result->value_prepared = !allow_unprepared_value_;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return is_valid;
|
|
|
|
}
|
|
|
|
|
|
|
|
void LevelIterator::Prev() {
|
|
|
|
assert(Valid());
|
|
|
|
file_iter_.Prev();
|
|
|
|
SkipEmptyFileBackward();
|
|
|
|
}
|
|
|
|
|
|
|
|
bool LevelIterator::SkipEmptyFileForward() {
|
|
|
|
bool seen_empty_file = false;
|
|
|
|
while (file_iter_.iter() == nullptr ||
|
Change and clarify the relationship between Valid(), status() and Seek*() for all iterators. Also fix some bugs
Summary:
Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are:
* If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted.
* When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not.
However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours).
This PR changes the convention to:
* If status() is not ok, Valid() always returns false.
* Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.)
This does sacrifice the two use cases listed above, but siying said it's ok.
Overview of the changes:
* A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario.
* Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed.
* A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator.
To find the iterator types that needed changes I searched for "public .*Iterator" in the code. Here's an overview of all 27 iterator types:
Iterators that didn't need changes:
* status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator.
* Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator.
Iterators with changes (see inline comments for details):
* DBIter - an overhaul:
- It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back.
- It had a few code paths silently discarding subiterator's status. The stress test caught a few.
- The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example.
- Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption.
- It used to not reset status on seek for some types of errors.
- Some simplifications and better comments.
- Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer.
* MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status.
* ForwardIterator - changed to the new convention, also slightly simplified.
* ForwardLevelIterator - fixed some bugs and simplified.
* LevelIterator - simplified.
* TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_.
* BlockBasedTableIterator - minor changes.
* BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid.
* PlainTableIterator - some seeks used to not reset status.
* CuckooTableIterator - tiny code cleanup.
* ManagedIterator - fixed some bugs.
* BaseDeltaIterator - changed to the new convention and fixed a bug.
* BlobDBIterator - seeks used to not reset status.
* KeyConvertingIterator - some small change.
Closes https://github.com/facebook/rocksdb/pull/3810
Differential Revision: D7888019
Pulled By: al13n321
fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d
7 years ago
|
|
|
(!file_iter_.Valid() && file_iter_.status().ok() &&
|
|
|
|
file_iter_.iter()->UpperBoundCheckResult() !=
|
|
|
|
IterBoundCheck::kOutOfBound)) {
|
|
|
|
seen_empty_file = true;
|
|
|
|
// Move to next file
|
|
|
|
if (file_index_ >= flevel_->num_files - 1) {
|
|
|
|
// Already at the last file
|
|
|
|
SetFileIterator(nullptr);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (KeyReachedUpperBound(file_smallest_key(file_index_ + 1))) {
|
|
|
|
SetFileIterator(nullptr);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
InitFileIterator(file_index_ + 1);
|
|
|
|
if (file_iter_.iter() != nullptr) {
|
|
|
|
file_iter_.SeekToFirst();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return seen_empty_file;
|
|
|
|
}
|
|
|
|
|
|
|
|
void LevelIterator::SkipEmptyFileBackward() {
|
|
|
|
while (file_iter_.iter() == nullptr ||
|
Change and clarify the relationship between Valid(), status() and Seek*() for all iterators. Also fix some bugs
Summary:
Before this PR, Iterator/InternalIterator may simultaneously have non-ok status() and Valid() = true. That state means that the last operation failed, but the iterator is nevertheless positioned on some unspecified record. Likely intended uses of that are:
* If some sst files are corrupted, a normal iterator can be used to read the data from files that are not corrupted.
* When using read_tier = kBlockCacheTier, read the data that's in block cache, skipping over the data that is not.
However, this behavior wasn't documented well (and until recently the wiki on github had misleading incorrect information). In the code there's a lot of confusion about the relationship between status() and Valid(), and about whether Seek()/SeekToLast()/etc reset the status or not. There were a number of bugs caused by this confusion, both inside rocksdb and in the code that uses rocksdb (including ours).
This PR changes the convention to:
* If status() is not ok, Valid() always returns false.
* Any seek operation resets status. (Before the PR, it depended on iterator type and on particular error.)
This does sacrifice the two use cases listed above, but siying said it's ok.
Overview of the changes:
* A commit that adds missing status checks in MergingIterator. This fixes a bug that actually affects us, and we need it fixed. `DBIteratorTest.NonBlockingIterationBugRepro` explains the scenario.
* Changes to lots of iterator types to make all of them conform to the new convention. Some bug fixes along the way. By far the biggest changes are in DBIter, which is a big messy piece of code; I tried to make it less big and messy but mostly failed.
* A stress-test for DBIter, to gain some confidence that I didn't break it. It does a few million random operations on the iterator, while occasionally modifying the underlying data (like ForwardIterator does) and occasionally returning non-ok status from internal iterator.
To find the iterator types that needed changes I searched for "public .*Iterator" in the code. Here's an overview of all 27 iterator types:
Iterators that didn't need changes:
* status() is always ok(), or Valid() is always false: MemTableIterator, ModelIter, TestIterator, KVIter (2 classes with this name anonymous namespaces), LoggingForwardVectorIterator, VectorIterator, MockTableIterator, EmptyIterator, EmptyInternalIterator.
* Thin wrappers that always pass through Valid() and status(): ArenaWrappedDBIter, TtlIterator, InternalIteratorFromIterator.
Iterators with changes (see inline comments for details):
* DBIter - an overhaul:
- It used to silently skip corrupted keys (`FindParseableKey()`), which seems dangerous. This PR makes it just stop immediately after encountering a corrupted key, just like it would for other kinds of corruption. Let me know if there was actually some deeper meaning in this behavior and I should put it back.
- It had a few code paths silently discarding subiterator's status. The stress test caught a few.
- The backwards iteration code path was expecting the internal iterator's set of keys to be immutable. It's probably always true in practice at the moment, since ForwardIterator doesn't support backwards iteration, but this PR fixes it anyway. See added DBIteratorTest.ReverseToForwardBug for an example.
- Some parts of backwards iteration code path even did things like `assert(iter_->Valid())` after a seek, which is never a safe assumption.
- It used to not reset status on seek for some types of errors.
- Some simplifications and better comments.
- Some things got more complicated from the added error handling. I'm open to ideas for how to make it nicer.
* MergingIterator - check status after every operation on every subiterator, and in some places assert that valid subiterators have ok status.
* ForwardIterator - changed to the new convention, also slightly simplified.
* ForwardLevelIterator - fixed some bugs and simplified.
* LevelIterator - simplified.
* TwoLevelIterator - changed to the new convention. Also fixed a bug that would make SeekForPrev() sometimes silently ignore errors from first_level_iter_.
* BlockBasedTableIterator - minor changes.
* BlockIter - replaced `SetStatus()` with `Invalidate()` to make sure non-ok BlockIter is always invalid.
* PlainTableIterator - some seeks used to not reset status.
* CuckooTableIterator - tiny code cleanup.
* ManagedIterator - fixed some bugs.
* BaseDeltaIterator - changed to the new convention and fixed a bug.
* BlobDBIterator - seeks used to not reset status.
* KeyConvertingIterator - some small change.
Closes https://github.com/facebook/rocksdb/pull/3810
Differential Revision: D7888019
Pulled By: al13n321
fbshipit-source-id: 4aaf6d3421c545d16722a815b2fa2e7912bc851d
7 years ago
|
|
|
(!file_iter_.Valid() && file_iter_.status().ok())) {
|
|
|
|
// Move to previous file
|
|
|
|
if (file_index_ == 0) {
|
|
|
|
// Already the first file
|
|
|
|
SetFileIterator(nullptr);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
InitFileIterator(file_index_ - 1);
|
|
|
|
if (file_iter_.iter() != nullptr) {
|
|
|
|
file_iter_.SeekToLast();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void LevelIterator::SetFileIterator(InternalIterator* iter) {
|
|
|
|
if (pinned_iters_mgr_ && iter) {
|
|
|
|
iter->SetPinnedItersMgr(pinned_iters_mgr_);
|
|
|
|
}
|
|
|
|
|
|
|
|
InternalIterator* old_iter = file_iter_.Set(iter);
|
|
|
|
if (pinned_iters_mgr_ && pinned_iters_mgr_->PinningEnabled()) {
|
|
|
|
pinned_iters_mgr_->PinIterator(old_iter);
|
|
|
|
} else {
|
|
|
|
delete old_iter;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void LevelIterator::InitFileIterator(size_t new_file_index) {
|
|
|
|
if (new_file_index >= flevel_->num_files) {
|
|
|
|
file_index_ = new_file_index;
|
|
|
|
SetFileIterator(nullptr);
|
|
|
|
return;
|
|
|
|
} else {
|
|
|
|
// If the file iterator shows incomplete, we try it again if users seek
|
|
|
|
// to the same file, as this time we may go to a different data block
|
|
|
|
// which is cached in block cache.
|
|
|
|
//
|
|
|
|
if (file_iter_.iter() != nullptr && !file_iter_.status().IsIncomplete() &&
|
|
|
|
new_file_index == file_index_) {
|
|
|
|
// file_iter_ is already constructed with this iterator, so
|
|
|
|
// no need to change anything
|
|
|
|
} else {
|
|
|
|
file_index_ = new_file_index;
|
|
|
|
InternalIterator* iter = NewFileIterator();
|
|
|
|
SetFileIterator(iter);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
} // anonymous namespace
|
|
|
|
|
|
|
|
Status Version::GetTableProperties(std::shared_ptr<const TableProperties>* tp,
|
|
|
|
const FileMetaData* file_meta,
|
|
|
|
const std::string* fname) const {
|
|
|
|
auto table_cache = cfd_->table_cache();
|
|
|
|
auto ioptions = cfd_->ioptions();
|
|
|
|
Status s = table_cache->GetTableProperties(
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
file_options_, cfd_->internal_comparator(), file_meta->fd, tp,
|
|
|
|
mutable_cf_options_.prefix_extractor.get(), true /* no io */);
|
|
|
|
if (s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
// We only ignore error type `Incomplete` since it's by design that we
|
|
|
|
// disallow table when it's not in table cache.
|
|
|
|
if (!s.IsIncomplete()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
// 2. Table is not present in table cache, we'll read the table properties
|
|
|
|
// directly from the properties block in the file.
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
std::unique_ptr<FSRandomAccessFile> file;
|
|
|
|
std::string file_name;
|
|
|
|
if (fname != nullptr) {
|
|
|
|
file_name = *fname;
|
|
|
|
} else {
|
|
|
|
file_name =
|
|
|
|
TableFileName(ioptions->cf_paths, file_meta->fd.GetNumber(),
|
|
|
|
file_meta->fd.GetPathId());
|
|
|
|
}
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
s = ioptions->fs->NewRandomAccessFile(file_name, file_options_, &file,
|
|
|
|
nullptr);
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
TableProperties* raw_table_properties;
|
|
|
|
// By setting the magic number to kInvalidTableMagicNumber, we can by
|
|
|
|
// pass the magic number check in the footer.
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
10 years ago
|
|
|
std::unique_ptr<RandomAccessFileReader> file_reader(
|
|
|
|
new RandomAccessFileReader(
|
|
|
|
std::move(file), file_name, nullptr /* env */, io_tracer_,
|
|
|
|
nullptr /* stats */, 0 /* hist_type */, nullptr /* file_read_hist */,
|
|
|
|
nullptr /* rate_limiter */, ioptions->listeners));
|
|
|
|
s = ReadTableProperties(
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
10 years ago
|
|
|
file_reader.get(), file_meta->fd.GetFileSize(),
|
|
|
|
Footer::kInvalidTableMagicNumber /* table's magic number */, *ioptions,
|
|
|
|
&raw_table_properties, false /* compression_type_missing */);
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
RecordTick(ioptions->statistics, NUMBER_DIRECT_LOAD_TABLE_PROPERTIES);
|
|
|
|
|
|
|
|
*tp = std::shared_ptr<const TableProperties>(raw_table_properties);
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
Status Version::GetPropertiesOfAllTables(TablePropertiesCollection* props) {
|
|
|
|
Status s;
|
|
|
|
for (int level = 0; level < storage_info_.num_levels_; level++) {
|
|
|
|
s = GetPropertiesOfAllTables(props, level);
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
|
|
|
Status Version::TablesRangeTombstoneSummary(int max_entries_to_print,
|
|
|
|
std::string* out_str) {
|
|
|
|
if (max_entries_to_print <= 0) {
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
int num_entries_left = max_entries_to_print;
|
|
|
|
|
|
|
|
std::stringstream ss;
|
|
|
|
|
|
|
|
for (int level = 0; level < storage_info_.num_levels_; level++) {
|
|
|
|
for (const auto& file_meta : storage_info_.files_[level]) {
|
|
|
|
auto fname =
|
|
|
|
TableFileName(cfd_->ioptions()->cf_paths, file_meta->fd.GetNumber(),
|
|
|
|
file_meta->fd.GetPathId());
|
|
|
|
|
|
|
|
ss << "=== file : " << fname << " ===\n";
|
|
|
|
|
|
|
|
TableCache* table_cache = cfd_->table_cache();
|
|
|
|
std::unique_ptr<FragmentedRangeTombstoneIterator> tombstone_iter;
|
|
|
|
|
|
|
|
Status s = table_cache->GetRangeTombstoneIterator(
|
|
|
|
ReadOptions(), cfd_->internal_comparator(), *file_meta,
|
|
|
|
&tombstone_iter);
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
if (tombstone_iter) {
|
|
|
|
tombstone_iter->SeekToFirst();
|
|
|
|
|
|
|
|
while (tombstone_iter->Valid() && num_entries_left > 0) {
|
|
|
|
ss << "start: " << tombstone_iter->start_key().ToString(true)
|
|
|
|
<< " end: " << tombstone_iter->end_key().ToString(true)
|
|
|
|
<< " seq: " << tombstone_iter->seq() << '\n';
|
|
|
|
tombstone_iter->Next();
|
|
|
|
num_entries_left--;
|
|
|
|
}
|
|
|
|
if (num_entries_left <= 0) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (num_entries_left <= 0) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
assert(num_entries_left >= 0);
|
|
|
|
if (num_entries_left <= 0) {
|
|
|
|
ss << "(results may not be complete)\n";
|
|
|
|
}
|
|
|
|
|
|
|
|
*out_str = ss.str();
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
|
|
|
Status Version::GetPropertiesOfAllTables(TablePropertiesCollection* props,
|
|
|
|
int level) {
|
|
|
|
for (const auto& file_meta : storage_info_.files_[level]) {
|
|
|
|
auto fname =
|
|
|
|
TableFileName(cfd_->ioptions()->cf_paths, file_meta->fd.GetNumber(),
|
|
|
|
file_meta->fd.GetPathId());
|
|
|
|
// 1. If the table is already present in table cache, load table
|
|
|
|
// properties from there.
|
|
|
|
std::shared_ptr<const TableProperties> table_properties;
|
|
|
|
Status s = GetTableProperties(&table_properties, file_meta, &fname);
|
|
|
|
if (s.ok()) {
|
|
|
|
props->insert({fname, table_properties});
|
|
|
|
} else {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
|
|
|
Status Version::GetPropertiesOfTablesInRange(
|
|
|
|
const Range* range, std::size_t n, TablePropertiesCollection* props) const {
|
|
|
|
for (int level = 0; level < storage_info_.num_non_empty_levels(); level++) {
|
|
|
|
for (decltype(n) i = 0; i < n; i++) {
|
|
|
|
// Convert user_key into a corresponding internal key.
|
|
|
|
InternalKey k1(range[i].start, kMaxSequenceNumber, kValueTypeForSeek);
|
|
|
|
InternalKey k2(range[i].limit, kMaxSequenceNumber, kValueTypeForSeek);
|
|
|
|
std::vector<FileMetaData*> files;
|
|
|
|
storage_info_.GetOverlappingInputs(level, &k1, &k2, &files, -1, nullptr,
|
|
|
|
false);
|
|
|
|
for (const auto& file_meta : files) {
|
|
|
|
auto fname =
|
|
|
|
TableFileName(cfd_->ioptions()->cf_paths,
|
|
|
|
file_meta->fd.GetNumber(), file_meta->fd.GetPathId());
|
|
|
|
if (props->count(fname) == 0) {
|
|
|
|
// 1. If the table is already present in table cache, load table
|
|
|
|
// properties from there.
|
|
|
|
std::shared_ptr<const TableProperties> table_properties;
|
|
|
|
Status s = GetTableProperties(&table_properties, file_meta, &fname);
|
|
|
|
if (s.ok()) {
|
|
|
|
props->insert({fname, table_properties});
|
|
|
|
} else {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
|
|
|
Status Version::GetAggregatedTableProperties(
|
|
|
|
std::shared_ptr<const TableProperties>* tp, int level) {
|
|
|
|
TablePropertiesCollection props;
|
|
|
|
Status s;
|
|
|
|
if (level < 0) {
|
|
|
|
s = GetPropertiesOfAllTables(&props);
|
|
|
|
} else {
|
|
|
|
s = GetPropertiesOfAllTables(&props, level);
|
|
|
|
}
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
auto* new_tp = new TableProperties();
|
|
|
|
for (const auto& item : props) {
|
|
|
|
new_tp->Add(*item.second);
|
|
|
|
}
|
|
|
|
tp->reset(new_tp);
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
|
|
|
size_t Version::GetMemoryUsageByTableReaders() {
|
|
|
|
size_t total_usage = 0;
|
|
|
|
for (auto& file_level : storage_info_.level_files_brief_) {
|
|
|
|
for (size_t i = 0; i < file_level.num_files; i++) {
|
|
|
|
total_usage += cfd_->table_cache()->GetMemoryUsageByTableReader(
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
file_options_, cfd_->internal_comparator(), file_level.files[i].fd,
|
|
|
|
mutable_cf_options_.prefix_extractor.get());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return total_usage;
|
|
|
|
}
|
|
|
|
|
CompactFiles, EventListener and GetDatabaseMetaData
Summary:
This diff adds three sets of APIs to RocksDB.
= GetColumnFamilyMetaData =
* This APIs allow users to obtain the current state of a RocksDB instance on one column family.
* See GetColumnFamilyMetaData in include/rocksdb/db.h
= EventListener =
* A virtual class that allows users to implement a set of
call-back functions which will be called when specific
events of a RocksDB instance happens.
* To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners
= CompactFiles =
* CompactFiles API inputs a set of file numbers and an output level, and RocksDB
will try to compact those files into the specified level.
= Example =
* Example code can be found in example/compact_files_example.cc, which implements
a simple external compactor using EventListener, GetColumnFamilyMetaData, and
CompactFiles API.
Test Plan:
listener_test
compactor_test
example/compact_files_example
export ROCKSDB_TESTS=CompactFiles
db_test
export ROCKSDB_TESTS=MetaData
db_test
Reviewers: ljin, igor, rven, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D24705
10 years ago
|
|
|
void Version::GetColumnFamilyMetaData(ColumnFamilyMetaData* cf_meta) {
|
|
|
|
assert(cf_meta);
|
|
|
|
assert(cfd_);
|
|
|
|
|
|
|
|
cf_meta->name = cfd_->GetName();
|
|
|
|
cf_meta->size = 0;
|
|
|
|
cf_meta->file_count = 0;
|
|
|
|
cf_meta->levels.clear();
|
|
|
|
|
|
|
|
auto* ioptions = cfd_->ioptions();
|
|
|
|
auto* vstorage = storage_info();
|
|
|
|
|
|
|
|
for (int level = 0; level < cfd_->NumberLevels(); level++) {
|
|
|
|
uint64_t level_size = 0;
|
|
|
|
cf_meta->file_count += vstorage->LevelFiles(level).size();
|
|
|
|
std::vector<SstFileMetaData> files;
|
|
|
|
for (const auto& file : vstorage->LevelFiles(level)) {
|
|
|
|
uint32_t path_id = file->fd.GetPathId();
|
|
|
|
std::string file_path;
|
|
|
|
if (path_id < ioptions->cf_paths.size()) {
|
|
|
|
file_path = ioptions->cf_paths[path_id].path;
|
CompactFiles, EventListener and GetDatabaseMetaData
Summary:
This diff adds three sets of APIs to RocksDB.
= GetColumnFamilyMetaData =
* This APIs allow users to obtain the current state of a RocksDB instance on one column family.
* See GetColumnFamilyMetaData in include/rocksdb/db.h
= EventListener =
* A virtual class that allows users to implement a set of
call-back functions which will be called when specific
events of a RocksDB instance happens.
* To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners
= CompactFiles =
* CompactFiles API inputs a set of file numbers and an output level, and RocksDB
will try to compact those files into the specified level.
= Example =
* Example code can be found in example/compact_files_example.cc, which implements
a simple external compactor using EventListener, GetColumnFamilyMetaData, and
CompactFiles API.
Test Plan:
listener_test
compactor_test
example/compact_files_example
export ROCKSDB_TESTS=CompactFiles
db_test
export ROCKSDB_TESTS=MetaData
db_test
Reviewers: ljin, igor, rven, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D24705
10 years ago
|
|
|
} else {
|
|
|
|
assert(!ioptions->cf_paths.empty());
|
|
|
|
file_path = ioptions->cf_paths.back().path;
|
CompactFiles, EventListener and GetDatabaseMetaData
Summary:
This diff adds three sets of APIs to RocksDB.
= GetColumnFamilyMetaData =
* This APIs allow users to obtain the current state of a RocksDB instance on one column family.
* See GetColumnFamilyMetaData in include/rocksdb/db.h
= EventListener =
* A virtual class that allows users to implement a set of
call-back functions which will be called when specific
events of a RocksDB instance happens.
* To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners
= CompactFiles =
* CompactFiles API inputs a set of file numbers and an output level, and RocksDB
will try to compact those files into the specified level.
= Example =
* Example code can be found in example/compact_files_example.cc, which implements
a simple external compactor using EventListener, GetColumnFamilyMetaData, and
CompactFiles API.
Test Plan:
listener_test
compactor_test
example/compact_files_example
export ROCKSDB_TESTS=CompactFiles
db_test
export ROCKSDB_TESTS=MetaData
db_test
Reviewers: ljin, igor, rven, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D24705
10 years ago
|
|
|
}
|
|
|
|
const uint64_t file_number = file->fd.GetNumber();
|
|
|
|
files.emplace_back(SstFileMetaData{
|
|
|
|
MakeTableFileName("", file_number), file_number, file_path,
|
|
|
|
static_cast<size_t>(file->fd.GetFileSize()), file->fd.smallest_seqno,
|
|
|
|
file->fd.largest_seqno, file->smallest.user_key().ToString(),
|
CompactFiles, EventListener and GetDatabaseMetaData
Summary:
This diff adds three sets of APIs to RocksDB.
= GetColumnFamilyMetaData =
* This APIs allow users to obtain the current state of a RocksDB instance on one column family.
* See GetColumnFamilyMetaData in include/rocksdb/db.h
= EventListener =
* A virtual class that allows users to implement a set of
call-back functions which will be called when specific
events of a RocksDB instance happens.
* To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners
= CompactFiles =
* CompactFiles API inputs a set of file numbers and an output level, and RocksDB
will try to compact those files into the specified level.
= Example =
* Example code can be found in example/compact_files_example.cc, which implements
a simple external compactor using EventListener, GetColumnFamilyMetaData, and
CompactFiles API.
Test Plan:
listener_test
compactor_test
example/compact_files_example
export ROCKSDB_TESTS=CompactFiles
db_test
export ROCKSDB_TESTS=MetaData
db_test
Reviewers: ljin, igor, rven, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D24705
10 years ago
|
|
|
file->largest.user_key().ToString(),
|
|
|
|
file->stats.num_reads_sampled.load(std::memory_order_relaxed),
|
|
|
|
file->being_compacted, file->oldest_blob_file_number,
|
|
|
|
file->TryGetOldestAncesterTime(), file->TryGetFileCreationTime(),
|
|
|
|
file->file_checksum, file->file_checksum_func_name});
|
|
|
|
files.back().num_entries = file->num_entries;
|
|
|
|
files.back().num_deletions = file->num_deletions;
|
CompactFiles, EventListener and GetDatabaseMetaData
Summary:
This diff adds three sets of APIs to RocksDB.
= GetColumnFamilyMetaData =
* This APIs allow users to obtain the current state of a RocksDB instance on one column family.
* See GetColumnFamilyMetaData in include/rocksdb/db.h
= EventListener =
* A virtual class that allows users to implement a set of
call-back functions which will be called when specific
events of a RocksDB instance happens.
* To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners
= CompactFiles =
* CompactFiles API inputs a set of file numbers and an output level, and RocksDB
will try to compact those files into the specified level.
= Example =
* Example code can be found in example/compact_files_example.cc, which implements
a simple external compactor using EventListener, GetColumnFamilyMetaData, and
CompactFiles API.
Test Plan:
listener_test
compactor_test
example/compact_files_example
export ROCKSDB_TESTS=CompactFiles
db_test
export ROCKSDB_TESTS=MetaData
db_test
Reviewers: ljin, igor, rven, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D24705
10 years ago
|
|
|
level_size += file->fd.GetFileSize();
|
|
|
|
}
|
|
|
|
cf_meta->levels.emplace_back(
|
|
|
|
level, level_size, std::move(files));
|
|
|
|
cf_meta->size += level_size;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t Version::GetSstFilesSize() {
|
|
|
|
uint64_t sst_files_size = 0;
|
|
|
|
for (int level = 0; level < storage_info_.num_levels_; level++) {
|
|
|
|
for (const auto& file_meta : storage_info_.LevelFiles(level)) {
|
|
|
|
sst_files_size += file_meta->fd.GetFileSize();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return sst_files_size;
|
|
|
|
}
|
CompactFiles, EventListener and GetDatabaseMetaData
Summary:
This diff adds three sets of APIs to RocksDB.
= GetColumnFamilyMetaData =
* This APIs allow users to obtain the current state of a RocksDB instance on one column family.
* See GetColumnFamilyMetaData in include/rocksdb/db.h
= EventListener =
* A virtual class that allows users to implement a set of
call-back functions which will be called when specific
events of a RocksDB instance happens.
* To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners
= CompactFiles =
* CompactFiles API inputs a set of file numbers and an output level, and RocksDB
will try to compact those files into the specified level.
= Example =
* Example code can be found in example/compact_files_example.cc, which implements
a simple external compactor using EventListener, GetColumnFamilyMetaData, and
CompactFiles API.
Test Plan:
listener_test
compactor_test
example/compact_files_example
export ROCKSDB_TESTS=CompactFiles
db_test
export ROCKSDB_TESTS=MetaData
db_test
Reviewers: ljin, igor, rven, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D24705
10 years ago
|
|
|
|
|
|
|
void Version::GetCreationTimeOfOldestFile(uint64_t* creation_time) {
|
|
|
|
uint64_t oldest_time = port::kMaxUint64;
|
|
|
|
for (int level = 0; level < storage_info_.num_non_empty_levels_; level++) {
|
|
|
|
for (FileMetaData* meta : storage_info_.LevelFiles(level)) {
|
|
|
|
assert(meta->fd.table_reader != nullptr);
|
|
|
|
uint64_t file_creation_time = meta->TryGetFileCreationTime();
|
|
|
|
if (file_creation_time == kUnknownFileCreationTime) {
|
|
|
|
*creation_time = 0;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
if (file_creation_time < oldest_time) {
|
|
|
|
oldest_time = file_creation_time;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
*creation_time = oldest_time;
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t VersionStorageInfo::GetEstimatedActiveKeys() const {
|
avoid returning a number-of-active-keys estimate of nearly 2^64
Summary:
If accumulated_num_non_deletions_ were ever smaller than
accumulated_num_deletions_, the computation of
"accumulated_num_non_deletions_ - accumulated_num_deletions_"
would result in a logically "negative" value, but since
the two operands are unsigned (uint64_t), the result corresponding
to e.g., -1 would 2^64-1.
Instead, return 0 in that case.
Test Plan:
- ensure "make check" still passes
- temporarily add an "abort();" call in the new "if"-block, and
observe that it fails in some test cases. However, note that
this case is triggered only when the two numbers are equal.
Thus, no test case triggers the erroneous behavior this
change is designed to avoid. If anyone can construct a
scenario in which that bug would be triggered, I'll be
happy to add a test case.
Reviewers: ljin, igor, rven, igor.sugak, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D36489
10 years ago
|
|
|
// Estimation will be inaccurate when:
|
|
|
|
// (1) there exist merge keys
|
|
|
|
// (2) keys are directly overwritten
|
|
|
|
// (3) deletion on non-existing keys
|
|
|
|
// (4) low number of samples
|
|
|
|
if (current_num_samples_ == 0) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (current_num_non_deletions_ <= current_num_deletions_) {
|
avoid returning a number-of-active-keys estimate of nearly 2^64
Summary:
If accumulated_num_non_deletions_ were ever smaller than
accumulated_num_deletions_, the computation of
"accumulated_num_non_deletions_ - accumulated_num_deletions_"
would result in a logically "negative" value, but since
the two operands are unsigned (uint64_t), the result corresponding
to e.g., -1 would 2^64-1.
Instead, return 0 in that case.
Test Plan:
- ensure "make check" still passes
- temporarily add an "abort();" call in the new "if"-block, and
observe that it fails in some test cases. However, note that
this case is triggered only when the two numbers are equal.
Thus, no test case triggers the erroneous behavior this
change is designed to avoid. If anyone can construct a
scenario in which that bug would be triggered, I'll be
happy to add a test case.
Reviewers: ljin, igor, rven, igor.sugak, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D36489
10 years ago
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t est = current_num_non_deletions_ - current_num_deletions_;
|
avoid returning a number-of-active-keys estimate of nearly 2^64
Summary:
If accumulated_num_non_deletions_ were ever smaller than
accumulated_num_deletions_, the computation of
"accumulated_num_non_deletions_ - accumulated_num_deletions_"
would result in a logically "negative" value, but since
the two operands are unsigned (uint64_t), the result corresponding
to e.g., -1 would 2^64-1.
Instead, return 0 in that case.
Test Plan:
- ensure "make check" still passes
- temporarily add an "abort();" call in the new "if"-block, and
observe that it fails in some test cases. However, note that
this case is triggered only when the two numbers are equal.
Thus, no test case triggers the erroneous behavior this
change is designed to avoid. If anyone can construct a
scenario in which that bug would be triggered, I'll be
happy to add a test case.
Reviewers: ljin, igor, rven, igor.sugak, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D36489
10 years ago
|
|
|
|
|
|
|
uint64_t file_count = 0;
|
|
|
|
for (int level = 0; level < num_levels_; ++level) {
|
|
|
|
file_count += files_[level].size();
|
|
|
|
}
|
|
|
|
|
|
|
|
if (current_num_samples_ < file_count) {
|
|
|
|
// casting to avoid overflowing
|
|
|
|
return
|
|
|
|
static_cast<uint64_t>(
|
|
|
|
(est * static_cast<double>(file_count) / current_num_samples_)
|
|
|
|
);
|
|
|
|
} else {
|
avoid returning a number-of-active-keys estimate of nearly 2^64
Summary:
If accumulated_num_non_deletions_ were ever smaller than
accumulated_num_deletions_, the computation of
"accumulated_num_non_deletions_ - accumulated_num_deletions_"
would result in a logically "negative" value, but since
the two operands are unsigned (uint64_t), the result corresponding
to e.g., -1 would 2^64-1.
Instead, return 0 in that case.
Test Plan:
- ensure "make check" still passes
- temporarily add an "abort();" call in the new "if"-block, and
observe that it fails in some test cases. However, note that
this case is triggered only when the two numbers are equal.
Thus, no test case triggers the erroneous behavior this
change is designed to avoid. If anyone can construct a
scenario in which that bug would be triggered, I'll be
happy to add a test case.
Reviewers: ljin, igor, rven, igor.sugak, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D36489
10 years ago
|
|
|
return est;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
double VersionStorageInfo::GetEstimatedCompressionRatioAtLevel(
|
|
|
|
int level) const {
|
|
|
|
assert(level < num_levels_);
|
|
|
|
uint64_t sum_file_size_bytes = 0;
|
|
|
|
uint64_t sum_data_size_bytes = 0;
|
|
|
|
for (auto* file_meta : files_[level]) {
|
|
|
|
sum_file_size_bytes += file_meta->fd.GetFileSize();
|
|
|
|
sum_data_size_bytes += file_meta->raw_key_size + file_meta->raw_value_size;
|
|
|
|
}
|
|
|
|
if (sum_file_size_bytes == 0) {
|
|
|
|
return -1.0;
|
|
|
|
}
|
|
|
|
return static_cast<double>(sum_data_size_bytes) / sum_file_size_bytes;
|
|
|
|
}
|
|
|
|
|
In DB::NewIterator(), try to allocate the whole iterator tree in an arena
Summary:
In this patch, try to allocate the whole iterator tree starting from DBIter from an arena
1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it.
2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator.
3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it.
Limitations:
(1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc
(2) Two level iterator itself is allocated in arena, but not iterators inside it.
Test Plan: make all check
Reviewers: ljin, haobo
Reviewed By: haobo
Subscribers: leveldb, dhruba, yhchiang, igor
Differential Revision: https://reviews.facebook.net/D18513
11 years ago
|
|
|
void Version::AddIterators(const ReadOptions& read_options,
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
const FileOptions& soptions,
|
|
|
|
MergeIteratorBuilder* merge_iter_builder,
|
Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
Reviewed By: siying
Differential Revision: D20786930
Pulled By: al13n321
fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
5 years ago
|
|
|
RangeDelAggregator* range_del_agg,
|
|
|
|
bool allow_unprepared_value) {
|
|
|
|
assert(storage_info_.finalized_);
|
|
|
|
|
|
|
|
for (int level = 0; level < storage_info_.num_non_empty_levels(); level++) {
|
|
|
|
AddIteratorsForLevel(read_options, soptions, merge_iter_builder, level,
|
Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
Reviewed By: siying
Differential Revision: D20786930
Pulled By: al13n321
fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
5 years ago
|
|
|
range_del_agg, allow_unprepared_value);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void Version::AddIteratorsForLevel(const ReadOptions& read_options,
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
const FileOptions& soptions,
|
|
|
|
MergeIteratorBuilder* merge_iter_builder,
|
|
|
|
int level,
|
Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
Reviewed By: siying
Differential Revision: D20786930
Pulled By: al13n321
fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
5 years ago
|
|
|
RangeDelAggregator* range_del_agg,
|
|
|
|
bool allow_unprepared_value) {
|
|
|
|
assert(storage_info_.finalized_);
|
|
|
|
if (level >= storage_info_.num_non_empty_levels()) {
|
|
|
|
// This is an empty level
|
|
|
|
return;
|
|
|
|
} else if (storage_info_.LevelFilesBrief(level).num_files == 0) {
|
|
|
|
// No files in this level
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
bool should_sample = should_sample_file_read();
|
|
|
|
|
|
|
|
auto* arena = merge_iter_builder->GetArena();
|
|
|
|
if (level == 0) {
|
|
|
|
// Merge all level zero files together since they may overlap
|
|
|
|
for (size_t i = 0; i < storage_info_.LevelFilesBrief(0).num_files; i++) {
|
|
|
|
const auto& file = storage_info_.LevelFilesBrief(0).files[i];
|
|
|
|
merge_iter_builder->AddIterator(cfd_->table_cache()->NewIterator(
|
|
|
|
read_options, soptions, cfd_->internal_comparator(),
|
|
|
|
*file.file_metadata, range_del_agg,
|
|
|
|
mutable_cf_options_.prefix_extractor.get(), nullptr,
|
|
|
|
cfd_->internal_stats()->GetFileReadHist(0),
|
|
|
|
TableReaderCaller::kUserIterator, arena,
|
|
|
|
/*skip_filters=*/false, /*level=*/0, max_file_size_for_l0_meta_pin_,
|
|
|
|
/*smallest_compaction_key=*/nullptr,
|
|
|
|
/*largest_compaction_key=*/nullptr, allow_unprepared_value));
|
In DB::NewIterator(), try to allocate the whole iterator tree in an arena
Summary:
In this patch, try to allocate the whole iterator tree starting from DBIter from an arena
1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it.
2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator.
3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it.
Limitations:
(1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc
(2) Two level iterator itself is allocated in arena, but not iterators inside it.
Test Plan: make all check
Reviewers: ljin, haobo
Reviewed By: haobo
Subscribers: leveldb, dhruba, yhchiang, igor
Differential Revision: https://reviews.facebook.net/D18513
11 years ago
|
|
|
}
|
|
|
|
if (should_sample) {
|
|
|
|
// Count ones for every L0 files. This is done per iterator creation
|
|
|
|
// rather than Seek(), while files in other levels are recored per seek.
|
|
|
|
// If users execute one range query per iterator, there may be some
|
|
|
|
// discrepancy here.
|
|
|
|
for (FileMetaData* meta : storage_info_.LevelFiles(0)) {
|
|
|
|
sample_file_read_inc(meta);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
} else if (storage_info_.LevelFilesBrief(level).num_files > 0) {
|
|
|
|
// For levels > 0, we can use a concatenating iterator that sequentially
|
|
|
|
// walks through the non-overlapping files in the level, opening them
|
|
|
|
// lazily.
|
|
|
|
auto* mem = arena->AllocateAligned(sizeof(LevelIterator));
|
|
|
|
merge_iter_builder->AddIterator(new (mem) LevelIterator(
|
|
|
|
cfd_->table_cache(), read_options, soptions,
|
|
|
|
cfd_->internal_comparator(), &storage_info_.LevelFilesBrief(level),
|
|
|
|
mutable_cf_options_.prefix_extractor.get(), should_sample_file_read(),
|
|
|
|
cfd_->internal_stats()->GetFileReadHist(level),
|
|
|
|
TableReaderCaller::kUserIterator, IsFilterSkipped(level), level,
|
|
|
|
range_del_agg,
|
|
|
|
/*compaction_boundaries=*/nullptr, allow_unprepared_value));
|
In DB::NewIterator(), try to allocate the whole iterator tree in an arena
Summary:
In this patch, try to allocate the whole iterator tree starting from DBIter from an arena
1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it.
2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator.
3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it.
Limitations:
(1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc
(2) Two level iterator itself is allocated in arena, but not iterators inside it.
Test Plan: make all check
Reviewers: ljin, haobo
Reviewed By: haobo
Subscribers: leveldb, dhruba, yhchiang, igor
Differential Revision: https://reviews.facebook.net/D18513
11 years ago
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
Status Version::OverlapWithLevelIterator(const ReadOptions& read_options,
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
const FileOptions& file_options,
|
|
|
|
const Slice& smallest_user_key,
|
|
|
|
const Slice& largest_user_key,
|
|
|
|
int level, bool* overlap) {
|
|
|
|
assert(storage_info_.finalized_);
|
|
|
|
|
|
|
|
auto icmp = cfd_->internal_comparator();
|
|
|
|
auto ucmp = icmp.user_comparator();
|
|
|
|
|
|
|
|
Arena arena;
|
|
|
|
Status status;
|
|
|
|
ReadRangeDelAggregator range_del_agg(&icmp,
|
|
|
|
kMaxSequenceNumber /* upper_bound */);
|
|
|
|
|
|
|
|
*overlap = false;
|
|
|
|
|
|
|
|
if (level == 0) {
|
|
|
|
for (size_t i = 0; i < storage_info_.LevelFilesBrief(0).num_files; i++) {
|
|
|
|
const auto file = &storage_info_.LevelFilesBrief(0).files[i];
|
|
|
|
if (AfterFile(ucmp, &smallest_user_key, file) ||
|
|
|
|
BeforeFile(ucmp, &largest_user_key, file)) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
ScopedArenaIterator iter(cfd_->table_cache()->NewIterator(
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
read_options, file_options, cfd_->internal_comparator(),
|
|
|
|
*file->file_metadata, &range_del_agg,
|
|
|
|
mutable_cf_options_.prefix_extractor.get(), nullptr,
|
|
|
|
cfd_->internal_stats()->GetFileReadHist(0),
|
|
|
|
TableReaderCaller::kUserIterator, &arena,
|
|
|
|
/*skip_filters=*/false, /*level=*/0, max_file_size_for_l0_meta_pin_,
|
|
|
|
/*smallest_compaction_key=*/nullptr,
|
Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
Reviewed By: siying
Differential Revision: D20786930
Pulled By: al13n321
fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
5 years ago
|
|
|
/*largest_compaction_key=*/nullptr,
|
|
|
|
/*allow_unprepared_value=*/false));
|
|
|
|
status = OverlapWithIterator(
|
|
|
|
ucmp, smallest_user_key, largest_user_key, iter.get(), overlap);
|
|
|
|
if (!status.ok() || *overlap) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
} else if (storage_info_.LevelFilesBrief(level).num_files > 0) {
|
|
|
|
auto mem = arena.AllocateAligned(sizeof(LevelIterator));
|
|
|
|
ScopedArenaIterator iter(new (mem) LevelIterator(
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
cfd_->table_cache(), read_options, file_options,
|
|
|
|
cfd_->internal_comparator(), &storage_info_.LevelFilesBrief(level),
|
|
|
|
mutable_cf_options_.prefix_extractor.get(), should_sample_file_read(),
|
|
|
|
cfd_->internal_stats()->GetFileReadHist(level),
|
|
|
|
TableReaderCaller::kUserIterator, IsFilterSkipped(level), level,
|
|
|
|
&range_del_agg));
|
|
|
|
status = OverlapWithIterator(
|
|
|
|
ucmp, smallest_user_key, largest_user_key, iter.get(), overlap);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (status.ok() && *overlap == false &&
|
|
|
|
range_del_agg.IsRangeOverlapped(smallest_user_key, largest_user_key)) {
|
|
|
|
*overlap = true;
|
|
|
|
}
|
|
|
|
return status;
|
|
|
|
}
|
|
|
|
|
|
|
|
VersionStorageInfo::VersionStorageInfo(
|
|
|
|
const InternalKeyComparator* internal_comparator,
|
|
|
|
const Comparator* user_comparator, int levels,
|
|
|
|
CompactionStyle compaction_style, VersionStorageInfo* ref_vstorage,
|
|
|
|
bool _force_consistency_checks)
|
|
|
|
: internal_comparator_(internal_comparator),
|
|
|
|
user_comparator_(user_comparator),
|
|
|
|
// cfd is nullptr if Version is dummy
|
|
|
|
num_levels_(levels),
|
|
|
|
num_non_empty_levels_(0),
|
|
|
|
file_indexer_(user_comparator),
|
|
|
|
compaction_style_(compaction_style),
|
|
|
|
files_(new std::vector<FileMetaData*>[num_levels_]),
|
|
|
|
base_level_(num_levels_ == 1 ? -1 : 1),
|
|
|
|
level_multiplier_(0.0),
|
|
|
|
files_by_compaction_pri_(num_levels_),
|
Allowing L0 -> L1 trivial move on sorted data
Summary:
This diff updates the logic of how we do trivial move, now trivial move can run on any number of files in input level as long as they are not overlapping
The conditions for trivial move have been updated
Introduced conditions:
- Trivial move cannot happen if we have a compaction filter (except if the compaction is not manual)
- Input level files cannot be overlapping
Removed conditions:
- Trivial move only run when the compaction is not manual
- Input level should can contain only 1 file
More context on what tests failed because of Trivial move
```
DBTest.CompactionsGenerateMultipleFiles
This test is expecting compaction on a file in L0 to generate multiple files in L1, this test will fail with trivial move because we end up with one file in L1
```
```
DBTest.NoSpaceCompactRange
This test expect compaction to fail when we force environment to report running out of space, of course this is not valid in trivial move situation
because trivial move does not need any extra space, and did not check for that
```
```
DBTest.DropWrites
Similar to DBTest.NoSpaceCompactRange
```
```
DBTest.DeleteObsoleteFilesPendingOutputs
This test expect that a file in L2 is deleted after it's moved to L3, this is not valid with trivial move because although the file was moved it is now used by L3
```
```
CuckooTableDBTest.CompactionIntoMultipleFiles
Same as DBTest.CompactionsGenerateMultipleFiles
```
This diff is based on a work by @sdong https://reviews.facebook.net/D34149
Test Plan: make -j64 check
Reviewers: rven, sdong, igor
Reviewed By: igor
Subscribers: yhchiang, ott, march, dhruba, sdong
Differential Revision: https://reviews.facebook.net/D34797
10 years ago
|
|
|
level0_non_overlapping_(false),
|
|
|
|
next_file_to_compact_by_size_(num_levels_),
|
|
|
|
compaction_score_(num_levels_),
|
|
|
|
compaction_level_(num_levels_),
|
|
|
|
l0_delay_trigger_count_(0),
|
|
|
|
accumulated_file_size_(0),
|
|
|
|
accumulated_raw_key_size_(0),
|
|
|
|
accumulated_raw_value_size_(0),
|
|
|
|
accumulated_num_non_deletions_(0),
|
|
|
|
accumulated_num_deletions_(0),
|
|
|
|
current_num_non_deletions_(0),
|
|
|
|
current_num_deletions_(0),
|
|
|
|
current_num_samples_(0),
|
|
|
|
estimated_compaction_needed_bytes_(0),
|
|
|
|
finalized_(false),
|
|
|
|
force_consistency_checks_(_force_consistency_checks) {
|
|
|
|
if (ref_vstorage != nullptr) {
|
|
|
|
accumulated_file_size_ = ref_vstorage->accumulated_file_size_;
|
|
|
|
accumulated_raw_key_size_ = ref_vstorage->accumulated_raw_key_size_;
|
|
|
|
accumulated_raw_value_size_ = ref_vstorage->accumulated_raw_value_size_;
|
|
|
|
accumulated_num_non_deletions_ =
|
|
|
|
ref_vstorage->accumulated_num_non_deletions_;
|
|
|
|
accumulated_num_deletions_ = ref_vstorage->accumulated_num_deletions_;
|
|
|
|
current_num_non_deletions_ = ref_vstorage->current_num_non_deletions_;
|
|
|
|
current_num_deletions_ = ref_vstorage->current_num_deletions_;
|
|
|
|
current_num_samples_ = ref_vstorage->current_num_samples_;
|
|
|
|
oldest_snapshot_seqnum_ = ref_vstorage->oldest_snapshot_seqnum_;
|
|
|
|
}
|
hints for narrowing down FindFile range and avoiding checking unrelevant L0 files
Summary:
The file tree structure in Version is prebuilt and the range of each file is known.
On the Get() code path, we do binary search in FindFile() by comparing
target key with each file's largest key and also check the range for each L0 file.
With some pre-calculated knowledge, each key comparision that has been done can serve
as a hint to narrow down further searches:
(1) If a key falls within a L0 file's range, we can safely skip the next
file if its range does not overlap with the current one.
(2) If a key falls within a file's range in level L0 - Ln-1, we should only
need to binary search in the next level for files that overlap with the current one.
(1) will be able to skip some files depending one the key distribution.
(2) can greatly reduce the range of binary search, especially for bottom
levels, given that one file most likely only overlaps with N files from
the level below (where N is max_bytes_for_level_multiplier). So on level
L, we will only look at ~N files instead of N^L files.
Some inital results: measured with 500M key DB, when write is light (10k/s = 1.2M/s), this
improves QPS ~7% on top of blocked bloom. When write is heavier (80k/s =
9.6M/s), it gives us ~13% improvement.
Test Plan: make all check
Reviewers: haobo, igor, dhruba, sdong, yhchiang
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17205
11 years ago
|
|
|
}
|
|
|
|
|
|
|
|
Version::Version(ColumnFamilyData* column_family_data, VersionSet* vset,
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
const FileOptions& file_opt,
|
|
|
|
const MutableCFOptions mutable_cf_options,
|
|
|
|
const std::shared_ptr<IOTracer>& io_tracer,
|
|
|
|
uint64_t version_number)
|
|
|
|
: env_(vset->env_),
|
|
|
|
clock_(env_->GetSystemClock()),
|
|
|
|
cfd_(column_family_data),
|
|
|
|
info_log_((cfd_ == nullptr) ? nullptr : cfd_->ioptions()->info_log),
|
|
|
|
db_statistics_((cfd_ == nullptr) ? nullptr
|
|
|
|
: cfd_->ioptions()->statistics),
|
|
|
|
table_cache_((cfd_ == nullptr) ? nullptr : cfd_->table_cache()),
|
|
|
|
blob_file_cache_(cfd_ ? cfd_->blob_file_cache() : nullptr),
|
|
|
|
merge_operator_((cfd_ == nullptr) ? nullptr
|
|
|
|
: cfd_->ioptions()->merge_operator),
|
|
|
|
storage_info_(
|
|
|
|
(cfd_ == nullptr) ? nullptr : &cfd_->internal_comparator(),
|
|
|
|
(cfd_ == nullptr) ? nullptr : cfd_->user_comparator(),
|
|
|
|
cfd_ == nullptr ? 0 : cfd_->NumberLevels(),
|
|
|
|
cfd_ == nullptr ? kCompactionStyleLevel
|
|
|
|
: cfd_->ioptions()->compaction_style,
|
|
|
|
(cfd_ == nullptr || cfd_->current() == nullptr)
|
|
|
|
? nullptr
|
|
|
|
: cfd_->current()->storage_info(),
|
|
|
|
cfd_ == nullptr ? false : cfd_->ioptions()->force_consistency_checks),
|
|
|
|
vset_(vset),
|
|
|
|
next_(this),
|
|
|
|
prev_(this),
|
|
|
|
refs_(0),
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
file_options_(file_opt),
|
|
|
|
mutable_cf_options_(mutable_cf_options),
|
|
|
|
max_file_size_for_l0_meta_pin_(
|
|
|
|
MaxFileSizeForL0MetaPin(mutable_cf_options_)),
|
|
|
|
version_number_(version_number),
|
|
|
|
io_tracer_(io_tracer) {}
|
|
|
|
|
|
|
|
Status Version::GetBlob(const ReadOptions& read_options, const Slice& user_key,
|
|
|
|
const Slice& blob_index_slice,
|
|
|
|
PinnableSlice* value) const {
|
|
|
|
if (read_options.read_tier == kBlockCacheTier) {
|
|
|
|
return Status::Incomplete("Cannot read blob: no disk I/O allowed");
|
|
|
|
}
|
|
|
|
|
|
|
|
BlobIndex blob_index;
|
|
|
|
|
|
|
|
{
|
|
|
|
Status s = blob_index.DecodeFrom(blob_index_slice);
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Integrated blob garbage collection: relocate blobs (#7694)
Summary:
The patch adds basic garbage collection support to the integrated BlobDB
implementation. Valid blobs residing in the oldest blob files are relocated
as they are encountered during compaction. The threshold that determines
which blob files qualify is computed based on the configuration option
`blob_garbage_collection_age_cutoff`, which was introduced in https://github.com/facebook/rocksdb/issues/7661 .
Once a blob is retrieved for the purposes of relocation, it passes through the
same logic that extracts large values to blob files in general. This means that
if, for instance, the size threshold for key-value separation (`min_blob_size`)
got changed or writing blob files got disabled altogether, it is possible for the
value to be moved back into the LSM tree. In particular, one way to re-inline
all blob values if needed would be to perform a full manual compaction with
`enable_blob_files` set to `false`, `enable_blob_garbage_collection` set to
`true`, and `blob_file_garbage_collection_age_cutoff` set to `1.0`.
Some TODOs that I plan to address in separate PRs:
1) We'll have to measure the amount of new garbage in each blob file and log
`BlobFileGarbage` entries as part of the compaction job's `VersionEdit`.
(For the time being, blob files are cleaned up solely based on the
`oldest_blob_file_number` relationships.)
2) When compression is used for blobs, the compression type hasn't changed,
and the blob still qualifies for being written to a blob file, we can simply copy
the compressed blob to the new file instead of going through decompression
and compression.
3) We need to update the formula for computing write amplification to account
for the amount of data read from blob files as part of GC.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7694
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D25069663
Pulled By: ltamasi
fbshipit-source-id: bdfa8feb09afcf5bca3b4eba2ba72ce2f15cd06a
4 years ago
|
|
|
return GetBlob(read_options, user_key, blob_index, value);
|
|
|
|
}
|
|
|
|
|
|
|
|
Status Version::GetBlob(const ReadOptions& read_options, const Slice& user_key,
|
|
|
|
const BlobIndex& blob_index,
|
|
|
|
PinnableSlice* value) const {
|
|
|
|
assert(value);
|
|
|
|
|
|
|
|
if (blob_index.HasTTL() || blob_index.IsInlined()) {
|
|
|
|
return Status::Corruption("Unexpected TTL/inlined blob index");
|
|
|
|
}
|
|
|
|
|
|
|
|
const auto& blob_files = storage_info_.GetBlobFiles();
|
|
|
|
|
|
|
|
const uint64_t blob_file_number = blob_index.file_number();
|
|
|
|
|
|
|
|
const auto it = blob_files.find(blob_file_number);
|
|
|
|
if (it == blob_files.end()) {
|
|
|
|
return Status::Corruption("Invalid blob file number");
|
|
|
|
}
|
|
|
|
|
|
|
|
CacheHandleGuard<BlobFileReader> blob_file_reader;
|
|
|
|
|
|
|
|
{
|
|
|
|
assert(blob_file_cache_);
|
|
|
|
const Status s = blob_file_cache_->GetBlobFileReader(blob_file_number,
|
|
|
|
&blob_file_reader);
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
assert(blob_file_reader.GetValue());
|
|
|
|
const Status s = blob_file_reader.GetValue()->GetBlob(
|
|
|
|
read_options, user_key, blob_index.offset(), blob_index.size(),
|
|
|
|
blob_index.compression(), value);
|
|
|
|
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
Use SST files for Transaction conflict detection
Summary:
Currently, transactions can fail even if there is no actual write conflict. This is due to relying on only the memtables to check for write-conflicts. Users have to tune memtable settings to try to avoid this, but it's hard to figure out exactly how to tune these settings.
With this diff, TransactionDB will use both memtables and SST files to determine if there are any write conflicts. This relies on the fact that BlockBasedTable stores sequence numbers for all writes that happen after any open snapshot. Also, D50295 is needed to prevent SingleDelete from disappearing writes (the TODOs in this test code will be fixed once the other diff is approved and merged).
Note that Optimistic transactions will still rely on tuning memtable settings as we do not want to read from SST while on the write thread. Also, memtable settings can still be used to reduce how often TransactionDB needs to read SST files.
Test Plan: unit tests, db bench
Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb, yoshinorim
Differential Revision: https://reviews.facebook.net/D50475
9 years ago
|
|
|
void Version::Get(const ReadOptions& read_options, const LookupKey& k,
|
|
|
|
PinnableSlice* value, std::string* timestamp, Status* status,
|
|
|
|
MergeContext* merge_context,
|
Use only "local" range tombstones during Get (#4449)
Summary:
Previously, range tombstones were accumulated from every level, which
was necessary if a range tombstone in a higher level covered a key in a lower
level. However, RangeDelAggregator::AddTombstones's complexity is based on
the number of tombstones that are currently stored in it, which is wasteful in
the Get case, where we only need to know the highest sequence number of range
tombstones that cover the key from higher levels, and compute the highest covering
sequence number at the current level. This change introduces this optimization, and
removes the use of RangeDelAggregator from the Get path.
In the benchmark results, the following command was used to initialize the database:
```
./db_bench -db=/dev/shm/5k-rts -use_existing_db=false -benchmarks=filluniquerandom -write_buffer_size=1048576 -compression_type=lz4 -target_file_size_base=1048576 -max_bytes_for_level_base=4194304 -value_size=112 -key_size=16 -block_size=4096 -level_compaction_dynamic_level_bytes=true -num=5000000 -max_background_jobs=12 -benchmark_write_rate_limit=20971520 -range_tombstone_width=100 -writes_per_range_tombstone=100 -max_num_range_tombstones=50000 -bloom_bits=8
```
...and the following command was used to measure read throughput:
```
./db_bench -db=/dev/shm/5k-rts/ -use_existing_db=true -benchmarks=readrandom -disable_auto_compactions=true -num=5000000 -reads=100000 -threads=32
```
The filluniquerandom command was only run once, and the resulting database was used
to measure read performance before and after the PR. Both binaries were compiled with
`DEBUG_LEVEL=0`.
Readrandom results before PR:
```
readrandom : 4.544 micros/op 220090 ops/sec; 16.9 MB/s (63103 of 100000 found)
```
Readrandom results after PR:
```
readrandom : 11.147 micros/op 89707 ops/sec; 6.9 MB/s (63103 of 100000 found)
```
So it's actually slower right now, but this PR paves the way for future optimizations (see #4493).
----
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4449
Differential Revision: D10370575
Pulled By: abhimadan
fbshipit-source-id: 9a2e152be1ef36969055c0e9eb4beb0d96c11f4d
6 years ago
|
|
|
SequenceNumber* max_covering_tombstone_seq, bool* value_found,
|
|
|
|
bool* key_exists, SequenceNumber* seq, ReadCallback* callback,
|
New API to get all merge operands for a Key (#5604)
Summary:
This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases:
1. Update subset of columns and read subset of columns -
Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU.
2. Updating very few attributes in a value which is a JSON-like document -
Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge.
----------------------------------------------------------------------------------------------------
API :
Status GetMergeOperands(
const ReadOptions& options, ColumnFamilyHandle* column_family,
const Slice& key, PinnableSlice* merge_operands,
GetMergeOperandsOptions* get_merge_operands_options,
int* number_of_operands)
Example usage :
int size = 100;
int number_of_operands = 0;
std::vector<PinnableSlice> values(size);
GetMergeOperandsOptions merge_operands_info;
db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands);
Description :
Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion.
merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604
Test Plan:
Added unit test and perf test in db_bench that can be run using the command:
./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist
Differential Revision: D16657366
Pulled By: vjnadimpalli
fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf
5 years ago
|
|
|
bool* is_blob, bool do_merge) {
|
|
|
|
Slice ikey = k.internal_key();
|
|
|
|
Slice user_key = k.user_key();
|
|
|
|
|
|
|
|
assert(status->ok() || status->IsMergeInProgress());
|
|
|
|
|
Use SST files for Transaction conflict detection
Summary:
Currently, transactions can fail even if there is no actual write conflict. This is due to relying on only the memtables to check for write-conflicts. Users have to tune memtable settings to try to avoid this, but it's hard to figure out exactly how to tune these settings.
With this diff, TransactionDB will use both memtables and SST files to determine if there are any write conflicts. This relies on the fact that BlockBasedTable stores sequence numbers for all writes that happen after any open snapshot. Also, D50295 is needed to prevent SingleDelete from disappearing writes (the TODOs in this test code will be fixed once the other diff is approved and merged).
Note that Optimistic transactions will still rely on tuning memtable settings as we do not want to read from SST while on the write thread. Also, memtable settings can still be used to reduce how often TransactionDB needs to read SST files.
Test Plan: unit tests, db bench
Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb, yoshinorim
Differential Revision: https://reviews.facebook.net/D50475
9 years ago
|
|
|
if (key_exists != nullptr) {
|
|
|
|
// will falsify below if not found
|
|
|
|
*key_exists = true;
|
|
|
|
}
|
|
|
|
|
Introduce FullMergeV2 (eliminate memcpy from merge operators)
Summary:
This diff update the code to pin the merge operator operands while the merge operation is done, so that we can eliminate the memcpy cost, to do that we need a new public API for FullMerge that replace the std::deque<std::string> with std::vector<Slice>
This diff is stacked on top of D56493 and D56511
In this diff we
- Update FullMergeV2 arguments to be encapsulated in MergeOperationInput and MergeOperationOutput which will make it easier to add new arguments in the future
- Replace std::deque<std::string> with std::vector<Slice> to pass operands
- Replace MergeContext std::deque with std::vector (based on a simple benchmark I ran https://gist.github.com/IslamAbdelRahman/78fc86c9ab9f52b1df791e58943fb187)
- Allow FullMergeV2 output to be an existing operand
```
[Everything in Memtable | 10K operands | 10 KB each | 1 operand per key]
DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=10000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000
[FullMergeV2]
readseq : 0.607 micros/op 1648235 ops/sec; 16121.2 MB/s
readseq : 0.478 micros/op 2091546 ops/sec; 20457.2 MB/s
readseq : 0.252 micros/op 3972081 ops/sec; 38850.5 MB/s
readseq : 0.237 micros/op 4218328 ops/sec; 41259.0 MB/s
readseq : 0.247 micros/op 4043927 ops/sec; 39553.2 MB/s
[master]
readseq : 3.935 micros/op 254140 ops/sec; 2485.7 MB/s
readseq : 3.722 micros/op 268657 ops/sec; 2627.7 MB/s
readseq : 3.149 micros/op 317605 ops/sec; 3106.5 MB/s
readseq : 3.125 micros/op 320024 ops/sec; 3130.1 MB/s
readseq : 4.075 micros/op 245374 ops/sec; 2400.0 MB/s
```
```
[Everything in Memtable | 10K operands | 10 KB each | 10 operand per key]
DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=1000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000
[FullMergeV2]
readseq : 3.472 micros/op 288018 ops/sec; 2817.1 MB/s
readseq : 2.304 micros/op 434027 ops/sec; 4245.2 MB/s
readseq : 1.163 micros/op 859845 ops/sec; 8410.0 MB/s
readseq : 1.192 micros/op 838926 ops/sec; 8205.4 MB/s
readseq : 1.250 micros/op 800000 ops/sec; 7824.7 MB/s
[master]
readseq : 24.025 micros/op 41623 ops/sec; 407.1 MB/s
readseq : 18.489 micros/op 54086 ops/sec; 529.0 MB/s
readseq : 18.693 micros/op 53495 ops/sec; 523.2 MB/s
readseq : 23.621 micros/op 42335 ops/sec; 414.1 MB/s
readseq : 18.775 micros/op 53262 ops/sec; 521.0 MB/s
```
```
[Everything in Block cache | 10K operands | 10 KB each | 1 operand per key]
[FullMergeV2]
$ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions
readseq : 14.741 micros/op 67837 ops/sec; 663.5 MB/s
readseq : 1.029 micros/op 971446 ops/sec; 9501.6 MB/s
readseq : 0.974 micros/op 1026229 ops/sec; 10037.4 MB/s
readseq : 0.965 micros/op 1036080 ops/sec; 10133.8 MB/s
readseq : 0.943 micros/op 1060657 ops/sec; 10374.2 MB/s
[master]
readseq : 16.735 micros/op 59755 ops/sec; 584.5 MB/s
readseq : 3.029 micros/op 330151 ops/sec; 3229.2 MB/s
readseq : 3.136 micros/op 318883 ops/sec; 3119.0 MB/s
readseq : 3.065 micros/op 326245 ops/sec; 3191.0 MB/s
readseq : 3.014 micros/op 331813 ops/sec; 3245.4 MB/s
```
```
[Everything in Block cache | 10K operands | 10 KB each | 10 operand per key]
DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10-operands-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions
[FullMergeV2]
readseq : 24.325 micros/op 41109 ops/sec; 402.1 MB/s
readseq : 1.470 micros/op 680272 ops/sec; 6653.7 MB/s
readseq : 1.231 micros/op 812347 ops/sec; 7945.5 MB/s
readseq : 1.091 micros/op 916590 ops/sec; 8965.1 MB/s
readseq : 1.109 micros/op 901713 ops/sec; 8819.6 MB/s
[master]
readseq : 27.257 micros/op 36687 ops/sec; 358.8 MB/s
readseq : 4.443 micros/op 225073 ops/sec; 2201.4 MB/s
readseq : 5.830 micros/op 171526 ops/sec; 1677.7 MB/s
readseq : 4.173 micros/op 239635 ops/sec; 2343.8 MB/s
readseq : 4.150 micros/op 240963 ops/sec; 2356.8 MB/s
```
Test Plan: COMPILE_WITH_ASAN=1 make check -j64
Reviewers: yhchiang, andrewkr, sdong
Reviewed By: sdong
Subscribers: lovro, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57075
9 years ago
|
|
|
PinnedIteratorsManager pinned_iters_mgr;
|
|
|
|
uint64_t tracing_get_id = BlockCacheTraceHelper::kReservedGetId;
|
|
|
|
if (vset_ && vset_->block_cache_tracer_ &&
|
|
|
|
vset_->block_cache_tracer_->is_tracing_enabled()) {
|
|
|
|
tracing_get_id = vset_->block_cache_tracer_->NextGetId();
|
|
|
|
}
|
|
|
|
|
|
|
|
// Note: the old StackableDB-based BlobDB passes in
|
|
|
|
// GetImplOptions::is_blob_index; for the integrated BlobDB implementation, we
|
|
|
|
// need to provide it here.
|
|
|
|
bool is_blob_index = false;
|
|
|
|
bool* const is_blob_to_use = is_blob ? is_blob : &is_blob_index;
|
|
|
|
|
|
|
|
GetContext get_context(
|
|
|
|
user_comparator(), merge_operator_, info_log_, db_statistics_,
|
|
|
|
status->ok() ? GetContext::kNotFound : GetContext::kMerge, user_key,
|
|
|
|
do_merge ? value : nullptr, do_merge ? timestamp : nullptr, value_found,
|
|
|
|
merge_context, do_merge, max_covering_tombstone_seq, clock_, seq,
|
|
|
|
merge_operator_ ? &pinned_iters_mgr : nullptr, callback, is_blob_to_use,
|
|
|
|
tracing_get_id);
|
Introduce FullMergeV2 (eliminate memcpy from merge operators)
Summary:
This diff update the code to pin the merge operator operands while the merge operation is done, so that we can eliminate the memcpy cost, to do that we need a new public API for FullMerge that replace the std::deque<std::string> with std::vector<Slice>
This diff is stacked on top of D56493 and D56511
In this diff we
- Update FullMergeV2 arguments to be encapsulated in MergeOperationInput and MergeOperationOutput which will make it easier to add new arguments in the future
- Replace std::deque<std::string> with std::vector<Slice> to pass operands
- Replace MergeContext std::deque with std::vector (based on a simple benchmark I ran https://gist.github.com/IslamAbdelRahman/78fc86c9ab9f52b1df791e58943fb187)
- Allow FullMergeV2 output to be an existing operand
```
[Everything in Memtable | 10K operands | 10 KB each | 1 operand per key]
DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=10000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000
[FullMergeV2]
readseq : 0.607 micros/op 1648235 ops/sec; 16121.2 MB/s
readseq : 0.478 micros/op 2091546 ops/sec; 20457.2 MB/s
readseq : 0.252 micros/op 3972081 ops/sec; 38850.5 MB/s
readseq : 0.237 micros/op 4218328 ops/sec; 41259.0 MB/s
readseq : 0.247 micros/op 4043927 ops/sec; 39553.2 MB/s
[master]
readseq : 3.935 micros/op 254140 ops/sec; 2485.7 MB/s
readseq : 3.722 micros/op 268657 ops/sec; 2627.7 MB/s
readseq : 3.149 micros/op 317605 ops/sec; 3106.5 MB/s
readseq : 3.125 micros/op 320024 ops/sec; 3130.1 MB/s
readseq : 4.075 micros/op 245374 ops/sec; 2400.0 MB/s
```
```
[Everything in Memtable | 10K operands | 10 KB each | 10 operand per key]
DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="mergerandom,readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --merge_keys=1000 --num=10000 --disable_auto_compactions --value_size=10240 --write_buffer_size=1000000000
[FullMergeV2]
readseq : 3.472 micros/op 288018 ops/sec; 2817.1 MB/s
readseq : 2.304 micros/op 434027 ops/sec; 4245.2 MB/s
readseq : 1.163 micros/op 859845 ops/sec; 8410.0 MB/s
readseq : 1.192 micros/op 838926 ops/sec; 8205.4 MB/s
readseq : 1.250 micros/op 800000 ops/sec; 7824.7 MB/s
[master]
readseq : 24.025 micros/op 41623 ops/sec; 407.1 MB/s
readseq : 18.489 micros/op 54086 ops/sec; 529.0 MB/s
readseq : 18.693 micros/op 53495 ops/sec; 523.2 MB/s
readseq : 23.621 micros/op 42335 ops/sec; 414.1 MB/s
readseq : 18.775 micros/op 53262 ops/sec; 521.0 MB/s
```
```
[Everything in Block cache | 10K operands | 10 KB each | 1 operand per key]
[FullMergeV2]
$ DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions
readseq : 14.741 micros/op 67837 ops/sec; 663.5 MB/s
readseq : 1.029 micros/op 971446 ops/sec; 9501.6 MB/s
readseq : 0.974 micros/op 1026229 ops/sec; 10037.4 MB/s
readseq : 0.965 micros/op 1036080 ops/sec; 10133.8 MB/s
readseq : 0.943 micros/op 1060657 ops/sec; 10374.2 MB/s
[master]
readseq : 16.735 micros/op 59755 ops/sec; 584.5 MB/s
readseq : 3.029 micros/op 330151 ops/sec; 3229.2 MB/s
readseq : 3.136 micros/op 318883 ops/sec; 3119.0 MB/s
readseq : 3.065 micros/op 326245 ops/sec; 3191.0 MB/s
readseq : 3.014 micros/op 331813 ops/sec; 3245.4 MB/s
```
```
[Everything in Block cache | 10K operands | 10 KB each | 10 operand per key]
DEBUG_LEVEL=0 make db_bench -j64 && ./db_bench --benchmarks="readseq,readseq,readseq,readseq,readseq" --merge_operator="max" --num=100000 --db="/dev/shm/merge-random-10-operands-10K-10KB" --cache_size=1000000000 --use_existing_db --disable_auto_compactions
[FullMergeV2]
readseq : 24.325 micros/op 41109 ops/sec; 402.1 MB/s
readseq : 1.470 micros/op 680272 ops/sec; 6653.7 MB/s
readseq : 1.231 micros/op 812347 ops/sec; 7945.5 MB/s
readseq : 1.091 micros/op 916590 ops/sec; 8965.1 MB/s
readseq : 1.109 micros/op 901713 ops/sec; 8819.6 MB/s
[master]
readseq : 27.257 micros/op 36687 ops/sec; 358.8 MB/s
readseq : 4.443 micros/op 225073 ops/sec; 2201.4 MB/s
readseq : 5.830 micros/op 171526 ops/sec; 1677.7 MB/s
readseq : 4.173 micros/op 239635 ops/sec; 2343.8 MB/s
readseq : 4.150 micros/op 240963 ops/sec; 2356.8 MB/s
```
Test Plan: COMPILE_WITH_ASAN=1 make check -j64
Reviewers: yhchiang, andrewkr, sdong
Reviewed By: sdong
Subscribers: lovro, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57075
9 years ago
|
|
|
|
|
|
|
// Pin blocks that we read to hold merge operands
|
|
|
|
if (merge_operator_) {
|
|
|
|
pinned_iters_mgr.StartPinning();
|
|
|
|
}
|
|
|
|
|
|
|
|
FilePicker fp(
|
|
|
|
storage_info_.files_, user_key, ikey, &storage_info_.level_files_brief_,
|
|
|
|
storage_info_.num_non_empty_levels_, &storage_info_.file_indexer_,
|
|
|
|
user_comparator(), internal_comparator());
|
|
|
|
FdWithKeyRange* f = fp.GetNextFile();
|
|
|
|
|
|
|
|
while (f != nullptr) {
|
Cache fragmented range tombstones in BlockBasedTableReader (#4493)
Summary:
This allows tombstone fragmenting to only be performed when the table is opened, and cached for subsequent accesses.
On the same DB used in #4449, running `readrandom` results in the following:
```
readrandom : 0.983 micros/op 1017076 ops/sec; 78.3 MB/s (63103 of 100000 found)
```
Now that Get performance in the presence of range tombstones is reasonable, I also compared the performance between a DB with range tombstones, "expanded" range tombstones (several point tombstones that cover the same keys the equivalent range tombstone would cover, a common workaround for DeleteRange), and no range tombstones. The created DBs had 5 million keys each, and DeleteRange was called at regular intervals (depending on the total number of range tombstones being written) after 4.5 million Puts. The table below summarizes the results of a `readwhilewriting` benchmark (in order to provide somewhat more realistic results):
```
Tombstones? | avg micros/op | stddev micros/op | avg ops/s | stddev ops/s
----------------- | ------------- | ---------------- | ------------ | ------------
None | 0.6186 | 0.04637 | 1,625,252.90 | 124,679.41
500 Expanded | 0.6019 | 0.03628 | 1,666,670.40 | 101,142.65
500 Unexpanded | 0.6435 | 0.03994 | 1,559,979.40 | 104,090.52
1k Expanded | 0.6034 | 0.04349 | 1,665,128.10 | 125,144.57
1k Unexpanded | 0.6261 | 0.03093 | 1,600,457.50 | 79,024.94
5k Expanded | 0.6163 | 0.05926 | 1,636,668.80 | 154,888.85
5k Unexpanded | 0.6402 | 0.04002 | 1,567,804.70 | 100,965.55
10k Expanded | 0.6036 | 0.05105 | 1,667,237.70 | 142,830.36
10k Unexpanded | 0.6128 | 0.02598 | 1,634,633.40 | 72,161.82
25k Expanded | 0.6198 | 0.04542 | 1,620,980.50 | 116,662.93
25k Unexpanded | 0.5478 | 0.0362 | 1,833,059.10 | 121,233.81
50k Expanded | 0.5104 | 0.04347 | 1,973,107.90 | 184,073.49
50k Unexpanded | 0.4528 | 0.03387 | 2,219,034.50 | 170,984.32
```
After a large enough quantity of range tombstones are written, range tombstone Gets can become faster than reading from an equivalent DB with several point tombstones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4493
Differential Revision: D10842844
Pulled By: abhimadan
fbshipit-source-id: a7d44534f8120e6aabb65779d26c6b9df954c509
6 years ago
|
|
|
if (*max_covering_tombstone_seq > 0) {
|
|
|
|
// The remaining files we look at will only contain covered keys, so we
|
|
|
|
// stop here.
|
|
|
|
break;
|
Cache fragmented range tombstones in BlockBasedTableReader (#4493)
Summary:
This allows tombstone fragmenting to only be performed when the table is opened, and cached for subsequent accesses.
On the same DB used in #4449, running `readrandom` results in the following:
```
readrandom : 0.983 micros/op 1017076 ops/sec; 78.3 MB/s (63103 of 100000 found)
```
Now that Get performance in the presence of range tombstones is reasonable, I also compared the performance between a DB with range tombstones, "expanded" range tombstones (several point tombstones that cover the same keys the equivalent range tombstone would cover, a common workaround for DeleteRange), and no range tombstones. The created DBs had 5 million keys each, and DeleteRange was called at regular intervals (depending on the total number of range tombstones being written) after 4.5 million Puts. The table below summarizes the results of a `readwhilewriting` benchmark (in order to provide somewhat more realistic results):
```
Tombstones? | avg micros/op | stddev micros/op | avg ops/s | stddev ops/s
----------------- | ------------- | ---------------- | ------------ | ------------
None | 0.6186 | 0.04637 | 1,625,252.90 | 124,679.41
500 Expanded | 0.6019 | 0.03628 | 1,666,670.40 | 101,142.65
500 Unexpanded | 0.6435 | 0.03994 | 1,559,979.40 | 104,090.52
1k Expanded | 0.6034 | 0.04349 | 1,665,128.10 | 125,144.57
1k Unexpanded | 0.6261 | 0.03093 | 1,600,457.50 | 79,024.94
5k Expanded | 0.6163 | 0.05926 | 1,636,668.80 | 154,888.85
5k Unexpanded | 0.6402 | 0.04002 | 1,567,804.70 | 100,965.55
10k Expanded | 0.6036 | 0.05105 | 1,667,237.70 | 142,830.36
10k Unexpanded | 0.6128 | 0.02598 | 1,634,633.40 | 72,161.82
25k Expanded | 0.6198 | 0.04542 | 1,620,980.50 | 116,662.93
25k Unexpanded | 0.5478 | 0.0362 | 1,833,059.10 | 121,233.81
50k Expanded | 0.5104 | 0.04347 | 1,973,107.90 | 184,073.49
50k Unexpanded | 0.4528 | 0.03387 | 2,219,034.50 | 170,984.32
```
After a large enough quantity of range tombstones are written, range tombstone Gets can become faster than reading from an equivalent DB with several point tombstones.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4493
Differential Revision: D10842844
Pulled By: abhimadan
fbshipit-source-id: a7d44534f8120e6aabb65779d26c6b9df954c509
6 years ago
|
|
|
}
|
|
|
|
if (get_context.sample()) {
|
|
|
|
sample_file_read_inc(f->file_metadata);
|
|
|
|
}
|
|
|
|
|
|
|
|
bool timer_enabled =
|
|
|
|
GetPerfLevel() >= PerfLevel::kEnableTimeExceptForMutex &&
|
|
|
|
get_perf_context()->per_level_perf_context_enabled;
|
|
|
|
StopWatchNano timer(clock_, timer_enabled /* auto_start */);
|
Measure file read latency histogram per level
Summary: In internal stats, remember read latency histogram, if statistics is enabled. It can be retrieved from DB::GetProperty() with "rocksdb.dbstats" property, if it is enabled.
Test Plan: Manually run db_bench and prints out "rocksdb.dbstats" by hand and make sure it prints out as expected
Reviewers: igor, IslamAbdelRahman, rven, kradhakrishnan, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D44193
9 years ago
|
|
|
*status = table_cache_->Get(
|
|
|
|
read_options, *internal_comparator(), *f->file_metadata, ikey,
|
|
|
|
&get_context, mutable_cf_options_.prefix_extractor.get(),
|
Skip bottom-level filter block caching when hit-optimized
Summary:
When Get() or NewIterator() trigger file loads, skip caching the filter block if
(1) optimize_filters_for_hits is set and (2) the file is on the bottommost
level. Also skip checking filters under the same conditions, which means that
for a preloaded file or a file that was trivially-moved to the bottom level, its
filter block will eventually expire from the cache.
- added parameters/instance variables in various places in order to propagate the config ("skip_filters") from version_set to block_based_table_reader
- in BlockBasedTable::Rep, this optimization prevents filter from being loaded when the file is opened simply by setting filter_policy = nullptr
- in BlockBasedTable::Get/BlockBasedTable::NewIterator, this optimization prevents filter from being used (even if it was loaded already) by setting filter = nullptr
Test Plan:
updated unit test:
$ ./db_test --gtest_filter=DBTest.OptimizeFiltersForHits
will also run 'make check'
Reviewers: sdong, igor, paultuckfield, anthony, rven, kradhakrishnan, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D51633
9 years ago
|
|
|
cfd_->internal_stats()->GetFileReadHist(fp.GetHitFileLevel()),
|
|
|
|
IsFilterSkipped(static_cast<int>(fp.GetHitFileLevel()),
|
Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
9 years ago
|
|
|
fp.IsHitFileLastInLevel()),
|
|
|
|
fp.GetHitFileLevel(), max_file_size_for_l0_meta_pin_);
|
|
|
|
// TODO: examine the behavior for corrupted key
|
|
|
|
if (timer_enabled) {
|
|
|
|
PERF_COUNTER_BY_LEVEL_ADD(get_from_table_nanos, timer.ElapsedNanos(),
|
|
|
|
fp.GetHitFileLevel());
|
|
|
|
}
|
|
|
|
if (!status->ok()) {
|
|
|
|
return;
|
hints for narrowing down FindFile range and avoiding checking unrelevant L0 files
Summary:
The file tree structure in Version is prebuilt and the range of each file is known.
On the Get() code path, we do binary search in FindFile() by comparing
target key with each file's largest key and also check the range for each L0 file.
With some pre-calculated knowledge, each key comparision that has been done can serve
as a hint to narrow down further searches:
(1) If a key falls within a L0 file's range, we can safely skip the next
file if its range does not overlap with the current one.
(2) If a key falls within a file's range in level L0 - Ln-1, we should only
need to binary search in the next level for files that overlap with the current one.
(1) will be able to skip some files depending one the key distribution.
(2) can greatly reduce the range of binary search, especially for bottom
levels, given that one file most likely only overlaps with N files from
the level below (where N is max_bytes_for_level_multiplier). So on level
L, we will only look at ~N files instead of N^L files.
Some inital results: measured with 500M key DB, when write is light (10k/s = 1.2M/s), this
improves QPS ~7% on top of blocked bloom. When write is heavier (80k/s =
9.6M/s), it gives us ~13% improvement.
Test Plan: make all check
Reviewers: haobo, igor, dhruba, sdong, yhchiang
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17205
11 years ago
|
|
|
}
|
|
|
|
|
|
|
|
// report the counters before returning
|
|
|
|
if (get_context.State() != GetContext::kNotFound &&
|
|
|
|
get_context.State() != GetContext::kMerge &&
|
|
|
|
db_statistics_ != nullptr) {
|
|
|
|
get_context.ReportCounters();
|
|
|
|
}
|
|
|
|
switch (get_context.State()) {
|
|
|
|
case GetContext::kNotFound:
|
|
|
|
// Keep searching in other files
|
|
|
|
break;
|
|
|
|
case GetContext::kMerge:
|
|
|
|
// TODO: update per-level perfcontext user_key_return_count for kMerge
|
|
|
|
break;
|
|
|
|
case GetContext::kFound:
|
|
|
|
if (fp.GetHitFileLevel() == 0) {
|
|
|
|
RecordTick(db_statistics_, GET_HIT_L0);
|
|
|
|
} else if (fp.GetHitFileLevel() == 1) {
|
|
|
|
RecordTick(db_statistics_, GET_HIT_L1);
|
|
|
|
} else if (fp.GetHitFileLevel() >= 2) {
|
|
|
|
RecordTick(db_statistics_, GET_HIT_L2_AND_UP);
|
|
|
|
}
|
|
|
|
|
|
|
|
PERF_COUNTER_BY_LEVEL_ADD(user_key_return_count, 1,
|
|
|
|
fp.GetHitFileLevel());
|
|
|
|
|
|
|
|
if (is_blob_index) {
|
|
|
|
if (do_merge && value) {
|
|
|
|
*status = GetBlob(read_options, user_key, *value, value);
|
|
|
|
if (!status->ok()) {
|
|
|
|
if (status->IsIncomplete()) {
|
|
|
|
get_context.MarkKeyMayExist();
|
|
|
|
}
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return;
|
|
|
|
case GetContext::kDeleted:
|
|
|
|
// Use empty error message for speed
|
|
|
|
*status = Status::NotFound();
|
|
|
|
return;
|
|
|
|
case GetContext::kCorrupt:
|
|
|
|
*status = Status::Corruption("corrupted key for ", user_key);
|
|
|
|
return;
|
|
|
|
case GetContext::kUnexpectedBlobIndex:
|
|
|
|
ROCKS_LOG_ERROR(info_log_, "Encounter unexpected blob index.");
|
|
|
|
*status = Status::NotSupported(
|
|
|
|
"Encounter unexpected blob index. Please open DB with "
|
|
|
|
"ROCKSDB_NAMESPACE::blob_db::BlobDB instead.");
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
f = fp.GetNextFile();
|
|
|
|
}
|
|
|
|
if (db_statistics_ != nullptr) {
|
|
|
|
get_context.ReportCounters();
|
|
|
|
}
|
|
|
|
if (GetContext::kMerge == get_context.State()) {
|
New API to get all merge operands for a Key (#5604)
Summary:
This is a new API added to db.h to allow for fetching all merge operands associated with a Key. The main motivation for this API is to support use cases where doing a full online merge is not necessary as it is performance sensitive. Example use-cases:
1. Update subset of columns and read subset of columns -
Imagine a SQL Table, a row is encoded as a K/V pair (as it is done in MyRocks). If there are many columns and users only updated one of them, we can use merge operator to reduce write amplification. While users only read one or two columns in the read query, this feature can avoid a full merging of the whole row, and save some CPU.
2. Updating very few attributes in a value which is a JSON-like document -
Updating one attribute can be done efficiently using merge operator, while reading back one attribute can be done more efficiently if we don't need to do a full merge.
----------------------------------------------------------------------------------------------------
API :
Status GetMergeOperands(
const ReadOptions& options, ColumnFamilyHandle* column_family,
const Slice& key, PinnableSlice* merge_operands,
GetMergeOperandsOptions* get_merge_operands_options,
int* number_of_operands)
Example usage :
int size = 100;
int number_of_operands = 0;
std::vector<PinnableSlice> values(size);
GetMergeOperandsOptions merge_operands_info;
db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, &number_of_operands);
Description :
Returns all the merge operands corresponding to the key. If the number of merge operands in DB is greater than merge_operands_options.expected_max_number_of_operands no merge operands are returned and status is Incomplete. Merge operands returned are in the order of insertion.
merge_operands-> Points to an array of at-least merge_operands_options.expected_max_number_of_operands and the caller is responsible for allocating it. If the status returned is Incomplete then number_of_operands will contain the total number of merge operands found in DB for key.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5604
Test Plan:
Added unit test and perf test in db_bench that can be run using the command:
./db_bench -benchmarks=getmergeoperands --merge_operator=sortlist
Differential Revision: D16657366
Pulled By: vjnadimpalli
fbshipit-source-id: 0faadd752351745224ee12d4ae9ef3cb529951bf
5 years ago
|
|
|
if (!do_merge) {
|
|
|
|
*status = Status::OK();
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
if (!merge_operator_) {
|
|
|
|
*status = Status::InvalidArgument(
|
|
|
|
"merge_operator is not properly initialized.");
|
|
|
|
return;
|
|
|
|
}
|
[RocksDB] [MergeOperator] The new Merge Interface! Uses merge sequences.
Summary:
Here are the major changes to the Merge Interface. It has been expanded
to handle cases where the MergeOperator is not associative. It does so by stacking
up merge operations while scanning through the key history (i.e.: during Get() or
Compaction), until a valid Put/Delete/end-of-history is encountered; it then
applies all of the merge operations in the correct sequence starting with the
base/sentinel value.
I have also introduced an "AssociativeMerge" function which allows the user to
take advantage of associative merge operations (such as in the case of counters).
The implementation will always attempt to merge the operations/operands themselves
together when they are encountered, and will resort to the "stacking" method if
and only if the "associative-merge" fails.
This implementation is conjectured to allow MergeOperator to handle the general
case, while still providing the user with the ability to take advantage of certain
efficiencies in their own merge-operator / data-structure.
NOTE: This is a preliminary diff. This must still go through a lot of review,
revision, and testing. Feedback welcome!
Test Plan:
-This is a preliminary diff. I have only just begun testing/debugging it.
-I will be testing this with the existing MergeOperator use-cases and unit-tests
(counters, string-append, and redis-lists)
-I will be "desk-checking" and walking through the code with the help gdb.
-I will find a way of stress-testing the new interface / implementation using
db_bench, db_test, merge_test, and/or db_stress.
-I will ensure that my tests cover all cases: Get-Memtable,
Get-Immutable-Memtable, Get-from-Disk, Iterator-Range-Scan, Flush-Memtable-to-L0,
Compaction-L0-L1, Compaction-Ln-L(n+1), Put/Delete found, Put/Delete not-found,
end-of-history, end-of-file, etc.
-A lot of feedback from the reviewers.
Reviewers: haobo, dhruba, zshao, emayanke
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11499
12 years ago
|
|
|
// merge_operands are in saver and we hit the beginning of the key history
|
|
|
|
// do a final merge of nullptr and operands;
|
|
|
|
std::string* str_value = value != nullptr ? value->GetSelf() : nullptr;
|
|
|
|
*status = MergeHelper::TimedFullMerge(
|
|
|
|
merge_operator_, user_key, nullptr, merge_context->GetOperands(),
|
|
|
|
str_value, info_log_, db_statistics_, clock_,
|
|
|
|
nullptr /* result_operand */, true);
|
|
|
|
if (LIKELY(value != nullptr)) {
|
|
|
|
value->PinSelf();
|
|
|
|
}
|
|
|
|
} else {
|
Use SST files for Transaction conflict detection
Summary:
Currently, transactions can fail even if there is no actual write conflict. This is due to relying on only the memtables to check for write-conflicts. Users have to tune memtable settings to try to avoid this, but it's hard to figure out exactly how to tune these settings.
With this diff, TransactionDB will use both memtables and SST files to determine if there are any write conflicts. This relies on the fact that BlockBasedTable stores sequence numbers for all writes that happen after any open snapshot. Also, D50295 is needed to prevent SingleDelete from disappearing writes (the TODOs in this test code will be fixed once the other diff is approved and merged).
Note that Optimistic transactions will still rely on tuning memtable settings as we do not want to read from SST while on the write thread. Also, memtable settings can still be used to reduce how often TransactionDB needs to read SST files.
Test Plan: unit tests, db bench
Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb, yoshinorim
Differential Revision: https://reviews.facebook.net/D50475
9 years ago
|
|
|
if (key_exists != nullptr) {
|
|
|
|
*key_exists = false;
|
|
|
|
}
|
|
|
|
*status = Status::NotFound(); // Use an empty error message for speed
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
void Version::MultiGet(const ReadOptions& read_options, MultiGetRange* range,
|
|
|
|
ReadCallback* callback) {
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
PinnedIteratorsManager pinned_iters_mgr;
|
|
|
|
|
|
|
|
// Pin blocks that we read to hold merge operands
|
|
|
|
if (merge_operator_) {
|
|
|
|
pinned_iters_mgr.StartPinning();
|
|
|
|
}
|
|
|
|
uint64_t tracing_mget_id = BlockCacheTraceHelper::kReservedGetId;
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
|
|
|
|
if (vset_ && vset_->block_cache_tracer_ &&
|
|
|
|
vset_->block_cache_tracer_->is_tracing_enabled()) {
|
|
|
|
tracing_mget_id = vset_->block_cache_tracer_->NextGetId();
|
|
|
|
}
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
// Even though we know the batch size won't be > MAX_BATCH_SIZE,
|
|
|
|
// use autovector in order to avoid unnecessary construction of GetContext
|
|
|
|
// objects, which is expensive
|
|
|
|
autovector<GetContext, 16> get_ctx;
|
|
|
|
for (auto iter = range->begin(); iter != range->end(); ++iter) {
|
|
|
|
assert(iter->s->ok() || iter->s->IsMergeInProgress());
|
|
|
|
get_ctx.emplace_back(
|
|
|
|
user_comparator(), merge_operator_, info_log_, db_statistics_,
|
|
|
|
iter->s->ok() ? GetContext::kNotFound : GetContext::kMerge,
|
|
|
|
iter->ukey_with_ts, iter->value, iter->timestamp, nullptr,
|
|
|
|
&(iter->merge_context), true, &iter->max_covering_tombstone_seq, clock_,
|
|
|
|
nullptr, merge_operator_ ? &pinned_iters_mgr : nullptr, callback,
|
|
|
|
&iter->is_blob_index, tracing_mget_id);
|
|
|
|
// MergeInProgress status, if set, has been transferred to the get_context
|
|
|
|
// state, so we set status to ok here. From now on, the iter status will
|
|
|
|
// be used for IO errors, and get_context state will be used for any
|
|
|
|
// key level errors
|
|
|
|
*(iter->s) = Status::OK();
|
|
|
|
}
|
|
|
|
int get_ctx_index = 0;
|
|
|
|
for (auto iter = range->begin(); iter != range->end();
|
|
|
|
++iter, get_ctx_index++) {
|
|
|
|
iter->get_context = &(get_ctx[get_ctx_index]);
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
}
|
|
|
|
|
|
|
|
MultiGetRange file_picker_range(*range, range->begin(), range->end());
|
|
|
|
FilePickerMultiGet fp(
|
|
|
|
&file_picker_range,
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
&storage_info_.level_files_brief_, storage_info_.num_non_empty_levels_,
|
|
|
|
&storage_info_.file_indexer_, user_comparator(), internal_comparator());
|
|
|
|
FdWithKeyRange* f = fp.GetNextFile();
|
|
|
|
Status s;
|
|
|
|
uint64_t num_index_read = 0;
|
|
|
|
uint64_t num_filter_read = 0;
|
|
|
|
uint64_t num_data_read = 0;
|
|
|
|
uint64_t num_sst_read = 0;
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
|
|
|
|
while (f != nullptr) {
|
|
|
|
MultiGetRange file_range = fp.CurrentFileRange();
|
|
|
|
bool timer_enabled =
|
|
|
|
GetPerfLevel() >= PerfLevel::kEnableTimeExceptForMutex &&
|
|
|
|
get_perf_context()->per_level_perf_context_enabled;
|
|
|
|
StopWatchNano timer(clock_, timer_enabled /* auto_start */);
|
|
|
|
s = table_cache_->MultiGet(
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
read_options, *internal_comparator(), *f->file_metadata, &file_range,
|
|
|
|
mutable_cf_options_.prefix_extractor.get(),
|
|
|
|
cfd_->internal_stats()->GetFileReadHist(fp.GetHitFileLevel()),
|
|
|
|
IsFilterSkipped(static_cast<int>(fp.GetHitFileLevel()),
|
|
|
|
fp.IsHitFileLastInLevel()),
|
|
|
|
fp.GetHitFileLevel());
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
// TODO: examine the behavior for corrupted key
|
|
|
|
if (timer_enabled) {
|
|
|
|
PERF_COUNTER_BY_LEVEL_ADD(get_from_table_nanos, timer.ElapsedNanos(),
|
|
|
|
fp.GetHitFileLevel());
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
}
|
|
|
|
if (!s.ok()) {
|
|
|
|
// TODO: Set status for individual keys appropriately
|
|
|
|
for (auto iter = file_range.begin(); iter != file_range.end(); ++iter) {
|
|
|
|
*iter->s = s;
|
|
|
|
file_range.MarkKeyDone(iter);
|
|
|
|
}
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
uint64_t batch_size = 0;
|
|
|
|
for (auto iter = file_range.begin(); s.ok() && iter != file_range.end();
|
|
|
|
++iter) {
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
GetContext& get_context = *iter->get_context;
|
|
|
|
Status* status = iter->s;
|
|
|
|
// The Status in the KeyContext takes precedence over GetContext state
|
|
|
|
// Status may be an error if there were any IO errors in the table
|
|
|
|
// reader. We never expect Status to be NotFound(), as that is
|
|
|
|
// determined by get_context
|
|
|
|
assert(!status->IsNotFound());
|
|
|
|
if (!status->ok()) {
|
|
|
|
file_range.MarkKeyDone(iter);
|
|
|
|
continue;
|
|
|
|
}
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
|
|
|
|
if (get_context.sample()) {
|
|
|
|
sample_file_read_inc(f->file_metadata);
|
|
|
|
}
|
|
|
|
batch_size++;
|
|
|
|
num_index_read += get_context.get_context_stats_.num_index_read;
|
|
|
|
num_filter_read += get_context.get_context_stats_.num_filter_read;
|
|
|
|
num_data_read += get_context.get_context_stats_.num_data_read;
|
|
|
|
num_sst_read += get_context.get_context_stats_.num_sst_read;
|
|
|
|
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
// report the counters before returning
|
|
|
|
if (get_context.State() != GetContext::kNotFound &&
|
|
|
|
get_context.State() != GetContext::kMerge &&
|
|
|
|
db_statistics_ != nullptr) {
|
|
|
|
get_context.ReportCounters();
|
|
|
|
} else {
|
|
|
|
if (iter->max_covering_tombstone_seq > 0) {
|
|
|
|
// The remaining files we look at will only contain covered keys, so
|
|
|
|
// we stop here for this key
|
|
|
|
file_picker_range.SkipKey(iter);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
switch (get_context.State()) {
|
|
|
|
case GetContext::kNotFound:
|
|
|
|
// Keep searching in other files
|
|
|
|
break;
|
|
|
|
case GetContext::kMerge:
|
|
|
|
// TODO: update per-level perfcontext user_key_return_count for kMerge
|
|
|
|
break;
|
|
|
|
case GetContext::kFound:
|
|
|
|
if (fp.GetHitFileLevel() == 0) {
|
|
|
|
RecordTick(db_statistics_, GET_HIT_L0);
|
|
|
|
} else if (fp.GetHitFileLevel() == 1) {
|
|
|
|
RecordTick(db_statistics_, GET_HIT_L1);
|
|
|
|
} else if (fp.GetHitFileLevel() >= 2) {
|
|
|
|
RecordTick(db_statistics_, GET_HIT_L2_AND_UP);
|
|
|
|
}
|
|
|
|
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
PERF_COUNTER_BY_LEVEL_ADD(user_key_return_count, 1,
|
|
|
|
fp.GetHitFileLevel());
|
|
|
|
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
file_range.MarkKeyDone(iter);
|
|
|
|
|
|
|
|
if (iter->is_blob_index) {
|
|
|
|
if (iter->value) {
|
|
|
|
*status = GetBlob(read_options, iter->ukey_with_ts, *iter->value,
|
|
|
|
iter->value);
|
|
|
|
if (!status->ok()) {
|
|
|
|
if (status->IsIncomplete()) {
|
|
|
|
get_context.MarkKeyMayExist();
|
|
|
|
}
|
|
|
|
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
file_range.AddValueSize(iter->value->size());
|
|
|
|
if (file_range.GetValueSize() > read_options.value_size_soft_limit) {
|
|
|
|
s = Status::Aborted();
|
|
|
|
break;
|
|
|
|
}
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
continue;
|
|
|
|
case GetContext::kDeleted:
|
|
|
|
// Use empty error message for speed
|
|
|
|
*status = Status::NotFound();
|
|
|
|
file_range.MarkKeyDone(iter);
|
|
|
|
continue;
|
|
|
|
case GetContext::kCorrupt:
|
|
|
|
*status =
|
|
|
|
Status::Corruption("corrupted key for ", iter->lkey->user_key());
|
|
|
|
file_range.MarkKeyDone(iter);
|
|
|
|
continue;
|
|
|
|
case GetContext::kUnexpectedBlobIndex:
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
ROCKS_LOG_ERROR(info_log_, "Encounter unexpected blob index.");
|
|
|
|
*status = Status::NotSupported(
|
|
|
|
"Encounter unexpected blob index. Please open DB with "
|
|
|
|
"ROCKSDB_NAMESPACE::blob_db::BlobDB instead.");
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
file_range.MarkKeyDone(iter);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// Report MultiGet stats per level.
|
|
|
|
if (fp.IsHitFileLastInLevel()) {
|
|
|
|
// Dump the stats if this is the last file of this level and reset for
|
|
|
|
// next level.
|
|
|
|
RecordInHistogram(db_statistics_,
|
|
|
|
NUM_INDEX_AND_FILTER_BLOCKS_READ_PER_LEVEL,
|
|
|
|
num_index_read + num_filter_read);
|
|
|
|
RecordInHistogram(db_statistics_, NUM_DATA_BLOCKS_READ_PER_LEVEL,
|
|
|
|
num_data_read);
|
|
|
|
RecordInHistogram(db_statistics_, NUM_SST_READ_PER_LEVEL, num_sst_read);
|
|
|
|
num_filter_read = 0;
|
|
|
|
num_index_read = 0;
|
|
|
|
num_data_read = 0;
|
|
|
|
num_sst_read = 0;
|
|
|
|
}
|
|
|
|
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
RecordInHistogram(db_statistics_, SST_BATCH_SIZE, batch_size);
|
|
|
|
if (!s.ok() || file_picker_range.empty()) {
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
break;
|
|
|
|
}
|
|
|
|
f = fp.GetNextFile();
|
|
|
|
}
|
|
|
|
|
|
|
|
// Process any left over keys
|
|
|
|
for (auto iter = range->begin(); s.ok() && iter != range->end(); ++iter) {
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
GetContext& get_context = *iter->get_context;
|
|
|
|
Status* status = iter->s;
|
|
|
|
Slice user_key = iter->lkey->user_key();
|
|
|
|
|
|
|
|
if (db_statistics_ != nullptr) {
|
|
|
|
get_context.ReportCounters();
|
|
|
|
}
|
|
|
|
if (GetContext::kMerge == get_context.State()) {
|
|
|
|
if (!merge_operator_) {
|
|
|
|
*status = Status::InvalidArgument(
|
|
|
|
"merge_operator is not properly initialized.");
|
|
|
|
range->MarkKeyDone(iter);
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
// merge_operands are in saver and we hit the beginning of the key history
|
|
|
|
// do a final merge of nullptr and operands;
|
|
|
|
std::string* str_value =
|
|
|
|
iter->value != nullptr ? iter->value->GetSelf() : nullptr;
|
|
|
|
*status = MergeHelper::TimedFullMerge(
|
|
|
|
merge_operator_, user_key, nullptr, iter->merge_context.GetOperands(),
|
|
|
|
str_value, info_log_, db_statistics_, clock_,
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
nullptr /* result_operand */, true);
|
|
|
|
if (LIKELY(iter->value != nullptr)) {
|
|
|
|
iter->value->PinSelf();
|
|
|
|
range->AddValueSize(iter->value->size());
|
|
|
|
range->MarkKeyDone(iter);
|
|
|
|
if (range->GetValueSize() > read_options.value_size_soft_limit) {
|
|
|
|
s = Status::Aborted();
|
|
|
|
break;
|
|
|
|
}
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
}
|
|
|
|
} else {
|
|
|
|
range->MarkKeyDone(iter);
|
|
|
|
*status = Status::NotFound(); // Use an empty error message for speed
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
for (auto iter = range->begin(); iter != range->end(); ++iter) {
|
|
|
|
range->MarkKeyDone(iter);
|
|
|
|
*(iter->s) = s;
|
|
|
|
}
|
Introduce a new MultiGet batching implementation (#5011)
Summary:
This PR introduces a new MultiGet() API, with the underlying implementation grouping keys based on SST file and batching lookups in a file. The reason for the new API is twofold - the definition allows callers to allocate storage for status and values on stack instead of std::vector, as well as return values as PinnableSlices in order to avoid copying, and it keeps the original MultiGet() implementation intact while we experiment with batching.
Batching is useful when there is some spatial locality to the keys being queries, as well as larger batch sizes. The main benefits are due to -
1. Fewer function calls, especially to BlockBasedTableReader::MultiGet() and FullFilterBlockReader::KeysMayMatch()
2. Bloom filter cachelines can be prefetched, hiding the cache miss latency
The next step is to optimize the binary searches in the level_storage_info, index blocks and data blocks, since we could reduce the number of key comparisons if the keys are relatively close to each other. The batching optimizations also need to be extended to other formats, such as PlainTable and filter formats. This also needs to be added to db_stress.
Benchmark results from db_bench for various batch size/locality of reference combinations are given below. Locality was simulated by offsetting the keys in a batch by a stride length. Each SST file is about 8.6MB uncompressed and key/value size is 16/100 uncompressed. To focus on the cpu benefit of batching, the runs were single threaded and bound to the same cpu to eliminate interference from other system events. The results show a 10-25% improvement in micros/op from smaller to larger batch sizes (4 - 32).
Batch Sizes
1 | 2 | 4 | 8 | 16 | 32
Random pattern (Stride length 0)
4.158 | 4.109 | 4.026 | 4.05 | 4.1 | 4.074 - Get
4.438 | 4.302 | 4.165 | 4.122 | 4.096 | 4.075 - MultiGet (no batching)
4.461 | 4.256 | 4.277 | 4.11 | 4.182 | 4.14 - MultiGet (w/ batching)
Good locality (Stride length 16)
4.048 | 3.659 | 3.248 | 2.99 | 2.84 | 2.753
4.429 | 3.728 | 3.406 | 3.053 | 2.911 | 2.781
4.452 | 3.45 | 2.833 | 2.451 | 2.233 | 2.135
Good locality (Stride length 256)
4.066 | 3.786 | 3.581 | 3.447 | 3.415 | 3.232
4.406 | 4.005 | 3.644 | 3.49 | 3.381 | 3.268
4.393 | 3.649 | 3.186 | 2.882 | 2.676 | 2.62
Medium locality (Stride length 4096)
4.012 | 3.922 | 3.768 | 3.61 | 3.582 | 3.555
4.364 | 4.057 | 3.791 | 3.65 | 3.57 | 3.465
4.479 | 3.758 | 3.316 | 3.077 | 2.959 | 2.891
dbbench command used (on a DB with 4 levels, 12 million keys)-
TEST_TMPDIR=/dev/shm numactl -C 10 ./db_bench.tmp -use_existing_db=true -benchmarks="readseq,multireadrandom" -write_buffer_size=4194304 -target_file_size_base=4194304 -max_bytes_for_level_base=16777216 -num=12000000 -reads=12000000 -duration=90 -threads=1 -compression_type=none -cache_size=4194304000 -batch_size=32 -disable_auto_compactions=true -bloom_bits=10 -cache_index_and_filter_blocks=true -pin_l0_filter_and_index_blocks_in_cache=true -multiread_batched=true -multiread_stride=4
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5011
Differential Revision: D14348703
Pulled By: anand1976
fbshipit-source-id: 774406dab3776d979c809522a67bedac6c17f84b
6 years ago
|
|
|
}
|
|
|
|
|
|
|
|
bool Version::IsFilterSkipped(int level, bool is_file_last_in_level) {
|
Skip bottom-level filter block caching when hit-optimized
Summary:
When Get() or NewIterator() trigger file loads, skip caching the filter block if
(1) optimize_filters_for_hits is set and (2) the file is on the bottommost
level. Also skip checking filters under the same conditions, which means that
for a preloaded file or a file that was trivially-moved to the bottom level, its
filter block will eventually expire from the cache.
- added parameters/instance variables in various places in order to propagate the config ("skip_filters") from version_set to block_based_table_reader
- in BlockBasedTable::Rep, this optimization prevents filter from being loaded when the file is opened simply by setting filter_policy = nullptr
- in BlockBasedTable::Get/BlockBasedTable::NewIterator, this optimization prevents filter from being used (even if it was loaded already) by setting filter = nullptr
Test Plan:
updated unit test:
$ ./db_test --gtest_filter=DBTest.OptimizeFiltersForHits
will also run 'make check'
Reviewers: sdong, igor, paultuckfield, anthony, rven, kradhakrishnan, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D51633
9 years ago
|
|
|
// Reaching the bottom level implies misses at all upper levels, so we'll
|
|
|
|
// skip checking the filters when we predict a hit.
|
|
|
|
return cfd_->ioptions()->optimize_filters_for_hits &&
|
|
|
|
(level > 0 || is_file_last_in_level) &&
|
Skip bottom-level filter block caching when hit-optimized
Summary:
When Get() or NewIterator() trigger file loads, skip caching the filter block if
(1) optimize_filters_for_hits is set and (2) the file is on the bottommost
level. Also skip checking filters under the same conditions, which means that
for a preloaded file or a file that was trivially-moved to the bottom level, its
filter block will eventually expire from the cache.
- added parameters/instance variables in various places in order to propagate the config ("skip_filters") from version_set to block_based_table_reader
- in BlockBasedTable::Rep, this optimization prevents filter from being loaded when the file is opened simply by setting filter_policy = nullptr
- in BlockBasedTable::Get/BlockBasedTable::NewIterator, this optimization prevents filter from being used (even if it was loaded already) by setting filter = nullptr
Test Plan:
updated unit test:
$ ./db_test --gtest_filter=DBTest.OptimizeFiltersForHits
will also run 'make check'
Reviewers: sdong, igor, paultuckfield, anthony, rven, kradhakrishnan, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D51633
9 years ago
|
|
|
level == storage_info_.num_non_empty_levels() - 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
void VersionStorageInfo::GenerateLevelFilesBrief() {
|
|
|
|
level_files_brief_.resize(num_non_empty_levels_);
|
create compressed_levels_ in Version, allocate its space using arena. Make Version::Get, Version::FindFile faster
Summary:
Define CompressedFileMetaData that just contains fd, smallest_slice, largest_slice. Create compressed_levels_ in Version, the space is allocated using arena
Thus increase the file meta data locality, speed up "Get" and "FindFile"
benchmark with in-memory tmpfs, could have 4% improvement under "random read" and 2% improvement under "read while writing"
benchmark command:
./db_bench --db=/mnt/db/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --block_size=4096 --cache_size=17179869184 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=33554432 --max_bytes_for_level_base=1073741824 --disable_wal=0 --sync=0 --disable_data_sync=1 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_grandparent_overlap_factor=10 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --perf_level=0 --benchmarks=readwhilewriting,readwhilewriting,readwhilewriting --use_existing_db=1 --num=52428800 --threads=1 —writes_per_second=81920
Read Random:
From 1.8363 ms/op, improve to 1.7587 ms/op.
Read while writing:
From 2.985 ms/op, improve to 2.924 ms/op.
Test Plan:
make all check
Reviewers: ljin, haobo, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, igor
Differential Revision: https://reviews.facebook.net/D19419
11 years ago
|
|
|
for (int level = 0; level < num_non_empty_levels_; level++) {
|
|
|
|
DoGenerateLevelFilesBrief(
|
|
|
|
&level_files_brief_[level], files_[level], &arena_);
|
create compressed_levels_ in Version, allocate its space using arena. Make Version::Get, Version::FindFile faster
Summary:
Define CompressedFileMetaData that just contains fd, smallest_slice, largest_slice. Create compressed_levels_ in Version, the space is allocated using arena
Thus increase the file meta data locality, speed up "Get" and "FindFile"
benchmark with in-memory tmpfs, could have 4% improvement under "random read" and 2% improvement under "read while writing"
benchmark command:
./db_bench --db=/mnt/db/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --block_size=4096 --cache_size=17179869184 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=33554432 --max_bytes_for_level_base=1073741824 --disable_wal=0 --sync=0 --disable_data_sync=1 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_grandparent_overlap_factor=10 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --perf_level=0 --benchmarks=readwhilewriting,readwhilewriting,readwhilewriting --use_existing_db=1 --num=52428800 --threads=1 —writes_per_second=81920
Read Random:
From 1.8363 ms/op, improve to 1.7587 ms/op.
Read while writing:
From 2.985 ms/op, improve to 2.924 ms/op.
Test Plan:
make all check
Reviewers: ljin, haobo, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, igor
Differential Revision: https://reviews.facebook.net/D19419
11 years ago
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void Version::PrepareApply(
|
|
|
|
const MutableCFOptions& mutable_cf_options,
|
|
|
|
bool update_stats) {
|
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"Version::PrepareApply:forced_check",
|
|
|
|
reinterpret_cast<void*>(&storage_info_.force_consistency_checks_));
|
|
|
|
UpdateAccumulatedStats(update_stats);
|
|
|
|
storage_info_.UpdateNumNonEmptyLevels();
|
options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
10 years ago
|
|
|
storage_info_.CalculateBaseBytes(*cfd_->ioptions(), mutable_cf_options);
|
|
|
|
storage_info_.UpdateFilesByCompactionPri(cfd_->ioptions()->compaction_pri);
|
|
|
|
storage_info_.GenerateFileIndexer();
|
|
|
|
storage_info_.GenerateLevelFilesBrief();
|
Allowing L0 -> L1 trivial move on sorted data
Summary:
This diff updates the logic of how we do trivial move, now trivial move can run on any number of files in input level as long as they are not overlapping
The conditions for trivial move have been updated
Introduced conditions:
- Trivial move cannot happen if we have a compaction filter (except if the compaction is not manual)
- Input level files cannot be overlapping
Removed conditions:
- Trivial move only run when the compaction is not manual
- Input level should can contain only 1 file
More context on what tests failed because of Trivial move
```
DBTest.CompactionsGenerateMultipleFiles
This test is expecting compaction on a file in L0 to generate multiple files in L1, this test will fail with trivial move because we end up with one file in L1
```
```
DBTest.NoSpaceCompactRange
This test expect compaction to fail when we force environment to report running out of space, of course this is not valid in trivial move situation
because trivial move does not need any extra space, and did not check for that
```
```
DBTest.DropWrites
Similar to DBTest.NoSpaceCompactRange
```
```
DBTest.DeleteObsoleteFilesPendingOutputs
This test expect that a file in L2 is deleted after it's moved to L3, this is not valid with trivial move because although the file was moved it is now used by L3
```
```
CuckooTableDBTest.CompactionIntoMultipleFiles
Same as DBTest.CompactionsGenerateMultipleFiles
```
This diff is based on a work by @sdong https://reviews.facebook.net/D34149
Test Plan: make -j64 check
Reviewers: rven, sdong, igor
Reviewed By: igor
Subscribers: yhchiang, ott, march, dhruba, sdong
Differential Revision: https://reviews.facebook.net/D34797
10 years ago
|
|
|
storage_info_.GenerateLevel0NonOverlapping();
|
|
|
|
storage_info_.GenerateBottommostFiles();
|
|
|
|
}
|
|
|
|
|
|
|
|
bool Version::MaybeInitializeFileMetaData(FileMetaData* file_meta) {
|
|
|
|
if (file_meta->init_stats_from_file ||
|
|
|
|
file_meta->compensated_file_size > 0) {
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
std::shared_ptr<const TableProperties> tp;
|
|
|
|
Status s = GetTableProperties(&tp, file_meta);
|
|
|
|
file_meta->init_stats_from_file = true;
|
|
|
|
if (!s.ok()) {
|
|
|
|
ROCKS_LOG_ERROR(vset_->db_options_->info_log,
|
|
|
|
"Unable to load table properties for file %" PRIu64
|
|
|
|
" --- %s\n",
|
|
|
|
file_meta->fd.GetNumber(), s.ToString().c_str());
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
if (tp.get() == nullptr) return false;
|
|
|
|
file_meta->num_entries = tp->num_entries;
|
|
|
|
file_meta->num_deletions = tp->num_deletions;
|
|
|
|
file_meta->raw_value_size = tp->raw_value_size;
|
|
|
|
file_meta->raw_key_size = tp->raw_key_size;
|
|
|
|
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
void VersionStorageInfo::UpdateAccumulatedStats(FileMetaData* file_meta) {
|
|
|
|
TEST_SYNC_POINT_CALLBACK("VersionStorageInfo::UpdateAccumulatedStats",
|
|
|
|
nullptr);
|
|
|
|
|
|
|
|
assert(file_meta->init_stats_from_file);
|
|
|
|
accumulated_file_size_ += file_meta->fd.GetFileSize();
|
|
|
|
accumulated_raw_key_size_ += file_meta->raw_key_size;
|
|
|
|
accumulated_raw_value_size_ += file_meta->raw_value_size;
|
|
|
|
accumulated_num_non_deletions_ +=
|
|
|
|
file_meta->num_entries - file_meta->num_deletions;
|
|
|
|
accumulated_num_deletions_ += file_meta->num_deletions;
|
|
|
|
|
|
|
|
current_num_non_deletions_ +=
|
|
|
|
file_meta->num_entries - file_meta->num_deletions;
|
|
|
|
current_num_deletions_ += file_meta->num_deletions;
|
|
|
|
current_num_samples_++;
|
|
|
|
}
|
|
|
|
|
|
|
|
void VersionStorageInfo::RemoveCurrentStats(FileMetaData* file_meta) {
|
|
|
|
if (file_meta->init_stats_from_file) {
|
|
|
|
current_num_non_deletions_ -=
|
|
|
|
file_meta->num_entries - file_meta->num_deletions;
|
|
|
|
current_num_deletions_ -= file_meta->num_deletions;
|
|
|
|
current_num_samples_--;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void Version::UpdateAccumulatedStats(bool update_stats) {
|
|
|
|
if (update_stats) {
|
|
|
|
// maximum number of table properties loaded from files.
|
|
|
|
const int kMaxInitCount = 20;
|
|
|
|
int init_count = 0;
|
|
|
|
// here only the first kMaxInitCount files which haven't been
|
|
|
|
// initialized from file will be updated with num_deletions.
|
|
|
|
// The motivation here is to cap the maximum I/O per Version creation.
|
|
|
|
// The reason for choosing files from lower-level instead of higher-level
|
|
|
|
// is that such design is able to propagate the initialization from
|
|
|
|
// lower-level to higher-level: When the num_deletions of lower-level
|
|
|
|
// files are updated, it will make the lower-level files have accurate
|
|
|
|
// compensated_file_size, making lower-level to higher-level compaction
|
|
|
|
// will be triggered, which creates higher-level files whose num_deletions
|
|
|
|
// will be updated here.
|
|
|
|
for (int level = 0;
|
|
|
|
level < storage_info_.num_levels_ && init_count < kMaxInitCount;
|
|
|
|
++level) {
|
|
|
|
for (auto* file_meta : storage_info_.files_[level]) {
|
|
|
|
if (MaybeInitializeFileMetaData(file_meta)) {
|
|
|
|
// each FileMeta will be initialized only once.
|
|
|
|
storage_info_.UpdateAccumulatedStats(file_meta);
|
|
|
|
// when option "max_open_files" is -1, all the file metadata has
|
|
|
|
// already been read, so MaybeInitializeFileMetaData() won't incur
|
|
|
|
// any I/O cost. "max_open_files=-1" means that the table cache passed
|
|
|
|
// to the VersionSet and then to the ColumnFamilySet has a size of
|
|
|
|
// TableCache::kInfiniteCapacity
|
|
|
|
if (vset_->GetColumnFamilySet()->get_table_cache()->GetCapacity() ==
|
|
|
|
TableCache::kInfiniteCapacity) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
if (++init_count >= kMaxInitCount) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
// In case all sampled-files contain only deletion entries, then we
|
|
|
|
// load the table-property of a file in higher-level to initialize
|
|
|
|
// that value.
|
|
|
|
for (int level = storage_info_.num_levels_ - 1;
|
|
|
|
storage_info_.accumulated_raw_value_size_ == 0 && level >= 0;
|
|
|
|
--level) {
|
|
|
|
for (int i = static_cast<int>(storage_info_.files_[level].size()) - 1;
|
|
|
|
storage_info_.accumulated_raw_value_size_ == 0 && i >= 0; --i) {
|
|
|
|
if (MaybeInitializeFileMetaData(storage_info_.files_[level][i])) {
|
|
|
|
storage_info_.UpdateAccumulatedStats(storage_info_.files_[level][i]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
storage_info_.ComputeCompensatedSizes();
|
|
|
|
}
|
|
|
|
|
|
|
|
void VersionStorageInfo::ComputeCompensatedSizes() {
|
|
|
|
static const int kDeletionWeightOnCompaction = 2;
|
|
|
|
uint64_t average_value_size = GetAverageValueSize();
|
|
|
|
|
|
|
|
// compute the compensated size
|
|
|
|
for (int level = 0; level < num_levels_; level++) {
|
|
|
|
for (auto* file_meta : files_[level]) {
|
|
|
|
// Here we only compute compensated_file_size for those file_meta
|
|
|
|
// which compensated_file_size is uninitialized (== 0). This is true only
|
|
|
|
// for files that have been created right now and no other thread has
|
|
|
|
// access to them. That's why we can safely mutate compensated_file_size.
|
|
|
|
if (file_meta->compensated_file_size == 0) {
|
|
|
|
file_meta->compensated_file_size = file_meta->fd.GetFileSize();
|
|
|
|
// Here we only boost the size of deletion entries of a file only
|
|
|
|
// when the number of deletion entries is greater than the number of
|
|
|
|
// non-deletion entries in the file. The motivation here is that in
|
|
|
|
// a stable workload, the number of deletion entries should be roughly
|
|
|
|
// equal to the number of non-deletion entries. If we compensate the
|
|
|
|
// size of deletion entries in a stable workload, the deletion
|
|
|
|
// compensation logic might introduce unwanted effet which changes the
|
|
|
|
// shape of LSM tree.
|
|
|
|
if (file_meta->num_deletions * 2 >= file_meta->num_entries) {
|
|
|
|
file_meta->compensated_file_size +=
|
|
|
|
(file_meta->num_deletions * 2 - file_meta->num_entries) *
|
|
|
|
average_value_size * kDeletionWeightOnCompaction;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
int VersionStorageInfo::MaxInputLevel() const {
|
|
|
|
if (compaction_style_ == kCompactionStyleLevel) {
|
|
|
|
return num_levels() - 2;
|
|
|
|
}
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
int VersionStorageInfo::MaxOutputLevel(bool allow_ingest_behind) const {
|
|
|
|
if (allow_ingest_behind) {
|
|
|
|
assert(num_levels() > 1);
|
|
|
|
return num_levels() - 2;
|
|
|
|
}
|
|
|
|
return num_levels() - 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
void VersionStorageInfo::EstimateCompactionBytesNeeded(
|
|
|
|
const MutableCFOptions& mutable_cf_options) {
|
|
|
|
// Only implemented for level-based compaction
|
|
|
|
if (compaction_style_ != kCompactionStyleLevel) {
|
When slowdown is triggered, reduce the write rate
Summary: It's usually hard for users to set a value of options.delayed_write_rate. With this diff, after slowdown condition triggers, we greedily reduce write rate if estimated pending compaction bytes increase. If estimated compaction pending bytes drop, we increase the write rate.
Test Plan:
Add a unit test
Test with db_bench setting:
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -num=10000000 --soft_pending_compaction_bytes_limit=1000000000 --hard_pending_compaction_bytes_limit=3000000000 --delayed_write_rate=100000000
and make sure without the commit, write stop will happen, but with the commit, it will not happen.
Reviewers: igor, anthony, rven, yhchiang, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52131
9 years ago
|
|
|
estimated_compaction_needed_bytes_ = 0;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Start from Level 0, if level 0 qualifies compaction to level 1,
|
|
|
|
// we estimate the size of compaction.
|
|
|
|
// Then we move on to the next level and see whether it qualifies compaction
|
|
|
|
// to the next level. The size of the level is estimated as the actual size
|
|
|
|
// on the level plus the input bytes from the previous level if there is any.
|
|
|
|
// If it exceeds, take the exceeded bytes as compaction input and add the size
|
|
|
|
// of the compaction size to tatal size.
|
|
|
|
// We keep doing it to Level 2, 3, etc, until the last level and return the
|
|
|
|
// accumulated bytes.
|
|
|
|
|
|
|
|
uint64_t bytes_compact_to_next_level = 0;
|
|
|
|
uint64_t level_size = 0;
|
|
|
|
for (auto* f : files_[0]) {
|
|
|
|
level_size += f->fd.GetFileSize();
|
|
|
|
}
|
|
|
|
// Level 0
|
|
|
|
bool level0_compact_triggered = false;
|
|
|
|
if (static_cast<int>(files_[0].size()) >=
|
|
|
|
mutable_cf_options.level0_file_num_compaction_trigger ||
|
|
|
|
level_size >= mutable_cf_options.max_bytes_for_level_base) {
|
|
|
|
level0_compact_triggered = true;
|
|
|
|
estimated_compaction_needed_bytes_ = level_size;
|
|
|
|
bytes_compact_to_next_level = level_size;
|
|
|
|
} else {
|
|
|
|
estimated_compaction_needed_bytes_ = 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Level 1 and up.
|
|
|
|
uint64_t bytes_next_level = 0;
|
|
|
|
for (int level = base_level(); level <= MaxInputLevel(); level++) {
|
|
|
|
level_size = 0;
|
|
|
|
if (bytes_next_level > 0) {
|
|
|
|
#ifndef NDEBUG
|
|
|
|
uint64_t level_size2 = 0;
|
|
|
|
for (auto* f : files_[level]) {
|
|
|
|
level_size2 += f->fd.GetFileSize();
|
|
|
|
}
|
|
|
|
assert(level_size2 == bytes_next_level);
|
|
|
|
#endif
|
|
|
|
level_size = bytes_next_level;
|
|
|
|
bytes_next_level = 0;
|
|
|
|
} else {
|
|
|
|
for (auto* f : files_[level]) {
|
|
|
|
level_size += f->fd.GetFileSize();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (level == base_level() && level0_compact_triggered) {
|
|
|
|
// Add base level size to compaction if level0 compaction triggered.
|
|
|
|
estimated_compaction_needed_bytes_ += level_size;
|
|
|
|
}
|
|
|
|
// Add size added by previous compaction
|
|
|
|
level_size += bytes_compact_to_next_level;
|
|
|
|
bytes_compact_to_next_level = 0;
|
|
|
|
uint64_t level_target = MaxBytesForLevel(level);
|
|
|
|
if (level_size > level_target) {
|
|
|
|
bytes_compact_to_next_level = level_size - level_target;
|
|
|
|
// Estimate the actual compaction fan-out ratio as size ratio between
|
|
|
|
// the two levels.
|
|
|
|
|
|
|
|
assert(bytes_next_level == 0);
|
|
|
|
if (level + 1 < num_levels_) {
|
|
|
|
for (auto* f : files_[level + 1]) {
|
|
|
|
bytes_next_level += f->fd.GetFileSize();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (bytes_next_level > 0) {
|
|
|
|
assert(level_size > 0);
|
|
|
|
estimated_compaction_needed_bytes_ += static_cast<uint64_t>(
|
|
|
|
static_cast<double>(bytes_compact_to_next_level) *
|
|
|
|
(static_cast<double>(bytes_next_level) /
|
|
|
|
static_cast<double>(level_size) +
|
|
|
|
1));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
namespace {
|
|
|
|
uint32_t GetExpiredTtlFilesCount(const ImmutableCFOptions& ioptions,
|
|
|
|
const MutableCFOptions& mutable_cf_options,
|
|
|
|
const std::vector<FileMetaData*>& files) {
|
|
|
|
uint32_t ttl_expired_files_count = 0;
|
|
|
|
|
|
|
|
int64_t _current_time;
|
|
|
|
auto status = ioptions.env->GetCurrentTime(&_current_time);
|
|
|
|
if (status.ok()) {
|
|
|
|
const uint64_t current_time = static_cast<uint64_t>(_current_time);
|
|
|
|
for (FileMetaData* f : files) {
|
|
|
|
if (!f->being_compacted) {
|
|
|
|
uint64_t oldest_ancester_time = f->TryGetOldestAncesterTime();
|
|
|
|
if (oldest_ancester_time != 0 &&
|
|
|
|
oldest_ancester_time < (current_time - mutable_cf_options.ttl)) {
|
|
|
|
ttl_expired_files_count++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return ttl_expired_files_count;
|
|
|
|
}
|
|
|
|
} // anonymous namespace
|
|
|
|
|
|
|
|
void VersionStorageInfo::ComputeCompactionScore(
|
|
|
|
const ImmutableCFOptions& immutable_cf_options,
|
|
|
|
const MutableCFOptions& mutable_cf_options) {
|
|
|
|
for (int level = 0; level <= MaxInputLevel(); level++) {
|
|
|
|
double score;
|
|
|
|
if (level == 0) {
|
|
|
|
// We treat level-0 specially by bounding the number of files
|
|
|
|
// instead of number of bytes for two reasons:
|
|
|
|
//
|
|
|
|
// (1) With larger write-buffer sizes, it is nice not to do too
|
|
|
|
// many level-0 compactions.
|
|
|
|
//
|
|
|
|
// (2) The files in level-0 are merged on every read and
|
|
|
|
// therefore we wish to avoid too many files when the individual
|
|
|
|
// file size is small (perhaps because of a small write-buffer
|
|
|
|
// setting, or very high compression ratios, or lots of
|
|
|
|
// overwrites/deletions).
|
|
|
|
int num_sorted_runs = 0;
|
|
|
|
uint64_t total_size = 0;
|
Add experimental API MarkForCompaction()
Summary:
Some Mongo+Rocks datasets in Parse's environment are not doing compactions very frequently. During the quiet period (with no IO), we'd like to schedule compactions so that our reads become faster. Also, aggressively compacting during quiet periods helps when write bursts happen. In addition, we also want to compact files that are containing deleted key ranges (like old oplog keys).
All of this is currently not possible with CompactRange() because it's single-threaded and blocks all other compactions from happening. Running CompactRange() risks an issue of blocking writes because we generate too much Level 0 files before the compaction is over. Stopping writes is very dangerous because they hold transaction locks. We tried running manual compaction once on Mongo+Rocks and everything fell apart.
MarkForCompaction() solves all of those problems. This is very light-weight manual compaction. It is lower priority than automatic compactions, which means it shouldn't interfere with background process keeping the LSM tree clean. However, if no automatic compactions need to be run (or we have extra background threads available), we will start compacting files that are marked for compaction.
Test Plan: added a new unit test
Reviewers: yhchiang, rven, MarkCallaghan, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37083
10 years ago
|
|
|
for (auto* f : files_[level]) {
|
|
|
|
if (!f->being_compacted) {
|
|
|
|
total_size += f->compensated_file_size;
|
|
|
|
num_sorted_runs++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (compaction_style_ == kCompactionStyleUniversal) {
|
|
|
|
// For universal compaction, we use level0 score to indicate
|
|
|
|
// compaction score for the whole DB. Adding other levels as if
|
|
|
|
// they are L0 files.
|
|
|
|
for (int i = 1; i < num_levels(); i++) {
|
|
|
|
// Its possible that a subset of the files in a level may be in a
|
|
|
|
// compaction, due to delete triggered compaction or trivial move.
|
|
|
|
// In that case, the below check may not catch a level being
|
|
|
|
// compacted as it only checks the first file. The worst that can
|
|
|
|
// happen is a scheduled compaction thread will find nothing to do.
|
|
|
|
if (!files_[i].empty() && !files_[i][0]->being_compacted) {
|
|
|
|
num_sorted_runs++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (compaction_style_ == kCompactionStyleFIFO) {
|
|
|
|
score = static_cast<double>(total_size) /
|
|
|
|
mutable_cf_options.compaction_options_fifo.max_table_files_size;
|
|
|
|
if (mutable_cf_options.compaction_options_fifo.allow_compaction) {
|
|
|
|
score = std::max(
|
|
|
|
static_cast<double>(num_sorted_runs) /
|
|
|
|
mutable_cf_options.level0_file_num_compaction_trigger,
|
|
|
|
score);
|
|
|
|
}
|
|
|
|
if (mutable_cf_options.ttl > 0) {
|
|
|
|
score = std::max(
|
|
|
|
static_cast<double>(GetExpiredTtlFilesCount(
|
|
|
|
immutable_cf_options, mutable_cf_options, files_[level])),
|
|
|
|
score);
|
|
|
|
}
|
|
|
|
|
|
|
|
} else {
|
|
|
|
score = static_cast<double>(num_sorted_runs) /
|
|
|
|
mutable_cf_options.level0_file_num_compaction_trigger;
|
|
|
|
if (compaction_style_ == kCompactionStyleLevel && num_levels() > 1) {
|
|
|
|
// Level-based involves L0->L0 compactions that can lead to oversized
|
|
|
|
// L0 files. Take into account size as well to avoid later giant
|
|
|
|
// compactions to the base level.
|
|
|
|
uint64_t l0_target_size = mutable_cf_options.max_bytes_for_level_base;
|
|
|
|
if (immutable_cf_options.level_compaction_dynamic_level_bytes &&
|
|
|
|
level_multiplier_ != 0.0) {
|
|
|
|
// Prevent L0 to Lbase fanout from growing larger than
|
|
|
|
// `level_multiplier_`. This prevents us from getting stuck picking
|
|
|
|
// L0 forever even when it is hurting write-amp. That could happen
|
|
|
|
// in dynamic level compaction's write-burst mode where the base
|
|
|
|
// level's target size can grow to be enormous.
|
|
|
|
l0_target_size =
|
|
|
|
std::max(l0_target_size,
|
|
|
|
static_cast<uint64_t>(level_max_bytes_[base_level_] /
|
|
|
|
level_multiplier_));
|
|
|
|
}
|
|
|
|
score =
|
|
|
|
std::max(score, static_cast<double>(total_size) / l0_target_size);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
// Compute the ratio of current size to size limit.
|
|
|
|
uint64_t level_bytes_no_compacting = 0;
|
|
|
|
for (auto f : files_[level]) {
|
Add experimental API MarkForCompaction()
Summary:
Some Mongo+Rocks datasets in Parse's environment are not doing compactions very frequently. During the quiet period (with no IO), we'd like to schedule compactions so that our reads become faster. Also, aggressively compacting during quiet periods helps when write bursts happen. In addition, we also want to compact files that are containing deleted key ranges (like old oplog keys).
All of this is currently not possible with CompactRange() because it's single-threaded and blocks all other compactions from happening. Running CompactRange() risks an issue of blocking writes because we generate too much Level 0 files before the compaction is over. Stopping writes is very dangerous because they hold transaction locks. We tried running manual compaction once on Mongo+Rocks and everything fell apart.
MarkForCompaction() solves all of those problems. This is very light-weight manual compaction. It is lower priority than automatic compactions, which means it shouldn't interfere with background process keeping the LSM tree clean. However, if no automatic compactions need to be run (or we have extra background threads available), we will start compacting files that are marked for compaction.
Test Plan: added a new unit test
Reviewers: yhchiang, rven, MarkCallaghan, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37083
10 years ago
|
|
|
if (!f->being_compacted) {
|
|
|
|
level_bytes_no_compacting += f->compensated_file_size;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
score = static_cast<double>(level_bytes_no_compacting) /
|
options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
10 years ago
|
|
|
MaxBytesForLevel(level);
|
|
|
|
}
|
|
|
|
compaction_level_[level] = level;
|
|
|
|
compaction_score_[level] = score;
|
|
|
|
}
|
|
|
|
|
|
|
|
// sort all the levels based on their score. Higher scores get listed
|
|
|
|
// first. Use bubble sort because the number of entries are small.
|
|
|
|
for (int i = 0; i < num_levels() - 2; i++) {
|
|
|
|
for (int j = i + 1; j < num_levels() - 1; j++) {
|
|
|
|
if (compaction_score_[i] < compaction_score_[j]) {
|
|
|
|
double score = compaction_score_[i];
|
|
|
|
int level = compaction_level_[i];
|
|
|
|
compaction_score_[i] = compaction_score_[j];
|
|
|
|
compaction_level_[i] = compaction_level_[j];
|
|
|
|
compaction_score_[j] = score;
|
|
|
|
compaction_level_[j] = level;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
Add experimental API MarkForCompaction()
Summary:
Some Mongo+Rocks datasets in Parse's environment are not doing compactions very frequently. During the quiet period (with no IO), we'd like to schedule compactions so that our reads become faster. Also, aggressively compacting during quiet periods helps when write bursts happen. In addition, we also want to compact files that are containing deleted key ranges (like old oplog keys).
All of this is currently not possible with CompactRange() because it's single-threaded and blocks all other compactions from happening. Running CompactRange() risks an issue of blocking writes because we generate too much Level 0 files before the compaction is over. Stopping writes is very dangerous because they hold transaction locks. We tried running manual compaction once on Mongo+Rocks and everything fell apart.
MarkForCompaction() solves all of those problems. This is very light-weight manual compaction. It is lower priority than automatic compactions, which means it shouldn't interfere with background process keeping the LSM tree clean. However, if no automatic compactions need to be run (or we have extra background threads available), we will start compacting files that are marked for compaction.
Test Plan: added a new unit test
Reviewers: yhchiang, rven, MarkCallaghan, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37083
10 years ago
|
|
|
ComputeFilesMarkedForCompaction();
|
|
|
|
ComputeBottommostFilesMarkedForCompaction();
|
|
|
|
if (mutable_cf_options.ttl > 0) {
|
|
|
|
ComputeExpiredTtlFiles(immutable_cf_options, mutable_cf_options.ttl);
|
|
|
|
}
|
|
|
|
if (mutable_cf_options.periodic_compaction_seconds > 0) {
|
Periodic Compactions (#5166)
Summary:
Introducing Periodic Compactions.
This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted. And also, of course, it helps to cleanup data older than certain threshold.
- Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
- This works across all levels.
- The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
- Compaction filters, if any, are invoked as usual.
- A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).
This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166
Differential Revision: D14884441
Pulled By: sagar0
fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
6 years ago
|
|
|
ComputeFilesMarkedForPeriodicCompaction(
|
|
|
|
immutable_cf_options, mutable_cf_options.periodic_compaction_seconds);
|
|
|
|
}
|
|
|
|
EstimateCompactionBytesNeeded(mutable_cf_options);
|
Add experimental API MarkForCompaction()
Summary:
Some Mongo+Rocks datasets in Parse's environment are not doing compactions very frequently. During the quiet period (with no IO), we'd like to schedule compactions so that our reads become faster. Also, aggressively compacting during quiet periods helps when write bursts happen. In addition, we also want to compact files that are containing deleted key ranges (like old oplog keys).
All of this is currently not possible with CompactRange() because it's single-threaded and blocks all other compactions from happening. Running CompactRange() risks an issue of blocking writes because we generate too much Level 0 files before the compaction is over. Stopping writes is very dangerous because they hold transaction locks. We tried running manual compaction once on Mongo+Rocks and everything fell apart.
MarkForCompaction() solves all of those problems. This is very light-weight manual compaction. It is lower priority than automatic compactions, which means it shouldn't interfere with background process keeping the LSM tree clean. However, if no automatic compactions need to be run (or we have extra background threads available), we will start compacting files that are marked for compaction.
Test Plan: added a new unit test
Reviewers: yhchiang, rven, MarkCallaghan, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37083
10 years ago
|
|
|
}
|
|
|
|
|
|
|
|
void VersionStorageInfo::ComputeFilesMarkedForCompaction() {
|
|
|
|
files_marked_for_compaction_.clear();
|
|
|
|
int last_qualify_level = 0;
|
|
|
|
|
|
|
|
// Do not include files from the last level with data
|
|
|
|
// If table properties collector suggests a file on the last level,
|
|
|
|
// we should not move it to a new level.
|
|
|
|
for (int level = num_levels() - 1; level >= 1; level--) {
|
|
|
|
if (!files_[level].empty()) {
|
|
|
|
last_qualify_level = level - 1;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
for (int level = 0; level <= last_qualify_level; level++) {
|
Add experimental API MarkForCompaction()
Summary:
Some Mongo+Rocks datasets in Parse's environment are not doing compactions very frequently. During the quiet period (with no IO), we'd like to schedule compactions so that our reads become faster. Also, aggressively compacting during quiet periods helps when write bursts happen. In addition, we also want to compact files that are containing deleted key ranges (like old oplog keys).
All of this is currently not possible with CompactRange() because it's single-threaded and blocks all other compactions from happening. Running CompactRange() risks an issue of blocking writes because we generate too much Level 0 files before the compaction is over. Stopping writes is very dangerous because they hold transaction locks. We tried running manual compaction once on Mongo+Rocks and everything fell apart.
MarkForCompaction() solves all of those problems. This is very light-weight manual compaction. It is lower priority than automatic compactions, which means it shouldn't interfere with background process keeping the LSM tree clean. However, if no automatic compactions need to be run (or we have extra background threads available), we will start compacting files that are marked for compaction.
Test Plan: added a new unit test
Reviewers: yhchiang, rven, MarkCallaghan, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D37083
10 years ago
|
|
|
for (auto* f : files_[level]) {
|
|
|
|
if (!f->being_compacted && f->marked_for_compaction) {
|
|
|
|
files_marked_for_compaction_.emplace_back(level, f);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void VersionStorageInfo::ComputeExpiredTtlFiles(
|
|
|
|
const ImmutableCFOptions& ioptions, const uint64_t ttl) {
|
|
|
|
assert(ttl > 0);
|
|
|
|
|
|
|
|
expired_ttl_files_.clear();
|
|
|
|
|
|
|
|
int64_t _current_time;
|
|
|
|
auto status = ioptions.env->GetCurrentTime(&_current_time);
|
|
|
|
if (!status.ok()) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
const uint64_t current_time = static_cast<uint64_t>(_current_time);
|
|
|
|
|
|
|
|
for (int level = 0; level < num_levels() - 1; level++) {
|
|
|
|
for (FileMetaData* f : files_[level]) {
|
|
|
|
if (!f->being_compacted) {
|
|
|
|
uint64_t oldest_ancester_time = f->TryGetOldestAncesterTime();
|
|
|
|
if (oldest_ancester_time > 0 &&
|
|
|
|
oldest_ancester_time < (current_time - ttl)) {
|
|
|
|
expired_ttl_files_.emplace_back(level, f);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Periodic Compactions (#5166)
Summary:
Introducing Periodic Compactions.
This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted. And also, of course, it helps to cleanup data older than certain threshold.
- Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
- This works across all levels.
- The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
- Compaction filters, if any, are invoked as usual.
- A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).
This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166
Differential Revision: D14884441
Pulled By: sagar0
fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
6 years ago
|
|
|
void VersionStorageInfo::ComputeFilesMarkedForPeriodicCompaction(
|
|
|
|
const ImmutableCFOptions& ioptions,
|
|
|
|
const uint64_t periodic_compaction_seconds) {
|
|
|
|
assert(periodic_compaction_seconds > 0);
|
Periodic Compactions (#5166)
Summary:
Introducing Periodic Compactions.
This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted. And also, of course, it helps to cleanup data older than certain threshold.
- Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
- This works across all levels.
- The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
- Compaction filters, if any, are invoked as usual.
- A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).
This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166
Differential Revision: D14884441
Pulled By: sagar0
fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
6 years ago
|
|
|
|
|
|
|
files_marked_for_periodic_compaction_.clear();
|
|
|
|
|
|
|
|
int64_t temp_current_time;
|
|
|
|
auto status = ioptions.env->GetCurrentTime(&temp_current_time);
|
|
|
|
if (!status.ok()) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
const uint64_t current_time = static_cast<uint64_t>(temp_current_time);
|
Auto enable Periodic Compactions if a Compaction Filter is used (#5865)
Summary:
- Periodic compactions are auto-enabled if a compaction filter or a compaction filter factory is set, in Level Compaction.
- The default value of `periodic_compaction_seconds` is changed to UINT64_MAX, which lets RocksDB auto-tune periodic compactions as needed. An explicit value of 0 will still work as before ie. to disable periodic compactions completely. For now, on seeing a compaction filter along with a UINT64_MAX value for `periodic_compaction_seconds`, RocksDB will make SST files older than 30 days to go through periodic copmactions.
Some RocksDB users make use of compaction filters to control when their data can be deleted, usually with a custom TTL logic. But it is occasionally possible that the compactions get delayed by considerable time due to factors like low writes to a key range, data reaching bottom level, etc before the TTL expiry. Periodic Compactions feature was originally built to help such cases. Now periodic compactions are auto enabled by default when compaction filters or compaction filter factories are used, as it is generally helpful to all cases to collect garbage.
`periodic_compaction_seconds` is set to a large value, 30 days, in `SanitizeOptions` when RocksDB sees that a `compaction_filter` or `compaction_filter_factory` is used.
This is done only for Level Compaction style.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5865
Test Plan:
- Added a new test `DBCompactionTest.LevelPeriodicCompactionWithCompactionFilters` to make sure that `periodic_compaction_seconds` is set if either `compaction_filter` or `compaction_filter_factory` options are set.
- `COMPILE_WITH_ASAN=1 make check`
Differential Revision: D17659180
Pulled By: sagar0
fbshipit-source-id: 4887b9cf2e53cf2dc93a7b658c6b15e1181217ee
5 years ago
|
|
|
|
|
|
|
// If periodic_compaction_seconds is larger than current time, periodic
|
|
|
|
// compaction can't possibly be triggered.
|
Auto enable Periodic Compactions if a Compaction Filter is used (#5865)
Summary:
- Periodic compactions are auto-enabled if a compaction filter or a compaction filter factory is set, in Level Compaction.
- The default value of `periodic_compaction_seconds` is changed to UINT64_MAX, which lets RocksDB auto-tune periodic compactions as needed. An explicit value of 0 will still work as before ie. to disable periodic compactions completely. For now, on seeing a compaction filter along with a UINT64_MAX value for `periodic_compaction_seconds`, RocksDB will make SST files older than 30 days to go through periodic copmactions.
Some RocksDB users make use of compaction filters to control when their data can be deleted, usually with a custom TTL logic. But it is occasionally possible that the compactions get delayed by considerable time due to factors like low writes to a key range, data reaching bottom level, etc before the TTL expiry. Periodic Compactions feature was originally built to help such cases. Now periodic compactions are auto enabled by default when compaction filters or compaction filter factories are used, as it is generally helpful to all cases to collect garbage.
`periodic_compaction_seconds` is set to a large value, 30 days, in `SanitizeOptions` when RocksDB sees that a `compaction_filter` or `compaction_filter_factory` is used.
This is done only for Level Compaction style.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5865
Test Plan:
- Added a new test `DBCompactionTest.LevelPeriodicCompactionWithCompactionFilters` to make sure that `periodic_compaction_seconds` is set if either `compaction_filter` or `compaction_filter_factory` options are set.
- `COMPILE_WITH_ASAN=1 make check`
Differential Revision: D17659180
Pulled By: sagar0
fbshipit-source-id: 4887b9cf2e53cf2dc93a7b658c6b15e1181217ee
5 years ago
|
|
|
if (periodic_compaction_seconds > current_time) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
Periodic Compactions (#5166)
Summary:
Introducing Periodic Compactions.
This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted. And also, of course, it helps to cleanup data older than certain threshold.
- Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
- This works across all levels.
- The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
- Compaction filters, if any, are invoked as usual.
- A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).
This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166
Differential Revision: D14884441
Pulled By: sagar0
fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
6 years ago
|
|
|
const uint64_t allowed_time_limit =
|
|
|
|
current_time - periodic_compaction_seconds;
|
|
|
|
|
|
|
|
for (int level = 0; level < num_levels(); level++) {
|
|
|
|
for (auto f : files_[level]) {
|
|
|
|
if (!f->being_compacted) {
|
|
|
|
// Compute a file's modification time in the following order:
|
|
|
|
// 1. Use file_creation_time table property if it is > 0.
|
|
|
|
// 2. Use creation_time table property if it is > 0.
|
|
|
|
// 3. Use file's mtime metadata if the above two table properties are 0.
|
|
|
|
// Don't consider the file at all if the modification time cannot be
|
|
|
|
// correctly determined based on the above conditions.
|
|
|
|
uint64_t file_modification_time = f->TryGetFileCreationTime();
|
|
|
|
if (file_modification_time == kUnknownFileCreationTime) {
|
|
|
|
file_modification_time = f->TryGetOldestAncesterTime();
|
|
|
|
}
|
|
|
|
if (file_modification_time == kUnknownOldestAncesterTime) {
|
|
|
|
auto file_path = TableFileName(ioptions.cf_paths, f->fd.GetNumber(),
|
|
|
|
f->fd.GetPathId());
|
|
|
|
status = ioptions.env->GetFileModificationTime(
|
|
|
|
file_path, &file_modification_time);
|
|
|
|
if (!status.ok()) {
|
|
|
|
ROCKS_LOG_WARN(ioptions.info_log,
|
|
|
|
"Can't get file modification time: %s: %s",
|
|
|
|
file_path.c_str(), status.ToString().c_str());
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (file_modification_time > 0 &&
|
|
|
|
file_modification_time < allowed_time_limit) {
|
Periodic Compactions (#5166)
Summary:
Introducing Periodic Compactions.
This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted. And also, of course, it helps to cleanup data older than certain threshold.
- Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
- This works across all levels.
- The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
- Compaction filters, if any, are invoked as usual.
- A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).
This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166
Differential Revision: D14884441
Pulled By: sagar0
fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
6 years ago
|
|
|
files_marked_for_periodic_compaction_.emplace_back(level, f);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
namespace {
|
|
|
|
|
|
|
|
// used to sort files by size
|
|
|
|
struct Fsize {
|
|
|
|
size_t index;
|
|
|
|
FileMetaData* file;
|
|
|
|
};
|
|
|
|
|
|
|
|
// Compator that is used to sort files based on their size
|
|
|
|
// In normal mode: descending size
|
|
|
|
bool CompareCompensatedSizeDescending(const Fsize& first, const Fsize& second) {
|
|
|
|
return (first.file->compensated_file_size >
|
|
|
|
second.file->compensated_file_size);
|
|
|
|
}
|
|
|
|
} // anonymous namespace
|
|
|
|
|
|
|
|
void VersionStorageInfo::AddFile(int level, FileMetaData* f) {
|
|
|
|
auto& level_files = files_[level];
|
|
|
|
level_files.push_back(f);
|
|
|
|
|
|
|
|
f->refs++;
|
|
|
|
|
|
|
|
const uint64_t file_number = f->fd.GetNumber();
|
|
|
|
|
|
|
|
assert(file_locations_.find(file_number) == file_locations_.end());
|
|
|
|
file_locations_.emplace(file_number,
|
|
|
|
FileLocation(level, level_files.size() - 1));
|
|
|
|
}
|
|
|
|
|
Add blob files to VersionStorageInfo/VersionBuilder (#6597)
Summary:
The patch adds a couple of classes to represent metadata about
blob files: `SharedBlobFileMetaData` contains the information elements
that are immutable (once the blob file is closed), e.g. blob file number,
total number and size of blob files, checksum method/value, while
`BlobFileMetaData` contains attributes that can vary across versions like
the amount of garbage in the file. There is a single `SharedBlobFileMetaData`
for each blob file, which is jointly owned by the `BlobFileMetaData` objects
that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s
and can also be shared if the (immutable _and_ mutable) state of the blob file
is the same in two versions.
In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends
`VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those
containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata
to a new `VersionStorageInfo`. Consistency checks are also extended to ensure
that table files point to blob files that are part of the `Version`, and that all blob files
that are part of any given `Version` have at least some _non_-garbage data in them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D20656803
Pulled By: ltamasi
fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
|
|
|
void VersionStorageInfo::AddBlobFile(
|
|
|
|
std::shared_ptr<BlobFileMetaData> blob_file_meta) {
|
|
|
|
assert(blob_file_meta);
|
|
|
|
|
|
|
|
const uint64_t blob_file_number = blob_file_meta->GetBlobFileNumber();
|
|
|
|
|
|
|
|
auto it = blob_files_.lower_bound(blob_file_number);
|
|
|
|
assert(it == blob_files_.end() || it->first != blob_file_number);
|
|
|
|
|
|
|
|
blob_files_.insert(
|
|
|
|
it, BlobFiles::value_type(blob_file_number, std::move(blob_file_meta)));
|
|
|
|
}
|
|
|
|
|
options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
10 years ago
|
|
|
// Version::PrepareApply() need to be called before calling the function, or
|
|
|
|
// following functions called:
|
|
|
|
// 1. UpdateNumNonEmptyLevels();
|
|
|
|
// 2. CalculateBaseBytes();
|
|
|
|
// 3. UpdateFilesByCompactionPri();
|
options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
10 years ago
|
|
|
// 4. GenerateFileIndexer();
|
|
|
|
// 5. GenerateLevelFilesBrief();
|
Allowing L0 -> L1 trivial move on sorted data
Summary:
This diff updates the logic of how we do trivial move, now trivial move can run on any number of files in input level as long as they are not overlapping
The conditions for trivial move have been updated
Introduced conditions:
- Trivial move cannot happen if we have a compaction filter (except if the compaction is not manual)
- Input level files cannot be overlapping
Removed conditions:
- Trivial move only run when the compaction is not manual
- Input level should can contain only 1 file
More context on what tests failed because of Trivial move
```
DBTest.CompactionsGenerateMultipleFiles
This test is expecting compaction on a file in L0 to generate multiple files in L1, this test will fail with trivial move because we end up with one file in L1
```
```
DBTest.NoSpaceCompactRange
This test expect compaction to fail when we force environment to report running out of space, of course this is not valid in trivial move situation
because trivial move does not need any extra space, and did not check for that
```
```
DBTest.DropWrites
Similar to DBTest.NoSpaceCompactRange
```
```
DBTest.DeleteObsoleteFilesPendingOutputs
This test expect that a file in L2 is deleted after it's moved to L3, this is not valid with trivial move because although the file was moved it is now used by L3
```
```
CuckooTableDBTest.CompactionIntoMultipleFiles
Same as DBTest.CompactionsGenerateMultipleFiles
```
This diff is based on a work by @sdong https://reviews.facebook.net/D34149
Test Plan: make -j64 check
Reviewers: rven, sdong, igor
Reviewed By: igor
Subscribers: yhchiang, ott, march, dhruba, sdong
Differential Revision: https://reviews.facebook.net/D34797
10 years ago
|
|
|
// 6. GenerateLevel0NonOverlapping();
|
|
|
|
// 7. GenerateBottommostFiles();
|
options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
10 years ago
|
|
|
void VersionStorageInfo::SetFinalized() {
|
|
|
|
finalized_ = true;
|
|
|
|
#ifndef NDEBUG
|
|
|
|
if (compaction_style_ != kCompactionStyleLevel) {
|
|
|
|
// Not level based compaction.
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
assert(base_level_ < 0 || num_levels() == 1 ||
|
|
|
|
(base_level_ >= 1 && base_level_ < num_levels()));
|
options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
10 years ago
|
|
|
// Verify all levels newer than base_level are empty except L0
|
|
|
|
for (int level = 1; level < base_level(); level++) {
|
|
|
|
assert(NumLevelBytes(level) == 0);
|
|
|
|
}
|
|
|
|
uint64_t max_bytes_prev_level = 0;
|
|
|
|
for (int level = base_level(); level < num_levels() - 1; level++) {
|
|
|
|
if (LevelFiles(level).size() == 0) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
assert(MaxBytesForLevel(level) >= max_bytes_prev_level);
|
|
|
|
max_bytes_prev_level = MaxBytesForLevel(level);
|
|
|
|
}
|
|
|
|
int num_empty_non_l0_level = 0;
|
|
|
|
for (int level = 0; level < num_levels(); level++) {
|
|
|
|
assert(LevelFiles(level).size() == 0 ||
|
|
|
|
LevelFiles(level).size() == LevelFilesBrief(level).num_files);
|
|
|
|
if (level > 0 && NumLevelBytes(level) > 0) {
|
|
|
|
num_empty_non_l0_level++;
|
|
|
|
}
|
|
|
|
if (LevelFiles(level).size() > 0) {
|
|
|
|
assert(level < num_non_empty_levels());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
assert(compaction_level_.size() > 0);
|
|
|
|
assert(compaction_level_.size() == compaction_score_.size());
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
|
|
|
|
void VersionStorageInfo::UpdateNumNonEmptyLevels() {
|
|
|
|
num_non_empty_levels_ = num_levels_;
|
|
|
|
for (int i = num_levels_ - 1; i >= 0; i--) {
|
|
|
|
if (files_[i].size() != 0) {
|
|
|
|
return;
|
|
|
|
} else {
|
|
|
|
num_non_empty_levels_ = i;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
namespace {
|
|
|
|
// Sort `temp` based on ratio of overlapping size over file size
|
|
|
|
void SortFileByOverlappingRatio(
|
|
|
|
const InternalKeyComparator& icmp, const std::vector<FileMetaData*>& files,
|
|
|
|
const std::vector<FileMetaData*>& next_level_files,
|
|
|
|
std::vector<Fsize>* temp) {
|
|
|
|
std::unordered_map<uint64_t, uint64_t> file_to_order;
|
|
|
|
auto next_level_it = next_level_files.begin();
|
|
|
|
|
|
|
|
for (auto& file : files) {
|
|
|
|
uint64_t overlapping_bytes = 0;
|
|
|
|
// Skip files in next level that is smaller than current file
|
|
|
|
while (next_level_it != next_level_files.end() &&
|
|
|
|
icmp.Compare((*next_level_it)->largest, file->smallest) < 0) {
|
|
|
|
next_level_it++;
|
|
|
|
}
|
|
|
|
|
|
|
|
while (next_level_it != next_level_files.end() &&
|
|
|
|
icmp.Compare((*next_level_it)->smallest, file->largest) < 0) {
|
|
|
|
overlapping_bytes += (*next_level_it)->fd.file_size;
|
|
|
|
|
|
|
|
if (icmp.Compare((*next_level_it)->largest, file->largest) > 0) {
|
|
|
|
// next level file cross large boundary of current file.
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
next_level_it++;
|
|
|
|
}
|
|
|
|
|
|
|
|
assert(file->compensated_file_size != 0);
|
|
|
|
file_to_order[file->fd.GetNumber()] =
|
|
|
|
overlapping_bytes * 1024u / file->compensated_file_size;
|
|
|
|
}
|
|
|
|
|
|
|
|
std::sort(temp->begin(), temp->end(),
|
|
|
|
[&](const Fsize& f1, const Fsize& f2) -> bool {
|
|
|
|
return file_to_order[f1.file->fd.GetNumber()] <
|
|
|
|
file_to_order[f2.file->fd.GetNumber()];
|
|
|
|
});
|
|
|
|
}
|
|
|
|
} // namespace
|
|
|
|
|
|
|
|
void VersionStorageInfo::UpdateFilesByCompactionPri(
|
|
|
|
CompactionPri compaction_pri) {
|
|
|
|
if (compaction_style_ == kCompactionStyleNone ||
|
|
|
|
compaction_style_ == kCompactionStyleFIFO ||
|
|
|
|
compaction_style_ == kCompactionStyleUniversal) {
|
|
|
|
// don't need this
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
// No need to sort the highest level because it is never compacted.
|
|
|
|
for (int level = 0; level < num_levels() - 1; level++) {
|
|
|
|
const std::vector<FileMetaData*>& files = files_[level];
|
|
|
|
auto& files_by_compaction_pri = files_by_compaction_pri_[level];
|
|
|
|
assert(files_by_compaction_pri.size() == 0);
|
|
|
|
|
|
|
|
// populate a temp vector for sorting based on size
|
|
|
|
std::vector<Fsize> temp(files.size());
|
|
|
|
for (size_t i = 0; i < files.size(); i++) {
|
|
|
|
temp[i].index = i;
|
|
|
|
temp[i].file = files[i];
|
|
|
|
}
|
|
|
|
|
|
|
|
// sort the top number_of_files_to_sort_ based on file size
|
|
|
|
size_t num = VersionStorageInfo::kNumberFilesToSort;
|
|
|
|
if (num > temp.size()) {
|
|
|
|
num = temp.size();
|
|
|
|
}
|
|
|
|
switch (compaction_pri) {
|
|
|
|
case kByCompensatedSize:
|
|
|
|
std::partial_sort(temp.begin(), temp.begin() + num, temp.end(),
|
|
|
|
CompareCompensatedSizeDescending);
|
|
|
|
break;
|
|
|
|
case kOldestLargestSeqFirst:
|
|
|
|
std::sort(temp.begin(), temp.end(),
|
|
|
|
[](const Fsize& f1, const Fsize& f2) -> bool {
|
|
|
|
return f1.file->fd.largest_seqno <
|
|
|
|
f2.file->fd.largest_seqno;
|
|
|
|
});
|
|
|
|
break;
|
|
|
|
case kOldestSmallestSeqFirst:
|
|
|
|
std::sort(temp.begin(), temp.end(),
|
|
|
|
[](const Fsize& f1, const Fsize& f2) -> bool {
|
|
|
|
return f1.file->fd.smallest_seqno <
|
|
|
|
f2.file->fd.smallest_seqno;
|
|
|
|
});
|
|
|
|
break;
|
|
|
|
case kMinOverlappingRatio:
|
|
|
|
SortFileByOverlappingRatio(*internal_comparator_, files_[level],
|
|
|
|
files_[level + 1], &temp);
|
|
|
|
break;
|
|
|
|
default:
|
|
|
|
assert(false);
|
|
|
|
}
|
|
|
|
assert(temp.size() == files.size());
|
|
|
|
|
|
|
|
// initialize files_by_compaction_pri_
|
|
|
|
for (size_t i = 0; i < temp.size(); i++) {
|
|
|
|
files_by_compaction_pri.push_back(static_cast<int>(temp[i].index));
|
|
|
|
}
|
|
|
|
next_file_to_compact_by_size_[level] = 0;
|
|
|
|
assert(files_[level].size() == files_by_compaction_pri_[level].size());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Allowing L0 -> L1 trivial move on sorted data
Summary:
This diff updates the logic of how we do trivial move, now trivial move can run on any number of files in input level as long as they are not overlapping
The conditions for trivial move have been updated
Introduced conditions:
- Trivial move cannot happen if we have a compaction filter (except if the compaction is not manual)
- Input level files cannot be overlapping
Removed conditions:
- Trivial move only run when the compaction is not manual
- Input level should can contain only 1 file
More context on what tests failed because of Trivial move
```
DBTest.CompactionsGenerateMultipleFiles
This test is expecting compaction on a file in L0 to generate multiple files in L1, this test will fail with trivial move because we end up with one file in L1
```
```
DBTest.NoSpaceCompactRange
This test expect compaction to fail when we force environment to report running out of space, of course this is not valid in trivial move situation
because trivial move does not need any extra space, and did not check for that
```
```
DBTest.DropWrites
Similar to DBTest.NoSpaceCompactRange
```
```
DBTest.DeleteObsoleteFilesPendingOutputs
This test expect that a file in L2 is deleted after it's moved to L3, this is not valid with trivial move because although the file was moved it is now used by L3
```
```
CuckooTableDBTest.CompactionIntoMultipleFiles
Same as DBTest.CompactionsGenerateMultipleFiles
```
This diff is based on a work by @sdong https://reviews.facebook.net/D34149
Test Plan: make -j64 check
Reviewers: rven, sdong, igor
Reviewed By: igor
Subscribers: yhchiang, ott, march, dhruba, sdong
Differential Revision: https://reviews.facebook.net/D34797
10 years ago
|
|
|
void VersionStorageInfo::GenerateLevel0NonOverlapping() {
|
|
|
|
assert(!finalized_);
|
|
|
|
level0_non_overlapping_ = true;
|
|
|
|
if (level_files_brief_.size() == 0) {
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
// A copy of L0 files sorted by smallest key
|
|
|
|
std::vector<FdWithKeyRange> level0_sorted_file(
|
|
|
|
level_files_brief_[0].files,
|
|
|
|
level_files_brief_[0].files + level_files_brief_[0].num_files);
|
|
|
|
std::sort(level0_sorted_file.begin(), level0_sorted_file.end(),
|
|
|
|
[this](const FdWithKeyRange& f1, const FdWithKeyRange& f2) -> bool {
|
|
|
|
return (internal_comparator_->Compare(f1.smallest_key,
|
|
|
|
f2.smallest_key) < 0);
|
|
|
|
});
|
Allowing L0 -> L1 trivial move on sorted data
Summary:
This diff updates the logic of how we do trivial move, now trivial move can run on any number of files in input level as long as they are not overlapping
The conditions for trivial move have been updated
Introduced conditions:
- Trivial move cannot happen if we have a compaction filter (except if the compaction is not manual)
- Input level files cannot be overlapping
Removed conditions:
- Trivial move only run when the compaction is not manual
- Input level should can contain only 1 file
More context on what tests failed because of Trivial move
```
DBTest.CompactionsGenerateMultipleFiles
This test is expecting compaction on a file in L0 to generate multiple files in L1, this test will fail with trivial move because we end up with one file in L1
```
```
DBTest.NoSpaceCompactRange
This test expect compaction to fail when we force environment to report running out of space, of course this is not valid in trivial move situation
because trivial move does not need any extra space, and did not check for that
```
```
DBTest.DropWrites
Similar to DBTest.NoSpaceCompactRange
```
```
DBTest.DeleteObsoleteFilesPendingOutputs
This test expect that a file in L2 is deleted after it's moved to L3, this is not valid with trivial move because although the file was moved it is now used by L3
```
```
CuckooTableDBTest.CompactionIntoMultipleFiles
Same as DBTest.CompactionsGenerateMultipleFiles
```
This diff is based on a work by @sdong https://reviews.facebook.net/D34149
Test Plan: make -j64 check
Reviewers: rven, sdong, igor
Reviewed By: igor
Subscribers: yhchiang, ott, march, dhruba, sdong
Differential Revision: https://reviews.facebook.net/D34797
10 years ago
|
|
|
|
|
|
|
for (size_t i = 1; i < level0_sorted_file.size(); ++i) {
|
|
|
|
FdWithKeyRange& f = level0_sorted_file[i];
|
|
|
|
FdWithKeyRange& prev = level0_sorted_file[i - 1];
|
|
|
|
if (internal_comparator_->Compare(prev.largest_key, f.smallest_key) >= 0) {
|
|
|
|
level0_non_overlapping_ = false;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void VersionStorageInfo::GenerateBottommostFiles() {
|
|
|
|
assert(!finalized_);
|
|
|
|
assert(bottommost_files_.empty());
|
|
|
|
for (size_t level = 0; level < level_files_brief_.size(); ++level) {
|
|
|
|
for (size_t file_idx = 0; file_idx < level_files_brief_[level].num_files;
|
|
|
|
++file_idx) {
|
|
|
|
const FdWithKeyRange& f = level_files_brief_[level].files[file_idx];
|
|
|
|
int l0_file_idx;
|
|
|
|
if (level == 0) {
|
|
|
|
l0_file_idx = static_cast<int>(file_idx);
|
|
|
|
} else {
|
|
|
|
l0_file_idx = -1;
|
|
|
|
}
|
|
|
|
Slice smallest_user_key = ExtractUserKey(f.smallest_key);
|
|
|
|
Slice largest_user_key = ExtractUserKey(f.largest_key);
|
|
|
|
if (!RangeMightExistAfterSortedRun(smallest_user_key, largest_user_key,
|
|
|
|
static_cast<int>(level),
|
|
|
|
l0_file_idx)) {
|
|
|
|
bottommost_files_.emplace_back(static_cast<int>(level),
|
|
|
|
f.file_metadata);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void VersionStorageInfo::UpdateOldestSnapshot(SequenceNumber seqnum) {
|
|
|
|
assert(seqnum >= oldest_snapshot_seqnum_);
|
|
|
|
oldest_snapshot_seqnum_ = seqnum;
|
|
|
|
if (oldest_snapshot_seqnum_ > bottommost_files_mark_threshold_) {
|
|
|
|
ComputeBottommostFilesMarkedForCompaction();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void VersionStorageInfo::ComputeBottommostFilesMarkedForCompaction() {
|
|
|
|
bottommost_files_marked_for_compaction_.clear();
|
|
|
|
bottommost_files_mark_threshold_ = kMaxSequenceNumber;
|
|
|
|
for (auto& level_and_file : bottommost_files_) {
|
|
|
|
if (!level_and_file.second->being_compacted &&
|
|
|
|
level_and_file.second->fd.largest_seqno != 0 &&
|
|
|
|
level_and_file.second->num_deletions > 1) {
|
|
|
|
// largest_seqno might be nonzero due to containing the final key in an
|
|
|
|
// earlier compaction, whose seqnum we didn't zero out. Multiple deletions
|
|
|
|
// ensures the file really contains deleted or overwritten keys.
|
|
|
|
if (level_and_file.second->fd.largest_seqno < oldest_snapshot_seqnum_) {
|
|
|
|
bottommost_files_marked_for_compaction_.push_back(level_and_file);
|
|
|
|
} else {
|
|
|
|
bottommost_files_mark_threshold_ =
|
|
|
|
std::min(bottommost_files_mark_threshold_,
|
|
|
|
level_and_file.second->fd.largest_seqno);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void Version::Ref() {
|
|
|
|
++refs_;
|
|
|
|
}
|
|
|
|
|
|
|
|
bool Version::Unref() {
|
|
|
|
assert(refs_ >= 1);
|
|
|
|
--refs_;
|
|
|
|
if (refs_ == 0) {
|
|
|
|
delete this;
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
bool VersionStorageInfo::OverlapInLevel(int level,
|
|
|
|
const Slice* smallest_user_key,
|
|
|
|
const Slice* largest_user_key) {
|
|
|
|
if (level >= num_non_empty_levels_) {
|
|
|
|
// empty level, no overlap
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
return SomeFileOverlapsRange(*internal_comparator_, (level > 0),
|
|
|
|
level_files_brief_[level], smallest_user_key,
|
|
|
|
largest_user_key);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Store in "*inputs" all files in "level" that overlap [begin,end]
|
|
|
|
// If hint_index is specified, then it points to a file in the
|
|
|
|
// overlapping range.
|
|
|
|
// The file_index returns a pointer to any file in an overlapping range.
|
|
|
|
void VersionStorageInfo::GetOverlappingInputs(
|
|
|
|
int level, const InternalKey* begin, const InternalKey* end,
|
|
|
|
std::vector<FileMetaData*>* inputs, int hint_index, int* file_index,
|
|
|
|
bool expand_range, InternalKey** next_smallest) const {
|
|
|
|
if (level >= num_non_empty_levels_) {
|
|
|
|
// this level is empty, no overlapping inputs
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
inputs->clear();
|
Assertion failure while running with unit tests with OPT=-g
Summary:
When we expand the range of keys for a level 0 compaction, we
need to invoke ParentFilesInCompaction() only once for the
entire range of keys that is being compacted. We were invoking
it for each file that was being compacted, but this triggers
an assertion because each file's range were contiguous but
non-overlapping.
I renamed ParentFilesInCompaction to ParentRangeInCompaction
to adequately represent that it is the range-of-keys and
not individual files that we compact in a single compaction run.
Here is the assertion that is fixed by this patch.
db_test: db/version_set.cc:585: void leveldb::Version::ExtendOverlappingInputs(int, const leveldb::Slice&, const leveldb::Slice&, std::vector<leveldb::FileMetaData*, std::allocator<leveldb::FileMetaData*> >*, int): Assertion `user_cmp->Compare(flimit, user_begin) >= 0' failed.
Test Plan: make clean check OPT=-g
Reviewers: sheki
Reviewed By: sheki
CC: MarkCallaghan, emayanke, leveldb
Differential Revision: https://reviews.facebook.net/D6963
12 years ago
|
|
|
if (file_index) {
|
|
|
|
*file_index = -1;
|
|
|
|
}
|
|
|
|
const Comparator* user_cmp = user_comparator_;
|
|
|
|
if (level > 0) {
|
|
|
|
GetOverlappingInputsRangeBinarySearch(level, begin, end, inputs, hint_index,
|
|
|
|
file_index, false, next_smallest);
|
|
|
|
return;
|
|
|
|
}
|
level compaction expansion
Summary:
reimplement the compaction expansion on lower level.
Considering such a case:
input level file: 1[B E] 2[F G] 3[H I] 4 [J M]
output level file: 5[A C] 6[D K] 7[L O]
If we initially pick file 2, now we will compact file 2 and 6. But we can safely compact 2, 3 and 6 without expanding the output level.
The previous code is messy and wrong.
In this diff, I first determine the input range [a, b], and output range [c, d],
then we get the range [e,f] = [min(a, c), max(b, d] and put all eligible clean-cut files within [e, f] into this compaction.
**Note: clean-cut means the files don't have the same user key on the boundaries of some files that are not chosen in this compaction**.
Closes https://github.com/facebook/rocksdb/pull/1760
Differential Revision: D4395564
Pulled By: lightmark
fbshipit-source-id: 2dc2c5c
8 years ago
|
|
|
|
|
|
|
if (next_smallest) {
|
|
|
|
// next_smallest key only makes sense for non-level 0, where files are
|
|
|
|
// non-overlapping
|
|
|
|
*next_smallest = nullptr;
|
|
|
|
}
|
|
|
|
|
|
|
|
Slice user_begin, user_end;
|
|
|
|
if (begin != nullptr) {
|
|
|
|
user_begin = begin->user_key();
|
|
|
|
}
|
|
|
|
if (end != nullptr) {
|
|
|
|
user_end = end->user_key();
|
|
|
|
}
|
|
|
|
|
|
|
|
// index stores the file index need to check.
|
|
|
|
std::list<size_t> index;
|
|
|
|
for (size_t i = 0; i < level_files_brief_[level].num_files; i++) {
|
|
|
|
index.emplace_back(i);
|
|
|
|
}
|
|
|
|
|
|
|
|
while (!index.empty()) {
|
|
|
|
bool found_overlapping_file = false;
|
|
|
|
auto iter = index.begin();
|
|
|
|
while (iter != index.end()) {
|
|
|
|
FdWithKeyRange* f = &(level_files_brief_[level].files[*iter]);
|
|
|
|
const Slice file_start = ExtractUserKey(f->smallest_key);
|
|
|
|
const Slice file_limit = ExtractUserKey(f->largest_key);
|
|
|
|
if (begin != nullptr &&
|
|
|
|
user_cmp->CompareWithoutTimestamp(file_limit, user_begin) < 0) {
|
|
|
|
// "f" is completely before specified range; skip it
|
|
|
|
iter++;
|
|
|
|
} else if (end != nullptr &&
|
|
|
|
user_cmp->CompareWithoutTimestamp(file_start, user_end) > 0) {
|
|
|
|
// "f" is completely after specified range; skip it
|
|
|
|
iter++;
|
|
|
|
} else {
|
|
|
|
// if overlap
|
|
|
|
inputs->emplace_back(files_[level][*iter]);
|
|
|
|
found_overlapping_file = true;
|
|
|
|
// record the first file index.
|
|
|
|
if (file_index && *file_index == -1) {
|
|
|
|
*file_index = static_cast<int>(*iter);
|
|
|
|
}
|
|
|
|
// the related file is overlap, erase to avoid checking again.
|
|
|
|
iter = index.erase(iter);
|
|
|
|
if (expand_range) {
|
|
|
|
if (begin != nullptr &&
|
|
|
|
user_cmp->CompareWithoutTimestamp(file_start, user_begin) < 0) {
|
|
|
|
user_begin = file_start;
|
|
|
|
}
|
|
|
|
if (end != nullptr &&
|
|
|
|
user_cmp->CompareWithoutTimestamp(file_limit, user_end) > 0) {
|
|
|
|
user_end = file_limit;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
// if all the files left are not overlap, break
|
|
|
|
if (!found_overlapping_file) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
level compaction expansion
Summary:
reimplement the compaction expansion on lower level.
Considering such a case:
input level file: 1[B E] 2[F G] 3[H I] 4 [J M]
output level file: 5[A C] 6[D K] 7[L O]
If we initially pick file 2, now we will compact file 2 and 6. But we can safely compact 2, 3 and 6 without expanding the output level.
The previous code is messy and wrong.
In this diff, I first determine the input range [a, b], and output range [c, d],
then we get the range [e,f] = [min(a, c), max(b, d] and put all eligible clean-cut files within [e, f] into this compaction.
**Note: clean-cut means the files don't have the same user key on the boundaries of some files that are not chosen in this compaction**.
Closes https://github.com/facebook/rocksdb/pull/1760
Differential Revision: D4395564
Pulled By: lightmark
fbshipit-source-id: 2dc2c5c
8 years ago
|
|
|
// Store in "*inputs" files in "level" that within range [begin,end]
|
|
|
|
// Guarantee a "clean cut" boundary between the files in inputs
|
|
|
|
// and the surrounding files and the maxinum number of files.
|
|
|
|
// This will ensure that no parts of a key are lost during compaction.
|
|
|
|
// If hint_index is specified, then it points to a file in the range.
|
|
|
|
// The file_index returns a pointer to any file in an overlapping range.
|
|
|
|
void VersionStorageInfo::GetCleanInputsWithinInterval(
|
|
|
|
int level, const InternalKey* begin, const InternalKey* end,
|
|
|
|
std::vector<FileMetaData*>* inputs, int hint_index, int* file_index) const {
|
|
|
|
inputs->clear();
|
|
|
|
if (file_index) {
|
|
|
|
*file_index = -1;
|
|
|
|
}
|
|
|
|
if (level >= num_non_empty_levels_ || level == 0 ||
|
|
|
|
level_files_brief_[level].num_files == 0) {
|
level compaction expansion
Summary:
reimplement the compaction expansion on lower level.
Considering such a case:
input level file: 1[B E] 2[F G] 3[H I] 4 [J M]
output level file: 5[A C] 6[D K] 7[L O]
If we initially pick file 2, now we will compact file 2 and 6. But we can safely compact 2, 3 and 6 without expanding the output level.
The previous code is messy and wrong.
In this diff, I first determine the input range [a, b], and output range [c, d],
then we get the range [e,f] = [min(a, c), max(b, d] and put all eligible clean-cut files within [e, f] into this compaction.
**Note: clean-cut means the files don't have the same user key on the boundaries of some files that are not chosen in this compaction**.
Closes https://github.com/facebook/rocksdb/pull/1760
Differential Revision: D4395564
Pulled By: lightmark
fbshipit-source-id: 2dc2c5c
8 years ago
|
|
|
// this level is empty, no inputs within range
|
|
|
|
// also don't support clean input interval within L0
|
level compaction expansion
Summary:
reimplement the compaction expansion on lower level.
Considering such a case:
input level file: 1[B E] 2[F G] 3[H I] 4 [J M]
output level file: 5[A C] 6[D K] 7[L O]
If we initially pick file 2, now we will compact file 2 and 6. But we can safely compact 2, 3 and 6 without expanding the output level.
The previous code is messy and wrong.
In this diff, I first determine the input range [a, b], and output range [c, d],
then we get the range [e,f] = [min(a, c), max(b, d] and put all eligible clean-cut files within [e, f] into this compaction.
**Note: clean-cut means the files don't have the same user key on the boundaries of some files that are not chosen in this compaction**.
Closes https://github.com/facebook/rocksdb/pull/1760
Differential Revision: D4395564
Pulled By: lightmark
fbshipit-source-id: 2dc2c5c
8 years ago
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
GetOverlappingInputsRangeBinarySearch(level, begin, end, inputs,
|
|
|
|
hint_index, file_index,
|
|
|
|
true /* within_interval */);
|
level compaction expansion
Summary:
reimplement the compaction expansion on lower level.
Considering such a case:
input level file: 1[B E] 2[F G] 3[H I] 4 [J M]
output level file: 5[A C] 6[D K] 7[L O]
If we initially pick file 2, now we will compact file 2 and 6. But we can safely compact 2, 3 and 6 without expanding the output level.
The previous code is messy and wrong.
In this diff, I first determine the input range [a, b], and output range [c, d],
then we get the range [e,f] = [min(a, c), max(b, d] and put all eligible clean-cut files within [e, f] into this compaction.
**Note: clean-cut means the files don't have the same user key on the boundaries of some files that are not chosen in this compaction**.
Closes https://github.com/facebook/rocksdb/pull/1760
Differential Revision: D4395564
Pulled By: lightmark
fbshipit-source-id: 2dc2c5c
8 years ago
|
|
|
}
|
|
|
|
|
|
|
|
// Store in "*inputs" all files in "level" that overlap [begin,end]
|
|
|
|
// Employ binary search to find at least one file that overlaps the
|
|
|
|
// specified range. From that file, iterate backwards and
|
|
|
|
// forwards to find all overlapping files.
|
level compaction expansion
Summary:
reimplement the compaction expansion on lower level.
Considering such a case:
input level file: 1[B E] 2[F G] 3[H I] 4 [J M]
output level file: 5[A C] 6[D K] 7[L O]
If we initially pick file 2, now we will compact file 2 and 6. But we can safely compact 2, 3 and 6 without expanding the output level.
The previous code is messy and wrong.
In this diff, I first determine the input range [a, b], and output range [c, d],
then we get the range [e,f] = [min(a, c), max(b, d] and put all eligible clean-cut files within [e, f] into this compaction.
**Note: clean-cut means the files don't have the same user key on the boundaries of some files that are not chosen in this compaction**.
Closes https://github.com/facebook/rocksdb/pull/1760
Differential Revision: D4395564
Pulled By: lightmark
fbshipit-source-id: 2dc2c5c
8 years ago
|
|
|
// if within_range is set, then only store the maximum clean inputs
|
|
|
|
// within range [begin, end]. "clean" means there is a boudnary
|
|
|
|
// between the files in "*inputs" and the surrounding files
|
|
|
|
void VersionStorageInfo::GetOverlappingInputsRangeBinarySearch(
|
|
|
|
int level, const InternalKey* begin, const InternalKey* end,
|
level compaction expansion
Summary:
reimplement the compaction expansion on lower level.
Considering such a case:
input level file: 1[B E] 2[F G] 3[H I] 4 [J M]
output level file: 5[A C] 6[D K] 7[L O]
If we initially pick file 2, now we will compact file 2 and 6. But we can safely compact 2, 3 and 6 without expanding the output level.
The previous code is messy and wrong.
In this diff, I first determine the input range [a, b], and output range [c, d],
then we get the range [e,f] = [min(a, c), max(b, d] and put all eligible clean-cut files within [e, f] into this compaction.
**Note: clean-cut means the files don't have the same user key on the boundaries of some files that are not chosen in this compaction**.
Closes https://github.com/facebook/rocksdb/pull/1760
Differential Revision: D4395564
Pulled By: lightmark
fbshipit-source-id: 2dc2c5c
8 years ago
|
|
|
std::vector<FileMetaData*>* inputs, int hint_index, int* file_index,
|
|
|
|
bool within_interval, InternalKey** next_smallest) const {
|
|
|
|
assert(level > 0);
|
|
|
|
|
|
|
|
auto user_cmp = user_comparator_;
|
|
|
|
const FdWithKeyRange* files = level_files_brief_[level].files;
|
|
|
|
const int num_files = static_cast<int>(level_files_brief_[level].num_files);
|
|
|
|
|
|
|
|
// begin to use binary search to find lower bound
|
|
|
|
// and upper bound.
|
|
|
|
int start_index = 0;
|
|
|
|
int end_index = num_files;
|
|
|
|
|
|
|
|
if (begin != nullptr) {
|
|
|
|
// if within_interval is true, with file_key would find
|
|
|
|
// not overlapping ranges in std::lower_bound.
|
|
|
|
auto cmp = [&user_cmp, &within_interval](const FdWithKeyRange& f,
|
|
|
|
const InternalKey* k) {
|
|
|
|
auto& file_key = within_interval ? f.file_metadata->smallest
|
|
|
|
: f.file_metadata->largest;
|
|
|
|
return sstableKeyCompare(user_cmp, file_key, *k) < 0;
|
|
|
|
};
|
|
|
|
|
|
|
|
start_index = static_cast<int>(
|
|
|
|
std::lower_bound(files,
|
|
|
|
files + (hint_index == -1 ? num_files : hint_index),
|
|
|
|
begin, cmp) -
|
|
|
|
files);
|
|
|
|
|
|
|
|
if (start_index > 0 && within_interval) {
|
|
|
|
bool is_overlapping = true;
|
|
|
|
while (is_overlapping && start_index < num_files) {
|
|
|
|
auto& pre_limit = files[start_index - 1].file_metadata->largest;
|
|
|
|
auto& cur_start = files[start_index].file_metadata->smallest;
|
|
|
|
is_overlapping = sstableKeyCompare(user_cmp, pre_limit, cur_start) == 0;
|
|
|
|
start_index += is_overlapping;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (end != nullptr) {
|
|
|
|
// if within_interval is true, with file_key would find
|
|
|
|
// not overlapping ranges in std::upper_bound.
|
|
|
|
auto cmp = [&user_cmp, &within_interval](const InternalKey* k,
|
|
|
|
const FdWithKeyRange& f) {
|
|
|
|
auto& file_key = within_interval ? f.file_metadata->largest
|
|
|
|
: f.file_metadata->smallest;
|
|
|
|
return sstableKeyCompare(user_cmp, *k, file_key) < 0;
|
|
|
|
};
|
|
|
|
|
|
|
|
end_index = static_cast<int>(
|
|
|
|
std::upper_bound(files + start_index, files + num_files, end, cmp) -
|
|
|
|
files);
|
|
|
|
|
|
|
|
if (end_index < num_files && within_interval) {
|
|
|
|
bool is_overlapping = true;
|
|
|
|
while (is_overlapping && end_index > start_index) {
|
|
|
|
auto& next_start = files[end_index].file_metadata->smallest;
|
|
|
|
auto& cur_limit = files[end_index - 1].file_metadata->largest;
|
|
|
|
is_overlapping =
|
|
|
|
sstableKeyCompare(user_cmp, cur_limit, next_start) == 0;
|
|
|
|
end_index -= is_overlapping;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
assert(start_index <= end_index);
|
|
|
|
|
|
|
|
// If there were no overlapping files, return immediately.
|
|
|
|
if (start_index == end_index) {
|
|
|
|
if (next_smallest) {
|
|
|
|
*next_smallest = nullptr;
|
|
|
|
}
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
assert(start_index < end_index);
|
|
|
|
|
|
|
|
// returns the index where an overlap is found
|
|
|
|
if (file_index) {
|
|
|
|
*file_index = start_index;
|
|
|
|
}
|
level compaction expansion
Summary:
reimplement the compaction expansion on lower level.
Considering such a case:
input level file: 1[B E] 2[F G] 3[H I] 4 [J M]
output level file: 5[A C] 6[D K] 7[L O]
If we initially pick file 2, now we will compact file 2 and 6. But we can safely compact 2, 3 and 6 without expanding the output level.
The previous code is messy and wrong.
In this diff, I first determine the input range [a, b], and output range [c, d],
then we get the range [e,f] = [min(a, c), max(b, d] and put all eligible clean-cut files within [e, f] into this compaction.
**Note: clean-cut means the files don't have the same user key on the boundaries of some files that are not chosen in this compaction**.
Closes https://github.com/facebook/rocksdb/pull/1760
Differential Revision: D4395564
Pulled By: lightmark
fbshipit-source-id: 2dc2c5c
8 years ago
|
|
|
|
|
|
|
// insert overlapping files into vector
|
|
|
|
for (int i = start_index; i < end_index; i++) {
|
level compaction expansion
Summary:
reimplement the compaction expansion on lower level.
Considering such a case:
input level file: 1[B E] 2[F G] 3[H I] 4 [J M]
output level file: 5[A C] 6[D K] 7[L O]
If we initially pick file 2, now we will compact file 2 and 6. But we can safely compact 2, 3 and 6 without expanding the output level.
The previous code is messy and wrong.
In this diff, I first determine the input range [a, b], and output range [c, d],
then we get the range [e,f] = [min(a, c), max(b, d] and put all eligible clean-cut files within [e, f] into this compaction.
**Note: clean-cut means the files don't have the same user key on the boundaries of some files that are not chosen in this compaction**.
Closes https://github.com/facebook/rocksdb/pull/1760
Differential Revision: D4395564
Pulled By: lightmark
fbshipit-source-id: 2dc2c5c
8 years ago
|
|
|
inputs->push_back(files_[level][i]);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (next_smallest != nullptr) {
|
|
|
|
// Provide the next key outside the range covered by inputs
|
|
|
|
if (end_index < static_cast<int>(files_[level].size())) {
|
|
|
|
**next_smallest = files_[level][end_index]->smallest;
|
|
|
|
} else {
|
|
|
|
*next_smallest = nullptr;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t VersionStorageInfo::NumLevelBytes(int level) const {
|
|
|
|
assert(level >= 0);
|
|
|
|
assert(level < num_levels());
|
|
|
|
return TotalFileSize(files_[level]);
|
|
|
|
}
|
|
|
|
|
|
|
|
const char* VersionStorageInfo::LevelSummary(
|
|
|
|
LevelSummaryStorage* scratch) const {
|
options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
10 years ago
|
|
|
int len = 0;
|
|
|
|
if (compaction_style_ == kCompactionStyleLevel && num_levels() > 1) {
|
options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
10 years ago
|
|
|
assert(base_level_ < static_cast<int>(level_max_bytes_.size()));
|
|
|
|
if (level_multiplier_ != 0.0) {
|
|
|
|
len = snprintf(
|
|
|
|
scratch->buffer, sizeof(scratch->buffer),
|
|
|
|
"base level %d level multiplier %.2f max bytes base %" PRIu64 " ",
|
|
|
|
base_level_, level_multiplier_, level_max_bytes_[base_level_]);
|
|
|
|
}
|
options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
10 years ago
|
|
|
}
|
|
|
|
len +=
|
|
|
|
snprintf(scratch->buffer + len, sizeof(scratch->buffer) - len, "files[");
|
|
|
|
for (int i = 0; i < num_levels(); i++) {
|
|
|
|
int sz = sizeof(scratch->buffer) - len;
|
|
|
|
int ret = snprintf(scratch->buffer + len, sz, "%d ", int(files_[i].size()));
|
|
|
|
if (ret < 0 || ret >= sz) break;
|
|
|
|
len += ret;
|
|
|
|
}
|
|
|
|
if (len > 0) {
|
|
|
|
// overwrite the last space
|
|
|
|
--len;
|
|
|
|
}
|
Print info message about files need compaction for debuging purpose
Summary:
When there are files marked for compaction after compactions, print extra messages to help debugging. Example:
2015/06/08-23:12:55.212855 7ff5013ff700 [default] [JOB 121] Generated table #75: 54 keys, 4807 bytes (need compaction)
2015/06/08-23:12:55.556194 7ff5013ff700 (Original Log Time 2015/06/08-23:12:55.556160) [default] compacted to: base level 1 max bytes base
10240 files[0 1 9 32 12 0 0 0] max score 0.96 (2 files need compaction), MB/sec: 0.0 rd, 0.1 wr, level 2, files in(1, 3) out(5) MB in(0.0,
0.0) out(0.0), read-write-amplify(11.3) write-amplify(5.7) OK, records in: 40, records dropped: 0
Test Plan:
Run test and see LOG files.
valgrind test DBTest.TablePropertiesNeedCompactTest
Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D39771
10 years ago
|
|
|
len += snprintf(scratch->buffer + len, sizeof(scratch->buffer) - len,
|
|
|
|
"] max score %.2f", compaction_score_[0]);
|
|
|
|
|
|
|
|
if (!files_marked_for_compaction_.empty()) {
|
|
|
|
snprintf(scratch->buffer + len, sizeof(scratch->buffer) - len,
|
|
|
|
" (%" ROCKSDB_PRIszt " files need compaction)",
|
Print info message about files need compaction for debuging purpose
Summary:
When there are files marked for compaction after compactions, print extra messages to help debugging. Example:
2015/06/08-23:12:55.212855 7ff5013ff700 [default] [JOB 121] Generated table #75: 54 keys, 4807 bytes (need compaction)
2015/06/08-23:12:55.556194 7ff5013ff700 (Original Log Time 2015/06/08-23:12:55.556160) [default] compacted to: base level 1 max bytes base
10240 files[0 1 9 32 12 0 0 0] max score 0.96 (2 files need compaction), MB/sec: 0.0 rd, 0.1 wr, level 2, files in(1, 3) out(5) MB in(0.0,
0.0) out(0.0), read-write-amplify(11.3) write-amplify(5.7) OK, records in: 40, records dropped: 0
Test Plan:
Run test and see LOG files.
valgrind test DBTest.TablePropertiesNeedCompactTest
Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, igor
Reviewed By: igor
Subscribers: yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D39771
10 years ago
|
|
|
files_marked_for_compaction_.size());
|
|
|
|
}
|
|
|
|
|
|
|
|
return scratch->buffer;
|
|
|
|
}
|
|
|
|
|
|
|
|
const char* VersionStorageInfo::LevelFileSummary(FileSummaryStorage* scratch,
|
|
|
|
int level) const {
|
|
|
|
int len = snprintf(scratch->buffer, sizeof(scratch->buffer), "files_size[");
|
|
|
|
for (const auto& f : files_[level]) {
|
|
|
|
int sz = sizeof(scratch->buffer) - len;
|
|
|
|
char sztxt[16];
|
|
|
|
AppendHumanBytes(f->fd.GetFileSize(), sztxt, sizeof(sztxt));
|
|
|
|
int ret = snprintf(scratch->buffer + len, sz,
|
|
|
|
"#%" PRIu64 "(seq=%" PRIu64 ",sz=%s,%d) ",
|
|
|
|
f->fd.GetNumber(), f->fd.smallest_seqno, sztxt,
|
|
|
|
static_cast<int>(f->being_compacted));
|
|
|
|
if (ret < 0 || ret >= sz)
|
|
|
|
break;
|
|
|
|
len += ret;
|
|
|
|
}
|
|
|
|
// overwrite the last space (only if files_[level].size() is non-zero)
|
|
|
|
if (files_[level].size() && len > 0) {
|
|
|
|
--len;
|
|
|
|
}
|
|
|
|
snprintf(scratch->buffer + len, sizeof(scratch->buffer) - len, "]");
|
|
|
|
return scratch->buffer;
|
|
|
|
}
|
|
|
|
|
|
|
|
int64_t VersionStorageInfo::MaxNextLevelOverlappingBytes() {
|
|
|
|
uint64_t result = 0;
|
|
|
|
std::vector<FileMetaData*> overlaps;
|
|
|
|
for (int level = 1; level < num_levels() - 1; level++) {
|
|
|
|
for (const auto& f : files_[level]) {
|
|
|
|
GetOverlappingInputs(level + 1, &f->smallest, &f->largest, &overlaps);
|
|
|
|
const uint64_t sum = TotalFileSize(overlaps);
|
|
|
|
if (sum > result) {
|
|
|
|
result = sum;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return result;
|
|
|
|
}
|
|
|
|
|
options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
10 years ago
|
|
|
uint64_t VersionStorageInfo::MaxBytesForLevel(int level) const {
|
|
|
|
// Note: the result for level zero is not really used since we set
|
|
|
|
// the level-0 compaction threshold based on number of files.
|
|
|
|
assert(level >= 0);
|
|
|
|
assert(level < static_cast<int>(level_max_bytes_.size()));
|
|
|
|
return level_max_bytes_[level];
|
|
|
|
}
|
|
|
|
|
|
|
|
void VersionStorageInfo::CalculateBaseBytes(const ImmutableCFOptions& ioptions,
|
|
|
|
const MutableCFOptions& options) {
|
|
|
|
// Special logic to set number of sorted runs.
|
|
|
|
// It is to match the previous behavior when all files are in L0.
|
|
|
|
int num_l0_count = static_cast<int>(files_[0].size());
|
|
|
|
if (compaction_style_ == kCompactionStyleUniversal) {
|
|
|
|
// For universal compaction, we use level0 score to indicate
|
|
|
|
// compaction score for the whole DB. Adding other levels as if
|
|
|
|
// they are L0 files.
|
|
|
|
for (int i = 1; i < num_levels(); i++) {
|
|
|
|
if (!files_[i].empty()) {
|
|
|
|
num_l0_count++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
set_l0_delay_trigger_count(num_l0_count);
|
|
|
|
|
options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
10 years ago
|
|
|
level_max_bytes_.resize(ioptions.num_levels);
|
|
|
|
if (!ioptions.level_compaction_dynamic_level_bytes) {
|
|
|
|
base_level_ = (ioptions.compaction_style == kCompactionStyleLevel) ? 1 : -1;
|
options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
10 years ago
|
|
|
|
|
|
|
// Calculate for static bytes base case
|
|
|
|
for (int i = 0; i < ioptions.num_levels; ++i) {
|
|
|
|
if (i == 0 && ioptions.compaction_style == kCompactionStyleUniversal) {
|
|
|
|
level_max_bytes_[i] = options.max_bytes_for_level_base;
|
|
|
|
} else if (i > 1) {
|
|
|
|
level_max_bytes_[i] = MultiplyCheckOverflow(
|
|
|
|
MultiplyCheckOverflow(level_max_bytes_[i - 1],
|
|
|
|
options.max_bytes_for_level_multiplier),
|
|
|
|
options.MaxBytesMultiplerAdditional(i - 1));
|
options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
10 years ago
|
|
|
} else {
|
|
|
|
level_max_bytes_[i] = options.max_bytes_for_level_base;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
uint64_t max_level_size = 0;
|
|
|
|
|
|
|
|
int first_non_empty_level = -1;
|
|
|
|
// Find size of non-L0 level of most data.
|
|
|
|
// Cannot use the size of the last level because it can be empty or less
|
|
|
|
// than previous levels after compaction.
|
|
|
|
for (int i = 1; i < num_levels_; i++) {
|
|
|
|
uint64_t total_size = 0;
|
|
|
|
for (const auto& f : files_[i]) {
|
|
|
|
total_size += f->fd.GetFileSize();
|
|
|
|
}
|
|
|
|
if (total_size > 0 && first_non_empty_level == -1) {
|
|
|
|
first_non_empty_level = i;
|
|
|
|
}
|
|
|
|
if (total_size > max_level_size) {
|
|
|
|
max_level_size = total_size;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// Prefill every level's max bytes to disallow compaction from there.
|
|
|
|
for (int i = 0; i < num_levels_; i++) {
|
|
|
|
level_max_bytes_[i] = std::numeric_limits<uint64_t>::max();
|
|
|
|
}
|
|
|
|
|
|
|
|
if (max_level_size == 0) {
|
|
|
|
// No data for L1 and up. L0 compacts to last level directly.
|
|
|
|
// No compaction from L1+ needs to be scheduled.
|
|
|
|
base_level_ = num_levels_ - 1;
|
|
|
|
} else {
|
|
|
|
uint64_t l0_size = 0;
|
|
|
|
for (const auto& f : files_[0]) {
|
|
|
|
l0_size += f->fd.GetFileSize();
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t base_bytes_max =
|
|
|
|
std::max(options.max_bytes_for_level_base, l0_size);
|
|
|
|
uint64_t base_bytes_min = static_cast<uint64_t>(
|
|
|
|
base_bytes_max / options.max_bytes_for_level_multiplier);
|
options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
10 years ago
|
|
|
|
|
|
|
// Try whether we can make last level's target size to be max_level_size
|
|
|
|
uint64_t cur_level_size = max_level_size;
|
|
|
|
for (int i = num_levels_ - 2; i >= first_non_empty_level; i--) {
|
|
|
|
// Round up after dividing
|
|
|
|
cur_level_size = static_cast<uint64_t>(
|
|
|
|
cur_level_size / options.max_bytes_for_level_multiplier);
|
options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
10 years ago
|
|
|
}
|
|
|
|
|
|
|
|
// Calculate base level and its size.
|
|
|
|
uint64_t base_level_size;
|
options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
10 years ago
|
|
|
if (cur_level_size <= base_bytes_min) {
|
|
|
|
// Case 1. If we make target size of last level to be max_level_size,
|
|
|
|
// target size of the first non-empty level would be smaller than
|
|
|
|
// base_bytes_min. We set it be base_bytes_min.
|
|
|
|
base_level_size = base_bytes_min + 1U;
|
options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
10 years ago
|
|
|
base_level_ = first_non_empty_level;
|
|
|
|
ROCKS_LOG_INFO(ioptions.info_log,
|
|
|
|
"More existing levels in DB than needed. "
|
|
|
|
"max_bytes_for_level_multiplier may not be guaranteed.");
|
options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
10 years ago
|
|
|
} else {
|
|
|
|
// Find base level (where L0 data is compacted to).
|
|
|
|
base_level_ = first_non_empty_level;
|
|
|
|
while (base_level_ > 1 && cur_level_size > base_bytes_max) {
|
|
|
|
--base_level_;
|
|
|
|
cur_level_size = static_cast<uint64_t>(
|
|
|
|
cur_level_size / options.max_bytes_for_level_multiplier);
|
options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
10 years ago
|
|
|
}
|
|
|
|
if (cur_level_size > base_bytes_max) {
|
|
|
|
// Even L1 will be too large
|
|
|
|
assert(base_level_ == 1);
|
|
|
|
base_level_size = base_bytes_max;
|
options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
10 years ago
|
|
|
} else {
|
|
|
|
base_level_size = cur_level_size;
|
options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
10 years ago
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
level_multiplier_ = options.max_bytes_for_level_multiplier;
|
|
|
|
assert(base_level_size > 0);
|
|
|
|
if (l0_size > base_level_size &&
|
|
|
|
(l0_size > options.max_bytes_for_level_base ||
|
|
|
|
static_cast<int>(files_[0].size() / 2) >=
|
|
|
|
options.level0_file_num_compaction_trigger)) {
|
|
|
|
// We adjust the base level according to actual L0 size, and adjust
|
|
|
|
// the level multiplier accordingly, when:
|
|
|
|
// 1. the L0 size is larger than level size base, or
|
|
|
|
// 2. number of L0 files reaches twice the L0->L1 compaction trigger
|
|
|
|
// We don't do this otherwise to keep the LSM-tree structure stable
|
|
|
|
// unless the L0 compation is backlogged.
|
|
|
|
base_level_size = l0_size;
|
|
|
|
if (base_level_ == num_levels_ - 1) {
|
|
|
|
level_multiplier_ = 1.0;
|
|
|
|
} else {
|
|
|
|
level_multiplier_ = std::pow(
|
|
|
|
static_cast<double>(max_level_size) /
|
|
|
|
static_cast<double>(base_level_size),
|
|
|
|
1.0 / static_cast<double>(num_levels_ - base_level_ - 1));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t level_size = base_level_size;
|
options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
10 years ago
|
|
|
for (int i = base_level_; i < num_levels_; i++) {
|
|
|
|
if (i > base_level_) {
|
|
|
|
level_size = MultiplyCheckOverflow(level_size, level_multiplier_);
|
options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
10 years ago
|
|
|
}
|
|
|
|
// Don't set any level below base_bytes_max. Otherwise, the LSM can
|
|
|
|
// assume an hourglass shape where L1+ sizes are smaller than L0. This
|
|
|
|
// causes compaction scoring, which depends on level sizes, to favor L1+
|
|
|
|
// at the expense of L0, which may fill up and stall.
|
|
|
|
level_max_bytes_[i] = std::max(level_size, base_bytes_max);
|
options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
10 years ago
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t VersionStorageInfo::EstimateLiveDataSize() const {
|
|
|
|
// Estimate the live data size by adding up the size of a maximal set of
|
|
|
|
// sst files with no range overlap in same or higher level. The less
|
|
|
|
// compacted, the more optimistic (smaller) this estimate is. Also,
|
|
|
|
// for multiple sorted runs within a level, file order will matter.
|
|
|
|
uint64_t size = 0;
|
|
|
|
|
|
|
|
auto ikey_lt = [this](InternalKey* x, InternalKey* y) {
|
|
|
|
return internal_comparator_->Compare(*x, *y) < 0;
|
|
|
|
};
|
|
|
|
// (Ordered) map of largest keys in files being included in size estimate
|
|
|
|
std::map<InternalKey*, FileMetaData*, decltype(ikey_lt)> ranges(ikey_lt);
|
|
|
|
|
|
|
|
for (int l = num_levels_ - 1; l >= 0; l--) {
|
|
|
|
bool found_end = false;
|
|
|
|
for (auto file : files_[l]) {
|
|
|
|
// Find the first file already included with largest key is larger than
|
|
|
|
// the smallest key of `file`. If that file does not overlap with the
|
|
|
|
// current file, none of the files in the map does. If there is
|
|
|
|
// no potential overlap, we can safely insert the rest of this level
|
|
|
|
// (if the level is not 0) into the map without checking again because
|
|
|
|
// the elements in the level are sorted and non-overlapping.
|
|
|
|
auto lb = (found_end && l != 0) ?
|
|
|
|
ranges.end() : ranges.lower_bound(&file->smallest);
|
|
|
|
found_end = (lb == ranges.end());
|
|
|
|
if (found_end || internal_comparator_->Compare(
|
|
|
|
file->largest, (*lb).second->smallest) < 0) {
|
|
|
|
ranges.emplace_hint(lb, &file->largest, file);
|
|
|
|
size += file->fd.file_size;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return size;
|
|
|
|
}
|
|
|
|
|
|
|
|
bool VersionStorageInfo::RangeMightExistAfterSortedRun(
|
|
|
|
const Slice& smallest_user_key, const Slice& largest_user_key,
|
|
|
|
int last_level, int last_l0_idx) {
|
|
|
|
assert((last_l0_idx != -1) == (last_level == 0));
|
|
|
|
// TODO(ajkr): this preserves earlier behavior where we considered an L0 file
|
|
|
|
// bottommost only if it's the oldest L0 file and there are no files on older
|
|
|
|
// levels. It'd be better to consider it bottommost if there's no overlap in
|
|
|
|
// older levels/files.
|
|
|
|
if (last_level == 0 &&
|
|
|
|
last_l0_idx != static_cast<int>(LevelFiles(0).size() - 1)) {
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Checks whether there are files living beyond the `last_level`. If lower
|
|
|
|
// levels have files, it checks for overlap between [`smallest_key`,
|
|
|
|
// `largest_key`] and those files. Bottomlevel optimizations can be made if
|
|
|
|
// there are no files in lower levels or if there is no overlap with the files
|
|
|
|
// in the lower levels.
|
|
|
|
for (int level = last_level + 1; level < num_levels(); level++) {
|
|
|
|
// The range is not in the bottommost level if there are files in lower
|
|
|
|
// levels when the `last_level` is 0 or if there are files in lower levels
|
|
|
|
// which overlap with [`smallest_key`, `largest_key`].
|
|
|
|
if (files_[level].size() > 0 &&
|
|
|
|
(last_level == 0 ||
|
|
|
|
OverlapInLevel(level, &smallest_user_key, &largest_user_key))) {
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
|
|
|
void Version::AddLiveFiles(std::vector<uint64_t>* live_table_files,
|
|
|
|
std::vector<uint64_t>* live_blob_files) const {
|
|
|
|
assert(live_table_files);
|
|
|
|
assert(live_blob_files);
|
|
|
|
|
|
|
|
for (int level = 0; level < storage_info_.num_levels(); ++level) {
|
|
|
|
const auto& level_files = storage_info_.LevelFiles(level);
|
|
|
|
for (const auto& meta : level_files) {
|
|
|
|
assert(meta);
|
|
|
|
|
|
|
|
live_table_files->emplace_back(meta->fd.GetNumber());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
const auto& blob_files = storage_info_.GetBlobFiles();
|
|
|
|
for (const auto& pair : blob_files) {
|
|
|
|
const auto& meta = pair.second;
|
|
|
|
assert(meta);
|
|
|
|
|
|
|
|
live_blob_files->emplace_back(meta->GetBlobFileNumber());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
std::string Version::DebugString(bool hex, bool print_stats) const {
|
|
|
|
std::string r;
|
|
|
|
for (int level = 0; level < storage_info_.num_levels_; level++) {
|
|
|
|
// E.g.,
|
|
|
|
// --- level 1 ---
|
|
|
|
// 17:123[1 .. 124]['a' .. 'd']
|
|
|
|
// 20:43[124 .. 128]['e' .. 'g']
|
|
|
|
//
|
|
|
|
// if print_stats=true:
|
|
|
|
// 17:123[1 .. 124]['a' .. 'd'](4096)
|
|
|
|
r.append("--- level ");
|
|
|
|
AppendNumberTo(&r, level);
|
|
|
|
r.append(" --- version# ");
|
|
|
|
AppendNumberTo(&r, version_number_);
|
|
|
|
r.append(" ---\n");
|
|
|
|
const std::vector<FileMetaData*>& files = storage_info_.files_[level];
|
|
|
|
for (size_t i = 0; i < files.size(); i++) {
|
|
|
|
r.push_back(' ');
|
|
|
|
AppendNumberTo(&r, files[i]->fd.GetNumber());
|
|
|
|
r.push_back(':');
|
|
|
|
AppendNumberTo(&r, files[i]->fd.GetFileSize());
|
|
|
|
r.append("[");
|
|
|
|
AppendNumberTo(&r, files[i]->fd.smallest_seqno);
|
|
|
|
r.append(" .. ");
|
|
|
|
AppendNumberTo(&r, files[i]->fd.largest_seqno);
|
|
|
|
r.append("]");
|
|
|
|
r.append("[");
|
|
|
|
r.append(files[i]->smallest.DebugString(hex));
|
|
|
|
r.append(" .. ");
|
|
|
|
r.append(files[i]->largest.DebugString(hex));
|
|
|
|
r.append("]");
|
|
|
|
if (files[i]->oldest_blob_file_number != kInvalidBlobFileNumber) {
|
|
|
|
r.append(" blob_file:");
|
|
|
|
AppendNumberTo(&r, files[i]->oldest_blob_file_number);
|
|
|
|
}
|
|
|
|
if (print_stats) {
|
|
|
|
r.append("(");
|
|
|
|
r.append(ToString(
|
|
|
|
files[i]->stats.num_reads_sampled.load(std::memory_order_relaxed)));
|
|
|
|
r.append(")");
|
|
|
|
}
|
|
|
|
r.append("\n");
|
|
|
|
}
|
|
|
|
}
|
Add blob files to VersionStorageInfo/VersionBuilder (#6597)
Summary:
The patch adds a couple of classes to represent metadata about
blob files: `SharedBlobFileMetaData` contains the information elements
that are immutable (once the blob file is closed), e.g. blob file number,
total number and size of blob files, checksum method/value, while
`BlobFileMetaData` contains attributes that can vary across versions like
the amount of garbage in the file. There is a single `SharedBlobFileMetaData`
for each blob file, which is jointly owned by the `BlobFileMetaData` objects
that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s
and can also be shared if the (immutable _and_ mutable) state of the blob file
is the same in two versions.
In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends
`VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those
containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata
to a new `VersionStorageInfo`. Consistency checks are also extended to ensure
that table files point to blob files that are part of the `Version`, and that all blob files
that are part of any given `Version` have at least some _non_-garbage data in them.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597
Test Plan: `make check`
Reviewed By: riversand963
Differential Revision: D20656803
Pulled By: ltamasi
fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
|
|
|
|
|
|
|
const auto& blob_files = storage_info_.GetBlobFiles();
|
|
|
|
if (!blob_files.empty()) {
|
|
|
|
r.append("--- blob files --- version# ");
|
|
|
|
AppendNumberTo(&r, version_number_);
|
|
|
|
r.append(" ---\n");
|
|
|
|
for (const auto& pair : blob_files) {
|
|
|
|
const auto& blob_file_meta = pair.second;
|
|
|
|
assert(blob_file_meta);
|
|
|
|
|
|
|
|
r.append(blob_file_meta->DebugString());
|
|
|
|
r.push_back('\n');
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return r;
|
|
|
|
}
|
|
|
|
|
|
|
|
// this is used to batch writes to the manifest file
|
|
|
|
struct VersionSet::ManifestWriter {
|
|
|
|
Status status;
|
|
|
|
bool done;
|
|
|
|
InstrumentedCondVar cv;
|
|
|
|
ColumnFamilyData* cfd;
|
|
|
|
const MutableCFOptions mutable_cf_options;
|
|
|
|
const autovector<VersionEdit*>& edit_list;
|
|
|
|
const std::function<void(const Status&)> manifest_write_callback;
|
|
|
|
|
|
|
|
explicit ManifestWriter(
|
|
|
|
InstrumentedMutex* mu, ColumnFamilyData* _cfd,
|
|
|
|
const MutableCFOptions& cf_options, const autovector<VersionEdit*>& e,
|
|
|
|
const std::function<void(const Status&)>& manifest_wcb)
|
|
|
|
: done(false),
|
|
|
|
cv(mu),
|
|
|
|
cfd(_cfd),
|
|
|
|
mutable_cf_options(cf_options),
|
|
|
|
edit_list(e),
|
|
|
|
manifest_write_callback(manifest_wcb) {}
|
|
|
|
~ManifestWriter() { status.PermitUncheckedError(); }
|
|
|
|
|
|
|
|
bool IsAllWalEdits() const {
|
|
|
|
bool all_wal_edits = true;
|
|
|
|
for (const auto& e : edit_list) {
|
|
|
|
if (!e->IsWalManipulation()) {
|
|
|
|
all_wal_edits = false;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return all_wal_edits;
|
|
|
|
}
|
|
|
|
};
|
|
|
|
|
|
|
|
Status AtomicGroupReadBuffer::AddEdit(VersionEdit* edit) {
|
|
|
|
assert(edit);
|
|
|
|
if (edit->is_in_atomic_group_) {
|
|
|
|
TEST_SYNC_POINT("AtomicGroupReadBuffer::AddEdit:AtomicGroup");
|
|
|
|
if (replay_buffer_.empty()) {
|
|
|
|
replay_buffer_.resize(edit->remaining_entries_ + 1);
|
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"AtomicGroupReadBuffer::AddEdit:FirstInAtomicGroup", edit);
|
|
|
|
}
|
|
|
|
read_edits_in_atomic_group_++;
|
|
|
|
if (read_edits_in_atomic_group_ + edit->remaining_entries_ !=
|
|
|
|
static_cast<uint32_t>(replay_buffer_.size())) {
|
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"AtomicGroupReadBuffer::AddEdit:IncorrectAtomicGroupSize", edit);
|
|
|
|
return Status::Corruption("corrupted atomic group");
|
|
|
|
}
|
|
|
|
replay_buffer_[read_edits_in_atomic_group_ - 1] = *edit;
|
|
|
|
if (read_edits_in_atomic_group_ == replay_buffer_.size()) {
|
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"AtomicGroupReadBuffer::AddEdit:LastInAtomicGroup", edit);
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
|
|
|
// A normal edit.
|
|
|
|
if (!replay_buffer().empty()) {
|
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"AtomicGroupReadBuffer::AddEdit:AtomicGroupMixedWithNormalEdits", edit);
|
|
|
|
return Status::Corruption("corrupted atomic group");
|
|
|
|
}
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
|
|
|
bool AtomicGroupReadBuffer::IsFull() const {
|
|
|
|
return read_edits_in_atomic_group_ == replay_buffer_.size();
|
|
|
|
}
|
|
|
|
|
|
|
|
bool AtomicGroupReadBuffer::IsEmpty() const { return replay_buffer_.empty(); }
|
|
|
|
|
|
|
|
void AtomicGroupReadBuffer::Clear() {
|
|
|
|
read_edits_in_atomic_group_ = 0;
|
|
|
|
replay_buffer_.clear();
|
|
|
|
}
|
|
|
|
|
|
|
|
VersionSet::VersionSet(const std::string& dbname,
|
Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765
Differential Revision: D7747618
Pulled By: siying
fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
7 years ago
|
|
|
const ImmutableDBOptions* _db_options,
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
const FileOptions& storage_options, Cache* table_cache,
|
|
|
|
WriteBufferManager* write_buffer_manager,
|
|
|
|
WriteController* write_controller,
|
|
|
|
BlockCacheTracer* const block_cache_tracer,
|
|
|
|
const std::shared_ptr<IOTracer>& io_tracer)
|
|
|
|
: column_family_set_(
|
|
|
|
new ColumnFamilySet(dbname, _db_options, storage_options, table_cache,
|
|
|
|
write_buffer_manager, write_controller,
|
|
|
|
block_cache_tracer, io_tracer)),
|
|
|
|
table_cache_(table_cache),
|
Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765
Differential Revision: D7747618
Pulled By: siying
fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
7 years ago
|
|
|
env_(_db_options->env),
|
|
|
|
fs_(_db_options->fs, io_tracer),
|
|
|
|
clock_(env_->GetSystemClock()),
|
|
|
|
dbname_(dbname),
|
Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765
Differential Revision: D7747618
Pulled By: siying
fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
7 years ago
|
|
|
db_options_(_db_options),
|
|
|
|
next_file_number_(2),
|
|
|
|
manifest_file_number_(0), // Filled by Recover()
|
|
|
|
options_file_number_(0),
|
|
|
|
pending_manifest_file_number_(0),
|
|
|
|
last_sequence_(0),
|
|
|
|
last_allocated_sequence_(0),
|
|
|
|
last_published_sequence_(0),
|
|
|
|
prev_log_number_(0),
|
|
|
|
current_version_number_(0),
|
|
|
|
manifest_file_size_(0),
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
file_options_(storage_options),
|
|
|
|
block_cache_tracer_(block_cache_tracer),
|
|
|
|
io_tracer_(io_tracer) {}
|
|
|
|
|
|
|
|
VersionSet::~VersionSet() {
|
|
|
|
// we need to delete column_family_set_ because its destructor depends on
|
|
|
|
// VersionSet
|
|
|
|
column_family_set_.reset();
|
|
|
|
for (auto& file : obsolete_files_) {
|
|
|
|
if (file.metadata->table_reader_handle) {
|
|
|
|
table_cache_->Release(file.metadata->table_reader_handle);
|
|
|
|
TableCache::Evict(table_cache_, file.metadata->fd.GetNumber());
|
|
|
|
}
|
|
|
|
file.DeleteMetadata();
|
|
|
|
}
|
|
|
|
obsolete_files_.clear();
|
|
|
|
io_status_.PermitUncheckedError();
|
|
|
|
}
|
|
|
|
|
|
|
|
void VersionSet::Reset() {
|
|
|
|
if (column_family_set_) {
|
|
|
|
WriteBufferManager* wbm = column_family_set_->write_buffer_manager();
|
|
|
|
WriteController* wc = column_family_set_->write_controller();
|
|
|
|
column_family_set_.reset(
|
|
|
|
new ColumnFamilySet(dbname_, db_options_, file_options_, table_cache_,
|
|
|
|
wbm, wc, block_cache_tracer_, io_tracer_));
|
|
|
|
}
|
|
|
|
db_id_.clear();
|
|
|
|
next_file_number_.store(2);
|
|
|
|
min_log_number_to_keep_2pc_.store(0);
|
|
|
|
manifest_file_number_ = 0;
|
|
|
|
options_file_number_ = 0;
|
|
|
|
pending_manifest_file_number_ = 0;
|
|
|
|
last_sequence_.store(0);
|
|
|
|
last_allocated_sequence_.store(0);
|
|
|
|
last_published_sequence_.store(0);
|
|
|
|
prev_log_number_ = 0;
|
|
|
|
descriptor_log_.reset();
|
|
|
|
current_version_number_ = 0;
|
|
|
|
manifest_writers_.clear();
|
|
|
|
manifest_file_size_ = 0;
|
|
|
|
obsolete_files_.clear();
|
|
|
|
obsolete_manifests_.clear();
|
Define WAL related classes to be used in VersionEdit and VersionSet (#7164)
Summary:
`WalAddition`, `WalDeletion` are defined in `wal_version.h` and used in `VersionEdit`.
`WalAddition` is used to represent events of creating a new WAL (no size, just log number), or closing a WAL (with size).
`WalDeletion` is used to represent events of deleting or archiving a WAL, it means the WAL is no longer alive (won't be replayed during recovery).
`WalSet` is the set of alive WALs kept in `VersionSet`.
1. Why use `WalDeletion` instead of relying on `MinLogNumber` to identify outdated WALs
On recovery, we can compute `MinLogNumber()` based on the log numbers kept in MANIFEST, any log with number < MinLogNumber can be ignored. So it seems that we don't need to persist `WalDeletion` to MANIFEST, since we can ignore the WALs based on MinLogNumber.
But the `MinLogNumber()` is actually a lower bound, it does not exactly mean that logs starting from MinLogNumber must exist. This is because in a corner case, when a column family is empty and never flushed, its log number is set to the largest log number, but not persisted in MANIFEST. So let's say there are 2 column families, when creating the DB, the first WAL has log number 1, so it's persisted to MANIFEST for both column families. Then CF 0 is empty and never flushed, CF 1 is updated and flushed, so a new WAL with log number 2 is created and persisted to MANIFEST for CF 1. But CF 0's log number in MANIFEST is still 1. So on recovery, MinLogNumber is 1, but since log 1 only contains data for CF 1, and CF 1 is flushed, log 1 might have already been deleted from disk.
We can make `MinLogNumber()` be the exactly minimum log number that must exist, by persisting the most recent log number for empty column families that are not flushed. But if there are N such column families, then every time a new WAL is created, we need to add N records to MANIFEST.
In current design, a record is persisted to MANIFEST only when WAL is created, closed, or deleted/archived, so the number of WAL related records are bounded to 3x number of WALs.
2. Why keep `WalSet` in `VersionSet` instead of applying the `VersionEdit`s to `VersionStorageInfo`
`VersionEdit`s are originally designed to track the addition and deletion of SST files. The SST files are related to column families, each column family has a list of `Version`s, and each `Version` keeps the set of active SST files in `VersionStorageInfo`.
But WALs are a concept of DB, they are not bounded to specific column families. So logically it does not make sense to store WALs in a column family's `Version`s.
Also, `Version`'s purpose is to keep reference to SST / blob files, so that they are not deleted until there is no version referencing them. But a WAL is deleted regardless of version references.
So we keep the WALs in `VersionSet` for the purpose of writing out the DB state's snapshot when creating new MANIFESTs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7164
Test Plan:
make version_edit_test && ./version_edit_test
make wal_edit_test && ./wal_edit_test
Reviewed By: ltamasi
Differential Revision: D22677936
Pulled By: cheng-chang
fbshipit-source-id: 5a3b6890140e572ffd79eb37e6e4c3c32361a859
4 years ago
|
|
|
wals_.Reset();
|
|
|
|
}
|
|
|
|
|
|
|
|
void VersionSet::AppendVersion(ColumnFamilyData* column_family_data,
|
|
|
|
Version* v) {
|
|
|
|
// compute new compaction score
|
|
|
|
v->storage_info()->ComputeCompactionScore(
|
|
|
|
*column_family_data->ioptions(),
|
|
|
|
*column_family_data->GetLatestMutableCFOptions());
|
|
|
|
|
|
|
|
// Mark v finalized
|
|
|
|
v->storage_info_.SetFinalized();
|
|
|
|
|
|
|
|
// Make "v" current
|
|
|
|
assert(v->refs_ == 0);
|
|
|
|
Version* current = column_family_data->current();
|
|
|
|
assert(v != current);
|
|
|
|
if (current != nullptr) {
|
|
|
|
assert(current->refs_ > 0);
|
|
|
|
current->Unref();
|
|
|
|
}
|
|
|
|
column_family_data->SetCurrent(v);
|
|
|
|
v->Ref();
|
|
|
|
|
|
|
|
// Append to linked list
|
|
|
|
v->prev_ = column_family_data->dummy_versions()->prev_;
|
|
|
|
v->next_ = column_family_data->dummy_versions();
|
|
|
|
v->prev_->next_ = v;
|
|
|
|
v->next_->prev_ = v;
|
|
|
|
}
|
|
|
|
|
|
|
|
Status VersionSet::ProcessManifestWrites(
|
|
|
|
std::deque<ManifestWriter>& writers, InstrumentedMutex* mu,
|
|
|
|
FSDirectory* db_directory, bool new_descriptor_log,
|
|
|
|
const ColumnFamilyOptions* new_cf_options) {
|
|
|
|
mu->AssertHeld();
|
|
|
|
assert(!writers.empty());
|
|
|
|
ManifestWriter& first_writer = writers.front();
|
|
|
|
ManifestWriter* last_writer = &first_writer;
|
|
|
|
|
|
|
|
assert(!manifest_writers_.empty());
|
|
|
|
assert(manifest_writers_.front() == &first_writer);
|
|
|
|
|
|
|
|
autovector<VersionEdit*> batch_edits;
|
|
|
|
autovector<Version*> versions;
|
|
|
|
autovector<const MutableCFOptions*> mutable_cf_options_ptrs;
|
|
|
|
std::vector<std::unique_ptr<BaseReferencedVersionBuilder>> builder_guards;
|
|
|
|
|
|
|
|
if (first_writer.edit_list.front()->IsColumnFamilyManipulation()) {
|
|
|
|
// No group commits for column family add or drop
|
|
|
|
LogAndApplyCFHelper(first_writer.edit_list.front());
|
|
|
|
batch_edits.push_back(first_writer.edit_list.front());
|
|
|
|
} else {
|
|
|
|
auto it = manifest_writers_.cbegin();
|
|
|
|
size_t group_start = std::numeric_limits<size_t>::max();
|
|
|
|
while (it != manifest_writers_.cend()) {
|
|
|
|
if ((*it)->edit_list.front()->IsColumnFamilyManipulation()) {
|
|
|
|
// no group commits for column family add or drop
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
last_writer = *(it++);
|
|
|
|
assert(last_writer != nullptr);
|
|
|
|
assert(last_writer->cfd != nullptr);
|
|
|
|
if (last_writer->cfd->IsDropped()) {
|
|
|
|
// If we detect a dropped CF at this point, and the corresponding
|
|
|
|
// version edits belong to an atomic group, then we need to find out
|
|
|
|
// the preceding version edits in the same atomic group, and update
|
|
|
|
// their `remaining_entries_` member variable because we are NOT going
|
|
|
|
// to write the version edits' of dropped CF to the MANIFEST. If we
|
|
|
|
// don't update, then Recover can report corrupted atomic group because
|
|
|
|
// the `remaining_entries_` do not match.
|
|
|
|
if (!batch_edits.empty()) {
|
|
|
|
if (batch_edits.back()->is_in_atomic_group_ &&
|
|
|
|
batch_edits.back()->remaining_entries_ > 0) {
|
|
|
|
assert(group_start < batch_edits.size());
|
|
|
|
const auto& edit_list = last_writer->edit_list;
|
|
|
|
size_t k = 0;
|
|
|
|
while (k < edit_list.size()) {
|
|
|
|
if (!edit_list[k]->is_in_atomic_group_) {
|
|
|
|
break;
|
|
|
|
} else if (edit_list[k]->remaining_entries_ == 0) {
|
|
|
|
++k;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
++k;
|
|
|
|
}
|
|
|
|
for (auto i = group_start; i < batch_edits.size(); ++i) {
|
|
|
|
assert(static_cast<uint32_t>(k) <=
|
|
|
|
batch_edits.back()->remaining_entries_);
|
|
|
|
batch_edits[i]->remaining_entries_ -= static_cast<uint32_t>(k);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
// We do a linear search on versions because versions is small.
|
|
|
|
// TODO(yanqin) maybe consider unordered_map
|
|
|
|
Version* version = nullptr;
|
|
|
|
VersionBuilder* builder = nullptr;
|
|
|
|
for (int i = 0; i != static_cast<int>(versions.size()); ++i) {
|
|
|
|
uint32_t cf_id = last_writer->cfd->GetID();
|
|
|
|
if (versions[i]->cfd()->GetID() == cf_id) {
|
|
|
|
version = versions[i];
|
|
|
|
assert(!builder_guards.empty() &&
|
|
|
|
builder_guards.size() == versions.size());
|
|
|
|
builder = builder_guards[i]->version_builder();
|
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"VersionSet::ProcessManifestWrites:SameColumnFamily", &cf_id);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (version == nullptr) {
|
|
|
|
// WAL manipulations do not need to be applied to versions.
|
|
|
|
if (!last_writer->IsAllWalEdits()) {
|
|
|
|
version = new Version(last_writer->cfd, this, file_options_,
|
|
|
|
last_writer->mutable_cf_options, io_tracer_,
|
|
|
|
current_version_number_++);
|
|
|
|
versions.push_back(version);
|
|
|
|
mutable_cf_options_ptrs.push_back(&last_writer->mutable_cf_options);
|
|
|
|
builder_guards.emplace_back(
|
|
|
|
new BaseReferencedVersionBuilder(last_writer->cfd));
|
|
|
|
builder = builder_guards.back()->version_builder();
|
|
|
|
}
|
|
|
|
assert(last_writer->IsAllWalEdits() || builder);
|
|
|
|
assert(last_writer->IsAllWalEdits() || version);
|
|
|
|
TEST_SYNC_POINT_CALLBACK("VersionSet::ProcessManifestWrites:NewVersion",
|
|
|
|
version);
|
|
|
|
}
|
|
|
|
for (const auto& e : last_writer->edit_list) {
|
|
|
|
if (e->is_in_atomic_group_) {
|
|
|
|
if (batch_edits.empty() || !batch_edits.back()->is_in_atomic_group_ ||
|
|
|
|
(batch_edits.back()->is_in_atomic_group_ &&
|
|
|
|
batch_edits.back()->remaining_entries_ == 0)) {
|
|
|
|
group_start = batch_edits.size();
|
|
|
|
}
|
|
|
|
} else if (group_start != std::numeric_limits<size_t>::max()) {
|
|
|
|
group_start = std::numeric_limits<size_t>::max();
|
|
|
|
}
|
|
|
|
Status s = LogAndApplyHelper(last_writer->cfd, builder, e, mu);
|
|
|
|
if (!s.ok()) {
|
|
|
|
// free up the allocated memory
|
|
|
|
for (auto v : versions) {
|
|
|
|
delete v;
|
|
|
|
}
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
batch_edits.push_back(e);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
for (int i = 0; i < static_cast<int>(versions.size()); ++i) {
|
|
|
|
assert(!builder_guards.empty() &&
|
|
|
|
builder_guards.size() == versions.size());
|
|
|
|
auto* builder = builder_guards[i]->version_builder();
|
|
|
|
Status s = builder->SaveTo(versions[i]->storage_info());
|
|
|
|
if (!s.ok()) {
|
|
|
|
// free up the allocated memory
|
|
|
|
for (auto v : versions) {
|
|
|
|
delete v;
|
|
|
|
}
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifndef NDEBUG
|
|
|
|
// Verify that version edits of atomic groups have correct
|
|
|
|
// remaining_entries_.
|
|
|
|
size_t k = 0;
|
|
|
|
while (k < batch_edits.size()) {
|
|
|
|
while (k < batch_edits.size() && !batch_edits[k]->is_in_atomic_group_) {
|
|
|
|
++k;
|
|
|
|
}
|
|
|
|
if (k == batch_edits.size()) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
size_t i = k;
|
|
|
|
while (i < batch_edits.size()) {
|
|
|
|
if (!batch_edits[i]->is_in_atomic_group_) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
assert(i - k + batch_edits[i]->remaining_entries_ ==
|
|
|
|
batch_edits[k]->remaining_entries_);
|
|
|
|
if (batch_edits[i]->remaining_entries_ == 0) {
|
|
|
|
++i;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
++i;
|
|
|
|
}
|
|
|
|
assert(batch_edits[i - 1]->is_in_atomic_group_);
|
|
|
|
assert(0 == batch_edits[i - 1]->remaining_entries_);
|
|
|
|
std::vector<VersionEdit*> tmp;
|
|
|
|
for (size_t j = k; j != i; ++j) {
|
|
|
|
tmp.emplace_back(batch_edits[j]);
|
|
|
|
}
|
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"VersionSet::ProcessManifestWrites:CheckOneAtomicGroup", &tmp);
|
|
|
|
k = i;
|
|
|
|
}
|
|
|
|
#endif // NDEBUG
|
|
|
|
|
|
|
|
assert(pending_manifest_file_number_ == 0);
|
|
|
|
if (!descriptor_log_ ||
|
|
|
|
manifest_file_size_ > db_options_->max_manifest_file_size) {
|
|
|
|
TEST_SYNC_POINT("VersionSet::ProcessManifestWrites:BeforeNewManifest");
|
|
|
|
new_descriptor_log = true;
|
|
|
|
} else {
|
|
|
|
pending_manifest_file_number_ = manifest_file_number_;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Local cached copy of state variable(s). WriteCurrentStateToManifest()
|
|
|
|
// reads its content after releasing db mutex to avoid race with
|
|
|
|
// SwitchMemtable().
|
|
|
|
std::unordered_map<uint32_t, MutableCFState> curr_state;
|
|
|
|
VersionEdit wal_additions;
|
|
|
|
if (new_descriptor_log) {
|
Fix MANIFEST name assignment (#6426)
Summary:
Currently, a new MANIFEST file is assigned a new file number when 1) no
MANIFEST is open, or 2) current MANIFEST file size exceeds a threshold. This is
not sufficient. There are cases when the caller explicitly specifies that a new
MANIFEST be created. For example, if user sets options.write_dbid_to_manifest = true,
and there are WAL files, then RocksDB will run into an issue during recovery.
`DBImpl::Recover()` will call `LogAndApply()` to write dbid. At this point, the db being
recovered creates a new MANIFEST, say, MANIFEST-000003. Since there are WALs,
`DBImpl::RecoverLogFiles` will be called. Towards the end of this function, we call
`LogAndApply(new_descriptor_log=true)`, which explicitly creates a new MANIFEST.
However, the manifest_file_number is wrong before this fix. Consequently, RocksDB
opens an existing, non-empty file for append, effectively truncating the file to zero.
If a crash occurs, then there will be data loss.
Test Plan (devserver):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6426
Test Plan: make check
Differential Revision: D19951866
Pulled By: riversand963
fbshipit-source-id: 4b1b9fc28d4fe2ac12764b388ef9e61f05e766da
5 years ago
|
|
|
pending_manifest_file_number_ = NewFileNumber();
|
|
|
|
batch_edits.back()->SetNextFile(next_file_number_.load());
|
|
|
|
|
|
|
|
// if we are writing out new snapshot make sure to persist max column
|
|
|
|
// family.
|
|
|
|
if (column_family_set_->GetMaxColumnFamily() > 0) {
|
|
|
|
first_writer.edit_list.front()->SetMaxColumnFamily(
|
|
|
|
column_family_set_->GetMaxColumnFamily());
|
|
|
|
}
|
|
|
|
for (const auto* cfd : *column_family_set_) {
|
|
|
|
assert(curr_state.find(cfd->GetID()) == curr_state.end());
|
|
|
|
curr_state.emplace(std::make_pair(
|
|
|
|
cfd->GetID(),
|
|
|
|
MutableCFState(cfd->GetLogNumber(), cfd->GetFullHistoryTsLow())));
|
|
|
|
}
|
|
|
|
|
|
|
|
for (const auto& wal : wals_.GetWals()) {
|
|
|
|
wal_additions.AddWal(wal.first, wal.second);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t new_manifest_file_size = 0;
|
|
|
|
Status s;
|
|
|
|
IOStatus io_s;
|
|
|
|
{
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
FileOptions opt_file_opts = fs_->OptimizeForManifestWrite(file_options_);
|
|
|
|
mu->Unlock();
|
|
|
|
|
|
|
|
TEST_SYNC_POINT_CALLBACK("VersionSet::LogAndApply:WriteManifest", nullptr);
|
|
|
|
if (!first_writer.edit_list.front()->IsColumnFamilyManipulation()) {
|
|
|
|
for (int i = 0; i < static_cast<int>(versions.size()); ++i) {
|
|
|
|
assert(!builder_guards.empty() &&
|
|
|
|
builder_guards.size() == versions.size());
|
|
|
|
assert(!mutable_cf_options_ptrs.empty() &&
|
|
|
|
builder_guards.size() == versions.size());
|
|
|
|
ColumnFamilyData* cfd = versions[i]->cfd_;
|
|
|
|
s = builder_guards[i]->version_builder()->LoadTableHandlers(
|
|
|
|
cfd->internal_stats(), 1 /* max_threads */,
|
|
|
|
true /* prefetch_index_and_filter_in_cache */,
|
|
|
|
false /* is_initial_load */,
|
|
|
|
mutable_cf_options_ptrs[i]->prefix_extractor.get(),
|
|
|
|
MaxFileSizeForL0MetaPin(*mutable_cf_options_ptrs[i]));
|
|
|
|
if (!s.ok()) {
|
|
|
|
if (db_options_->paranoid_checks) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
s = Status::OK();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (s.ok() && new_descriptor_log) {
|
|
|
|
// This is fine because everything inside of this block is serialized --
|
|
|
|
// only one thread can be here at the same time
|
|
|
|
// create new manifest file
|
|
|
|
ROCKS_LOG_INFO(db_options_->info_log, "Creating manifest %" PRIu64 "\n",
|
|
|
|
pending_manifest_file_number_);
|
|
|
|
std::string descriptor_fname =
|
|
|
|
DescriptorFileName(dbname_, pending_manifest_file_number_);
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
std::unique_ptr<FSWritableFile> descriptor_file;
|
|
|
|
io_s = NewWritableFile(fs_.get(), descriptor_fname, &descriptor_file,
|
|
|
|
opt_file_opts);
|
|
|
|
if (io_s.ok()) {
|
|
|
|
descriptor_file->SetPreallocationBlockSize(
|
|
|
|
db_options_->manifest_preallocation_size);
|
|
|
|
FileTypeSet tmp_set = db_options_->checksum_handoff_file_types;
|
|
|
|
std::unique_ptr<WritableFileWriter> file_writer(new WritableFileWriter(
|
|
|
|
std::move(descriptor_file), descriptor_fname, opt_file_opts, clock_,
|
|
|
|
io_tracer_, nullptr, db_options_->listeners, nullptr,
|
|
|
|
tmp_set.Contains(FileType::kDescriptorFile)));
|
|
|
|
descriptor_log_.reset(
|
|
|
|
new log::Writer(std::move(file_writer), 0, false));
|
|
|
|
s = WriteCurrentStateToManifest(curr_state, wal_additions,
|
|
|
|
descriptor_log_.get(), io_s);
|
|
|
|
} else {
|
|
|
|
s = io_s;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (s.ok()) {
|
|
|
|
if (!first_writer.edit_list.front()->IsColumnFamilyManipulation()) {
|
|
|
|
for (int i = 0; i < static_cast<int>(versions.size()); ++i) {
|
|
|
|
versions[i]->PrepareApply(*mutable_cf_options_ptrs[i], true);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// Write new records to MANIFEST log
|
|
|
|
#ifndef NDEBUG
|
|
|
|
size_t idx = 0;
|
|
|
|
#endif
|
|
|
|
for (auto& e : batch_edits) {
|
|
|
|
std::string record;
|
|
|
|
if (!e->EncodeTo(&record)) {
|
|
|
|
s = Status::Corruption("Unable to encode VersionEdit:" +
|
|
|
|
e->DebugString(true));
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
TEST_KILL_RANDOM("VersionSet::LogAndApply:BeforeAddRecord",
|
|
|
|
rocksdb_kill_odds * REDUCE_ODDS2);
|
|
|
|
#ifndef NDEBUG
|
|
|
|
if (batch_edits.size() > 1 && batch_edits.size() - 1 == idx) {
|
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"VersionSet::ProcessManifestWrites:BeforeWriteLastVersionEdit:0",
|
|
|
|
nullptr);
|
|
|
|
TEST_SYNC_POINT(
|
|
|
|
"VersionSet::ProcessManifestWrites:BeforeWriteLastVersionEdit:1");
|
|
|
|
}
|
|
|
|
++idx;
|
|
|
|
#endif /* !NDEBUG */
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
5 years ago
|
|
|
io_s = descriptor_log_->AddRecord(record);
|
|
|
|
if (!io_s.ok()) {
|
|
|
|
s = io_s;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (s.ok()) {
|
|
|
|
io_s = SyncManifest(clock_, db_options_, descriptor_log_->file());
|
First step towards handling MANIFEST write error (#6949)
Summary:
This PR provides preliminary support for handling IO error during MANIFEST write.
File write/sync is not guaranteed to be atomic. If we encounter an IOError while writing/syncing to the MANIFEST file, we cannot be sure about the state of the MANIFEST file. The version edits may or may not have reached the file. During cleanup, if we delete the newly-generated SST files referenced by the pending version edit(s), but the version edit(s) actually are persistent in the MANIFEST, then next recovery attempt will process the version edits(s) and then fail since the SST files have already been deleted.
One approach is to truncate the MANIFEST after write/sync error, so that it is safe to delete the SST files. However, file truncation may not be supported on certain file systems. Therefore, we take the following approach.
If an IOError is detected during MANIFEST write/sync, we disable file deletions for the faulty database. Depending on whether the IOError is retryable (set by underlying file system), either RocksDB or application can call `DB::Resume()`, or simply shutdown and restart. During `Resume()`, RocksDB will try to switch to a new MANIFEST and write all existing in-memory version storage in the new file. If this succeeds, then RocksDB may proceed. If all recovery is completed, then file deletions will be re-enabled.
Note that multiple threads can call `LogAndApply()` at the same time, though only one of them will be going through the process MANIFEST write, possibly batching the version edits of other threads. When the leading MANIFEST writer finishes, all of the MANIFEST writing threads in this batch will have the same IOError. They will all call `ErrorHandler::SetBGError()` in which file deletion will be disabled.
Possible future directions:
- Add an `ErrorContext` structure so that it is easier to pass more info to `ErrorHandler`. Currently, as in this example, a new `BackgroundErrorReason` has to be added.
Test plan (dev server):
make check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6949
Reviewed By: anand1976
Differential Revision: D22026020
Pulled By: riversand963
fbshipit-source-id: f3c68a2ef45d9b505d0d625c7c5e0c88495b91c8
5 years ago
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"VersionSet::ProcessManifestWrites:AfterSyncManifest", &io_s);
|
|
|
|
}
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
5 years ago
|
|
|
if (!io_s.ok()) {
|
|
|
|
s = io_s;
|
|
|
|
ROCKS_LOG_ERROR(db_options_->info_log, "MANIFEST write %s\n",
|
|
|
|
s.ToString().c_str());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// If we just created a new descriptor file, install it by writing a
|
|
|
|
// new CURRENT file that points to it.
|
|
|
|
if (s.ok() && new_descriptor_log) {
|
|
|
|
io_s = SetCurrentFile(fs_.get(), dbname_, pending_manifest_file_number_,
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
5 years ago
|
|
|
db_directory);
|
|
|
|
if (!io_s.ok()) {
|
|
|
|
s = io_s;
|
|
|
|
}
|
|
|
|
TEST_SYNC_POINT("VersionSet::ProcessManifestWrites:AfterNewManifest");
|
|
|
|
}
|
|
|
|
|
|
|
|
if (s.ok()) {
|
|
|
|
// find offset in manifest file where this version is stored.
|
|
|
|
new_manifest_file_size = descriptor_log_->file()->GetFileSize();
|
|
|
|
}
|
|
|
|
|
|
|
|
if (first_writer.edit_list.front()->is_column_family_drop_) {
|
|
|
|
TEST_SYNC_POINT("VersionSet::LogAndApply::ColumnFamilyDrop:0");
|
LogAndApply() should fail if the column family has been dropped
Summary:
This patch finally fixes the ColumnFamilyTest.ReadDroppedColumnFamily test. The test has been failing very sporadically and it was hard to repro. However, I managed to write a new tests that reproes the failure deterministically.
Here's what happens:
1. We start the flush for the column family
2. We check if the column family was dropped here: https://github.com/facebook/rocksdb/blob/a3fc49bfddcdb1ff29409aacd06c04df56c7a1d7/db/flush_job.cc#L149
3. This check goes through, ends up in InstallMemtableFlushResults() and it goes into LogAndApply()
4. At about this time, we start dropping the column family. Dropping the column family process gets to LogAndApply() at about the same time as LogAndApply() from flush process
5. Drop column family goes through LogAndApply() first, marking the column family as dropped.
6. Flush process gets woken up and gets a chance to write to the MANIFEST. However, this is where it gets stuck: https://github.com/facebook/rocksdb/blob/a3fc49bfddcdb1ff29409aacd06c04df56c7a1d7/db/version_set.cc#L1975
7. We see that the column family was dropped, so there is no need to write to the MANIFEST. We return OK.
8. Flush gets OK back from LogAndApply() and it deletes the memtable, thinking that the data is now safely persisted to sst file.
The fix is pretty simple. Instead of OK, we return ShutdownInProgress. This is not really true, but we have been using this status code to also mean "this operation was canceled because the column family has been dropped".
The fix is only one LOC. All other code is related to tests. I added a new test that reproes the failure. I also moved SleepingBackgroundTask to util/testutil.h (because I needed it in column_family_test for my new test). There's plenty of other places where we reimplement SleepingBackgroundTask, but I'll address that in a separate commit.
Test Plan:
1. new test
2. make check
3. Make sure the ColumnFamilyTest.ReadDroppedColumnFamily doesn't fail on Travis: https://travis-ci.org/facebook/rocksdb/jobs/79952386
Reviewers: yhchiang, anthony, IslamAbdelRahman, kradhakrishnan, rven, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D46773
9 years ago
|
|
|
TEST_SYNC_POINT("VersionSet::LogAndApply::ColumnFamilyDrop:1");
|
|
|
|
TEST_SYNC_POINT("VersionSet::LogAndApply::ColumnFamilyDrop:2");
|
|
|
|
}
|
|
|
|
|
|
|
|
LogFlush(db_options_->info_log);
|
|
|
|
TEST_SYNC_POINT("VersionSet::LogAndApply:WriteManifestDone");
|
|
|
|
mu->Lock();
|
|
|
|
}
|
|
|
|
|
|
|
|
if (s.ok()) {
|
|
|
|
// Apply WAL edits, DB mutex must be held.
|
|
|
|
for (auto& e : batch_edits) {
|
|
|
|
if (e->IsWalAddition()) {
|
|
|
|
s = wals_.AddWals(e->GetWalAdditions());
|
|
|
|
} else if (e->IsWalDeletion()) {
|
|
|
|
s = wals_.DeleteWalsBefore(e->GetWalDeletion().GetLogNumber());
|
|
|
|
}
|
|
|
|
if (!s.ok()) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!io_s.ok()) {
|
|
|
|
if (io_status_.ok()) {
|
|
|
|
io_status_ = io_s;
|
|
|
|
}
|
|
|
|
} else if (!io_status_.ok()) {
|
|
|
|
io_status_ = io_s;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Append the old manifest file to the obsolete_manifest_ list to be deleted
|
|
|
|
// by PurgeObsoleteFiles later.
|
|
|
|
if (s.ok() && new_descriptor_log) {
|
|
|
|
obsolete_manifests_.emplace_back(
|
|
|
|
DescriptorFileName("", manifest_file_number_));
|
|
|
|
}
|
|
|
|
|
|
|
|
// Install the new versions
|
|
|
|
if (s.ok()) {
|
|
|
|
if (first_writer.edit_list.front()->is_column_family_add_) {
|
|
|
|
assert(batch_edits.size() == 1);
|
|
|
|
assert(new_cf_options != nullptr);
|
|
|
|
CreateColumnFamily(*new_cf_options, first_writer.edit_list.front());
|
|
|
|
} else if (first_writer.edit_list.front()->is_column_family_drop_) {
|
|
|
|
assert(batch_edits.size() == 1);
|
|
|
|
first_writer.cfd->SetDropped();
|
|
|
|
first_writer.cfd->UnrefAndTryDelete();
|
|
|
|
} else {
|
|
|
|
// Each version in versions corresponds to a column family.
|
|
|
|
// For each column family, update its log number indicating that logs
|
|
|
|
// with number smaller than this should be ignored.
|
|
|
|
uint64_t last_min_log_number_to_keep = 0;
|
|
|
|
for (const auto& e : batch_edits) {
|
|
|
|
ColumnFamilyData* cfd = nullptr;
|
|
|
|
if (!e->IsColumnFamilyManipulation()) {
|
|
|
|
cfd = column_family_set_->GetColumnFamily(e->column_family_);
|
|
|
|
// e would not have been added to batch_edits if its corresponding
|
|
|
|
// column family is dropped.
|
|
|
|
assert(cfd);
|
|
|
|
}
|
|
|
|
if (cfd) {
|
|
|
|
if (e->has_log_number_ && e->log_number_ > cfd->GetLogNumber()) {
|
|
|
|
cfd->SetLogNumber(e->log_number_);
|
|
|
|
}
|
|
|
|
if (e->HasFullHistoryTsLow()) {
|
|
|
|
cfd->SetFullHistoryTsLow(e->GetFullHistoryTsLow());
|
|
|
|
}
|
|
|
|
}
|
Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765
Differential Revision: D7747618
Pulled By: siying
fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
7 years ago
|
|
|
if (e->has_min_log_number_to_keep_) {
|
|
|
|
last_min_log_number_to_keep =
|
|
|
|
std::max(last_min_log_number_to_keep, e->min_log_number_to_keep_);
|
Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765
Differential Revision: D7747618
Pulled By: siying
fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
7 years ago
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (last_min_log_number_to_keep != 0) {
|
Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765
Differential Revision: D7747618
Pulled By: siying
fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
7 years ago
|
|
|
// Should only be set in 2PC mode.
|
|
|
|
MarkMinLogNumberToKeep2PC(last_min_log_number_to_keep);
|
Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765
Differential Revision: D7747618
Pulled By: siying
fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
7 years ago
|
|
|
}
|
|
|
|
|
|
|
|
for (int i = 0; i < static_cast<int>(versions.size()); ++i) {
|
|
|
|
ColumnFamilyData* cfd = versions[i]->cfd_;
|
|
|
|
AppendVersion(cfd, versions[i]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
manifest_file_number_ = pending_manifest_file_number_;
|
|
|
|
manifest_file_size_ = new_manifest_file_size;
|
|
|
|
prev_log_number_ = first_writer.edit_list.front()->prev_log_number_;
|
|
|
|
} else {
|
|
|
|
std::string version_edits;
|
|
|
|
for (auto& e : batch_edits) {
|
|
|
|
version_edits += ("\n" + e->DebugString(true));
|
|
|
|
}
|
|
|
|
ROCKS_LOG_ERROR(db_options_->info_log,
|
|
|
|
"Error in committing version edit to MANIFEST: %s",
|
|
|
|
version_edits.c_str());
|
|
|
|
for (auto v : versions) {
|
|
|
|
delete v;
|
|
|
|
}
|
|
|
|
// If manifest append failed for whatever reason, the file could be
|
|
|
|
// corrupted. So we need to force the next version update to start a
|
|
|
|
// new manifest file.
|
|
|
|
descriptor_log_.reset();
|
|
|
|
if (new_descriptor_log) {
|
|
|
|
ROCKS_LOG_INFO(db_options_->info_log,
|
|
|
|
"Deleting manifest %" PRIu64 " current manifest %" PRIu64
|
|
|
|
"\n",
|
|
|
|
pending_manifest_file_number_, manifest_file_number_);
|
|
|
|
Status manifest_del_status = env_->DeleteFile(
|
|
|
|
DescriptorFileName(dbname_, pending_manifest_file_number_));
|
|
|
|
if (!manifest_del_status.ok()) {
|
|
|
|
ROCKS_LOG_WARN(db_options_->info_log,
|
|
|
|
"Failed to delete manifest %" PRIu64 ": %s",
|
|
|
|
pending_manifest_file_number_,
|
|
|
|
manifest_del_status.ToString().c_str());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
pending_manifest_file_number_ = 0;
|
|
|
|
|
|
|
|
// wake up all the waiting writers
|
|
|
|
while (true) {
|
|
|
|
ManifestWriter* ready = manifest_writers_.front();
|
|
|
|
manifest_writers_.pop_front();
|
|
|
|
bool need_signal = true;
|
|
|
|
for (const auto& w : writers) {
|
|
|
|
if (&w == ready) {
|
|
|
|
need_signal = false;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
ready->status = s;
|
|
|
|
ready->done = true;
|
|
|
|
if (ready->manifest_write_callback) {
|
|
|
|
(ready->manifest_write_callback)(s);
|
|
|
|
}
|
|
|
|
if (need_signal) {
|
|
|
|
ready->cv.Signal();
|
|
|
|
}
|
|
|
|
if (ready == last_writer) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!manifest_writers_.empty()) {
|
|
|
|
manifest_writers_.front()->cv.Signal();
|
|
|
|
}
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
// 'datas' is gramatically incorrect. We still use this notation to indicate
|
|
|
|
// that this variable represents a collection of column_family_data.
|
|
|
|
Status VersionSet::LogAndApply(
|
|
|
|
const autovector<ColumnFamilyData*>& column_family_datas,
|
|
|
|
const autovector<const MutableCFOptions*>& mutable_cf_options_list,
|
|
|
|
const autovector<autovector<VersionEdit*>>& edit_lists,
|
|
|
|
InstrumentedMutex* mu, FSDirectory* db_directory, bool new_descriptor_log,
|
|
|
|
const ColumnFamilyOptions* new_cf_options,
|
|
|
|
const std::vector<std::function<void(const Status&)>>& manifest_wcbs) {
|
|
|
|
mu->AssertHeld();
|
|
|
|
int num_edits = 0;
|
|
|
|
for (const auto& elist : edit_lists) {
|
|
|
|
num_edits += static_cast<int>(elist.size());
|
|
|
|
}
|
|
|
|
if (num_edits == 0) {
|
|
|
|
return Status::OK();
|
|
|
|
} else if (num_edits > 1) {
|
|
|
|
#ifndef NDEBUG
|
|
|
|
for (const auto& edit_list : edit_lists) {
|
|
|
|
for (const auto& edit : edit_list) {
|
|
|
|
assert(!edit->IsColumnFamilyManipulation());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif /* ! NDEBUG */
|
|
|
|
}
|
|
|
|
|
|
|
|
int num_cfds = static_cast<int>(column_family_datas.size());
|
|
|
|
if (num_cfds == 1 && column_family_datas[0] == nullptr) {
|
|
|
|
assert(edit_lists.size() == 1 && edit_lists[0].size() == 1);
|
|
|
|
assert(edit_lists[0][0]->is_column_family_add_);
|
|
|
|
assert(new_cf_options != nullptr);
|
|
|
|
}
|
|
|
|
std::deque<ManifestWriter> writers;
|
|
|
|
if (num_cfds > 0) {
|
|
|
|
assert(static_cast<size_t>(num_cfds) == mutable_cf_options_list.size());
|
|
|
|
assert(static_cast<size_t>(num_cfds) == edit_lists.size());
|
|
|
|
}
|
|
|
|
for (int i = 0; i < num_cfds; ++i) {
|
|
|
|
const auto wcb =
|
|
|
|
manifest_wcbs.empty() ? [](const Status&) {} : manifest_wcbs[i];
|
|
|
|
writers.emplace_back(mu, column_family_datas[i],
|
|
|
|
*mutable_cf_options_list[i], edit_lists[i], wcb);
|
|
|
|
manifest_writers_.push_back(&writers[i]);
|
|
|
|
}
|
|
|
|
assert(!writers.empty());
|
|
|
|
ManifestWriter& first_writer = writers.front();
|
|
|
|
TEST_SYNC_POINT_CALLBACK("VersionSet::LogAndApply:BeforeWriterWaiting",
|
|
|
|
nullptr);
|
|
|
|
while (!first_writer.done && &first_writer != manifest_writers_.front()) {
|
|
|
|
first_writer.cv.Wait();
|
|
|
|
}
|
|
|
|
if (first_writer.done) {
|
|
|
|
// All non-CF-manipulation operations can be grouped together and committed
|
|
|
|
// to MANIFEST. They should all have finished. The status code is stored in
|
|
|
|
// the first manifest writer.
|
|
|
|
#ifndef NDEBUG
|
|
|
|
for (const auto& writer : writers) {
|
|
|
|
assert(writer.done);
|
|
|
|
}
|
|
|
|
TEST_SYNC_POINT_CALLBACK("VersionSet::LogAndApply:WakeUpAndDone", mu);
|
|
|
|
#endif /* !NDEBUG */
|
|
|
|
return first_writer.status;
|
|
|
|
}
|
|
|
|
|
|
|
|
int num_undropped_cfds = 0;
|
|
|
|
for (auto cfd : column_family_datas) {
|
|
|
|
// if cfd == nullptr, it is a column family add.
|
|
|
|
if (cfd == nullptr || !cfd->IsDropped()) {
|
|
|
|
++num_undropped_cfds;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (0 == num_undropped_cfds) {
|
|
|
|
for (int i = 0; i != num_cfds; ++i) {
|
|
|
|
manifest_writers_.pop_front();
|
|
|
|
}
|
|
|
|
// Notify new head of manifest write queue.
|
|
|
|
if (!manifest_writers_.empty()) {
|
|
|
|
manifest_writers_.front()->cv.Signal();
|
|
|
|
}
|
|
|
|
return Status::ColumnFamilyDropped();
|
|
|
|
}
|
|
|
|
|
|
|
|
return ProcessManifestWrites(writers, mu, db_directory, new_descriptor_log,
|
|
|
|
new_cf_options);
|
|
|
|
}
|
|
|
|
|
|
|
|
void VersionSet::LogAndApplyCFHelper(VersionEdit* edit) {
|
|
|
|
assert(edit->IsColumnFamilyManipulation());
|
|
|
|
edit->SetNextFile(next_file_number_.load());
|
|
|
|
// The log might have data that is not visible to memtbale and hence have not
|
|
|
|
// updated the last_sequence_ yet. It is also possible that the log has is
|
|
|
|
// expecting some new data that is not written yet. Since LastSequence is an
|
|
|
|
// upper bound on the sequence, it is ok to record
|
|
|
|
// last_allocated_sequence_ as the last sequence.
|
|
|
|
edit->SetLastSequence(db_options_->two_write_queues ? last_allocated_sequence_
|
|
|
|
: last_sequence_);
|
|
|
|
if (edit->is_column_family_drop_) {
|
|
|
|
// if we drop column family, we have to make sure to save max column family,
|
|
|
|
// so that we don't reuse existing ID
|
|
|
|
edit->SetMaxColumnFamily(column_family_set_->GetMaxColumnFamily());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
Status VersionSet::LogAndApplyHelper(ColumnFamilyData* cfd,
|
|
|
|
VersionBuilder* builder, VersionEdit* edit,
|
|
|
|
InstrumentedMutex* mu) {
|
|
|
|
#ifdef NDEBUG
|
|
|
|
(void)cfd;
|
|
|
|
#endif
|
|
|
|
mu->AssertHeld();
|
|
|
|
assert(!edit->IsColumnFamilyManipulation());
|
|
|
|
|
|
|
|
if (edit->has_log_number_) {
|
|
|
|
assert(edit->log_number_ >= cfd->GetLogNumber());
|
|
|
|
assert(edit->log_number_ < next_file_number_.load());
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!edit->has_prev_log_number_) {
|
|
|
|
edit->SetPrevLogNumber(prev_log_number_);
|
|
|
|
}
|
|
|
|
edit->SetNextFile(next_file_number_.load());
|
|
|
|
// The log might have data that is not visible to memtbale and hence have not
|
|
|
|
// updated the last_sequence_ yet. It is also possible that the log has is
|
|
|
|
// expecting some new data that is not written yet. Since LastSequence is an
|
|
|
|
// upper bound on the sequence, it is ok to record
|
|
|
|
// last_allocated_sequence_ as the last sequence.
|
|
|
|
edit->SetLastSequence(db_options_->two_write_queues ? last_allocated_sequence_
|
|
|
|
: last_sequence_);
|
|
|
|
|
|
|
|
// The builder can be nullptr only if edit is WAL manipulation,
|
|
|
|
// because WAL edits do not need to be applied to versions,
|
|
|
|
// we return Status::OK() in this case.
|
|
|
|
assert(builder || edit->IsWalManipulation());
|
|
|
|
return builder ? builder->Apply(edit) : Status::OK();
|
|
|
|
}
|
|
|
|
|
|
|
|
Status VersionSet::ApplyOneVersionEditToBuilder(
|
|
|
|
VersionEdit& edit,
|
|
|
|
const std::unordered_map<std::string, ColumnFamilyOptions>& name_to_options,
|
|
|
|
std::unordered_map<int, std::string>& column_families_not_found,
|
|
|
|
std::unordered_map<uint32_t, std::unique_ptr<BaseReferencedVersionBuilder>>&
|
|
|
|
builders,
|
|
|
|
VersionEditParams* version_edit_params) {
|
|
|
|
// Not found means that user didn't supply that column
|
|
|
|
// family option AND we encountered column family add
|
|
|
|
// record. Once we encounter column family drop record,
|
|
|
|
// we will delete the column family from
|
|
|
|
// column_families_not_found.
|
|
|
|
bool cf_in_not_found = (column_families_not_found.find(edit.column_family_) !=
|
|
|
|
column_families_not_found.end());
|
|
|
|
// in builders means that user supplied that column family
|
|
|
|
// option AND that we encountered column family add record
|
|
|
|
bool cf_in_builders = builders.find(edit.column_family_) != builders.end();
|
|
|
|
|
|
|
|
// they can't both be true
|
|
|
|
assert(!(cf_in_not_found && cf_in_builders));
|
|
|
|
|
|
|
|
ColumnFamilyData* cfd = nullptr;
|
|
|
|
|
|
|
|
if (edit.is_column_family_add_) {
|
|
|
|
if (cf_in_builders || cf_in_not_found) {
|
|
|
|
return Status::Corruption(
|
|
|
|
"Manifest adding the same column family twice: " +
|
|
|
|
edit.column_family_name_);
|
|
|
|
}
|
|
|
|
auto cf_options = name_to_options.find(edit.column_family_name_);
|
|
|
|
// implicitly add persistent_stats column family without requiring user
|
|
|
|
// to specify
|
|
|
|
bool is_persistent_stats_column_family =
|
|
|
|
edit.column_family_name_.compare(kPersistentStatsColumnFamilyName) == 0;
|
|
|
|
if (cf_options == name_to_options.end() &&
|
|
|
|
!is_persistent_stats_column_family) {
|
|
|
|
column_families_not_found.insert(
|
|
|
|
{edit.column_family_, edit.column_family_name_});
|
|
|
|
} else {
|
|
|
|
// recover persistent_stats CF from a DB that already contains it
|
|
|
|
if (is_persistent_stats_column_family) {
|
|
|
|
ColumnFamilyOptions cfo;
|
|
|
|
OptimizeForPersistentStats(&cfo);
|
|
|
|
cfd = CreateColumnFamily(cfo, &edit);
|
|
|
|
} else {
|
|
|
|
cfd = CreateColumnFamily(cf_options->second, &edit);
|
|
|
|
}
|
|
|
|
cfd->set_initialized();
|
|
|
|
builders.insert(std::make_pair(
|
|
|
|
edit.column_family_, std::unique_ptr<BaseReferencedVersionBuilder>(
|
|
|
|
new BaseReferencedVersionBuilder(cfd))));
|
|
|
|
}
|
|
|
|
} else if (edit.is_column_family_drop_) {
|
|
|
|
if (cf_in_builders) {
|
|
|
|
auto builder = builders.find(edit.column_family_);
|
|
|
|
assert(builder != builders.end());
|
|
|
|
builders.erase(builder);
|
|
|
|
cfd = column_family_set_->GetColumnFamily(edit.column_family_);
|
|
|
|
assert(cfd != nullptr);
|
|
|
|
if (cfd->UnrefAndTryDelete()) {
|
|
|
|
cfd = nullptr;
|
|
|
|
} else {
|
|
|
|
// who else can have reference to cfd!?
|
|
|
|
assert(false);
|
|
|
|
}
|
|
|
|
} else if (cf_in_not_found) {
|
|
|
|
column_families_not_found.erase(edit.column_family_);
|
|
|
|
} else {
|
|
|
|
return Status::Corruption(
|
|
|
|
"Manifest - dropping non-existing column family");
|
|
|
|
}
|
|
|
|
} else if (edit.IsWalAddition()) {
|
|
|
|
Status s = wals_.AddWals(edit.GetWalAdditions());
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
} else if (edit.IsWalDeletion()) {
|
|
|
|
Status s = wals_.DeleteWalsBefore(edit.GetWalDeletion().GetLogNumber());
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
} else if (!cf_in_not_found) {
|
|
|
|
if (!cf_in_builders) {
|
|
|
|
return Status::Corruption(
|
|
|
|
"Manifest record referencing unknown column family");
|
|
|
|
}
|
|
|
|
|
|
|
|
cfd = column_family_set_->GetColumnFamily(edit.column_family_);
|
|
|
|
// this should never happen since cf_in_builders is true
|
|
|
|
assert(cfd != nullptr);
|
|
|
|
|
|
|
|
// if it is not column family add or column family drop,
|
|
|
|
// then it's a file add/delete, which should be forwarded
|
|
|
|
// to builder
|
|
|
|
auto builder = builders.find(edit.column_family_);
|
|
|
|
assert(builder != builders.end());
|
|
|
|
Status s = builder->second->version_builder()->Apply(&edit);
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return ExtractInfoFromVersionEdit(cfd, edit, version_edit_params);
|
|
|
|
}
|
|
|
|
|
|
|
|
Status VersionSet::ExtractInfoFromVersionEdit(
|
|
|
|
ColumnFamilyData* cfd, const VersionEdit& from_edit,
|
|
|
|
VersionEditParams* version_edit_params) {
|
|
|
|
if (cfd != nullptr) {
|
|
|
|
if (from_edit.has_db_id_) {
|
|
|
|
version_edit_params->SetDBId(from_edit.db_id_);
|
|
|
|
}
|
|
|
|
if (from_edit.has_log_number_) {
|
|
|
|
if (cfd->GetLogNumber() > from_edit.log_number_) {
|
|
|
|
ROCKS_LOG_WARN(
|
|
|
|
db_options_->info_log,
|
|
|
|
"MANIFEST corruption detected, but ignored - Log numbers in "
|
|
|
|
"records NOT monotonically increasing");
|
|
|
|
} else {
|
|
|
|
cfd->SetLogNumber(from_edit.log_number_);
|
|
|
|
version_edit_params->SetLogNumber(from_edit.log_number_);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (from_edit.has_comparator_ &&
|
|
|
|
from_edit.comparator_ != cfd->user_comparator()->Name()) {
|
|
|
|
return Status::InvalidArgument(
|
|
|
|
cfd->user_comparator()->Name(),
|
|
|
|
"does not match existing comparator " + from_edit.comparator_);
|
|
|
|
}
|
|
|
|
if (from_edit.HasFullHistoryTsLow()) {
|
|
|
|
const std::string& new_ts = from_edit.GetFullHistoryTsLow();
|
|
|
|
cfd->SetFullHistoryTsLow(new_ts);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (from_edit.has_prev_log_number_) {
|
|
|
|
version_edit_params->SetPrevLogNumber(from_edit.prev_log_number_);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (from_edit.has_next_file_number_) {
|
|
|
|
version_edit_params->SetNextFile(from_edit.next_file_number_);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (from_edit.has_max_column_family_) {
|
|
|
|
version_edit_params->SetMaxColumnFamily(from_edit.max_column_family_);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (from_edit.has_min_log_number_to_keep_) {
|
|
|
|
version_edit_params->min_log_number_to_keep_ =
|
|
|
|
std::max(version_edit_params->min_log_number_to_keep_,
|
|
|
|
from_edit.min_log_number_to_keep_);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (from_edit.has_last_sequence_) {
|
|
|
|
version_edit_params->SetLastSequence(from_edit.last_sequence_);
|
|
|
|
}
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
Status VersionSet::GetCurrentManifestPath(const std::string& dbname,
|
|
|
|
FileSystem* fs,
|
|
|
|
std::string* manifest_path,
|
|
|
|
uint64_t* manifest_file_number) {
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
assert(fs != nullptr);
|
|
|
|
assert(manifest_path != nullptr);
|
|
|
|
assert(manifest_file_number != nullptr);
|
|
|
|
|
|
|
|
std::string fname;
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
Status s = ReadFileToString(fs, CurrentFileName(dbname), &fname);
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
if (fname.empty() || fname.back() != '\n') {
|
|
|
|
return Status::Corruption("CURRENT file does not end with newline");
|
|
|
|
}
|
|
|
|
// remove the trailing '\n'
|
|
|
|
fname.resize(fname.size() - 1);
|
|
|
|
FileType type;
|
|
|
|
bool parse_ok = ParseFileName(fname, manifest_file_number, &type);
|
|
|
|
if (!parse_ok || type != kDescriptorFile) {
|
|
|
|
return Status::Corruption("CURRENT file corrupted");
|
|
|
|
}
|
|
|
|
*manifest_path = dbname;
|
|
|
|
if (dbname.back() != '/') {
|
|
|
|
manifest_path->push_back('/');
|
|
|
|
}
|
|
|
|
manifest_path->append(fname);
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
|
|
|
Status VersionSet::ReadAndRecover(
|
|
|
|
log::Reader& reader, AtomicGroupReadBuffer* read_buffer,
|
|
|
|
const std::unordered_map<std::string, ColumnFamilyOptions>& name_to_options,
|
|
|
|
std::unordered_map<int, std::string>& column_families_not_found,
|
|
|
|
std::unordered_map<uint32_t, std::unique_ptr<BaseReferencedVersionBuilder>>&
|
|
|
|
builders,
|
|
|
|
Status* log_read_status, VersionEditParams* version_edit_params,
|
|
|
|
std::string* db_id) {
|
|
|
|
assert(read_buffer != nullptr);
|
|
|
|
assert(log_read_status != nullptr);
|
|
|
|
Status s;
|
|
|
|
Slice record;
|
|
|
|
std::string scratch;
|
|
|
|
size_t recovered_edits = 0;
|
|
|
|
while (s.ok() && reader.ReadRecord(&record, &scratch) &&
|
|
|
|
log_read_status->ok()) {
|
|
|
|
VersionEdit edit;
|
|
|
|
s = edit.DecodeFrom(record);
|
|
|
|
if (!s.ok()) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (edit.has_db_id_) {
|
|
|
|
db_id_ = edit.GetDbId();
|
|
|
|
if (db_id != nullptr) {
|
|
|
|
db_id->assign(edit.GetDbId());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
s = read_buffer->AddEdit(&edit);
|
|
|
|
if (!s.ok()) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
if (edit.is_in_atomic_group_) {
|
|
|
|
if (read_buffer->IsFull()) {
|
|
|
|
// Apply edits in an atomic group when we have read all edits in the
|
|
|
|
// group.
|
|
|
|
for (auto& e : read_buffer->replay_buffer()) {
|
|
|
|
s = ApplyOneVersionEditToBuilder(e, name_to_options,
|
|
|
|
column_families_not_found, builders,
|
|
|
|
version_edit_params);
|
|
|
|
if (!s.ok()) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
recovered_edits++;
|
|
|
|
}
|
|
|
|
if (!s.ok()) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
read_buffer->Clear();
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
// Apply a normal edit immediately.
|
|
|
|
s = ApplyOneVersionEditToBuilder(edit, name_to_options,
|
|
|
|
column_families_not_found, builders,
|
|
|
|
version_edit_params);
|
|
|
|
if (s.ok()) {
|
|
|
|
recovered_edits++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!log_read_status->ok()) {
|
|
|
|
s = *log_read_status;
|
|
|
|
}
|
|
|
|
if (!s.ok()) {
|
|
|
|
// Clear the buffer if we fail to decode/apply an edit.
|
|
|
|
read_buffer->Clear();
|
|
|
|
}
|
|
|
|
TEST_SYNC_POINT_CALLBACK("VersionSet::ReadAndRecover:RecoveredEdits",
|
|
|
|
&recovered_edits);
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
Status VersionSet::Recover(
|
|
|
|
const std::vector<ColumnFamilyDescriptor>& column_families, bool read_only,
|
|
|
|
std::string* db_id) {
|
|
|
|
// Read "CURRENT" file, which contains a pointer to the current manifest file
|
|
|
|
std::string manifest_path;
|
|
|
|
Status s = GetCurrentManifestPath(dbname_, fs_.get(), &manifest_path,
|
|
|
|
&manifest_file_number_);
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
ROCKS_LOG_INFO(db_options_->info_log, "Recovering from manifest file: %s\n",
|
|
|
|
manifest_path.c_str());
|
|
|
|
|
|
|
|
std::unique_ptr<SequentialFileReader> manifest_file_reader;
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
10 years ago
|
|
|
{
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
std::unique_ptr<FSSequentialFile> manifest_file;
|
|
|
|
s = fs_->NewSequentialFile(manifest_path,
|
|
|
|
fs_->OptimizeForManifestRead(file_options_),
|
|
|
|
&manifest_file, nullptr);
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
10 years ago
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
manifest_file_reader.reset(
|
|
|
|
new SequentialFileReader(std::move(manifest_file), manifest_path,
|
|
|
|
db_options_->log_readahead_size, io_tracer_));
|
|
|
|
}
|
|
|
|
uint64_t current_manifest_file_size = 0;
|
|
|
|
uint64_t log_number = 0;
|
|
|
|
{
|
|
|
|
VersionSet::LogReporter reporter;
|
|
|
|
Status log_read_status;
|
|
|
|
reporter.status = &log_read_status;
|
|
|
|
log::Reader reader(nullptr, std::move(manifest_file_reader), &reporter,
|
|
|
|
true /* checksum */, 0 /* log_number */);
|
|
|
|
VersionEditHandler handler(
|
|
|
|
read_only, column_families, const_cast<VersionSet*>(this),
|
|
|
|
/*track_missing_files=*/false,
|
|
|
|
/*no_error_if_table_files_missing=*/false, io_tracer_);
|
|
|
|
handler.Iterate(reader, &log_read_status);
|
|
|
|
s = handler.status();
|
|
|
|
if (s.ok()) {
|
|
|
|
log_number = handler.GetVersionEditParams().log_number_;
|
|
|
|
current_manifest_file_size = reader.GetReadOffset();
|
|
|
|
assert(current_manifest_file_size != 0);
|
|
|
|
handler.GetDbId(db_id);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (s.ok()) {
|
|
|
|
manifest_file_size_ = current_manifest_file_size;
|
|
|
|
ROCKS_LOG_INFO(
|
|
|
|
db_options_->info_log,
|
|
|
|
"Recovered from manifest file:%s succeeded,"
|
|
|
|
"manifest_file_number is %" PRIu64 ", next_file_number is %" PRIu64
|
|
|
|
", last_sequence is %" PRIu64 ", log_number is %" PRIu64
|
|
|
|
",prev_log_number is %" PRIu64 ",max_column_family is %" PRIu32
|
|
|
|
",min_log_number_to_keep is %" PRIu64 "\n",
|
|
|
|
manifest_path.c_str(), manifest_file_number_, next_file_number_.load(),
|
|
|
|
last_sequence_.load(), log_number, prev_log_number_,
|
|
|
|
column_family_set_->GetMaxColumnFamily(), min_log_number_to_keep_2pc());
|
|
|
|
|
|
|
|
for (auto cfd : *column_family_set_) {
|
|
|
|
if (cfd->IsDropped()) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
ROCKS_LOG_INFO(db_options_->info_log,
|
|
|
|
"Column family [%s] (ID %" PRIu32
|
|
|
|
"), log number is %" PRIu64 "\n",
|
|
|
|
cfd->GetName().c_str(), cfd->GetID(), cfd->GetLogNumber());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
namespace {
|
|
|
|
class ManifestPicker {
|
|
|
|
public:
|
|
|
|
explicit ManifestPicker(const std::string& dbname,
|
|
|
|
const std::vector<std::string>& files_in_dbname);
|
|
|
|
// REQUIRES Valid() == true
|
|
|
|
std::string GetNextManifest(uint64_t* file_number, std::string* file_name);
|
|
|
|
bool Valid() const { return manifest_file_iter_ != manifest_files_.end(); }
|
|
|
|
|
|
|
|
private:
|
|
|
|
const std::string& dbname_;
|
|
|
|
// MANIFEST file names(s)
|
|
|
|
std::vector<std::string> manifest_files_;
|
|
|
|
std::vector<std::string>::const_iterator manifest_file_iter_;
|
|
|
|
};
|
|
|
|
|
|
|
|
ManifestPicker::ManifestPicker(const std::string& dbname,
|
|
|
|
const std::vector<std::string>& files_in_dbname)
|
|
|
|
: dbname_(dbname) {
|
|
|
|
// populate manifest files
|
|
|
|
assert(!files_in_dbname.empty());
|
|
|
|
for (const auto& fname : files_in_dbname) {
|
|
|
|
uint64_t file_num = 0;
|
|
|
|
FileType file_type;
|
|
|
|
bool parse_ok = ParseFileName(fname, &file_num, &file_type);
|
|
|
|
if (parse_ok && file_type == kDescriptorFile) {
|
|
|
|
manifest_files_.push_back(fname);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
// seek to first manifest
|
|
|
|
std::sort(manifest_files_.begin(), manifest_files_.end(),
|
|
|
|
[](const std::string& lhs, const std::string& rhs) {
|
|
|
|
uint64_t num1 = 0;
|
|
|
|
uint64_t num2 = 0;
|
|
|
|
FileType type1;
|
|
|
|
FileType type2;
|
|
|
|
bool parse_ok1 = ParseFileName(lhs, &num1, &type1);
|
|
|
|
bool parse_ok2 = ParseFileName(rhs, &num2, &type2);
|
|
|
|
#ifndef NDEBUG
|
|
|
|
assert(parse_ok1);
|
|
|
|
assert(parse_ok2);
|
|
|
|
#else
|
|
|
|
(void)parse_ok1;
|
|
|
|
(void)parse_ok2;
|
|
|
|
#endif
|
|
|
|
return num1 > num2;
|
|
|
|
});
|
|
|
|
manifest_file_iter_ = manifest_files_.begin();
|
|
|
|
}
|
|
|
|
|
|
|
|
std::string ManifestPicker::GetNextManifest(uint64_t* number,
|
|
|
|
std::string* file_name) {
|
|
|
|
assert(Valid());
|
|
|
|
std::string ret;
|
|
|
|
if (manifest_file_iter_ != manifest_files_.end()) {
|
|
|
|
ret.assign(dbname_);
|
|
|
|
if (ret.back() != kFilePathSeparator) {
|
|
|
|
ret.push_back(kFilePathSeparator);
|
|
|
|
}
|
|
|
|
ret.append(*manifest_file_iter_);
|
|
|
|
if (number) {
|
|
|
|
FileType type;
|
|
|
|
bool parse = ParseFileName(*manifest_file_iter_, number, &type);
|
|
|
|
assert(type == kDescriptorFile);
|
|
|
|
#ifndef NDEBUG
|
|
|
|
assert(parse);
|
|
|
|
#else
|
|
|
|
(void)parse;
|
|
|
|
#endif
|
|
|
|
}
|
|
|
|
if (file_name) {
|
|
|
|
*file_name = *manifest_file_iter_;
|
|
|
|
}
|
|
|
|
++manifest_file_iter_;
|
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
} // namespace
|
|
|
|
|
|
|
|
Status VersionSet::TryRecover(
|
|
|
|
const std::vector<ColumnFamilyDescriptor>& column_families, bool read_only,
|
|
|
|
const std::vector<std::string>& files_in_dbname, std::string* db_id,
|
|
|
|
bool* has_missing_table_file) {
|
|
|
|
ManifestPicker manifest_picker(dbname_, files_in_dbname);
|
|
|
|
if (!manifest_picker.Valid()) {
|
|
|
|
return Status::Corruption("Cannot locate MANIFEST file in " + dbname_);
|
|
|
|
}
|
|
|
|
Status s;
|
|
|
|
std::string manifest_path =
|
|
|
|
manifest_picker.GetNextManifest(&manifest_file_number_, nullptr);
|
|
|
|
while (!manifest_path.empty()) {
|
|
|
|
s = TryRecoverFromOneManifest(manifest_path, column_families, read_only,
|
|
|
|
db_id, has_missing_table_file);
|
|
|
|
if (s.ok() || !manifest_picker.Valid()) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
Reset();
|
|
|
|
manifest_path =
|
|
|
|
manifest_picker.GetNextManifest(&manifest_file_number_, nullptr);
|
|
|
|
}
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
Status VersionSet::TryRecoverFromOneManifest(
|
|
|
|
const std::string& manifest_path,
|
|
|
|
const std::vector<ColumnFamilyDescriptor>& column_families, bool read_only,
|
|
|
|
std::string* db_id, bool* has_missing_table_file) {
|
|
|
|
ROCKS_LOG_INFO(db_options_->info_log, "Trying to recover from manifest: %s\n",
|
|
|
|
manifest_path.c_str());
|
|
|
|
std::unique_ptr<SequentialFileReader> manifest_file_reader;
|
|
|
|
Status s;
|
|
|
|
{
|
|
|
|
std::unique_ptr<FSSequentialFile> manifest_file;
|
|
|
|
s = fs_->NewSequentialFile(manifest_path,
|
|
|
|
fs_->OptimizeForManifestRead(file_options_),
|
|
|
|
&manifest_file, nullptr);
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
manifest_file_reader.reset(
|
|
|
|
new SequentialFileReader(std::move(manifest_file), manifest_path,
|
|
|
|
db_options_->log_readahead_size, io_tracer_));
|
|
|
|
}
|
|
|
|
|
|
|
|
assert(s.ok());
|
|
|
|
VersionSet::LogReporter reporter;
|
|
|
|
reporter.status = &s;
|
|
|
|
log::Reader reader(nullptr, std::move(manifest_file_reader), &reporter,
|
|
|
|
/*checksum=*/true, /*log_num=*/0);
|
|
|
|
VersionEditHandlerPointInTime handler_pit(
|
|
|
|
read_only, column_families, const_cast<VersionSet*>(this), io_tracer_);
|
|
|
|
|
|
|
|
handler_pit.Iterate(reader, &s);
|
|
|
|
|
|
|
|
handler_pit.GetDbId(db_id);
|
|
|
|
|
|
|
|
assert(nullptr != has_missing_table_file);
|
|
|
|
*has_missing_table_file = handler_pit.HasMissingFiles();
|
|
|
|
|
|
|
|
return handler_pit.status();
|
|
|
|
}
|
|
|
|
|
|
|
|
Status VersionSet::ListColumnFamilies(std::vector<std::string>* column_families,
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
const std::string& dbname,
|
|
|
|
FileSystem* fs) {
|
|
|
|
// these are just for performance reasons, not correcntes,
|
|
|
|
// so we're fine using the defaults
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
FileOptions soptions;
|
|
|
|
// Read "CURRENT" file, which contains a pointer to the current manifest file
|
|
|
|
std::string manifest_path;
|
|
|
|
uint64_t manifest_file_number;
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
Status s =
|
|
|
|
GetCurrentManifestPath(dbname, fs, &manifest_path, &manifest_file_number);
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
10 years ago
|
|
|
|
|
|
|
std::unique_ptr<SequentialFileReader> file_reader;
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
10 years ago
|
|
|
{
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
std::unique_ptr<FSSequentialFile> file;
|
|
|
|
s = fs->NewSequentialFile(manifest_path, soptions, &file, nullptr);
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
file_reader.reset(new SequentialFileReader(std::move(file), manifest_path,
|
|
|
|
nullptr /*IOTracer*/));
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
10 years ago
|
|
|
}
|
|
|
|
|
|
|
|
VersionSet::LogReporter reporter;
|
|
|
|
reporter.status = &s;
|
|
|
|
log::Reader reader(nullptr, std::move(file_reader), &reporter,
|
|
|
|
true /* checksum */, 0 /* log_number */);
|
|
|
|
|
|
|
|
ListColumnFamiliesHandler handler;
|
|
|
|
handler.Iterate(reader, &s);
|
|
|
|
|
|
|
|
assert(column_families);
|
|
|
|
column_families->clear();
|
|
|
|
if (handler.status().ok()) {
|
|
|
|
for (const auto& iter : handler.GetColumnFamilyNames()) {
|
|
|
|
column_families->push_back(iter.second);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return handler.status();
|
|
|
|
}
|
|
|
|
|
|
|
|
#ifndef ROCKSDB_LITE
|
Make VersionSet::ReduceNumberOfLevels() static
Summary:
A lot of our code implicitly assumes number_levels to be static. ReduceNumberOfLevels() breaks that assumption. For example, after calling ReduceNumberOfLevels(), DBImpl::NumberLevels() will be different from VersionSet::NumberLevels(). This is dangerous. Thankfully, it's not in public headers and is only used from LDB cmd tool. LDB tool is only using it statically, i.e. it never calls it with running DB instance. With this diff, we make it explicitly static. This way, we can assume number_levels to be immutable and not break assumption that lot of our code is relying upon. LDB tool can still use the method.
Also, I removed the method from a separate file since it breaks filename completition. version_se<TAB> now completes to "version_set." instead of "version_set" (without the dot). I don't see a big reason that the function should be in a different file.
Test Plan: reduce_levels_test
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15303
11 years ago
|
|
|
Status VersionSet::ReduceNumberOfLevels(const std::string& dbname,
|
|
|
|
const Options* options,
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
const FileOptions& file_options,
|
Make VersionSet::ReduceNumberOfLevels() static
Summary:
A lot of our code implicitly assumes number_levels to be static. ReduceNumberOfLevels() breaks that assumption. For example, after calling ReduceNumberOfLevels(), DBImpl::NumberLevels() will be different from VersionSet::NumberLevels(). This is dangerous. Thankfully, it's not in public headers and is only used from LDB cmd tool. LDB tool is only using it statically, i.e. it never calls it with running DB instance. With this diff, we make it explicitly static. This way, we can assume number_levels to be immutable and not break assumption that lot of our code is relying upon. LDB tool can still use the method.
Also, I removed the method from a separate file since it breaks filename completition. version_se<TAB> now completes to "version_set." instead of "version_set" (without the dot). I don't see a big reason that the function should be in a different file.
Test Plan: reduce_levels_test
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15303
11 years ago
|
|
|
int new_levels) {
|
|
|
|
if (new_levels <= 1) {
|
|
|
|
return Status::InvalidArgument(
|
|
|
|
"Number of levels needs to be bigger than 1");
|
|
|
|
}
|
|
|
|
|
|
|
|
ImmutableDBOptions db_options(*options);
|
[CF] Rethink table cache
Summary:
Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory.
To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects.
This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15915
11 years ago
|
|
|
ColumnFamilyOptions cf_options(*options);
|
|
|
|
std::shared_ptr<Cache> tc(NewLRUCache(options->max_open_files - 10,
|
|
|
|
options->table_cache_numshardbits));
|
|
|
|
WriteController wc(options->delayed_write_rate);
|
|
|
|
WriteBufferManager wb(options->db_write_buffer_size);
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
VersionSet versions(dbname, &db_options, file_options, tc.get(), &wb, &wc,
|
|
|
|
nullptr /*BlockCacheTracer*/, nullptr /*IOTracer*/);
|
Make VersionSet::ReduceNumberOfLevels() static
Summary:
A lot of our code implicitly assumes number_levels to be static. ReduceNumberOfLevels() breaks that assumption. For example, after calling ReduceNumberOfLevels(), DBImpl::NumberLevels() will be different from VersionSet::NumberLevels(). This is dangerous. Thankfully, it's not in public headers and is only used from LDB cmd tool. LDB tool is only using it statically, i.e. it never calls it with running DB instance. With this diff, we make it explicitly static. This way, we can assume number_levels to be immutable and not break assumption that lot of our code is relying upon. LDB tool can still use the method.
Also, I removed the method from a separate file since it breaks filename completition. version_se<TAB> now completes to "version_set." instead of "version_set" (without the dot). I don't see a big reason that the function should be in a different file.
Test Plan: reduce_levels_test
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15303
11 years ago
|
|
|
Status status;
|
|
|
|
|
|
|
|
std::vector<ColumnFamilyDescriptor> dummy;
|
|
|
|
ColumnFamilyDescriptor dummy_descriptor(kDefaultColumnFamilyName,
|
|
|
|
ColumnFamilyOptions(*options));
|
|
|
|
dummy.push_back(dummy_descriptor);
|
|
|
|
status = versions.Recover(dummy);
|
Make VersionSet::ReduceNumberOfLevels() static
Summary:
A lot of our code implicitly assumes number_levels to be static. ReduceNumberOfLevels() breaks that assumption. For example, after calling ReduceNumberOfLevels(), DBImpl::NumberLevels() will be different from VersionSet::NumberLevels(). This is dangerous. Thankfully, it's not in public headers and is only used from LDB cmd tool. LDB tool is only using it statically, i.e. it never calls it with running DB instance. With this diff, we make it explicitly static. This way, we can assume number_levels to be immutable and not break assumption that lot of our code is relying upon. LDB tool can still use the method.
Also, I removed the method from a separate file since it breaks filename completition. version_se<TAB> now completes to "version_set." instead of "version_set" (without the dot). I don't see a big reason that the function should be in a different file.
Test Plan: reduce_levels_test
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15303
11 years ago
|
|
|
if (!status.ok()) {
|
|
|
|
return status;
|
|
|
|
}
|
|
|
|
|
|
|
|
Version* current_version =
|
|
|
|
versions.GetColumnFamilySet()->GetDefault()->current();
|
|
|
|
auto* vstorage = current_version->storage_info();
|
|
|
|
int current_levels = vstorage->num_levels();
|
Make VersionSet::ReduceNumberOfLevels() static
Summary:
A lot of our code implicitly assumes number_levels to be static. ReduceNumberOfLevels() breaks that assumption. For example, after calling ReduceNumberOfLevels(), DBImpl::NumberLevels() will be different from VersionSet::NumberLevels(). This is dangerous. Thankfully, it's not in public headers and is only used from LDB cmd tool. LDB tool is only using it statically, i.e. it never calls it with running DB instance. With this diff, we make it explicitly static. This way, we can assume number_levels to be immutable and not break assumption that lot of our code is relying upon. LDB tool can still use the method.
Also, I removed the method from a separate file since it breaks filename completition. version_se<TAB> now completes to "version_set." instead of "version_set" (without the dot). I don't see a big reason that the function should be in a different file.
Test Plan: reduce_levels_test
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15303
11 years ago
|
|
|
|
|
|
|
if (current_levels <= new_levels) {
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
|
|
|
// Make sure there are file only on one level from
|
|
|
|
// (new_levels-1) to (current_levels-1)
|
|
|
|
int first_nonempty_level = -1;
|
|
|
|
int first_nonempty_level_filenum = 0;
|
|
|
|
for (int i = new_levels - 1; i < current_levels; i++) {
|
|
|
|
int file_num = vstorage->NumLevelFiles(i);
|
Make VersionSet::ReduceNumberOfLevels() static
Summary:
A lot of our code implicitly assumes number_levels to be static. ReduceNumberOfLevels() breaks that assumption. For example, after calling ReduceNumberOfLevels(), DBImpl::NumberLevels() will be different from VersionSet::NumberLevels(). This is dangerous. Thankfully, it's not in public headers and is only used from LDB cmd tool. LDB tool is only using it statically, i.e. it never calls it with running DB instance. With this diff, we make it explicitly static. This way, we can assume number_levels to be immutable and not break assumption that lot of our code is relying upon. LDB tool can still use the method.
Also, I removed the method from a separate file since it breaks filename completition. version_se<TAB> now completes to "version_set." instead of "version_set" (without the dot). I don't see a big reason that the function should be in a different file.
Test Plan: reduce_levels_test
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15303
11 years ago
|
|
|
if (file_num != 0) {
|
|
|
|
if (first_nonempty_level < 0) {
|
|
|
|
first_nonempty_level = i;
|
|
|
|
first_nonempty_level_filenum = file_num;
|
|
|
|
} else {
|
|
|
|
char msg[255];
|
|
|
|
snprintf(msg, sizeof(msg),
|
|
|
|
"Found at least two levels containing files: "
|
|
|
|
"[%d:%d],[%d:%d].\n",
|
|
|
|
first_nonempty_level, first_nonempty_level_filenum, i,
|
|
|
|
file_num);
|
|
|
|
return Status::InvalidArgument(msg);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// we need to allocate an array with the old number of levels size to
|
|
|
|
// avoid SIGSEGV in WriteCurrentStatetoManifest()
|
|
|
|
// however, all levels bigger or equal to new_levels will be empty
|
Make VersionSet::ReduceNumberOfLevels() static
Summary:
A lot of our code implicitly assumes number_levels to be static. ReduceNumberOfLevels() breaks that assumption. For example, after calling ReduceNumberOfLevels(), DBImpl::NumberLevels() will be different from VersionSet::NumberLevels(). This is dangerous. Thankfully, it's not in public headers and is only used from LDB cmd tool. LDB tool is only using it statically, i.e. it never calls it with running DB instance. With this diff, we make it explicitly static. This way, we can assume number_levels to be immutable and not break assumption that lot of our code is relying upon. LDB tool can still use the method.
Also, I removed the method from a separate file since it breaks filename completition. version_se<TAB> now completes to "version_set." instead of "version_set" (without the dot). I don't see a big reason that the function should be in a different file.
Test Plan: reduce_levels_test
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15303
11 years ago
|
|
|
std::vector<FileMetaData*>* new_files_list =
|
|
|
|
new std::vector<FileMetaData*>[current_levels];
|
Make VersionSet::ReduceNumberOfLevels() static
Summary:
A lot of our code implicitly assumes number_levels to be static. ReduceNumberOfLevels() breaks that assumption. For example, after calling ReduceNumberOfLevels(), DBImpl::NumberLevels() will be different from VersionSet::NumberLevels(). This is dangerous. Thankfully, it's not in public headers and is only used from LDB cmd tool. LDB tool is only using it statically, i.e. it never calls it with running DB instance. With this diff, we make it explicitly static. This way, we can assume number_levels to be immutable and not break assumption that lot of our code is relying upon. LDB tool can still use the method.
Also, I removed the method from a separate file since it breaks filename completition. version_se<TAB> now completes to "version_set." instead of "version_set" (without the dot). I don't see a big reason that the function should be in a different file.
Test Plan: reduce_levels_test
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15303
11 years ago
|
|
|
for (int i = 0; i < new_levels - 1; i++) {
|
|
|
|
new_files_list[i] = vstorage->LevelFiles(i);
|
Make VersionSet::ReduceNumberOfLevels() static
Summary:
A lot of our code implicitly assumes number_levels to be static. ReduceNumberOfLevels() breaks that assumption. For example, after calling ReduceNumberOfLevels(), DBImpl::NumberLevels() will be different from VersionSet::NumberLevels(). This is dangerous. Thankfully, it's not in public headers and is only used from LDB cmd tool. LDB tool is only using it statically, i.e. it never calls it with running DB instance. With this diff, we make it explicitly static. This way, we can assume number_levels to be immutable and not break assumption that lot of our code is relying upon. LDB tool can still use the method.
Also, I removed the method from a separate file since it breaks filename completition. version_se<TAB> now completes to "version_set." instead of "version_set" (without the dot). I don't see a big reason that the function should be in a different file.
Test Plan: reduce_levels_test
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15303
11 years ago
|
|
|
}
|
|
|
|
|
|
|
|
if (first_nonempty_level > 0) {
|
|
|
|
auto& new_last_level = new_files_list[new_levels - 1];
|
|
|
|
|
|
|
|
new_last_level = vstorage->LevelFiles(first_nonempty_level);
|
|
|
|
|
|
|
|
for (size_t i = 0; i < new_last_level.size(); ++i) {
|
|
|
|
const FileMetaData* const meta = new_last_level[i];
|
|
|
|
assert(meta);
|
|
|
|
|
|
|
|
const uint64_t file_number = meta->fd.GetNumber();
|
|
|
|
|
|
|
|
vstorage->file_locations_[file_number] =
|
|
|
|
VersionStorageInfo::FileLocation(new_levels - 1, i);
|
|
|
|
}
|
Make VersionSet::ReduceNumberOfLevels() static
Summary:
A lot of our code implicitly assumes number_levels to be static. ReduceNumberOfLevels() breaks that assumption. For example, after calling ReduceNumberOfLevels(), DBImpl::NumberLevels() will be different from VersionSet::NumberLevels(). This is dangerous. Thankfully, it's not in public headers and is only used from LDB cmd tool. LDB tool is only using it statically, i.e. it never calls it with running DB instance. With this diff, we make it explicitly static. This way, we can assume number_levels to be immutable and not break assumption that lot of our code is relying upon. LDB tool can still use the method.
Also, I removed the method from a separate file since it breaks filename completition. version_se<TAB> now completes to "version_set." instead of "version_set" (without the dot). I don't see a big reason that the function should be in a different file.
Test Plan: reduce_levels_test
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15303
11 years ago
|
|
|
}
|
|
|
|
|
|
|
|
delete[] vstorage -> files_;
|
|
|
|
vstorage->files_ = new_files_list;
|
|
|
|
vstorage->num_levels_ = new_levels;
|
Make VersionSet::ReduceNumberOfLevels() static
Summary:
A lot of our code implicitly assumes number_levels to be static. ReduceNumberOfLevels() breaks that assumption. For example, after calling ReduceNumberOfLevels(), DBImpl::NumberLevels() will be different from VersionSet::NumberLevels(). This is dangerous. Thankfully, it's not in public headers and is only used from LDB cmd tool. LDB tool is only using it statically, i.e. it never calls it with running DB instance. With this diff, we make it explicitly static. This way, we can assume number_levels to be immutable and not break assumption that lot of our code is relying upon. LDB tool can still use the method.
Also, I removed the method from a separate file since it breaks filename completition. version_se<TAB> now completes to "version_set." instead of "version_set" (without the dot). I don't see a big reason that the function should be in a different file.
Test Plan: reduce_levels_test
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15303
11 years ago
|
|
|
|
|
|
|
MutableCFOptions mutable_cf_options(*options);
|
Make VersionSet::ReduceNumberOfLevels() static
Summary:
A lot of our code implicitly assumes number_levels to be static. ReduceNumberOfLevels() breaks that assumption. For example, after calling ReduceNumberOfLevels(), DBImpl::NumberLevels() will be different from VersionSet::NumberLevels(). This is dangerous. Thankfully, it's not in public headers and is only used from LDB cmd tool. LDB tool is only using it statically, i.e. it never calls it with running DB instance. With this diff, we make it explicitly static. This way, we can assume number_levels to be immutable and not break assumption that lot of our code is relying upon. LDB tool can still use the method.
Also, I removed the method from a separate file since it breaks filename completition. version_se<TAB> now completes to "version_set." instead of "version_set" (without the dot). I don't see a big reason that the function should be in a different file.
Test Plan: reduce_levels_test
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15303
11 years ago
|
|
|
VersionEdit ve;
|
|
|
|
InstrumentedMutex dummy_mutex;
|
|
|
|
InstrumentedMutexLock l(&dummy_mutex);
|
|
|
|
return versions.LogAndApply(
|
|
|
|
versions.GetColumnFamilySet()->GetDefault(),
|
|
|
|
mutable_cf_options, &ve, &dummy_mutex, nullptr, true);
|
Make VersionSet::ReduceNumberOfLevels() static
Summary:
A lot of our code implicitly assumes number_levels to be static. ReduceNumberOfLevels() breaks that assumption. For example, after calling ReduceNumberOfLevels(), DBImpl::NumberLevels() will be different from VersionSet::NumberLevels(). This is dangerous. Thankfully, it's not in public headers and is only used from LDB cmd tool. LDB tool is only using it statically, i.e. it never calls it with running DB instance. With this diff, we make it explicitly static. This way, we can assume number_levels to be immutable and not break assumption that lot of our code is relying upon. LDB tool can still use the method.
Also, I removed the method from a separate file since it breaks filename completition. version_se<TAB> now completes to "version_set." instead of "version_set" (without the dot). I don't see a big reason that the function should be in a different file.
Test Plan: reduce_levels_test
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15303
11 years ago
|
|
|
}
|
|
|
|
|
|
|
|
// Get the checksum information including the checksum and checksum function
|
|
|
|
// name of all SST files in VersionSet. Store the information in
|
|
|
|
// FileChecksumList which contains a map from file number to its checksum info.
|
|
|
|
// If DB is not running, make sure call VersionSet::Recover() to load the file
|
|
|
|
// metadata from Manifest to VersionSet before calling this function.
|
|
|
|
Status VersionSet::GetLiveFilesChecksumInfo(FileChecksumList* checksum_list) {
|
|
|
|
// Clean the previously stored checksum information if any.
|
|
|
|
Status s;
|
|
|
|
if (checksum_list == nullptr) {
|
|
|
|
s = Status::InvalidArgument("checksum_list is nullptr");
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
checksum_list->reset();
|
|
|
|
|
|
|
|
for (auto cfd : *column_family_set_) {
|
|
|
|
if (cfd->IsDropped() || !cfd->initialized()) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
for (int level = 0; level < cfd->NumberLevels(); level++) {
|
|
|
|
for (const auto& file :
|
|
|
|
cfd->current()->storage_info()->LevelFiles(level)) {
|
|
|
|
s = checksum_list->InsertOneFileChecksum(file->fd.GetNumber(),
|
|
|
|
file->file_checksum,
|
|
|
|
file->file_checksum_func_name);
|
|
|
|
if (!s.ok()) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!s.ok()) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!s.ok()) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
Status VersionSet::DumpManifest(Options& options, std::string& dscname,
|
Added JSON manifest dump option to ldb command
Summary:
Added a new flag --json to the ldb manifest_dump command
that prints out the version edits as JSON objects for easier
reading and parsing of information.
Test Plan:
**Sample usage: **
```
./ldb manifest_dump --json --path=path/to/manifest/file
```
**Sample output:**
```
{"EditNumber": 0, "Comparator": "leveldb.BytewiseComparator", "ColumnFamily": 0}
{"EditNumber": 1, "LogNumber": 0, "ColumnFamily": 0}
{"EditNumber": 2, "LogNumber": 4, "PrevLogNumber": 0, "NextFileNumber": 7, "LastSeq": 35356, "AddedFiles": [{"Level": 0, "FileNumber": 5, "FileSize": 1949284, "SmallestIKey": "'", "LargestIKey": "'"}], "ColumnFamily": 0}
...
{"EditNumber": 13, "PrevLogNumber": 0, "NextFileNumber": 36, "LastSeq": 290994, "DeletedFiles": [{"Level": 0, "FileNumber": 17}, {"Level": 0, "FileNumber": 20}, {"Level": 0, "FileNumber": 22}, {"Level": 0, "FileNumber": 24}, {"Level": 1, "FileNumber": 13}, {"Level": 1, "FileNumber": 14}, {"Level": 1, "FileNumber": 15}, {"Level": 1, "FileNumber": 18}], "AddedFiles": [{"Level": 1, "FileNumber": 25, "FileSize": 2114340, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 26, "FileSize": 2115213, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 27, "FileSize": 2114807, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 30, "FileSize": 2115271, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 31, "FileSize": 2115165, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 32, "FileSize": 2114683, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 35, "FileSize": 1757512, "SmallestIKey": "'", "LargestIKey": "'"}], "ColumnFamily": 0}
...
```
Reviewers: sdong, anthony, yhchiang, igor
Reviewed By: igor
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D41727
10 years ago
|
|
|
bool verbose, bool hex, bool json) {
|
|
|
|
// Open the specified manifest file.
|
|
|
|
std::unique_ptr<SequentialFileReader> file_reader;
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
10 years ago
|
|
|
Status s;
|
|
|
|
{
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
std::unique_ptr<FSSequentialFile> file;
|
|
|
|
const std::shared_ptr<FileSystem>& fs = options.env->GetFileSystem();
|
|
|
|
s = fs->NewSequentialFile(
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
dscname,
|
|
|
|
fs->OptimizeForManifestRead(file_options_), &file,
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
nullptr);
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
10 years ago
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
file_reader.reset(new SequentialFileReader(
|
|
|
|
std::move(file), dscname, db_options_->log_readahead_size, io_tracer_));
|
|
|
|
}
|
|
|
|
|
|
|
|
std::vector<ColumnFamilyDescriptor> column_families(
|
|
|
|
1, ColumnFamilyDescriptor(kDefaultColumnFamilyName, options));
|
|
|
|
DumpManifestHandler handler(column_families, this, io_tracer_, verbose, hex,
|
|
|
|
json);
|
|
|
|
{
|
|
|
|
VersionSet::LogReporter reporter;
|
|
|
|
reporter.status = &s;
|
|
|
|
log::Reader reader(nullptr, std::move(file_reader), &reporter,
|
|
|
|
true /* checksum */, 0 /* log_number */);
|
|
|
|
handler.Iterate(reader, &s);
|
|
|
|
}
|
|
|
|
|
|
|
|
return handler.status();
|
|
|
|
}
|
|
|
|
#endif // ROCKSDB_LITE
|
|
|
|
|
|
|
|
void VersionSet::MarkFileNumberUsed(uint64_t number) {
|
|
|
|
// only called during recovery and repair which are single threaded, so this
|
|
|
|
// works because there can't be concurrent calls
|
|
|
|
if (next_file_number_.load(std::memory_order_relaxed) <= number) {
|
|
|
|
next_file_number_.store(number + 1, std::memory_order_relaxed);
|
|
|
|
}
|
|
|
|
}
|
Skip deleted WALs during recovery
Summary:
This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic.
Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction)
This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2.
Closes https://github.com/facebook/rocksdb/pull/3765
Differential Revision: D7747618
Pulled By: siying
fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729
7 years ago
|
|
|
// Called only either from ::LogAndApply which is protected by mutex or during
|
|
|
|
// recovery which is single-threaded.
|
|
|
|
void VersionSet::MarkMinLogNumberToKeep2PC(uint64_t number) {
|
|
|
|
if (min_log_number_to_keep_2pc_.load(std::memory_order_relaxed) < number) {
|
|
|
|
min_log_number_to_keep_2pc_.store(number, std::memory_order_relaxed);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
Status VersionSet::WriteCurrentStateToManifest(
|
|
|
|
const std::unordered_map<uint32_t, MutableCFState>& curr_state,
|
|
|
|
const VersionEdit& wal_additions, log::Writer* log, IOStatus& io_s) {
|
|
|
|
// TODO: Break up into multiple records to reduce memory usage on recovery?
|
|
|
|
|
|
|
|
// WARNING: This method doesn't hold a mutex!!
|
|
|
|
|
|
|
|
// This is done without DB mutex lock held, but only within single-threaded
|
|
|
|
// LogAndApply. Column family manipulations can only happen within LogAndApply
|
|
|
|
// (the same single thread), so we're safe to iterate.
|
|
|
|
|
|
|
|
assert(io_s.ok());
|
|
|
|
if (db_options_->write_dbid_to_manifest) {
|
|
|
|
VersionEdit edit_for_db_id;
|
|
|
|
assert(!db_id_.empty());
|
|
|
|
edit_for_db_id.SetDBId(db_id_);
|
|
|
|
std::string db_id_record;
|
|
|
|
if (!edit_for_db_id.EncodeTo(&db_id_record)) {
|
|
|
|
return Status::Corruption("Unable to Encode VersionEdit:" +
|
|
|
|
edit_for_db_id.DebugString(true));
|
|
|
|
}
|
|
|
|
io_s = log->AddRecord(db_id_record);
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
5 years ago
|
|
|
if (!io_s.ok()) {
|
|
|
|
return io_s;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// Save WALs.
|
|
|
|
if (!wal_additions.GetWalAdditions().empty()) {
|
|
|
|
TEST_SYNC_POINT_CALLBACK("VersionSet::WriteCurrentStateToManifest:SaveWal",
|
|
|
|
const_cast<VersionEdit*>(&wal_additions));
|
|
|
|
std::string record;
|
|
|
|
if (!wal_additions.EncodeTo(&record)) {
|
|
|
|
return Status::Corruption("Unable to Encode VersionEdit: " +
|
|
|
|
wal_additions.DebugString(true));
|
|
|
|
}
|
|
|
|
io_s = log->AddRecord(record);
|
|
|
|
if (!io_s.ok()) {
|
|
|
|
return io_s;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
for (auto cfd : *column_family_set_) {
|
|
|
|
assert(cfd);
|
|
|
|
|
|
|
|
if (cfd->IsDropped()) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
assert(cfd->initialized());
|
|
|
|
{
|
|
|
|
// Store column family info
|
|
|
|
VersionEdit edit;
|
|
|
|
if (cfd->GetID() != 0) {
|
|
|
|
// default column family is always there,
|
|
|
|
// no need to explicitly write it
|
|
|
|
edit.AddColumnFamily(cfd->GetName());
|
|
|
|
edit.SetColumnFamily(cfd->GetID());
|
|
|
|
}
|
|
|
|
edit.SetComparatorName(
|
|
|
|
cfd->internal_comparator().user_comparator()->Name());
|
|
|
|
std::string record;
|
|
|
|
if (!edit.EncodeTo(&record)) {
|
|
|
|
return Status::Corruption(
|
|
|
|
"Unable to Encode VersionEdit:" + edit.DebugString(true));
|
|
|
|
}
|
|
|
|
io_s = log->AddRecord(record);
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
5 years ago
|
|
|
if (!io_s.ok()) {
|
|
|
|
return io_s;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
{
|
|
|
|
// Save files
|
|
|
|
VersionEdit edit;
|
|
|
|
edit.SetColumnFamily(cfd->GetID());
|
|
|
|
|
|
|
|
assert(cfd->current());
|
|
|
|
assert(cfd->current()->storage_info());
|
|
|
|
|
|
|
|
for (int level = 0; level < cfd->NumberLevels(); level++) {
|
|
|
|
for (const auto& f :
|
|
|
|
cfd->current()->storage_info()->LevelFiles(level)) {
|
|
|
|
edit.AddFile(level, f->fd.GetNumber(), f->fd.GetPathId(),
|
|
|
|
f->fd.GetFileSize(), f->smallest, f->largest,
|
|
|
|
f->fd.smallest_seqno, f->fd.largest_seqno,
|
|
|
|
f->marked_for_compaction, f->oldest_blob_file_number,
|
|
|
|
f->oldest_ancester_time, f->file_creation_time,
|
|
|
|
f->file_checksum, f->file_checksum_func_name);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
const auto& blob_files = cfd->current()->storage_info()->GetBlobFiles();
|
|
|
|
for (const auto& pair : blob_files) {
|
|
|
|
const uint64_t blob_file_number = pair.first;
|
|
|
|
const auto& meta = pair.second;
|
|
|
|
|
|
|
|
assert(meta);
|
|
|
|
assert(blob_file_number == meta->GetBlobFileNumber());
|
|
|
|
|
|
|
|
edit.AddBlobFile(blob_file_number, meta->GetTotalBlobCount(),
|
|
|
|
meta->GetTotalBlobBytes(), meta->GetChecksumMethod(),
|
|
|
|
meta->GetChecksumValue());
|
|
|
|
if (meta->GetGarbageBlobCount() > 0) {
|
|
|
|
edit.AddBlobFileGarbage(blob_file_number, meta->GetGarbageBlobCount(),
|
|
|
|
meta->GetGarbageBlobBytes());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
const auto iter = curr_state.find(cfd->GetID());
|
|
|
|
assert(iter != curr_state.end());
|
|
|
|
uint64_t log_number = iter->second.log_number;
|
|
|
|
edit.SetLogNumber(log_number);
|
|
|
|
|
|
|
|
if (cfd->GetID() == 0) {
|
|
|
|
// min_log_number_to_keep is for the whole db, not for specific column family.
|
|
|
|
// So it does not need to be set for every column family, just need to be set once.
|
|
|
|
// Since default CF can never be dropped, we set the min_log to the default CF here.
|
|
|
|
uint64_t min_log = min_log_number_to_keep_2pc();
|
|
|
|
if (min_log != 0) {
|
|
|
|
edit.SetMinLogNumberToKeep(min_log);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
const std::string& full_history_ts_low = iter->second.full_history_ts_low;
|
|
|
|
if (!full_history_ts_low.empty()) {
|
|
|
|
edit.SetFullHistoryTsLow(full_history_ts_low);
|
|
|
|
}
|
|
|
|
std::string record;
|
|
|
|
if (!edit.EncodeTo(&record)) {
|
|
|
|
return Status::Corruption(
|
|
|
|
"Unable to Encode VersionEdit:" + edit.DebugString(true));
|
|
|
|
}
|
|
|
|
io_s = log->AddRecord(record);
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
5 years ago
|
|
|
if (!io_s.ok()) {
|
|
|
|
return io_s;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
|
|
|
// TODO(aekmekji): in CompactionJob::GenSubcompactionBoundaries(), this
|
|
|
|
// function is called repeatedly with consecutive pairs of slices. For example
|
|
|
|
// if the slice list is [a, b, c, d] this function is called with arguments
|
|
|
|
// (a,b) then (b,c) then (c,d). Knowing this, an optimization is possible where
|
|
|
|
// we avoid doing binary search for the keys b and c twice and instead somehow
|
|
|
|
// maintain state of where they first appear in the files.
|
|
|
|
uint64_t VersionSet::ApproximateSize(const SizeApproximationOptions& options,
|
|
|
|
Version* v, const Slice& start,
|
|
|
|
const Slice& end, int start_level,
|
|
|
|
int end_level, TableReaderCaller caller) {
|
|
|
|
const auto& icmp = v->cfd_->internal_comparator();
|
|
|
|
|
|
|
|
// pre-condition
|
|
|
|
assert(icmp.Compare(start, end) <= 0);
|
|
|
|
|
|
|
|
uint64_t total_full_size = 0;
|
|
|
|
const auto* vstorage = v->storage_info();
|
|
|
|
const int num_non_empty_levels = vstorage->num_non_empty_levels();
|
|
|
|
end_level = (end_level == -1) ? num_non_empty_levels
|
|
|
|
: std::min(end_level, num_non_empty_levels);
|
|
|
|
|
|
|
|
assert(start_level <= end_level);
|
|
|
|
|
|
|
|
// Outline of the optimization that uses options.files_size_error_margin.
|
|
|
|
// When approximating the files total size that is used to store a keys range,
|
|
|
|
// we first sum up the sizes of the files that fully fall into the range.
|
|
|
|
// Then we sum up the sizes of all the files that may intersect with the range
|
|
|
|
// (this includes all files in L0 as well). Then, if total_intersecting_size
|
|
|
|
// is smaller than total_full_size * options.files_size_error_margin - we can
|
|
|
|
// infer that the intersecting files have a sufficiently negligible
|
|
|
|
// contribution to the total size, and we can approximate the storage required
|
|
|
|
// for the keys in range as just half of the intersecting_files_size.
|
|
|
|
// E.g., if the value of files_size_error_margin is 0.1, then the error of the
|
|
|
|
// approximation is limited to only ~10% of the total size of files that fully
|
|
|
|
// fall into the keys range. In such case, this helps to avoid a costly
|
|
|
|
// process of binary searching the intersecting files that is required only
|
|
|
|
// for a more precise calculation of the total size.
|
|
|
|
|
|
|
|
autovector<FdWithKeyRange*, 32> first_files;
|
|
|
|
autovector<FdWithKeyRange*, 16> last_files;
|
|
|
|
|
|
|
|
// scan all the levels
|
|
|
|
for (int level = start_level; level < end_level; ++level) {
|
|
|
|
const LevelFilesBrief& files_brief = vstorage->LevelFilesBrief(level);
|
|
|
|
if (files_brief.num_files == 0) {
|
|
|
|
// empty level, skip exploration
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (level == 0) {
|
|
|
|
// level 0 files are not in sorted order, we need to iterate through
|
|
|
|
// the list to compute the total bytes that require scanning,
|
|
|
|
// so handle the case explicitly (similarly to first_files case)
|
|
|
|
for (size_t i = 0; i < files_brief.num_files; i++) {
|
|
|
|
first_files.push_back(&files_brief.files[i]);
|
|
|
|
}
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
assert(level > 0);
|
|
|
|
assert(files_brief.num_files > 0);
|
|
|
|
|
|
|
|
// identify the file position for start key
|
|
|
|
const int idx_start =
|
|
|
|
FindFileInRange(icmp, files_brief, start, 0,
|
|
|
|
static_cast<uint32_t>(files_brief.num_files - 1));
|
|
|
|
assert(static_cast<size_t>(idx_start) < files_brief.num_files);
|
|
|
|
|
|
|
|
// identify the file position for end key
|
|
|
|
int idx_end = idx_start;
|
|
|
|
if (icmp.Compare(files_brief.files[idx_end].largest_key, end) < 0) {
|
|
|
|
idx_end =
|
|
|
|
FindFileInRange(icmp, files_brief, end, idx_start,
|
|
|
|
static_cast<uint32_t>(files_brief.num_files - 1));
|
|
|
|
}
|
|
|
|
assert(idx_end >= idx_start &&
|
|
|
|
static_cast<size_t>(idx_end) < files_brief.num_files);
|
|
|
|
|
|
|
|
// scan all files from the starting index to the ending index
|
|
|
|
// (inferred from the sorted order)
|
|
|
|
|
|
|
|
// first scan all the intermediate full files (excluding first and last)
|
|
|
|
for (int i = idx_start + 1; i < idx_end; ++i) {
|
|
|
|
uint64_t file_size = files_brief.files[i].fd.GetFileSize();
|
|
|
|
// The entire file falls into the range, so we can just take its size.
|
|
|
|
assert(file_size ==
|
|
|
|
ApproximateSize(v, files_brief.files[i], start, end, caller));
|
|
|
|
total_full_size += file_size;
|
|
|
|
}
|
|
|
|
|
|
|
|
// save the first and the last files (which may be the same file), so we
|
|
|
|
// can scan them later.
|
|
|
|
first_files.push_back(&files_brief.files[idx_start]);
|
|
|
|
if (idx_start != idx_end) {
|
|
|
|
// we need to estimate size for both files, only if they are different
|
|
|
|
last_files.push_back(&files_brief.files[idx_end]);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// The sum of all file sizes that intersect the [start, end] keys range.
|
|
|
|
uint64_t total_intersecting_size = 0;
|
|
|
|
for (const auto* file_ptr : first_files) {
|
|
|
|
total_intersecting_size += file_ptr->fd.GetFileSize();
|
|
|
|
}
|
|
|
|
for (const auto* file_ptr : last_files) {
|
|
|
|
total_intersecting_size += file_ptr->fd.GetFileSize();
|
|
|
|
}
|
|
|
|
|
|
|
|
// Now scan all the first & last files at each level, and estimate their size.
|
|
|
|
// If the total_intersecting_size is less than X% of the total_full_size - we
|
|
|
|
// want to approximate the result in order to avoid the costly binary search
|
|
|
|
// inside ApproximateSize. We use half of file size as an approximation below.
|
|
|
|
|
|
|
|
const double margin = options.files_size_error_margin;
|
|
|
|
if (margin > 0 && total_intersecting_size <
|
|
|
|
static_cast<uint64_t>(total_full_size * margin)) {
|
|
|
|
total_full_size += total_intersecting_size / 2;
|
|
|
|
} else {
|
For ApproximateSizes, pro-rate table metadata size over data blocks (#6784)
Summary:
The implementation of GetApproximateSizes was inconsistent in
its treatment of the size of non-data blocks of SST files, sometimes
including and sometimes now. This was at its worst with large portion
of table file used by filters and querying a small range that crossed
a table boundary: the size estimate would include large filter size.
It's conceivable that someone might want only to know the size in terms
of data blocks, but I believe that's unlikely enough to ignore for now.
Similarly, there's no evidence the internal function AppoximateOffsetOf
is used for anything other than a one-sided ApproximateSize, so I intend
to refactor to remove redundancy in a follow-up commit.
So to fix this, GetApproximateSizes (and implementation details
ApproximateSize and ApproximateOffsetOf) now consistently include in
their returned sizes a portion of table file metadata (incl filters
and indexes) based on the size portion of the data blocks in range. In
other words, if a key range covers data blocks that are X% by size of all
the table's data blocks, returned approximate size is X% of the total
file size. It would technically be more accurate to attribute metadata
based on number of keys, but that's not computationally efficient with
data available and rarely a meaningful difference.
Also includes miscellaneous comment improvements / clarifications.
Also included is a new approximatesizerandom benchmark for db_bench.
No significant performance difference seen with this change, whether ~700 ops/sec with cache_index_and_filter_blocks and small cache or ~150k ops/sec without cache_index_and_filter_blocks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6784
Test Plan:
Test added to DBTest.ApproximateSizesFilesWithErrorMargin.
Old code running new test...
[ RUN ] DBTest.ApproximateSizesFilesWithErrorMargin
db/db_test.cc:1562: Failure
Expected: (size) <= (11 * 100), actual: 9478 vs 1100
Other tests updated to reflect consistent accounting of metadata.
Reviewed By: siying
Differential Revision: D21334706
Pulled By: pdillinger
fbshipit-source-id: 6f86870e45213334fedbe9c73b4ebb1d8d611185
5 years ago
|
|
|
// Estimate for all the first files (might also be last files), at each
|
|
|
|
// level
|
|
|
|
for (const auto file_ptr : first_files) {
|
|
|
|
total_full_size += ApproximateSize(v, *file_ptr, start, end, caller);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Estimate for all the last files, at each level
|
|
|
|
for (const auto file_ptr : last_files) {
|
|
|
|
// We could use ApproximateSize here, but calling ApproximateOffsetOf
|
|
|
|
// directly is just more efficient.
|
|
|
|
total_full_size += ApproximateOffsetOf(v, *file_ptr, end, caller);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
return total_full_size;
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t VersionSet::ApproximateOffsetOf(Version* v, const FdWithKeyRange& f,
|
|
|
|
const Slice& key,
|
|
|
|
TableReaderCaller caller) {
|
|
|
|
// pre-condition
|
|
|
|
assert(v);
|
|
|
|
const auto& icmp = v->cfd_->internal_comparator();
|
|
|
|
|
|
|
|
uint64_t result = 0;
|
|
|
|
if (icmp.Compare(f.largest_key, key) <= 0) {
|
|
|
|
// Entire file is before "key", so just add the file size
|
|
|
|
result = f.fd.GetFileSize();
|
|
|
|
} else if (icmp.Compare(f.smallest_key, key) > 0) {
|
|
|
|
// Entire file is after "key", so ignore
|
|
|
|
result = 0;
|
|
|
|
} else {
|
|
|
|
// "key" falls in the range for this table. Add the
|
|
|
|
// approximate offset of "key" within the table.
|
|
|
|
TableCache* table_cache = v->cfd_->table_cache();
|
|
|
|
if (table_cache != nullptr) {
|
|
|
|
result = table_cache->ApproximateOffsetOf(
|
|
|
|
key, f.file_metadata->fd, caller, icmp,
|
|
|
|
v->GetMutableCFOptions().prefix_extractor.get());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return result;
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t VersionSet::ApproximateSize(Version* v, const FdWithKeyRange& f,
|
|
|
|
const Slice& start, const Slice& end,
|
|
|
|
TableReaderCaller caller) {
|
|
|
|
// pre-condition
|
|
|
|
assert(v);
|
|
|
|
const auto& icmp = v->cfd_->internal_comparator();
|
|
|
|
assert(icmp.Compare(start, end) <= 0);
|
|
|
|
|
|
|
|
if (icmp.Compare(f.largest_key, start) <= 0 ||
|
|
|
|
icmp.Compare(f.smallest_key, end) > 0) {
|
|
|
|
// Entire file is before or after the start/end keys range
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (icmp.Compare(f.smallest_key, start) >= 0) {
|
|
|
|
// Start of the range is before the file start - approximate by end offset
|
|
|
|
return ApproximateOffsetOf(v, f, end, caller);
|
|
|
|
}
|
|
|
|
|
|
|
|
if (icmp.Compare(f.largest_key, end) < 0) {
|
|
|
|
// End of the range is after the file end - approximate by subtracting
|
|
|
|
// start offset from the file size
|
|
|
|
uint64_t start_offset = ApproximateOffsetOf(v, f, start, caller);
|
|
|
|
assert(f.fd.GetFileSize() >= start_offset);
|
|
|
|
return f.fd.GetFileSize() - start_offset;
|
|
|
|
}
|
|
|
|
|
|
|
|
// The interval falls entirely in the range for this file.
|
|
|
|
TableCache* table_cache = v->cfd_->table_cache();
|
|
|
|
if (table_cache == nullptr) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
return table_cache->ApproximateSize(
|
|
|
|
start, end, f.file_metadata->fd, caller, icmp,
|
|
|
|
v->GetMutableCFOptions().prefix_extractor.get());
|
|
|
|
}
|
|
|
|
|
|
|
|
void VersionSet::AddLiveFiles(std::vector<uint64_t>* live_table_files,
|
|
|
|
std::vector<uint64_t>* live_blob_files) const {
|
|
|
|
assert(live_table_files);
|
|
|
|
assert(live_blob_files);
|
|
|
|
|
[RocksDB] [Performance] Speed up FindObsoleteFiles
Summary:
FindObsoleteFiles was slow, holding the single big lock, resulted in bad p99 behavior.
Didn't profile anything, but several things could be improved:
1. VersionSet::AddLiveFiles works with std::set, which is by itself slow (a tree).
You also don't know how many dynamic allocations occur just for building up this tree.
switched to std::vector, also added logic to pre-calculate total size and do just one allocation
2. Don't see why env_->GetChildren() needs to be mutex proteced, moved to PurgeObsoleteFiles where
mutex could be unlocked.
3. switched std::set to std:unordered_set, the conversion from vector is also inside PurgeObsoleteFiles
I have a feeling this should pretty much fix it.
Test Plan: make check; db_stress
Reviewers: dhruba, heyongqiang, MarkCallaghan
Reviewed By: dhruba
CC: leveldb, zshao
Differential Revision: https://reviews.facebook.net/D10197
12 years ago
|
|
|
// pre-calculate space requirement
|
|
|
|
size_t total_table_files = 0;
|
|
|
|
size_t total_blob_files = 0;
|
|
|
|
|
|
|
|
assert(column_family_set_);
|
|
|
|
for (auto cfd : *column_family_set_) {
|
|
|
|
assert(cfd);
|
|
|
|
|
|
|
|
if (!cfd->initialized()) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
Version* const dummy_versions = cfd->dummy_versions();
|
|
|
|
assert(dummy_versions);
|
|
|
|
|
|
|
|
for (Version* v = dummy_versions->next_; v != dummy_versions;
|
|
|
|
v = v->next_) {
|
|
|
|
assert(v);
|
|
|
|
|
|
|
|
const auto* vstorage = v->storage_info();
|
|
|
|
assert(vstorage);
|
|
|
|
|
|
|
|
for (int level = 0; level < vstorage->num_levels(); ++level) {
|
|
|
|
total_table_files += vstorage->LevelFiles(level).size();
|
|
|
|
}
|
|
|
|
|
|
|
|
total_blob_files += vstorage->GetBlobFiles().size();
|
[RocksDB] [Performance] Speed up FindObsoleteFiles
Summary:
FindObsoleteFiles was slow, holding the single big lock, resulted in bad p99 behavior.
Didn't profile anything, but several things could be improved:
1. VersionSet::AddLiveFiles works with std::set, which is by itself slow (a tree).
You also don't know how many dynamic allocations occur just for building up this tree.
switched to std::vector, also added logic to pre-calculate total size and do just one allocation
2. Don't see why env_->GetChildren() needs to be mutex proteced, moved to PurgeObsoleteFiles where
mutex could be unlocked.
3. switched std::set to std:unordered_set, the conversion from vector is also inside PurgeObsoleteFiles
I have a feeling this should pretty much fix it.
Test Plan: make check; db_stress
Reviewers: dhruba, heyongqiang, MarkCallaghan
Reviewed By: dhruba
CC: leveldb, zshao
Differential Revision: https://reviews.facebook.net/D10197
12 years ago
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// just one time extension to the right size
|
|
|
|
live_table_files->reserve(live_table_files->size() + total_table_files);
|
|
|
|
live_blob_files->reserve(live_blob_files->size() + total_blob_files);
|
[RocksDB] [Performance] Speed up FindObsoleteFiles
Summary:
FindObsoleteFiles was slow, holding the single big lock, resulted in bad p99 behavior.
Didn't profile anything, but several things could be improved:
1. VersionSet::AddLiveFiles works with std::set, which is by itself slow (a tree).
You also don't know how many dynamic allocations occur just for building up this tree.
switched to std::vector, also added logic to pre-calculate total size and do just one allocation
2. Don't see why env_->GetChildren() needs to be mutex proteced, moved to PurgeObsoleteFiles where
mutex could be unlocked.
3. switched std::set to std:unordered_set, the conversion from vector is also inside PurgeObsoleteFiles
I have a feeling this should pretty much fix it.
Test Plan: make check; db_stress
Reviewers: dhruba, heyongqiang, MarkCallaghan
Reviewed By: dhruba
CC: leveldb, zshao
Differential Revision: https://reviews.facebook.net/D10197
12 years ago
|
|
|
|
|
|
|
assert(column_family_set_);
|
|
|
|
for (auto cfd : *column_family_set_) {
|
|
|
|
assert(cfd);
|
|
|
|
if (!cfd->initialized()) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
auto* current = cfd->current();
|
|
|
|
bool found_current = false;
|
|
|
|
|
|
|
|
Version* const dummy_versions = cfd->dummy_versions();
|
|
|
|
assert(dummy_versions);
|
|
|
|
|
|
|
|
for (Version* v = dummy_versions->next_; v != dummy_versions;
|
|
|
|
v = v->next_) {
|
|
|
|
v->AddLiveFiles(live_table_files, live_blob_files);
|
|
|
|
if (v == current) {
|
|
|
|
found_current = true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (!found_current && current != nullptr) {
|
|
|
|
// Should never happen unless it is a bug.
|
|
|
|
assert(false);
|
|
|
|
current->AddLiveFiles(live_table_files, live_blob_files);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Compaction Support for Range Deletion
Summary:
This diff introduces RangeDelAggregator, which takes ownership of iterators
provided to it via AddTombstones(). The tombstones are organized in a two-level
map (snapshot stripe -> begin key -> tombstone). Tombstone creation avoids data
copy by holding Slices returned by the iterator, which remain valid thanks to pinning.
For compaction, we create a hierarchical range tombstone iterator with structure
matching the iterator over compaction input data. An aggregator based on that
iterator is used by CompactionIterator to determine which keys are covered by
range tombstones. In case of merge operand, the same aggregator is used by
MergeHelper. Upon finishing each file in the compaction, relevant range tombstones
are added to the output file's range tombstone metablock and file boundaries are
updated accordingly.
To check whether a key is covered by range tombstone, RangeDelAggregator::ShouldDelete()
considers tombstones in the key's snapshot stripe. When this function is used outside of
compaction, it also checks newer stripes, which can contain covering tombstones. Currently
the intra-stripe check involves a linear scan; however, in the future we plan to collapse ranges
within a stripe such that binary search can be used.
RangeDelAggregator::AddToBuilder() adds all range tombstones in the table's key-range
to a new table's range tombstone meta-block. Since range tombstones may fall in the gap
between files, we may need to extend some files' key-ranges. The strategy is (1) first file
extends as far left as possible and other files do not extend left, (2) all files extend right
until either the start of the next file or the end of the last range tombstone in the gap,
whichever comes first.
One other notable change is adding release/move semantics to ScopedArenaIterator
such that it can be used to transfer ownership of an arena-allocated iterator, similar to
how unique_ptr is used for malloc'd data.
Depends on D61473
Test Plan: compaction_iterator_test, mock_table, end-to-end tests in D63927
Reviewers: sdong, IslamAbdelRahman, wanning, yhchiang, lightmark
Reviewed By: lightmark
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62205
8 years ago
|
|
|
InternalIterator* VersionSet::MakeInputIterator(
|
|
|
|
const ReadOptions& read_options, const Compaction* c,
|
|
|
|
RangeDelAggregator* range_del_agg,
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
const FileOptions& file_options_compactions) {
|
|
|
|
auto cfd = c->column_family_data();
|
|
|
|
// Level-0 files have to be merged together. For other levels,
|
|
|
|
// we will make a concatenating iterator per level.
|
|
|
|
// TODO(opt): use concatenating iterator for level-0 if there is no overlap
|
|
|
|
const size_t space = (c->level() == 0 ? c->input_levels(0)->num_files +
|
|
|
|
c->num_input_levels() - 1
|
|
|
|
: c->num_input_levels());
|
|
|
|
InternalIterator** list = new InternalIterator* [space];
|
|
|
|
size_t num = 0;
|
|
|
|
for (size_t which = 0; which < c->num_input_levels(); which++) {
|
|
|
|
if (c->input_levels(which)->num_files != 0) {
|
|
|
|
if (c->level(which) == 0) {
|
|
|
|
const LevelFilesBrief* flevel = c->input_levels(which);
|
|
|
|
for (size_t i = 0; i < flevel->num_files; i++) {
|
|
|
|
list[num++] = cfd->table_cache()->NewIterator(
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
read_options, file_options_compactions,
|
|
|
|
cfd->internal_comparator(), *flevel->files[i].file_metadata,
|
|
|
|
range_del_agg, c->mutable_cf_options()->prefix_extractor.get(),
|
|
|
|
/*table_reader_ptr=*/nullptr,
|
|
|
|
/*file_read_hist=*/nullptr, TableReaderCaller::kCompaction,
|
|
|
|
/*arena=*/nullptr,
|
|
|
|
/*skip_filters=*/false,
|
|
|
|
/*level=*/static_cast<int>(c->level(which)),
|
|
|
|
MaxFileSizeForL0MetaPin(*c->mutable_cf_options()),
|
|
|
|
/*smallest_compaction_key=*/nullptr,
|
Properly report IO errors when IndexType::kBinarySearchWithFirstKey is used (#6621)
Summary:
Context: Index type `kBinarySearchWithFirstKey` added the ability for sst file iterator to sometimes report a key from index without reading the corresponding data block. This is useful when sst blocks are cut at some meaningful boundaries (e.g. one block per key prefix), and many seeks land between blocks (e.g. for each prefix, the ranges of keys in different sst files are nearly disjoint, so a typical seek needs to read a data block from only one file even if all files have the prefix). But this added a new error condition, which rocksdb code was really not equipped to deal with: `InternalIterator::value()` may fail with an IO error or Status::Incomplete, but it's just a method returning a Slice, with no way to report error instead. Before this PR, this type of error wasn't handled at all (an empty slice was returned), and kBinarySearchWithFirstKey implementation was considered a prototype.
Now that we (LogDevice) have experimented with kBinarySearchWithFirstKey for a while and confirmed that it's really useful, this PR is adding the missing error handling.
It's a pretty inconvenient situation implementation-wise. The error needs to be reported from InternalIterator when trying to access value. But there are ~700 call sites of `InternalIterator::value()`, most of which either can't hit the error condition (because the iterator is reading from memtable or from index or something) or wouldn't benefit from the deferred loading of the value (e.g. compaction iterator that reads all values anyway). Adding error handling to all these call sites would needlessly bloat the code. So instead I made the deferred value loading optional: only the call sites that may use deferred loading have to call the new method `PrepareValue()` before calling `value()`. The feature is enabled with a new bool argument `allow_unprepared_value` to a bunch of methods that create iterators (it wouldn't make sense to put it in ReadOptions because it's completely internal to iterators, with virtually no user-visible effect). Lmk if you have better ideas.
Note that the deferred value loading only happens for *internal* iterators. The user-visible iterator (DBIter) always prepares the value before returning from Seek/Next/etc. We could go further and add an API to defer that value loading too, but that's most likely not useful for LogDevice, so it doesn't seem worth the complexity for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6621
Test Plan: make -j5 check . Will also deploy to some logdevice test clusters and look at stats.
Reviewed By: siying
Differential Revision: D20786930
Pulled By: al13n321
fbshipit-source-id: 6da77d918bad3780522e918f17f4d5513d3e99ee
5 years ago
|
|
|
/*largest_compaction_key=*/nullptr,
|
|
|
|
/*allow_unprepared_value=*/false);
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
// Create concatenating iterator for the files from this level
|
|
|
|
list[num++] = new LevelIterator(
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
cfd->table_cache(), read_options, file_options_compactions,
|
|
|
|
cfd->internal_comparator(), c->input_levels(which),
|
|
|
|
c->mutable_cf_options()->prefix_extractor.get(),
|
|
|
|
/*should_sample=*/false,
|
|
|
|
/*no per level latency histogram=*/nullptr,
|
|
|
|
TableReaderCaller::kCompaction, /*skip_filters=*/false,
|
|
|
|
/*level=*/static_cast<int>(c->level(which)), range_del_agg,
|
|
|
|
c->boundaries(which));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
assert(num <= space);
|
|
|
|
InternalIterator* result =
|
|
|
|
NewMergingIterator(&c->column_family_data()->internal_comparator(), list,
|
|
|
|
static_cast<int>(num));
|
|
|
|
delete[] list;
|
|
|
|
return result;
|
|
|
|
}
|
|
|
|
|
|
|
|
// verify that the files listed in this compaction are present
|
|
|
|
// in the current version
|
|
|
|
bool VersionSet::VerifyCompactionFileConsistency(Compaction* c) {
|
|
|
|
#ifndef NDEBUG
|
|
|
|
Version* version = c->column_family_data()->current();
|
|
|
|
const VersionStorageInfo* vstorage = version->storage_info();
|
|
|
|
if (c->input_version() != version) {
|
|
|
|
ROCKS_LOG_INFO(
|
|
|
|
db_options_->info_log,
|
|
|
|
"[%s] compaction output being applied to a different base version from"
|
|
|
|
" input version",
|
|
|
|
c->column_family_data()->GetName().c_str());
|
|
|
|
|
|
|
|
if (vstorage->compaction_style_ == kCompactionStyleLevel &&
|
|
|
|
c->start_level() == 0 && c->num_input_levels() > 2U) {
|
options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
10 years ago
|
|
|
// We are doing a L0->base_level compaction. The assumption is if
|
|
|
|
// base level is not L1, levels from L1 to base_level - 1 is empty.
|
|
|
|
// This is ensured by having one compaction from L0 going on at the
|
|
|
|
// same time in level-based compaction. So that during the time, no
|
|
|
|
// compaction/flush can put files to those levels.
|
|
|
|
for (int l = c->start_level() + 1; l < c->output_level(); l++) {
|
|
|
|
if (vstorage->NumLevelFiles(l) != 0) {
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
for (size_t input = 0; input < c->num_input_levels(); ++input) {
|
CompactFiles, EventListener and GetDatabaseMetaData
Summary:
This diff adds three sets of APIs to RocksDB.
= GetColumnFamilyMetaData =
* This APIs allow users to obtain the current state of a RocksDB instance on one column family.
* See GetColumnFamilyMetaData in include/rocksdb/db.h
= EventListener =
* A virtual class that allows users to implement a set of
call-back functions which will be called when specific
events of a RocksDB instance happens.
* To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners
= CompactFiles =
* CompactFiles API inputs a set of file numbers and an output level, and RocksDB
will try to compact those files into the specified level.
= Example =
* Example code can be found in example/compact_files_example.cc, which implements
a simple external compactor using EventListener, GetColumnFamilyMetaData, and
CompactFiles API.
Test Plan:
listener_test
compactor_test
example/compact_files_example
export ROCKSDB_TESTS=CompactFiles
db_test
export ROCKSDB_TESTS=MetaData
db_test
Reviewers: ljin, igor, rven, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D24705
10 years ago
|
|
|
int level = c->level(input);
|
|
|
|
for (size_t i = 0; i < c->num_input_files(input); ++i) {
|
CompactFiles, EventListener and GetDatabaseMetaData
Summary:
This diff adds three sets of APIs to RocksDB.
= GetColumnFamilyMetaData =
* This APIs allow users to obtain the current state of a RocksDB instance on one column family.
* See GetColumnFamilyMetaData in include/rocksdb/db.h
= EventListener =
* A virtual class that allows users to implement a set of
call-back functions which will be called when specific
events of a RocksDB instance happens.
* To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners
= CompactFiles =
* CompactFiles API inputs a set of file numbers and an output level, and RocksDB
will try to compact those files into the specified level.
= Example =
* Example code can be found in example/compact_files_example.cc, which implements
a simple external compactor using EventListener, GetColumnFamilyMetaData, and
CompactFiles API.
Test Plan:
listener_test
compactor_test
example/compact_files_example
export ROCKSDB_TESTS=CompactFiles
db_test
export ROCKSDB_TESTS=MetaData
db_test
Reviewers: ljin, igor, rven, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D24705
10 years ago
|
|
|
uint64_t number = c->input(input, i)->fd.GetNumber();
|
|
|
|
bool found = false;
|
|
|
|
for (size_t j = 0; j < vstorage->files_[level].size(); j++) {
|
CompactFiles, EventListener and GetDatabaseMetaData
Summary:
This diff adds three sets of APIs to RocksDB.
= GetColumnFamilyMetaData =
* This APIs allow users to obtain the current state of a RocksDB instance on one column family.
* See GetColumnFamilyMetaData in include/rocksdb/db.h
= EventListener =
* A virtual class that allows users to implement a set of
call-back functions which will be called when specific
events of a RocksDB instance happens.
* To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners
= CompactFiles =
* CompactFiles API inputs a set of file numbers and an output level, and RocksDB
will try to compact those files into the specified level.
= Example =
* Example code can be found in example/compact_files_example.cc, which implements
a simple external compactor using EventListener, GetColumnFamilyMetaData, and
CompactFiles API.
Test Plan:
listener_test
compactor_test
example/compact_files_example
export ROCKSDB_TESTS=CompactFiles
db_test
export ROCKSDB_TESTS=MetaData
db_test
Reviewers: ljin, igor, rven, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D24705
10 years ago
|
|
|
FileMetaData* f = vstorage->files_[level][j];
|
|
|
|
if (f->fd.GetNumber() == number) {
|
|
|
|
found = true;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
CompactFiles, EventListener and GetDatabaseMetaData
Summary:
This diff adds three sets of APIs to RocksDB.
= GetColumnFamilyMetaData =
* This APIs allow users to obtain the current state of a RocksDB instance on one column family.
* See GetColumnFamilyMetaData in include/rocksdb/db.h
= EventListener =
* A virtual class that allows users to implement a set of
call-back functions which will be called when specific
events of a RocksDB instance happens.
* To register EventListener, simply insert an EventListener to ColumnFamilyOptions::listeners
= CompactFiles =
* CompactFiles API inputs a set of file numbers and an output level, and RocksDB
will try to compact those files into the specified level.
= Example =
* Example code can be found in example/compact_files_example.cc, which implements
a simple external compactor using EventListener, GetColumnFamilyMetaData, and
CompactFiles API.
Test Plan:
listener_test
compactor_test
example/compact_files_example
export ROCKSDB_TESTS=CompactFiles
db_test
export ROCKSDB_TESTS=MetaData
db_test
Reviewers: ljin, igor, rven, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D24705
10 years ago
|
|
|
if (!found) {
|
|
|
|
return false; // input files non existent in current version
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#else
|
|
|
|
(void)c;
|
|
|
|
#endif
|
|
|
|
return true; // everything good
|
|
|
|
}
|
|
|
|
|
|
|
|
Status VersionSet::GetMetadataForFile(uint64_t number, int* filelevel,
|
|
|
|
FileMetaData** meta,
|
|
|
|
ColumnFamilyData** cfd) {
|
|
|
|
for (auto cfd_iter : *column_family_set_) {
|
|
|
|
if (!cfd_iter->initialized()) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
Version* version = cfd_iter->current();
|
|
|
|
const auto* vstorage = version->storage_info();
|
|
|
|
for (int level = 0; level < vstorage->num_levels(); level++) {
|
|
|
|
for (const auto& file : vstorage->LevelFiles(level)) {
|
|
|
|
if (file->fd.GetNumber() == number) {
|
|
|
|
*meta = file;
|
|
|
|
*filelevel = level;
|
|
|
|
*cfd = cfd_iter;
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return Status::NotFound("File not present in any level");
|
|
|
|
}
|
|
|
|
|
|
|
|
void VersionSet::GetLiveFilesMetaData(std::vector<LiveFileMetaData>* metadata) {
|
|
|
|
for (auto cfd : *column_family_set_) {
|
|
|
|
if (cfd->IsDropped() || !cfd->initialized()) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
for (int level = 0; level < cfd->NumberLevels(); level++) {
|
|
|
|
for (const auto& file :
|
|
|
|
cfd->current()->storage_info()->LevelFiles(level)) {
|
|
|
|
LiveFileMetaData filemetadata;
|
|
|
|
filemetadata.column_family_name = cfd->GetName();
|
|
|
|
uint32_t path_id = file->fd.GetPathId();
|
|
|
|
if (path_id < cfd->ioptions()->cf_paths.size()) {
|
|
|
|
filemetadata.db_path = cfd->ioptions()->cf_paths[path_id].path;
|
|
|
|
} else {
|
|
|
|
assert(!cfd->ioptions()->cf_paths.empty());
|
|
|
|
filemetadata.db_path = cfd->ioptions()->cf_paths.back().path;
|
|
|
|
}
|
|
|
|
const uint64_t file_number = file->fd.GetNumber();
|
|
|
|
filemetadata.name = MakeTableFileName("", file_number);
|
|
|
|
filemetadata.file_number = file_number;
|
|
|
|
filemetadata.level = level;
|
|
|
|
filemetadata.size = static_cast<size_t>(file->fd.GetFileSize());
|
|
|
|
filemetadata.smallestkey = file->smallest.user_key().ToString();
|
|
|
|
filemetadata.largestkey = file->largest.user_key().ToString();
|
|
|
|
filemetadata.smallest_seqno = file->fd.smallest_seqno;
|
|
|
|
filemetadata.largest_seqno = file->fd.largest_seqno;
|
|
|
|
filemetadata.num_reads_sampled = file->stats.num_reads_sampled.load(
|
|
|
|
std::memory_order_relaxed);
|
|
|
|
filemetadata.being_compacted = file->being_compacted;
|
|
|
|
filemetadata.num_entries = file->num_entries;
|
|
|
|
filemetadata.num_deletions = file->num_deletions;
|
|
|
|
filemetadata.oldest_blob_file_number = file->oldest_blob_file_number;
|
|
|
|
filemetadata.file_checksum = file->file_checksum;
|
|
|
|
filemetadata.file_checksum_func_name = file->file_checksum_func_name;
|
|
|
|
metadata->push_back(filemetadata);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void VersionSet::GetObsoleteFiles(std::vector<ObsoleteFileInfo>* files,
|
|
|
|
std::vector<ObsoleteBlobFileInfo>* blob_files,
|
|
|
|
std::vector<std::string>* manifest_filenames,
|
|
|
|
uint64_t min_pending_output) {
|
|
|
|
assert(files);
|
|
|
|
assert(blob_files);
|
|
|
|
assert(manifest_filenames);
|
|
|
|
assert(files->empty());
|
|
|
|
assert(blob_files->empty());
|
|
|
|
assert(manifest_filenames->empty());
|
|
|
|
|
|
|
|
std::vector<ObsoleteFileInfo> pending_files;
|
|
|
|
for (auto& f : obsolete_files_) {
|
|
|
|
if (f.metadata->fd.GetNumber() < min_pending_output) {
|
|
|
|
files->emplace_back(std::move(f));
|
|
|
|
} else {
|
|
|
|
pending_files.emplace_back(std::move(f));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
obsolete_files_.swap(pending_files);
|
|
|
|
|
|
|
|
std::vector<ObsoleteBlobFileInfo> pending_blob_files;
|
|
|
|
for (auto& blob_file : obsolete_blob_files_) {
|
|
|
|
if (blob_file.GetBlobFileNumber() < min_pending_output) {
|
|
|
|
blob_files->emplace_back(std::move(blob_file));
|
|
|
|
} else {
|
|
|
|
pending_blob_files.emplace_back(std::move(blob_file));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
obsolete_blob_files_.swap(pending_blob_files);
|
|
|
|
|
|
|
|
obsolete_manifests_.swap(*manifest_filenames);
|
|
|
|
}
|
|
|
|
|
|
|
|
ColumnFamilyData* VersionSet::CreateColumnFamily(
|
|
|
|
const ColumnFamilyOptions& cf_options, const VersionEdit* edit) {
|
|
|
|
assert(edit->is_column_family_add_);
|
|
|
|
|
|
|
|
MutableCFOptions dummy_cf_options;
|
|
|
|
Version* dummy_versions =
|
|
|
|
new Version(nullptr, this, file_options_, dummy_cf_options, io_tracer_);
|
|
|
|
// Ref() dummy version once so that later we can call Unref() to delete it
|
|
|
|
// by avoiding calling "delete" explicitly (~Version is private)
|
|
|
|
dummy_versions->Ref();
|
|
|
|
auto new_cfd = column_family_set_->CreateColumnFamily(
|
|
|
|
edit->column_family_name_, edit->column_family_, dummy_versions,
|
|
|
|
cf_options);
|
|
|
|
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
Version* v = new Version(new_cfd, this, file_options_,
|
|
|
|
*new_cfd->GetLatestMutableCFOptions(), io_tracer_,
|
|
|
|
current_version_number_++);
|
|
|
|
|
options.level_compaction_dynamic_level_bytes to allow RocksDB to pick size bases of levels dynamically.
Summary:
When having fixed max_bytes_for_level_base, the ratio of size of largest level and the second one can range from 0 to the multiplier. This makes LSM tree frequently irregular and unpredictable. It can also cause poor space amplification in some cases.
In this improvement (proposed by Igor Kabiljo), we introduce a parameter option.level_compaction_use_dynamic_max_bytes. When turning it on, RocksDB is free to pick a level base in the range of (options.max_bytes_for_level_base/options.max_bytes_for_level_multiplier, options.max_bytes_for_level_base] so that real level ratios are close to options.max_bytes_for_level_multiplier.
Test Plan: New unit tests and pass tests suites including valgrind.
Reviewers: MarkCallaghan, rven, yhchiang, igor, ikabiljo
Reviewed By: ikabiljo
Subscribers: yoshinorim, ikabiljo, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D31437
10 years ago
|
|
|
// Fill level target base information.
|
|
|
|
v->storage_info()->CalculateBaseBytes(*new_cfd->ioptions(),
|
|
|
|
*new_cfd->GetLatestMutableCFOptions());
|
|
|
|
AppendVersion(new_cfd, v);
|
|
|
|
// GetLatestMutableCFOptions() is safe here without mutex since the
|
|
|
|
// cfd is not available to client
|
|
|
|
new_cfd->CreateNewMemtable(*new_cfd->GetLatestMutableCFOptions(),
|
|
|
|
LastSequence());
|
|
|
|
new_cfd->SetLogNumber(edit->log_number_);
|
|
|
|
return new_cfd;
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t VersionSet::GetNumLiveVersions(Version* dummy_versions) {
|
|
|
|
uint64_t count = 0;
|
|
|
|
for (Version* v = dummy_versions->next_; v != dummy_versions; v = v->next_) {
|
|
|
|
count++;
|
|
|
|
}
|
|
|
|
return count;
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t VersionSet::GetTotalSstFilesSize(Version* dummy_versions) {
|
|
|
|
std::unordered_set<uint64_t> unique_files;
|
|
|
|
uint64_t total_files_size = 0;
|
|
|
|
for (Version* v = dummy_versions->next_; v != dummy_versions; v = v->next_) {
|
|
|
|
VersionStorageInfo* storage_info = v->storage_info();
|
|
|
|
for (int level = 0; level < storage_info->num_levels_; level++) {
|
|
|
|
for (const auto& file_meta : storage_info->LevelFiles(level)) {
|
|
|
|
if (unique_files.find(file_meta->fd.packed_number_and_path_id) ==
|
|
|
|
unique_files.end()) {
|
|
|
|
unique_files.insert(file_meta->fd.packed_number_and_path_id);
|
|
|
|
total_files_size += file_meta->fd.GetFileSize();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return total_files_size;
|
|
|
|
}
|
|
|
|
|
|
|
|
Status VersionSet::VerifyFileMetadata(const std::string& fpath,
|
|
|
|
const FileMetaData& meta) const {
|
|
|
|
uint64_t fsize = 0;
|
|
|
|
Status status = fs_->GetFileSize(fpath, IOOptions(), &fsize, nullptr);
|
|
|
|
if (status.ok()) {
|
|
|
|
if (fsize != meta.fd.GetFileSize()) {
|
|
|
|
status = Status::Corruption("File size mismatch: " + fpath);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return status;
|
|
|
|
}
|
|
|
|
|
|
|
|
ReactiveVersionSet::ReactiveVersionSet(
|
|
|
|
const std::string& dbname, const ImmutableDBOptions* _db_options,
|
|
|
|
const FileOptions& _file_options, Cache* table_cache,
|
|
|
|
WriteBufferManager* write_buffer_manager, WriteController* write_controller,
|
|
|
|
const std::shared_ptr<IOTracer>& io_tracer)
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
: VersionSet(dbname, _db_options, _file_options, table_cache,
|
|
|
|
write_buffer_manager, write_controller,
|
|
|
|
/*block_cache_tracer=*/nullptr, io_tracer),
|
|
|
|
number_of_edits_to_skip_(0) {}
|
|
|
|
|
|
|
|
ReactiveVersionSet::~ReactiveVersionSet() {}
|
|
|
|
|
|
|
|
Status ReactiveVersionSet::Recover(
|
|
|
|
const std::vector<ColumnFamilyDescriptor>& column_families,
|
|
|
|
std::unique_ptr<log::FragmentBufferedReader>* manifest_reader,
|
|
|
|
std::unique_ptr<log::Reader::Reporter>* manifest_reporter,
|
|
|
|
std::unique_ptr<Status>* manifest_reader_status) {
|
|
|
|
assert(manifest_reader != nullptr);
|
|
|
|
assert(manifest_reporter != nullptr);
|
|
|
|
assert(manifest_reader_status != nullptr);
|
|
|
|
|
|
|
|
std::unordered_map<std::string, ColumnFamilyOptions> cf_name_to_options;
|
|
|
|
for (const auto& cf : column_families) {
|
|
|
|
cf_name_to_options.insert({cf.name, cf.options});
|
|
|
|
}
|
|
|
|
|
|
|
|
// add default column family
|
|
|
|
auto default_cf_iter = cf_name_to_options.find(kDefaultColumnFamilyName);
|
|
|
|
if (default_cf_iter == cf_name_to_options.end()) {
|
|
|
|
return Status::InvalidArgument("Default column family not specified");
|
|
|
|
}
|
|
|
|
VersionEdit default_cf_edit;
|
|
|
|
default_cf_edit.AddColumnFamily(kDefaultColumnFamilyName);
|
|
|
|
default_cf_edit.SetColumnFamily(0);
|
|
|
|
ColumnFamilyData* default_cfd =
|
|
|
|
CreateColumnFamily(default_cf_iter->second, &default_cf_edit);
|
|
|
|
// In recovery, nobody else can access it, so it's fine to set it to be
|
|
|
|
// initialized earlier.
|
|
|
|
default_cfd->set_initialized();
|
|
|
|
VersionBuilderMap builders;
|
|
|
|
std::unordered_map<int, std::string> column_families_not_found;
|
|
|
|
builders.insert(
|
|
|
|
std::make_pair(0, std::unique_ptr<BaseReferencedVersionBuilder>(
|
|
|
|
new BaseReferencedVersionBuilder(default_cfd))));
|
|
|
|
|
|
|
|
manifest_reader_status->reset(new Status());
|
|
|
|
manifest_reporter->reset(new LogReporter());
|
|
|
|
static_cast_with_check<LogReporter>(manifest_reporter->get())->status =
|
|
|
|
manifest_reader_status->get();
|
|
|
|
Status s = MaybeSwitchManifest(manifest_reporter->get(), manifest_reader);
|
|
|
|
log::Reader* reader = manifest_reader->get();
|
|
|
|
|
|
|
|
int retry = 0;
|
|
|
|
VersionEdit version_edit;
|
|
|
|
while (s.ok() && retry < 1) {
|
|
|
|
assert(reader != nullptr);
|
|
|
|
s = ReadAndRecover(*reader, &read_buffer_, cf_name_to_options,
|
|
|
|
column_families_not_found, builders,
|
|
|
|
manifest_reader_status->get(), &version_edit);
|
|
|
|
if (s.ok()) {
|
|
|
|
bool enough = version_edit.has_next_file_number_ &&
|
|
|
|
version_edit.has_log_number_ &&
|
|
|
|
version_edit.has_last_sequence_;
|
|
|
|
if (enough) {
|
|
|
|
for (const auto& cf : column_families) {
|
|
|
|
auto cfd = column_family_set_->GetColumnFamily(cf.name);
|
|
|
|
if (cfd == nullptr) {
|
|
|
|
enough = false;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (enough) {
|
|
|
|
for (const auto& cf : column_families) {
|
|
|
|
auto cfd = column_family_set_->GetColumnFamily(cf.name);
|
|
|
|
assert(cfd != nullptr);
|
|
|
|
if (!cfd->IsDropped()) {
|
|
|
|
auto builder_iter = builders.find(cfd->GetID());
|
|
|
|
assert(builder_iter != builders.end());
|
|
|
|
auto builder = builder_iter->second->version_builder();
|
|
|
|
assert(builder != nullptr);
|
|
|
|
s = builder->LoadTableHandlers(
|
|
|
|
cfd->internal_stats(), db_options_->max_file_opening_threads,
|
|
|
|
false /* prefetch_index_and_filter_in_cache */,
|
|
|
|
true /* is_initial_load */,
|
|
|
|
cfd->GetLatestMutableCFOptions()->prefix_extractor.get(),
|
|
|
|
MaxFileSizeForL0MetaPin(*cfd->GetLatestMutableCFOptions()));
|
|
|
|
if (!s.ok()) {
|
|
|
|
enough = false;
|
|
|
|
if (s.IsPathNotFound()) {
|
|
|
|
s = Status::OK();
|
|
|
|
}
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (enough) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
++retry;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (s.ok()) {
|
|
|
|
if (!version_edit.has_prev_log_number_) {
|
|
|
|
version_edit.prev_log_number_ = 0;
|
|
|
|
}
|
|
|
|
column_family_set_->UpdateMaxColumnFamily(version_edit.max_column_family_);
|
|
|
|
|
|
|
|
MarkMinLogNumberToKeep2PC(version_edit.min_log_number_to_keep_);
|
|
|
|
MarkFileNumberUsed(version_edit.prev_log_number_);
|
|
|
|
MarkFileNumberUsed(version_edit.log_number_);
|
|
|
|
|
|
|
|
for (auto cfd : *column_family_set_) {
|
|
|
|
assert(builders.count(cfd->GetID()) > 0);
|
|
|
|
auto builder = builders[cfd->GetID()]->version_builder();
|
|
|
|
if (!builder->CheckConsistencyForNumLevels()) {
|
|
|
|
s = Status::InvalidArgument(
|
|
|
|
"db has more levels than options.num_levels");
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (s.ok()) {
|
|
|
|
for (auto cfd : *column_family_set_) {
|
|
|
|
if (cfd->IsDropped()) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
assert(cfd->initialized());
|
|
|
|
auto builders_iter = builders.find(cfd->GetID());
|
|
|
|
assert(builders_iter != builders.end());
|
|
|
|
auto* builder = builders_iter->second->version_builder();
|
|
|
|
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
Version* v = new Version(cfd, this, file_options_,
|
|
|
|
*cfd->GetLatestMutableCFOptions(), io_tracer_,
|
|
|
|
current_version_number_++);
|
|
|
|
s = builder->SaveTo(v->storage_info());
|
|
|
|
|
|
|
|
if (s.ok()) {
|
|
|
|
// Install recovered version
|
|
|
|
v->PrepareApply(*cfd->GetLatestMutableCFOptions(),
|
|
|
|
!(db_options_->skip_stats_update_on_db_open));
|
|
|
|
AppendVersion(cfd, v);
|
|
|
|
} else {
|
|
|
|
ROCKS_LOG_ERROR(db_options_->info_log,
|
|
|
|
"[%s]: inconsistent version: %s\n",
|
|
|
|
cfd->GetName().c_str(), s.ToString().c_str());
|
|
|
|
delete v;
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (s.ok()) {
|
|
|
|
next_file_number_.store(version_edit.next_file_number_ + 1);
|
|
|
|
last_allocated_sequence_ = version_edit.last_sequence_;
|
|
|
|
last_published_sequence_ = version_edit.last_sequence_;
|
|
|
|
last_sequence_ = version_edit.last_sequence_;
|
|
|
|
prev_log_number_ = version_edit.prev_log_number_;
|
|
|
|
for (auto cfd : *column_family_set_) {
|
|
|
|
if (cfd->IsDropped()) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
ROCKS_LOG_INFO(db_options_->info_log,
|
|
|
|
"Column family [%s] (ID %u), log number is %" PRIu64 "\n",
|
|
|
|
cfd->GetName().c_str(), cfd->GetID(), cfd->GetLogNumber());
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
Status ReactiveVersionSet::ReadAndApply(
|
|
|
|
InstrumentedMutex* mu,
|
|
|
|
std::unique_ptr<log::FragmentBufferedReader>* manifest_reader,
|
|
|
|
std::unordered_set<ColumnFamilyData*>* cfds_changed) {
|
|
|
|
assert(manifest_reader != nullptr);
|
|
|
|
assert(cfds_changed != nullptr);
|
|
|
|
mu->AssertHeld();
|
|
|
|
|
|
|
|
Status s;
|
|
|
|
uint64_t applied_edits = 0;
|
|
|
|
while (s.ok()) {
|
|
|
|
Slice record;
|
|
|
|
std::string scratch;
|
|
|
|
log::Reader* reader = manifest_reader->get();
|
|
|
|
std::string old_manifest_path = reader->file()->file_name();
|
|
|
|
while (reader->ReadRecord(&record, &scratch)) {
|
|
|
|
VersionEdit edit;
|
|
|
|
s = edit.DecodeFrom(record);
|
|
|
|
if (!s.ok()) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Skip the first VersionEdits of each MANIFEST generated by
|
|
|
|
// VersionSet::WriteCurrentStatetoManifest.
|
|
|
|
if (number_of_edits_to_skip_ > 0) {
|
|
|
|
ColumnFamilyData* cfd =
|
|
|
|
column_family_set_->GetColumnFamily(edit.column_family_);
|
|
|
|
if (cfd != nullptr && !cfd->IsDropped()) {
|
|
|
|
--number_of_edits_to_skip_;
|
|
|
|
}
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
|
|
|
|
s = read_buffer_.AddEdit(&edit);
|
|
|
|
if (!s.ok()) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
VersionEdit temp_edit;
|
|
|
|
if (edit.is_in_atomic_group_) {
|
|
|
|
if (read_buffer_.IsFull()) {
|
|
|
|
// Apply edits in an atomic group when we have read all edits in the
|
|
|
|
// group.
|
|
|
|
for (auto& e : read_buffer_.replay_buffer()) {
|
|
|
|
s = ApplyOneVersionEditToBuilder(e, cfds_changed, &temp_edit);
|
|
|
|
if (!s.ok()) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
applied_edits++;
|
|
|
|
}
|
|
|
|
if (!s.ok()) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
read_buffer_.Clear();
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
// Apply a normal edit immediately.
|
|
|
|
s = ApplyOneVersionEditToBuilder(edit, cfds_changed, &temp_edit);
|
|
|
|
if (s.ok()) {
|
|
|
|
applied_edits++;
|
|
|
|
} else {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (!s.ok()) {
|
|
|
|
// Clear the buffer if we fail to decode/apply an edit.
|
|
|
|
read_buffer_.Clear();
|
|
|
|
}
|
|
|
|
// It's possible that:
|
|
|
|
// 1) s.IsCorruption(), indicating the current MANIFEST is corrupted.
|
|
|
|
// Or the version(s) rebuilt from tailing the MANIFEST is inconsistent.
|
|
|
|
// 2) we have finished reading the current MANIFEST.
|
|
|
|
// 3) we have encountered an IOError reading the current MANIFEST.
|
|
|
|
// We need to look for the next MANIFEST and start from there. If we cannot
|
|
|
|
// find the next MANIFEST, we should exit the loop.
|
|
|
|
Status tmp_s = MaybeSwitchManifest(reader->GetReporter(), manifest_reader);
|
|
|
|
reader = manifest_reader->get();
|
|
|
|
if (tmp_s.ok()) {
|
|
|
|
if (reader->file()->file_name() == old_manifest_path) {
|
|
|
|
// Still processing the same MANIFEST, thus no need to continue this
|
|
|
|
// loop since no record is available if we have reached here.
|
|
|
|
break;
|
|
|
|
} else {
|
|
|
|
// We have switched to a new MANIFEST whose first records have been
|
|
|
|
// generated by VersionSet::WriteCurrentStatetoManifest. Since the
|
|
|
|
// secondary instance has already finished recovering upon start, there
|
|
|
|
// is no need for the secondary to process these records. Actually, if
|
|
|
|
// the secondary were to replay these records, the secondary may end up
|
|
|
|
// adding the same SST files AGAIN to each column family, causing
|
|
|
|
// consistency checks done by VersionBuilder to fail. Therefore, we
|
|
|
|
// record the number of records to skip at the beginning of the new
|
|
|
|
// MANIFEST and ignore them.
|
|
|
|
number_of_edits_to_skip_ = 0;
|
|
|
|
for (auto* cfd : *column_family_set_) {
|
|
|
|
if (cfd->IsDropped()) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
// Increase number_of_edits_to_skip by 2 because
|
|
|
|
// WriteCurrentStatetoManifest() writes 2 version edits for each
|
|
|
|
// column family at the beginning of the newly-generated MANIFEST.
|
|
|
|
// TODO(yanqin) remove hard-coded value.
|
|
|
|
if (db_options_->write_dbid_to_manifest) {
|
|
|
|
number_of_edits_to_skip_ += 3;
|
|
|
|
} else {
|
|
|
|
number_of_edits_to_skip_ += 2;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
s = tmp_s;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
if (s.ok()) {
|
|
|
|
for (auto cfd : *column_family_set_) {
|
|
|
|
auto builder_iter = active_version_builders_.find(cfd->GetID());
|
|
|
|
if (builder_iter == active_version_builders_.end()) {
|
|
|
|
continue;
|
|
|
|
}
|
|
|
|
auto builder = builder_iter->second->version_builder();
|
|
|
|
if (!builder->CheckConsistencyForNumLevels()) {
|
|
|
|
s = Status::InvalidArgument(
|
|
|
|
"db has more levels than options.num_levels");
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
TEST_SYNC_POINT_CALLBACK("ReactiveVersionSet::ReadAndApply:AppliedEdits",
|
|
|
|
&applied_edits);
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
Status ReactiveVersionSet::ApplyOneVersionEditToBuilder(
|
|
|
|
VersionEdit& edit, std::unordered_set<ColumnFamilyData*>* cfds_changed,
|
|
|
|
VersionEdit* version_edit) {
|
|
|
|
ColumnFamilyData* cfd =
|
|
|
|
column_family_set_->GetColumnFamily(edit.column_family_);
|
|
|
|
|
|
|
|
// If we cannot find this column family in our column family set, then it
|
|
|
|
// may be a new column family created by the primary after the secondary
|
|
|
|
// starts. It is also possible that the secondary instance opens only a subset
|
|
|
|
// of column families. Ignore it for now.
|
|
|
|
if (nullptr == cfd) {
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
if (active_version_builders_.find(edit.column_family_) ==
|
|
|
|
active_version_builders_.end() &&
|
|
|
|
!cfd->IsDropped()) {
|
|
|
|
std::unique_ptr<BaseReferencedVersionBuilder> builder_guard(
|
|
|
|
new BaseReferencedVersionBuilder(cfd));
|
|
|
|
active_version_builders_.insert(
|
|
|
|
std::make_pair(edit.column_family_, std::move(builder_guard)));
|
|
|
|
}
|
|
|
|
|
|
|
|
auto builder_iter = active_version_builders_.find(edit.column_family_);
|
|
|
|
assert(builder_iter != active_version_builders_.end());
|
|
|
|
auto builder = builder_iter->second->version_builder();
|
|
|
|
assert(builder != nullptr);
|
|
|
|
|
|
|
|
if (edit.is_column_family_add_) {
|
|
|
|
// TODO (yanqin) for now the secondary ignores column families created
|
|
|
|
// after Open. This also simplifies handling of switching to a new MANIFEST
|
|
|
|
// and processing the snapshot of the system at the beginning of the
|
|
|
|
// MANIFEST.
|
|
|
|
} else if (edit.is_column_family_drop_) {
|
|
|
|
// Drop the column family by setting it to be 'dropped' without destroying
|
|
|
|
// the column family handle.
|
|
|
|
// TODO (haoyu) figure out how to handle column faimly drop for
|
|
|
|
// secondary instance. (Is it possible that the ref count for cfd is 0 but
|
|
|
|
// the ref count for its versions is higher than 0?)
|
|
|
|
cfd->SetDropped();
|
|
|
|
if (cfd->UnrefAndTryDelete()) {
|
|
|
|
cfd = nullptr;
|
|
|
|
}
|
|
|
|
active_version_builders_.erase(builder_iter);
|
|
|
|
} else {
|
|
|
|
Status s = builder->Apply(&edit);
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
Status s = ExtractInfoFromVersionEdit(cfd, edit, version_edit);
|
|
|
|
if (!s.ok()) {
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (cfd != nullptr && !cfd->IsDropped()) {
|
|
|
|
s = builder->LoadTableHandlers(
|
|
|
|
cfd->internal_stats(), db_options_->max_file_opening_threads,
|
|
|
|
false /* prefetch_index_and_filter_in_cache */,
|
|
|
|
false /* is_initial_load */,
|
|
|
|
cfd->GetLatestMutableCFOptions()->prefix_extractor.get(),
|
|
|
|
MaxFileSizeForL0MetaPin(*cfd->GetLatestMutableCFOptions()));
|
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"ReactiveVersionSet::ApplyOneVersionEditToBuilder:"
|
|
|
|
"AfterLoadTableHandlers",
|
|
|
|
&s);
|
|
|
|
|
|
|
|
if (s.ok()) {
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
auto version = new Version(cfd, this, file_options_,
|
|
|
|
*cfd->GetLatestMutableCFOptions(), io_tracer_,
|
|
|
|
current_version_number_++);
|
|
|
|
s = builder->SaveTo(version->storage_info());
|
|
|
|
if (s.ok()) {
|
|
|
|
version->PrepareApply(*cfd->GetLatestMutableCFOptions(), true);
|
|
|
|
AppendVersion(cfd, version);
|
|
|
|
active_version_builders_.erase(builder_iter);
|
|
|
|
if (cfds_changed->count(cfd) == 0) {
|
|
|
|
cfds_changed->insert(cfd);
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
delete version;
|
|
|
|
}
|
|
|
|
} else if (s.IsPathNotFound()) {
|
|
|
|
s = Status::OK();
|
|
|
|
}
|
|
|
|
// Some other error has occurred during LoadTableHandlers.
|
|
|
|
}
|
|
|
|
|
|
|
|
if (s.ok()) {
|
|
|
|
if (version_edit->HasNextFile()) {
|
|
|
|
next_file_number_.store(version_edit->next_file_number_ + 1);
|
|
|
|
}
|
|
|
|
if (version_edit->has_last_sequence_) {
|
|
|
|
last_allocated_sequence_ = version_edit->last_sequence_;
|
|
|
|
last_published_sequence_ = version_edit->last_sequence_;
|
|
|
|
last_sequence_ = version_edit->last_sequence_;
|
|
|
|
}
|
|
|
|
if (version_edit->has_prev_log_number_) {
|
|
|
|
prev_log_number_ = version_edit->prev_log_number_;
|
|
|
|
MarkFileNumberUsed(version_edit->prev_log_number_);
|
|
|
|
}
|
|
|
|
if (version_edit->has_log_number_) {
|
|
|
|
MarkFileNumberUsed(version_edit->log_number_);
|
|
|
|
}
|
|
|
|
column_family_set_->UpdateMaxColumnFamily(version_edit->max_column_family_);
|
|
|
|
MarkMinLogNumberToKeep2PC(version_edit->min_log_number_to_keep_);
|
|
|
|
}
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
Status ReactiveVersionSet::MaybeSwitchManifest(
|
|
|
|
log::Reader::Reporter* reporter,
|
|
|
|
std::unique_ptr<log::FragmentBufferedReader>* manifest_reader) {
|
|
|
|
assert(manifest_reader != nullptr);
|
|
|
|
Status s;
|
|
|
|
do {
|
|
|
|
std::string manifest_path;
|
|
|
|
s = GetCurrentManifestPath(dbname_, fs_.get(), &manifest_path,
|
|
|
|
&manifest_file_number_);
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
std::unique_ptr<FSSequentialFile> manifest_file;
|
|
|
|
if (s.ok()) {
|
|
|
|
if (nullptr == manifest_reader->get() ||
|
|
|
|
manifest_reader->get()->file()->file_name() != manifest_path) {
|
|
|
|
TEST_SYNC_POINT(
|
|
|
|
"ReactiveVersionSet::MaybeSwitchManifest:"
|
|
|
|
"AfterGetCurrentManifestPath:0");
|
|
|
|
TEST_SYNC_POINT(
|
|
|
|
"ReactiveVersionSet::MaybeSwitchManifest:"
|
|
|
|
"AfterGetCurrentManifestPath:1");
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
s = fs_->NewSequentialFile(manifest_path,
|
|
|
|
fs_->OptimizeForManifestRead(file_options_),
|
Introduce a new storage specific Env API (#5761)
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
|
|
|
&manifest_file, nullptr);
|
|
|
|
} else {
|
|
|
|
// No need to switch manifest.
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
std::unique_ptr<SequentialFileReader> manifest_file_reader;
|
|
|
|
if (s.ok()) {
|
|
|
|
manifest_file_reader.reset(new SequentialFileReader(
|
|
|
|
std::move(manifest_file), manifest_path,
|
|
|
|
db_options_->log_readahead_size, io_tracer_));
|
|
|
|
manifest_reader->reset(new log::FragmentBufferedReader(
|
|
|
|
nullptr, std::move(manifest_file_reader), reporter,
|
|
|
|
true /* checksum */, 0 /* log_number */));
|
|
|
|
ROCKS_LOG_INFO(db_options_->info_log, "Switched to new manifest: %s\n",
|
|
|
|
manifest_path.c_str());
|
|
|
|
// TODO (yanqin) every time we switch to a new MANIFEST, we clear the
|
|
|
|
// active_version_builders_ map because we choose to construct the
|
|
|
|
// versions from scratch, thanks to the first part of each MANIFEST
|
|
|
|
// written by VersionSet::WriteCurrentStatetoManifest. This is not
|
|
|
|
// necessary, but we choose this at present for the sake of simplicity.
|
|
|
|
active_version_builders_.clear();
|
|
|
|
}
|
|
|
|
} while (s.IsPathNotFound());
|
|
|
|
return s;
|
|
|
|
}
|
|
|
|
|
|
|
|
} // namespace ROCKSDB_NAMESPACE
|