|
|
|
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
|
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
|
|
// (found in the LICENSE.Apache file in the root directory).
|
|
|
|
//
|
|
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
|
|
|
|
|
|
|
#include "table/block_based/block_based_table_builder.h"
|
|
|
|
|
|
|
|
#include <assert.h>
|
|
|
|
#include <stdio.h>
|
|
|
|
#include <list>
|
|
|
|
#include <map>
|
|
|
|
#include <memory>
|
|
|
|
#include <string>
|
|
|
|
#include <unordered_map>
|
|
|
|
#include <utility>
|
|
|
|
|
|
|
|
#include "cache/simple_deleter.h"
|
|
|
|
#include "db/dbformat.h"
|
|
|
|
#include "index_builder.h"
|
|
|
|
|
|
|
|
#include "rocksdb/cache.h"
|
|
|
|
#include "rocksdb/comparator.h"
|
|
|
|
#include "rocksdb/env.h"
|
|
|
|
#include "rocksdb/flush_block_policy.h"
|
|
|
|
#include "rocksdb/merge_operator.h"
|
|
|
|
#include "rocksdb/table.h"
|
|
|
|
|
|
|
|
#include "table/block_based/block.h"
|
|
|
|
#include "table/block_based/block_based_filter_block.h"
|
|
|
|
#include "table/block_based/block_based_table_factory.h"
|
|
|
|
#include "table/block_based/block_based_table_reader.h"
|
|
|
|
#include "table/block_based/block_builder.h"
|
|
|
|
#include "table/block_based/filter_block.h"
|
New Bloom filter implementation for full and partitioned filters (#6007)
Summary:
Adds an improved, replacement Bloom filter implementation (FastLocalBloom) for full and partitioned filters in the block-based table. This replacement is faster and more accurate, especially for high bits per key or millions of keys in a single filter.
Speed
The improved speed, at least on recent x86_64, comes from
* Using fastrange instead of modulo (%)
* Using our new hash function (XXH3 preview, added in a previous commit), which is much faster for large keys and only *slightly* slower on keys around 12 bytes if hashing the same size many thousands of times in a row.
* Optimizing the Bloom filter queries with AVX2 SIMD operations. (Added AVX2 to the USE_SSE=1 build.) Careful design was required to support (a) SIMD-optimized queries, (b) compatible non-SIMD code that's simple and efficient, (c) flexible choice of number of probes, and (d) essentially maximized accuracy for a cache-local Bloom filter. Probes are made eight at a time, so any number of probes up to 8 is the same speed, then up to 16, etc.
* Prefetching cache lines when building the filter. Although this optimization could be applied to the old structure as well, it seems to balance out the small added cost of accumulating 64 bit hashes for adding to the filter rather than 32 bit hashes.
Here's nominal speed data from filter_bench (200MB in filters, about 10k keys each, 10 bits filter data / key, 6 probes, avg key size 24 bytes, includes hashing time) on Skylake DE (relatively low clock speed):
$ ./filter_bench -quick -impl=2 -net_includes_hashing # New Bloom filter
Build avg ns/key: 47.7135
Mixed inside/outside queries...
Single filter net ns/op: 26.2825
Random filter net ns/op: 150.459
Average FP rate %: 0.954651
$ ./filter_bench -quick -impl=0 -net_includes_hashing # Old Bloom filter
Build avg ns/key: 47.2245
Mixed inside/outside queries...
Single filter net ns/op: 63.2978
Random filter net ns/op: 188.038
Average FP rate %: 1.13823
Similar build time but dramatically faster query times on hot data (63 ns to 26 ns), and somewhat faster on stale data (188 ns to 150 ns). Performance differences on batched and skewed query loads are between these extremes as expected.
The only other interesting thing about speed is "inside" (query key was added to filter) vs. "outside" (query key was not added to filter) query times. The non-SIMD implementations are substantially slower when most queries are "outside" vs. "inside". This goes against what one might expect or would have observed years ago, as "outside" queries only need about two probes on average, due to short-circuiting, while "inside" always have num_probes (say 6). The problem is probably the nastily unpredictable branch. The SIMD implementation has few branches (very predictable) and has pretty consistent running time regardless of query outcome.
Accuracy
The generally improved accuracy (re: Issue https://github.com/facebook/rocksdb/issues/5857) comes from a better design for probing indices
within a cache line (re: Issue https://github.com/facebook/rocksdb/issues/4120) and improved accuracy for millions of keys in a single filter from using a 64-bit hash function (XXH3p). Design details in code comments.
Accuracy data (generalizes, except old impl gets worse with millions of keys):
Memory bits per key: FP rate percent old impl -> FP rate percent new impl
6: 5.70953 -> 5.69888
8: 2.45766 -> 2.29709
10: 1.13977 -> 0.959254
12: 0.662498 -> 0.411593
16: 0.353023 -> 0.0873754
24: 0.261552 -> 0.0060971
50: 0.225453 -> ~0.00003 (less than 1 in a million queries are FP)
Fixes https://github.com/facebook/rocksdb/issues/5857
Fixes https://github.com/facebook/rocksdb/issues/4120
Unlike the old implementation, this implementation has a fixed cache line size (64 bytes). At 10 bits per key, the accuracy of this new implementation is very close to the old implementation with 128-byte cache line size. If there's sufficient demand, this implementation could be generalized.
Compatibility
Although old releases would see the new structure as corrupt filter data and read the table as if there's no filter, we've decided only to enable the new Bloom filter with new format_version=5. This provides a smooth path for automatic adoption over time, with an option for early opt-in.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6007
Test Plan: filter_bench has been used thoroughly to validate speed, accuracy, and correctness. Unit tests have been carefully updated to exercise new and old implementations, as well as the logic to select an implementation based on context (format_version).
Differential Revision: D18294749
Pulled By: pdillinger
fbshipit-source-id: d44c9db3696e4d0a17caaec47075b7755c262c5f
5 years ago
|
|
|
#include "table/block_based/filter_policy_internal.h"
|
|
|
|
#include "table/block_based/full_filter_block.h"
|
|
|
|
#include "table/block_based/partitioned_filter_block.h"
|
|
|
|
#include "table/format.h"
|
|
|
|
#include "table/table_builder.h"
|
|
|
|
|
|
|
|
#include "memory/memory_allocator.h"
|
|
|
|
#include "util/coding.h"
|
|
|
|
#include "util/compression.h"
|
|
|
|
#include "util/crc32c.h"
|
|
|
|
#include "util/stop_watch.h"
|
|
|
|
#include "util/string_util.h"
|
|
|
|
#include "util/xxhash.h"
|
|
|
|
|
|
|
|
namespace ROCKSDB_NAMESPACE {
|
|
|
|
|
|
|
|
extern const std::string kHashIndexPrefixesBlock;
|
|
|
|
extern const std::string kHashIndexPrefixesMetadataBlock;
|
|
|
|
|
|
|
|
typedef BlockBasedTableOptions::IndexType IndexType;
|
|
|
|
|
|
|
|
// Without anonymous namespace here, we fail the warning -Wmissing-prototypes
|
|
|
|
namespace {
|
|
|
|
|
|
|
|
// Create a filter block builder based on its type.
|
|
|
|
FilterBlockBuilder* CreateFilterBlockBuilder(
|
|
|
|
const ImmutableCFOptions& /*opt*/, const MutableCFOptions& mopt,
|
|
|
|
const FilterBuildingContext& context,
|
|
|
|
const bool use_delta_encoding_for_index_values,
|
|
|
|
PartitionedIndexBuilder* const p_index_builder) {
|
|
|
|
const BlockBasedTableOptions& table_opt = context.table_options;
|
|
|
|
if (table_opt.filter_policy == nullptr) return nullptr;
|
|
|
|
|
|
|
|
FilterBitsBuilder* filter_bits_builder =
|
|
|
|
BloomFilterPolicy::GetBuilderFromContext(context);
|
|
|
|
if (filter_bits_builder == nullptr) {
|
|
|
|
return new BlockBasedFilterBlockBuilder(mopt.prefix_extractor.get(),
|
|
|
|
table_opt);
|
|
|
|
} else {
|
|
|
|
if (table_opt.partition_filters) {
|
|
|
|
assert(p_index_builder != nullptr);
|
|
|
|
// Since after partition cut request from filter builder it takes time
|
|
|
|
// until index builder actully cuts the partition, we take the lower bound
|
|
|
|
// as partition size.
|
|
|
|
assert(table_opt.block_size_deviation <= 100);
|
|
|
|
auto partition_size =
|
|
|
|
static_cast<uint32_t>(((table_opt.metadata_block_size *
|
|
|
|
(100 - table_opt.block_size_deviation)) +
|
|
|
|
99) /
|
|
|
|
100);
|
|
|
|
partition_size = std::max(partition_size, static_cast<uint32_t>(1));
|
|
|
|
return new PartitionedFilterBlockBuilder(
|
|
|
|
mopt.prefix_extractor.get(), table_opt.whole_key_filtering,
|
|
|
|
filter_bits_builder, table_opt.index_block_restart_interval,
|
|
|
|
use_delta_encoding_for_index_values, p_index_builder, partition_size);
|
|
|
|
} else {
|
|
|
|
return new FullFilterBlockBuilder(mopt.prefix_extractor.get(),
|
|
|
|
table_opt.whole_key_filtering,
|
|
|
|
filter_bits_builder);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
bool GoodCompressionRatio(size_t compressed_size, size_t raw_size) {
|
|
|
|
// Check to see if compressed less than 12.5%
|
|
|
|
return compressed_size < raw_size - (raw_size / 8u);
|
|
|
|
}
|
|
|
|
|
|
|
|
bool CompressBlockInternal(const Slice& raw,
|
|
|
|
const CompressionInfo& compression_info,
|
|
|
|
uint32_t format_version,
|
|
|
|
std::string* compressed_output) {
|
|
|
|
// Will return compressed block contents if (1) the compression method is
|
|
|
|
// supported in this platform and (2) the compression rate is "good enough".
|
|
|
|
switch (compression_info.type()) {
|
|
|
|
case kSnappyCompression:
|
|
|
|
return Snappy_Compress(compression_info, raw.data(), raw.size(),
|
|
|
|
compressed_output);
|
|
|
|
case kZlibCompression:
|
|
|
|
return Zlib_Compress(
|
|
|
|
compression_info,
|
|
|
|
GetCompressFormatForVersion(kZlibCompression, format_version),
|
|
|
|
raw.data(), raw.size(), compressed_output);
|
|
|
|
case kBZip2Compression:
|
|
|
|
return BZip2_Compress(
|
|
|
|
compression_info,
|
|
|
|
GetCompressFormatForVersion(kBZip2Compression, format_version),
|
|
|
|
raw.data(), raw.size(), compressed_output);
|
|
|
|
case kLZ4Compression:
|
|
|
|
return LZ4_Compress(
|
|
|
|
compression_info,
|
|
|
|
GetCompressFormatForVersion(kLZ4Compression, format_version),
|
|
|
|
raw.data(), raw.size(), compressed_output);
|
|
|
|
case kLZ4HCCompression:
|
|
|
|
return LZ4HC_Compress(
|
|
|
|
compression_info,
|
|
|
|
GetCompressFormatForVersion(kLZ4HCCompression, format_version),
|
|
|
|
raw.data(), raw.size(), compressed_output);
|
|
|
|
case kXpressCompression:
|
|
|
|
return XPRESS_Compress(raw.data(), raw.size(), compressed_output);
|
|
|
|
case kZSTD:
|
|
|
|
case kZSTDNotFinalCompression:
|
|
|
|
return ZSTD_Compress(compression_info, raw.data(), raw.size(),
|
|
|
|
compressed_output);
|
|
|
|
default:
|
|
|
|
// Do not recognize this compression type
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
} // namespace
|
|
|
|
|
|
|
|
// format_version is the block format as defined in include/rocksdb/table.h
|
|
|
|
Slice CompressBlock(const Slice& raw, const CompressionInfo& info,
|
|
|
|
CompressionType* type, uint32_t format_version,
|
|
|
|
bool do_sample, std::string* compressed_output,
|
|
|
|
std::string* sampled_output_fast,
|
|
|
|
std::string* sampled_output_slow) {
|
|
|
|
*type = info.type();
|
|
|
|
|
|
|
|
if (info.type() == kNoCompression && !info.SampleForCompression()) {
|
|
|
|
return raw;
|
|
|
|
}
|
|
|
|
|
|
|
|
// If requested, we sample one in every N block with a
|
|
|
|
// fast and slow compression algorithm and report the stats.
|
|
|
|
// The users can use these stats to decide if it is worthwhile
|
|
|
|
// enabling compression and they also get a hint about which
|
|
|
|
// compression algorithm wil be beneficial.
|
|
|
|
if (do_sample && info.SampleForCompression() &&
|
|
|
|
Random::GetTLSInstance()->OneIn((int)info.SampleForCompression()) &&
|
|
|
|
sampled_output_fast && sampled_output_slow) {
|
|
|
|
// Sampling with a fast compression algorithm
|
|
|
|
if (LZ4_Supported() || Snappy_Supported()) {
|
|
|
|
CompressionType c =
|
|
|
|
LZ4_Supported() ? kLZ4Compression : kSnappyCompression;
|
|
|
|
CompressionContext context(c);
|
|
|
|
CompressionOptions options;
|
|
|
|
CompressionInfo info_tmp(options, context,
|
|
|
|
CompressionDict::GetEmptyDict(), c,
|
|
|
|
info.SampleForCompression());
|
|
|
|
|
|
|
|
CompressBlockInternal(raw, info_tmp, format_version, sampled_output_fast);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Sampling with a slow but high-compression algorithm
|
|
|
|
if (ZSTD_Supported() || Zlib_Supported()) {
|
|
|
|
CompressionType c = ZSTD_Supported() ? kZSTD : kZlibCompression;
|
|
|
|
CompressionContext context(c);
|
|
|
|
CompressionOptions options;
|
|
|
|
CompressionInfo info_tmp(options, context,
|
|
|
|
CompressionDict::GetEmptyDict(), c,
|
|
|
|
info.SampleForCompression());
|
|
|
|
CompressBlockInternal(raw, info_tmp, format_version, sampled_output_slow);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// Actually compress the data
|
|
|
|
if (*type != kNoCompression) {
|
|
|
|
if (CompressBlockInternal(raw, info, format_version, compressed_output) &&
|
|
|
|
GoodCompressionRatio(compressed_output->size(), raw.size())) {
|
|
|
|
return *compressed_output;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// Compression method is not supported, or not good
|
|
|
|
// compression ratio, so just fall back to uncompressed form.
|
|
|
|
*type = kNoCompression;
|
|
|
|
return raw;
|
|
|
|
}
|
|
|
|
|
|
|
|
// kBlockBasedTableMagicNumber was picked by running
|
|
|
|
// echo rocksdb.table.block_based | sha1sum
|
|
|
|
// and taking the leading 64 bits.
|
|
|
|
// Please note that kBlockBasedTableMagicNumber may also be accessed by other
|
|
|
|
// .cc files
|
|
|
|
// for that reason we declare it extern in the header but to get the space
|
|
|
|
// allocated
|
|
|
|
// it must be not extern in one place.
|
|
|
|
const uint64_t kBlockBasedTableMagicNumber = 0x88e241b785f4cff7ull;
|
|
|
|
// We also support reading and writing legacy block based table format (for
|
|
|
|
// backwards compatibility)
|
|
|
|
const uint64_t kLegacyBlockBasedTableMagicNumber = 0xdb4775248b80fb57ull;
|
|
|
|
|
|
|
|
// A collector that collects properties of interest to block-based table.
|
|
|
|
// For now this class looks heavy-weight since we only write one additional
|
|
|
|
// property.
|
|
|
|
// But in the foreseeable future, we will add more and more properties that are
|
|
|
|
// specific to block-based table.
|
|
|
|
class BlockBasedTableBuilder::BlockBasedTablePropertiesCollector
|
A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.
Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties
Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.
Reviewers: yhchiang, igor.sugak, rven, igor
Reviewed By: rven, igor
Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D35373
10 years ago
|
|
|
: public IntTblPropCollector {
|
|
|
|
public:
|
|
|
|
explicit BlockBasedTablePropertiesCollector(
|
|
|
|
BlockBasedTableOptions::IndexType index_type, bool whole_key_filtering,
|
|
|
|
bool prefix_filtering)
|
|
|
|
: index_type_(index_type),
|
|
|
|
whole_key_filtering_(whole_key_filtering),
|
|
|
|
prefix_filtering_(prefix_filtering) {}
|
|
|
|
|
|
|
|
Status InternalAdd(const Slice& /*key*/, const Slice& /*value*/,
|
|
|
|
uint64_t /*file_size*/) override {
|
|
|
|
// Intentionally left blank. Have no interest in collecting stats for
|
|
|
|
// individual key/value pairs.
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
|
|
|
virtual void BlockAdd(uint64_t /* blockRawBytes */,
|
|
|
|
uint64_t /* blockCompressedBytesFast */,
|
|
|
|
uint64_t /* blockCompressedBytesSlow */) override {
|
|
|
|
// Intentionally left blank. No interest in collecting stats for
|
|
|
|
// blocks.
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
Status Finish(UserCollectedProperties* properties) override {
|
|
|
|
std::string val;
|
|
|
|
PutFixed32(&val, static_cast<uint32_t>(index_type_));
|
|
|
|
properties->insert({BlockBasedTablePropertyNames::kIndexType, val});
|
|
|
|
properties->insert({BlockBasedTablePropertyNames::kWholeKeyFiltering,
|
|
|
|
whole_key_filtering_ ? kPropTrue : kPropFalse});
|
|
|
|
properties->insert({BlockBasedTablePropertyNames::kPrefixFiltering,
|
|
|
|
prefix_filtering_ ? kPropTrue : kPropFalse});
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
|
|
|
// The name of the properties collector can be used for debugging purpose.
|
|
|
|
const char* Name() const override {
|
|
|
|
return "BlockBasedTablePropertiesCollector";
|
|
|
|
}
|
|
|
|
|
|
|
|
UserCollectedProperties GetReadableProperties() const override {
|
|
|
|
// Intentionally left blank.
|
|
|
|
return UserCollectedProperties();
|
|
|
|
}
|
|
|
|
|
|
|
|
private:
|
|
|
|
BlockBasedTableOptions::IndexType index_type_;
|
|
|
|
bool whole_key_filtering_;
|
|
|
|
bool prefix_filtering_;
|
|
|
|
};
|
|
|
|
|
|
|
|
struct BlockBasedTableBuilder::Rep {
|
|
|
|
const ImmutableCFOptions ioptions;
|
|
|
|
const MutableCFOptions moptions;
|
|
|
|
const BlockBasedTableOptions table_options;
|
|
|
|
const InternalKeyComparator& internal_comparator;
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
9 years ago
|
|
|
WritableFileWriter* file;
|
|
|
|
uint64_t offset = 0;
|
|
|
|
Status status;
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
5 years ago
|
|
|
IOStatus io_status;
|
|
|
|
size_t alignment;
|
|
|
|
BlockBuilder data_block;
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
// Buffers uncompressed data blocks and keys to replay later. Needed when
|
|
|
|
// compression dictionary is enabled so we can finalize the dictionary before
|
|
|
|
// compressing any data blocks.
|
|
|
|
// TODO(ajkr): ideally we don't buffer all keys and all uncompressed data
|
|
|
|
// blocks as it's redundant, but it's easier to implement for now.
|
|
|
|
std::vector<std::pair<std::string, std::vector<std::string>>>
|
|
|
|
data_block_and_keys_buffers;
|
|
|
|
BlockBuilder range_del_block;
|
|
|
|
|
|
|
|
InternalKeySliceTransform internal_prefix_transform;
|
|
|
|
std::unique_ptr<IndexBuilder> index_builder;
|
|
|
|
PartitionedIndexBuilder* p_index_builder_ = nullptr;
|
|
|
|
|
|
|
|
std::string last_key;
|
|
|
|
CompressionType compression_type;
|
|
|
|
uint64_t sample_for_compression;
|
|
|
|
CompressionOptions compression_opts;
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
std::unique_ptr<CompressionDict> compression_dict;
|
|
|
|
CompressionContext compression_ctx;
|
|
|
|
std::unique_ptr<UncompressionContext> verify_ctx;
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
std::unique_ptr<UncompressionDict> verify_dict;
|
|
|
|
|
|
|
|
size_t data_begin_offset = 0;
|
|
|
|
|
|
|
|
TableProperties props;
|
|
|
|
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
// States of the builder.
|
|
|
|
//
|
|
|
|
// - `kBuffered`: This is the initial state where zero or more data blocks are
|
|
|
|
// accumulated uncompressed in-memory. From this state, call
|
|
|
|
// `EnterUnbuffered()` to finalize the compression dictionary if enabled,
|
|
|
|
// compress/write out any buffered blocks, and proceed to the `kUnbuffered`
|
|
|
|
// state.
|
|
|
|
//
|
|
|
|
// - `kUnbuffered`: This is the state when compression dictionary is finalized
|
|
|
|
// either because it wasn't enabled in the first place or it's been created
|
|
|
|
// from sampling previously buffered data. In this state, blocks are simply
|
|
|
|
// compressed/written out as they fill up. From this state, call `Finish()`
|
|
|
|
// to complete the file (write meta-blocks, etc.), or `Abandon()` to delete
|
|
|
|
// the partially created file.
|
|
|
|
//
|
|
|
|
// - `kClosed`: This indicates either `Finish()` or `Abandon()` has been
|
|
|
|
// called, so the table builder is no longer usable. We must be in this
|
|
|
|
// state by the time the destructor runs.
|
|
|
|
enum class State {
|
|
|
|
kBuffered,
|
|
|
|
kUnbuffered,
|
|
|
|
kClosed,
|
|
|
|
};
|
|
|
|
State state;
|
|
|
|
|
|
|
|
const bool use_delta_encoding_for_index_values;
|
|
|
|
std::unique_ptr<FilterBlockBuilder> filter_builder;
|
|
|
|
char compressed_cache_key_prefix[BlockBasedTable::kMaxCacheKeyPrefixSize];
|
|
|
|
size_t compressed_cache_key_prefix_size;
|
|
|
|
|
|
|
|
BlockHandle pending_handle; // Handle to add to index block
|
|
|
|
|
|
|
|
std::string compressed_output;
|
|
|
|
std::unique_ptr<FlushBlockPolicy> flush_block_policy;
|
|
|
|
int level_at_creation;
|
|
|
|
uint32_t column_family_id;
|
|
|
|
const std::string& column_family_name;
|
|
|
|
uint64_t creation_time = 0;
|
|
|
|
uint64_t oldest_key_time = 0;
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
const uint64_t target_file_size;
|
Periodic Compactions (#5166)
Summary:
Introducing Periodic Compactions.
This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted. And also, of course, it helps to cleanup data older than certain threshold.
- Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
- This works across all levels.
- The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
- Compaction filters, if any, are invoked as usual.
- A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).
This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166
Differential Revision: D14884441
Pulled By: sagar0
fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
6 years ago
|
|
|
uint64_t file_creation_time = 0;
|
|
|
|
|
A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.
Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties
Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.
Reviewers: yhchiang, igor.sugak, rven, igor
Reviewed By: rven, igor
Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D35373
10 years ago
|
|
|
std::vector<std::unique_ptr<IntTblPropCollector>> table_properties_collectors;
|
TablePropertiesCollectorFactory
Summary:
This diff addresses task #4296714 and rethinks how users provide us with TablePropertiesCollectors as part of Options.
Here's description of task #4296714:
I'm debugging #4295529 and noticed that our count of user properties kDeletedKeys is wrong. We're sharing one single InternalKeyPropertiesCollector with all Table Builders. In LOG Files, we're outputting number of kDeletedKeys as connected with a single table, while it's actually the total count of deleted keys since creation of the DB.
For example, this table has 3155 entries and 1391828 deleted keys.
The problem with current approach that we call methods on a single TablePropertiesCollector for all the tables we create. Even worse, we could do it from multiple threads at the same time and TablePropertiesCollector has no way of knowing which table we're calling it for.
Good part: Looks like nobody inside Facebook is using Options::table_properties_collectors. This means we should be able to painfully change the API.
In this change, I introduce TablePropertiesCollectorFactory. For every table we create, we call `CreateTablePropertiesCollector`, which creates a TablePropertiesCollector for a single table. We then use it sequentially from a single thread, which means it doesn't have to be thread-safe.
Test Plan:
Added a test in table_properties_collector_test that fails on master (build two tables, assert that kDeletedKeys count is correct for the second one).
Also, all other tests
Reviewers: sdong, dhruba, haobo, kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18579
11 years ago
|
|
|
|
|
|
|
Rep(const ImmutableCFOptions& _ioptions, const MutableCFOptions& _moptions,
|
|
|
|
const BlockBasedTableOptions& table_opt,
|
A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.
Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties
Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.
Reviewers: yhchiang, igor.sugak, rven, igor
Reviewed By: rven, igor
Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D35373
10 years ago
|
|
|
const InternalKeyComparator& icomparator,
|
|
|
|
const std::vector<std::unique_ptr<IntTblPropCollectorFactory>>*
|
|
|
|
int_tbl_prop_collector_factories,
|
|
|
|
uint32_t _column_family_id, WritableFileWriter* f,
|
|
|
|
const CompressionType _compression_type,
|
|
|
|
const uint64_t _sample_for_compression,
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
const CompressionOptions& _compression_opts, const bool skip_filters,
|
|
|
|
const int _level_at_creation, const std::string& _column_family_name,
|
|
|
|
const uint64_t _creation_time, const uint64_t _oldest_key_time,
|
|
|
|
const uint64_t _target_file_size, const uint64_t _file_creation_time)
|
|
|
|
: ioptions(_ioptions),
|
|
|
|
moptions(_moptions),
|
|
|
|
table_options(table_opt),
|
|
|
|
internal_comparator(icomparator),
|
|
|
|
file(f),
|
|
|
|
alignment(table_options.block_align
|
|
|
|
? std::min(table_options.block_size, kDefaultPageSize)
|
|
|
|
: 0),
|
|
|
|
data_block(table_options.block_restart_interval,
|
|
|
|
table_options.use_delta_encoding,
|
|
|
|
false /* use_value_delta_encoding */,
|
|
|
|
icomparator.user_comparator()
|
|
|
|
->CanKeysWithDifferentByteContentsBeEqual()
|
|
|
|
? BlockBasedTableOptions::kDataBlockBinarySearch
|
|
|
|
: table_options.data_block_index_type,
|
|
|
|
table_options.data_block_hash_table_util_ratio),
|
|
|
|
range_del_block(1 /* block_restart_interval */),
|
|
|
|
internal_prefix_transform(_moptions.prefix_extractor.get()),
|
|
|
|
compression_type(_compression_type),
|
|
|
|
sample_for_compression(_sample_for_compression),
|
|
|
|
compression_opts(_compression_opts),
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
compression_dict(),
|
|
|
|
compression_ctx(_compression_type),
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
verify_dict(),
|
|
|
|
state((_compression_opts.max_dict_bytes > 0) ? State::kBuffered
|
|
|
|
: State::kUnbuffered),
|
|
|
|
use_delta_encoding_for_index_values(table_opt.format_version >= 4 &&
|
|
|
|
!table_opt.block_align),
|
|
|
|
compressed_cache_key_prefix_size(0),
|
|
|
|
flush_block_policy(
|
|
|
|
table_options.flush_block_policy_factory->NewFlushBlockPolicy(
|
|
|
|
table_options, data_block)),
|
|
|
|
level_at_creation(_level_at_creation),
|
|
|
|
column_family_id(_column_family_id),
|
|
|
|
column_family_name(_column_family_name),
|
|
|
|
creation_time(_creation_time),
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
oldest_key_time(_oldest_key_time),
|
Periodic Compactions (#5166)
Summary:
Introducing Periodic Compactions.
This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted. And also, of course, it helps to cleanup data older than certain threshold.
- Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
- This works across all levels.
- The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
- Compaction filters, if any, are invoked as usual.
- A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).
This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166
Differential Revision: D14884441
Pulled By: sagar0
fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
6 years ago
|
|
|
target_file_size(_target_file_size),
|
|
|
|
file_creation_time(_file_creation_time) {
|
|
|
|
if (table_options.index_type ==
|
|
|
|
BlockBasedTableOptions::kTwoLevelIndexSearch) {
|
|
|
|
p_index_builder_ = PartitionedIndexBuilder::CreateIndexBuilder(
|
|
|
|
&internal_comparator, use_delta_encoding_for_index_values,
|
|
|
|
table_options);
|
|
|
|
index_builder.reset(p_index_builder_);
|
|
|
|
} else {
|
|
|
|
index_builder.reset(IndexBuilder::CreateIndexBuilder(
|
|
|
|
table_options.index_type, &internal_comparator,
|
|
|
|
&this->internal_prefix_transform, use_delta_encoding_for_index_values,
|
|
|
|
table_options));
|
|
|
|
}
|
|
|
|
if (skip_filters) {
|
|
|
|
filter_builder = nullptr;
|
|
|
|
} else {
|
|
|
|
FilterBuildingContext context(table_options);
|
|
|
|
context.column_family_name = column_family_name;
|
|
|
|
context.compaction_style = ioptions.compaction_style;
|
|
|
|
context.level_at_creation = level_at_creation;
|
|
|
|
context.info_log = ioptions.info_log;
|
|
|
|
filter_builder.reset(CreateFilterBlockBuilder(
|
|
|
|
ioptions, moptions, context, use_delta_encoding_for_index_values,
|
|
|
|
p_index_builder_));
|
|
|
|
}
|
|
|
|
|
A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.
Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties
Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.
Reviewers: yhchiang, igor.sugak, rven, igor
Reviewed By: rven, igor
Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D35373
10 years ago
|
|
|
for (auto& collector_factories : *int_tbl_prop_collector_factories) {
|
TablePropertiesCollectorFactory
Summary:
This diff addresses task #4296714 and rethinks how users provide us with TablePropertiesCollectors as part of Options.
Here's description of task #4296714:
I'm debugging #4295529 and noticed that our count of user properties kDeletedKeys is wrong. We're sharing one single InternalKeyPropertiesCollector with all Table Builders. In LOG Files, we're outputting number of kDeletedKeys as connected with a single table, while it's actually the total count of deleted keys since creation of the DB.
For example, this table has 3155 entries and 1391828 deleted keys.
The problem with current approach that we call methods on a single TablePropertiesCollector for all the tables we create. Even worse, we could do it from multiple threads at the same time and TablePropertiesCollector has no way of knowing which table we're calling it for.
Good part: Looks like nobody inside Facebook is using Options::table_properties_collectors. This means we should be able to painfully change the API.
In this change, I introduce TablePropertiesCollectorFactory. For every table we create, we call `CreateTablePropertiesCollector`, which creates a TablePropertiesCollector for a single table. We then use it sequentially from a single thread, which means it doesn't have to be thread-safe.
Test Plan:
Added a test in table_properties_collector_test that fails on master (build two tables, assert that kDeletedKeys count is correct for the second one).
Also, all other tests
Reviewers: sdong, dhruba, haobo, kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18579
11 years ago
|
|
|
table_properties_collectors.emplace_back(
|
|
|
|
collector_factories->CreateIntTblPropCollector(column_family_id));
|
TablePropertiesCollectorFactory
Summary:
This diff addresses task #4296714 and rethinks how users provide us with TablePropertiesCollectors as part of Options.
Here's description of task #4296714:
I'm debugging #4295529 and noticed that our count of user properties kDeletedKeys is wrong. We're sharing one single InternalKeyPropertiesCollector with all Table Builders. In LOG Files, we're outputting number of kDeletedKeys as connected with a single table, while it's actually the total count of deleted keys since creation of the DB.
For example, this table has 3155 entries and 1391828 deleted keys.
The problem with current approach that we call methods on a single TablePropertiesCollector for all the tables we create. Even worse, we could do it from multiple threads at the same time and TablePropertiesCollector has no way of knowing which table we're calling it for.
Good part: Looks like nobody inside Facebook is using Options::table_properties_collectors. This means we should be able to painfully change the API.
In this change, I introduce TablePropertiesCollectorFactory. For every table we create, we call `CreateTablePropertiesCollector`, which creates a TablePropertiesCollector for a single table. We then use it sequentially from a single thread, which means it doesn't have to be thread-safe.
Test Plan:
Added a test in table_properties_collector_test that fails on master (build two tables, assert that kDeletedKeys count is correct for the second one).
Also, all other tests
Reviewers: sdong, dhruba, haobo, kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18579
11 years ago
|
|
|
}
|
|
|
|
table_properties_collectors.emplace_back(
|
|
|
|
new BlockBasedTablePropertiesCollector(
|
|
|
|
table_options.index_type, table_options.whole_key_filtering,
|
|
|
|
_moptions.prefix_extractor != nullptr));
|
|
|
|
if (table_options.verify_compression) {
|
|
|
|
verify_ctx.reset(new UncompressionContext(UncompressionContext::NoCache(),
|
|
|
|
compression_type));
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
Rep(const Rep&) = delete;
|
|
|
|
Rep& operator=(const Rep&) = delete;
|
|
|
|
|
|
|
|
~Rep() {}
|
|
|
|
};
|
|
|
|
|
|
|
|
BlockBasedTableBuilder::BlockBasedTableBuilder(
|
|
|
|
const ImmutableCFOptions& ioptions, const MutableCFOptions& moptions,
|
|
|
|
const BlockBasedTableOptions& table_options,
|
A new call back to TablePropertiesCollector to allow users know the entry is add, delete or merge
Summary:
Currently users have no idea a key is add, delete or merge from TablePropertiesCollector call back. Add a new function to add it.
Also refactor the codes so that
(1) make table property collector and internal table property collector two separate data structures with the later one now exposed
(2) table builders only receive internal table properties
Test Plan: Add cases in table_properties_collector_test to cover both of old and new ways of using TablePropertiesCollector.
Reviewers: yhchiang, igor.sugak, rven, igor
Reviewed By: rven, igor
Subscribers: meyering, yoshinorim, maykov, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D35373
10 years ago
|
|
|
const InternalKeyComparator& internal_comparator,
|
|
|
|
const std::vector<std::unique_ptr<IntTblPropCollectorFactory>>*
|
|
|
|
int_tbl_prop_collector_factories,
|
|
|
|
uint32_t column_family_id, WritableFileWriter* file,
|
|
|
|
const CompressionType compression_type,
|
|
|
|
const uint64_t sample_for_compression,
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
const CompressionOptions& compression_opts, const bool skip_filters,
|
|
|
|
const std::string& column_family_name, const int level_at_creation,
|
|
|
|
const uint64_t creation_time, const uint64_t oldest_key_time,
|
|
|
|
const uint64_t target_file_size, const uint64_t file_creation_time) {
|
|
|
|
BlockBasedTableOptions sanitized_table_options(table_options);
|
|
|
|
if (sanitized_table_options.format_version == 0 &&
|
|
|
|
sanitized_table_options.checksum != kCRC32c) {
|
|
|
|
ROCKS_LOG_WARN(
|
|
|
|
ioptions.info_log,
|
|
|
|
"Silently converting format_version to 1 because checksum is "
|
|
|
|
"non-default");
|
|
|
|
// silently convert format_version to 1 to keep consistent with current
|
|
|
|
// behavior
|
|
|
|
sanitized_table_options.format_version = 1;
|
|
|
|
}
|
|
|
|
|
|
|
|
rep_ = new Rep(ioptions, moptions, sanitized_table_options,
|
|
|
|
internal_comparator, int_tbl_prop_collector_factories,
|
|
|
|
column_family_id, file, compression_type,
|
|
|
|
sample_for_compression, compression_opts, skip_filters,
|
|
|
|
level_at_creation, column_family_name, creation_time,
|
|
|
|
oldest_key_time, target_file_size, file_creation_time);
|
|
|
|
|
|
|
|
if (rep_->filter_builder != nullptr) {
|
|
|
|
rep_->filter_builder->StartBlock(0);
|
|
|
|
}
|
|
|
|
if (table_options.block_cache_compressed.get() != nullptr) {
|
|
|
|
BlockBasedTable::GenerateCachePrefix(
|
Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
9 years ago
|
|
|
table_options.block_cache_compressed.get(), file->writable_file(),
|
|
|
|
&rep_->compressed_cache_key_prefix[0],
|
|
|
|
&rep_->compressed_cache_key_prefix_size);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
BlockBasedTableBuilder::~BlockBasedTableBuilder() {
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
// Catch errors where caller forgot to call Finish()
|
|
|
|
assert(rep_->state == Rep::State::kClosed);
|
|
|
|
delete rep_;
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableBuilder::Add(const Slice& key, const Slice& value) {
|
|
|
|
Rep* r = rep_;
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
assert(rep_->state != Rep::State::kClosed);
|
|
|
|
if (!ok()) return;
|
|
|
|
ValueType value_type = ExtractValueType(key);
|
|
|
|
if (IsValueType(value_type)) {
|
|
|
|
#ifndef NDEBUG
|
|
|
|
if (r->props.num_entries > r->props.num_range_deletions) {
|
|
|
|
assert(r->internal_comparator.Compare(key, Slice(r->last_key)) > 0);
|
|
|
|
}
|
|
|
|
#endif // NDEBUG
|
|
|
|
|
|
|
|
auto should_flush = r->flush_block_policy->Update(key, value);
|
|
|
|
if (should_flush) {
|
|
|
|
assert(!r->data_block.empty());
|
|
|
|
Flush();
|
|
|
|
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
if (r->state == Rep::State::kBuffered &&
|
|
|
|
r->data_begin_offset > r->target_file_size) {
|
|
|
|
EnterUnbuffered();
|
|
|
|
}
|
|
|
|
|
|
|
|
// Add item to index block.
|
|
|
|
// We do not emit the index entry for a block until we have seen the
|
|
|
|
// first key for the next data block. This allows us to use shorter
|
|
|
|
// keys in the index block. For example, consider a block boundary
|
|
|
|
// between the keys "the quick brown fox" and "the who". We can use
|
|
|
|
// "the r" as the key for the index block entry since it is >= all
|
|
|
|
// entries in the first block and < all entries in subsequent
|
|
|
|
// blocks.
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
if (ok() && r->state == Rep::State::kUnbuffered) {
|
|
|
|
r->index_builder->AddIndexEntry(&r->last_key, &key, r->pending_handle);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// Note: PartitionedFilterBlockBuilder requires key being added to filter
|
|
|
|
// builder after being added to index builder.
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
if (r->state == Rep::State::kUnbuffered && r->filter_builder != nullptr) {
|
|
|
|
size_t ts_sz = r->internal_comparator.user_comparator()->timestamp_size();
|
|
|
|
r->filter_builder->Add(ExtractUserKeyAndStripTimestamp(key, ts_sz));
|
|
|
|
}
|
|
|
|
|
|
|
|
r->last_key.assign(key.data(), key.size());
|
|
|
|
r->data_block.Add(key, value);
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
if (r->state == Rep::State::kBuffered) {
|
|
|
|
// Buffer keys to be replayed during `Finish()` once compression
|
|
|
|
// dictionary has been finalized.
|
|
|
|
if (r->data_block_and_keys_buffers.empty() || should_flush) {
|
|
|
|
r->data_block_and_keys_buffers.emplace_back();
|
|
|
|
}
|
|
|
|
r->data_block_and_keys_buffers.back().second.emplace_back(key.ToString());
|
|
|
|
} else {
|
|
|
|
r->index_builder->OnKeyAdded(key);
|
|
|
|
}
|
|
|
|
NotifyCollectTableCollectorsOnAdd(key, value, r->offset,
|
|
|
|
r->table_properties_collectors,
|
|
|
|
r->ioptions.info_log);
|
|
|
|
|
|
|
|
} else if (value_type == kTypeRangeDeletion) {
|
|
|
|
r->range_del_block.Add(key, value);
|
|
|
|
NotifyCollectTableCollectorsOnAdd(key, value, r->offset,
|
|
|
|
r->table_properties_collectors,
|
|
|
|
r->ioptions.info_log);
|
|
|
|
} else {
|
|
|
|
assert(false);
|
|
|
|
}
|
|
|
|
|
|
|
|
r->props.num_entries++;
|
|
|
|
r->props.raw_key_size += key.size();
|
|
|
|
r->props.raw_value_size += value.size();
|
|
|
|
if (value_type == kTypeDeletion || value_type == kTypeSingleDeletion) {
|
|
|
|
r->props.num_deletions++;
|
|
|
|
} else if (value_type == kTypeRangeDeletion) {
|
|
|
|
r->props.num_deletions++;
|
|
|
|
r->props.num_range_deletions++;
|
|
|
|
} else if (value_type == kTypeMerge) {
|
|
|
|
r->props.num_merge_operands++;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableBuilder::Flush() {
|
|
|
|
Rep* r = rep_;
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
assert(rep_->state != Rep::State::kClosed);
|
|
|
|
if (!ok()) return;
|
|
|
|
if (r->data_block.empty()) return;
|
Shared dictionary compression using reference block
Summary:
This adds a new metablock containing a shared dictionary that is used
to compress all data blocks in the SST file. The size of the shared dictionary
is configurable in CompressionOptions and defaults to 0. It's currently only
used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
the compression type if the user chooses a nonzero dictionary size.
During compaction, computes the dictionary by randomly sampling the first
output file in each subcompaction. It pre-computes the intervals to sample
by assuming the output file will have the maximum allowable length. In case
the file is smaller, some of the pre-computed sampling intervals can be beyond
end-of-file, in which case we skip over those samples and the dictionary will
be a bit smaller. After the dictionary is generated using the first file in a
subcompaction, it is loaded into the compression library before writing each
block in each subsequent file of that subcompaction.
On the read path, gets the dictionary from the metablock, if it exists. Then,
loads that dictionary into the compression library before reading each block.
Test Plan: new unit test
Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong
Reviewed By: sdong
Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52287
9 years ago
|
|
|
WriteBlock(&r->data_block, &r->pending_handle, true /* is_data_block */);
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableBuilder::WriteBlock(BlockBuilder* block,
|
Shared dictionary compression using reference block
Summary:
This adds a new metablock containing a shared dictionary that is used
to compress all data blocks in the SST file. The size of the shared dictionary
is configurable in CompressionOptions and defaults to 0. It's currently only
used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
the compression type if the user chooses a nonzero dictionary size.
During compaction, computes the dictionary by randomly sampling the first
output file in each subcompaction. It pre-computes the intervals to sample
by assuming the output file will have the maximum allowable length. In case
the file is smaller, some of the pre-computed sampling intervals can be beyond
end-of-file, in which case we skip over those samples and the dictionary will
be a bit smaller. After the dictionary is generated using the first file in a
subcompaction, it is loaded into the compression library before writing each
block in each subsequent file of that subcompaction.
On the read path, gets the dictionary from the metablock, if it exists. Then,
loads that dictionary into the compression library before reading each block.
Test Plan: new unit test
Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong
Reviewed By: sdong
Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52287
9 years ago
|
|
|
BlockHandle* handle,
|
|
|
|
bool is_data_block) {
|
|
|
|
WriteBlock(block->Finish(), handle, is_data_block);
|
|
|
|
block->Reset();
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableBuilder::WriteBlock(const Slice& raw_block_contents,
|
Shared dictionary compression using reference block
Summary:
This adds a new metablock containing a shared dictionary that is used
to compress all data blocks in the SST file. The size of the shared dictionary
is configurable in CompressionOptions and defaults to 0. It's currently only
used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
the compression type if the user chooses a nonzero dictionary size.
During compaction, computes the dictionary by randomly sampling the first
output file in each subcompaction. It pre-computes the intervals to sample
by assuming the output file will have the maximum allowable length. In case
the file is smaller, some of the pre-computed sampling intervals can be beyond
end-of-file, in which case we skip over those samples and the dictionary will
be a bit smaller. After the dictionary is generated using the first file in a
subcompaction, it is loaded into the compression library before writing each
block in each subsequent file of that subcompaction.
On the read path, gets the dictionary from the metablock, if it exists. Then,
loads that dictionary into the compression library before reading each block.
Test Plan: new unit test
Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong
Reviewed By: sdong
Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52287
9 years ago
|
|
|
BlockHandle* handle,
|
|
|
|
bool is_data_block) {
|
|
|
|
// File format contains a sequence of blocks where each block has:
|
|
|
|
// block_data: uint8[n]
|
|
|
|
// type: uint8
|
|
|
|
// crc: uint32
|
|
|
|
assert(ok());
|
|
|
|
Rep* r = rep_;
|
|
|
|
|
|
|
|
auto type = r->compression_type;
|
|
|
|
uint64_t sample_for_compression = r->sample_for_compression;
|
|
|
|
Slice block_contents;
|
|
|
|
bool abort_compression = false;
|
|
|
|
|
|
|
|
StopWatchNano timer(
|
|
|
|
r->ioptions.env,
|
|
|
|
ShouldReportDetailedTime(r->ioptions.env, r->ioptions.statistics));
|
|
|
|
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
if (r->state == Rep::State::kBuffered) {
|
|
|
|
assert(is_data_block);
|
|
|
|
assert(!r->data_block_and_keys_buffers.empty());
|
|
|
|
r->data_block_and_keys_buffers.back().first = raw_block_contents.ToString();
|
|
|
|
r->data_begin_offset += r->data_block_and_keys_buffers.back().first.size();
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (raw_block_contents.size() < kCompressionSizeLimit) {
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
const CompressionDict* compression_dict;
|
|
|
|
if (!is_data_block || r->compression_dict == nullptr) {
|
|
|
|
compression_dict = &CompressionDict::GetEmptyDict();
|
|
|
|
} else {
|
|
|
|
compression_dict = r->compression_dict.get();
|
|
|
|
}
|
|
|
|
assert(compression_dict != nullptr);
|
|
|
|
CompressionInfo compression_info(r->compression_opts, r->compression_ctx,
|
|
|
|
*compression_dict, type,
|
|
|
|
sample_for_compression);
|
|
|
|
|
|
|
|
std::string sampled_output_fast;
|
|
|
|
std::string sampled_output_slow;
|
|
|
|
block_contents = CompressBlock(
|
|
|
|
raw_block_contents, compression_info, &type,
|
|
|
|
r->table_options.format_version, is_data_block /* do_sample */,
|
|
|
|
&r->compressed_output, &sampled_output_fast, &sampled_output_slow);
|
|
|
|
|
|
|
|
// notify collectors on block add
|
|
|
|
NotifyCollectTableCollectorsOnBlockAdd(
|
|
|
|
r->table_properties_collectors, raw_block_contents.size(),
|
|
|
|
sampled_output_fast.size(), sampled_output_slow.size());
|
|
|
|
|
|
|
|
// Some of the compression algorithms are known to be unreliable. If
|
|
|
|
// the verify_compression flag is set then try to de-compress the
|
|
|
|
// compressed data and compare to the input.
|
|
|
|
if (type != kNoCompression && r->table_options.verify_compression) {
|
|
|
|
// Retrieve the uncompressed contents into a new buffer
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
const UncompressionDict* verify_dict;
|
|
|
|
if (!is_data_block || r->verify_dict == nullptr) {
|
|
|
|
verify_dict = &UncompressionDict::GetEmptyDict();
|
|
|
|
} else {
|
|
|
|
verify_dict = r->verify_dict.get();
|
|
|
|
}
|
|
|
|
assert(verify_dict != nullptr);
|
|
|
|
BlockContents contents;
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
UncompressionInfo uncompression_info(*r->verify_ctx, *verify_dict,
|
|
|
|
r->compression_type);
|
|
|
|
Status stat = UncompressBlockContentsForCompressionType(
|
|
|
|
uncompression_info, block_contents.data(), block_contents.size(),
|
|
|
|
&contents, r->table_options.format_version, r->ioptions);
|
|
|
|
|
|
|
|
if (stat.ok()) {
|
|
|
|
bool compressed_ok = contents.data.compare(raw_block_contents) == 0;
|
|
|
|
if (!compressed_ok) {
|
|
|
|
// The result of the compression was invalid. abort.
|
|
|
|
abort_compression = true;
|
|
|
|
ROCKS_LOG_ERROR(r->ioptions.info_log,
|
|
|
|
"Decompressed block did not match raw block");
|
|
|
|
r->status =
|
|
|
|
Status::Corruption("Decompressed block did not match raw block");
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
// Decompression reported an error. abort.
|
|
|
|
r->status = Status::Corruption("Could not decompress");
|
|
|
|
abort_compression = true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
// Block is too big to be compressed.
|
|
|
|
abort_compression = true;
|
|
|
|
}
|
|
|
|
|
|
|
|
// Abort compression if the block is too big, or did not pass
|
|
|
|
// verification.
|
|
|
|
if (abort_compression) {
|
|
|
|
RecordTick(r->ioptions.statistics, NUMBER_BLOCK_NOT_COMPRESSED);
|
|
|
|
type = kNoCompression;
|
|
|
|
block_contents = raw_block_contents;
|
|
|
|
} else if (type != kNoCompression) {
|
|
|
|
if (ShouldReportDetailedTime(r->ioptions.env, r->ioptions.statistics)) {
|
|
|
|
RecordTimeToHistogram(r->ioptions.statistics, COMPRESSION_TIMES_NANOS,
|
|
|
|
timer.ElapsedNanos());
|
|
|
|
}
|
|
|
|
RecordInHistogram(r->ioptions.statistics, BYTES_COMPRESSED,
|
|
|
|
raw_block_contents.size());
|
|
|
|
RecordTick(r->ioptions.statistics, NUMBER_BLOCK_COMPRESSED);
|
|
|
|
} else if (type != r->compression_type) {
|
|
|
|
RecordTick(r->ioptions.statistics, NUMBER_BLOCK_NOT_COMPRESSED);
|
|
|
|
}
|
|
|
|
|
|
|
|
WriteRawBlock(block_contents, type, handle, is_data_block);
|
|
|
|
r->compressed_output.clear();
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
if (is_data_block) {
|
|
|
|
if (r->filter_builder != nullptr) {
|
|
|
|
r->filter_builder->StartBlock(r->offset);
|
|
|
|
}
|
|
|
|
r->props.data_size = r->offset;
|
|
|
|
++r->props.num_data_blocks;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableBuilder::WriteRawBlock(const Slice& block_contents,
|
|
|
|
CompressionType type,
|
|
|
|
BlockHandle* handle,
|
|
|
|
bool is_data_block) {
|
|
|
|
Rep* r = rep_;
|
|
|
|
StopWatch sw(r->ioptions.env, r->ioptions.statistics, WRITE_RAW_BLOCK_MICROS);
|
|
|
|
handle->set_offset(r->offset);
|
|
|
|
handle->set_size(block_contents.size());
|
|
|
|
assert(r->status.ok());
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
5 years ago
|
|
|
assert(r->io_status.ok());
|
|
|
|
r->io_status = r->file->Append(block_contents);
|
|
|
|
if (r->io_status.ok()) {
|
|
|
|
char trailer[kBlockTrailerSize];
|
|
|
|
trailer[0] = type;
|
|
|
|
char* trailer_without_type = trailer + 1;
|
|
|
|
switch (r->table_options.checksum) {
|
|
|
|
case kNoChecksum:
|
|
|
|
EncodeFixed32(trailer_without_type, 0);
|
|
|
|
break;
|
|
|
|
case kCRC32c: {
|
|
|
|
auto crc = crc32c::Value(block_contents.data(), block_contents.size());
|
|
|
|
crc = crc32c::Extend(crc, trailer, 1); // Extend to cover block type
|
|
|
|
EncodeFixed32(trailer_without_type, crc32c::Mask(crc));
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
case kxxHash: {
|
|
|
|
XXH32_state_t* const state = XXH32_createState();
|
|
|
|
XXH32_reset(state, 0);
|
|
|
|
XXH32_update(state, block_contents.data(),
|
|
|
|
static_cast<uint32_t>(block_contents.size()));
|
|
|
|
XXH32_update(state, trailer, 1); // Extend to cover block type
|
|
|
|
EncodeFixed32(trailer_without_type, XXH32_digest(state));
|
|
|
|
XXH32_freeState(state);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
case kxxHash64: {
|
|
|
|
XXH64_state_t* const state = XXH64_createState();
|
|
|
|
XXH64_reset(state, 0);
|
|
|
|
XXH64_update(state, block_contents.data(),
|
|
|
|
static_cast<uint32_t>(block_contents.size()));
|
|
|
|
XXH64_update(state, trailer, 1); // Extend to cover block type
|
|
|
|
EncodeFixed32(
|
|
|
|
trailer_without_type,
|
|
|
|
static_cast<uint32_t>(XXH64_digest(state) & // lower 32 bits
|
|
|
|
uint64_t{0xffffffff}));
|
|
|
|
XXH64_freeState(state);
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
5 years ago
|
|
|
assert(r->io_status.ok());
|
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"BlockBasedTableBuilder::WriteRawBlock:TamperWithChecksum",
|
|
|
|
static_cast<char*>(trailer));
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
5 years ago
|
|
|
r->io_status = r->file->Append(Slice(trailer, kBlockTrailerSize));
|
|
|
|
if (r->io_status.ok()) {
|
|
|
|
r->status = InsertBlockInCache(block_contents, type, handle);
|
|
|
|
}
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
5 years ago
|
|
|
if (r->status.ok() && r->io_status.ok()) {
|
|
|
|
r->offset += block_contents.size() + kBlockTrailerSize;
|
|
|
|
if (r->table_options.block_align && is_data_block) {
|
|
|
|
size_t pad_bytes =
|
|
|
|
(r->alignment - ((block_contents.size() + kBlockTrailerSize) &
|
|
|
|
(r->alignment - 1))) &
|
|
|
|
(r->alignment - 1);
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
5 years ago
|
|
|
r->io_status = r->file->Pad(pad_bytes);
|
|
|
|
if (r->io_status.ok()) {
|
|
|
|
r->offset += pad_bytes;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
5 years ago
|
|
|
r->status = r->io_status;
|
|
|
|
}
|
|
|
|
|
|
|
|
Status BlockBasedTableBuilder::status() const { return rep_->status; }
|
|
|
|
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
5 years ago
|
|
|
IOStatus BlockBasedTableBuilder::io_status() const { return rep_->io_status; }
|
|
|
|
|
|
|
|
//
|
|
|
|
// Make a copy of the block contents and insert into compressed block cache
|
|
|
|
//
|
|
|
|
Status BlockBasedTableBuilder::InsertBlockInCache(const Slice& block_contents,
|
|
|
|
const CompressionType type,
|
|
|
|
const BlockHandle* handle) {
|
|
|
|
Rep* r = rep_;
|
|
|
|
Cache* block_cache_compressed = r->table_options.block_cache_compressed.get();
|
|
|
|
|
|
|
|
if (type != kNoCompression && block_cache_compressed != nullptr) {
|
|
|
|
size_t size = block_contents.size();
|
|
|
|
|
|
|
|
auto ubuf =
|
|
|
|
AllocateBlock(size + 1, block_cache_compressed->memory_allocator());
|
|
|
|
memcpy(ubuf.get(), block_contents.data(), size);
|
|
|
|
ubuf[size] = type;
|
|
|
|
|
|
|
|
BlockContents* block_contents_to_cache =
|
|
|
|
new BlockContents(std::move(ubuf), size);
|
|
|
|
#ifndef NDEBUG
|
|
|
|
block_contents_to_cache->is_raw_block = true;
|
|
|
|
#endif // NDEBUG
|
|
|
|
|
|
|
|
// make cache key by appending the file offset to the cache prefix id
|
|
|
|
char* end = EncodeVarint64(
|
|
|
|
r->compressed_cache_key_prefix + r->compressed_cache_key_prefix_size,
|
|
|
|
handle->offset());
|
|
|
|
Slice key(r->compressed_cache_key_prefix,
|
|
|
|
static_cast<size_t>(end - r->compressed_cache_key_prefix));
|
|
|
|
|
|
|
|
// Insert into compressed block cache.
|
|
|
|
block_cache_compressed->Insert(
|
|
|
|
key, block_contents_to_cache,
|
|
|
|
block_contents_to_cache->ApproximateMemoryUsage(),
|
|
|
|
SimpleDeleter<BlockContents>::GetInstance());
|
|
|
|
|
|
|
|
// Invalidate OS cache.
|
|
|
|
r->file->InvalidateCache(static_cast<size_t>(r->offset), size);
|
|
|
|
}
|
|
|
|
return Status::OK();
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableBuilder::WriteFilterBlock(
|
|
|
|
MetaIndexBuilder* meta_index_builder) {
|
|
|
|
BlockHandle filter_block_handle;
|
|
|
|
bool empty_filter_block = (rep_->filter_builder == nullptr ||
|
|
|
|
rep_->filter_builder->NumAdded() == 0);
|
|
|
|
if (ok() && !empty_filter_block) {
|
|
|
|
Status s = Status::Incomplete();
|
|
|
|
while (ok() && s.IsIncomplete()) {
|
|
|
|
Slice filter_content =
|
|
|
|
rep_->filter_builder->Finish(filter_block_handle, &s);
|
|
|
|
assert(s.ok() || s.IsIncomplete());
|
|
|
|
rep_->props.filter_size += filter_content.size();
|
|
|
|
WriteRawBlock(filter_content, kNoCompression, &filter_block_handle);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (ok() && !empty_filter_block) {
|
|
|
|
// Add mapping from "<filter_block_prefix>.Name" to location
|
|
|
|
// of filter data.
|
|
|
|
std::string key;
|
|
|
|
if (rep_->filter_builder->IsBlockBased()) {
|
|
|
|
key = BlockBasedTable::kFilterBlockPrefix;
|
|
|
|
} else {
|
|
|
|
key = rep_->table_options.partition_filters
|
|
|
|
? BlockBasedTable::kPartitionedFilterBlockPrefix
|
|
|
|
: BlockBasedTable::kFullFilterBlockPrefix;
|
|
|
|
}
|
|
|
|
key.append(rep_->table_options.filter_policy->Name());
|
|
|
|
meta_index_builder->Add(key, filter_block_handle);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableBuilder::WriteIndexBlock(
|
|
|
|
MetaIndexBuilder* meta_index_builder, BlockHandle* index_block_handle) {
|
|
|
|
IndexBuilder::IndexBlocks index_blocks;
|
|
|
|
auto index_builder_status = rep_->index_builder->Finish(&index_blocks);
|
|
|
|
if (index_builder_status.IsIncomplete()) {
|
|
|
|
// We we have more than one index partition then meta_blocks are not
|
|
|
|
// supported for the index. Currently meta_blocks are used only by
|
|
|
|
// HashIndexBuilder which is not multi-partition.
|
|
|
|
assert(index_blocks.meta_blocks.empty());
|
|
|
|
} else if (ok() && !index_builder_status.ok()) {
|
|
|
|
rep_->status = index_builder_status;
|
|
|
|
}
|
|
|
|
if (ok()) {
|
|
|
|
for (const auto& item : index_blocks.meta_blocks) {
|
|
|
|
BlockHandle block_handle;
|
|
|
|
WriteBlock(item.second, &block_handle, false /* is_data_block */);
|
|
|
|
if (!ok()) {
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
meta_index_builder->Add(item.first, block_handle);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
if (ok()) {
|
|
|
|
if (rep_->table_options.enable_index_compression) {
|
|
|
|
WriteBlock(index_blocks.index_block_contents, index_block_handle, false);
|
|
|
|
} else {
|
|
|
|
WriteRawBlock(index_blocks.index_block_contents, kNoCompression,
|
|
|
|
index_block_handle);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
// If there are more index partitions, finish them and write them out
|
|
|
|
Status s = index_builder_status;
|
|
|
|
while (ok() && s.IsIncomplete()) {
|
|
|
|
s = rep_->index_builder->Finish(&index_blocks, *index_block_handle);
|
|
|
|
if (!s.ok() && !s.IsIncomplete()) {
|
|
|
|
rep_->status = s;
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
if (rep_->table_options.enable_index_compression) {
|
|
|
|
WriteBlock(index_blocks.index_block_contents, index_block_handle, false);
|
|
|
|
} else {
|
|
|
|
WriteRawBlock(index_blocks.index_block_contents, kNoCompression,
|
|
|
|
index_block_handle);
|
|
|
|
}
|
|
|
|
// The last index_block_handle will be for the partition index block
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableBuilder::WritePropertiesBlock(
|
|
|
|
MetaIndexBuilder* meta_index_builder) {
|
|
|
|
BlockHandle properties_block_handle;
|
|
|
|
if (ok()) {
|
|
|
|
PropertyBlockBuilder property_block_builder;
|
|
|
|
rep_->props.column_family_id = rep_->column_family_id;
|
|
|
|
rep_->props.column_family_name = rep_->column_family_name;
|
|
|
|
rep_->props.filter_policy_name =
|
|
|
|
rep_->table_options.filter_policy != nullptr
|
|
|
|
? rep_->table_options.filter_policy->Name()
|
|
|
|
: "";
|
|
|
|
rep_->props.index_size =
|
|
|
|
rep_->index_builder->IndexSize() + kBlockTrailerSize;
|
|
|
|
rep_->props.comparator_name = rep_->ioptions.user_comparator != nullptr
|
|
|
|
? rep_->ioptions.user_comparator->Name()
|
|
|
|
: "nullptr";
|
|
|
|
rep_->props.merge_operator_name =
|
|
|
|
rep_->ioptions.merge_operator != nullptr
|
|
|
|
? rep_->ioptions.merge_operator->Name()
|
|
|
|
: "nullptr";
|
|
|
|
rep_->props.compression_name =
|
|
|
|
CompressionTypeToString(rep_->compression_type);
|
|
|
|
rep_->props.compression_options =
|
|
|
|
CompressionOptionsToString(rep_->compression_opts);
|
|
|
|
rep_->props.prefix_extractor_name =
|
|
|
|
rep_->moptions.prefix_extractor != nullptr
|
|
|
|
? rep_->moptions.prefix_extractor->Name()
|
|
|
|
: "nullptr";
|
|
|
|
|
|
|
|
std::string property_collectors_names = "[";
|
|
|
|
for (size_t i = 0;
|
|
|
|
i < rep_->ioptions.table_properties_collector_factories.size(); ++i) {
|
|
|
|
if (i != 0) {
|
|
|
|
property_collectors_names += ",";
|
Shared dictionary compression using reference block
Summary:
This adds a new metablock containing a shared dictionary that is used
to compress all data blocks in the SST file. The size of the shared dictionary
is configurable in CompressionOptions and defaults to 0. It's currently only
used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
the compression type if the user chooses a nonzero dictionary size.
During compaction, computes the dictionary by randomly sampling the first
output file in each subcompaction. It pre-computes the intervals to sample
by assuming the output file will have the maximum allowable length. In case
the file is smaller, some of the pre-computed sampling intervals can be beyond
end-of-file, in which case we skip over those samples and the dictionary will
be a bit smaller. After the dictionary is generated using the first file in a
subcompaction, it is loaded into the compression library before writing each
block in each subsequent file of that subcompaction.
On the read path, gets the dictionary from the metablock, if it exists. Then,
loads that dictionary into the compression library before reading each block.
Test Plan: new unit test
Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong
Reviewed By: sdong
Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52287
9 years ago
|
|
|
}
|
|
|
|
property_collectors_names +=
|
|
|
|
rep_->ioptions.table_properties_collector_factories[i]->Name();
|
|
|
|
}
|
|
|
|
property_collectors_names += "]";
|
|
|
|
rep_->props.property_collectors_names = property_collectors_names;
|
|
|
|
if (rep_->table_options.index_type ==
|
|
|
|
BlockBasedTableOptions::kTwoLevelIndexSearch) {
|
|
|
|
assert(rep_->p_index_builder_ != nullptr);
|
|
|
|
rep_->props.index_partitions = rep_->p_index_builder_->NumPartitions();
|
|
|
|
rep_->props.top_level_index_size =
|
|
|
|
rep_->p_index_builder_->TopLevelIndexSize(rep_->offset);
|
|
|
|
}
|
|
|
|
rep_->props.index_key_is_user_key =
|
|
|
|
!rep_->index_builder->seperator_is_key_plus_seq();
|
|
|
|
rep_->props.index_value_is_delta_encoded =
|
|
|
|
rep_->use_delta_encoding_for_index_values;
|
|
|
|
rep_->props.creation_time = rep_->creation_time;
|
|
|
|
rep_->props.oldest_key_time = rep_->oldest_key_time;
|
Periodic Compactions (#5166)
Summary:
Introducing Periodic Compactions.
This feature allows all the files in a CF to be periodically compacted. It could help in catching any corruptions that could creep into the DB proactively as every file is constantly getting re-compacted. And also, of course, it helps to cleanup data older than certain threshold.
- Introduced a new option `periodic_compaction_time` to control how long a file can live without being compacted in a CF.
- This works across all levels.
- The files are put in the same level after going through the compaction. (Related files in the same level are picked up as `ExpandInputstoCleanCut` is used).
- Compaction filters, if any, are invoked as usual.
- A new table property, `file_creation_time`, is introduced to implement this feature. This property is set to the time at which the SST file was created (and that time is given by the underlying Env/OS).
This feature can be enabled on its own, or in conjunction with `ttl`. It is possible to set a different time threshold for the bottom level when used in conjunction with ttl. Since `ttl` works only on 0 to last but one levels, you could set `ttl` to, say, 1 day, and `periodic_compaction_time` to, say, 7 days. Since `ttl < periodic_compaction_time` all files in last but one levels keep getting picked up based on ttl, and almost never based on periodic_compaction_time. The files in the bottom level get picked up for compaction based on `periodic_compaction_time`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5166
Differential Revision: D14884441
Pulled By: sagar0
fbshipit-source-id: 408426cbacb409c06386a98632dcf90bfa1bda47
6 years ago
|
|
|
rep_->props.file_creation_time = rep_->file_creation_time;
|
|
|
|
|
|
|
|
// Add basic properties
|
|
|
|
property_block_builder.AddTableProperty(rep_->props);
|
|
|
|
|
|
|
|
// Add use collected properties
|
|
|
|
NotifyCollectTableCollectorsOnFinish(rep_->table_properties_collectors,
|
|
|
|
rep_->ioptions.info_log,
|
|
|
|
&property_block_builder);
|
|
|
|
|
|
|
|
WriteRawBlock(property_block_builder.Finish(), kNoCompression,
|
|
|
|
&properties_block_handle);
|
|
|
|
}
|
|
|
|
if (ok()) {
|
|
|
|
#ifndef NDEBUG
|
|
|
|
{
|
|
|
|
uint64_t props_block_offset = properties_block_handle.offset();
|
|
|
|
uint64_t props_block_size = properties_block_handle.size();
|
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"BlockBasedTableBuilder::WritePropertiesBlock:GetPropsBlockOffset",
|
|
|
|
&props_block_offset);
|
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"BlockBasedTableBuilder::WritePropertiesBlock:GetPropsBlockSize",
|
|
|
|
&props_block_size);
|
|
|
|
}
|
|
|
|
#endif // !NDEBUG
|
|
|
|
meta_index_builder->Add(kPropertiesBlock, properties_block_handle);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableBuilder::WriteCompressionDictBlock(
|
|
|
|
MetaIndexBuilder* meta_index_builder) {
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
if (rep_->compression_dict != nullptr &&
|
|
|
|
rep_->compression_dict->GetRawDict().size()) {
|
|
|
|
BlockHandle compression_dict_block_handle;
|
|
|
|
if (ok()) {
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
WriteRawBlock(rep_->compression_dict->GetRawDict(), kNoCompression,
|
|
|
|
&compression_dict_block_handle);
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
#ifndef NDEBUG
|
|
|
|
Slice compression_dict = rep_->compression_dict->GetRawDict();
|
|
|
|
TEST_SYNC_POINT_CALLBACK(
|
|
|
|
"BlockBasedTableBuilder::WriteCompressionDictBlock:RawDict",
|
|
|
|
&compression_dict);
|
|
|
|
#endif // NDEBUG
|
|
|
|
}
|
|
|
|
if (ok()) {
|
|
|
|
meta_index_builder->Add(kCompressionDictBlock,
|
|
|
|
compression_dict_block_handle);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableBuilder::WriteRangeDelBlock(
|
|
|
|
MetaIndexBuilder* meta_index_builder) {
|
|
|
|
if (ok() && !rep_->range_del_block.empty()) {
|
|
|
|
BlockHandle range_del_block_handle;
|
|
|
|
WriteRawBlock(rep_->range_del_block.Finish(), kNoCompression,
|
|
|
|
&range_del_block_handle);
|
|
|
|
meta_index_builder->Add(kRangeDelBlock, range_del_block_handle);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableBuilder::WriteFooter(BlockHandle& metaindex_block_handle,
|
|
|
|
BlockHandle& index_block_handle) {
|
|
|
|
Rep* r = rep_;
|
|
|
|
// No need to write out new footer if we're using default checksum.
|
|
|
|
// We're writing legacy magic number because we want old versions of RocksDB
|
|
|
|
// be able to read files generated with new release (just in case if
|
|
|
|
// somebody wants to roll back after an upgrade)
|
|
|
|
// TODO(icanadi) at some point in the future, when we're absolutely sure
|
|
|
|
// nobody will roll back to RocksDB 2.x versions, retire the legacy magic
|
|
|
|
// number and always write new table files with new magic number
|
|
|
|
bool legacy = (r->table_options.format_version == 0);
|
|
|
|
// this is guaranteed by BlockBasedTableBuilder's constructor
|
|
|
|
assert(r->table_options.checksum == kCRC32c ||
|
|
|
|
r->table_options.format_version != 0);
|
|
|
|
Footer footer(
|
|
|
|
legacy ? kLegacyBlockBasedTableMagicNumber : kBlockBasedTableMagicNumber,
|
|
|
|
r->table_options.format_version);
|
|
|
|
footer.set_metaindex_handle(metaindex_block_handle);
|
|
|
|
footer.set_index_handle(index_block_handle);
|
|
|
|
footer.set_checksum(r->table_options.checksum);
|
|
|
|
std::string footer_encoding;
|
|
|
|
footer.EncodeTo(&footer_encoding);
|
|
|
|
assert(r->status.ok());
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
5 years ago
|
|
|
assert(r->io_status.ok());
|
|
|
|
r->io_status = r->file->Append(footer_encoding);
|
|
|
|
if (r->io_status.ok()) {
|
|
|
|
r->offset += footer_encoding.size();
|
|
|
|
}
|
Pass IOStatus to write path and set retryable IO Error as hard error in BG jobs (#6487)
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
5 years ago
|
|
|
r->status = r->io_status;
|
|
|
|
}
|
|
|
|
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
void BlockBasedTableBuilder::EnterUnbuffered() {
|
|
|
|
Rep* r = rep_;
|
|
|
|
assert(r->state == Rep::State::kBuffered);
|
|
|
|
r->state = Rep::State::kUnbuffered;
|
|
|
|
const size_t kSampleBytes = r->compression_opts.zstd_max_train_bytes > 0
|
|
|
|
? r->compression_opts.zstd_max_train_bytes
|
|
|
|
: r->compression_opts.max_dict_bytes;
|
|
|
|
Random64 generator{r->creation_time};
|
|
|
|
std::string compression_dict_samples;
|
|
|
|
std::vector<size_t> compression_dict_sample_lens;
|
|
|
|
if (!r->data_block_and_keys_buffers.empty()) {
|
|
|
|
while (compression_dict_samples.size() < kSampleBytes) {
|
|
|
|
size_t rand_idx =
|
|
|
|
static_cast<size_t>(
|
|
|
|
generator.Uniform(r->data_block_and_keys_buffers.size()));
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
size_t copy_len =
|
|
|
|
std::min(kSampleBytes - compression_dict_samples.size(),
|
|
|
|
r->data_block_and_keys_buffers[rand_idx].first.size());
|
|
|
|
compression_dict_samples.append(
|
|
|
|
r->data_block_and_keys_buffers[rand_idx].first, 0, copy_len);
|
|
|
|
compression_dict_sample_lens.emplace_back(copy_len);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// final data block flushed, now we can generate dictionary from the samples.
|
|
|
|
// OK if compression_dict_samples is empty, we'll just get empty dictionary.
|
|
|
|
std::string dict;
|
|
|
|
if (r->compression_opts.zstd_max_train_bytes > 0) {
|
|
|
|
dict = ZSTD_TrainDictionary(compression_dict_samples,
|
|
|
|
compression_dict_sample_lens,
|
|
|
|
r->compression_opts.max_dict_bytes);
|
|
|
|
} else {
|
|
|
|
dict = std::move(compression_dict_samples);
|
|
|
|
}
|
|
|
|
r->compression_dict.reset(new CompressionDict(dict, r->compression_type,
|
|
|
|
r->compression_opts.level));
|
|
|
|
r->verify_dict.reset(new UncompressionDict(
|
|
|
|
dict, r->compression_type == kZSTD ||
|
|
|
|
r->compression_type == kZSTDNotFinalCompression));
|
|
|
|
|
|
|
|
for (size_t i = 0; ok() && i < r->data_block_and_keys_buffers.size(); ++i) {
|
|
|
|
const auto& data_block = r->data_block_and_keys_buffers[i].first;
|
|
|
|
auto& keys = r->data_block_and_keys_buffers[i].second;
|
|
|
|
assert(!data_block.empty());
|
|
|
|
assert(!keys.empty());
|
|
|
|
|
|
|
|
for (const auto& key : keys) {
|
|
|
|
if (r->filter_builder != nullptr) {
|
|
|
|
size_t ts_sz =
|
|
|
|
r->internal_comparator.user_comparator()->timestamp_size();
|
Fix wrong ExtractUserKey usage in BlockBasedTableBuilder::EnterUnbuff… (#6100)
Summary:
BlockBasedTableBuilder uses ExtractUserKey in EnterUnbuffered. This would
cause index filter building error, since user-provided timestamp is supported
by ExtractUserKeyAndStripTimestamp, and it's used in Add. This commit changes
ExtractUserKey to ExtractUserKeyAndStripTimestamp.
A test case is also added by modifying DBBasicTestWithTimestampWithParam_
PutAndGet test in db_basic_test to cover ExtractUserKeyAndStripTimestamp usage
in both kBuffered and kUnbuffered state of BlockBasedTableBuilder.
Before the ExtractUserKeyAndStripTimstamp fix:
```
$ ./db_basic_test --gtest_filter="*PutAndGet*"
Note: Google Test filter = *PutAndGet*
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from Timestamp/DBBasicTestWithTimestampWithParam
[ RUN ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/0
db/db_basic_test.cc:2109: Failure
db_->Get(ropts, cfh, "key" + std::to_string(j), &value)
NotFound:
db/db_basic_test.cc:2109: Failure
db_->Get(ropts, cfh, "key" + std::to_string(j), &value)
NotFound:
db/db_basic_test.cc:2109: Failure
db_->Get(ropts, cfh, "key" + std::to_string(j), &value)
NotFound:
db/db_basic_test.cc:2109: Failure
db_->Get(ropts, cfh, "key" + std::to_string(j), &value)
NotFound:
db/db_basic_test.cc:2109: Failure
db_->Get(ropts, cfh, "key" + std::to_string(j), &value)
NotFound:
[ FAILED ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/0, where GetParam() = false (1177 ms)
[ RUN ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/1
[ OK ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/1 (1056 ms)
[----------] 2 tests from Timestamp/DBBasicTestWithTimestampWithParam (2233 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (2233 ms total)
[ PASSED ] 1 test.
[ FAILED ] 1 test, listed below:
[ FAILED ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/0, where GetParam() = false
1 FAILED TEST
```
After the ExtractUserKeyAndStripTimstamp fix:
```
$ ./db_basic_test --gtest_filter="*PutAndGet*"
Note: Google Test filter = *PutAndGet*
[==========] Running 2 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 2 tests from Timestamp/DBBasicTestWithTimestampWithParam
[ RUN ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/0
[ OK ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/0 (1417 ms)
[ RUN ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/1
[ OK ] Timestamp/DBBasicTestWithTimestampWithParam.PutAndGet/1 (1041 ms)
[----------] 2 tests from Timestamp/DBBasicTestWithTimestampWithParam (2458 ms total)
[----------] Global test environment tear-down
[==========] 2 tests from 1 test case ran. (2458 ms total)
[ PASSED ] 2 tests.
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6100
Differential Revision: D18769654
Pulled By: riversand963
fbshipit-source-id: 76c2cf2c9a5e0d85db95d98e812e6af0c2a15c6b
5 years ago
|
|
|
r->filter_builder->Add(ExtractUserKeyAndStripTimestamp(key, ts_sz));
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
}
|
|
|
|
r->index_builder->OnKeyAdded(key);
|
|
|
|
}
|
|
|
|
WriteBlock(Slice(data_block), &r->pending_handle, true /* is_data_block */);
|
|
|
|
if (ok() && i + 1 < r->data_block_and_keys_buffers.size()) {
|
|
|
|
Slice first_key_in_next_block =
|
|
|
|
r->data_block_and_keys_buffers[i + 1].second.front();
|
|
|
|
Slice* first_key_in_next_block_ptr = &first_key_in_next_block;
|
|
|
|
r->index_builder->AddIndexEntry(&keys.back(), first_key_in_next_block_ptr,
|
|
|
|
r->pending_handle);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
r->data_block_and_keys_buffers.clear();
|
|
|
|
}
|
|
|
|
|
|
|
|
Status BlockBasedTableBuilder::Finish() {
|
|
|
|
Rep* r = rep_;
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
assert(r->state != Rep::State::kClosed);
|
|
|
|
bool empty_data_block = r->data_block.empty();
|
|
|
|
Flush();
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
if (r->state == Rep::State::kBuffered) {
|
|
|
|
EnterUnbuffered();
|
|
|
|
}
|
|
|
|
// To make sure properties block is able to keep the accurate size of index
|
|
|
|
// block, we will finish writing all index entries first.
|
|
|
|
if (ok() && !empty_data_block) {
|
|
|
|
r->index_builder->AddIndexEntry(
|
|
|
|
&r->last_key, nullptr /* no next data block */, r->pending_handle);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Write meta blocks, metaindex block and footer in the following order.
|
|
|
|
// 1. [meta block: filter]
|
|
|
|
// 2. [meta block: index]
|
|
|
|
// 3. [meta block: compression dictionary]
|
|
|
|
// 4. [meta block: range deletion tombstone]
|
|
|
|
// 5. [meta block: properties]
|
|
|
|
// 6. [metaindex block]
|
|
|
|
// 7. Footer
|
|
|
|
BlockHandle metaindex_block_handle, index_block_handle;
|
|
|
|
MetaIndexBuilder meta_index_builder;
|
|
|
|
WriteFilterBlock(&meta_index_builder);
|
|
|
|
WriteIndexBlock(&meta_index_builder, &index_block_handle);
|
|
|
|
WriteCompressionDictBlock(&meta_index_builder);
|
|
|
|
WriteRangeDelBlock(&meta_index_builder);
|
|
|
|
WritePropertiesBlock(&meta_index_builder);
|
|
|
|
if (ok()) {
|
|
|
|
// flush the meta index block
|
|
|
|
WriteRawBlock(meta_index_builder.Finish(), kNoCompression,
|
|
|
|
&metaindex_block_handle);
|
|
|
|
}
|
|
|
|
if (ok()) {
|
|
|
|
WriteFooter(metaindex_block_handle, index_block_handle);
|
|
|
|
}
|
|
|
|
if (r->file != nullptr) {
|
|
|
|
file_checksum_ = r->file->GetFileChecksum();
|
|
|
|
}
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
r->state = Rep::State::kClosed;
|
|
|
|
return r->status;
|
|
|
|
}
|
|
|
|
|
|
|
|
void BlockBasedTableBuilder::Abandon() {
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
assert(rep_->state != Rep::State::kClosed);
|
|
|
|
rep_->state = Rep::State::kClosed;
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t BlockBasedTableBuilder::NumEntries() const {
|
|
|
|
return rep_->props.num_entries;
|
|
|
|
}
|
|
|
|
|
Reduce scope of compression dictionary to single SST (#4952)
Summary:
Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio.
So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include:
- The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called.
- After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up.
- Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952
Differential Revision: D13967980
Pulled By: ajkr
fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f
6 years ago
|
|
|
uint64_t BlockBasedTableBuilder::FileSize() const { return rep_->offset; }
|
|
|
|
|
|
|
|
bool BlockBasedTableBuilder::NeedCompact() const {
|
|
|
|
for (const auto& collector : rep_->table_properties_collectors) {
|
|
|
|
if (collector->NeedCompact()) {
|
|
|
|
return true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
return false;
|
|
|
|
}
|
|
|
|
|
Add more table properties to EventLogger
Summary:
Example output:
{"time_micros": 1431463794310521, "job": 353, "event": "table_file_creation", "file_number": 387, "file_size": 86937, "table_info": {"data_size": "81801", "index_size": "9751", "filter_size": "0", "raw_key_size": "23448", "raw_average_key_size": "24.000000", "raw_value_size": "990571", "raw_average_value_size": "1013.890481", "num_data_blocks": "245", "num_entries": "977", "filter_policy_name": "", "kDeletedKeys": "0"}}
Also fixed a bug where BuildTable() in recovery was passing Env::IOHigh argument into paranoid_checks_file parameter.
Test Plan: make check + check out the output in the log
Reviewers: sdong, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38343
10 years ago
|
|
|
TableProperties BlockBasedTableBuilder::GetTableProperties() const {
|
|
|
|
TableProperties ret = rep_->props;
|
|
|
|
for (const auto& collector : rep_->table_properties_collectors) {
|
|
|
|
for (const auto& prop : collector->GetReadableProperties()) {
|
|
|
|
ret.readable_properties.insert(prop);
|
Add more table properties to EventLogger
Summary:
Example output:
{"time_micros": 1431463794310521, "job": 353, "event": "table_file_creation", "file_number": 387, "file_size": 86937, "table_info": {"data_size": "81801", "index_size": "9751", "filter_size": "0", "raw_key_size": "23448", "raw_average_key_size": "24.000000", "raw_value_size": "990571", "raw_average_value_size": "1013.890481", "num_data_blocks": "245", "num_entries": "977", "filter_policy_name": "", "kDeletedKeys": "0"}}
Also fixed a bug where BuildTable() in recovery was passing Env::IOHigh argument into paranoid_checks_file parameter.
Test Plan: make check + check out the output in the log
Reviewers: sdong, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38343
10 years ago
|
|
|
}
|
|
|
|
collector->Finish(&ret.user_collected_properties);
|
Add more table properties to EventLogger
Summary:
Example output:
{"time_micros": 1431463794310521, "job": 353, "event": "table_file_creation", "file_number": 387, "file_size": 86937, "table_info": {"data_size": "81801", "index_size": "9751", "filter_size": "0", "raw_key_size": "23448", "raw_average_key_size": "24.000000", "raw_value_size": "990571", "raw_average_value_size": "1013.890481", "num_data_blocks": "245", "num_entries": "977", "filter_policy_name": "", "kDeletedKeys": "0"}}
Also fixed a bug where BuildTable() in recovery was passing Env::IOHigh argument into paranoid_checks_file parameter.
Test Plan: make check + check out the output in the log
Reviewers: sdong, rven, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D38343
10 years ago
|
|
|
}
|
|
|
|
return ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
const char* BlockBasedTableBuilder::GetFileChecksumFuncName() const {
|
|
|
|
if (rep_->file != nullptr) {
|
|
|
|
return rep_->file->GetFileChecksumFuncName();
|
|
|
|
} else {
|
|
|
|
return kUnknownFileChecksumFuncName.c_str();
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
const std::string BlockBasedTable::kFilterBlockPrefix = "filter.";
|
|
|
|
const std::string BlockBasedTable::kFullFilterBlockPrefix = "fullfilter.";
|
|
|
|
const std::string BlockBasedTable::kPartitionedFilterBlockPrefix =
|
|
|
|
"partitionedfilter.";
|
|
|
|
} // namespace ROCKSDB_NAMESPACE
|