|
|
|
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
|
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
|
|
// (found in the LICENSE.Apache file in the root directory).
|
|
|
|
//
|
|
|
|
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
|
|
|
|
// Use of this source code is governed by a BSD-style license that can be
|
|
|
|
// found in the LICENSE file. See the AUTHORS file for names of contributors.
|
|
|
|
//
|
|
|
|
// Thread-safe (provides internal synchronization)
|
|
|
|
|
|
|
|
#pragma once
|
|
|
|
#include <string>
|
|
|
|
#include <vector>
|
|
|
|
#include <stdint.h>
|
|
|
|
|
|
|
|
#include "db/dbformat.h"
|
Compaction Support for Range Deletion
Summary:
This diff introduces RangeDelAggregator, which takes ownership of iterators
provided to it via AddTombstones(). The tombstones are organized in a two-level
map (snapshot stripe -> begin key -> tombstone). Tombstone creation avoids data
copy by holding Slices returned by the iterator, which remain valid thanks to pinning.
For compaction, we create a hierarchical range tombstone iterator with structure
matching the iterator over compaction input data. An aggregator based on that
iterator is used by CompactionIterator to determine which keys are covered by
range tombstones. In case of merge operand, the same aggregator is used by
MergeHelper. Upon finishing each file in the compaction, relevant range tombstones
are added to the output file's range tombstone metablock and file boundaries are
updated accordingly.
To check whether a key is covered by range tombstone, RangeDelAggregator::ShouldDelete()
considers tombstones in the key's snapshot stripe. When this function is used outside of
compaction, it also checks newer stripes, which can contain covering tombstones. Currently
the intra-stripe check involves a linear scan; however, in the future we plan to collapse ranges
within a stripe such that binary search can be used.
RangeDelAggregator::AddToBuilder() adds all range tombstones in the table's key-range
to a new table's range tombstone meta-block. Since range tombstones may fall in the gap
between files, we may need to extend some files' key-ranges. The strategy is (1) first file
extends as far left as possible and other files do not extend left, (2) all files extend right
until either the start of the next file or the end of the last range tombstone in the gap,
whichever comes first.
One other notable change is adding release/move semantics to ScopedArenaIterator
such that it can be used to transfer ownership of an arena-allocated iterator, similar to
how unique_ptr is used for malloc'd data.
Depends on D61473
Test Plan: compaction_iterator_test, mock_table, end-to-end tests in D63927
Reviewers: sdong, IslamAbdelRahman, wanning, yhchiang, lightmark
Reviewed By: lightmark
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62205
8 years ago
|
|
|
#include "db/range_del_aggregator.h"
|
|
|
|
#include "options/cf_options.h"
|
|
|
|
#include "port/port.h"
|
|
|
|
#include "rocksdb/cache.h"
|
|
|
|
#include "rocksdb/env.h"
|
|
|
|
#include "rocksdb/options.h"
|
|
|
|
#include "rocksdb/table.h"
|
|
|
|
#include "table/table_reader.h"
|
|
|
|
|
|
|
|
namespace rocksdb {
|
|
|
|
|
|
|
|
class Env;
|
In DB::NewIterator(), try to allocate the whole iterator tree in an arena
Summary:
In this patch, try to allocate the whole iterator tree starting from DBIter from an arena
1. ArenaWrappedDBIter is created when serves as the entry point of an iterator tree, with an arena in it.
2. Add an option to create iterator from arena for following iterators: DBIter, MergingIterator, MemtableIterator, all mem table's iterators, all table reader's iterators and two level iterator.
3. MergeIteratorBuilder is created to incrementally build the tree of internal iterators. It is passed to mem table list and version set and add iterators to it.
Limitations:
(1) Only DB::NewIterator() without tailing uses the arena. Other cases, including readonly DB and compactions are still from malloc
(2) Two level iterator itself is allocated in arena, but not iterators inside it.
Test Plan: make all check
Reviewers: ljin, haobo
Reviewed By: haobo
Subscribers: leveldb, dhruba, yhchiang, igor
Differential Revision: https://reviews.facebook.net/D18513
11 years ago
|
|
|
class Arena;
|
|
|
|
struct FileDescriptor;
|
|
|
|
class GetContext;
|
Measure file read latency histogram per level
Summary: In internal stats, remember read latency histogram, if statistics is enabled. It can be retrieved from DB::GetProperty() with "rocksdb.dbstats" property, if it is enabled.
Test Plan: Manually run db_bench and prints out "rocksdb.dbstats" by hand and make sure it prints out as expected
Reviewers: igor, IslamAbdelRahman, rven, kradhakrishnan, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D44193
10 years ago
|
|
|
class HistogramImpl;
|
|
|
|
class InternalIterator;
|
|
|
|
|
|
|
|
class TableCache {
|
|
|
|
public:
|
|
|
|
TableCache(const ImmutableCFOptions& ioptions,
|
|
|
|
const EnvOptions& storage_options, Cache* cache);
|
|
|
|
~TableCache();
|
|
|
|
|
|
|
|
// Return an iterator for the specified file number (the corresponding
|
|
|
|
// file length must be exactly "file_size" bytes). If "tableptr" is
|
|
|
|
// non-nullptr, also sets "*tableptr" to point to the Table object
|
|
|
|
// underlying the returned iterator, or nullptr if no Table object underlies
|
|
|
|
// the returned iterator. The returned "*tableptr" object is owned by
|
|
|
|
// the cache and should not be deleted, and is valid for as long as the
|
|
|
|
// returned iterator is live.
|
|
|
|
// @param range_del_agg If non-nullptr, adds range deletions to the
|
|
|
|
// aggregator. If an error occurs, returns it in a NewErrorInternalIterator
|
Skip bottom-level filter block caching when hit-optimized
Summary:
When Get() or NewIterator() trigger file loads, skip caching the filter block if
(1) optimize_filters_for_hits is set and (2) the file is on the bottommost
level. Also skip checking filters under the same conditions, which means that
for a preloaded file or a file that was trivially-moved to the bottom level, its
filter block will eventually expire from the cache.
- added parameters/instance variables in various places in order to propagate the config ("skip_filters") from version_set to block_based_table_reader
- in BlockBasedTable::Rep, this optimization prevents filter from being loaded when the file is opened simply by setting filter_policy = nullptr
- in BlockBasedTable::Get/BlockBasedTable::NewIterator, this optimization prevents filter from being used (even if it was loaded already) by setting filter = nullptr
Test Plan:
updated unit test:
$ ./db_test --gtest_filter=DBTest.OptimizeFiltersForHits
will also run 'make check'
Reviewers: sdong, igor, paultuckfield, anthony, rven, kradhakrishnan, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D51633
9 years ago
|
|
|
// @param skip_filters Disables loading/accessing the filter block
|
Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
9 years ago
|
|
|
// @param level The level this table is at, -1 for "not set / don't know"
|
|
|
|
InternalIterator* NewIterator(
|
|
|
|
const ReadOptions& options, const EnvOptions& toptions,
|
|
|
|
const InternalKeyComparator& internal_comparator,
|
|
|
|
const FileDescriptor& file_fd, RangeDelAggregator* range_del_agg,
|
|
|
|
const SliceTransform* prefix_extractor = nullptr,
|
|
|
|
TableReader** table_reader_ptr = nullptr,
|
|
|
|
HistogramImpl* file_read_hist = nullptr, bool for_compaction = false,
|
|
|
|
Arena* arena = nullptr, bool skip_filters = false, int level = -1);
|
|
|
|
|
|
|
|
InternalIterator* NewRangeTombstoneIterator(
|
|
|
|
const ReadOptions& options, const EnvOptions& toptions,
|
|
|
|
const InternalKeyComparator& internal_comparator,
|
|
|
|
const FileDescriptor& file_fd, HistogramImpl* file_read_hist,
|
|
|
|
bool skip_filters, int level,
|
|
|
|
const SliceTransform* prefix_extractor = nullptr);
|
|
|
|
|
|
|
|
// If a seek to internal key "k" in specified file finds an entry,
|
|
|
|
// call (*handle_result)(arg, found_key, found_value) repeatedly until
|
|
|
|
// it returns false.
|
|
|
|
// @param get_context State for get operation. If its range_del_agg() returns
|
|
|
|
// non-nullptr, adds range deletions to the aggregator. If an error occurs,
|
|
|
|
// returns non-ok status.
|
Skip bottom-level filter block caching when hit-optimized
Summary:
When Get() or NewIterator() trigger file loads, skip caching the filter block if
(1) optimize_filters_for_hits is set and (2) the file is on the bottommost
level. Also skip checking filters under the same conditions, which means that
for a preloaded file or a file that was trivially-moved to the bottom level, its
filter block will eventually expire from the cache.
- added parameters/instance variables in various places in order to propagate the config ("skip_filters") from version_set to block_based_table_reader
- in BlockBasedTable::Rep, this optimization prevents filter from being loaded when the file is opened simply by setting filter_policy = nullptr
- in BlockBasedTable::Get/BlockBasedTable::NewIterator, this optimization prevents filter from being used (even if it was loaded already) by setting filter = nullptr
Test Plan:
updated unit test:
$ ./db_test --gtest_filter=DBTest.OptimizeFiltersForHits
will also run 'make check'
Reviewers: sdong, igor, paultuckfield, anthony, rven, kradhakrishnan, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D51633
9 years ago
|
|
|
// @param skip_filters Disables loading/accessing the filter block
|
Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
9 years ago
|
|
|
// @param level The level this table is at, -1 for "not set / don't know"
|
|
|
|
Status Get(const ReadOptions& options,
|
|
|
|
const InternalKeyComparator& internal_comparator,
|
|
|
|
const FileDescriptor& file_fd, const Slice& k,
|
|
|
|
GetContext* get_context,
|
|
|
|
const SliceTransform* prefix_extractor = nullptr,
|
|
|
|
HistogramImpl* file_read_hist = nullptr, bool skip_filters = false,
|
|
|
|
int level = -1);
|
|
|
|
|
|
|
|
// Evict any entry for the specified file number
|
[CF] Rethink table cache
Summary:
Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory.
To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects.
This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15915
11 years ago
|
|
|
static void Evict(Cache* cache, uint64_t file_number);
|
|
|
|
|
Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
9 years ago
|
|
|
// Clean table handle and erase it from the table cache
|
|
|
|
// Used in DB close, or the file is not live anymore.
|
|
|
|
void EraseHandle(const FileDescriptor& fd, Cache::Handle* handle);
|
|
|
|
|
|
|
|
// Find table reader
|
Skip bottom-level filter block caching when hit-optimized
Summary:
When Get() or NewIterator() trigger file loads, skip caching the filter block if
(1) optimize_filters_for_hits is set and (2) the file is on the bottommost
level. Also skip checking filters under the same conditions, which means that
for a preloaded file or a file that was trivially-moved to the bottom level, its
filter block will eventually expire from the cache.
- added parameters/instance variables in various places in order to propagate the config ("skip_filters") from version_set to block_based_table_reader
- in BlockBasedTable::Rep, this optimization prevents filter from being loaded when the file is opened simply by setting filter_policy = nullptr
- in BlockBasedTable::Get/BlockBasedTable::NewIterator, this optimization prevents filter from being used (even if it was loaded already) by setting filter = nullptr
Test Plan:
updated unit test:
$ ./db_test --gtest_filter=DBTest.OptimizeFiltersForHits
will also run 'make check'
Reviewers: sdong, igor, paultuckfield, anthony, rven, kradhakrishnan, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D51633
9 years ago
|
|
|
// @param skip_filters Disables loading/accessing the filter block
|
Adding pin_l0_filter_and_index_blocks_in_cache feature and related fixes.
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
9 years ago
|
|
|
// @param level == -1 means not specified
|
|
|
|
Status FindTable(const EnvOptions& toptions,
|
|
|
|
const InternalKeyComparator& internal_comparator,
|
|
|
|
const FileDescriptor& file_fd, Cache::Handle**,
|
|
|
|
const SliceTransform* prefix_extractor = nullptr,
|
Measure file read latency histogram per level
Summary: In internal stats, remember read latency histogram, if statistics is enabled. It can be retrieved from DB::GetProperty() with "rocksdb.dbstats" property, if it is enabled.
Test Plan: Manually run db_bench and prints out "rocksdb.dbstats" by hand and make sure it prints out as expected
Reviewers: igor, IslamAbdelRahman, rven, kradhakrishnan, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D44193
10 years ago
|
|
|
const bool no_io = false, bool record_read_stats = true,
|
Skip bottom-level filter block caching when hit-optimized
Summary:
When Get() or NewIterator() trigger file loads, skip caching the filter block if
(1) optimize_filters_for_hits is set and (2) the file is on the bottommost
level. Also skip checking filters under the same conditions, which means that
for a preloaded file or a file that was trivially-moved to the bottom level, its
filter block will eventually expire from the cache.
- added parameters/instance variables in various places in order to propagate the config ("skip_filters") from version_set to block_based_table_reader
- in BlockBasedTable::Rep, this optimization prevents filter from being loaded when the file is opened simply by setting filter_policy = nullptr
- in BlockBasedTable::Get/BlockBasedTable::NewIterator, this optimization prevents filter from being used (even if it was loaded already) by setting filter = nullptr
Test Plan:
updated unit test:
$ ./db_test --gtest_filter=DBTest.OptimizeFiltersForHits
will also run 'make check'
Reviewers: sdong, igor, paultuckfield, anthony, rven, kradhakrishnan, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D51633
9 years ago
|
|
|
HistogramImpl* file_read_hist = nullptr,
|
|
|
|
bool skip_filters = false, int level = -1,
|
|
|
|
bool prefetch_index_and_filter_in_cache = true);
|
|
|
|
|
|
|
|
// Get TableReader from a cache handle.
|
|
|
|
TableReader* GetTableReaderFromHandle(Cache::Handle* handle);
|
|
|
|
|
|
|
|
// Get the table properties of a given table.
|
|
|
|
// @no_io: indicates if we should load table to the cache if it is not present
|
|
|
|
// in table cache yet.
|
|
|
|
// @returns: `properties` will be reset on success. Please note that we will
|
|
|
|
// return Status::Incomplete() if table is not present in cache and
|
|
|
|
// we set `no_io` to be true.
|
|
|
|
Status GetTableProperties(const EnvOptions& toptions,
|
|
|
|
const InternalKeyComparator& internal_comparator,
|
|
|
|
const FileDescriptor& file_meta,
|
|
|
|
std::shared_ptr<const TableProperties>* properties,
|
|
|
|
const SliceTransform* prefix_extractor = nullptr,
|
|
|
|
bool no_io = false);
|
|
|
|
|
|
|
|
// Return total memory usage of the table reader of the file.
|
|
|
|
// 0 if table reader of the file is not loaded.
|
|
|
|
size_t GetMemoryUsageByTableReader(
|
|
|
|
const EnvOptions& toptions,
|
|
|
|
const InternalKeyComparator& internal_comparator,
|
|
|
|
const FileDescriptor& fd,
|
|
|
|
const SliceTransform* prefix_extractor = nullptr);
|
|
|
|
|
|
|
|
// Release the handle from a cache
|
|
|
|
void ReleaseHandle(Cache::Handle* handle);
|
|
|
|
|
|
|
|
// Capacity of the backing Cache that indicates inifinite TableCache capacity.
|
|
|
|
// For example when max_open_files is -1 we set the backing Cache to this.
|
|
|
|
static const int kInfiniteCapacity = 0x400000;
|
|
|
|
|
|
|
|
// The tables opened with this TableCache will be immortal, i.e., their
|
|
|
|
// lifetime is as long as that of the DB.
|
|
|
|
void SetTablesAreImmortal() {
|
|
|
|
if (cache_->GetCapacity() >= kInfiniteCapacity) {
|
|
|
|
immortal_tables_ = true;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
private:
|
|
|
|
// Build a table reader
|
|
|
|
Status GetTableReader(const EnvOptions& env_options,
|
|
|
|
const InternalKeyComparator& internal_comparator,
|
|
|
|
const FileDescriptor& fd, bool sequential_mode,
|
|
|
|
size_t readahead, bool record_read_stats,
|
|
|
|
HistogramImpl* file_read_hist,
|
Skip bottom-level filter block caching when hit-optimized
Summary:
When Get() or NewIterator() trigger file loads, skip caching the filter block if
(1) optimize_filters_for_hits is set and (2) the file is on the bottommost
level. Also skip checking filters under the same conditions, which means that
for a preloaded file or a file that was trivially-moved to the bottom level, its
filter block will eventually expire from the cache.
- added parameters/instance variables in various places in order to propagate the config ("skip_filters") from version_set to block_based_table_reader
- in BlockBasedTable::Rep, this optimization prevents filter from being loaded when the file is opened simply by setting filter_policy = nullptr
- in BlockBasedTable::Get/BlockBasedTable::NewIterator, this optimization prevents filter from being used (even if it was loaded already) by setting filter = nullptr
Test Plan:
updated unit test:
$ ./db_test --gtest_filter=DBTest.OptimizeFiltersForHits
will also run 'make check'
Reviewers: sdong, igor, paultuckfield, anthony, rven, kradhakrishnan, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D51633
9 years ago
|
|
|
unique_ptr<TableReader>* table_reader,
|
|
|
|
const SliceTransform* prefix_extractor = nullptr,
|
|
|
|
bool skip_filters = false, int level = -1,
|
|
|
|
bool prefetch_index_and_filter_in_cache = true,
|
|
|
|
bool for_compaction = false);
|
|
|
|
|
|
|
|
const ImmutableCFOptions& ioptions_;
|
|
|
|
const EnvOptions& env_options_;
|
[CF] Rethink table cache
Summary:
Adapting table cache to column families is interesting. We want table cache to be global LRU, so if some column families are use not as often as others, we want them to be evicted from cache. However, current TableCache object also constructs tables on its own. If table is not found in the cache, TableCache automatically creates new table. We want each column family to be able to specify different table factory.
To solve the problem, we still have a single LRU, but we provide the LRUCache object to TableCache on construction. We have one TableCache per column family, but the underyling cache is shared by all TableCache objects.
This allows us to have a global LRU, but still be able to support different table factories for different column families. Also, in the future it will also be able to support different directories for different column families.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15915
11 years ago
|
|
|
Cache* const cache_;
|
|
|
|
std::string row_cache_id_;
|
|
|
|
bool immortal_tables_;
|
|
|
|
};
|
|
|
|
|
|
|
|
} // namespace rocksdb
|