You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
rocksdb/db/version_builder.cc

1427 lines
46 KiB

// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
// This source code is licensed under both the GPLv2 (found in the
// COPYING file in the root directory) and Apache 2.0 License
// (found in the LICENSE.Apache file in the root directory).
//
// Copyright (c) 2011 The LevelDB Authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file. See the AUTHORS file for names of contributors.
#include "db/version_builder.h"
#include <algorithm>
#include <atomic>
#include <cinttypes>
#include <functional>
#include <map>
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
#include <memory>
#include <set>
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
#include <sstream>
#include <thread>
#include <unordered_map>
#include <unordered_set>
#include <utility>
#include <vector>
Account memory of FileMetaData in global memory limit (#9924) Summary: **Context/Summary:** As revealed by heap profiling, allocation of `FileMetaData` for [newly created file added to a Version](https://github.com/facebook/rocksdb/pull/9924/files#diff-a6aa385940793f95a2c5b39cc670bd440c4547fa54fd44622f756382d5e47e43R774) can consume significant heap memory. This PR is to account that toward our global memory limit based on block cache capacity. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9924 Test Plan: - Previous `make check` verified there are only 2 places where the memory of the allocated `FileMetaData` can be released - New unit test `TEST_P(ChargeFileMetadataTestWithParam, Basic)` - db bench (CPU cost of `charge_file_metadata` in write and compact) - **write micros/op: -0.24%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 (remove this option for pre-PR) -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'` - **compact micros/op -0.87%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 -numdistinct=1000 && ./db_bench -benchmarks=compact -db=$TEST_TMPDIR -use_existing_db=1 -charge_file_metadata=1 -disable_auto_compactions=1 | egrep 'compact'` table 1 - write #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721 20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | -0.3633711465 40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | 0.5289363078 80 | 3.87828 | 0.119007 | 3.86791 | 0.115674 | **-0.2673865734** 160 | 3.87677 | 0.162231 | 3.86739 | 0.16663 | **-0.2419539978** table 2 - compact #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 2,399,650.00 | 96,375.80 | 2,359,537.00 | 53,243.60 | -1.67 20 | 2,410,480.00 | 89,988.00 | 2,433,580.00 | 91,121.20 | 0.96 40 | 2.41E+06 | 121811 | 2.39E+06 | 131525 | **-0.96** 80 | 2.40E+06 | 134503 | 2.39E+06 | 108799 | **-0.78** - stress test: `python3 tools/db_crashtest.py blackbox --charge_file_metadata=1 --cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D36055583 Pulled By: hx235 fbshipit-source-id: b60eab94707103cb1322cf815f05810ef0232625
2 years ago
#include "cache/cache_reservation_manager.h"
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
#include "db/blob/blob_file_meta.h"
#include "db/dbformat.h"
#include "db/internal_stats.h"
#include "db/table_cache.h"
Sort L0 files by newly introduced epoch_num (#10922) Summary: **Context:** Sorting L0 files by `largest_seqno` has at least two inconvenience: - File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap. - For example, consider the following sequence of events ("key@n" indicates key at seqno "n") - insert k1@1 to memtable m1 - ingest file s1 with k2@2, ingest file s2 with k3@3 - insert k4@4 to m1 - compact files s1, s2 and result in new file s3 of seqno range [2, 3] - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1 - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption. - Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example) - an existing SST s1 contains only k1@1 - insert k1@2 to memtable m1 - ingest file s2 with k3@3, ingest file s3 with k4@4 - insert single delete k5@5 in m1 - flush m1 and result in new file s4 of seqno range [2, 5] - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4] - compact s4 and result in new file s6 of seqno range [2] due to single delete - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno` Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways: - In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more. - In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption. **Summary:** - Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`. - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`) - Compaction output file is assigned with the minimum `epoch_number` among input files' - Refit level: reuse refitted file's epoch_number - Other paths needing `epoch_number` treatment: - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo` - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`. - Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair). - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder. - Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag` - Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`. - Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR. - Misc: - update existing tests with `epoch_number` so make check will pass - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber() Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922 Test Plan: - `make check` - New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc` - Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930 - [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox` - [Ongoing] normal db stress test - [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 Reviewed By: ajkr Differential Revision: D41063187 Pulled By: hx235 fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2 years ago
#include "db/version_edit.h"
#include "db/version_set.h"
#include "port/port.h"
#include "table/table_reader.h"
#include "util/string_util.h"
namespace ROCKSDB_NAMESPACE {
class VersionBuilder::Rep {
Sort L0 files by newly introduced epoch_num (#10922) Summary: **Context:** Sorting L0 files by `largest_seqno` has at least two inconvenience: - File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap. - For example, consider the following sequence of events ("key@n" indicates key at seqno "n") - insert k1@1 to memtable m1 - ingest file s1 with k2@2, ingest file s2 with k3@3 - insert k4@4 to m1 - compact files s1, s2 and result in new file s3 of seqno range [2, 3] - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1 - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption. - Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example) - an existing SST s1 contains only k1@1 - insert k1@2 to memtable m1 - ingest file s2 with k3@3, ingest file s3 with k4@4 - insert single delete k5@5 in m1 - flush m1 and result in new file s4 of seqno range [2, 5] - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4] - compact s4 and result in new file s6 of seqno range [2] due to single delete - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno` Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways: - In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more. - In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption. **Summary:** - Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`. - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`) - Compaction output file is assigned with the minimum `epoch_number` among input files' - Refit level: reuse refitted file's epoch_number - Other paths needing `epoch_number` treatment: - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo` - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`. - Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair). - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder. - Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag` - Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`. - Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR. - Misc: - update existing tests with `epoch_number` so make check will pass - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber() Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922 Test Plan: - `make check` - New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc` - Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930 - [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox` - [Ongoing] normal db stress test - [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 Reviewed By: ajkr Differential Revision: D41063187 Pulled By: hx235 fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2 years ago
class NewestFirstByEpochNumber {
private:
inline static const NewestFirstBySeqNo seqno_cmp;
public:
bool operator()(const FileMetaData* lhs, const FileMetaData* rhs) const {
assert(lhs);
assert(rhs);
Sort L0 files by newly introduced epoch_num (#10922) Summary: **Context:** Sorting L0 files by `largest_seqno` has at least two inconvenience: - File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap. - For example, consider the following sequence of events ("key@n" indicates key at seqno "n") - insert k1@1 to memtable m1 - ingest file s1 with k2@2, ingest file s2 with k3@3 - insert k4@4 to m1 - compact files s1, s2 and result in new file s3 of seqno range [2, 3] - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1 - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption. - Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example) - an existing SST s1 contains only k1@1 - insert k1@2 to memtable m1 - ingest file s2 with k3@3, ingest file s3 with k4@4 - insert single delete k5@5 in m1 - flush m1 and result in new file s4 of seqno range [2, 5] - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4] - compact s4 and result in new file s6 of seqno range [2] due to single delete - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno` Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways: - In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more. - In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption. **Summary:** - Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`. - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`) - Compaction output file is assigned with the minimum `epoch_number` among input files' - Refit level: reuse refitted file's epoch_number - Other paths needing `epoch_number` treatment: - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo` - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`. - Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair). - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder. - Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag` - Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`. - Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR. - Misc: - update existing tests with `epoch_number` so make check will pass - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber() Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922 Test Plan: - `make check` - New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc` - Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930 - [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox` - [Ongoing] normal db stress test - [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 Reviewed By: ajkr Differential Revision: D41063187 Pulled By: hx235 fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2 years ago
if (lhs->epoch_number != rhs->epoch_number) {
return lhs->epoch_number > rhs->epoch_number;
} else {
return seqno_cmp(lhs, rhs);
}
}
};
class BySmallestKey {
public:
explicit BySmallestKey(const InternalKeyComparator* cmp) : cmp_(cmp) {}
bool operator()(const FileMetaData* lhs, const FileMetaData* rhs) const {
assert(lhs);
assert(rhs);
assert(cmp_);
const int r = cmp_->Compare(lhs->smallest, rhs->smallest);
if (r != 0) {
return (r < 0);
}
// Break ties by file number
return (lhs->fd.GetNumber() < rhs->fd.GetNumber());
}
private:
const InternalKeyComparator* cmp_;
};
struct LevelState {
std::unordered_set<uint64_t> deleted_files;
// Map from file number to file meta data.
std::unordered_map<uint64_t, FileMetaData*> added_files;
};
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
// A class that represents the accumulated changes (like additional garbage or
// newly linked/unlinked SST files) for a given blob file after applying a
// series of VersionEdits.
class BlobFileMetaDataDelta {
public:
bool IsEmpty() const {
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
return !additional_garbage_count_ && !additional_garbage_bytes_ &&
newly_linked_ssts_.empty() && newly_unlinked_ssts_.empty();
}
uint64_t GetAdditionalGarbageCount() const {
return additional_garbage_count_;
}
uint64_t GetAdditionalGarbageBytes() const {
return additional_garbage_bytes_;
}
const std::unordered_set<uint64_t>& GetNewlyLinkedSsts() const {
return newly_linked_ssts_;
}
const std::unordered_set<uint64_t>& GetNewlyUnlinkedSsts() const {
return newly_unlinked_ssts_;
}
void AddGarbage(uint64_t count, uint64_t bytes) {
additional_garbage_count_ += count;
additional_garbage_bytes_ += bytes;
}
void LinkSst(uint64_t sst_file_number) {
assert(newly_linked_ssts_.find(sst_file_number) ==
newly_linked_ssts_.end());
// Reconcile with newly unlinked SSTs on the fly. (Note: an SST can be
// linked to and unlinked from the same blob file in the case of a trivial
// move.)
auto it = newly_unlinked_ssts_.find(sst_file_number);
if (it != newly_unlinked_ssts_.end()) {
newly_unlinked_ssts_.erase(it);
} else {
newly_linked_ssts_.emplace(sst_file_number);
}
}
void UnlinkSst(uint64_t sst_file_number) {
assert(newly_unlinked_ssts_.find(sst_file_number) ==
newly_unlinked_ssts_.end());
// Reconcile with newly linked SSTs on the fly. (Note: an SST can be
// linked to and unlinked from the same blob file in the case of a trivial
// move.)
auto it = newly_linked_ssts_.find(sst_file_number);
if (it != newly_linked_ssts_.end()) {
newly_linked_ssts_.erase(it);
} else {
newly_unlinked_ssts_.emplace(sst_file_number);
}
}
private:
uint64_t additional_garbage_count_ = 0;
uint64_t additional_garbage_bytes_ = 0;
std::unordered_set<uint64_t> newly_linked_ssts_;
std::unordered_set<uint64_t> newly_unlinked_ssts_;
};
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
// A class that represents the state of a blob file after applying a series of
// VersionEdits. In addition to the resulting state, it also contains the
// delta (see BlobFileMetaDataDelta above). The resulting state can be used to
// identify obsolete blob files, while the delta makes it possible to
// efficiently detect trivial moves.
class MutableBlobFileMetaData {
public:
// To be used for brand new blob files
explicit MutableBlobFileMetaData(
std::shared_ptr<SharedBlobFileMetaData>&& shared_meta)
: shared_meta_(std::move(shared_meta)) {}
// To be used for pre-existing blob files
explicit MutableBlobFileMetaData(
const std::shared_ptr<BlobFileMetaData>& meta)
: shared_meta_(meta->GetSharedMeta()),
linked_ssts_(meta->GetLinkedSsts()),
garbage_blob_count_(meta->GetGarbageBlobCount()),
garbage_blob_bytes_(meta->GetGarbageBlobBytes()) {}
const std::shared_ptr<SharedBlobFileMetaData>& GetSharedMeta() const {
return shared_meta_;
}
uint64_t GetBlobFileNumber() const {
assert(shared_meta_);
return shared_meta_->GetBlobFileNumber();
}
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
bool HasDelta() const { return !delta_.IsEmpty(); }
const std::unordered_set<uint64_t>& GetLinkedSsts() const {
return linked_ssts_;
}
uint64_t GetGarbageBlobCount() const { return garbage_blob_count_; }
uint64_t GetGarbageBlobBytes() const { return garbage_blob_bytes_; }
bool AddGarbage(uint64_t count, uint64_t bytes) {
assert(shared_meta_);
if (garbage_blob_count_ + count > shared_meta_->GetTotalBlobCount() ||
garbage_blob_bytes_ + bytes > shared_meta_->GetTotalBlobBytes()) {
return false;
}
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
delta_.AddGarbage(count, bytes);
garbage_blob_count_ += count;
garbage_blob_bytes_ += bytes;
return true;
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
}
void LinkSst(uint64_t sst_file_number) {
delta_.LinkSst(sst_file_number);
assert(linked_ssts_.find(sst_file_number) == linked_ssts_.end());
linked_ssts_.emplace(sst_file_number);
}
void UnlinkSst(uint64_t sst_file_number) {
delta_.UnlinkSst(sst_file_number);
assert(linked_ssts_.find(sst_file_number) != linked_ssts_.end());
linked_ssts_.erase(sst_file_number);
}
private:
std::shared_ptr<SharedBlobFileMetaData> shared_meta_;
// Accumulated changes
BlobFileMetaDataDelta delta_;
// Resulting state after applying the changes
BlobFileMetaData::LinkedSsts linked_ssts_;
uint64_t garbage_blob_count_ = 0;
uint64_t garbage_blob_bytes_ = 0;
};
Introduce a new storage specific Env API (#5761) Summary: The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc. This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO. The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before. This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection. The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761 Differential Revision: D18868376 Pulled By: anand1976 fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
const FileOptions& file_options_;
const ImmutableCFOptions* const ioptions_;
TableCache* table_cache_;
VersionStorageInfo* base_vstorage_;
VersionSet* version_set_;
int num_levels_;
LevelState* levels_;
Improve consistency checks in VersionBuilder (#6901) Summary: The patch cleans up the code and improves the consistency checks around adding/deleting table files in `VersionBuilder`. Namely, it makes the checks stricter and improves them in the following ways: 1) A table file can now only be deleted from the LSM tree using the level it resides on. Earlier, there was some unnecessary wiggle room for trivially moved files (they could be deleted using a lower level number than the actual one). 2) A table file cannot be added to the tree if it is already present in the tree on any level (not just the target level). The earlier code only had an assertion (which is a no-op in release builds) that the newly added file is not already present on the target level. 3) The above consistency checks around state transitions are now mandatory, as opposed to the earlier `CheckConsistencyForDeletes`, which was a no-op in release mode unless `force_consistency_checks` was set to `true`. The rationale here is that assuming that the initial state is consistent, a valid transition leads to a next state that is also consistent; however, an *invalid* transition offers no such guarantee. Hence it makes sense to validate the transitions unconditionally, and save `force_consistency_checks` for the paranoid checks that re-validate the entire state. 4) The new checks build on the mechanism introduced in https://github.com/facebook/rocksdb/pull/6862, which enables us to efficiently look up the location (level and position within level) of files in a `Version` by file number. This makes the consistency checks much more efficient than the earlier `CheckConsistencyForDeletes`, which essentially performed a linear search. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6901 Test Plan: Extended the unit tests and ran: `make check` `make whitebox_crash_test` Reviewed By: ajkr Differential Revision: D21822714 Pulled By: ltamasi fbshipit-source-id: e2b29c8b6da1bf0f59004acc889e4870b2d18215
5 years ago
// Store sizes of levels larger than num_levels_. We do this instead of
// storing them in levels_ to avoid regression in case there are no files
// on invalid levels. The version is not consistent if in the end the files
// on invalid levels don't cancel out.
Improve consistency checks in VersionBuilder (#6901) Summary: The patch cleans up the code and improves the consistency checks around adding/deleting table files in `VersionBuilder`. Namely, it makes the checks stricter and improves them in the following ways: 1) A table file can now only be deleted from the LSM tree using the level it resides on. Earlier, there was some unnecessary wiggle room for trivially moved files (they could be deleted using a lower level number than the actual one). 2) A table file cannot be added to the tree if it is already present in the tree on any level (not just the target level). The earlier code only had an assertion (which is a no-op in release builds) that the newly added file is not already present on the target level. 3) The above consistency checks around state transitions are now mandatory, as opposed to the earlier `CheckConsistencyForDeletes`, which was a no-op in release mode unless `force_consistency_checks` was set to `true`. The rationale here is that assuming that the initial state is consistent, a valid transition leads to a next state that is also consistent; however, an *invalid* transition offers no such guarantee. Hence it makes sense to validate the transitions unconditionally, and save `force_consistency_checks` for the paranoid checks that re-validate the entire state. 4) The new checks build on the mechanism introduced in https://github.com/facebook/rocksdb/pull/6862, which enables us to efficiently look up the location (level and position within level) of files in a `Version` by file number. This makes the consistency checks much more efficient than the earlier `CheckConsistencyForDeletes`, which essentially performed a linear search. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6901 Test Plan: Extended the unit tests and ran: `make check` `make whitebox_crash_test` Reviewed By: ajkr Differential Revision: D21822714 Pulled By: ltamasi fbshipit-source-id: e2b29c8b6da1bf0f59004acc889e4870b2d18215
5 years ago
std::unordered_map<int, size_t> invalid_level_sizes_;
// Whether there are invalid new files or invalid deletion on levels larger
// than num_levels_.
bool has_invalid_levels_;
Improve consistency checks in VersionBuilder (#6901) Summary: The patch cleans up the code and improves the consistency checks around adding/deleting table files in `VersionBuilder`. Namely, it makes the checks stricter and improves them in the following ways: 1) A table file can now only be deleted from the LSM tree using the level it resides on. Earlier, there was some unnecessary wiggle room for trivially moved files (they could be deleted using a lower level number than the actual one). 2) A table file cannot be added to the tree if it is already present in the tree on any level (not just the target level). The earlier code only had an assertion (which is a no-op in release builds) that the newly added file is not already present on the target level. 3) The above consistency checks around state transitions are now mandatory, as opposed to the earlier `CheckConsistencyForDeletes`, which was a no-op in release mode unless `force_consistency_checks` was set to `true`. The rationale here is that assuming that the initial state is consistent, a valid transition leads to a next state that is also consistent; however, an *invalid* transition offers no such guarantee. Hence it makes sense to validate the transitions unconditionally, and save `force_consistency_checks` for the paranoid checks that re-validate the entire state. 4) The new checks build on the mechanism introduced in https://github.com/facebook/rocksdb/pull/6862, which enables us to efficiently look up the location (level and position within level) of files in a `Version` by file number. This makes the consistency checks much more efficient than the earlier `CheckConsistencyForDeletes`, which essentially performed a linear search. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6901 Test Plan: Extended the unit tests and ran: `make check` `make whitebox_crash_test` Reviewed By: ajkr Differential Revision: D21822714 Pulled By: ltamasi fbshipit-source-id: e2b29c8b6da1bf0f59004acc889e4870b2d18215
5 years ago
// Current levels of table files affected by additions/deletions.
std::unordered_map<uint64_t, int> table_file_levels_;
Add basic kRoundRobin compaction policy (#10107) Summary: Add `kRoundRobin` as a compaction priority. The implementation is as follows. - Define a cursor as the smallest Internal key in the successor of the selected file. Add `vector<InternalKey> compact_cursor_` into `VersionStorageInfo` where each element (`InternalKey`) in `compact_cursor_` represents a cursor. In round-robin compaction policy, we just need to select the first file (assuming files are sorted) and also has the smallest InternalKey larger than/equal to the cursor. After a file is chosen, we create a new `Fsize` vector which puts the selected file is placed at the first position in `temp`, the next cursor is then updated as the smallest InternalKey in successor of the selected file (the above logic is implemented in `SortFileByRoundRobin`). - After a compaction succeeds, typically `InstallCompactionResults()`, we choose the next cursor for the input level and save it to `edit`. When calling `LogAndApply`, we save the next cursor with its level into some local variable and finally apply the change to `vstorage` in `SaveTo` function. - Cursors are persist pair by pair (<level, InternalKey>) in `EncodeTo` so that they can be reconstructed when reopening. An empty cursor will not be encoded to MANIFEST Pull Request resolved: https://github.com/facebook/rocksdb/pull/10107 Test Plan: add unit test (`CompactionPriRoundRobin`) in `compaction_picker_test`, add `kRoundRobin` priority in `CompactionPriTest` from `db_compaction_test`, and add `PersistRoundRobinCompactCursor` in `db_compaction_test` Reviewed By: ajkr Differential Revision: D37316037 Pulled By: littlepig2013 fbshipit-source-id: 9f481748190ace416079139044e00df2968fb1ee
2 years ago
// Current compact cursors that should be changed after the last compaction
std::unordered_map<int, InternalKey> updated_compact_cursors_;
Sort L0 files by newly introduced epoch_num (#10922) Summary: **Context:** Sorting L0 files by `largest_seqno` has at least two inconvenience: - File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap. - For example, consider the following sequence of events ("key@n" indicates key at seqno "n") - insert k1@1 to memtable m1 - ingest file s1 with k2@2, ingest file s2 with k3@3 - insert k4@4 to m1 - compact files s1, s2 and result in new file s3 of seqno range [2, 3] - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1 - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption. - Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example) - an existing SST s1 contains only k1@1 - insert k1@2 to memtable m1 - ingest file s2 with k3@3, ingest file s3 with k4@4 - insert single delete k5@5 in m1 - flush m1 and result in new file s4 of seqno range [2, 5] - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4] - compact s4 and result in new file s6 of seqno range [2] due to single delete - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno` Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways: - In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more. - In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption. **Summary:** - Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`. - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`) - Compaction output file is assigned with the minimum `epoch_number` among input files' - Refit level: reuse refitted file's epoch_number - Other paths needing `epoch_number` treatment: - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo` - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`. - Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair). - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder. - Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag` - Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`. - Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR. - Misc: - update existing tests with `epoch_number` so make check will pass - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber() Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922 Test Plan: - `make check` - New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc` - Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930 - [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox` - [Ongoing] normal db stress test - [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 Reviewed By: ajkr Differential Revision: D41063187 Pulled By: hx235 fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2 years ago
NewestFirstByEpochNumber level_zero_cmp_by_epochno_;
NewestFirstBySeqNo level_zero_cmp_by_seqno_;
BySmallestKey level_nonzero_cmp_;
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
// Mutable metadata objects for all blob files affected by the series of
// version edits.
std::map<uint64_t, MutableBlobFileMetaData> mutable_blob_file_metas_;
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
Account memory of FileMetaData in global memory limit (#9924) Summary: **Context/Summary:** As revealed by heap profiling, allocation of `FileMetaData` for [newly created file added to a Version](https://github.com/facebook/rocksdb/pull/9924/files#diff-a6aa385940793f95a2c5b39cc670bd440c4547fa54fd44622f756382d5e47e43R774) can consume significant heap memory. This PR is to account that toward our global memory limit based on block cache capacity. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9924 Test Plan: - Previous `make check` verified there are only 2 places where the memory of the allocated `FileMetaData` can be released - New unit test `TEST_P(ChargeFileMetadataTestWithParam, Basic)` - db bench (CPU cost of `charge_file_metadata` in write and compact) - **write micros/op: -0.24%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 (remove this option for pre-PR) -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'` - **compact micros/op -0.87%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 -numdistinct=1000 && ./db_bench -benchmarks=compact -db=$TEST_TMPDIR -use_existing_db=1 -charge_file_metadata=1 -disable_auto_compactions=1 | egrep 'compact'` table 1 - write #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721 20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | -0.3633711465 40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | 0.5289363078 80 | 3.87828 | 0.119007 | 3.86791 | 0.115674 | **-0.2673865734** 160 | 3.87677 | 0.162231 | 3.86739 | 0.16663 | **-0.2419539978** table 2 - compact #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 2,399,650.00 | 96,375.80 | 2,359,537.00 | 53,243.60 | -1.67 20 | 2,410,480.00 | 89,988.00 | 2,433,580.00 | 91,121.20 | 0.96 40 | 2.41E+06 | 121811 | 2.39E+06 | 131525 | **-0.96** 80 | 2.40E+06 | 134503 | 2.39E+06 | 108799 | **-0.78** - stress test: `python3 tools/db_crashtest.py blackbox --charge_file_metadata=1 --cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D36055583 Pulled By: hx235 fbshipit-source-id: b60eab94707103cb1322cf815f05810ef0232625
2 years ago
std::shared_ptr<CacheReservationManager> file_metadata_cache_res_mgr_;
public:
Rep(const FileOptions& file_options, const ImmutableCFOptions* ioptions,
TableCache* table_cache, VersionStorageInfo* base_vstorage,
Account memory of FileMetaData in global memory limit (#9924) Summary: **Context/Summary:** As revealed by heap profiling, allocation of `FileMetaData` for [newly created file added to a Version](https://github.com/facebook/rocksdb/pull/9924/files#diff-a6aa385940793f95a2c5b39cc670bd440c4547fa54fd44622f756382d5e47e43R774) can consume significant heap memory. This PR is to account that toward our global memory limit based on block cache capacity. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9924 Test Plan: - Previous `make check` verified there are only 2 places where the memory of the allocated `FileMetaData` can be released - New unit test `TEST_P(ChargeFileMetadataTestWithParam, Basic)` - db bench (CPU cost of `charge_file_metadata` in write and compact) - **write micros/op: -0.24%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 (remove this option for pre-PR) -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'` - **compact micros/op -0.87%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 -numdistinct=1000 && ./db_bench -benchmarks=compact -db=$TEST_TMPDIR -use_existing_db=1 -charge_file_metadata=1 -disable_auto_compactions=1 | egrep 'compact'` table 1 - write #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721 20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | -0.3633711465 40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | 0.5289363078 80 | 3.87828 | 0.119007 | 3.86791 | 0.115674 | **-0.2673865734** 160 | 3.87677 | 0.162231 | 3.86739 | 0.16663 | **-0.2419539978** table 2 - compact #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 2,399,650.00 | 96,375.80 | 2,359,537.00 | 53,243.60 | -1.67 20 | 2,410,480.00 | 89,988.00 | 2,433,580.00 | 91,121.20 | 0.96 40 | 2.41E+06 | 121811 | 2.39E+06 | 131525 | **-0.96** 80 | 2.40E+06 | 134503 | 2.39E+06 | 108799 | **-0.78** - stress test: `python3 tools/db_crashtest.py blackbox --charge_file_metadata=1 --cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D36055583 Pulled By: hx235 fbshipit-source-id: b60eab94707103cb1322cf815f05810ef0232625
2 years ago
VersionSet* version_set,
std::shared_ptr<CacheReservationManager> file_metadata_cache_res_mgr)
Introduce a new storage specific Env API (#5761) Summary: The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc. This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO. The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before. This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection. The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761 Differential Revision: D18868376 Pulled By: anand1976 fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
5 years ago
: file_options_(file_options),
ioptions_(ioptions),
table_cache_(table_cache),
base_vstorage_(base_vstorage),
version_set_(version_set),
num_levels_(base_vstorage->num_levels()),
has_invalid_levels_(false),
Account memory of FileMetaData in global memory limit (#9924) Summary: **Context/Summary:** As revealed by heap profiling, allocation of `FileMetaData` for [newly created file added to a Version](https://github.com/facebook/rocksdb/pull/9924/files#diff-a6aa385940793f95a2c5b39cc670bd440c4547fa54fd44622f756382d5e47e43R774) can consume significant heap memory. This PR is to account that toward our global memory limit based on block cache capacity. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9924 Test Plan: - Previous `make check` verified there are only 2 places where the memory of the allocated `FileMetaData` can be released - New unit test `TEST_P(ChargeFileMetadataTestWithParam, Basic)` - db bench (CPU cost of `charge_file_metadata` in write and compact) - **write micros/op: -0.24%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 (remove this option for pre-PR) -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'` - **compact micros/op -0.87%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 -numdistinct=1000 && ./db_bench -benchmarks=compact -db=$TEST_TMPDIR -use_existing_db=1 -charge_file_metadata=1 -disable_auto_compactions=1 | egrep 'compact'` table 1 - write #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721 20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | -0.3633711465 40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | 0.5289363078 80 | 3.87828 | 0.119007 | 3.86791 | 0.115674 | **-0.2673865734** 160 | 3.87677 | 0.162231 | 3.86739 | 0.16663 | **-0.2419539978** table 2 - compact #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 2,399,650.00 | 96,375.80 | 2,359,537.00 | 53,243.60 | -1.67 20 | 2,410,480.00 | 89,988.00 | 2,433,580.00 | 91,121.20 | 0.96 40 | 2.41E+06 | 121811 | 2.39E+06 | 131525 | **-0.96** 80 | 2.40E+06 | 134503 | 2.39E+06 | 108799 | **-0.78** - stress test: `python3 tools/db_crashtest.py blackbox --charge_file_metadata=1 --cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D36055583 Pulled By: hx235 fbshipit-source-id: b60eab94707103cb1322cf815f05810ef0232625
2 years ago
level_nonzero_cmp_(base_vstorage_->InternalComparator()),
file_metadata_cache_res_mgr_(file_metadata_cache_res_mgr) {
assert(ioptions_);
levels_ = new LevelState[num_levels_];
}
~Rep() {
for (int level = 0; level < num_levels_; level++) {
const auto& added = levels_[level].added_files;
for (auto& pair : added) {
UnrefFile(pair.second);
}
}
delete[] levels_;
}
void UnrefFile(FileMetaData* f) {
f->refs--;
if (f->refs <= 0) {
if (f->table_reader_handle) {
assert(table_cache_ != nullptr);
Major Cache refactoring, CPU efficiency improvement (#10975) Summary: This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache). The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below. * static_cast lines of code +29 -35 (net removed 6) * reinterpret_cast lines of code +6 -32 (net removed 26) ## cache.h and secondary_cache.h * Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications: * Simpler for implementations to deal with just one Insert and one Lookup. * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428. * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks). * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below). * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc. * Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation. * Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.) * Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.) * Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774) * Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object. * Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change. ## typed_cache.h Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae). The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used. * PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value. * BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter. * FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue. * For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`. These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.) Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it. ## block_cache.h This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table. ## block_based_table_reader.cc Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation. The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions. ## block_based_table_builder.cc, cache_dump_load_impl.cc Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.) ## Everything else Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975 Test Plan: tests updated Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache): 34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844 34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594 34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297 34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523 34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602 34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293 34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926 34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488 233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984 233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922 233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559 233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93 233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418 233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273 233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691 233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82 1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55 1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02 1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45 1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24 1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92 1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78 1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36 1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83 Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn. Reviewed By: anand1976 Differential Revision: D42417818 Pulled By: pdillinger fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
2 years ago
// NOTE: have to release in raw cache interface to avoid using a
// TypedHandle for FileMetaData::table_reader_handle
table_cache_->get_cache().get()->Release(f->table_reader_handle);
f->table_reader_handle = nullptr;
}
Account memory of FileMetaData in global memory limit (#9924) Summary: **Context/Summary:** As revealed by heap profiling, allocation of `FileMetaData` for [newly created file added to a Version](https://github.com/facebook/rocksdb/pull/9924/files#diff-a6aa385940793f95a2c5b39cc670bd440c4547fa54fd44622f756382d5e47e43R774) can consume significant heap memory. This PR is to account that toward our global memory limit based on block cache capacity. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9924 Test Plan: - Previous `make check` verified there are only 2 places where the memory of the allocated `FileMetaData` can be released - New unit test `TEST_P(ChargeFileMetadataTestWithParam, Basic)` - db bench (CPU cost of `charge_file_metadata` in write and compact) - **write micros/op: -0.24%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 (remove this option for pre-PR) -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'` - **compact micros/op -0.87%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 -numdistinct=1000 && ./db_bench -benchmarks=compact -db=$TEST_TMPDIR -use_existing_db=1 -charge_file_metadata=1 -disable_auto_compactions=1 | egrep 'compact'` table 1 - write #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721 20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | -0.3633711465 40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | 0.5289363078 80 | 3.87828 | 0.119007 | 3.86791 | 0.115674 | **-0.2673865734** 160 | 3.87677 | 0.162231 | 3.86739 | 0.16663 | **-0.2419539978** table 2 - compact #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 2,399,650.00 | 96,375.80 | 2,359,537.00 | 53,243.60 | -1.67 20 | 2,410,480.00 | 89,988.00 | 2,433,580.00 | 91,121.20 | 0.96 40 | 2.41E+06 | 121811 | 2.39E+06 | 131525 | **-0.96** 80 | 2.40E+06 | 134503 | 2.39E+06 | 108799 | **-0.78** - stress test: `python3 tools/db_crashtest.py blackbox --charge_file_metadata=1 --cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D36055583 Pulled By: hx235 fbshipit-source-id: b60eab94707103cb1322cf815f05810ef0232625
2 years ago
if (file_metadata_cache_res_mgr_) {
Status s = file_metadata_cache_res_mgr_->UpdateCacheReservation(
f->ApproximateMemoryUsage(), false /* increase */);
s.PermitUncheckedError();
}
delete f;
}
}
// Mapping used for checking the consistency of links between SST files and
// blob files. It is built using the forward links (table file -> blob file),
// and is subsequently compared with the inverse mapping stored in the
// BlobFileMetaData objects.
using ExpectedLinkedSsts =
std::unordered_map<uint64_t, BlobFileMetaData::LinkedSsts>;
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
static void UpdateExpectedLinkedSsts(
uint64_t table_file_number, uint64_t blob_file_number,
ExpectedLinkedSsts* expected_linked_ssts) {
assert(expected_linked_ssts);
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
if (blob_file_number == kInvalidBlobFileNumber) {
return;
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
}
(*expected_linked_ssts)[blob_file_number].emplace(table_file_number);
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
}
template <typename Checker>
Status CheckConsistencyDetailsForLevel(
const VersionStorageInfo* vstorage, int level, Checker checker,
const std::string& sync_point,
ExpectedLinkedSsts* expected_linked_ssts) const {
#ifdef NDEBUG
(void)sync_point;
#endif
assert(vstorage);
assert(level >= 0 && level < num_levels_);
assert(expected_linked_ssts);
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
const auto& level_files = vstorage->LevelFiles(level);
if (level_files.empty()) {
return Status::OK();
}
assert(level_files[0]);
UpdateExpectedLinkedSsts(level_files[0]->fd.GetNumber(),
level_files[0]->oldest_blob_file_number,
expected_linked_ssts);
for (size_t i = 1; i < level_files.size(); ++i) {
assert(level_files[i]);
UpdateExpectedLinkedSsts(level_files[i]->fd.GetNumber(),
level_files[i]->oldest_blob_file_number,
expected_linked_ssts);
auto lhs = level_files[i - 1];
auto rhs = level_files[i];
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
#ifndef NDEBUG
auto pair = std::make_pair(&lhs, &rhs);
TEST_SYNC_POINT_CALLBACK(sync_point, &pair);
#endif
const Status s = checker(lhs, rhs);
if (!s.ok()) {
return s;
}
}
return Status::OK();
}
// Make sure table files are sorted correctly and that the links between
// table files and blob files are consistent.
Status CheckConsistencyDetails(const VersionStorageInfo* vstorage) const {
assert(vstorage);
ExpectedLinkedSsts expected_linked_ssts;
if (num_levels_ > 0) {
Sort L0 files by newly introduced epoch_num (#10922) Summary: **Context:** Sorting L0 files by `largest_seqno` has at least two inconvenience: - File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap. - For example, consider the following sequence of events ("key@n" indicates key at seqno "n") - insert k1@1 to memtable m1 - ingest file s1 with k2@2, ingest file s2 with k3@3 - insert k4@4 to m1 - compact files s1, s2 and result in new file s3 of seqno range [2, 3] - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1 - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption. - Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example) - an existing SST s1 contains only k1@1 - insert k1@2 to memtable m1 - ingest file s2 with k3@3, ingest file s3 with k4@4 - insert single delete k5@5 in m1 - flush m1 and result in new file s4 of seqno range [2, 5] - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4] - compact s4 and result in new file s6 of seqno range [2] due to single delete - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno` Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways: - In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more. - In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption. **Summary:** - Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`. - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`) - Compaction output file is assigned with the minimum `epoch_number` among input files' - Refit level: reuse refitted file's epoch_number - Other paths needing `epoch_number` treatment: - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo` - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`. - Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair). - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder. - Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag` - Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`. - Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR. - Misc: - update existing tests with `epoch_number` so make check will pass - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber() Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922 Test Plan: - `make check` - New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc` - Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930 - [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox` - [Ongoing] normal db stress test - [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 Reviewed By: ajkr Differential Revision: D41063187 Pulled By: hx235 fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2 years ago
const InternalKeyComparator* const icmp = vstorage->InternalComparator();
EpochNumberRequirement epoch_number_requirement =
vstorage->GetEpochNumberRequirement();
assert(icmp);
// Check L0
{
Sort L0 files by newly introduced epoch_num (#10922) Summary: **Context:** Sorting L0 files by `largest_seqno` has at least two inconvenience: - File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap. - For example, consider the following sequence of events ("key@n" indicates key at seqno "n") - insert k1@1 to memtable m1 - ingest file s1 with k2@2, ingest file s2 with k3@3 - insert k4@4 to m1 - compact files s1, s2 and result in new file s3 of seqno range [2, 3] - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1 - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption. - Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example) - an existing SST s1 contains only k1@1 - insert k1@2 to memtable m1 - ingest file s2 with k3@3, ingest file s3 with k4@4 - insert single delete k5@5 in m1 - flush m1 and result in new file s4 of seqno range [2, 5] - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4] - compact s4 and result in new file s6 of seqno range [2] due to single delete - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno` Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways: - In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more. - In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption. **Summary:** - Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`. - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`) - Compaction output file is assigned with the minimum `epoch_number` among input files' - Refit level: reuse refitted file's epoch_number - Other paths needing `epoch_number` treatment: - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo` - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`. - Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair). - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder. - Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag` - Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`. - Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR. - Misc: - update existing tests with `epoch_number` so make check will pass - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber() Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922 Test Plan: - `make check` - New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc` - Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930 - [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox` - [Ongoing] normal db stress test - [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 Reviewed By: ajkr Differential Revision: D41063187 Pulled By: hx235 fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2 years ago
auto l0_checker = [this, epoch_number_requirement, icmp](
const FileMetaData* lhs,
const FileMetaData* rhs) {
assert(lhs);
assert(rhs);
Sort L0 files by newly introduced epoch_num (#10922) Summary: **Context:** Sorting L0 files by `largest_seqno` has at least two inconvenience: - File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap. - For example, consider the following sequence of events ("key@n" indicates key at seqno "n") - insert k1@1 to memtable m1 - ingest file s1 with k2@2, ingest file s2 with k3@3 - insert k4@4 to m1 - compact files s1, s2 and result in new file s3 of seqno range [2, 3] - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1 - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption. - Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example) - an existing SST s1 contains only k1@1 - insert k1@2 to memtable m1 - ingest file s2 with k3@3, ingest file s3 with k4@4 - insert single delete k5@5 in m1 - flush m1 and result in new file s4 of seqno range [2, 5] - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4] - compact s4 and result in new file s6 of seqno range [2] due to single delete - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno` Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways: - In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more. - In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption. **Summary:** - Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`. - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`) - Compaction output file is assigned with the minimum `epoch_number` among input files' - Refit level: reuse refitted file's epoch_number - Other paths needing `epoch_number` treatment: - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo` - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`. - Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair). - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder. - Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag` - Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`. - Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR. - Misc: - update existing tests with `epoch_number` so make check will pass - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber() Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922 Test Plan: - `make check` - New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc` - Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930 - [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox` - [Ongoing] normal db stress test - [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 Reviewed By: ajkr Differential Revision: D41063187 Pulled By: hx235 fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2 years ago
if (epoch_number_requirement ==
EpochNumberRequirement::kMightMissing) {
if (!level_zero_cmp_by_seqno_(lhs, rhs)) {
std::ostringstream oss;
Sort L0 files by newly introduced epoch_num (#10922) Summary: **Context:** Sorting L0 files by `largest_seqno` has at least two inconvenience: - File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap. - For example, consider the following sequence of events ("key@n" indicates key at seqno "n") - insert k1@1 to memtable m1 - ingest file s1 with k2@2, ingest file s2 with k3@3 - insert k4@4 to m1 - compact files s1, s2 and result in new file s3 of seqno range [2, 3] - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1 - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption. - Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example) - an existing SST s1 contains only k1@1 - insert k1@2 to memtable m1 - ingest file s2 with k3@3, ingest file s3 with k4@4 - insert single delete k5@5 in m1 - flush m1 and result in new file s4 of seqno range [2, 5] - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4] - compact s4 and result in new file s6 of seqno range [2] due to single delete - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno` Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways: - In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more. - In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption. **Summary:** - Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`. - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`) - Compaction output file is assigned with the minimum `epoch_number` among input files' - Refit level: reuse refitted file's epoch_number - Other paths needing `epoch_number` treatment: - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo` - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`. - Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair). - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder. - Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag` - Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`. - Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR. - Misc: - update existing tests with `epoch_number` so make check will pass - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber() Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922 Test Plan: - `make check` - New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc` - Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930 - [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox` - [Ongoing] normal db stress test - [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 Reviewed By: ajkr Differential Revision: D41063187 Pulled By: hx235 fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2 years ago
oss << "L0 files are not sorted properly: files #"
<< lhs->fd.GetNumber() << " with seqnos (largest, smallest) "
<< lhs->fd.largest_seqno << " , " << lhs->fd.smallest_seqno
<< ", #" << rhs->fd.GetNumber()
<< " with seqnos (largest, smallest) "
<< rhs->fd.largest_seqno << " , " << rhs->fd.smallest_seqno;
return Status::Corruption("VersionBuilder", oss.str());
}
Sort L0 files by newly introduced epoch_num (#10922) Summary: **Context:** Sorting L0 files by `largest_seqno` has at least two inconvenience: - File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap. - For example, consider the following sequence of events ("key@n" indicates key at seqno "n") - insert k1@1 to memtable m1 - ingest file s1 with k2@2, ingest file s2 with k3@3 - insert k4@4 to m1 - compact files s1, s2 and result in new file s3 of seqno range [2, 3] - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1 - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption. - Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example) - an existing SST s1 contains only k1@1 - insert k1@2 to memtable m1 - ingest file s2 with k3@3, ingest file s3 with k4@4 - insert single delete k5@5 in m1 - flush m1 and result in new file s4 of seqno range [2, 5] - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4] - compact s4 and result in new file s6 of seqno range [2] due to single delete - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno` Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways: - In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more. - In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption. **Summary:** - Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`. - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`) - Compaction output file is assigned with the minimum `epoch_number` among input files' - Refit level: reuse refitted file's epoch_number - Other paths needing `epoch_number` treatment: - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo` - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`. - Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair). - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder. - Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag` - Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`. - Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR. - Misc: - update existing tests with `epoch_number` so make check will pass - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber() Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922 Test Plan: - `make check` - New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc` - Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930 - [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox` - [Ongoing] normal db stress test - [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 Reviewed By: ajkr Differential Revision: D41063187 Pulled By: hx235 fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2 years ago
} else if (epoch_number_requirement ==
EpochNumberRequirement::kMustPresent) {
if (lhs->epoch_number == rhs->epoch_number) {
bool range_overlapped =
icmp->Compare(lhs->smallest, rhs->largest) <= 0 &&
icmp->Compare(lhs->largest, rhs->smallest) >= 0;
if (range_overlapped) {
std::ostringstream oss;
oss << "L0 files of same epoch number but overlapping range #"
<< lhs->fd.GetNumber()
<< " , smallest key: " << lhs->smallest.DebugString(true)
<< " , largest key: " << lhs->largest.DebugString(true)
<< " , epoch number: " << lhs->epoch_number << " vs. file #"
<< rhs->fd.GetNumber()
<< " , smallest key: " << rhs->smallest.DebugString(true)
<< " , largest key: " << rhs->largest.DebugString(true)
<< " , epoch number: " << rhs->epoch_number;
return Status::Corruption("VersionBuilder", oss.str());
}
}
Sort L0 files by newly introduced epoch_num (#10922) Summary: **Context:** Sorting L0 files by `largest_seqno` has at least two inconvenience: - File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap. - For example, consider the following sequence of events ("key@n" indicates key at seqno "n") - insert k1@1 to memtable m1 - ingest file s1 with k2@2, ingest file s2 with k3@3 - insert k4@4 to m1 - compact files s1, s2 and result in new file s3 of seqno range [2, 3] - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1 - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption. - Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example) - an existing SST s1 contains only k1@1 - insert k1@2 to memtable m1 - ingest file s2 with k3@3, ingest file s3 with k4@4 - insert single delete k5@5 in m1 - flush m1 and result in new file s4 of seqno range [2, 5] - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4] - compact s4 and result in new file s6 of seqno range [2] due to single delete - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno` Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways: - In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more. - In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption. **Summary:** - Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`. - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`) - Compaction output file is assigned with the minimum `epoch_number` among input files' - Refit level: reuse refitted file's epoch_number - Other paths needing `epoch_number` treatment: - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo` - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`. - Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair). - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder. - Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag` - Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`. - Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR. - Misc: - update existing tests with `epoch_number` so make check will pass - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber() Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922 Test Plan: - `make check` - New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc` - Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930 - [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox` - [Ongoing] normal db stress test - [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 Reviewed By: ajkr Differential Revision: D41063187 Pulled By: hx235 fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2 years ago
if (!level_zero_cmp_by_epochno_(lhs, rhs)) {
std::ostringstream oss;
oss << "L0 files are not sorted properly: files #"
<< lhs->fd.GetNumber() << " with epoch number "
<< lhs->epoch_number << ", #" << rhs->fd.GetNumber()
<< " with epoch number " << rhs->epoch_number;
return Status::Corruption("VersionBuilder", oss.str());
}
}
return Status::OK();
};
const Status s = CheckConsistencyDetailsForLevel(
vstorage, /* level */ 0, l0_checker,
"VersionBuilder::CheckConsistency0", &expected_linked_ssts);
if (!s.ok()) {
return s;
}
}
// Check L1 and up
for (int level = 1; level < num_levels_; ++level) {
auto checker = [this, level, icmp](const FileMetaData* lhs,
const FileMetaData* rhs) {
assert(lhs);
assert(rhs);
if (!level_nonzero_cmp_(lhs, rhs)) {
std::ostringstream oss;
oss << 'L' << level << " files are not sorted properly: files #"
<< lhs->fd.GetNumber() << ", #" << rhs->fd.GetNumber();
return Status::Corruption("VersionBuilder", oss.str());
}
// Make sure there is no overlap in level
if (icmp->Compare(lhs->largest, rhs->smallest) >= 0) {
std::ostringstream oss;
oss << 'L' << level << " has overlapping ranges: file #"
<< lhs->fd.GetNumber()
<< " largest key: " << lhs->largest.DebugString(true)
<< " vs. file #" << rhs->fd.GetNumber()
<< " smallest key: " << rhs->smallest.DebugString(true);
return Status::Corruption("VersionBuilder", oss.str());
}
return Status::OK();
};
const Status s = CheckConsistencyDetailsForLevel(
vstorage, level, checker, "VersionBuilder::CheckConsistency1",
&expected_linked_ssts);
if (!s.ok()) {
return s;
}
}
}
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
// Make sure that all blob files in the version have non-garbage data and
// the links between them and the table files are consistent.
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
const auto& blob_files = vstorage->GetBlobFiles();
Use a sorted vector instead of a map to store blob file metadata (#9526) Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
3 years ago
for (const auto& blob_file_meta : blob_files) {
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
assert(blob_file_meta);
Use a sorted vector instead of a map to store blob file metadata (#9526) Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
3 years ago
const uint64_t blob_file_number = blob_file_meta->GetBlobFileNumber();
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
if (blob_file_meta->GetGarbageBlobCount() >=
blob_file_meta->GetTotalBlobCount()) {
std::ostringstream oss;
oss << "Blob file #" << blob_file_number
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
<< " consists entirely of garbage";
return Status::Corruption("VersionBuilder", oss.str());
}
if (blob_file_meta->GetLinkedSsts() !=
expected_linked_ssts[blob_file_number]) {
std::ostringstream oss;
oss << "Links are inconsistent between table files and blob file #"
<< blob_file_number;
return Status::Corruption("VersionBuilder", oss.str());
}
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
}
Status ret_s;
TEST_SYNC_POINT_CALLBACK("VersionBuilder::CheckConsistencyBeforeReturn",
&ret_s);
return ret_s;
}
Status CheckConsistency(const VersionStorageInfo* vstorage) const {
assert(vstorage);
// Always run consistency checks in debug build
#ifdef NDEBUG
if (!vstorage->force_consistency_checks()) {
return Status::OK();
}
#endif
Status s = CheckConsistencyDetails(vstorage);
if (s.IsCorruption() && s.getState()) {
// Make it clear the error is due to force_consistency_checks = 1 or
// debug build
#ifdef NDEBUG
auto prefix = "force_consistency_checks";
#else
auto prefix = "force_consistency_checks(DEBUG)";
#endif
s = Status::Corruption(prefix, s.getState());
} else {
// was only expecting corruption with message, or OK
assert(s.ok());
}
return s;
}
Improve consistency checks in VersionBuilder (#6901) Summary: The patch cleans up the code and improves the consistency checks around adding/deleting table files in `VersionBuilder`. Namely, it makes the checks stricter and improves them in the following ways: 1) A table file can now only be deleted from the LSM tree using the level it resides on. Earlier, there was some unnecessary wiggle room for trivially moved files (they could be deleted using a lower level number than the actual one). 2) A table file cannot be added to the tree if it is already present in the tree on any level (not just the target level). The earlier code only had an assertion (which is a no-op in release builds) that the newly added file is not already present on the target level. 3) The above consistency checks around state transitions are now mandatory, as opposed to the earlier `CheckConsistencyForDeletes`, which was a no-op in release mode unless `force_consistency_checks` was set to `true`. The rationale here is that assuming that the initial state is consistent, a valid transition leads to a next state that is also consistent; however, an *invalid* transition offers no such guarantee. Hence it makes sense to validate the transitions unconditionally, and save `force_consistency_checks` for the paranoid checks that re-validate the entire state. 4) The new checks build on the mechanism introduced in https://github.com/facebook/rocksdb/pull/6862, which enables us to efficiently look up the location (level and position within level) of files in a `Version` by file number. This makes the consistency checks much more efficient than the earlier `CheckConsistencyForDeletes`, which essentially performed a linear search. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6901 Test Plan: Extended the unit tests and ran: `make check` `make whitebox_crash_test` Reviewed By: ajkr Differential Revision: D21822714 Pulled By: ltamasi fbshipit-source-id: e2b29c8b6da1bf0f59004acc889e4870b2d18215
5 years ago
bool CheckConsistencyForNumLevels() const {
// Make sure there are no files on or beyond num_levels().
if (has_invalid_levels_) {
return false;
}
Improve consistency checks in VersionBuilder (#6901) Summary: The patch cleans up the code and improves the consistency checks around adding/deleting table files in `VersionBuilder`. Namely, it makes the checks stricter and improves them in the following ways: 1) A table file can now only be deleted from the LSM tree using the level it resides on. Earlier, there was some unnecessary wiggle room for trivially moved files (they could be deleted using a lower level number than the actual one). 2) A table file cannot be added to the tree if it is already present in the tree on any level (not just the target level). The earlier code only had an assertion (which is a no-op in release builds) that the newly added file is not already present on the target level. 3) The above consistency checks around state transitions are now mandatory, as opposed to the earlier `CheckConsistencyForDeletes`, which was a no-op in release mode unless `force_consistency_checks` was set to `true`. The rationale here is that assuming that the initial state is consistent, a valid transition leads to a next state that is also consistent; however, an *invalid* transition offers no such guarantee. Hence it makes sense to validate the transitions unconditionally, and save `force_consistency_checks` for the paranoid checks that re-validate the entire state. 4) The new checks build on the mechanism introduced in https://github.com/facebook/rocksdb/pull/6862, which enables us to efficiently look up the location (level and position within level) of files in a `Version` by file number. This makes the consistency checks much more efficient than the earlier `CheckConsistencyForDeletes`, which essentially performed a linear search. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6901 Test Plan: Extended the unit tests and ran: `make check` `make whitebox_crash_test` Reviewed By: ajkr Differential Revision: D21822714 Pulled By: ltamasi fbshipit-source-id: e2b29c8b6da1bf0f59004acc889e4870b2d18215
5 years ago
for (const auto& pair : invalid_level_sizes_) {
const size_t level_size = pair.second;
if (level_size != 0) {
return false;
}
}
Improve consistency checks in VersionBuilder (#6901) Summary: The patch cleans up the code and improves the consistency checks around adding/deleting table files in `VersionBuilder`. Namely, it makes the checks stricter and improves them in the following ways: 1) A table file can now only be deleted from the LSM tree using the level it resides on. Earlier, there was some unnecessary wiggle room for trivially moved files (they could be deleted using a lower level number than the actual one). 2) A table file cannot be added to the tree if it is already present in the tree on any level (not just the target level). The earlier code only had an assertion (which is a no-op in release builds) that the newly added file is not already present on the target level. 3) The above consistency checks around state transitions are now mandatory, as opposed to the earlier `CheckConsistencyForDeletes`, which was a no-op in release mode unless `force_consistency_checks` was set to `true`. The rationale here is that assuming that the initial state is consistent, a valid transition leads to a next state that is also consistent; however, an *invalid* transition offers no such guarantee. Hence it makes sense to validate the transitions unconditionally, and save `force_consistency_checks` for the paranoid checks that re-validate the entire state. 4) The new checks build on the mechanism introduced in https://github.com/facebook/rocksdb/pull/6862, which enables us to efficiently look up the location (level and position within level) of files in a `Version` by file number. This makes the consistency checks much more efficient than the earlier `CheckConsistencyForDeletes`, which essentially performed a linear search. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6901 Test Plan: Extended the unit tests and ran: `make check` `make whitebox_crash_test` Reviewed By: ajkr Differential Revision: D21822714 Pulled By: ltamasi fbshipit-source-id: e2b29c8b6da1bf0f59004acc889e4870b2d18215
5 years ago
return true;
}
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
bool IsBlobFileInVersion(uint64_t blob_file_number) const {
auto mutable_it = mutable_blob_file_metas_.find(blob_file_number);
if (mutable_it != mutable_blob_file_metas_.end()) {
return true;
}
assert(base_vstorage_);
Use a sorted vector instead of a map to store blob file metadata (#9526) Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
3 years ago
const auto meta = base_vstorage_->GetBlobFileMetaData(blob_file_number);
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
Use a sorted vector instead of a map to store blob file metadata (#9526) Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
3 years ago
return !!meta;
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
}
MutableBlobFileMetaData* GetOrCreateMutableBlobFileMetaData(
uint64_t blob_file_number) {
auto mutable_it = mutable_blob_file_metas_.find(blob_file_number);
if (mutable_it != mutable_blob_file_metas_.end()) {
return &mutable_it->second;
}
assert(base_vstorage_);
Use a sorted vector instead of a map to store blob file metadata (#9526) Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
3 years ago
const auto meta = base_vstorage_->GetBlobFileMetaData(blob_file_number);
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
Use a sorted vector instead of a map to store blob file metadata (#9526) Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
3 years ago
if (meta) {
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
mutable_it = mutable_blob_file_metas_
Use a sorted vector instead of a map to store blob file metadata (#9526) Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
3 years ago
.emplace(blob_file_number, MutableBlobFileMetaData(meta))
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
.first;
return &mutable_it->second;
}
return nullptr;
}
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
Status ApplyBlobFileAddition(const BlobFileAddition& blob_file_addition) {
const uint64_t blob_file_number = blob_file_addition.GetBlobFileNumber();
if (IsBlobFileInVersion(blob_file_number)) {
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
std::ostringstream oss;
oss << "Blob file #" << blob_file_number << " already added";
return Status::Corruption("VersionBuilder", oss.str());
}
// Note: we use C++11 for now but in C++14, this could be done in a more
// elegant way using generalized lambda capture.
VersionSet* const vs = version_set_;
const ImmutableCFOptions* const ioptions = ioptions_;
auto deleter = [vs, ioptions](SharedBlobFileMetaData* shared_meta) {
if (vs) {
assert(ioptions);
assert(!ioptions->cf_paths.empty());
assert(shared_meta);
vs->AddObsoleteBlobFile(shared_meta->GetBlobFileNumber(),
ioptions->cf_paths.front().path);
}
delete shared_meta;
};
auto shared_meta = SharedBlobFileMetaData::Create(
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
blob_file_number, blob_file_addition.GetTotalBlobCount(),
blob_file_addition.GetTotalBlobBytes(),
blob_file_addition.GetChecksumMethod(),
blob_file_addition.GetChecksumValue(), deleter);
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
mutable_blob_file_metas_.emplace(
blob_file_number, MutableBlobFileMetaData(std::move(shared_meta)));
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
return Status::OK();
}
Status ApplyBlobFileGarbage(const BlobFileGarbage& blob_file_garbage) {
const uint64_t blob_file_number = blob_file_garbage.GetBlobFileNumber();
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
MutableBlobFileMetaData* const mutable_meta =
GetOrCreateMutableBlobFileMetaData(blob_file_number);
if (!mutable_meta) {
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
std::ostringstream oss;
oss << "Blob file #" << blob_file_number << " not found";
return Status::Corruption("VersionBuilder", oss.str());
}
if (!mutable_meta->AddGarbage(blob_file_garbage.GetGarbageBlobCount(),
blob_file_garbage.GetGarbageBlobBytes())) {
std::ostringstream oss;
oss << "Garbage overflow for blob file #" << blob_file_number;
return Status::Corruption("VersionBuilder", oss.str());
}
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
return Status::OK();
}
Improve consistency checks in VersionBuilder (#6901) Summary: The patch cleans up the code and improves the consistency checks around adding/deleting table files in `VersionBuilder`. Namely, it makes the checks stricter and improves them in the following ways: 1) A table file can now only be deleted from the LSM tree using the level it resides on. Earlier, there was some unnecessary wiggle room for trivially moved files (they could be deleted using a lower level number than the actual one). 2) A table file cannot be added to the tree if it is already present in the tree on any level (not just the target level). The earlier code only had an assertion (which is a no-op in release builds) that the newly added file is not already present on the target level. 3) The above consistency checks around state transitions are now mandatory, as opposed to the earlier `CheckConsistencyForDeletes`, which was a no-op in release mode unless `force_consistency_checks` was set to `true`. The rationale here is that assuming that the initial state is consistent, a valid transition leads to a next state that is also consistent; however, an *invalid* transition offers no such guarantee. Hence it makes sense to validate the transitions unconditionally, and save `force_consistency_checks` for the paranoid checks that re-validate the entire state. 4) The new checks build on the mechanism introduced in https://github.com/facebook/rocksdb/pull/6862, which enables us to efficiently look up the location (level and position within level) of files in a `Version` by file number. This makes the consistency checks much more efficient than the earlier `CheckConsistencyForDeletes`, which essentially performed a linear search. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6901 Test Plan: Extended the unit tests and ran: `make check` `make whitebox_crash_test` Reviewed By: ajkr Differential Revision: D21822714 Pulled By: ltamasi fbshipit-source-id: e2b29c8b6da1bf0f59004acc889e4870b2d18215
5 years ago
int GetCurrentLevelForTableFile(uint64_t file_number) const {
auto it = table_file_levels_.find(file_number);
if (it != table_file_levels_.end()) {
return it->second;
}
assert(base_vstorage_);
return base_vstorage_->GetFileLocation(file_number).GetLevel();
}
uint64_t GetOldestBlobFileNumberForTableFile(int level,
uint64_t file_number) const {
assert(level < num_levels_);
const auto& added_files = levels_[level].added_files;
auto it = added_files.find(file_number);
if (it != added_files.end()) {
const FileMetaData* const meta = it->second;
assert(meta);
return meta->oldest_blob_file_number;
}
assert(base_vstorage_);
const FileMetaData* const meta =
base_vstorage_->GetFileMetaDataByNumber(file_number);
assert(meta);
return meta->oldest_blob_file_number;
}
Improve consistency checks in VersionBuilder (#6901) Summary: The patch cleans up the code and improves the consistency checks around adding/deleting table files in `VersionBuilder`. Namely, it makes the checks stricter and improves them in the following ways: 1) A table file can now only be deleted from the LSM tree using the level it resides on. Earlier, there was some unnecessary wiggle room for trivially moved files (they could be deleted using a lower level number than the actual one). 2) A table file cannot be added to the tree if it is already present in the tree on any level (not just the target level). The earlier code only had an assertion (which is a no-op in release builds) that the newly added file is not already present on the target level. 3) The above consistency checks around state transitions are now mandatory, as opposed to the earlier `CheckConsistencyForDeletes`, which was a no-op in release mode unless `force_consistency_checks` was set to `true`. The rationale here is that assuming that the initial state is consistent, a valid transition leads to a next state that is also consistent; however, an *invalid* transition offers no such guarantee. Hence it makes sense to validate the transitions unconditionally, and save `force_consistency_checks` for the paranoid checks that re-validate the entire state. 4) The new checks build on the mechanism introduced in https://github.com/facebook/rocksdb/pull/6862, which enables us to efficiently look up the location (level and position within level) of files in a `Version` by file number. This makes the consistency checks much more efficient than the earlier `CheckConsistencyForDeletes`, which essentially performed a linear search. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6901 Test Plan: Extended the unit tests and ran: `make check` `make whitebox_crash_test` Reviewed By: ajkr Differential Revision: D21822714 Pulled By: ltamasi fbshipit-source-id: e2b29c8b6da1bf0f59004acc889e4870b2d18215
5 years ago
Status ApplyFileDeletion(int level, uint64_t file_number) {
assert(level != VersionStorageInfo::FileLocation::Invalid().GetLevel());
const int current_level = GetCurrentLevelForTableFile(file_number);
if (level != current_level) {
if (level >= num_levels_) {
has_invalid_levels_ = true;
}
std::ostringstream oss;
oss << "Cannot delete table file #" << file_number << " from level "
<< level << " since it is ";
if (current_level ==
VersionStorageInfo::FileLocation::Invalid().GetLevel()) {
oss << "not in the LSM tree";
} else {
oss << "on level " << current_level;
}
return Status::Corruption("VersionBuilder", oss.str());
}
if (level >= num_levels_) {
assert(invalid_level_sizes_[level] > 0);
--invalid_level_sizes_[level];
table_file_levels_[file_number] =
VersionStorageInfo::FileLocation::Invalid().GetLevel();
return Status::OK();
}
const uint64_t blob_file_number =
GetOldestBlobFileNumberForTableFile(level, file_number);
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
if (blob_file_number != kInvalidBlobFileNumber) {
MutableBlobFileMetaData* const mutable_meta =
GetOrCreateMutableBlobFileMetaData(blob_file_number);
if (mutable_meta) {
mutable_meta->UnlinkSst(file_number);
}
}
Improve consistency checks in VersionBuilder (#6901) Summary: The patch cleans up the code and improves the consistency checks around adding/deleting table files in `VersionBuilder`. Namely, it makes the checks stricter and improves them in the following ways: 1) A table file can now only be deleted from the LSM tree using the level it resides on. Earlier, there was some unnecessary wiggle room for trivially moved files (they could be deleted using a lower level number than the actual one). 2) A table file cannot be added to the tree if it is already present in the tree on any level (not just the target level). The earlier code only had an assertion (which is a no-op in release builds) that the newly added file is not already present on the target level. 3) The above consistency checks around state transitions are now mandatory, as opposed to the earlier `CheckConsistencyForDeletes`, which was a no-op in release mode unless `force_consistency_checks` was set to `true`. The rationale here is that assuming that the initial state is consistent, a valid transition leads to a next state that is also consistent; however, an *invalid* transition offers no such guarantee. Hence it makes sense to validate the transitions unconditionally, and save `force_consistency_checks` for the paranoid checks that re-validate the entire state. 4) The new checks build on the mechanism introduced in https://github.com/facebook/rocksdb/pull/6862, which enables us to efficiently look up the location (level and position within level) of files in a `Version` by file number. This makes the consistency checks much more efficient than the earlier `CheckConsistencyForDeletes`, which essentially performed a linear search. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6901 Test Plan: Extended the unit tests and ran: `make check` `make whitebox_crash_test` Reviewed By: ajkr Differential Revision: D21822714 Pulled By: ltamasi fbshipit-source-id: e2b29c8b6da1bf0f59004acc889e4870b2d18215
5 years ago
auto& level_state = levels_[level];
auto& add_files = level_state.added_files;
auto add_it = add_files.find(file_number);
if (add_it != add_files.end()) {
UnrefFile(add_it->second);
add_files.erase(add_it);
}
auto& del_files = level_state.deleted_files;
assert(del_files.find(file_number) == del_files.end());
del_files.emplace(file_number);
Improve consistency checks in VersionBuilder (#6901) Summary: The patch cleans up the code and improves the consistency checks around adding/deleting table files in `VersionBuilder`. Namely, it makes the checks stricter and improves them in the following ways: 1) A table file can now only be deleted from the LSM tree using the level it resides on. Earlier, there was some unnecessary wiggle room for trivially moved files (they could be deleted using a lower level number than the actual one). 2) A table file cannot be added to the tree if it is already present in the tree on any level (not just the target level). The earlier code only had an assertion (which is a no-op in release builds) that the newly added file is not already present on the target level. 3) The above consistency checks around state transitions are now mandatory, as opposed to the earlier `CheckConsistencyForDeletes`, which was a no-op in release mode unless `force_consistency_checks` was set to `true`. The rationale here is that assuming that the initial state is consistent, a valid transition leads to a next state that is also consistent; however, an *invalid* transition offers no such guarantee. Hence it makes sense to validate the transitions unconditionally, and save `force_consistency_checks` for the paranoid checks that re-validate the entire state. 4) The new checks build on the mechanism introduced in https://github.com/facebook/rocksdb/pull/6862, which enables us to efficiently look up the location (level and position within level) of files in a `Version` by file number. This makes the consistency checks much more efficient than the earlier `CheckConsistencyForDeletes`, which essentially performed a linear search. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6901 Test Plan: Extended the unit tests and ran: `make check` `make whitebox_crash_test` Reviewed By: ajkr Differential Revision: D21822714 Pulled By: ltamasi fbshipit-source-id: e2b29c8b6da1bf0f59004acc889e4870b2d18215
5 years ago
table_file_levels_[file_number] =
VersionStorageInfo::FileLocation::Invalid().GetLevel();
return Status::OK();
}
Status ApplyFileAddition(int level, const FileMetaData& meta) {
assert(level != VersionStorageInfo::FileLocation::Invalid().GetLevel());
const uint64_t file_number = meta.fd.GetNumber();
const int current_level = GetCurrentLevelForTableFile(file_number);
if (current_level !=
VersionStorageInfo::FileLocation::Invalid().GetLevel()) {
if (level >= num_levels_) {
has_invalid_levels_ = true;
}
std::ostringstream oss;
oss << "Cannot add table file #" << file_number << " to level " << level
<< " since it is already in the LSM tree on level " << current_level;
return Status::Corruption("VersionBuilder", oss.str());
}
if (level >= num_levels_) {
++invalid_level_sizes_[level];
table_file_levels_[file_number] = level;
return Status::OK();
}
auto& level_state = levels_[level];
auto& del_files = level_state.deleted_files;
auto del_it = del_files.find(file_number);
if (del_it != del_files.end()) {
del_files.erase(del_it);
}
FileMetaData* const f = new FileMetaData(meta);
f->refs = 1;
Account memory of FileMetaData in global memory limit (#9924) Summary: **Context/Summary:** As revealed by heap profiling, allocation of `FileMetaData` for [newly created file added to a Version](https://github.com/facebook/rocksdb/pull/9924/files#diff-a6aa385940793f95a2c5b39cc670bd440c4547fa54fd44622f756382d5e47e43R774) can consume significant heap memory. This PR is to account that toward our global memory limit based on block cache capacity. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9924 Test Plan: - Previous `make check` verified there are only 2 places where the memory of the allocated `FileMetaData` can be released - New unit test `TEST_P(ChargeFileMetadataTestWithParam, Basic)` - db bench (CPU cost of `charge_file_metadata` in write and compact) - **write micros/op: -0.24%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 (remove this option for pre-PR) -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'` - **compact micros/op -0.87%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 -numdistinct=1000 && ./db_bench -benchmarks=compact -db=$TEST_TMPDIR -use_existing_db=1 -charge_file_metadata=1 -disable_auto_compactions=1 | egrep 'compact'` table 1 - write #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721 20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | -0.3633711465 40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | 0.5289363078 80 | 3.87828 | 0.119007 | 3.86791 | 0.115674 | **-0.2673865734** 160 | 3.87677 | 0.162231 | 3.86739 | 0.16663 | **-0.2419539978** table 2 - compact #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 2,399,650.00 | 96,375.80 | 2,359,537.00 | 53,243.60 | -1.67 20 | 2,410,480.00 | 89,988.00 | 2,433,580.00 | 91,121.20 | 0.96 40 | 2.41E+06 | 121811 | 2.39E+06 | 131525 | **-0.96** 80 | 2.40E+06 | 134503 | 2.39E+06 | 108799 | **-0.78** - stress test: `python3 tools/db_crashtest.py blackbox --charge_file_metadata=1 --cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D36055583 Pulled By: hx235 fbshipit-source-id: b60eab94707103cb1322cf815f05810ef0232625
2 years ago
if (file_metadata_cache_res_mgr_) {
Status s = file_metadata_cache_res_mgr_->UpdateCacheReservation(
f->ApproximateMemoryUsage(), true /* increase */);
if (!s.ok()) {
delete f;
s = Status::MemoryLimit(
"Can't allocate " +
kCacheEntryRoleToCamelString[static_cast<std::uint32_t>(
CacheEntryRole::kFileMetadata)] +
" due to exceeding the memory limit "
"based on "
"cache capacity");
return s;
}
}
auto& add_files = level_state.added_files;
assert(add_files.find(file_number) == add_files.end());
add_files.emplace(file_number, f);
const uint64_t blob_file_number = f->oldest_blob_file_number;
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
if (blob_file_number != kInvalidBlobFileNumber) {
MutableBlobFileMetaData* const mutable_meta =
GetOrCreateMutableBlobFileMetaData(blob_file_number);
if (mutable_meta) {
mutable_meta->LinkSst(file_number);
}
}
Improve consistency checks in VersionBuilder (#6901) Summary: The patch cleans up the code and improves the consistency checks around adding/deleting table files in `VersionBuilder`. Namely, it makes the checks stricter and improves them in the following ways: 1) A table file can now only be deleted from the LSM tree using the level it resides on. Earlier, there was some unnecessary wiggle room for trivially moved files (they could be deleted using a lower level number than the actual one). 2) A table file cannot be added to the tree if it is already present in the tree on any level (not just the target level). The earlier code only had an assertion (which is a no-op in release builds) that the newly added file is not already present on the target level. 3) The above consistency checks around state transitions are now mandatory, as opposed to the earlier `CheckConsistencyForDeletes`, which was a no-op in release mode unless `force_consistency_checks` was set to `true`. The rationale here is that assuming that the initial state is consistent, a valid transition leads to a next state that is also consistent; however, an *invalid* transition offers no such guarantee. Hence it makes sense to validate the transitions unconditionally, and save `force_consistency_checks` for the paranoid checks that re-validate the entire state. 4) The new checks build on the mechanism introduced in https://github.com/facebook/rocksdb/pull/6862, which enables us to efficiently look up the location (level and position within level) of files in a `Version` by file number. This makes the consistency checks much more efficient than the earlier `CheckConsistencyForDeletes`, which essentially performed a linear search. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6901 Test Plan: Extended the unit tests and ran: `make check` `make whitebox_crash_test` Reviewed By: ajkr Differential Revision: D21822714 Pulled By: ltamasi fbshipit-source-id: e2b29c8b6da1bf0f59004acc889e4870b2d18215
5 years ago
table_file_levels_[file_number] = level;
return Status::OK();
}
Add basic kRoundRobin compaction policy (#10107) Summary: Add `kRoundRobin` as a compaction priority. The implementation is as follows. - Define a cursor as the smallest Internal key in the successor of the selected file. Add `vector<InternalKey> compact_cursor_` into `VersionStorageInfo` where each element (`InternalKey`) in `compact_cursor_` represents a cursor. In round-robin compaction policy, we just need to select the first file (assuming files are sorted) and also has the smallest InternalKey larger than/equal to the cursor. After a file is chosen, we create a new `Fsize` vector which puts the selected file is placed at the first position in `temp`, the next cursor is then updated as the smallest InternalKey in successor of the selected file (the above logic is implemented in `SortFileByRoundRobin`). - After a compaction succeeds, typically `InstallCompactionResults()`, we choose the next cursor for the input level and save it to `edit`. When calling `LogAndApply`, we save the next cursor with its level into some local variable and finally apply the change to `vstorage` in `SaveTo` function. - Cursors are persist pair by pair (<level, InternalKey>) in `EncodeTo` so that they can be reconstructed when reopening. An empty cursor will not be encoded to MANIFEST Pull Request resolved: https://github.com/facebook/rocksdb/pull/10107 Test Plan: add unit test (`CompactionPriRoundRobin`) in `compaction_picker_test`, add `kRoundRobin` priority in `CompactionPriTest` from `db_compaction_test`, and add `PersistRoundRobinCompactCursor` in `db_compaction_test` Reviewed By: ajkr Differential Revision: D37316037 Pulled By: littlepig2013 fbshipit-source-id: 9f481748190ace416079139044e00df2968fb1ee
2 years ago
Status ApplyCompactCursors(int level,
const InternalKey& smallest_uncompacted_key) {
if (level < 0) {
std::ostringstream oss;
oss << "Cannot add compact cursor (" << level << ","
<< smallest_uncompacted_key.Encode().ToString()
<< " due to invalid level (level = " << level << ")";
return Status::Corruption("VersionBuilder", oss.str());
}
if (level < num_levels_) {
// Omit levels (>= num_levels_) when re-open with shrinking num_levels_
updated_compact_cursors_[level] = smallest_uncompacted_key;
}
return Status::OK();
}
// Apply all of the edits in *edit to the current state.
Status Apply(const VersionEdit* edit) {
Improve consistency checks in VersionBuilder (#6901) Summary: The patch cleans up the code and improves the consistency checks around adding/deleting table files in `VersionBuilder`. Namely, it makes the checks stricter and improves them in the following ways: 1) A table file can now only be deleted from the LSM tree using the level it resides on. Earlier, there was some unnecessary wiggle room for trivially moved files (they could be deleted using a lower level number than the actual one). 2) A table file cannot be added to the tree if it is already present in the tree on any level (not just the target level). The earlier code only had an assertion (which is a no-op in release builds) that the newly added file is not already present on the target level. 3) The above consistency checks around state transitions are now mandatory, as opposed to the earlier `CheckConsistencyForDeletes`, which was a no-op in release mode unless `force_consistency_checks` was set to `true`. The rationale here is that assuming that the initial state is consistent, a valid transition leads to a next state that is also consistent; however, an *invalid* transition offers no such guarantee. Hence it makes sense to validate the transitions unconditionally, and save `force_consistency_checks` for the paranoid checks that re-validate the entire state. 4) The new checks build on the mechanism introduced in https://github.com/facebook/rocksdb/pull/6862, which enables us to efficiently look up the location (level and position within level) of files in a `Version` by file number. This makes the consistency checks much more efficient than the earlier `CheckConsistencyForDeletes`, which essentially performed a linear search. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6901 Test Plan: Extended the unit tests and ran: `make check` `make whitebox_crash_test` Reviewed By: ajkr Differential Revision: D21822714 Pulled By: ltamasi fbshipit-source-id: e2b29c8b6da1bf0f59004acc889e4870b2d18215
5 years ago
{
const Status s = CheckConsistency(base_vstorage_);
if (!s.ok()) {
return s;
}
}
// Note: we process the blob file related changes first because the
// table file addition/deletion logic depends on the blob files
// already being there.
// Add new blob files
for (const auto& blob_file_addition : edit->GetBlobFileAdditions()) {
const Status s = ApplyBlobFileAddition(blob_file_addition);
if (!s.ok()) {
return s;
}
}
// Increase the amount of garbage for blob files affected by GC
for (const auto& blob_file_garbage : edit->GetBlobFileGarbages()) {
const Status s = ApplyBlobFileGarbage(blob_file_garbage);
if (!s.ok()) {
return s;
}
}
// Delete table files
Improve consistency checks in VersionBuilder (#6901) Summary: The patch cleans up the code and improves the consistency checks around adding/deleting table files in `VersionBuilder`. Namely, it makes the checks stricter and improves them in the following ways: 1) A table file can now only be deleted from the LSM tree using the level it resides on. Earlier, there was some unnecessary wiggle room for trivially moved files (they could be deleted using a lower level number than the actual one). 2) A table file cannot be added to the tree if it is already present in the tree on any level (not just the target level). The earlier code only had an assertion (which is a no-op in release builds) that the newly added file is not already present on the target level. 3) The above consistency checks around state transitions are now mandatory, as opposed to the earlier `CheckConsistencyForDeletes`, which was a no-op in release mode unless `force_consistency_checks` was set to `true`. The rationale here is that assuming that the initial state is consistent, a valid transition leads to a next state that is also consistent; however, an *invalid* transition offers no such guarantee. Hence it makes sense to validate the transitions unconditionally, and save `force_consistency_checks` for the paranoid checks that re-validate the entire state. 4) The new checks build on the mechanism introduced in https://github.com/facebook/rocksdb/pull/6862, which enables us to efficiently look up the location (level and position within level) of files in a `Version` by file number. This makes the consistency checks much more efficient than the earlier `CheckConsistencyForDeletes`, which essentially performed a linear search. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6901 Test Plan: Extended the unit tests and ran: `make check` `make whitebox_crash_test` Reviewed By: ajkr Differential Revision: D21822714 Pulled By: ltamasi fbshipit-source-id: e2b29c8b6da1bf0f59004acc889e4870b2d18215
5 years ago
for (const auto& deleted_file : edit->GetDeletedFiles()) {
const int level = deleted_file.first;
const uint64_t file_number = deleted_file.second;
Improve consistency checks in VersionBuilder (#6901) Summary: The patch cleans up the code and improves the consistency checks around adding/deleting table files in `VersionBuilder`. Namely, it makes the checks stricter and improves them in the following ways: 1) A table file can now only be deleted from the LSM tree using the level it resides on. Earlier, there was some unnecessary wiggle room for trivially moved files (they could be deleted using a lower level number than the actual one). 2) A table file cannot be added to the tree if it is already present in the tree on any level (not just the target level). The earlier code only had an assertion (which is a no-op in release builds) that the newly added file is not already present on the target level. 3) The above consistency checks around state transitions are now mandatory, as opposed to the earlier `CheckConsistencyForDeletes`, which was a no-op in release mode unless `force_consistency_checks` was set to `true`. The rationale here is that assuming that the initial state is consistent, a valid transition leads to a next state that is also consistent; however, an *invalid* transition offers no such guarantee. Hence it makes sense to validate the transitions unconditionally, and save `force_consistency_checks` for the paranoid checks that re-validate the entire state. 4) The new checks build on the mechanism introduced in https://github.com/facebook/rocksdb/pull/6862, which enables us to efficiently look up the location (level and position within level) of files in a `Version` by file number. This makes the consistency checks much more efficient than the earlier `CheckConsistencyForDeletes`, which essentially performed a linear search. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6901 Test Plan: Extended the unit tests and ran: `make check` `make whitebox_crash_test` Reviewed By: ajkr Differential Revision: D21822714 Pulled By: ltamasi fbshipit-source-id: e2b29c8b6da1bf0f59004acc889e4870b2d18215
5 years ago
const Status s = ApplyFileDeletion(level, file_number);
if (!s.ok()) {
return s;
}
}
// Add new table files
for (const auto& new_file : edit->GetNewFiles()) {
const int level = new_file.first;
Improve consistency checks in VersionBuilder (#6901) Summary: The patch cleans up the code and improves the consistency checks around adding/deleting table files in `VersionBuilder`. Namely, it makes the checks stricter and improves them in the following ways: 1) A table file can now only be deleted from the LSM tree using the level it resides on. Earlier, there was some unnecessary wiggle room for trivially moved files (they could be deleted using a lower level number than the actual one). 2) A table file cannot be added to the tree if it is already present in the tree on any level (not just the target level). The earlier code only had an assertion (which is a no-op in release builds) that the newly added file is not already present on the target level. 3) The above consistency checks around state transitions are now mandatory, as opposed to the earlier `CheckConsistencyForDeletes`, which was a no-op in release mode unless `force_consistency_checks` was set to `true`. The rationale here is that assuming that the initial state is consistent, a valid transition leads to a next state that is also consistent; however, an *invalid* transition offers no such guarantee. Hence it makes sense to validate the transitions unconditionally, and save `force_consistency_checks` for the paranoid checks that re-validate the entire state. 4) The new checks build on the mechanism introduced in https://github.com/facebook/rocksdb/pull/6862, which enables us to efficiently look up the location (level and position within level) of files in a `Version` by file number. This makes the consistency checks much more efficient than the earlier `CheckConsistencyForDeletes`, which essentially performed a linear search. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6901 Test Plan: Extended the unit tests and ran: `make check` `make whitebox_crash_test` Reviewed By: ajkr Differential Revision: D21822714 Pulled By: ltamasi fbshipit-source-id: e2b29c8b6da1bf0f59004acc889e4870b2d18215
5 years ago
const FileMetaData& meta = new_file.second;
const Status s = ApplyFileAddition(level, meta);
if (!s.ok()) {
return s;
}
}
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
Add basic kRoundRobin compaction policy (#10107) Summary: Add `kRoundRobin` as a compaction priority. The implementation is as follows. - Define a cursor as the smallest Internal key in the successor of the selected file. Add `vector<InternalKey> compact_cursor_` into `VersionStorageInfo` where each element (`InternalKey`) in `compact_cursor_` represents a cursor. In round-robin compaction policy, we just need to select the first file (assuming files are sorted) and also has the smallest InternalKey larger than/equal to the cursor. After a file is chosen, we create a new `Fsize` vector which puts the selected file is placed at the first position in `temp`, the next cursor is then updated as the smallest InternalKey in successor of the selected file (the above logic is implemented in `SortFileByRoundRobin`). - After a compaction succeeds, typically `InstallCompactionResults()`, we choose the next cursor for the input level and save it to `edit`. When calling `LogAndApply`, we save the next cursor with its level into some local variable and finally apply the change to `vstorage` in `SaveTo` function. - Cursors are persist pair by pair (<level, InternalKey>) in `EncodeTo` so that they can be reconstructed when reopening. An empty cursor will not be encoded to MANIFEST Pull Request resolved: https://github.com/facebook/rocksdb/pull/10107 Test Plan: add unit test (`CompactionPriRoundRobin`) in `compaction_picker_test`, add `kRoundRobin` priority in `CompactionPriTest` from `db_compaction_test`, and add `PersistRoundRobinCompactCursor` in `db_compaction_test` Reviewed By: ajkr Differential Revision: D37316037 Pulled By: littlepig2013 fbshipit-source-id: 9f481748190ace416079139044e00df2968fb1ee
2 years ago
// Populate compact cursors for round-robin compaction, leave
// the cursor to be empty to indicate it is invalid
for (const auto& cursor : edit->GetCompactCursors()) {
const int level = cursor.first;
const InternalKey smallest_uncompacted_key = cursor.second;
const Status s = ApplyCompactCursors(level, smallest_uncompacted_key);
if (!s.ok()) {
return s;
}
}
return Status::OK();
}
// Helper function template for merging the blob file metadata from the base
// version with the mutable metadata representing the state after applying the
// edits. The function objects process_base and process_mutable are
// respectively called to handle a base version object when there is no
// matching mutable object, and a mutable object when there is no matching
// base version object. process_both is called to perform the merge when a
// given blob file appears both in the base version and the mutable list. The
// helper stops processing objects if a function object returns false. Blob
// files with a file number below first_blob_file are not processed.
template <typename ProcessBase, typename ProcessMutable, typename ProcessBoth>
void MergeBlobFileMetas(uint64_t first_blob_file, ProcessBase process_base,
ProcessMutable process_mutable,
ProcessBoth process_both) const {
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
assert(base_vstorage_);
Clean up blob files based on the linked SST set (#7001) Summary: The earlier `VersionBuilder` code only cleaned up blob files that were marked as entirely consisting of garbage using `VersionEdits` with `BlobFileGarbage`. This covers the cases when table files go through regular compaction, where we iterate through the KVs and thus have an opportunity to calculate the amount of garbage (that is, most cases). However, it does not help when table files are simply dropped (e.g. deletion compactions or the `DeleteFile` API). To deal with such cases, the patch adds logic that cleans up all blob files at the head of the list until the first one with linked SSTs is found. (As an example, let's assume we have blob files with numbers 1..10, and the first one with any linked SSTs is number 8. This means that SSTs in the `Version` only rely on blob files with numbers >= 8, and thus 1..7 are no longer needed.) The code change itself is pretty small; however, changing the logic like this necessitated changes to some tests that have been added recently (namely to the ones that use blob files in isolation, i.e. without any table files referring to them). Some of these cases were fixed by bypassing `VersionBuilder` altogether in order to keep the tests simple (which actually makes them more proper unit tests as well), while the `VersionBuilder` unit tests were fixed by adding dummy table files to the test cases as needed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7001 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D22119474 Pulled By: ltamasi fbshipit-source-id: c6547141355667d4291d9661d6518eb741e7b54a
4 years ago
Use a sorted vector instead of a map to store blob file metadata (#9526) Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
3 years ago
auto base_it = base_vstorage_->GetBlobFileMetaDataLB(first_blob_file);
const auto base_it_end = base_vstorage_->GetBlobFiles().end();
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
auto mutable_it = mutable_blob_file_metas_.lower_bound(first_blob_file);
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
const auto mutable_it_end = mutable_blob_file_metas_.end();
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
while (base_it != base_it_end && mutable_it != mutable_it_end) {
Use a sorted vector instead of a map to store blob file metadata (#9526) Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
3 years ago
const auto& base_meta = *base_it;
assert(base_meta);
const uint64_t base_blob_file_number = base_meta->GetBlobFileNumber();
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
const uint64_t mutable_blob_file_number = mutable_it->first;
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
if (base_blob_file_number < mutable_blob_file_number) {
if (!process_base(base_meta)) {
return;
}
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
++base_it;
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
} else if (mutable_blob_file_number < base_blob_file_number) {
const auto& mutable_meta = mutable_it->second;
if (!process_mutable(mutable_meta)) {
return;
}
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
++mutable_it;
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
} else {
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
assert(base_blob_file_number == mutable_blob_file_number);
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
const auto& mutable_meta = mutable_it->second;
if (!process_both(base_meta, mutable_meta)) {
return;
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
}
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
++base_it;
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
++mutable_it;
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
}
}
while (base_it != base_it_end) {
Use a sorted vector instead of a map to store blob file metadata (#9526) Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
3 years ago
const auto& base_meta = *base_it;
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
if (!process_base(base_meta)) {
return;
}
Clean up blob files based on the linked SST set (#7001) Summary: The earlier `VersionBuilder` code only cleaned up blob files that were marked as entirely consisting of garbage using `VersionEdits` with `BlobFileGarbage`. This covers the cases when table files go through regular compaction, where we iterate through the KVs and thus have an opportunity to calculate the amount of garbage (that is, most cases). However, it does not help when table files are simply dropped (e.g. deletion compactions or the `DeleteFile` API). To deal with such cases, the patch adds logic that cleans up all blob files at the head of the list until the first one with linked SSTs is found. (As an example, let's assume we have blob files with numbers 1..10, and the first one with any linked SSTs is number 8. This means that SSTs in the `Version` only rely on blob files with numbers >= 8, and thus 1..7 are no longer needed.) The code change itself is pretty small; however, changing the logic like this necessitated changes to some tests that have been added recently (namely to the ones that use blob files in isolation, i.e. without any table files referring to them). Some of these cases were fixed by bypassing `VersionBuilder` altogether in order to keep the tests simple (which actually makes them more proper unit tests as well), while the `VersionBuilder` unit tests were fixed by adding dummy table files to the test cases as needed. Pull Request resolved: https://github.com/facebook/rocksdb/pull/7001 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D22119474 Pulled By: ltamasi fbshipit-source-id: c6547141355667d4291d9661d6518eb741e7b54a
4 years ago
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
++base_it;
}
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
while (mutable_it != mutable_it_end) {
const auto& mutable_meta = mutable_it->second;
if (!process_mutable(mutable_meta)) {
return;
}
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
Aggregate blob file related changes in VersionBuilder as VersionEdits are applied (#9085) Summary: The current VersionBuilder code on mainline keeps track of blob file related changes ("delta") induced by a series of `VersionEdit`s in the form of `BlobFileMetaDataDelta` objects. Specifically, `BlobFileMetaDataDelta` contains the amount of additional garbage generated by compactions, as well as the set of newly linked/unlinked SSTs. This is very handy for detecting trivial moves, since in that case the newly linked and unlinked SSTs cancel each other out. However, this representation does not allow us to easily tell whether a certain blob file is obsolete after applying a set of `VersionEdit`s or not. In order to solve this issue, the patch introduces `MutableBlobFileMetaData`, which, in addition to the delta, also contains the materialized state after applying a set of version edits (i.e. the total amount of garbage and the resulting set of linked SSTs). This will enable us to add further consistency checks and to improve certain pieces of functionality where knowing up front which blob files get obsoleted is beneficial. (Note: this patch is just the refactoring part; I plan to create separate PRs for the enhancements.) Pull Request resolved: https://github.com/facebook/rocksdb/pull/9085 Test Plan: Ran `make check` and the stress tests in BlobDB mode. Reviewed By: riversand963 Differential Revision: D31980867 Pulled By: ltamasi fbshipit-source-id: cc4286778b10900af720423d6b772c77f28a93e3
3 years ago
++mutable_it;
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
}
}
// Helper function template for finding the first blob file that has linked
// SSTs.
template <typename Meta>
static bool CheckLinkedSsts(const Meta& meta,
uint64_t* min_oldest_blob_file_num) {
assert(min_oldest_blob_file_num);
if (!meta.GetLinkedSsts().empty()) {
assert(*min_oldest_blob_file_num == kInvalidBlobFileNumber);
*min_oldest_blob_file_num = meta.GetBlobFileNumber();
return false;
}
return true;
}
// Find the oldest blob file that has linked SSTs.
uint64_t GetMinOldestBlobFileNumber() const {
uint64_t min_oldest_blob_file_num = kInvalidBlobFileNumber;
auto process_base =
[&min_oldest_blob_file_num](
const std::shared_ptr<BlobFileMetaData>& base_meta) {
assert(base_meta);
return CheckLinkedSsts(*base_meta, &min_oldest_blob_file_num);
};
auto process_mutable = [&min_oldest_blob_file_num](
const MutableBlobFileMetaData& mutable_meta) {
return CheckLinkedSsts(mutable_meta, &min_oldest_blob_file_num);
};
auto process_both = [&min_oldest_blob_file_num](
const std::shared_ptr<BlobFileMetaData>& base_meta,
const MutableBlobFileMetaData& mutable_meta) {
#ifndef NDEBUG
assert(base_meta);
assert(base_meta->GetSharedMeta() == mutable_meta.GetSharedMeta());
#else
(void)base_meta;
#endif
// Look at mutable_meta since it supersedes *base_meta
return CheckLinkedSsts(mutable_meta, &min_oldest_blob_file_num);
};
MergeBlobFileMetas(kInvalidBlobFileNumber, process_base, process_mutable,
process_both);
return min_oldest_blob_file_num;
}
static std::shared_ptr<BlobFileMetaData> CreateBlobFileMetaData(
const MutableBlobFileMetaData& mutable_meta) {
return BlobFileMetaData::Create(
mutable_meta.GetSharedMeta(), mutable_meta.GetLinkedSsts(),
mutable_meta.GetGarbageBlobCount(), mutable_meta.GetGarbageBlobBytes());
}
// Add the blob file specified by meta to *vstorage if it is determined to
// contain valid data (blobs).
template <typename Meta>
static void AddBlobFileIfNeeded(VersionStorageInfo* vstorage, Meta&& meta) {
assert(vstorage);
assert(meta);
if (meta->GetLinkedSsts().empty() &&
meta->GetGarbageBlobCount() >= meta->GetTotalBlobCount()) {
return;
}
vstorage->AddBlobFile(std::forward<Meta>(meta));
}
// Merge the blob file metadata from the base version with the changes (edits)
// applied, and save the result into *vstorage.
void SaveBlobFilesTo(VersionStorageInfo* vstorage) const {
assert(vstorage);
Use a sorted vector instead of a map to store blob file metadata (#9526) Summary: The patch replaces `std::map` with a sorted `std::vector` for `VersionStorageInfo::blob_files_` and preallocates the space for the `vector` before saving the `BlobFileMetaData` into the new `VersionStorageInfo` in `VersionBuilder::Rep::SaveBlobFilesTo`. These changes reduce the time the DB mutex is held while saving new `Version`s, and using a sorted `vector` also makes lookups faster thanks to better memory locality. In addition, the patch introduces helper methods `VersionStorageInfo::GetBlobFileMetaData` and `VersionStorageInfo::GetBlobFileMetaDataLB` that can be used by clients to perform lookups in the `vector`, and does some general cleanup in the parts of code where blob file metadata are used. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9526 Test Plan: Ran `make check` and the crash test script for a while. Performance was tested using a load-optimized benchmark (`fillseq` with vector memtable, no WAL) and small file sizes so that a significant number of files are produced: ``` numactl --interleave=all ./db_bench --benchmarks=fillseq --allow_concurrent_memtable_write=false --level0_file_num_compaction_trigger=4 --level0_slowdown_writes_trigger=20 --level0_stop_writes_trigger=30 --max_background_jobs=8 --max_write_buffer_number=8 --db=/data/ltamasi-dbbench --wal_dir=/data/ltamasi-dbbench --num=800000000 --num_levels=8 --key_size=20 --value_size=400 --block_size=8192 --cache_size=51539607552 --cache_numshardbits=6 --compression_max_dict_bytes=0 --compression_ratio=0.5 --compression_type=lz4 --bytes_per_sync=8388608 --cache_index_and_filter_blocks=1 --cache_high_pri_pool_ratio=0.5 --benchmark_write_rate_limit=0 --write_buffer_size=16777216 --target_file_size_base=16777216 --max_bytes_for_level_base=67108864 --verify_checksum=1 --delete_obsolete_files_period_micros=62914560 --max_bytes_for_level_multiplier=8 --statistics=0 --stats_per_interval=1 --stats_interval_seconds=20 --histogram=1 --memtablerep=skip_list --bloom_bits=10 --open_files=-1 --subcompactions=1 --compaction_style=0 --min_level_to_compress=3 --level_compaction_dynamic_level_bytes=true --pin_l0_filter_and_index_blocks_in_cache=1 --soft_pending_compaction_bytes_limit=167503724544 --hard_pending_compaction_bytes_limit=335007449088 --min_level_to_compress=0 --use_existing_db=0 --sync=0 --threads=1 --memtablerep=vector --allow_concurrent_memtable_write=false --disable_wal=1 --enable_blob_files=1 --blob_file_size=16777216 --min_blob_size=0 --blob_compression_type=lz4 --enable_blob_garbage_collection=1 --seed=<some value> ``` Final statistics before the patch: ``` Cumulative writes: 0 writes, 700M keys, 0 commit groups, 0.0 writes per commit group, ingest: 284.62 GB, 121.27 MB/s Interval writes: 0 writes, 334K keys, 0 commit groups, 0.0 writes per commit group, ingest: 139.28 MB, 72.46 MB/s ``` With the patch: ``` Cumulative writes: 0 writes, 760M keys, 0 commit groups, 0.0 writes per commit group, ingest: 308.66 GB, 131.52 MB/s Interval writes: 0 writes, 445K keys, 0 commit groups, 0.0 writes per commit group, ingest: 185.35 MB, 93.15 MB/s ``` Total time to complete the benchmark is 2611 seconds with the patch, down from 2986 secs. Reviewed By: riversand963 Differential Revision: D34082728 Pulled By: ltamasi fbshipit-source-id: fc598abf676dce436734d06bb9d2d99a26a004fc
3 years ago
assert(base_vstorage_);
vstorage->ReserveBlob(base_vstorage_->GetBlobFiles().size() +
mutable_blob_file_metas_.size());
const uint64_t oldest_blob_file_with_linked_ssts =
GetMinOldestBlobFileNumber();
auto process_base =
[vstorage](const std::shared_ptr<BlobFileMetaData>& base_meta) {
assert(base_meta);
AddBlobFileIfNeeded(vstorage, base_meta);
return true;
};
auto process_mutable =
[vstorage](const MutableBlobFileMetaData& mutable_meta) {
AddBlobFileIfNeeded(vstorage, CreateBlobFileMetaData(mutable_meta));
return true;
};
auto process_both = [vstorage](
const std::shared_ptr<BlobFileMetaData>& base_meta,
const MutableBlobFileMetaData& mutable_meta) {
assert(base_meta);
assert(base_meta->GetSharedMeta() == mutable_meta.GetSharedMeta());
if (!mutable_meta.HasDelta()) {
assert(base_meta->GetGarbageBlobCount() ==
mutable_meta.GetGarbageBlobCount());
assert(base_meta->GetGarbageBlobBytes() ==
mutable_meta.GetGarbageBlobBytes());
assert(base_meta->GetLinkedSsts() == mutable_meta.GetLinkedSsts());
AddBlobFileIfNeeded(vstorage, base_meta);
return true;
}
AddBlobFileIfNeeded(vstorage, CreateBlobFileMetaData(mutable_meta));
return true;
};
MergeBlobFileMetas(oldest_blob_file_with_linked_ssts, process_base,
process_mutable, process_both);
}
void MaybeAddFile(VersionStorageInfo* vstorage, int level,
FileMetaData* f) const {
const uint64_t file_number = f->fd.GetNumber();
const auto& level_state = levels_[level];
const auto& del_files = level_state.deleted_files;
const auto del_it = del_files.find(file_number);
if (del_it != del_files.end()) {
// f is to-be-deleted table file
vstorage->RemoveCurrentStats(f);
} else {
const auto& add_files = level_state.added_files;
const auto add_it = add_files.find(file_number);
// Note: if the file appears both in the base version and in the added
// list, the added FileMetaData supersedes the one in the base version.
if (add_it != add_files.end() && add_it->second != f) {
vstorage->RemoveCurrentStats(f);
} else {
vstorage->AddFile(level, f);
}
}
}
template <typename Cmp>
void SaveSSTFilesTo(VersionStorageInfo* vstorage, int level, Cmp cmp) const {
// Merge the set of added files with the set of pre-existing files.
// Drop any deleted files. Store the result in *vstorage.
const auto& base_files = base_vstorage_->LevelFiles(level);
const auto& unordered_added_files = levels_[level].added_files;
vstorage->Reserve(level, base_files.size() + unordered_added_files.size());
// Sort added files for the level.
std::vector<FileMetaData*> added_files;
added_files.reserve(unordered_added_files.size());
for (const auto& pair : unordered_added_files) {
added_files.push_back(pair.second);
}
std::sort(added_files.begin(), added_files.end(), cmp);
auto base_iter = base_files.begin();
auto base_end = base_files.end();
auto added_iter = added_files.begin();
auto added_end = added_files.end();
while (added_iter != added_end || base_iter != base_end) {
if (base_iter == base_end ||
(added_iter != added_end && cmp(*added_iter, *base_iter))) {
MaybeAddFile(vstorage, level, *added_iter++);
} else {
MaybeAddFile(vstorage, level, *base_iter++);
}
}
}
Sort L0 files by newly introduced epoch_num (#10922) Summary: **Context:** Sorting L0 files by `largest_seqno` has at least two inconvenience: - File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap. - For example, consider the following sequence of events ("key@n" indicates key at seqno "n") - insert k1@1 to memtable m1 - ingest file s1 with k2@2, ingest file s2 with k3@3 - insert k4@4 to m1 - compact files s1, s2 and result in new file s3 of seqno range [2, 3] - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1 - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption. - Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example) - an existing SST s1 contains only k1@1 - insert k1@2 to memtable m1 - ingest file s2 with k3@3, ingest file s3 with k4@4 - insert single delete k5@5 in m1 - flush m1 and result in new file s4 of seqno range [2, 5] - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4] - compact s4 and result in new file s6 of seqno range [2] due to single delete - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno` Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways: - In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more. - In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption. **Summary:** - Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`. - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`) - Compaction output file is assigned with the minimum `epoch_number` among input files' - Refit level: reuse refitted file's epoch_number - Other paths needing `epoch_number` treatment: - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo` - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`. - Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair). - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder. - Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag` - Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`. - Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR. - Misc: - update existing tests with `epoch_number` so make check will pass - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber() Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922 Test Plan: - `make check` - New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc` - Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930 - [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox` - [Ongoing] normal db stress test - [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 Reviewed By: ajkr Differential Revision: D41063187 Pulled By: hx235 fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2 years ago
bool PromoteEpochNumberRequirementIfNeeded(
VersionStorageInfo* vstorage) const {
if (vstorage->HasMissingEpochNumber()) {
return false;
}
for (int level = 0; level < num_levels_; ++level) {
for (const auto& pair : levels_[level].added_files) {
const FileMetaData* f = pair.second;
if (f->epoch_number == kUnknownEpochNumber) {
return false;
}
}
}
vstorage->SetEpochNumberRequirement(EpochNumberRequirement::kMustPresent);
return true;
}
void SaveSSTFilesTo(VersionStorageInfo* vstorage) const {
assert(vstorage);
if (!num_levels_) {
return;
}
Sort L0 files by newly introduced epoch_num (#10922) Summary: **Context:** Sorting L0 files by `largest_seqno` has at least two inconvenience: - File ingestion and compaction involving ingested files can create files of overlapping seqno range with the existing files. `force_consistency_check=true` will catch such overlap seqno range even those harmless overlap. - For example, consider the following sequence of events ("key@n" indicates key at seqno "n") - insert k1@1 to memtable m1 - ingest file s1 with k2@2, ingest file s2 with k3@3 - insert k4@4 to m1 - compact files s1, s2 and result in new file s3 of seqno range [2, 3] - flush m1 and result in new file s4 of seqno range [1, 4]. And `force_consistency_check=true` will think s4 and s3 has file reordering corruption that might cause retuning an old value of k1 - However such caught corruption is a false positive since s1, s2 will not have overlapped keys with k1 or whatever inserted into m1 before ingest file s1 by the requirement of file ingestion (otherwise the m1 will be flushed first before any of the file ingestion completes). Therefore there in fact isn't any file reordering corruption. - Single delete can decrease a file's largest seqno and ordering by `largest_seqno` can introduce a wrong ordering hence file reordering corruption - For example, consider the following sequence of events ("key@n" indicates key at seqno "n", Credit to ajkr for this example) - an existing SST s1 contains only k1@1 - insert k1@2 to memtable m1 - ingest file s2 with k3@3, ingest file s3 with k4@4 - insert single delete k5@5 in m1 - flush m1 and result in new file s4 of seqno range [2, 5] - compact s1, s2, s3 and result in new file s5 of seqno range [1, 4] - compact s4 and result in new file s6 of seqno range [2] due to single delete - By the last step, we have file ordering by largest seqno (">" means "newer") : s5 > s6 while s6 contains a newer version of the k1's value (i.e, k1@2) than s5, which is a real reordering corruption. While this can be caught by `force_consistency_check=true`, there isn't a good way to prevent this from happening if ordering by `largest_seqno` Therefore, we are redesigning the sorting criteria of L0 files and avoid above inconvenience. Credit to ajkr , we now introduce `epoch_num` which describes the order of a file being flushed or ingested/imported (compaction output file will has the minimum `epoch_num` among input files'). This will avoid the above inconvenience in the following ways: - In the first case above, there will no longer be overlap seqno range check in `force_consistency_check=true` but `epoch_number` ordering check. This will result in file ordering s1 < s2 < s4 (pre-compaction) and s3 < s4 (post-compaction) which won't trigger false positive corruption. See test class `DBCompactionTestL0FilesMisorderCorruption*` for more. - In the second case above, this will result in file ordering s1 < s2 < s3 < s4 (pre-compacting s1, s2, s3), s5 < s4 (post-compacting s1, s2, s3), s5 < s6 (post-compacting s4), which are correct file ordering without causing any corruption. **Summary:** - Introduce `epoch_number` stored per `ColumnFamilyData` and sort CF's L0 files by their assigned `epoch_number` instead of `largest_seqno`. - `epoch_number` is increased and assigned upon `VersionEdit::AddFile()` for flush (or similarly for WriteLevel0TableForRecovery) and file ingestion (except for allow_behind_true, which will always get assigned as the `kReservedEpochNumberForFileIngestedBehind`) - Compaction output file is assigned with the minimum `epoch_number` among input files' - Refit level: reuse refitted file's epoch_number - Other paths needing `epoch_number` treatment: - Import column families: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo` - Repair: reuse file's epoch_number if exists. If not, assign one based on `NewestFirstBySeqNo`. - Assigning new epoch_number to a file and adding this file to LSM tree should be atomic. This is guaranteed by us assigning epoch_number right upon `VersionEdit::AddFile()` where this version edit will be apply to LSM tree shape right after by holding the db mutex (e.g, flush, file ingestion, import column family) or by there is only 1 ongoing edit per CF (e.g, WriteLevel0TableForRecovery, Repair). - Assigning the minimum input epoch number to compaction output file won't misorder L0 files (even through later `Refit(target_level=0)`). It's due to for every key "k" in the input range, a legit compaction will cover a continuous epoch number range of that key. As long as we assign the key "k" the minimum input epoch number, it won't become newer or older than the versions of this key that aren't included in this compaction hence no misorder. - Persist `epoch_number` of each file in manifest and recover `epoch_number` on db recovery - Backward compatibility with old db without `epoch_number` support is guaranteed by assigning `epoch_number` to recovered files by `NewestFirstBySeqno` order. See `VersionStorageInfo::RecoverEpochNumbers()` for more - Forward compatibility with manifest is guaranteed by flexibility of `NewFileCustomTag` - Replace `force_consistent_check` on L0 with `epoch_number` and remove false positive check like case 1 with `largest_seqno` above - Due to backward compatibility issue, we might encounter files with missing epoch number at the beginning of db recovery. We will still use old L0 sorting mechanism (`NewestFirstBySeqno`) to check/sort them till we infer their epoch number. See usages of `EpochNumberRequirement`. - Remove fix https://github.com/facebook/rocksdb/pull/5958#issue-511150930 and their outdated tests to file reordering corruption because such fix can be replaced by this PR. - Misc: - update existing tests with `epoch_number` so make check will pass - update https://github.com/facebook/rocksdb/pull/5958#issue-511150930 tests to verify corruption is fixed using `epoch_number` and cover universal/fifo compaction/CompactRange/CompactFile cases - assert db_mutex is held for a few places before calling ColumnFamilyData::NewEpochNumber() Pull Request resolved: https://github.com/facebook/rocksdb/pull/10922 Test Plan: - `make check` - New unit tests under `db/db_compaction_test.cc`, `db/db_test2.cc`, `db/version_builder_test.cc`, `db/repair_test.cc` - Updated tests (i.e, `DBCompactionTestL0FilesMisorderCorruption*`) under https://github.com/facebook/rocksdb/pull/5958#issue-511150930 - [Ongoing] Compatibility test: manually run https://github.com/ajkr/rocksdb/commit/36a5686ec012f35a4371e409aa85c404ca1c210d (with file ingestion off for running the `.orig` binary to prevent this bug affecting upgrade/downgrade formality checking) for 1 hour on `simple black/white box`, `cf_consistency/txn/enable_ts with whitebox + test_best_efforts_recovery with blackbox` - [Ongoing] normal db stress test - [Ongoing] db stress test with aggressive value https://github.com/facebook/rocksdb/pull/10761 Reviewed By: ajkr Differential Revision: D41063187 Pulled By: hx235 fbshipit-source-id: 826cb23455de7beaabe2d16c57682a82733a32a9
2 years ago
EpochNumberRequirement epoch_number_requirement =
vstorage->GetEpochNumberRequirement();
if (epoch_number_requirement == EpochNumberRequirement::kMightMissing) {
bool promoted = PromoteEpochNumberRequirementIfNeeded(vstorage);
if (promoted) {
epoch_number_requirement = vstorage->GetEpochNumberRequirement();
}
}
if (epoch_number_requirement == EpochNumberRequirement::kMightMissing) {
SaveSSTFilesTo(vstorage, /* level */ 0, level_zero_cmp_by_seqno_);
} else {
SaveSSTFilesTo(vstorage, /* level */ 0, level_zero_cmp_by_epochno_);
}
for (int level = 1; level < num_levels_; ++level) {
SaveSSTFilesTo(vstorage, level, level_nonzero_cmp_);
}
}
Add basic kRoundRobin compaction policy (#10107) Summary: Add `kRoundRobin` as a compaction priority. The implementation is as follows. - Define a cursor as the smallest Internal key in the successor of the selected file. Add `vector<InternalKey> compact_cursor_` into `VersionStorageInfo` where each element (`InternalKey`) in `compact_cursor_` represents a cursor. In round-robin compaction policy, we just need to select the first file (assuming files are sorted) and also has the smallest InternalKey larger than/equal to the cursor. After a file is chosen, we create a new `Fsize` vector which puts the selected file is placed at the first position in `temp`, the next cursor is then updated as the smallest InternalKey in successor of the selected file (the above logic is implemented in `SortFileByRoundRobin`). - After a compaction succeeds, typically `InstallCompactionResults()`, we choose the next cursor for the input level and save it to `edit`. When calling `LogAndApply`, we save the next cursor with its level into some local variable and finally apply the change to `vstorage` in `SaveTo` function. - Cursors are persist pair by pair (<level, InternalKey>) in `EncodeTo` so that they can be reconstructed when reopening. An empty cursor will not be encoded to MANIFEST Pull Request resolved: https://github.com/facebook/rocksdb/pull/10107 Test Plan: add unit test (`CompactionPriRoundRobin`) in `compaction_picker_test`, add `kRoundRobin` priority in `CompactionPriTest` from `db_compaction_test`, and add `PersistRoundRobinCompactCursor` in `db_compaction_test` Reviewed By: ajkr Differential Revision: D37316037 Pulled By: littlepig2013 fbshipit-source-id: 9f481748190ace416079139044e00df2968fb1ee
2 years ago
void SaveCompactCursorsTo(VersionStorageInfo* vstorage) const {
for (auto iter = updated_compact_cursors_.begin();
iter != updated_compact_cursors_.end(); iter++) {
vstorage->AddCursorForOneLevel(iter->first, iter->second);
}
}
// Save the current state in *vstorage.
Status SaveTo(VersionStorageInfo* vstorage) const {
Status s;
#ifndef NDEBUG
// The same check is done within Apply() so we skip it in release mode.
s = CheckConsistency(base_vstorage_);
if (!s.ok()) {
return s;
}
#endif // NDEBUG
s = CheckConsistency(vstorage);
if (!s.ok()) {
return s;
}
SaveSSTFilesTo(vstorage);
Add blob files to VersionStorageInfo/VersionBuilder (#6597) Summary: The patch adds a couple of classes to represent metadata about blob files: `SharedBlobFileMetaData` contains the information elements that are immutable (once the blob file is closed), e.g. blob file number, total number and size of blob files, checksum method/value, while `BlobFileMetaData` contains attributes that can vary across versions like the amount of garbage in the file. There is a single `SharedBlobFileMetaData` for each blob file, which is jointly owned by the `BlobFileMetaData` objects that point to it; `BlobFileMetaData` objects, in turn, are owned by `Version`s and can also be shared if the (immutable _and_ mutable) state of the blob file is the same in two versions. In addition, the patch adds the blob file metadata to `VersionStorageInfo`, and extends `VersionBuilder` so that it can apply blob file related `VersionEdit`s (i.e. those containing `BlobFileAddition`s and/or `BlobFileGarbage`), and save blob file metadata to a new `VersionStorageInfo`. Consistency checks are also extended to ensure that table files point to blob files that are part of the `Version`, and that all blob files that are part of any given `Version` have at least some _non_-garbage data in them. Pull Request resolved: https://github.com/facebook/rocksdb/pull/6597 Test Plan: `make check` Reviewed By: riversand963 Differential Revision: D20656803 Pulled By: ltamasi fbshipit-source-id: f1f74d135045b3b42d0146f03ee576ef0a4bfd80
5 years ago
SaveBlobFilesTo(vstorage);
Add basic kRoundRobin compaction policy (#10107) Summary: Add `kRoundRobin` as a compaction priority. The implementation is as follows. - Define a cursor as the smallest Internal key in the successor of the selected file. Add `vector<InternalKey> compact_cursor_` into `VersionStorageInfo` where each element (`InternalKey`) in `compact_cursor_` represents a cursor. In round-robin compaction policy, we just need to select the first file (assuming files are sorted) and also has the smallest InternalKey larger than/equal to the cursor. After a file is chosen, we create a new `Fsize` vector which puts the selected file is placed at the first position in `temp`, the next cursor is then updated as the smallest InternalKey in successor of the selected file (the above logic is implemented in `SortFileByRoundRobin`). - After a compaction succeeds, typically `InstallCompactionResults()`, we choose the next cursor for the input level and save it to `edit`. When calling `LogAndApply`, we save the next cursor with its level into some local variable and finally apply the change to `vstorage` in `SaveTo` function. - Cursors are persist pair by pair (<level, InternalKey>) in `EncodeTo` so that they can be reconstructed when reopening. An empty cursor will not be encoded to MANIFEST Pull Request resolved: https://github.com/facebook/rocksdb/pull/10107 Test Plan: add unit test (`CompactionPriRoundRobin`) in `compaction_picker_test`, add `kRoundRobin` priority in `CompactionPriTest` from `db_compaction_test`, and add `PersistRoundRobinCompactCursor` in `db_compaction_test` Reviewed By: ajkr Differential Revision: D37316037 Pulled By: littlepig2013 fbshipit-source-id: 9f481748190ace416079139044e00df2968fb1ee
2 years ago
SaveCompactCursorsTo(vstorage);
s = CheckConsistency(vstorage);
return s;
}
Fast path for detecting unchanged prefix_extractor (#9407) Summary: Fixes a major performance regression in 6.26, where extra CPU is spent in SliceTransform::AsString when reads involve a prefix_extractor (Get, MultiGet, Seek). Common case performance is now better than 6.25. This change creates a "fast path" for verifying that the current prefix extractor is unchanged and compatible with what was used to generate a table file. This fast path detects the common case by pointer comparison on the current prefix_extractor and a "known good" prefix extractor (if applicable) that is saved at the time the table reader is opened. The "known good" prefix extractor is saved as another shared_ptr copy (in an existing field, however) to ensure the pointer is not recycled. When the prefix_extractor has changed to a different instance but same compatible configuration (rare, odd), performance is still a regression compared to 6.25, but this is likely acceptable because of the oddity of such a case. The performance of incompatible prefix_extractor is essentially unchanged. Also fixed a minor case (ForwardIterator) where a prefix_extractor could be used via a raw pointer after being freed as a shared_ptr, if replaced via SetOptions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9407 Test Plan: ## Performance Populate DB with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12` Running head-to-head comparisons simultaneously with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=seekrandom -num=10000000 -duration=20 -disable_wal=1 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12` Below each is compared by ops/sec vs. baseline which is version 6.25 (multiple baseline runs because of variable machine load) v6.26: 4833 vs. 6698 (<- major regression!) v6.27: 4737 vs. 6397 (still) New: 6704 vs. 6461 (better than baseline in common case) Disabled fastpath: 4843 vs. 6389 (e.g. if prefix extractor instance changes but is still compatible) Changed prefix size (no usable filter) in new: 787 vs. 5927 Changed prefix size (no usable filter) in new & baseline: 773 vs. 784 Reviewed By: mrambacher Differential Revision: D33677812 Pulled By: pdillinger fbshipit-source-id: 571d9711c461fb97f957378a061b7e7dbc4d6a76
3 years ago
Status LoadTableHandlers(
InternalStats* internal_stats, int max_threads,
bool prefetch_index_and_filter_in_cache, bool is_initial_load,
const std::shared_ptr<const SliceTransform>& prefix_extractor,
Block per key-value checksum (#11287) Summary: add option `block_protection_bytes_per_key` and implementation for block per key-value checksum. The main changes are 1. checksum construction and verification in block.cc/h 2. pass the option `block_protection_bytes_per_key` around (mainly for methods defined in table_cache.h) 3. unit tests/crash test updates Tests: * Added unit tests * Crash test: `python3 tools/db_crashtest.py blackbox --simple --block_protection_bytes_per_key=1 --write_buffer_size=1048576` Follow up (maybe as a separate PR): make sure corruption status returned from BlockIters are correctly handled. Performance: Turning on block per KV protection has a non-trivial negative impact on read performance and costs additional memory. For memory, each block includes additional 24 bytes for checksum-related states beside checksum itself. For CPU, I set up a DB of size ~1.2GB with 5M keys (32 bytes key and 200 bytes value) which compacts to ~5 SST files (target file size 256 MB) in L6 without compression. I tested readrandom performance with various block cache size (to mimic various cache hit rates): ``` SETUP make OPTIMIZE_LEVEL="-O3" USE_LTO=1 DEBUG_LEVEL=0 -j32 db_bench ./db_bench -benchmarks=fillseq,compact0,waitforcompaction,compact,waitforcompaction -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -target_file_size_base=268435456 --num=5000000 --key_size=32 --value_size=200 --compression_type=none BENCHMARK ./db_bench --use_existing_db -benchmarks=readtocache,readrandom[-X10] --num=5000000 --key_size=32 --disable_auto_compactions --reads=1000000 --block_protection_bytes_per_key=[0|1] --cache_size=$CACHESIZE The readrandom ops/sec looks like the following: Block cache size: 2GB 1.2GB * 0.9 1.2GB * 0.8 1.2GB * 0.5 8MB Main 240805 223604 198176 161653 139040 PR prot_bytes=0 238691 226693 200127 161082 141153 PR prot_bytes=1 214983 193199 178532 137013 108211 prot_bytes=1 vs -10% -15% -10.8% -15% -23% prot_bytes=0 ``` The benchmark has a lot of variance, but there was a 5% to 25% regression in this benchmark with different cache hit rates. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11287 Reviewed By: ajkr Differential Revision: D43970708 Pulled By: cbi42 fbshipit-source-id: ef98d898b71779846fa74212b9ec9e08b7183940
2 years ago
size_t max_file_size_for_l0_meta_pin, const ReadOptions& read_options,
uint8_t block_protection_bytes_per_key) {
assert(table_cache_ != nullptr);
Major Cache refactoring, CPU efficiency improvement (#10975) Summary: This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache). The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below. * static_cast lines of code +29 -35 (net removed 6) * reinterpret_cast lines of code +6 -32 (net removed 26) ## cache.h and secondary_cache.h * Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications: * Simpler for implementations to deal with just one Insert and one Lookup. * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428. * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks). * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below). * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc. * Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation. * Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.) * Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.) * Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774) * Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object. * Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change. ## typed_cache.h Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae). The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used. * PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value. * BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter. * FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue. * For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`. These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.) Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it. ## block_cache.h This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table. ## block_based_table_reader.cc Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation. The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions. ## block_based_table_builder.cc, cache_dump_load_impl.cc Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.) ## Everything else Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975 Test Plan: tests updated Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache): 34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844 34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594 34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297 34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523 34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602 34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293 34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926 34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488 233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984 233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922 233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559 233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93 233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418 233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273 233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691 233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82 1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55 1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02 1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45 1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24 1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92 1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78 1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36 1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83 Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn. Reviewed By: anand1976 Differential Revision: D42417818 Pulled By: pdillinger fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
2 years ago
size_t table_cache_capacity =
table_cache_->get_cache().get()->GetCapacity();
bool always_load = (table_cache_capacity == TableCache::kInfiniteCapacity);
size_t max_load = std::numeric_limits<size_t>::max();
if (!always_load) {
// If it is initial loading and not set to always loading all the
// files, we only load up to kInitialLoadLimit files, to limit the
// time reopening the DB.
const size_t kInitialLoadLimit = 16;
size_t load_limit;
// If the table cache is not 1/4 full, we pin the table handle to
// file metadata to avoid the cache read costs when reading the file.
// The downside of pinning those files is that LRU won't be followed
// for those files. This doesn't matter much because if number of files
// of the DB excceeds table cache capacity, eventually no table reader
// will be pinned and LRU will be followed.
if (is_initial_load) {
load_limit = std::min(kInitialLoadLimit, table_cache_capacity / 4);
} else {
load_limit = table_cache_capacity / 4;
}
Major Cache refactoring, CPU efficiency improvement (#10975) Summary: This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache). The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below. * static_cast lines of code +29 -35 (net removed 6) * reinterpret_cast lines of code +6 -32 (net removed 26) ## cache.h and secondary_cache.h * Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications: * Simpler for implementations to deal with just one Insert and one Lookup. * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428. * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks). * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below). * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc. * Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation. * Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.) * Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.) * Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774) * Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object. * Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change. ## typed_cache.h Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae). The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used. * PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value. * BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter. * FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue. * For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`. These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.) Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it. ## block_cache.h This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table. ## block_based_table_reader.cc Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation. The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions. ## block_based_table_builder.cc, cache_dump_load_impl.cc Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.) ## Everything else Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975 Test Plan: tests updated Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache): 34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844 34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594 34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297 34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523 34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602 34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293 34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926 34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488 233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984 233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922 233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559 233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93 233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418 233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273 233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691 233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82 1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55 1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02 1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45 1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24 1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92 1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78 1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36 1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83 Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn. Reviewed By: anand1976 Differential Revision: D42417818 Pulled By: pdillinger fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
2 years ago
size_t table_cache_usage = table_cache_->get_cache().get()->GetUsage();
if (table_cache_usage >= load_limit) {
Support for single-primary, multi-secondary instances (#4899) Summary: This PR allows RocksDB to run in single-primary, multi-secondary process mode. The writer is a regular RocksDB (e.g. an `DBImpl`) instance playing the role of a primary. Multiple `DBImplSecondary` processes (secondaries) share the same set of SST files, MANIFEST, WAL files with the primary. Secondaries tail the MANIFEST of the primary and apply updates to their own in-memory state of the file system, e.g. `VersionStorageInfo`. This PR has several components: 1. (Originally in #4745). Add a `PathNotFound` subcode to `IOError` to denote the failure when a secondary tries to open a file which has been deleted by the primary. 2. (Similar to #4602). Add `FragmentBufferedReader` to handle partially-read, trailing record at the end of a log from where future read can continue. 3. (Originally in #4710 and #4820). Add implementation of the secondary, i.e. `DBImplSecondary`. 3.1 Tail the primary's MANIFEST during recovery. 3.2 Tail the primary's MANIFEST during normal processing by calling `ReadAndApply`. 3.3 Tailing WAL will be in a future PR. 4. Add an example in 'examples/multi_processes_example.cc' to demonstrate the usage of secondary RocksDB instance in a multi-process setting. Instructions to run the example can be found at the beginning of the source code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4899 Differential Revision: D14510945 Pulled By: riversand963 fbshipit-source-id: 4ac1c5693e6012ad23f7b4b42d3c374fecbe8886
6 years ago
// TODO (yanqin) find a suitable status code.
return Status::OK();
} else {
max_load = load_limit - table_cache_usage;
}
}
// <file metadata, level>
std::vector<std::pair<FileMetaData*, int>> files_meta;
Support for single-primary, multi-secondary instances (#4899) Summary: This PR allows RocksDB to run in single-primary, multi-secondary process mode. The writer is a regular RocksDB (e.g. an `DBImpl`) instance playing the role of a primary. Multiple `DBImplSecondary` processes (secondaries) share the same set of SST files, MANIFEST, WAL files with the primary. Secondaries tail the MANIFEST of the primary and apply updates to their own in-memory state of the file system, e.g. `VersionStorageInfo`. This PR has several components: 1. (Originally in #4745). Add a `PathNotFound` subcode to `IOError` to denote the failure when a secondary tries to open a file which has been deleted by the primary. 2. (Similar to #4602). Add `FragmentBufferedReader` to handle partially-read, trailing record at the end of a log from where future read can continue. 3. (Originally in #4710 and #4820). Add implementation of the secondary, i.e. `DBImplSecondary`. 3.1 Tail the primary's MANIFEST during recovery. 3.2 Tail the primary's MANIFEST during normal processing by calling `ReadAndApply`. 3.3 Tailing WAL will be in a future PR. 4. Add an example in 'examples/multi_processes_example.cc' to demonstrate the usage of secondary RocksDB instance in a multi-process setting. Instructions to run the example can be found at the beginning of the source code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4899 Differential Revision: D14510945 Pulled By: riversand963 fbshipit-source-id: 4ac1c5693e6012ad23f7b4b42d3c374fecbe8886
6 years ago
std::vector<Status> statuses;
for (int level = 0; level < num_levels_; level++) {
for (auto& file_meta_pair : levels_[level].added_files) {
auto* file_meta = file_meta_pair.second;
Support for single-primary, multi-secondary instances (#4899) Summary: This PR allows RocksDB to run in single-primary, multi-secondary process mode. The writer is a regular RocksDB (e.g. an `DBImpl`) instance playing the role of a primary. Multiple `DBImplSecondary` processes (secondaries) share the same set of SST files, MANIFEST, WAL files with the primary. Secondaries tail the MANIFEST of the primary and apply updates to their own in-memory state of the file system, e.g. `VersionStorageInfo`. This PR has several components: 1. (Originally in #4745). Add a `PathNotFound` subcode to `IOError` to denote the failure when a secondary tries to open a file which has been deleted by the primary. 2. (Similar to #4602). Add `FragmentBufferedReader` to handle partially-read, trailing record at the end of a log from where future read can continue. 3. (Originally in #4710 and #4820). Add implementation of the secondary, i.e. `DBImplSecondary`. 3.1 Tail the primary's MANIFEST during recovery. 3.2 Tail the primary's MANIFEST during normal processing by calling `ReadAndApply`. 3.3 Tailing WAL will be in a future PR. 4. Add an example in 'examples/multi_processes_example.cc' to demonstrate the usage of secondary RocksDB instance in a multi-process setting. Instructions to run the example can be found at the beginning of the source code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4899 Differential Revision: D14510945 Pulled By: riversand963 fbshipit-source-id: 4ac1c5693e6012ad23f7b4b42d3c374fecbe8886
6 years ago
// If the file has been opened before, just skip it.
if (!file_meta->table_reader_handle) {
files_meta.emplace_back(file_meta, level);
statuses.emplace_back(Status::OK());
}
if (files_meta.size() >= max_load) {
break;
}
}
if (files_meta.size() >= max_load) {
break;
}
}
std::atomic<size_t> next_file_meta_idx(0);
std::function<void()> load_handlers_func([&]() {
while (true) {
size_t file_idx = next_file_meta_idx.fetch_add(1);
if (file_idx >= files_meta.size()) {
break;
}
auto* file_meta = files_meta[file_idx].first;
int level = files_meta[file_idx].second;
Major Cache refactoring, CPU efficiency improvement (#10975) Summary: This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache). The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below. * static_cast lines of code +29 -35 (net removed 6) * reinterpret_cast lines of code +6 -32 (net removed 26) ## cache.h and secondary_cache.h * Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications: * Simpler for implementations to deal with just one Insert and one Lookup. * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428. * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks). * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below). * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc. * Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation. * Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.) * Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.) * Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774) * Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object. * Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change. ## typed_cache.h Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae). The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used. * PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value. * BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter. * FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue. * For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`. These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.) Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it. ## block_cache.h This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table. ## block_based_table_reader.cc Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation. The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions. ## block_based_table_builder.cc, cache_dump_load_impl.cc Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.) ## Everything else Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975 Test Plan: tests updated Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache): 34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844 34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594 34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297 34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523 34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602 34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293 34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926 34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488 233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984 233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922 233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559 233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93 233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418 233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273 233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691 233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82 1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55 1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02 1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45 1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24 1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92 1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78 1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36 1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83 Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn. Reviewed By: anand1976 Differential Revision: D42417818 Pulled By: pdillinger fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
2 years ago
TableCache::TypedHandle* handle = nullptr;
Support for single-primary, multi-secondary instances (#4899) Summary: This PR allows RocksDB to run in single-primary, multi-secondary process mode. The writer is a regular RocksDB (e.g. an `DBImpl`) instance playing the role of a primary. Multiple `DBImplSecondary` processes (secondaries) share the same set of SST files, MANIFEST, WAL files with the primary. Secondaries tail the MANIFEST of the primary and apply updates to their own in-memory state of the file system, e.g. `VersionStorageInfo`. This PR has several components: 1. (Originally in #4745). Add a `PathNotFound` subcode to `IOError` to denote the failure when a secondary tries to open a file which has been deleted by the primary. 2. (Similar to #4602). Add `FragmentBufferedReader` to handle partially-read, trailing record at the end of a log from where future read can continue. 3. (Originally in #4710 and #4820). Add implementation of the secondary, i.e. `DBImplSecondary`. 3.1 Tail the primary's MANIFEST during recovery. 3.2 Tail the primary's MANIFEST during normal processing by calling `ReadAndApply`. 3.3 Tailing WAL will be in a future PR. 4. Add an example in 'examples/multi_processes_example.cc' to demonstrate the usage of secondary RocksDB instance in a multi-process setting. Instructions to run the example can be found at the beginning of the source code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4899 Differential Revision: D14510945 Pulled By: riversand963 fbshipit-source-id: 4ac1c5693e6012ad23f7b4b42d3c374fecbe8886
6 years ago
statuses[file_idx] = table_cache_->FindTable(
Group rocksdb.sst.read.micros stat by IOActivity flush and compaction (#11288) Summary: **Context:** The existing stat rocksdb.sst.read.micros does not reflect each of compaction and flush cases but aggregate them, which is not so helpful for us to understand IO read behavior of each of them. **Summary** - Update `StopWatch` and `RandomAccessFileReader` to record `rocksdb.sst.read.micros` and `rocksdb.file.{flush/compaction}.read.micros` - Fixed the default histogram in `RandomAccessFileReader` - New field `ReadOptions/IOOptions::io_activity`; Pass `ReadOptions` through paths under db open, flush and compaction to where we can prepare `IOOptions` and pass it to `RandomAccessFileReader` - Use `thread_status_util` for assertion in `DbStressFSWrapper` for continuous testing on we are passing correct `io_activity` under db open, flush and compaction Pull Request resolved: https://github.com/facebook/rocksdb/pull/11288 Test Plan: - **Stress test** - **Db bench 1: rocksdb.sst.read.micros COUNT ≈ sum of rocksdb.file.read.flush.micros's and rocksdb.file.read.compaction.micros's.** (without blob) - May not be exactly the same due to `HistogramStat::Add` only guarantees atomic not accuracy across threads. ``` ./db_bench -db=/dev/shm/testdb/ -statistics=true -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -target_file_size_base=655 -disable_auto_compactions=false -compression_type=none -bloom_bits=3 (-use_plain_table=1 -prefix_size=10) ``` ``` // BlockBasedTable rocksdb.sst.read.micros P50 : 2.009374 P95 : 4.968548 P99 : 8.110362 P100 : 43.000000 COUNT : 40456 SUM : 114805 rocksdb.file.read.flush.micros P50 : 1.871841 P95 : 3.872407 P99 : 5.540541 P100 : 43.000000 COUNT : 2250 SUM : 6116 rocksdb.file.read.compaction.micros P50 : 2.023109 P95 : 5.029149 P99 : 8.196910 P100 : 26.000000 COUNT : 38206 SUM : 108689 // PlainTable Does not apply ``` - **Db bench 2: performance** **Read** SETUP: db with 900 files ``` ./db_bench -db=/dev/shm/testdb/ -benchmarks="fillseq" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=true -target_file_size_base=655 -compression_type=none ```run till convergence ``` ./db_bench -seed=1678564177044286 -use_existing_db=true -db=/dev/shm/testdb -benchmarks=readrandom[-X60] -statistics=true -num=1000000 -disable_auto_compactions=true -compression_type=none -bloom_bits=3 ``` Pre-change `readrandom [AVG 60 runs] : 21568 (± 248) ops/sec` Post-change (no regression, -0.3%) `readrandom [AVG 60 runs] : 21486 (± 236) ops/sec` **Compaction/Flush**run till convergence ``` ./db_bench -db=/dev/shm/testdb2/ -seed=1678564177044286 -benchmarks="fillseq[-X60]" -key_size=32 -value_size=512 -num=50000 -write_buffer_size=655 -disable_auto_compactions=false -target_file_size_base=655 -compression_type=none rocksdb.sst.read.micros COUNT : 33820 rocksdb.sst.read.flush.micros COUNT : 1800 rocksdb.sst.read.compaction.micros COUNT : 32020 ``` Pre-change `fillseq [AVG 46 runs] : 1391 (± 214) ops/sec; 0.7 (± 0.1) MB/sec` Post-change (no regression, ~-0.4%) `fillseq [AVG 46 runs] : 1385 (± 216) ops/sec; 0.7 (± 0.1) MB/sec` Reviewed By: ajkr Differential Revision: D44007011 Pulled By: hx235 fbshipit-source-id: a54c89e4846dfc9a135389edf3f3eedfea257132
2 years ago
read_options, file_options_,
Major Cache refactoring, CPU efficiency improvement (#10975) Summary: This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache). The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below. * static_cast lines of code +29 -35 (net removed 6) * reinterpret_cast lines of code +6 -32 (net removed 26) ## cache.h and secondary_cache.h * Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications: * Simpler for implementations to deal with just one Insert and one Lookup. * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428. * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks). * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below). * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc. * Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation. * Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.) * Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.) * Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774) * Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object. * Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change. ## typed_cache.h Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae). The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used. * PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value. * BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter. * FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue. * For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`. These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.) Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it. ## block_cache.h This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table. ## block_based_table_reader.cc Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation. The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions. ## block_based_table_builder.cc, cache_dump_load_impl.cc Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.) ## Everything else Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975 Test Plan: tests updated Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache): 34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844 34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594 34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297 34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523 34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602 34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293 34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926 34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488 233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984 233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922 233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559 233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93 233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418 233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273 233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691 233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82 1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55 1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02 1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45 1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24 1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92 1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78 1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36 1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83 Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn. Reviewed By: anand1976 Differential Revision: D42417818 Pulled By: pdillinger fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
2 years ago
*(base_vstorage_->InternalComparator()), *file_meta, &handle,
Block per key-value checksum (#11287) Summary: add option `block_protection_bytes_per_key` and implementation for block per key-value checksum. The main changes are 1. checksum construction and verification in block.cc/h 2. pass the option `block_protection_bytes_per_key` around (mainly for methods defined in table_cache.h) 3. unit tests/crash test updates Tests: * Added unit tests * Crash test: `python3 tools/db_crashtest.py blackbox --simple --block_protection_bytes_per_key=1 --write_buffer_size=1048576` Follow up (maybe as a separate PR): make sure corruption status returned from BlockIters are correctly handled. Performance: Turning on block per KV protection has a non-trivial negative impact on read performance and costs additional memory. For memory, each block includes additional 24 bytes for checksum-related states beside checksum itself. For CPU, I set up a DB of size ~1.2GB with 5M keys (32 bytes key and 200 bytes value) which compacts to ~5 SST files (target file size 256 MB) in L6 without compression. I tested readrandom performance with various block cache size (to mimic various cache hit rates): ``` SETUP make OPTIMIZE_LEVEL="-O3" USE_LTO=1 DEBUG_LEVEL=0 -j32 db_bench ./db_bench -benchmarks=fillseq,compact0,waitforcompaction,compact,waitforcompaction -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -target_file_size_base=268435456 --num=5000000 --key_size=32 --value_size=200 --compression_type=none BENCHMARK ./db_bench --use_existing_db -benchmarks=readtocache,readrandom[-X10] --num=5000000 --key_size=32 --disable_auto_compactions --reads=1000000 --block_protection_bytes_per_key=[0|1] --cache_size=$CACHESIZE The readrandom ops/sec looks like the following: Block cache size: 2GB 1.2GB * 0.9 1.2GB * 0.8 1.2GB * 0.5 8MB Main 240805 223604 198176 161653 139040 PR prot_bytes=0 238691 226693 200127 161082 141153 PR prot_bytes=1 214983 193199 178532 137013 108211 prot_bytes=1 vs -10% -15% -10.8% -15% -23% prot_bytes=0 ``` The benchmark has a lot of variance, but there was a 5% to 25% regression in this benchmark with different cache hit rates. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11287 Reviewed By: ajkr Differential Revision: D43970708 Pulled By: cbi42 fbshipit-source-id: ef98d898b71779846fa74212b9ec9e08b7183940
2 years ago
block_protection_bytes_per_key, prefix_extractor, false /*no_io */,
true /* record_read_stats */,
internal_stats->GetFileReadHist(level), false, level,
prefetch_index_and_filter_in_cache, max_file_size_for_l0_meta_pin,
file_meta->temperature);
Major Cache refactoring, CPU efficiency improvement (#10975) Summary: This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache). The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below. * static_cast lines of code +29 -35 (net removed 6) * reinterpret_cast lines of code +6 -32 (net removed 26) ## cache.h and secondary_cache.h * Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications: * Simpler for implementations to deal with just one Insert and one Lookup. * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428. * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks). * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below). * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc. * Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation. * Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.) * Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.) * Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774) * Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object. * Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change. ## typed_cache.h Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae). The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used. * PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value. * BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter. * FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue. * For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`. These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.) Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it. ## block_cache.h This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table. ## block_based_table_reader.cc Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation. The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions. ## block_based_table_builder.cc, cache_dump_load_impl.cc Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.) ## Everything else Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975 Test Plan: tests updated Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache): 34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844 34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594 34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297 34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523 34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602 34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293 34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926 34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488 233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984 233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922 233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559 233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93 233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418 233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273 233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691 233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82 1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55 1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02 1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45 1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24 1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92 1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78 1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36 1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83 Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn. Reviewed By: anand1976 Differential Revision: D42417818 Pulled By: pdillinger fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
2 years ago
if (handle != nullptr) {
file_meta->table_reader_handle = handle;
// Load table_reader
Major Cache refactoring, CPU efficiency improvement (#10975) Summary: This is several refactorings bundled into one to avoid having to incrementally re-modify uses of Cache several times. Overall, there are breaking changes to Cache class, and it becomes more of low-level interface for implementing caches, especially block cache. New internal APIs make using Cache cleaner than before, and more insulated from block cache evolution. Hopefully, this is the last really big block cache refactoring, because of rather effectively decoupling the implementations from the uses. This change also removes the EXPERIMENTAL designation on the SecondaryCache support in Cache. It seems reasonably mature at this point but still subject to change/evolution (as I warn in the API docs for Cache). The high-level motivation for this refactoring is to minimize code duplication / compounding complexity in adding SecondaryCache support to HyperClockCache (in a later PR). Other benefits listed below. * static_cast lines of code +29 -35 (net removed 6) * reinterpret_cast lines of code +6 -32 (net removed 26) ## cache.h and secondary_cache.h * Always use CacheItemHelper with entries instead of just a Deleter. There are several motivations / justifications: * Simpler for implementations to deal with just one Insert and one Lookup. * Simpler and more efficient implementation because we don't have to track which entries are using helpers and which are using deleters * Gets rid of hack to classify cache entries by their deleter. Instead, the CacheItemHelper includes a CacheEntryRole. This simplifies a lot of code (cache_entry_roles.h almost eliminated). Fixes https://github.com/facebook/rocksdb/issues/9428. * Makes it trivial to adjust SecondaryCache behavior based on kind of block (e.g. don't re-compress filter blocks). * It is arguably less convenient for many direct users of Cache, but direct users of Cache are now rare with introduction of typed_cache.h (below). * I considered and rejected an alternative approach in which we reduce customizability by assuming each secondary cache compatible value starts with a Slice referencing the uncompressed block contents (already true or mostly true), but we apparently intend to stack secondary caches. Saving an entry from a compressed secondary to a lower tier requires custom handling offered by SaveToCallback, etc. * Make CreateCallback part of the helper and introduce CreateContext to work with it (alternative to https://github.com/facebook/rocksdb/issues/10562). This cleans up the interface while still allowing context to be provided for loading/parsing values into primary cache. This model works for async lookup in BlockBasedTable reader (reader owns a CreateContext) under the assumption that it always waits on secondary cache operations to finish. (Otherwise, the CreateContext could be destroyed while async operation depending on it continues.) This likely contributes most to the observed performance improvement because it saves an std::function backed by a heap allocation. * Use char* for serialized data, e.g. in SaveToCallback, where void* was confusingly used. (We use `char*` for serialized byte data all over RocksDB, with many advantages over `void*`. `memcpy` etc. are legacy APIs that should not be mimicked.) * Add a type alias Cache::ObjectPtr = void*, so that we can better indicate the intent of the void* when it is to be the object associated with a Cache entry. Related: started (but did not complete) a refactoring to move away from "value" of a cache entry toward "object" or "obj". (It is confusing to call Cache a key-value store (like DB) when it is really storing arbitrary in-memory objects, not byte strings.) * Remove unnecessary key param from DeleterFn. This is good for efficiency in HyperClockCache, which does not directly store the cache key in memory. (Alternative to https://github.com/facebook/rocksdb/issues/10774) * Add allocator to Cache DeleterFn. This is a kind of future-proofing change in case we get more serious about using the Cache allocator for memory tracked by the Cache. Right now, only the uncompressed block contents are allocated using the allocator, and a pointer to that allocator is saved as part of the cached object so that the deleter can use it. (See CacheAllocationPtr.) If in the future we are able to "flatten out" our Cache objects some more, it would be good not to have to track the allocator as part of each object. * Removes legacy `ApplyToAllCacheEntries` and changes `ApplyToAllEntries` signature for Deleter->CacheItemHelper change. ## typed_cache.h Adds various "typed" interfaces to the Cache as internal APIs, so that most uses of Cache can use simple type safe code without casting and without explicit deleters, etc. Almost all of the non-test, non-glue code uses of Cache have been migrated. (Follow-up work: CompressedSecondaryCache deserves deeper attention to migrate.) This change expands RocksDB's internal usage of metaprogramming and SFINAE (https://en.cppreference.com/w/cpp/language/sfinae). The existing usages of Cache are divided up at a high level into these new interfaces. See updated existing uses of Cache for examples of how these are used. * PlaceholderCacheInterface - Used for making cache reservations, with entries that have a charge but no value. * BasicTypedCacheInterface<TValue> - Used for primary cache storage of objects of type TValue, which can be cleaned up with std::default_delete<TValue>. The role is provided by TValue::kCacheEntryRole or given in an optional template parameter. * FullTypedCacheInterface<TValue, TCreateContext> - Used for secondary cache compatible storage of objects of type TValue. In addition to BasicTypedCacheInterface constraints, we require TValue::ContentSlice() to return persistable data. This simplifies usage for the normal case of simple secondary cache compatibility (can give you a Slice to the data already in memory). In addition to TCreateContext performing the role of Cache::CreateContext, it is also expected to provide a factory function for creating TValue. * For each of these, there's a "Shared" version (e.g. FullTypedSharedCacheInterface) that holds a shared_ptr to the Cache, rather than assuming external ownership by holding only a raw `Cache*`. These interfaces introduce specific handle types for each interface instantiation, so that it's easy to see what kind of object is controlled by a handle. (Ultimately, this might not be worth the extra complexity, but it seems OK so far.) Note: I attempted to make the cache 'charge' automatically inferred from the cache object type, such as by expecting an ApproximateMemoryUsage() function, but this is not so clean because there are cases where we need to compute the charge ahead of time and don't want to re-compute it. ## block_cache.h This header is essentially the replacement for the old block_like_traits.h. It includes various things to support block cache access with typed_cache.h for block-based table. ## block_based_table_reader.cc Before this change, accessing the block cache here was an awkward mix of static polymorphism (template TBlocklike) and switch-case on a dynamic BlockType value. This change mostly unifies on static polymorphism, relying on minor hacks in block_cache.h to distinguish variants of Block. We still check BlockType in some places (especially for stats, which could be improved in follow-up work) but at least the BlockType is a static constant from the template parameter. (No more awkward partial redundancy between static and dynamic info.) This likely contributes to the overall performance improvement, but hasn't been tested in isolation. The other key source of simplification here is a more unified system of creating block cache objects: for directly populating from primary cache and for promotion from secondary cache. Both use BlockCreateContext, for context and for factory functions. ## block_based_table_builder.cc, cache_dump_load_impl.cc Before this change, warming caches was super ugly code. Both of these source files had switch statements to basically transition from the dynamic BlockType world to the static TBlocklike world. None of that mess is needed anymore as there's a new, untyped WarmInCache function that handles all the details just as promotion from SecondaryCache would. (Fixes `TODO akanksha: Dedup below code` in block_based_table_builder.cc.) ## Everything else Mostly just updating Cache users to use new typed APIs when reasonably possible, or changed Cache APIs when not. Pull Request resolved: https://github.com/facebook/rocksdb/pull/10975 Test Plan: tests updated Performance test setup similar to https://github.com/facebook/rocksdb/issues/10626 (by cache size, LRUCache when not "hyper" for HyperClockCache): 34MB 1thread base.hyper -> kops/s: 0.745 io_bytes/op: 2.52504e+06 miss_ratio: 0.140906 max_rss_mb: 76.4844 34MB 1thread new.hyper -> kops/s: 0.751 io_bytes/op: 2.5123e+06 miss_ratio: 0.140161 max_rss_mb: 79.3594 34MB 1thread base -> kops/s: 0.254 io_bytes/op: 1.36073e+07 miss_ratio: 0.918818 max_rss_mb: 45.9297 34MB 1thread new -> kops/s: 0.252 io_bytes/op: 1.36157e+07 miss_ratio: 0.918999 max_rss_mb: 44.1523 34MB 32thread base.hyper -> kops/s: 7.272 io_bytes/op: 2.88323e+06 miss_ratio: 0.162532 max_rss_mb: 516.602 34MB 32thread new.hyper -> kops/s: 7.214 io_bytes/op: 2.99046e+06 miss_ratio: 0.168818 max_rss_mb: 518.293 34MB 32thread base -> kops/s: 3.528 io_bytes/op: 1.35722e+07 miss_ratio: 0.914691 max_rss_mb: 264.926 34MB 32thread new -> kops/s: 3.604 io_bytes/op: 1.35744e+07 miss_ratio: 0.915054 max_rss_mb: 264.488 233MB 1thread base.hyper -> kops/s: 53.909 io_bytes/op: 2552.35 miss_ratio: 0.0440566 max_rss_mb: 241.984 233MB 1thread new.hyper -> kops/s: 62.792 io_bytes/op: 2549.79 miss_ratio: 0.044043 max_rss_mb: 241.922 233MB 1thread base -> kops/s: 1.197 io_bytes/op: 2.75173e+06 miss_ratio: 0.103093 max_rss_mb: 241.559 233MB 1thread new -> kops/s: 1.199 io_bytes/op: 2.73723e+06 miss_ratio: 0.10305 max_rss_mb: 240.93 233MB 32thread base.hyper -> kops/s: 1298.69 io_bytes/op: 2539.12 miss_ratio: 0.0440307 max_rss_mb: 371.418 233MB 32thread new.hyper -> kops/s: 1421.35 io_bytes/op: 2538.75 miss_ratio: 0.0440307 max_rss_mb: 347.273 233MB 32thread base -> kops/s: 9.693 io_bytes/op: 2.77304e+06 miss_ratio: 0.103745 max_rss_mb: 569.691 233MB 32thread new -> kops/s: 9.75 io_bytes/op: 2.77559e+06 miss_ratio: 0.103798 max_rss_mb: 552.82 1597MB 1thread base.hyper -> kops/s: 58.607 io_bytes/op: 1449.14 miss_ratio: 0.0249324 max_rss_mb: 1583.55 1597MB 1thread new.hyper -> kops/s: 69.6 io_bytes/op: 1434.89 miss_ratio: 0.0247167 max_rss_mb: 1584.02 1597MB 1thread base -> kops/s: 60.478 io_bytes/op: 1421.28 miss_ratio: 0.024452 max_rss_mb: 1589.45 1597MB 1thread new -> kops/s: 63.973 io_bytes/op: 1416.07 miss_ratio: 0.0243766 max_rss_mb: 1589.24 1597MB 32thread base.hyper -> kops/s: 1436.2 io_bytes/op: 1357.93 miss_ratio: 0.0235353 max_rss_mb: 1692.92 1597MB 32thread new.hyper -> kops/s: 1605.03 io_bytes/op: 1358.04 miss_ratio: 0.023538 max_rss_mb: 1702.78 1597MB 32thread base -> kops/s: 280.059 io_bytes/op: 1350.34 miss_ratio: 0.023289 max_rss_mb: 1675.36 1597MB 32thread new -> kops/s: 283.125 io_bytes/op: 1351.05 miss_ratio: 0.0232797 max_rss_mb: 1703.83 Almost uniformly improving over base revision, especially for hot paths with HyperClockCache, up to 12% higher throughput seen (1597MB, 32thread, hyper). The improvement for that is likely coming from much simplified code for providing context for secondary cache promotion (CreateCallback/CreateContext), and possibly from less branching in block_based_table_reader. And likely a small improvement from not reconstituting key for DeleterFn. Reviewed By: anand1976 Differential Revision: D42417818 Pulled By: pdillinger fbshipit-source-id: f86bfdd584dce27c028b151ba56818ad14f7a432
2 years ago
file_meta->fd.table_reader = table_cache_->get_cache().Value(handle);
}
}
});
std::vector<port::Thread> threads;
for (int i = 1; i < max_threads; i++) {
threads.emplace_back(load_handlers_func);
}
load_handlers_func();
for (auto& t : threads) {
t.join();
}
Status ret;
Support for single-primary, multi-secondary instances (#4899) Summary: This PR allows RocksDB to run in single-primary, multi-secondary process mode. The writer is a regular RocksDB (e.g. an `DBImpl`) instance playing the role of a primary. Multiple `DBImplSecondary` processes (secondaries) share the same set of SST files, MANIFEST, WAL files with the primary. Secondaries tail the MANIFEST of the primary and apply updates to their own in-memory state of the file system, e.g. `VersionStorageInfo`. This PR has several components: 1. (Originally in #4745). Add a `PathNotFound` subcode to `IOError` to denote the failure when a secondary tries to open a file which has been deleted by the primary. 2. (Similar to #4602). Add `FragmentBufferedReader` to handle partially-read, trailing record at the end of a log from where future read can continue. 3. (Originally in #4710 and #4820). Add implementation of the secondary, i.e. `DBImplSecondary`. 3.1 Tail the primary's MANIFEST during recovery. 3.2 Tail the primary's MANIFEST during normal processing by calling `ReadAndApply`. 3.3 Tailing WAL will be in a future PR. 4. Add an example in 'examples/multi_processes_example.cc' to demonstrate the usage of secondary RocksDB instance in a multi-process setting. Instructions to run the example can be found at the beginning of the source code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4899 Differential Revision: D14510945 Pulled By: riversand963 fbshipit-source-id: 4ac1c5693e6012ad23f7b4b42d3c374fecbe8886
6 years ago
for (const auto& s : statuses) {
if (!s.ok()) {
if (ret.ok()) {
ret = s;
}
Support for single-primary, multi-secondary instances (#4899) Summary: This PR allows RocksDB to run in single-primary, multi-secondary process mode. The writer is a regular RocksDB (e.g. an `DBImpl`) instance playing the role of a primary. Multiple `DBImplSecondary` processes (secondaries) share the same set of SST files, MANIFEST, WAL files with the primary. Secondaries tail the MANIFEST of the primary and apply updates to their own in-memory state of the file system, e.g. `VersionStorageInfo`. This PR has several components: 1. (Originally in #4745). Add a `PathNotFound` subcode to `IOError` to denote the failure when a secondary tries to open a file which has been deleted by the primary. 2. (Similar to #4602). Add `FragmentBufferedReader` to handle partially-read, trailing record at the end of a log from where future read can continue. 3. (Originally in #4710 and #4820). Add implementation of the secondary, i.e. `DBImplSecondary`. 3.1 Tail the primary's MANIFEST during recovery. 3.2 Tail the primary's MANIFEST during normal processing by calling `ReadAndApply`. 3.3 Tailing WAL will be in a future PR. 4. Add an example in 'examples/multi_processes_example.cc' to demonstrate the usage of secondary RocksDB instance in a multi-process setting. Instructions to run the example can be found at the beginning of the source code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4899 Differential Revision: D14510945 Pulled By: riversand963 fbshipit-source-id: 4ac1c5693e6012ad23f7b4b42d3c374fecbe8886
6 years ago
}
}
return ret;
}
};
Account memory of FileMetaData in global memory limit (#9924) Summary: **Context/Summary:** As revealed by heap profiling, allocation of `FileMetaData` for [newly created file added to a Version](https://github.com/facebook/rocksdb/pull/9924/files#diff-a6aa385940793f95a2c5b39cc670bd440c4547fa54fd44622f756382d5e47e43R774) can consume significant heap memory. This PR is to account that toward our global memory limit based on block cache capacity. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9924 Test Plan: - Previous `make check` verified there are only 2 places where the memory of the allocated `FileMetaData` can be released - New unit test `TEST_P(ChargeFileMetadataTestWithParam, Basic)` - db bench (CPU cost of `charge_file_metadata` in write and compact) - **write micros/op: -0.24%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 (remove this option for pre-PR) -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'` - **compact micros/op -0.87%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 -numdistinct=1000 && ./db_bench -benchmarks=compact -db=$TEST_TMPDIR -use_existing_db=1 -charge_file_metadata=1 -disable_auto_compactions=1 | egrep 'compact'` table 1 - write #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721 20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | -0.3633711465 40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | 0.5289363078 80 | 3.87828 | 0.119007 | 3.86791 | 0.115674 | **-0.2673865734** 160 | 3.87677 | 0.162231 | 3.86739 | 0.16663 | **-0.2419539978** table 2 - compact #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 2,399,650.00 | 96,375.80 | 2,359,537.00 | 53,243.60 | -1.67 20 | 2,410,480.00 | 89,988.00 | 2,433,580.00 | 91,121.20 | 0.96 40 | 2.41E+06 | 121811 | 2.39E+06 | 131525 | **-0.96** 80 | 2.40E+06 | 134503 | 2.39E+06 | 108799 | **-0.78** - stress test: `python3 tools/db_crashtest.py blackbox --charge_file_metadata=1 --cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D36055583 Pulled By: hx235 fbshipit-source-id: b60eab94707103cb1322cf815f05810ef0232625
2 years ago
VersionBuilder::VersionBuilder(
const FileOptions& file_options, const ImmutableCFOptions* ioptions,
TableCache* table_cache, VersionStorageInfo* base_vstorage,
VersionSet* version_set,
std::shared_ptr<CacheReservationManager> file_metadata_cache_res_mgr)
: rep_(new Rep(file_options, ioptions, table_cache, base_vstorage,
Account memory of FileMetaData in global memory limit (#9924) Summary: **Context/Summary:** As revealed by heap profiling, allocation of `FileMetaData` for [newly created file added to a Version](https://github.com/facebook/rocksdb/pull/9924/files#diff-a6aa385940793f95a2c5b39cc670bd440c4547fa54fd44622f756382d5e47e43R774) can consume significant heap memory. This PR is to account that toward our global memory limit based on block cache capacity. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9924 Test Plan: - Previous `make check` verified there are only 2 places where the memory of the allocated `FileMetaData` can be released - New unit test `TEST_P(ChargeFileMetadataTestWithParam, Basic)` - db bench (CPU cost of `charge_file_metadata` in write and compact) - **write micros/op: -0.24%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 (remove this option for pre-PR) -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'` - **compact micros/op -0.87%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 -numdistinct=1000 && ./db_bench -benchmarks=compact -db=$TEST_TMPDIR -use_existing_db=1 -charge_file_metadata=1 -disable_auto_compactions=1 | egrep 'compact'` table 1 - write #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721 20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | -0.3633711465 40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | 0.5289363078 80 | 3.87828 | 0.119007 | 3.86791 | 0.115674 | **-0.2673865734** 160 | 3.87677 | 0.162231 | 3.86739 | 0.16663 | **-0.2419539978** table 2 - compact #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 2,399,650.00 | 96,375.80 | 2,359,537.00 | 53,243.60 | -1.67 20 | 2,410,480.00 | 89,988.00 | 2,433,580.00 | 91,121.20 | 0.96 40 | 2.41E+06 | 121811 | 2.39E+06 | 131525 | **-0.96** 80 | 2.40E+06 | 134503 | 2.39E+06 | 108799 | **-0.78** - stress test: `python3 tools/db_crashtest.py blackbox --charge_file_metadata=1 --cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D36055583 Pulled By: hx235 fbshipit-source-id: b60eab94707103cb1322cf815f05810ef0232625
2 years ago
version_set, file_metadata_cache_res_mgr)) {}
VersionBuilder::~VersionBuilder() = default;
bool VersionBuilder::CheckConsistencyForNumLevels() {
return rep_->CheckConsistencyForNumLevels();
}
Status VersionBuilder::Apply(const VersionEdit* edit) {
return rep_->Apply(edit);
}
Status VersionBuilder::SaveTo(VersionStorageInfo* vstorage) const {
return rep_->SaveTo(vstorage);
}
Support for single-primary, multi-secondary instances (#4899) Summary: This PR allows RocksDB to run in single-primary, multi-secondary process mode. The writer is a regular RocksDB (e.g. an `DBImpl`) instance playing the role of a primary. Multiple `DBImplSecondary` processes (secondaries) share the same set of SST files, MANIFEST, WAL files with the primary. Secondaries tail the MANIFEST of the primary and apply updates to their own in-memory state of the file system, e.g. `VersionStorageInfo`. This PR has several components: 1. (Originally in #4745). Add a `PathNotFound` subcode to `IOError` to denote the failure when a secondary tries to open a file which has been deleted by the primary. 2. (Similar to #4602). Add `FragmentBufferedReader` to handle partially-read, trailing record at the end of a log from where future read can continue. 3. (Originally in #4710 and #4820). Add implementation of the secondary, i.e. `DBImplSecondary`. 3.1 Tail the primary's MANIFEST during recovery. 3.2 Tail the primary's MANIFEST during normal processing by calling `ReadAndApply`. 3.3 Tailing WAL will be in a future PR. 4. Add an example in 'examples/multi_processes_example.cc' to demonstrate the usage of secondary RocksDB instance in a multi-process setting. Instructions to run the example can be found at the beginning of the source code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4899 Differential Revision: D14510945 Pulled By: riversand963 fbshipit-source-id: 4ac1c5693e6012ad23f7b4b42d3c374fecbe8886
6 years ago
Status VersionBuilder::LoadTableHandlers(
InternalStats* internal_stats, int max_threads,
bool prefetch_index_and_filter_in_cache, bool is_initial_load,
Fast path for detecting unchanged prefix_extractor (#9407) Summary: Fixes a major performance regression in 6.26, where extra CPU is spent in SliceTransform::AsString when reads involve a prefix_extractor (Get, MultiGet, Seek). Common case performance is now better than 6.25. This change creates a "fast path" for verifying that the current prefix extractor is unchanged and compatible with what was used to generate a table file. This fast path detects the common case by pointer comparison on the current prefix_extractor and a "known good" prefix extractor (if applicable) that is saved at the time the table reader is opened. The "known good" prefix extractor is saved as another shared_ptr copy (in an existing field, however) to ensure the pointer is not recycled. When the prefix_extractor has changed to a different instance but same compatible configuration (rare, odd), performance is still a regression compared to 6.25, but this is likely acceptable because of the oddity of such a case. The performance of incompatible prefix_extractor is essentially unchanged. Also fixed a minor case (ForwardIterator) where a prefix_extractor could be used via a raw pointer after being freed as a shared_ptr, if replaced via SetOptions. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9407 Test Plan: ## Performance Populate DB with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -benchmarks=fillrandom -num=10000000 -disable_wal=1 -write_buffer_size=10000000 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12` Running head-to-head comparisons simultaneously with `TEST_TMPDIR=/dev/shm/rocksdb ./db_bench -use_existing_db -readonly -benchmarks=seekrandom -num=10000000 -duration=20 -disable_wal=1 -bloom_bits=16 -compaction_style=2 -fifo_compaction_max_table_files_size_mb=10000 -fifo_compaction_allow_compaction=0 -prefix_size=12` Below each is compared by ops/sec vs. baseline which is version 6.25 (multiple baseline runs because of variable machine load) v6.26: 4833 vs. 6698 (<- major regression!) v6.27: 4737 vs. 6397 (still) New: 6704 vs. 6461 (better than baseline in common case) Disabled fastpath: 4843 vs. 6389 (e.g. if prefix extractor instance changes but is still compatible) Changed prefix size (no usable filter) in new: 787 vs. 5927 Changed prefix size (no usable filter) in new & baseline: 773 vs. 784 Reviewed By: mrambacher Differential Revision: D33677812 Pulled By: pdillinger fbshipit-source-id: 571d9711c461fb97f957378a061b7e7dbc4d6a76
3 years ago
const std::shared_ptr<const SliceTransform>& prefix_extractor,
Block per key-value checksum (#11287) Summary: add option `block_protection_bytes_per_key` and implementation for block per key-value checksum. The main changes are 1. checksum construction and verification in block.cc/h 2. pass the option `block_protection_bytes_per_key` around (mainly for methods defined in table_cache.h) 3. unit tests/crash test updates Tests: * Added unit tests * Crash test: `python3 tools/db_crashtest.py blackbox --simple --block_protection_bytes_per_key=1 --write_buffer_size=1048576` Follow up (maybe as a separate PR): make sure corruption status returned from BlockIters are correctly handled. Performance: Turning on block per KV protection has a non-trivial negative impact on read performance and costs additional memory. For memory, each block includes additional 24 bytes for checksum-related states beside checksum itself. For CPU, I set up a DB of size ~1.2GB with 5M keys (32 bytes key and 200 bytes value) which compacts to ~5 SST files (target file size 256 MB) in L6 without compression. I tested readrandom performance with various block cache size (to mimic various cache hit rates): ``` SETUP make OPTIMIZE_LEVEL="-O3" USE_LTO=1 DEBUG_LEVEL=0 -j32 db_bench ./db_bench -benchmarks=fillseq,compact0,waitforcompaction,compact,waitforcompaction -write_buffer_size=33554432 -level_compaction_dynamic_level_bytes=true -max_background_jobs=8 -target_file_size_base=268435456 --num=5000000 --key_size=32 --value_size=200 --compression_type=none BENCHMARK ./db_bench --use_existing_db -benchmarks=readtocache,readrandom[-X10] --num=5000000 --key_size=32 --disable_auto_compactions --reads=1000000 --block_protection_bytes_per_key=[0|1] --cache_size=$CACHESIZE The readrandom ops/sec looks like the following: Block cache size: 2GB 1.2GB * 0.9 1.2GB * 0.8 1.2GB * 0.5 8MB Main 240805 223604 198176 161653 139040 PR prot_bytes=0 238691 226693 200127 161082 141153 PR prot_bytes=1 214983 193199 178532 137013 108211 prot_bytes=1 vs -10% -15% -10.8% -15% -23% prot_bytes=0 ``` The benchmark has a lot of variance, but there was a 5% to 25% regression in this benchmark with different cache hit rates. Pull Request resolved: https://github.com/facebook/rocksdb/pull/11287 Reviewed By: ajkr Differential Revision: D43970708 Pulled By: cbi42 fbshipit-source-id: ef98d898b71779846fa74212b9ec9e08b7183940
2 years ago
size_t max_file_size_for_l0_meta_pin, const ReadOptions& read_options,
uint8_t block_protection_bytes_per_key) {
return rep_->LoadTableHandlers(
internal_stats, max_threads, prefetch_index_and_filter_in_cache,
is_initial_load, prefix_extractor, max_file_size_for_l0_meta_pin,
read_options, block_protection_bytes_per_key);
}
uint64_t VersionBuilder::GetMinOldestBlobFileNumber() const {
return rep_->GetMinOldestBlobFileNumber();
}
BaseReferencedVersionBuilder::BaseReferencedVersionBuilder(
ColumnFamilyData* cfd)
: version_builder_(new VersionBuilder(
cfd->current()->version_set()->file_options(), cfd->ioptions(),
cfd->table_cache(), cfd->current()->storage_info(),
Account memory of FileMetaData in global memory limit (#9924) Summary: **Context/Summary:** As revealed by heap profiling, allocation of `FileMetaData` for [newly created file added to a Version](https://github.com/facebook/rocksdb/pull/9924/files#diff-a6aa385940793f95a2c5b39cc670bd440c4547fa54fd44622f756382d5e47e43R774) can consume significant heap memory. This PR is to account that toward our global memory limit based on block cache capacity. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9924 Test Plan: - Previous `make check` verified there are only 2 places where the memory of the allocated `FileMetaData` can be released - New unit test `TEST_P(ChargeFileMetadataTestWithParam, Basic)` - db bench (CPU cost of `charge_file_metadata` in write and compact) - **write micros/op: -0.24%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 (remove this option for pre-PR) -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'` - **compact micros/op -0.87%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 -numdistinct=1000 && ./db_bench -benchmarks=compact -db=$TEST_TMPDIR -use_existing_db=1 -charge_file_metadata=1 -disable_auto_compactions=1 | egrep 'compact'` table 1 - write #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721 20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | -0.3633711465 40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | 0.5289363078 80 | 3.87828 | 0.119007 | 3.86791 | 0.115674 | **-0.2673865734** 160 | 3.87677 | 0.162231 | 3.86739 | 0.16663 | **-0.2419539978** table 2 - compact #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 2,399,650.00 | 96,375.80 | 2,359,537.00 | 53,243.60 | -1.67 20 | 2,410,480.00 | 89,988.00 | 2,433,580.00 | 91,121.20 | 0.96 40 | 2.41E+06 | 121811 | 2.39E+06 | 131525 | **-0.96** 80 | 2.40E+06 | 134503 | 2.39E+06 | 108799 | **-0.78** - stress test: `python3 tools/db_crashtest.py blackbox --charge_file_metadata=1 --cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D36055583 Pulled By: hx235 fbshipit-source-id: b60eab94707103cb1322cf815f05810ef0232625
2 years ago
cfd->current()->version_set(),
cfd->GetFileMetadataCacheReservationManager())),
version_(cfd->current()) {
version_->Ref();
}
BaseReferencedVersionBuilder::BaseReferencedVersionBuilder(
ColumnFamilyData* cfd, Version* v)
: version_builder_(new VersionBuilder(
cfd->current()->version_set()->file_options(), cfd->ioptions(),
Account memory of FileMetaData in global memory limit (#9924) Summary: **Context/Summary:** As revealed by heap profiling, allocation of `FileMetaData` for [newly created file added to a Version](https://github.com/facebook/rocksdb/pull/9924/files#diff-a6aa385940793f95a2c5b39cc670bd440c4547fa54fd44622f756382d5e47e43R774) can consume significant heap memory. This PR is to account that toward our global memory limit based on block cache capacity. Pull Request resolved: https://github.com/facebook/rocksdb/pull/9924 Test Plan: - Previous `make check` verified there are only 2 places where the memory of the allocated `FileMetaData` can be released - New unit test `TEST_P(ChargeFileMetadataTestWithParam, Basic)` - db bench (CPU cost of `charge_file_metadata` in write and compact) - **write micros/op: -0.24%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 (remove this option for pre-PR) -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 | egrep 'fillseq'` - **compact micros/op -0.87%** : `TEST_TMPDIR=/dev/shm/testdb ./db_bench -benchmarks=fillseq -db=$TEST_TMPDIR -charge_file_metadata=1 -disable_auto_compactions=1 -write_buffer_size=100000 -num=4000000 -numdistinct=1000 && ./db_bench -benchmarks=compact -db=$TEST_TMPDIR -use_existing_db=1 -charge_file_metadata=1 -disable_auto_compactions=1 | egrep 'compact'` table 1 - write #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 3.9711 | 0.264408 | 3.9914 | 0.254563 | 0.5111933721 20 | 3.83905 | 0.0664488 | 3.8251 | 0.0695456 | -0.3633711465 40 | 3.86625 | 0.136669 | 3.8867 | 0.143765 | 0.5289363078 80 | 3.87828 | 0.119007 | 3.86791 | 0.115674 | **-0.2673865734** 160 | 3.87677 | 0.162231 | 3.86739 | 0.16663 | **-0.2419539978** table 2 - compact #-run | (pre-PR) avg micros/op | std micros/op | (post-PR) micros/op | std micros/op | change (%) -- | -- | -- | -- | -- | -- 10 | 2,399,650.00 | 96,375.80 | 2,359,537.00 | 53,243.60 | -1.67 20 | 2,410,480.00 | 89,988.00 | 2,433,580.00 | 91,121.20 | 0.96 40 | 2.41E+06 | 121811 | 2.39E+06 | 131525 | **-0.96** 80 | 2.40E+06 | 134503 | 2.39E+06 | 108799 | **-0.78** - stress test: `python3 tools/db_crashtest.py blackbox --charge_file_metadata=1 --cache_size=1` killed as normal Reviewed By: ajkr Differential Revision: D36055583 Pulled By: hx235 fbshipit-source-id: b60eab94707103cb1322cf815f05810ef0232625
2 years ago
cfd->table_cache(), v->storage_info(), v->version_set(),
cfd->GetFileMetadataCacheReservationManager())),
version_(v) {
assert(version_ != cfd->current());
}
BaseReferencedVersionBuilder::~BaseReferencedVersionBuilder() {
version_->Unref();
}
} // namespace ROCKSDB_NAMESPACE