|
|
|
// Copyright (c) 2011-present, Facebook, Inc. All rights reserved.
|
|
|
|
// This source code is licensed under both the GPLv2 (found in the
|
|
|
|
// COPYING file in the root directory) and Apache 2.0 License
|
|
|
|
// (found in the LICENSE.Apache file in the root directory).
|
Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
10 years ago
|
|
|
|
|
|
|
#include "db/write_controller.h"
|
|
|
|
|
|
|
|
#include <algorithm>
|
|
|
|
#include <atomic>
|
Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
10 years ago
|
|
|
#include <cassert>
|
|
|
|
#include <ratio>
|
|
|
|
|
|
|
|
#include "rocksdb/system_clock.h"
|
Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
10 years ago
|
|
|
|
|
|
|
namespace ROCKSDB_NAMESPACE {
|
Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
10 years ago
|
|
|
|
|
|
|
std::unique_ptr<WriteControllerToken> WriteController::GetStopToken() {
|
|
|
|
++total_stopped_;
|
|
|
|
return std::unique_ptr<WriteControllerToken>(new StopWriteToken(this));
|
|
|
|
}
|
|
|
|
|
When slowdown is triggered, reduce the write rate
Summary: It's usually hard for users to set a value of options.delayed_write_rate. With this diff, after slowdown condition triggers, we greedily reduce write rate if estimated pending compaction bytes increase. If estimated compaction pending bytes drop, we increase the write rate.
Test Plan:
Add a unit test
Test with db_bench setting:
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -num=10000000 --soft_pending_compaction_bytes_limit=1000000000 --hard_pending_compaction_bytes_limit=3000000000 --delayed_write_rate=100000000
and make sure without the commit, write stop will happen, but with the commit, it will not happen.
Reviewers: igor, anthony, rven, yhchiang, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52131
9 years ago
|
|
|
std::unique_ptr<WriteControllerToken> WriteController::GetDelayToken(
|
|
|
|
uint64_t write_rate) {
|
|
|
|
if (0 == total_delayed_++) {
|
|
|
|
// Starting delay, so reset counters.
|
|
|
|
next_refill_time_ = 0;
|
|
|
|
credit_in_bytes_ = 0;
|
|
|
|
}
|
|
|
|
// NOTE: for simplicity, any current credit_in_bytes_ or "debt" in
|
|
|
|
// next_refill_time_ will be based on an old rate. This rate will apply
|
|
|
|
// for subsequent additional debts and for the next refill.
|
When slowdown is triggered, reduce the write rate
Summary: It's usually hard for users to set a value of options.delayed_write_rate. With this diff, after slowdown condition triggers, we greedily reduce write rate if estimated pending compaction bytes increase. If estimated compaction pending bytes drop, we increase the write rate.
Test Plan:
Add a unit test
Test with db_bench setting:
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -num=10000000 --soft_pending_compaction_bytes_limit=1000000000 --hard_pending_compaction_bytes_limit=3000000000 --delayed_write_rate=100000000
and make sure without the commit, write stop will happen, but with the commit, it will not happen.
Reviewers: igor, anthony, rven, yhchiang, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D52131
9 years ago
|
|
|
set_delayed_write_rate(write_rate);
|
|
|
|
return std::unique_ptr<WriteControllerToken>(new DelayWriteToken(this));
|
Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
10 years ago
|
|
|
}
|
|
|
|
|
Add options.base_background_compactions as a number of compaction threads for low compaction debt
Summary:
If options.base_background_compactions is given, we try to schedule number of compactions not existing this number, only when L0 files increase to certain number, or pending compaction bytes more than certain threshold, we schedule compactions based on options.max_background_compactions.
The watermarks are calculated based on slowdown thresholds.
Test Plan:
Add new test cases in column_family_test.
Adding more unit tests.
Reviewers: IslamAbdelRahman, yhchiang, kradhakrishnan, rven, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D53409
9 years ago
|
|
|
std::unique_ptr<WriteControllerToken>
|
|
|
|
WriteController::GetCompactionPressureToken() {
|
|
|
|
++total_compaction_pressure_;
|
|
|
|
return std::unique_ptr<WriteControllerToken>(
|
|
|
|
new CompactionPressureToken(this));
|
|
|
|
}
|
|
|
|
|
|
|
|
bool WriteController::IsStopped() const {
|
|
|
|
return total_stopped_.load(std::memory_order_relaxed) > 0;
|
|
|
|
}
|
|
|
|
// This is inside DB mutex, so we can't sleep and need to minimize
|
|
|
|
// frequency to get time.
|
|
|
|
// If it turns out to be a performance issue, we can redesign the thread
|
|
|
|
// synchronization model here.
|
|
|
|
// The function trust caller will sleep micros returned.
|
|
|
|
uint64_t WriteController::GetDelay(SystemClock* clock, uint64_t num_bytes) {
|
|
|
|
if (total_stopped_.load(std::memory_order_relaxed) > 0) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
if (total_delayed_.load(std::memory_order_relaxed) == 0) {
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (credit_in_bytes_ >= num_bytes) {
|
|
|
|
credit_in_bytes_ -= num_bytes;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
// The frequency to get time inside DB mutex is less than one per refill
|
|
|
|
// interval.
|
|
|
|
auto time_now = NowMicrosMonotonic(clock);
|
|
|
|
|
|
|
|
const uint64_t kMicrosPerSecond = 1000000;
|
|
|
|
// Refill every 1 ms
|
|
|
|
const uint64_t kMicrosPerRefill = 1000;
|
|
|
|
|
|
|
|
if (next_refill_time_ == 0) {
|
|
|
|
// Start with an initial allotment of bytes for one interval
|
|
|
|
next_refill_time_ = time_now;
|
|
|
|
}
|
|
|
|
if (next_refill_time_ <= time_now) {
|
|
|
|
// Refill based on time interval plus any extra elapsed
|
|
|
|
uint64_t elapsed = time_now - next_refill_time_ + kMicrosPerRefill;
|
|
|
|
credit_in_bytes_ += static_cast<uint64_t>(
|
|
|
|
1.0 * elapsed / kMicrosPerSecond * delayed_write_rate_ + 0.999999);
|
|
|
|
next_refill_time_ = time_now + kMicrosPerRefill;
|
|
|
|
|
|
|
|
if (credit_in_bytes_ >= num_bytes) {
|
|
|
|
// Avoid delay if possible, to reduce DB mutex release & re-aquire.
|
|
|
|
credit_in_bytes_ -= num_bytes;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// We need to delay to avoid exceeding write rate.
|
|
|
|
assert(num_bytes > credit_in_bytes_);
|
|
|
|
uint64_t bytes_over_budget = num_bytes - credit_in_bytes_;
|
|
|
|
uint64_t needed_delay = static_cast<uint64_t>(
|
|
|
|
1.0 * bytes_over_budget / delayed_write_rate_ * kMicrosPerSecond);
|
|
|
|
|
|
|
|
credit_in_bytes_ = 0;
|
|
|
|
next_refill_time_ += needed_delay;
|
|
|
|
|
|
|
|
// Minimum delay of refill interval, to reduce DB mutex contention.
|
|
|
|
return std::max(next_refill_time_ - time_now, kMicrosPerRefill);
|
|
|
|
}
|
Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
10 years ago
|
|
|
|
|
|
|
uint64_t WriteController::NowMicrosMonotonic(SystemClock* clock) {
|
|
|
|
return clock->NowNanos() / std::milli::den;
|
|
|
|
}
|
|
|
|
|
Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
10 years ago
|
|
|
StopWriteToken::~StopWriteToken() {
|
|
|
|
assert(controller_->total_stopped_ >= 1);
|
|
|
|
--controller_->total_stopped_;
|
|
|
|
}
|
|
|
|
|
|
|
|
DelayWriteToken::~DelayWriteToken() {
|
|
|
|
controller_->total_delayed_--;
|
|
|
|
assert(controller_->total_delayed_.load() >= 0);
|
Push- instead of pull-model for managing Write stalls
Summary:
Introducing WriteController, which is a source of truth about per-DB write delays. Let's define an DB epoch as a period where there are no flushes and compactions (i.e. new epoch is started when flush or compaction finishes). Each epoch can either:
* proceed with all writes without delay
* delay all writes by fixed time
* stop all writes
The three modes are recomputed at each epoch change (flush, compaction), rather than on every write (which is currently the case).
When we have a lot of column families, our current pull behavior adds a big overhead, since we need to loop over every column family for every write. With new push model, overhead on Write code-path is minimal.
This is just the start. Next step is to also take care of stalls introduced by slow memtable flushes. The final goal is to eliminate function MakeRoomForWrite(), which currently needs to be called for every column family by every write.
Test Plan: make check for now. I'll add some unit tests later. Also, perf test.
Reviewers: dhruba, yhchiang, MarkCallaghan, sdong, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D22791
10 years ago
|
|
|
}
|
|
|
|
|
Add options.base_background_compactions as a number of compaction threads for low compaction debt
Summary:
If options.base_background_compactions is given, we try to schedule number of compactions not existing this number, only when L0 files increase to certain number, or pending compaction bytes more than certain threshold, we schedule compactions based on options.max_background_compactions.
The watermarks are calculated based on slowdown thresholds.
Test Plan:
Add new test cases in column_family_test.
Adding more unit tests.
Reviewers: IslamAbdelRahman, yhchiang, kradhakrishnan, rven, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D53409
9 years ago
|
|
|
CompactionPressureToken::~CompactionPressureToken() {
|
|
|
|
controller_->total_compaction_pressure_--;
|
|
|
|
assert(controller_->total_compaction_pressure_ >= 0);
|
|
|
|
}
|
|
|
|
|
|
|
|
} // namespace ROCKSDB_NAMESPACE
|