|
|
|
@ -387,7 +387,8 @@ of point reads of small values may wish to switch to a smaller block |
|
|
|
|
size if performance measurements indicate an improvement. There isn't |
|
|
|
|
much benefit in using blocks smaller than one kilobyte, or larger than |
|
|
|
|
a few megabytes. Also note that compression will be more effective |
|
|
|
|
with larger block sizes. |
|
|
|
|
with larger block sizes. To change block size parameter, use |
|
|
|
|
<code>Options::block_size</code>. |
|
|
|
|
<p> |
|
|
|
|
<h2>Write buffer</h2> |
|
|
|
|
<p> |
|
|
|
@ -434,7 +435,7 @@ filesystem and each file stores a sequence of compressed blocks. If |
|
|
|
|
used uncompressed block contents. If <code>options.block_cache_compressed</code> |
|
|
|
|
is non-NULL, it is used to cache frequently used compressed blocks. Compressed |
|
|
|
|
cache is an alternative to OS cache, which also caches compressed blocks. If |
|
|
|
|
compressed cache is used, you should disable OS cache by setting |
|
|
|
|
compressed cache is used, the OS cache will be disabled automatically by setting |
|
|
|
|
<code>options.allow_os_buffer</code> to false. |
|
|
|
|
<p> |
|
|
|
|
<pre> |
|
|
|
@ -588,7 +589,7 @@ Here we give overview of the options that impact behavior of Compactions: |
|
|
|
|
<ul> |
|
|
|
|
<p> |
|
|
|
|
<li><code>Options::compaction_style</code> - RocksDB currently supports two |
|
|
|
|
compaction algorithms - Compaction style and Level style. This option switches |
|
|
|
|
compaction algorithms - Universal style and Level style. This option switches |
|
|
|
|
between the two. Can be kCompactionStyleUniversal or kCompactionStyleLevel. |
|
|
|
|
If this is kCompactionStyleUniversal, then you can configure universal style |
|
|
|
|
parameters with <code>Options::compaction_options_universal</code>. |
|
|
|
@ -608,16 +609,126 @@ key-value during background compaction. |
|
|
|
|
</ul> |
|
|
|
|
<p> |
|
|
|
|
Other options impacting performance of compactions and when they get triggered |
|
|
|
|
are: <code>access_hint_on_compaction_start</code>, |
|
|
|
|
<code>level0_file_num_compaction_trigger</code>, |
|
|
|
|
<code>max_mem_compaction_level</code>, <code>target_file_size_base</code>, |
|
|
|
|
<code>target_file_size_multiplier</code>, |
|
|
|
|
<code>expanded_compaction_factor</code>, <code>source_compaction_factor</code>, |
|
|
|
|
<code>max_grandparent_overlap_factor</code>, |
|
|
|
|
<code>disable_seek_compaction</code>, <code>max_background_compactions</code>. |
|
|
|
|
are: |
|
|
|
|
<ul> |
|
|
|
|
<p> |
|
|
|
|
<li> <code>Options::access_hint_on_compaction_start</code> - Specify the file access |
|
|
|
|
pattern once a compaction is started. It will be applied to all input files of a compaction. Default: NORMAL |
|
|
|
|
<p> |
|
|
|
|
<li> <code>Options::level0_file_num_compaction_trigger</code> - Number of files to trigger level-0 compaction. |
|
|
|
|
A negative value means that level-0 compaction will not be triggered by number of files at all. |
|
|
|
|
<p> |
|
|
|
|
<li> <code>Options::max_mem_compaction_level</code> - Maximum level to which a new compacted memtable is pushed if it |
|
|
|
|
does not create overlap. We try to push to level 2 to avoid the relatively expensive level 0=>1 compactions and to avoid some |
|
|
|
|
expensive manifest file operations. We do not push all the way to the largest level since that can generate a lot of wasted disk |
|
|
|
|
space if the same key space is being repeatedly overwritten. |
|
|
|
|
<p> |
|
|
|
|
<li> <code>Options::target_file_size_base</code> and <code>Options::target_file_size_multiplier</code> - |
|
|
|
|
Target file size for compaction. target_file_size_base is per-file size for level-1. |
|
|
|
|
Target file size for level L can be calculated by target_file_size_base * (target_file_size_multiplier ^ (L-1)) |
|
|
|
|
For example, if target_file_size_base is 2MB and target_file_size_multiplier is 10, then each file on level-1 will |
|
|
|
|
be 2MB, and each file on level 2 will be 20MB, and each file on level-3 will be 200MB. Default target_file_size_base is 2MB |
|
|
|
|
and default target_file_size_multiplier is 1. |
|
|
|
|
<p> |
|
|
|
|
<li> <code>Options::expanded_compaction_factor</code> - Maximum number of bytes in all compacted files. We avoid expanding |
|
|
|
|
the lower level file set of a compaction if it would make the total compaction cover more than |
|
|
|
|
(expanded_compaction_factor * targetFileSizeLevel()) many bytes. |
|
|
|
|
<p> |
|
|
|
|
<li> <code>Options::source_compaction_factor</code> - Maximum number of bytes in all source files to be compacted in a |
|
|
|
|
single compaction run. We avoid picking too many files in the source level so that we do not exceed the total source bytes |
|
|
|
|
for compaction to exceed (source_compaction_factor * targetFileSizeLevel()) many bytes. |
|
|
|
|
Default:1, i.e. pick maxfilesize amount of data as the source of a compaction. |
|
|
|
|
<p> |
|
|
|
|
<li> <code>Options::max_grandparent_overlap_factor</code> - Control maximum bytes of overlaps in grandparent (i.e., level+2) before we |
|
|
|
|
stop building a single file in a level->level+1 compaction. |
|
|
|
|
<p> |
|
|
|
|
<li> <code>Options::disable_seek_compaction</code> - Disable compaction triggered by seek. |
|
|
|
|
With bloomfilter and fast storage, a miss on one level is very cheap if the file handle is cached in table cache |
|
|
|
|
(which is true if max_open_files is large). |
|
|
|
|
<p> |
|
|
|
|
<li> <code>Options::max_background_compactions</code> - Maximum number of concurrent background jobs, submitted to |
|
|
|
|
the default LOW priority thread pool |
|
|
|
|
</ul> |
|
|
|
|
|
|
|
|
|
<p> |
|
|
|
|
You can learn more about all of those options in <code>rocksdb/options.h</code> |
|
|
|
|
|
|
|
|
|
<h2> Universal style compaction specific settings</h2> |
|
|
|
|
<p> |
|
|
|
|
If you're using Universal style compaction, there is an object <code>CompactionOptionsUniversal</code> |
|
|
|
|
that hold all the different options for that compaction. The exact definition is in |
|
|
|
|
<code>rocksdb/universal_compaction.h</code> and you can set it in <code>Options::compaction_options_universal</code>. |
|
|
|
|
Here we give short overview of options in <code>CompactionOptionsUniversal</code>: |
|
|
|
|
<ul> |
|
|
|
|
<p> |
|
|
|
|
<li> <code>CompactionOptionsUniversal::size_ratio</code> - Percentage flexibilty while comparing file size. If the candidate file(s) |
|
|
|
|
size is 1% smaller than the next file's size, then include next file into |
|
|
|
|
this candidate set. Default: 1 |
|
|
|
|
<p> |
|
|
|
|
<li> <code>CompactionOptionsUniversal::min_merge_width</code> - The minimum number of files in a single compaction run. Default: 2 |
|
|
|
|
<p> |
|
|
|
|
<li> <code>CompactionOptionsUniversal::max_merge_width</code> - The maximum number of files in a single compaction run. Default: UINT_MAX |
|
|
|
|
<p> |
|
|
|
|
<li> <code>CompactionOptionsUniversal::max_size_amplification_percent</code> - The size amplification is defined as the amount (in percentage) of |
|
|
|
|
additional storage needed to store a single byte of data in the database. For example, a size amplification of 2% means that a database that |
|
|
|
|
contains 100 bytes of user-data may occupy upto 102 bytes of physical storage. By this definition, a fully compacted database has |
|
|
|
|
a size amplification of 0%. Rocksdb uses the following heuristic to calculate size amplification: it assumes that all files excluding |
|
|
|
|
the earliest file contribute to the size amplification. Default: 200, which means that a 100 byte database could require upto |
|
|
|
|
300 bytes of storage. |
|
|
|
|
<p> |
|
|
|
|
<li> <code>CompactionOptionsUniversal::compression_size_percent</code> - If this option is set to be -1 (the default value), all the output files |
|
|
|
|
will follow compression type specified. If this option is not negative, we will try to make sure compressed |
|
|
|
|
size is just above this value. In normal cases, at least this percentage |
|
|
|
|
of data will be compressed. |
|
|
|
|
When we are compacting to a new file, here is the criteria whether |
|
|
|
|
it needs to be compressed: assuming here are the list of files sorted |
|
|
|
|
by generation time: [ A1...An B1...Bm C1...Ct ], |
|
|
|
|
where A1 is the newest and Ct is the oldest, and we are going to compact |
|
|
|
|
B1...Bm, we calculate the total size of all the files as total_size, as |
|
|
|
|
well as the total size of C1...Ct as total_C, the compaction output file |
|
|
|
|
will be compressed iff total_C / total_size < this percentage |
|
|
|
|
<p> |
|
|
|
|
<li> <code>CompactionOptionsUniversal::stop_style</code> - The algorithm used to stop picking files into a single compaction run. |
|
|
|
|
Can be kCompactionStopStyleSimilarSize (pick files of similar size) or kCompactionStopStyleTotalSize (total size of picked files > next file). |
|
|
|
|
Default: kCompactionStopStyleTotalSize |
|
|
|
|
</ul> |
|
|
|
|
|
|
|
|
|
<h1>Thread pools</h1> |
|
|
|
|
<p> |
|
|
|
|
A thread pool is associated with Env environment object. The client has to create a thread pool by setting the number of background |
|
|
|
|
threads using method <code>Env::SetBackgroundThreads()</code> defined in <code>rocksdb/env.h</code>. |
|
|
|
|
We use the thread pool for compactions and memtable flushes. |
|
|
|
|
Since memtable flushes are in critical code path (stalling memtable flush can stall writes, increasing p99), we suggest |
|
|
|
|
having two thread pools - with priorities HIGH and LOW. Memtable flushes can be set up to be scheduled on HIGH thread pool. |
|
|
|
|
There are two options available for configuration of background compactions and flushes: |
|
|
|
|
<ul> |
|
|
|
|
<p> |
|
|
|
|
<li> <code>Options::max_background_compactions</code> - Maximum number of concurrent background jobs, |
|
|
|
|
submitted to the default LOW priority thread pool |
|
|
|
|
<p> |
|
|
|
|
<li> <code>Options::max_background_flushes</code> - Maximum number of concurrent background memtable flush jobs, submitted to |
|
|
|
|
the HIGH priority thread pool. By default, all background jobs (major compaction and memtable flush) go |
|
|
|
|
to the LOW priority pool. If this option is set to a positive number, memtable flush jobs will be submitted to the HIGH priority pool. |
|
|
|
|
It is important when the same Env is shared by multiple db instances. Without a separate pool, long running major compaction jobs could |
|
|
|
|
potentially block memtable flush jobs of other db instances, leading to unnecessary Put stalls. |
|
|
|
|
</ul> |
|
|
|
|
<p> |
|
|
|
|
<pre> |
|
|
|
|
#include "rocksdb/env.h" |
|
|
|
|
#include "rocksdb/db.h" |
|
|
|
|
|
|
|
|
|
auto env = rocksdb::Env::Default(); |
|
|
|
|
env->SetBackgroundThreads(2, rocksdb::Env::LOW); |
|
|
|
|
env->SetBackgroundThreads(1, rocksdb::Env::HIGH); |
|
|
|
|
rocksdb::DB* db; |
|
|
|
|
rocksdb::Options options; |
|
|
|
|
options.env = env; |
|
|
|
|
options.max_background_compactions = 2; |
|
|
|
|
options.max_background_flushes = 1; |
|
|
|
|
rocksdb::Status status = rocksdb::DB::Open(options, "/tmp/testdb", &db); |
|
|
|
|
assert(status.ok()); |
|
|
|
|
... |
|
|
|
|
</pre> |
|
|
|
|
<h1>Approximate Sizes</h1> |
|
|
|
|
<p> |
|
|
|
|
The <code>GetApproximateSizes</code> method can used to get the approximate |
|
|
|
|