Summary:
User gets undefinied error since the definition is not exposed.
Also re-enable the db test with only upper bound check
Test Plan: db_test, rate_limit_test
Reviewers: igor, yhchiang, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D20403
Summary:
Add a function GetOptions(), where based on four parameters users give: read/write amplification threshold, memory budget for mem tables and target DB size, it picks up a compaction style and parameters for them. Background threads are not touched yet.
One limit of this algorithm: since compression rate and key/value size are hard to predict, it's hard to predict level 0 file size from write buffer size. Simply make 1:1 ratio here.
Sample results: https://reviews.facebook.net/P477
Test Plan: Will add some a unit test where some sample scenarios are given and see they pick the results that make sense
Reviewers: yhchiang, dhruba, haobo, igor, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D18741
Summary:
Suggested by @dhruba from the other diff, here is the improved
comments for parameters of the function
Test Plan: none
Reviewers: dhruba, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D19623
Summary:
This patch adds a target size parameter in options.db_paths and universal compaction will base it to determine which DB path to place a new file.
Level-style stays the same.
Test Plan: Add new unit tests
Reviewers: ljin, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, dhruba, igor, leveldb
Differential Revision: https://reviews.facebook.net/D19869
Summary: Browsing through the code, looks like StatsLogger is not used at all!
Test Plan: compiles
Reviewers: ljin, sdong, yhchiang, dhruba
Reviewed By: dhruba
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D19827
Summary:
Add option and plugin rate limiter for PosixWritableFile. The rate
limiter only applies to flush and compaction. WAL and MANIFEST are
excluded from this enforcement.
Test Plan: db_test
Reviewers: igor, yhchiang, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19425
Summary:
A generic rate limiter that can be shared by threads and rocksdb
instances. Will use this to smooth out write traffic generated by
compaction and flush. This will help us get better p99 behavior on flash
storage.
Test Plan:
unit test output
==== Test RateLimiterTest.Rate
request size [1 - 1023], limit 10 KB/sec, actual rate: 10.374969 KB/sec, elapsed 2002265
request size [1 - 2047], limit 20 KB/sec, actual rate: 20.771242 KB/sec, elapsed 2002139
request size [1 - 4095], limit 40 KB/sec, actual rate: 41.285299 KB/sec, elapsed 2202424
request size [1 - 8191], limit 80 KB/sec, actual rate: 81.371605 KB/sec, elapsed 2402558
request size [1 - 16383], limit 160 KB/sec, actual rate: 162.541268 KB/sec, elapsed 3303500
Reviewers: yhchiang, igor, sdong
Reviewed By: sdong
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19359
Summary:
This diff adds timeout_hint_us to WriteOptions. If it's non-zero, then
1) writes associated with this options MAY be aborted when it has been
waiting for longer than the specified time. If an abortion happens,
associated writes will return Status::TimeOut.
2) the stall time of the associated write caused by flush or compaction
will be limited by timeout_hint_us.
The default value of timeout_hint_us is 0 (i.e., OFF.)
The statistics of timeout writes will be recorded in WRITE_TIMEDOUT.
Test Plan:
export ROCKSDB_TESTS=WriteTimeoutAndDelayTest
make db_test
./db_test
Reviewers: igor, ljin, haobo, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D18837
Summary:
In this patch, we allow RocksDB to support multiple DB paths internally.
No user interface is supported yet so this patch is silent to users.
Test Plan: make all check
Reviewers: igor, haobo, ljin, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D18921
Summary:
As discussed in our internal group, we don't get much use of seek compaction at the moment, while it's making code more complicated and slower in some cases.
This diff removes seek compaction and (hopefully) all code that was introduced to support seek compaction.
There is one test case that relied on didIO information. I'll try to find another way to implement it.
Test Plan: make check
Reviewers: sdong, haobo, yhchiang, ljin, dhruba
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19161
Summary: Include max_write_buffer_number >= 2 to SanitizeOptions.
Test Plan: make all check
Reviewers: haobo, sdong, igor, ljin
Reviewed By: ljin
Subscribers: leveldb
Differential Revision: https://reviews.facebook.net/D19077
Summary: Provide an convenience option to create column families if they are missing from the DB. Task #4460490
Test Plan: added unit test. also, stress test for some time
Reviewers: sdong, haobo, dhruba, ljin, yhchiang
Reviewed By: yhchiang
Subscribers: yhchiang, leveldb
Differential Revision: https://reviews.facebook.net/D18951
Summary:
This patch changes meaning of options.bloom_locality: 0 means disable cache line optimization and any positive number means use CACHE_LINE_SIZE as block size (the previous behavior is the block size will be CACHE_LINE_SIZE*options.bloom_locality). By doing it, the divide operations inside a block can be replaced by a shift.
Performance is improved:
https://reviews.facebook.net/P471
Also, improve the basic algorithm in two ways:
(1) make sure num of blocks is an odd number
(2) rotate bytes after every probe in locality mode. Since the divider is 2^n, unless doing it, we are never able to use all the bits.
Improvements of false positive: https://reviews.facebook.net/P459
Test Plan: make all check
Reviewers: ljin, haobo
Reviewed By: haobo
Subscribers: dhruba, yhchiang, igor, leveldb
Differential Revision: https://reviews.facebook.net/D18843
Summary:
Introducing new compaction style -- FIFO.
FIFO compaction style has write amplification of 1 (+1 for WAL) and it deletes the oldest files when the total DB size exceeds pre-configured values.
FIFO compaction style is suited for storing high-frequency event logs.
Test Plan: Added a unit test
Reviewers: dhruba, haobo, sdong
Reviewed By: dhruba
Subscribers: alberts, leveldb
Differential Revision: https://reviews.facebook.net/D18765
Summary:
This diff addresses task #4296714 and rethinks how users provide us with TablePropertiesCollectors as part of Options.
Here's description of task #4296714:
I'm debugging #4295529 and noticed that our count of user properties kDeletedKeys is wrong. We're sharing one single InternalKeyPropertiesCollector with all Table Builders. In LOG Files, we're outputting number of kDeletedKeys as connected with a single table, while it's actually the total count of deleted keys since creation of the DB.
For example, this table has 3155 entries and 1391828 deleted keys.
The problem with current approach that we call methods on a single TablePropertiesCollector for all the tables we create. Even worse, we could do it from multiple threads at the same time and TablePropertiesCollector has no way of knowing which table we're calling it for.
Good part: Looks like nobody inside Facebook is using Options::table_properties_collectors. This means we should be able to painfully change the API.
In this change, I introduce TablePropertiesCollectorFactory. For every table we create, we call `CreateTablePropertiesCollector`, which creates a TablePropertiesCollector for a single table. We then use it sequentially from a single thread, which means it doesn't have to be thread-safe.
Test Plan:
Added a test in table_properties_collector_test that fails on master (build two tables, assert that kDeletedKeys count is correct for the second one).
Also, all other tests
Reviewers: sdong, dhruba, haobo, kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18579
Summary:
This diff is addressing multiple things with a single goal -- to make RocksDB easier to use:
* Add some functions to Options that make RocksDB easier to tune.
* Add example code for both simple RocksDB and RocksDB with Column Families.
* Rewrite our README.md
Regarding Options, I took a stab at something we talked about for a long time:
* https://www.facebook.com/groups/rocksdb.dev/permalink/563169950448190/
I added functions:
* IncreaseParallelism() -- easy, increases the thread pool and max_background_compactions
* OptimizeLevelStyleCompaction(memtable_memory_budget) -- the easiest way to optimize rocksdb for less stalls with level style compaction. This is very likely not ideal configuration. Feel free to suggest improvements. I used some of Mark's suggestions from here: https://github.com/facebook/rocksdb/issues/54
* OptimizeUniversalStyleCompaction(memtable_memory_budget) -- optimize for universal compaction.
Test Plan: compiled rocksdb. ran examples.
Reviewers: dhruba, MarkCallaghan, haobo, sdong, yhchiang
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18621
Summary:
Added a new option `max_total_wal_size`. Once the total WAL size goes over that, we make an attempt to flush all column families that still have data in the earliest WAL file.
By default, I calculate `max_total_wal_size` dynamically, that should be good-enough for non-advanced customers.
Test Plan: Added a test
Reviewers: dhruba, haobo, sdong, ljin, yhchiang
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D18345
Summary: Add an option to allocate a piece of memory from huge page TLB. Add options to trigger it in dynamic bloom, plain table indexes andhash linked list hash table.
Test Plan: make all check
Reviewers: haobo, ljin
Reviewed By: haobo
CC: nkg-, dhruba, leveldb, igor, yhchiang
Differential Revision: https://reviews.facebook.net/D18357
Summary:
also add an override option total_order_iteration if you want to use full
iterator with prefix_extractor
Test Plan: make all check
Reviewers: igor, haobo, sdong, yhchiang
Reviewed By: haobo
CC: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D17805
Summary:
Introducing RocksDBLite! Removes all the non-essential features and reduces the binary size. This effort should help our adoption on mobile.
Binary size when compiling for IOS (`TARGET_OS=IOS m static_lib`) is down to 9MB from 15MB (without stripping)
Test Plan: compiles :)
Reviewers: dhruba, haobo, ljin, sdong, yhchiang
Reviewed By: yhchiang
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17835
Summary:
it writes ~10M data, default L0 compaction trigger is 4, plus 2 writer
buffer, so that can accommodate ~6M data before compaction happens for
sure. I guess encoding is doing a good job to shrink the data so that
sometime, compaction does not get triggered. I get test failure quite
often.
Test Plan: ran it multiple times and all got pass
Reviewers: igor, sdong
Reviewed By: sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17775
Summary: Since we are optimizing for server workloads, some default values are not optimized any more. We change some of those values that I feel it's less prone to regression bugs.
Test Plan: make all check
Reviewers: dhruba, haobo, ljin, igor, yhchiang
Reviewed By: igor
CC: leveldb, MarkCallaghan
Differential Revision: https://reviews.facebook.net/D16995
Summary:
By constraining the probes within cache line(s), we can improve the
cache miss rate thus performance. This probably only makes sense for
in-memory workload so defaults the option to off.
Numbers and comparision can be found in wiki:
https://our.intern.facebook.com/intern/wiki/index.php/Ljin/rocksdb_perf/2014_03_17#Bloom_Filter_Study
Test Plan: benchmarked this change substantially. Will run make all check as well
Reviewers: haobo, igor, dhruba, sdong, yhchiang
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17133
Summary:
NewFixedPrefixTransform is leaked in default options. Broken by b47812fba6
Also included in the diff some code cleanup
Test Plan:
valgrind env_test
also make check
Reviewers: haobo, danguo, yhchiang
Reviewed By: danguo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17211
Summary:
This diff adds a new CompactionFilterV2 API that roll up the
decisions of kv pairs during compactions. These kv pairs must share the
same key prefix. They are buffered inside the db.
typedef std::vector<Slice> SliceVector;
virtual std::vector<bool> Filter(int level,
const SliceVector& keys,
const SliceVector& existing_values,
std::vector<std::string>* new_values,
std::vector<bool>* values_changed
) const = 0;
Application can override the Filter() function to operate
on the buffered kv pairs. More details in the inline documentation.
Test Plan:
make check. Added unit tests to make sure Keep, Delete,
Change all works.
Reviewers: haobo
CCs: leveldb
Differential Revision: https://reviews.facebook.net/D15087
Summary:
* PartialMerge api now takes a list of operands instead of two operands.
* Add min_pertial_merge_operands to Options, indicating the minimum
number of operands to trigger partial merge.
* This diff is based on Schalk's previous diff (D14601), but it also
includes necessary changes such as updating the pure C api for
partial merge.
Test Plan:
* make check all
* develop tests for cases where partial merge takes more than two
operands.
TODOs (from Schalk):
* Add test with min_partial_merge_operands > 2.
* Perform benchmarks to measure the performance improvements (can probably
use results of task #2837810.)
* Add description of problem to doc/index.html.
* Change wiki pages to reflect the interface changes.
Reviewers: haobo, igor, vamsi
Reviewed By: haobo
CC: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D16815
Summary:
(1) Fix SanitizeOptions() to also check HashLinkList. The current
dynamic case just happens to work because the 2 classes have the same
layout.
(2) Do not delete SliceTransform object in HashSkipListFactory and
HashLinkListFactory destructor. Reason: SanitizeOptions() enforces
prefix_extractor and SliceTransform to be the same object when
Hash**Factory is used. This makes the behavior strange: when
Hash**Factory is used, prefix_extractor will be released by RocksDB. If
other memtable factory is used, prefix_extractor should be released by
user.
Test Plan: db_bench && make asan_check
Reviewers: haobo, igor, sdong
Reviewed By: igor
CC: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D16587
Summary:
If verify_checksums_in_compaction is true, compaction will verify checksums. This is default.
If it's false, compaction doesn't verify checksums. This is useful for in-memory workloads.
Test Plan: corruption_test
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16695
Summary:
Currently, there is no easy way for user to change log level of info log. Add a parameter in options to specify that.
Also make the default level to INFO level. Removing the [INFO] tag if it is INFO level as I don't want to cause performance regression. (add [LOG] means another mem-copy and string formatting).
Test Plan:
make all check
manual check the levels work as expected.
Reviewers: dhruba, yhchiang
Reviewed By: yhchiang
CC: dhruba, igor, i.am.jin.lei, ljin, haobo, leveldb
Differential Revision: https://reviews.facebook.net/D16563
Summary: as title
Test Plan:
asan_check
will post results later
Reviewers: haobo, igor, dhruba, sdong
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16257
Summary: When we open a DB, we should dump only DBOptions and then when we create a new column family, we dump ColumnFamilyOptions for each one.
Test Plan: make check, confirm contents of the LOG
Reviewers: dhruba, haobo, sdong, kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16011
Summary:
This is not column-family related diff. It is in columnfamily branch because the change is significant and we want to push it with next major release (3.0).
It removes the leveldb notion of one thread pool and expands it to two thread pools by default (HIGH and LOW). Flush process is removed from compaction process and all flush threads are executed on HIGH thread pool, since we don't want long-running compactions to influence flush latency.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15987
Summary:
There are three SanitizeOption-s now : one for DBOptions, one for ColumnFamilyOptions and one for Options (which just calls the other two)
I have also reshuffled some options -- table_cache options and info_log should live in DBOptions, for example.
Test Plan: make check doesn't complain
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15873
Summary: In PlainTable, use one single byte to represent 8 bytes of internal bytes, if seqID = 0 and it is value type (which should be common for bottom most files). It is to save 7 bytes for uncompressed cases.
Test Plan: make all check
Reviewers: haobo, dhruba, kailiu
Reviewed By: haobo
CC: igor, leveldb
Differential Revision: https://reviews.facebook.net/D15489
Summary: Compaction picker and internal key comparator are different for each column family (not global), so they should live in ColumnFamilyData
Test Plan: make check
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15801
Summary: The default settings enable checksum verification on every read.
Test Plan: make check
Reviewers: haobo
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15591
Summary: All memtables and immutable memtables are moved from DBImpl to ColumnFamilyData. For now, they are all referenced from default column family in DBImpl. It shouldn't be hard to get them from custom column family.
Test Plan: make check
Reviewers: dhruba, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15459
Summary: By removing some includes form options.h and reply on forward declaration, we can more easily reason the dependencies.
Test Plan: make all check
Reviewers: kailiu, haobo, igor, dhruba
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15411
Summary:
This diff implements a special type of iterator that doesn't create a snapshot
(can be used to read newly inserted data) and is optimized for doing sequential
reads.
TailingIterator uses current superversion number to determine whether to
invalidate its internal iterators. If the version hasn't changed, it can often
avoid doing expensive seeks over immutable structures (sst files and immutable
memtables).
Test Plan:
* new unit tests
* running LD with this patch
Reviewers: igor, dhruba, haobo, sdong, kailiu
Reviewed By: sdong
CC: leveldb, lovro, march
Differential Revision: https://reviews.facebook.net/D15285
Summary:
This diff takes an even more aggressive way to inline the functions. A decent rule that I followed is "not inline a function if it is more than 10 lines long."
Normally optimizing code by inline is ugly and hard to control, but since one of our usecase has significant amount of CPU used in functions from coding.cc, I'd like to try this diff out.
Test Plan:
1. the size for some .o file increased a little bit, but most less than 1%. So I think the negative impact of inline is negligible.
2. As the regression test shows (ran for 10 times and I calculated the average number)
Metrics Befor After
========================================================================
rocksdb.build.fillseq.qps 426595 444515 (+4.6%)
rocksdb.build.memtablefillrandom.qps 121739 123110
rocksdb.build.memtablereadrandom.qps 1285103 1280520
rocksdb.build.overwrite.qps 125816 135570 (+9%)
rocksdb.build.readrandom_fillunique_random.qps 285995 296863
rocksdb.build.readrandom_memtable_sst.qps 1027132 1027279
rocksdb.build.readrandom.qps 1041427 1054665
rocksdb.build.readrandom_smallblockcache.qps 1028631 1038433
rocksdb.build.readwhilewriting.qps 918352 914629
Reviewers: haobo, sdong, igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15291
Summary:
This diff fixes 2 hacks:
* The callback function can modify the existing value inplace, if the merged value fits within the existing buffer size. But currently the existing buffer size is not being modified. Now the callback recieves a int* allowing the size to be modified. Since size is encoded as a varint in the internal key for memtable. It might happen that the entire value might have be copied to the new location if the new size varint is smaller than the existing size varint.
* The callback function has 3 functionalities
1. Modify existing buffer inplace, and update size correspondingly. Now to indicate that, Returns 1.
2. Generate a new buffer indicating merged value. Returns 2.
3. Fails to do either of above, based on whatever application logic. Returns 0.
Test Plan: Just make all for now. I'm adding another unit test to test each scenario.
Reviewers: dhruba, haobo
Reviewed By: haobo
CC: leveldb, sdong, kailiu, xinyaohu, sumeet, danguo
Differential Revision: https://reviews.facebook.net/D15195
Summary: The application can set a callback function, which is applied on the previous value. And calculates the new value. This new value can be set, either inplace, if the previous value existed in memtable, and new value is smaller than previous value. Otherwise the new value is added normally.
Test Plan: fbmake. Added unit tests. All unit tests pass.
Reviewers: dhruba, haobo
Reviewed By: haobo
CC: sdong, kailiu, xinyaohu, sumeet, leveldb
Differential Revision: https://reviews.facebook.net/D14745
Summary:
Added an option (max_successive_merges) that can be used to specify the
maximum number of successive merge operations on a key in the memtable.
This can be used to improve performance of the "get" operation. If many
successive merge operations are performed on a key, the performance of "get"
operations on the key deteriorates, as the value has to be computed for each
"get" operation by applying all the successive merge operations.
FB Task ID: #3428853
Test Plan:
make all check
db_bench --benchmarks=readrandommergerandom
counter_stress_test
Reviewers: haobo, vamsi, dhruba, sdong
Reviewed By: haobo
CC: zshao
Differential Revision: https://reviews.facebook.net/D14991