Summary:
So far we only have key/value pairs as well as bloom filter stored in the
sst file. It will be great if we are able to store more metadata about
this table itself, for example, the entry size, bloom filter name, etc.
This diff is the first step of this effort. It allows table to keep the
basic statistics mentioned in http://fburl.com/14995441, as well as
allowing writing user-collected stats to stats block.
After this diff, we will figure out the interface of how to allow user to collect their interested statistics.
Test Plan:
1. Added several unit tests.
2. Ran `make check` to ensure it doesn't break other tests.
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13419
Summary: When I debug the unit test failures when enabling background flush thread, I feel the function names can be made clearer for people to understand. Also, if the names are fixed, in many places, some tests' bugs are obvious (and some of those tests are failing). This patch is to clean it up for future maintenance.
Test Plan: Run test suites.
Reviewers: haobo, dhruba, xjin
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13431
Summary:
With this patch, when LRUCache.Insert() is called and the cache is full, it will first try to free up entries whose reference counter is 1 (would become 0 after remo\
ving from the cache). We do it in two passes, in the first pass, we only try to release those unreferenced entries. If we cannot free enough space after traversing t\
he first remove_scan_cnt_ entries, we start from the beginning again and remove those entries being used.
Test Plan: add two unit tests to cover the codes
Reviewers: dhruba, haobo, emayanke
Reviewed By: emayanke
CC: leveldb, emayanke, xjin
Differential Revision: https://reviews.facebook.net/D13377
Summary:
This careless error was causing ASSERT_OK(DestroyDB) to fail in db_test.
Basically .. was being returned as a child of db/archive and ParseFileName returned false on that,
but 'type' was set to LogFile from earlier and not reset. The return of ParseFileName was not being checked to delete the log file or not.
Test Plan: make all check
Reviewers: dhruba, haobo, xjin, kailiu, nkg-
Reviewed By: nkg-
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13413
Summary: In some cases, you might not want to store the data log (write ahead log) files in the same dir as the sst files. An example use case is leaf, which stores sst files in tmpfs. And would like to save the log files in a separate dir (disk) to save memory.
Test Plan: make all. Ran db_test test. A few test failing. P2785018. If you guys don't see an obvious problem with the code, maybe somebody from the rocksdb team could help me debug the issue here. Running this on leaf worked well. I could see logs stored on disk, and deleted appropriately after compactions. Obviously this is only one set of options. The unit tests cover different options. Seems like I'm missing some edge cases.
Reviewers: dhruba, haobo, leveldb
CC: xinyaohu, sumeet
Differential Revision: https://reviews.facebook.net/D13239
Summary: While working on D13239, I noticed that the same options are not used for opening and destroying at db. So adding that. Also added asserts for successful DestroyDB calls.
Test Plan: Ran unit tests. Atleast 1 unit test is failing. They failures are a result of some past logic change. I'm not really planning to fix those. But I would like to check this in. And hopefully the respective unit test owners can fix the broken tests
Reviewers: leveldb, haobo
CC: xinyaohu, sumeet, dhruba
Differential Revision: https://reviews.facebook.net/D13329
Summary:
Previous patch introduced a unit test failure in
DBTest.NumImmutableMemTable because of change in property names.
Test Plan:
Reviewers:
CC:
Task ID: #
Blame Rev:
Summary:
Change namespace from leveldb to rocksdb. This allows a single
application to link in open-source leveldb code as well as
rocksdb code into the same process.
Test Plan: compile rocksdb
Reviewers: emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13287
Summary: This is useful to keep track of refreshes in transaction log iterator
Test Plan: make; db_stress --statistics=1 shows it
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13281
Summary:
As explained in comments in GetLiveFiles in db.h, this option will cause flush to be skipped in GetLiveFiles because some use-cases use GetSortedWalFiles after GetLiveFiles to generate more complete snapshots.
Using GetSortedWalFiles after GetLiveFiles allows us to not Flush in GetLiveFiles first because wals have everything.
Note: file deletions will be disabled before calling GLF or GSWF so live logs will not move to archive logs or get delted.
Note: Manifest file is truncated to a proper value in GLF, so it will always reply from the proper wal files on a restart
Test Plan: make
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13257
Summary: as title
Test Plan: make check
Reviewers: emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13263
Summary: As title. This is just a quick hack and not ready for commit. fails a lot of unit test. I will test/debug it directly in ViewState shadow .
Test Plan: Try it in shadow test.
Reviewers: dhruba, xjin
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12933
Summary:
We saw SIGSEGV when set options.num_levels=1 in universal compaction
style. Dug into this issue for a while, and finally found the root cause (thank Haobo for discussion).
Test Plan: Add new unit test. It throws SIGSEGV without this change. Also run "make all check".
Reviewers: haobo, dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13251
Summary: as title. unit test not polished. this is for a quick live test
Test Plan: live
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13221
Summary: A previous diff moved these outside of lock protected area. Moved back in now. Also moved tmp_batch_ update outside of lock protected area, as only the single write thread can access it.
Test Plan: make check
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13137
Summary: The original optimization missed updating links other than the lowest level.
Test Plan: make check; perf_context_test
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb, adsharma
Differential Revision: https://reviews.facebook.net/D13119
Summary:
The constructor for Vector memtable has a parameter called 'count'
that specifies the capacity of the vector to be reserved at allocation
time. It was incorrectly used to initialize the size of the vector.
Test Plan: Enhanced db_test.
Reviewers: haobo, xjin, emayanke
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13083
Summary:
There is a use-case where we want to insert data into rocksdb as
fast as possible. Vector rep is used for this purpose.
The background flush thread needs to flush the vectorrep to
storage. It acquires the dblock then sorts the vector, releases
the dblock and then writes the sorted vector to storage. This is
suboptimal because the lock is held during the sort, which
prevents new writes for occuring.
This patch moves the sorting of the vector rep to outside the
db mutex. Performance is now as fastas the underlying storage
system. If you are doing buffered writes to rocksdb files, then
you can observe throughput upwards of 200 MB/sec writes.
This is an early draft and not yet ready to be reviewed.
Test Plan:
make check
Task ID: #
Blame Rev:
Reviewers: haobo
Reviewed By: haobo
CC: leveldb, haobo
Differential Revision: https://reviews.facebook.net/D12987
Summary: as title
Test Plan: make db_test; ./db_test
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13005
Summary: as title
Test Plan: make db_test; ./db_test
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12981
Summary: Currently, when total number of files reaches level0_file_num_compaction_trigger, universal compaction will schedule a compaction job, but the job will not honor the compaction until the total number of files is level0_file_num_compaction_trigger+1. Fixed the condition for consistent behavior (start compaction on reaching level0_file_num_compaction_trigger).
Test Plan: make check; db_stress
Reviewers: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12945
Summary:
Added a new field called max_size_amplification_ratio in the
CompactionOptionsUniversal structure. This determines the maximum
percentage overhead of space amplification.
The size amplification is defined to be the ratio between the size of
the oldest file to the sum of the sizes of all other files. If the
size amplification exceeds the specified value, then min_merge_width
and max_merge_width are ignored and a full compaction of all files is done.
A value of 10 means that the size a database that stores 100 bytes
of user data could occupy 110 bytes of physical storage.
Test Plan: Unit test DBTest.UniversalCompactionSpaceAmplification added.
Reviewers: haobo, emayanke, xjin
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12825
Summary: As title. The DB log file life cycle is tied up with the memtable it backs. Once the memtable is flushed to sst and committed, we should be able to delete the log file, without holding the mutex. This is part of the bigger change to avoid FindObsoleteFiles at runtime. It deals with log files. sst files will be dealt with later.
Test Plan: make check; db_bench
Reviewers: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11709
Summary: The pupose of this diff is to expose per user-call level precise timing of block read, so that we can answer questions like: a Get() costs me 100ms, is that somehow related to loading blocks from file system, or sth else? We will answer that with EXACTLY how many blocks have been read, how much time was spent on transfering the bytes from os, how much time was spent on checksum verification and how much time was spent on block decompression, just for that one Get. A nano second stopwatch was introduced to track time with higher precision. The cost/precision of the stopwatch is also measured in unit-test. On my dev box, retrieving one time instance costs about 30ns, on average. The deviation of timing results is good enough to track 100ns-1us level events. And the overhead could be safely ignored for 100us level events (10000 instances/s), for example, a viewstate thrift call.
Test Plan: perf_context_test, also testing with viewstate shadow traffic.
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb, xjin
Differential Revision: https://reviews.facebook.net/D12351
Summary:
There is an config option called Options.min_write_buffer_number_to_merge
that specifies the minimum number of write buffers to merge in memory
before flushing to a file in L0. But in the the case when the db is
being closed, we should not be using this config, instead we should
flush whatever write buffers were available at that time.
Test Plan: Unit test attached.
Reviewers: haobo, emayanke
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12717
Summary:
An iterator invokes reseek if the number of sequential skips over the
same userkey exceeds a configured number. This makes iter->Next()
faster (bacause of fewer key compares) if a large number of
adjacent internal keys in a table (sst or memtable) have the
same userkey.
Test Plan: Unit test DBTest.IterReseek.
Reviewers: emayanke, haobo, xjin
Reviewed By: xjin
CC: leveldb, xjin
Differential Revision: https://reviews.facebook.net/D11865
Summary: So that replication can just download from wherever LogFile.Pathname is pointing them.
Test Plan: make all check;./db_repl_stress
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12609
Summary:
Add new command "change_compaction_style" to ldb tool. For
universal->level, it shows "nothing to do". For level->universal, it
compacts all files into a single one and moves the file to level 0.
Also add check for number of files at level 1+ when opening db with
universal compaction style.
Test Plan:
'make all check'. New unit test for internal convertion function. Also manully test various
cmd like:
./ldb change_compaction_style --old_compaction_style=0
--new_compaction_style=1 --db=/tmp/leveldbtest-3088/db_test
Reviewers: haobo, dhruba
Reviewed By: haobo
CC: vamsi, emayanke
Differential Revision: https://reviews.facebook.net/D12603
Summary:
The way counters/statistics are implemented in rocksdb demands that enum Tickers and TickerNameMap follow the same order, otherwise statistics exposed from fbcode/rocks get out-of-sync. 2 counters for prefix had violated this order and when I built counters for fbcode/mcrocksdb, statistics for sequence number were appearing out-of-sync.
The other change is to record sequence-number using setTickerCount only and not recordTick. This is because of difference in statistics as understood by rocks/utils which uses ServiceData::statistics function and rocksdb statistics. In rocksdb there is just 1 counter for a countername. But in ServiceData there are 4 independent buckets for every countername-Count, Sum, Average and Rate. SetTickerCount and RecordTick update the same variable in rocksdb but different buckets in ServiceData. Therefore, I had to choose one consistent function from RecordTick or SetTickerCount for sequence number in rocksdb. I chose SetTickerCount because the statistics object in options passed during rocksdb-open is user-dependent and SetTickerCount makes sense there.
There will be a corresponding diff to mcorcksdb in fbcode shortly.
Test Plan: make all check; check ticker value using fprintfs
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12669
Summary: db->DeleteFile calls ParseFileName to check name that was returned for sst file. Now, sst filename is returned using TableFileName which uses MakeFileName. This puts a / at the front of the name and ParseFileName doesn't like that. Changed ParseFileName to tolerate /s at the beginning. The test delet_file_test used to pass earlier because this behaviour of MakeFileName had been changed a while back to not return a / during which delete_file_test was checked in. But MakeFileName had to be reverted to add / at the front because GetLiveFiles used at many places outside rocksdb used the previous behaviour of MakeFileName.
Test Plan: make;./delete_filetest;make all check
Reviewers: dhruba, haobo, vamsi
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12663
Summary: // won't hurt but a missing / hurts sometimes
Test Plan: make all check; ./db_repl_stress
Reviewers: vamsi
Reviewed By: vamsi
CC: dhruba
Differential Revision: https://reviews.facebook.net/D12621
Summary:
The DeleteFile API was removing files inside the db-lock. This
is now changed to remove files outside the db-lock.
The GetLiveFilesMetadata() returns the smallest and largest
seqnuence number of each file as well.
Test Plan: deletefile_test
Reviewers: emayanke, haobo
Reviewed By: haobo
CC: leveldb
Maniphest Tasks: T63
Differential Revision: https://reviews.facebook.net/D12567
Summary: Let TransformRepFactory own the passed in transform. Also make it better encapsulated.
Test Plan: make valgrind_check;
Reviewers: dhruba, emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12591
Summary:
If ReadOptions.non_blocking_io is set to true, then KeyMayExists
and Iterators will return data that is cached in RAM.
If the Iterator needs to do IO from storage to serve the data,
then the Iterator.status() will return Status::IsRetry().
Test Plan:
Enhanced unit test DBTest.KeyMayExist to detect if there were are IOs
issues from storage. Added DBTest.NonBlockingIteration to verify
nonblocking Iterations.
Reviewers: emayanke, haobo
Reviewed By: haobo
CC: leveldb
Maniphest Tasks: T63
Differential Revision: https://reviews.facebook.net/D12531
Summary:
As title. This is possible as tickers are atomic now.
db_bench on high qps in-memory muti-thread random get workload, showed ~5% throughput improvement.
Test Plan: make check; db_bench; db_stress
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12555
Summary: In KeyMayExist.db_test we do a Flush which causes sst file to be written and added as open file in TableCache, but block cache for the file is not populated. So value_found should have been false where it was true and KeyMayExist.db_test should not have passed earlier. But it passed because BlockReader in table/table.cc takes 2 default arguments at the end called for_compaction and no_io. Although I passed no_io=true from InternalGet to BlockReader, but it understood for_compaction=true and defaulted no_io to false. This is a bug and although will be removed by Dhruba's new patch to incorporate no_io in readoptions, I'm submitting this patch to fix this bug independently of that patch.
Test Plan: make all check
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12537
Summary: initialiszer list is fasteri/preferable because it can straightaway call the constructor for this object, otherwise it will be created first and then again initialized. Although gain may not be much in this case because files_ is just a pointer and not a complex object, this is recommended practice.
Test Plan: make all check
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12519
Summary: Fix code so that the filter_block layer only assumes keys are internal when prefix_extractor is set.
Test Plan: ./filter_block_test
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12501
Summary: Replace include/leveldb with include/rocksdb.
Test Plan:
make clean; make check
make clean; make release
Differential Revision: https://reviews.facebook.net/D12489
Summary:
This patch adds three new MemTableRep's: UnsortedRep, PrefixHashRep, and VectorRep.
UnsortedRep stores keys in an std::unordered_map of std::sets. When an iterator is requested, it dumps the keys into an std::set and iterates over that.
VectorRep stores keys in an std::vector. When an iterator is requested, it creates a copy of the vector and sorts it using std::sort. The iterator accesses that new vector.
PrefixHashRep stores keys in an unordered_map mapping prefixes to ordered sets.
I also added one API change. I added a function MemTableRep::MarkImmutable. This function is called when the rep is added to the immutable list. It doesn't do anything yet, but it seems like that could be useful. In particular, for the vectorrep, it means we could elide the extra copy and just sort in place. The only reason I haven't done that yet is because the use of the ArenaAllocator complicates things (I can elaborate on this if needed).
Test Plan:
make -j32 check
./db_stress --memtablerep=vector
./db_stress --memtablerep=unsorted
./db_stress --memtablerep=prefixhash --prefix_size=10
Reviewers: dhruba, haobo, emayanke
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12117