Summary:
A recent commit e37dd216f9 makes sure hash index can be used when reading existing files. This patch uses another way to achieve the approach:
(1) Currently, always writing kBinarySearch to files, despite of BlockBasedTableOptions.IndexType setting.
(2) When reading a file, read out the field, and make sure it is kBinarySearch, while always use index type by users.
The reason for doing it is, to reserve kHashSearch property on disk to future. If now we write out binary index for both of kHashSearch and kBinarySearch. We have to use a new flag in the future for hash index on disk, otherwise compatibility would break. Also, we want the real index type and type shown in properties block to be consistent.
Test Plan: make all check
Reviewers: haobo, kailiu
Reviewed By: kailiu
CC: igor, ljin, yhchiang, xjin, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D18009
Summary:
With the recent changes, there is no need to check the property block about the index block type.
If user want to use it, they don't really need any disk format change; everything happens in the fly.
Also another team encountered an error while reading the index type from properties.
Test Plan:
ran all the tests
Reviewers: sdong
CC:
Task ID: #
Blame Rev:
Summary:
From 2.6 to 2.7, property block name is renamed from rocksdb.stats to rocksdb.properties. Older properties were not able to be loaded. In 2.8, we seem to have added some logic that uses property block without checking null pointers, which create segment faults.
In this patch, we fix it by:
(1) try rocksdb.stats if rocksdb.properties is not found
(2) add some null checking before consuming rep->table_properties
Test Plan: make sure a file generated in 2.7 couldn't be opened now can be opened.
Reviewers: haobo, igor, yhchiang
Reviewed By: igor
CC: ljin, xjin, dhruba, kailiu, leveldb
Differential Revision: https://reviews.facebook.net/D17961
Summary: Based on previous patches, this diff eventually provides the end-to-end mechanism for users to specify the hash-index.
Test Plan: Wrote several new unit tests.
Reviewers: sdong, haobo, dhruba
Reviewed By: sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16539
Summary: Compiling for iOS has by default turned on -Wmissing-prototypes, which causes rocksdb to fail compiling. This diff turns on -Wmissing-prototypes in our compile options and cleans up all functions with missing prototypes.
Test Plan: compiles
Reviewers: dhruba, haobo, ljin, sdong
Reviewed By: ljin
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17649
Summary: This will allow us to disable them completely for iOS or for better performance
Test Plan: will run make all check
Reviewers: igor, haobo, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D17511
Summary:
BinarySearchIndex didn't use unique_ptr to guard the block object nor
delete it in destructor, leading to valgrind failure for "definite
memory leak".
Test Plan:
re-ran the failed valgrind test cases
Summary:
My last diff was developed in MacOS but in devserver environment error occurs.
I dug into the problem and found the way we calcuate approximate data size is pretty out-of-date. We can use table properties to get more accurate results.
Test Plan: ran ./table_test and passed
Reviewers: igor, dhruba, haobo, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16509
Summary:
This patch introduced a new table options that allows us to make
block-based table's index pluggable.
To support that new features:
* Code has been refacotred to be more flexible and supports this option well.
* More documentation is added for the existing obsecure functionalities.
* Big surgeon on DataBlockReader(), where the logic was really convoluted.
* Other small code cleanups.
The pluggablility will mostly affect development of internal modules
and won't change frequently, as a result I intentionally avoid
heavy-weight patterns (like factory) and try to make it simple.
Test Plan: make all check
Reviewers: haobo, sdong
Reviewed By: sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16395
Summary:
Found some function follows camel style. When naming funciton, we have two styles:
Trivially expose internal data in readonly mode: `all_lower_case()`
Regular function: `CapitalizeFirstLetter()`
I renames these functions.
Test Plan: make -j32
Reviewers: haobo, sdong, dhruba, igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16383
Summary: This bug caused server crash issues because the filter block is too big and kept purging out of cache.
Test Plan: Wrote a new unit tests to make sure it works.
Reviewers: dhruba, haobo, igor, sdong
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D16221
Summary:
We are going to expose properties of all tables to end users through "some" db interface.
However, current design doesn't naturally fit for this need, which is because:
1. If a table presents in table cache, we cannot simply return the reference to its table properties, because the table may be destroy after compaction (and we don't want to hold the ref of the version).
2. Copy table properties is OK, but it's slow.
Thus in this diff, I change the table reader's interface to return a shared pointer (for const table properties), instead a const refernce.
Test Plan: `make check` passed
Reviewers: haobo, sdong, dhruba
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15999
Summary: In PlainTable, use one single byte to represent 8 bytes of internal bytes, if seqID = 0 and it is value type (which should be common for bottom most files). It is to save 7 bytes for uncompressed cases.
Test Plan: make all check
Reviewers: haobo, dhruba, kailiu
Reviewed By: haobo
CC: igor, leveldb
Differential Revision: https://reviews.facebook.net/D15489
Summary:
Mixing index/filter blocks with data blocks resulted in some known
issues. To make sure in next release our users won't be affected,
we added a new option in BlockBasedTableFactory::TableOption to
conceal this functionality for now.
This patch also introduced a BlockBasedTableReader::OpenOptions,
which avoids the "infinite" growth of parameters in
BlockBasedTableReader::Open().
Test Plan: make check
Reviewers: haobo, sdong, igor, dhruba
Reviewed By: igor
CC: leveldb, tnovak
Differential Revision: https://reviews.facebook.net/D15327
Summary:
This is the last diff that adds the property block to plain table.
The format resembles that of the block-based table: https://github.com/facebook/rocksdb/wiki/Rocksdb-table-format
[data block]
[meta block 1: stats block]
[meta block 2: future extended block]
...
[meta block K: future extended block] (we may add more meta blocks in the future)
[metaindex block]
[index block: we only have the placeholder here, we can add persistent index block in the future]
[Footer: contains magic number, handle to metaindex block and index block]
<end_of_file>
Test Plan: extended existing property block test.
Reviewers: haobo, sdong, dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14523
Summary: This change will allow other table to reuse the code for meta blocks.
Test Plan: all existing unit tests passed
Reviewers: dhruba, haobo, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14475
Summary:
As we are having different types of tables and they all might share the same structure in block-based table:
[metaindex block]
[index block]
[Footer]
To be able to identify differnt types of tables, we need to parameterize the "magic number" in the `Footer`.
Test Plan:
make check
Summary:
I went through all remaining shared_ptrs and removed the ones that I found not-necessary. Only GenerateCachePrefix() is called fairly often, so don't expect much perf wins.
The ones that are left are accessed infrequently and I think we're fine with keeping them.
Test Plan: make asan_check
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14427
Summary: liveness of the statistics object is already ensured by the shared pointer in DB options. There's no reason to pass again shared pointer among internal functions. Raw pointer is sufficient and efficient.
Test Plan: make check
Reviewers: dhruba, MarkCallaghan, igor
Reviewed By: dhruba
CC: leveldb, reconnect.grayhat
Differential Revision: https://reviews.facebook.net/D14289
Summary:
The primary motivation of the changes is to make it easier to figure out the inside of the tables.
* rename "table stats" to "table properties" since now we have more than "integers" to store in the property block.
* Add filter block size to the basic table properties.
* Whenever a table is built, we'll log the table properties (the sample output is in Test Plan).
* Make an api to expose deleted keys.
Test Plan:
Passed all existing test. and the sample output of table stats:
==================================================================
Basic Properties
------------------------------------------------------------------
# data blocks: 1
# entries: 1
raw key size: 9
raw average key size: 9
raw value size: 9
raw average value size: 0
data block size: 25
index block size: 27
filter block size: 18
(estimated) table size: 70
filter policy: rocksdb.BuiltinBloomFilter
==================================================================
User collected properties: InternalKeyPropertiesCollector
------------------------------------------------------------------
kDeletedKeys: 1
==================================================================
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14187
Summary: This diff invoves some more complicated issues in the posix environment.
Test Plan: works under mac os. will need to verify dev box.
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14061
Summary: https://reviews.facebook.net/D13167 broke bloom filters. If filter is not in cache, we want to return true (safe thing). Am I right?
Test Plan: when benchmarking https://reviews.facebook.net/D14031 I got different results when using bloom filters vs. when not using them. This fixed the issue. I will also be putting this change to the other diff, but that one will probably be in review for longer time.
Reviewers: kailiu, dhruba, haobo
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14085
Summary: This diff leverage the existing block cache and extend it to cache index/filter block.
Test Plan:
Added new tests in db_test and table_test
The correctness is checked by:
1. make check
2. make valgrind_check
Performance is test by:
1. 10 times of build_tools/regression_build_test.sh on two versions of rocksdb before/after the code change. Test results suggests no significant difference between them. For the two key operatons `overwrite` and `readrandom`, the average iops are both 20k and ~260k, with very small variance).
2. db_stress.
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb, haobo, xjin
Differential Revision: https://reviews.facebook.net/D13167
Summary: The work to make sure mac os compiles rocksdb is not completed yet. But at least we can start cleaning some warnings captured only by g++ from mac os..
Test Plan: ran make in mac os
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14049
Summary:
Rocksdb can now support a uncompressed block cache, or a compressed
block cache or both. Lookups first look for a block in the
uncompressed cache, if it is not found only then it is looked up
in the compressed cache. If it is found in the compressed cache,
then it is uncompressed and inserted into the uncompressed cache.
It is possible that the same block resides in the compressed cache
as well as the uncompressed cache at the same time. Both caches
have their own individual LRU policy.
Test Plan: Unit test case attached.
Reviewers: kailiu, sdong, haobo, leveldb
Reviewed By: haobo
CC: xjin, haobo
Differential Revision: https://reviews.facebook.net/D12675
Summary:
This patch is to address @haobo's comments on D13521:
1. rename Table to be TableReader and make its factory function to be GetTableReader
2. move the compression type selection logic out of TableBuilder but to compaction logic
3. more accurate comments
4. Move stat name constants into BlockBasedTable implementation.
5. remove some uncleaned codes in simple_table_db_test
Test Plan: pass test suites.
Reviewers: haobo, dhruba, kailiu
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13785
Summary: This patch makes Table and TableBuilder a abstract class and make all the implementation of the current table into BlockedBasedTable and BlockedBasedTable Builder.
Test Plan: Make db_test.cc to work with block based table. Add a new test simple_table_db_test.cc where a different simple table format is implemented.
Reviewers: dhruba, haobo, kailiu, emayanke, vamsi
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13521
Summary: As title.
Test Plan: Updated the unit tests to make sure new statistic is correctly written/read.
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13497
Summary:
So far we only have key/value pairs as well as bloom filter stored in the
sst file. It will be great if we are able to store more metadata about
this table itself, for example, the entry size, bloom filter name, etc.
This diff is the first step of this effort. It allows table to keep the
basic statistics mentioned in http://fburl.com/14995441, as well as
allowing writing user-collected stats to stats block.
After this diff, we will figure out the interface of how to allow user to collect their interested statistics.
Test Plan:
1. Added several unit tests.
2. Ran `make check` to ensure it doesn't break other tests.
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13419
Summary:
Change namespace from leveldb to rocksdb. This allows a single
application to link in open-source leveldb code as well as
rocksdb code into the same process.
Test Plan: compile rocksdb
Reviewers: emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D13287
Summary: The pupose of this diff is to expose per user-call level precise timing of block read, so that we can answer questions like: a Get() costs me 100ms, is that somehow related to loading blocks from file system, or sth else? We will answer that with EXACTLY how many blocks have been read, how much time was spent on transfering the bytes from os, how much time was spent on checksum verification and how much time was spent on block decompression, just for that one Get. A nano second stopwatch was introduced to track time with higher precision. The cost/precision of the stopwatch is also measured in unit-test. On my dev box, retrieving one time instance costs about 30ns, on average. The deviation of timing results is good enough to track 100ns-1us level events. And the overhead could be safely ignored for 100us level events (10000 instances/s), for example, a viewstate thrift call.
Test Plan: perf_context_test, also testing with viewstate shadow traffic.
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb, xjin
Differential Revision: https://reviews.facebook.net/D12351
Summary:
In InternalGet, BlockReader returns an Iterator which is legitimately freed at the end of the 'else' scope. BUT there is a break statement in between and must be freed there too!
The best solution would be to move to unique_ptr and let it handle. Changed it to a unique_ptr.
Test Plan: valgrind ./db_test;make all check
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12681
Summary:
If ReadOptions.non_blocking_io is set to true, then KeyMayExists
and Iterators will return data that is cached in RAM.
If the Iterator needs to do IO from storage to serve the data,
then the Iterator.status() will return Status::IsRetry().
Test Plan:
Enhanced unit test DBTest.KeyMayExist to detect if there were are IOs
issues from storage. Added DBTest.NonBlockingIteration to verify
nonblocking Iterations.
Reviewers: emayanke, haobo
Reviewed By: haobo
CC: leveldb
Maniphest Tasks: T63
Differential Revision: https://reviews.facebook.net/D12531
Summary: In KeyMayExist.db_test we do a Flush which causes sst file to be written and added as open file in TableCache, but block cache for the file is not populated. So value_found should have been false where it was true and KeyMayExist.db_test should not have passed earlier. But it passed because BlockReader in table/table.cc takes 2 default arguments at the end called for_compaction and no_io. Although I passed no_io=true from InternalGet to BlockReader, but it understood for_compaction=true and defaulted no_io to false. This is a bug and although will be removed by Dhruba's new patch to incorporate no_io in readoptions, I'm submitting this patch to fix this bug independently of that patch.
Test Plan: make all check
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12537
Summary: Fix code so that the filter_block layer only assumes keys are internal when prefix_extractor is set.
Test Plan: ./filter_block_test
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D12501
Summary: Replace include/leveldb with include/rocksdb.
Test Plan:
make clean; make check
make clean; make release
Differential Revision: https://reviews.facebook.net/D12489
Summary: If use_prefix_filters is set and read_range>1, then the random seeks will set a the prefix filter to be the prefix of the key which was randomly selected as the target. Still need to add statistics (perhaps in a separate diff).
Test Plan: ./db_bench --benchmarks=fillseq,prefixscanrandom --num=10000000 --statistics=1 --use_prefix_blooms=1 --use_prefix_api=1 --bloom_bits=10
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb, haobo
Differential Revision: https://reviews.facebook.net/D12273
Summary: Similar to v2 (db and table code understands prefixes), but use ReadOptions as in v3. Also, make the CreateFilter code faster and cleaner.
Test Plan: make db_test; export LEVELDB_TESTS=PrefixScan; ./db_test
Reviewers: dhruba
Reviewed By: dhruba
CC: haobo, emayanke
Differential Revision: https://reviews.facebook.net/D12027
Summary:
Here are the major changes to the Merge Interface. It has been expanded
to handle cases where the MergeOperator is not associative. It does so by stacking
up merge operations while scanning through the key history (i.e.: during Get() or
Compaction), until a valid Put/Delete/end-of-history is encountered; it then
applies all of the merge operations in the correct sequence starting with the
base/sentinel value.
I have also introduced an "AssociativeMerge" function which allows the user to
take advantage of associative merge operations (such as in the case of counters).
The implementation will always attempt to merge the operations/operands themselves
together when they are encountered, and will resort to the "stacking" method if
and only if the "associative-merge" fails.
This implementation is conjectured to allow MergeOperator to handle the general
case, while still providing the user with the ability to take advantage of certain
efficiencies in their own merge-operator / data-structure.
NOTE: This is a preliminary diff. This must still go through a lot of review,
revision, and testing. Feedback welcome!
Test Plan:
-This is a preliminary diff. I have only just begun testing/debugging it.
-I will be testing this with the existing MergeOperator use-cases and unit-tests
(counters, string-append, and redis-lists)
-I will be "desk-checking" and walking through the code with the help gdb.
-I will find a way of stress-testing the new interface / implementation using
db_bench, db_test, merge_test, and/or db_stress.
-I will ensure that my tests cover all cases: Get-Memtable,
Get-Immutable-Memtable, Get-from-Disk, Iterator-Range-Scan, Flush-Memtable-to-L0,
Compaction-L0-L1, Compaction-Ln-L(n+1), Put/Delete found, Put/Delete not-found,
end-of-history, end-of-file, etc.
-A lot of feedback from the reviewers.
Reviewers: haobo, dhruba, zshao, emayanke
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11499
Summary: Removed KeyMayExistImpl because KeyMayExist demanded Get like semantics now. Removed no_io from memtable and imm because we need the proper value now and shouldn't just stop when we see Merge in memtable. Added checks to block_cache. Updated documentation and unit-test
Test Plan: make all check;db_stress for 1 hour
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11853
Summary:
Introduced KeyMayExist checking during writebatch-delete and removed from Outer Delete API because it uses writebatch-delete.
Added code to skip getting Table from disk if not already present in table_cache.
Some renaming of variables.
Introduced KeyMayExistImpl which allows checking since specified sequence number in GetImpl useful to check partially written writebatch.
Changed KeyMayExist to not be pure virtual and provided a default implementation.
Expanded unit-tests in db_test to check appropriately.
Ran db_stress for 1 hour with ./db_stress --max_key=100000 --ops_per_thread=10000000 --delpercent=50 --filter_deletes=1 --statistics=1.
Test Plan: db_stress;make check
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb, xjin
Differential Revision: https://reviews.facebook.net/D11745
Summary:
Wrote a new function in db_impl.c-CheckKeyMayExist that calls Get but with a new parameter turned on which makes Get return false only if bloom filters can guarantee that key is not in database. Delete calls this function and if the option- deletes_use_filter is turned on and CheckKeyMayExist returns false, the delete will be dropped saving:
1. Put of delete type
2. Space in the db,and
3. Compaction time
Test Plan:
make all check;
will run db_stress and db_bench and enhance unit-test once the basic design gets approved
Reviewers: dhruba, haobo, vamsi
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D11607
Summary: Table is setup for compaction using Table::SetupForCompaction. So read block calls can be differentiated b/w Gets/Compaction. Use this and measure times.
Test Plan: db_bench --statistics=1
Reviewers: dhruba, haobo
Reviewed By: haobo
CC: leveldb, MarkCallaghan
Differential Revision: https://reviews.facebook.net/D11217
Summary:
Current posix advice implementation ties up the access pattern hint with the creation of a file.
It is not possible to apply different advice for different access (random get vs compaction read),
without keeping two open files for the same table. This patch extended the RandomeAccessFile interface
to accept new access hint at anytime. Particularly, we are able to set different access hint on the same
table file based on when/how the file is used.
Two options are added to set the access hint, after the file is first opened and after the file is being
compacted.
Test Plan: make check; db_stress; db_bench
Reviewers: dhruba
Reviewed By: dhruba
CC: MarkCallaghan, leveldb
Differential Revision: https://reviews.facebook.net/D10905