rocksdb

Commit Graph

Author	SHA1	Message	Date
Dhruba Borthakur	fc0c399d2e	Introduced a new flag non_blocking_io in ReadOptions. Summary: If ReadOptions.non_blocking_io is set to true, then KeyMayExists and Iterators will return data that is cached in RAM. If the Iterator needs to do IO from storage to serve the data, then the Iterator.status() will return Status::IsRetry(). Test Plan: Enhanced unit test DBTest.KeyMayExist to detect if there were are IOs issues from storage. Added DBTest.NonBlockingIteration to verify nonblocking Iterations. Reviewers: emayanke, haobo Reviewed By: haobo CC: leveldb Maniphest Tasks: T63 Differential Revision: https://reviews.facebook.net/D12531	12 years ago
Jim Paton	5c3b254ef2	Fix memory leak Summary: There is a memory leak because TransformRepFactory does not delete its SliceTransform pointer. This patch adds a delete to the destructor. Test Plan: make check make valgrind_check Reviewers: dhruba, emayanke, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12513	12 years ago
Tyler Harter	4504c99030	Internal/user key bug fix. Summary: Fix code so that the filter_block layer only assumes keys are internal when prefix_extractor is set. Test Plan: ./filter_block_test Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12501	12 years ago
Dhruba Borthakur	1186192ed1	Replace include/leveldb with include/rocksdb. Summary: Replace include/leveldb with include/rocksdb. Test Plan: make clean; make check make clean; make release Differential Revision: https://reviews.facebook.net/D12489	12 years ago
Jim Paton	74781a0c49	Add three new MemTableRep's Summary: This patch adds three new MemTableRep's: UnsortedRep, PrefixHashRep, and VectorRep. UnsortedRep stores keys in an std::unordered_map of std::sets. When an iterator is requested, it dumps the keys into an std::set and iterates over that. VectorRep stores keys in an std::vector. When an iterator is requested, it creates a copy of the vector and sorts it using std::sort. The iterator accesses that new vector. PrefixHashRep stores keys in an unordered_map mapping prefixes to ordered sets. I also added one API change. I added a function MemTableRep::MarkImmutable. This function is called when the rep is added to the immutable list. It doesn't do anything yet, but it seems like that could be useful. In particular, for the vectorrep, it means we could elide the extra copy and just sort in place. The only reason I haven't done that yet is because the use of the ArenaAllocator complicates things (I can elaborate on this if needed). Test Plan: make -j32 check ./db_stress --memtablerep=vector ./db_stress --memtablerep=unsorted ./db_stress --memtablerep=prefixhash --prefix_size=10 Reviewers: dhruba, haobo, emayanke Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12117	12 years ago
Tyler Harter	94cf218720	Revert "Prefix scan: db_bench and bug fixes" This reverts commit `c2bd8f4824`.	12 years ago
Tyler Harter	c2bd8f4824	Prefix scan: db_bench and bug fixes Summary: If use_prefix_filters is set and read_range>1, then the random seeks will set a the prefix filter to be the prefix of the key which was randomly selected as the target. Still need to add statistics (perhaps in a separate diff). Test Plan: ./db_bench --benchmarks=fillseq,prefixscanrandom --num=10000000 --statistics=1 --use_prefix_blooms=1 --use_prefix_api=1 --bloom_bits=10 Reviewers: dhruba Reviewed By: dhruba CC: leveldb, haobo Differential Revision: https://reviews.facebook.net/D12273	12 years ago
Simha Venkataramaiah	60bf2b7d4a	Add APIs to query SST file metadata and to delete specific SST files Summary: An api to query the level, key ranges, size etc for each SST file and an api to delete a specific file from the db and all associated state in the bookkeeping datastructures. Notes: Editing the manifest version does not release the obsolete files right away. However deleting the file directly will mess up the iterator. We may need a more aggressive/timely file deletion api. I have used std::unique_ptr - will switch to boost:: since this is external. thoughts? Unit test is fragile right now as it expects the compaction at certain levels. Test Plan: unittest Reviewers: dhruba, vamsi, emayanke CC: zshao, leveldb, haobo Task ID: # Blame Rev:	12 years ago
Jim Paton	cb703c9d03	Allow WriteBatch::Handler to abort iteration Summary: Sometimes you don't need to iterate through the whole WriteBatch. This diff makes the Handler member functions return a bool that indicates whether to abort or not. If they return true, the iteration stops. One thing I just thought of is that this will break backwards-compability. Maybe it would be better to add a virtual member function WriteBatch::Handler::ShouldAbort() that returns false by default. Comments requested. I still have to add a new unit test for the abort code, but let's finalize the API first. Test Plan: make -j32 check Reviewers: dhruba, haobo, vamsi, emayanke Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12339	12 years ago
Mayank Agarwal	404d63ac3e	Add TODO for DBToStackableDB function which doesn't work yet Summary: Test Plan: Reviewers: CC: Task ID: # Blame Rev:	12 years ago
Deon Nicholas	b87dcae1a3	Made merge_oprator a shared_ptr; and added TTL unit tests Test Plan: - make all check; - make release; - make stringappend_test; ./stringappend_test Reviewers: haobo, emayanke Reviewed By: haobo CC: leveldb, kailiu Differential Revision: https://reviews.facebook.net/D12381	12 years ago
Mayank Agarwal	3ab2792f93	Add default info to comment in leveldb/options.h for no_block_cache Summary: to ley clients know Test Plan: visual	12 years ago
Mayank Agarwal	28e6fe5e9f	Correct documentation for no_block_cache in leveldb/options.h Summary: false shoudl have been true in comment Test Plan: visual	12 years ago
Mayank Agarwal	8a3547d38e	API for getting archived log files Summary: Also expanded class LogFile to have startSequene and FileSize and exposed it publicly Test Plan: make all check Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12087	12 years ago
Deon Nicholas	e1346968d8	Merge operator fixes part 1. Summary: -Added null checks and revisions to DBIter::MergeValuesNewToOld() -Added DBIter test to stringappend_test -Major fix with Merge and TTL More plans for fixes later. Test Plan: -make clean; make stringappend_test -j 32; ./stringappend_test -make all check; Reviewers: haobo, emayanke, vamsi, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D12315	12 years ago
Mayank Agarwal	387ac0f1e1	Expose statistic for sequence number and implement setTickerCount Summary: statistic for sequence number is needed by wormhole. setTickerCount is demanded for this statistic. I can't simply recordTick(max_sequence) when db recovers because the statistic iobject is owned by client and may/may not be reset during reopen. Eg. statistic is reset in mcrocksdb whereas it is not in db_stress. Therefore it is best to go with setTickerCount Test Plan: ./db_stress ... --statistics=1 and observed expected sequence number Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12327	12 years ago
Jim Paton	0307c5fe3a	Implement log blobs Summary: This patch adds the ability for the user to add sequences of arbitrary data (blobs) to write batches. These blobs are saved to the log along with everything else in the write batch. You can add multiple blobs per WriteBatch and the ordering of blobs, puts, merges, and deletes are preserved. Blobs are not saves to SST files. RocksDB ignores blobs in every way except for writing them to the log. Before committing this patch, I need to add some test code. But I'm submitting it now so people can comment on the API. Test Plan: make -j32 check Reviewers: dhruba, haobo, vamsi Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12195	12 years ago
Haobo Xu	d9dd2a1926	[RocksDB] Expose thread local perf counter for low overhead, per call level performance statistics. Summary: As title. No locking/atomic is needed due to thread local. There is also no need to modify the existing client interface, in order to expose related counters. perf_context_test shows a simple example of retrieving the number of user key comparison done for each put and get call. More counters could be added later. Sample output ./perf_context_test 1000000 ==== Test PerfContextTest.KeyComparisonCount Inserting 1000000 key/value pairs ... total user key comparison get: 43446523 total user key comparison put: 8017877 max user key comparison get: 88939 avg user key comparison get:43 Basically, the current skiplist does well on average, but could perform poorly in extreme cases. Test Plan: run perf_context_test <total number of entries to put/get> Reviewers: dhruba Differential Revision: https://reviews.facebook.net/D12225	12 years ago
Mayank Agarwal	f1bf169484	Counter for merge failure Summary: With Merge returning bool, it can keep failing silently(eg. While faling to fetch timestamp in TTL). We need to detect this through a rocksdb counter which can get bumped whenever Merge returns false. This will also be super-useful for the mcrocksdb-counter service where Merge may fail. Added a counter NUMBER_MERGE_FAILURES and appropriately updated db/merge_helper.cc I felt that it would be better to directly add counter-bumping in Merge as a default function of MergeOperator class but user should not be aware of this, so this approach seems better to me. Test Plan: make all check Reviewers: dnicholas, haobo, dhruba, vamsi CC: leveldb Differential Revision: https://reviews.facebook.net/D12129	12 years ago
Tyler Harter	f5f1842282	Prefix filters for scans (v4) Summary: Similar to v2 (db and table code understands prefixes), but use ReadOptions as in v3. Also, make the CreateFilter code faster and cleaner. Test Plan: make db_test; export LEVELDB_TESTS=PrefixScan; ./db_test Reviewers: dhruba Reviewed By: dhruba CC: haobo, emayanke Differential Revision: https://reviews.facebook.net/D12027	12 years ago
sumeet	3b81df34bd	Separate compaction filter for each compaction Summary: If we have same compaction filter for each compaction, application cannot know about the different compaction processes. Later on, we can put in more details in compaction filter for the application to consume and use it according to its needs. For e.g. In the universal compaction, we have a compaction process involving all the files while others don't involve all the files. Applications may want to collect some stats only when during full compaction. Test Plan: run existing unit tests Reviewers: haobo, dhruba Reviewed By: dhruba CC: xinyaohu, leveldb Differential Revision: https://reviews.facebook.net/D12057	12 years ago
Deon Nicholas	c2d7826ced	[RocksDB] [MergeOperator] The new Merge Interface! Uses merge sequences. Summary: Here are the major changes to the Merge Interface. It has been expanded to handle cases where the MergeOperator is not associative. It does so by stacking up merge operations while scanning through the key history (i.e.: during Get() or Compaction), until a valid Put/Delete/end-of-history is encountered; it then applies all of the merge operations in the correct sequence starting with the base/sentinel value. I have also introduced an "AssociativeMerge" function which allows the user to take advantage of associative merge operations (such as in the case of counters). The implementation will always attempt to merge the operations/operands themselves together when they are encountered, and will resort to the "stacking" method if and only if the "associative-merge" fails. This implementation is conjectured to allow MergeOperator to handle the general case, while still providing the user with the ability to take advantage of certain efficiencies in their own merge-operator / data-structure. NOTE: This is a preliminary diff. This must still go through a lot of review, revision, and testing. Feedback welcome! Test Plan: -This is a preliminary diff. I have only just begun testing/debugging it. -I will be testing this with the existing MergeOperator use-cases and unit-tests (counters, string-append, and redis-lists) -I will be "desk-checking" and walking through the code with the help gdb. -I will find a way of stress-testing the new interface / implementation using db_bench, db_test, merge_test, and/or db_stress. -I will ensure that my tests cover all cases: Get-Memtable, Get-Immutable-Memtable, Get-from-Disk, Iterator-Range-Scan, Flush-Memtable-to-L0, Compaction-L0-L1, Compaction-Ln-L(n+1), Put/Delete found, Put/Delete not-found, end-of-history, end-of-file, etc. -A lot of feedback from the reviewers. Reviewers: haobo, dhruba, zshao, emayanke Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11499	12 years ago
Jim Paton	8e792e5896	Add soft_rate_limit stats Summary: This diff adds histogram stats for soft_rate_limit stalls. It also renames the old rate_limit stats to hard_rate_limit. Test Plan: make -j32 check Reviewers: dhruba, haobo, MarkCallaghan Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12021	12 years ago
Mayank Agarwal	1d7b4765c3	Expose base db object from ttl wrapper Summary: rocksdb replicaiton will need this when writing value+TS from master to slave 'as is' Test Plan: make Reviewers: dhruba, vamsi, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11919	12 years ago
Jim Paton	1036537c94	Add soft and hard rate limit support Summary: This diff adds support for both soft and hard rate limiting. The following changes are included: 1) Options.rate_limit is renamed to Options.hard_rate_limit. 2) Options.rate_limit_delay_milliseconds is renamed to Options.rate_limit_delay_max_milliseconds. 3) Options.soft_rate_limit is added. 4) If the maximum compaction score is > hard_rate_limit and rate_limit_delay_max_milliseconds == 0, then writes are delayed by 1 ms at a time until the max compaction score falls below hard_rate_limit. 5) If the max compaction score is > soft_rate_limit but <= hard_rate_limit, then writes are delayed by 0-1 ms depending on how close we are to hard_rate_limit. 6) Users can disable 4 by setting hard_rate_limit = 0. They can add a limit to the maximum amount of time waited by setting rate_limit_delay_max_milliseconds > 0. Thus, the old behavior can be preserved by setting soft_rate_limit = 0, which is the default. Test Plan: make -j32 check ./db_stress Reviewers: dhruba, haobo, MarkCallaghan Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D12003	12 years ago
Mayank Agarwal	cacd812fb2	Support user's compaction filter in TTL logic Summary: TTL uses compaction filter to purge key-values and required the user to not pass one. This diff makes it accommodating of user's compaciton filter. Added test to ttl_test Test Plan: make; ./ttl_test Reviewers: dhruba, haobo, vamsi Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11973	12 years ago
Mayank Agarwal	59d0b02f8b	Expand KeyMayExist to return the proper value if it can be found in memory and also check block_cache Summary: Removed KeyMayExistImpl because KeyMayExist demanded Get like semantics now. Removed no_io from memtable and imm because we need the proper value now and shouldn't just stop when we see Merge in memtable. Added checks to block_cache. Updated documentation and unit-test Test Plan: make all check;db_stress for 1 hour Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11853	12 years ago
Jim Paton	9700677a2b	Slow down writes gradually rather than suddenly Summary: Currently, when a certain number of level0 files (level0_slowdown_writes_trigger) are present, RocksDB will slow down each write by 1ms. There is a second limit of level0 files at which RocksDB will stop writes altogether (level0_stop_writes_trigger). This patch enables the user to supply a third parameter specifying the number of files at which Rocks will start slowing down writes (level0_start_slowdown_writes). When this number is reached, Rocks will slow down writes as a quadratic function of level0_slowdown_writes_trigger - num_level0_files. For some workloads, this improves latency and throughput. I will post some stats momentarily in https://our.intern.facebook.com/intern/tasks/?t=2613384. Test Plan: make -j32 check ./db_stress ./db_bench Reviewers: dhruba, haobo, MarkCallaghan, xjin Reviewed By: xjin CC: leveldb, xjin, zshao Differential Revision: https://reviews.facebook.net/D11859	12 years ago
Xing Jin	0f0a24e298	Make arena block size configurable Summary: Add an option for arena block size, default value 4096 bytes. Arena will allocate blocks with such size. I am not sure about passing parameter to skiplist in the new virtualized framework, though I talked to Jim a bit. So add Jim as reviewer. Test Plan: new unit test, I am running db_test. For passing paramter from configured option to Arena, I tried tests like: TEST(DBTest, Arena_Option) { std::string dbname = test::TmpDir() + "/db_arena_option_test"; DestroyDB(dbname, Options()); DB* db = nullptr; Options opts; opts.create_if_missing = true; opts.arena_block_size = 1000000; // tested 99, 999999 Status s = DB::Open(opts, dbname, &db); db->Put(WriteOptions(), "a", "123"); } and printed some debug info. The results look good. Any suggestion for such a unit-test? Reviewers: haobo, dhruba, emayanke, jpaton Reviewed By: dhruba CC: leveldb, zshao Differential Revision: https://reviews.facebook.net/D11799	12 years ago
Jim Paton	18afff2e63	Add stall counts to statistics Summary: Previously, statistics are kept on how much time is spent on stalls of different types. This patch adds support for keeping number of stalls of each type. For example, instead of just reporting how many microseconds are spent waiting for memtables to be compacted, it will also report how many times a write stalled for that to occur. Test Plan: make -j32 check ./db_stress # Not really sure what else should be done... Reviewers: dhruba, MarkCallaghan, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11841	12 years ago
Jim Paton	52d7ecfc78	Virtualize SkipList Interface Summary: This diff virtualizes the skiplist interface so that users can provide their own implementation of a backing store for MemTables. Eventually, the backing store will be responsible for its own synchronization, allowing users (and us) to experiment with different lockless implementations. Test Plan: make clean make -j32 check ./db_stress Reviewers: dhruba, emayanke, haobo Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11739	12 years ago
Mayank Agarwal	bf66c10b13	Use KeyMayExist for WriteBatch-Deletes Summary: Introduced KeyMayExist checking during writebatch-delete and removed from Outer Delete API because it uses writebatch-delete. Added code to skip getting Table from disk if not already present in table_cache. Some renaming of variables. Introduced KeyMayExistImpl which allows checking since specified sequence number in GetImpl useful to check partially written writebatch. Changed KeyMayExist to not be pure virtual and provided a default implementation. Expanded unit-tests in db_test to check appropriately. Ran db_stress for 1 hour with ./db_stress --max_key=100000 --ops_per_thread=10000000 --delpercent=50 --filter_deletes=1 --statistics=1. Test Plan: db_stress;make check Reviewers: dhruba, haobo Reviewed By: dhruba CC: leveldb, xjin Differential Revision: https://reviews.facebook.net/D11745	12 years ago
Haobo Xu	9ee68871dc	[RocksDB] Enable manual compaction to move files back to an appropriate level. Summary: As title. This diff added an option reduce_level to CompactRange. When set to true, it will try to move the files back to the minimum level sufficient to hold the data set. Note that the default is set to true now, just to excerise it in all existing tests. Will set the default to false before check-in, for backward compatibility. Test Plan: make check; Reviewers: dhruba, emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D11553	12 years ago
Mayank Agarwal	2a986919d6	Make rocksdb-deletes faster using bloom filter Summary: Wrote a new function in db_impl.c-CheckKeyMayExist that calls Get but with a new parameter turned on which makes Get return false only if bloom filters can guarantee that key is not in database. Delete calls this function and if the option- deletes_use_filter is turned on and CheckKeyMayExist returns false, the delete will be dropped saving: 1. Put of delete type 2. Space in the db,and 3. Compaction time Test Plan: make all check; will run db_stress and db_bench and enhance unit-test once the basic design gets approved Reviewers: dhruba, haobo, vamsi Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11607	12 years ago
Dhruba Borthakur	7cb8d462d5	Rename PickCompactionHybrid to PickCompactionUniversal. Summary: Rename PickCompactionHybrid to PickCompactionUniversal. Changed a few LOG message from "Hybrid:" to "Universal:". Test Plan: Reviewers: CC: Task ID: # Blame Rev:	12 years ago
Haobo Xu	a8d5f8dde2	[RocksDB] Remove old readahead options Summary: As title. Test Plan: make check; db_bench Reviewers: dhruba, MarkCallaghan CC: leveldb Differential Revision: https://reviews.facebook.net/D11643	12 years ago
Dhruba Borthakur	116ec527f2	Renamed 'hybrid_compaction' tp be "Universal Compaction'. Summary: All the universal compaction parameters are encapsulated in a new file universal_compaction.h Test Plan: make check	12 years ago
Mayank Agarwal	d56523c49c	Update rocksdb version Summary: rocksdb-2.0 released to third party Test Plan: visual inspection Reviewers: dhruba, haobo, sheki Reviewed By: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11559	12 years ago
Dhruba Borthakur	47c4191fe8	Reduce write amplification by merging files in L0 back into L0 Summary: There is a new option called hybrid_mode which, when switched on, causes HBase style compactions. Files from L0 are compacted back into L0. This meat of this compaction algorithm is in PickCompactionHybrid(). All files reside in L0. That means all files have overlapping keys. Each file has a time-bound, i.e. each file contains a range of keys that were inserted around the same time. The start-seqno and the end-seqno refers to the timeframe when these keys were inserted. Files that have contiguous seqno are compacted together into a larger file. All files are ordered from most recent to the oldest. The current compaction algorithm starts to look for candidate files starting from the most recent file. It continues to add more files to the same compaction run as long as the sum of the files chosen till now is smaller than the next candidate file size. This logic needs to be debated and validated. The above logic should reduce write amplification to a large extent... will publish numbers shortly. Test Plan: dbstress runs for 6 hours with no data corruption (tested so far). Differential Revision: https://reviews.facebook.net/D11289	12 years ago
Dhruba Borthakur	554c06dd18	Reduce write amplification by merging files in L0 back into L0 Summary: There is a new option called hybrid_mode which, when switched on, causes HBase style compactions. Files from L0 are compacted back into L0. This meat of this compaction algorithm is in PickCompactionHybrid(). All files reside in L0. That means all files have overlapping keys. Each file has a time-bound, i.e. each file contains a range of keys that were inserted around the same time. The start-seqno and the end-seqno refers to the timeframe when these keys were inserted. Files that have contiguous seqno are compacted together into a larger file. All files are ordered from most recent to the oldest. The current compaction algorithm starts to look for candidate files starting from the most recent file. It continues to add more files to the same compaction run as long as the sum of the files chosen till now is smaller than the next candidate file size. This logic needs to be debated and validated. The above logic should reduce write amplification to a large extent... will publish numbers shortly. Test Plan: dbstress runs for 6 hours with no data corruption (tested so far). Differential Revision: https://reviews.facebook.net/D11289	12 years ago
Haobo Xu	71e0f695c1	[RocksDB] Expose count for WriteBatch Summary: As title. Exposed a Count function that returns the number of updates in a batch. Could be handy for replication sequence number check. Test Plan: make check; Reviewers: emayanke, sheki, dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11523	12 years ago
Haobo Xu	4deaa0d48b	[RocksDB] Minor change to statistics.h Summary: as title, use initialize list so that lines fit in 80 chars. Test Plan: make check; Reviewers: sheki, dhruba Differential Revision: https://reviews.facebook.net/D11385	12 years ago
Haobo Xu	96be2c4ee0	[RocksDB] Add mmap_read option for db_stress Summary: as title, also removed an incorrect assertion Test Plan: make check; db_stress --mmap_read=1; db_stress --mmap_read=0 Reviewers: dhruba, emayanke CC: leveldb Differential Revision: https://reviews.facebook.net/D11367	12 years ago
Abhishek Kona	5ef6bb8c37	[rocksdb][refactor] statistic printing code to one place Summary: $title Test Plan: db_bench --statistics=1 Reviewers: haobo Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11373	12 years ago
Haobo Xu	3cc1af2062	[RocksDB] Option for incremental sync Summary: This diff added an option to control the incremenal sync frequency. db_bench has a new flag bytes_per_sync for easy tuning exercise. Test Plan: make check; db_bench Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11295	12 years ago
Dhruba Borthakur	6acbe0fc45	Compact multiple memtables before flushing to storage. Summary: Merge multiple multiple memtables in memory before writing it out to a file in L0. There is a new config parameter min_write_buffer_number_to_merge that specifies the number of write buffers that should be merged together to a single file in storage. The system will not flush wrte buffers to storage unless at least these many buffers have accumulated in memory. The default value of this new parameter is 1, which means that a write buffer will be immediately flushed to disk as soon it is ready. Test Plan: make check Differential Revision: https://reviews.facebook.net/D11241	12 years ago
Abhishek Kona	f561b3a324	[Rocksdb] Rename one stat key from leveldb to rocksdb	12 years ago
Abhishek Kona	39ee47fbf4	[Rocksdb] Record WriteBlock Times into a histogram Summary: Add a histogram to track WriteBlock times Test Plan: db_bench and print Reviewers: haobo, dhruba Reviewed By: haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D11319	12 years ago
Abhishek Kona	7a5f71d19a	[Rocksdb] measure table open io in a histogram Summary: Table is setup for compaction using Table::SetupForCompaction. So read block calls can be differentiated b/w Gets/Compaction. Use this and measure times. Test Plan: db_bench --statistics=1 Reviewers: dhruba, haobo Reviewed By: haobo CC: leveldb, MarkCallaghan Differential Revision: https://reviews.facebook.net/D11217	12 years ago
Haobo Xu	778e179046	[RocksDB] Sync file to disk incrementally Summary: During compaction, we sync the output files after they are fully written out. This causes unnecessary blocking of the compaction thread and burstiness of the write traffic. This diff simply asks the OS to sync data incrementally as they are written, on the background. The hope is that, at the final sync, most of the data are already on disk and we would block less on the sync call. Thus, each compaction runs faster and we could use fewer number of compaction threads to saturate IO. In addition, the write traffic will be smoothed out, hopefully reducing the IO P99 latency too. Some quick tests show 10~20% improvement in per thread compaction throughput. Combined with posix advice on compaction read, just 5 threads are enough to almost saturate the udb flash bandwidth for 800 bytes write only benchmark. What's more promising is that, with saturated IO, iostat shows average wait time is actually smoother and much smaller. For the write only test 800bytes test: Before the change: await occillate between 10ms and 3ms After the change: await ranges 1-3ms Will test against read-modify-write workload too, see if high read latency P99 could be resolved. Will introduce a parameter to control the sync interval in a follow up diff after cleaning up EnvOptions. Test Plan: make check; db_bench; db_stress Reviewers: dhruba CC: leveldb Differential Revision: https://reviews.facebook.net/D11115	12 years ago

... 6 7 8 9 10 ...

572 Commits (a40ce219b9d3009331f4984529d92b499b03d454)