rocksdb

Commit Graph

Author	SHA1	Message	Date
Mike Kolupaev	d18dd2c41f	Abort compactions more reliably when closing DB Summary: DB shutdown aborts running compactions by setting an atomic shutting_down=true that CompactionJob periodically checks. Without this PR it checks it before processing every _output_ value. If compaction filter filters everything out, the compaction is uninterruptible. This PR adds checks for shutting_down on every _input_ value (in CompactionIterator and MergeHelper). There's also some minor code cleanup along the way. Closes https://github.com/facebook/rocksdb/pull/1639 Differential Revision: D4306571 Pulled By: yiwu-arbug fbshipit-source-id: f050890	9 years ago
Andrew Kryczka	fe395fb63d	Allow incrementing refcount on cache handles Summary: Previously the only way to increment a handle's refcount was to invoke Lookup(), which (1) did hash table lookup to get cache handle, (2) incremented that handle's refcount. For a future DeleteRange optimization, I added a function, Ref(), for when the caller already has a cache handle and only needs to do (2). Closes https://github.com/facebook/rocksdb/pull/1761 Differential Revision: D4397114 Pulled By: ajkr fbshipit-source-id: 9addbe5	9 years ago
Maysam Yabandeh	d0ba8ec8f9	Revert "PinnableSlice" Summary: This reverts commit `54d94e9c2c`. The pull request was landed by mistake. Closes https://github.com/facebook/rocksdb/pull/1755 Differential Revision: D4391678 Pulled By: maysamyabandeh fbshipit-source-id: 36d5149	9 years ago
Maysam Yabandeh	54d94e9c2c	PinnableSlice Summary: Currently the point lookup values are copied to a string provided by the user. This incures an extra memcpy cost. This patch allows doing point lookup via a PinnableSlice which pins the source memory location (instead of copying their content) and releases them after the content is consumed by the user. The old API of Get(string) is translated to the new API underneath. Here is the summary for improvements: 1. value 100 byte: 1.8% regular, 1.2% merge values 2. value 1k byte: 11.5% regular, 7.5% merge values 3. value 10k byte: 26% regular, 29.9% merge values The improvement for merge could be more if we extend this approach to pin the merge output and delay the full merge operation until the user actually needs it. We have put that for future work. PS: Sometimes we observe a small decrease in performance when switching from t5452014 to this patch but with the old Get(string) API. The difference is a little and could be noise. More importantly it is safely cancelled Closes https://github.com/facebook/rocksdb/pull/1732 Differential Revision: D4374613 Pulled By: maysamyabandeh fbshipit-source-id: a077f1a	9 years ago
Islam AbdelRahman	ac73d7558b	Add GetSupportedCompressions() convenience function Summary: This function will return a list of supported compression types in RocksDB This is needed for MyRocks https://github.com/facebook/mysql-5.6/pull/446 Closes https://github.com/facebook/rocksdb/pull/1747 Differential Revision: D4385921 Pulled By: IslamAbdelRahman fbshipit-source-id: 2f5b59f	9 years ago
siddontang	653ac1f9c6	C API: support total_order_mode Summary: Closes https://github.com/facebook/rocksdb/pull/1687 Differential Revision: D4349210 Pulled By: IslamAbdelRahman fbshipit-source-id: 32d0fbd	9 years ago
Adam Retter	85ac1a320a	Fix rocksdb::Status::getState Summary: This fixes the Java API for Status#getState use in Native code and also simplifies the implementation of rocksdb::Status::getState. Closes https://github.com/facebook/rocksdb/issues/1688 Closes https://github.com/facebook/rocksdb/pull/1714 Differential Revision: D4364181 Pulled By: yiwu-arbug fbshipit-source-id: 8e073b4	9 years ago
Maysam Yabandeh	0712d541d1	Delegate Cleanables Summary: Cleanable objects will perform the registered cleanups when they are destructed. We however rather to delay this cleaning like when we are gathering the merge operands. Current approach is to create the Cleanable object on heap (instead of on stack) and delay deleting it. By allowing Cleanables to delegate their cleanups to another cleanable object we can delay the cleaning without however the need to craete the cleanable object on heap and keeping it around. This patch applies this technique for the cleanups of BlockIter and shows improved performance for some in-memory benchmarks: +1.8% for merge worklaod, +6.4% for non-merge workload when the merge operator is specified. https://our.intern.facebook.com/intern/tasks?t=15168163 Non-merge benchmark: TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks=fillrandom --num=1000000 -value_size=100 -compression_type=none Reading random with no merge operator specified: TEST_TMPDIR=/dev/shm/v100nocomp/ ./db_bench --benchmarks="read Closes https://github.com/facebook/rocksdb/pull/1711 Differential Revision: D4361163 Pulled By: maysamyabandeh fbshipit-source-id: 9801e07	9 years ago
Islam AbdelRahman	d58ef52ba6	Allow SstFileWriter to Fadvise the file away from page cache Summary: Add `fadvise_trigger` option to `SstFileWriter` If fadvise_trigger is passed with a non-zero value, SstFileWriter will invalidate the os page cache every `fadvise_trigger` bytes for the sst file Closes https://github.com/facebook/rocksdb/pull/1731 Differential Revision: D4371246 Pulled By: IslamAbdelRahman fbshipit-source-id: 91caff1	9 years ago
Siying Dong	438f22bc56	Fix bug of Checkpoint loses recent transactions with 2PC Summary: If 2PC is enabled, checkpoint may not copy previous log files that contain uncommitted prepare records. In this diff we keep those files. Closes https://github.com/facebook/rocksdb/pull/1724 Differential Revision: D4368319 Pulled By: siying fbshipit-source-id: cc2c746	9 years ago
Sage Weil	4e07b08eff	include/rocksdb/utilities/env_librados: fix typo Summary: Broken by `972f96b3fb` Signed-off-by: Sage Weil <sage@redhat.com> Closes https://github.com/facebook/rocksdb/pull/1719 Differential Revision: D4366123 Pulled By: IslamAbdelRahman fbshipit-source-id: a11e535	9 years ago
Yi Wu	ab48c165a9	Print cache options to info log Summary: Improve cache options logging to info log. Also print the value of cache_index_and_filter_blocks_with_high_priority. Closes https://github.com/facebook/rocksdb/pull/1709 Differential Revision: D4358776 Pulled By: yiwu-arbug fbshipit-source-id: 8f030a0	9 years ago
Aaron Gao	972f96b3fb	direct io write support Summary: rocksdb direct io support ``` [gzh@dev11575.prn2 ~/rocksdb] ./db_bench -benchmarks=fillseq --num=1000000 Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags RocksDB: version 5.0 Date: Wed Nov 23 13:17:43 2016 CPU: 40 * Intel(R) Xeon(R) CPU E5-2660 v2 @ 2.20GHz CPUCache: 25600 KB Keys: 16 bytes each Values: 100 bytes each (50 bytes after compression) Entries: 1000000 Prefix: 0 bytes Keys per prefix: 0 RawSize: 110.6 MB (estimated) FileSize: 62.9 MB (estimated) Write rate: 0 bytes/second Compression: Snappy Memtablerep: skip_list Perf Level: 1 WARNING: Assertions are enabled; benchmarks unnecessarily slow ------------------------------------------------ Initializing RocksDB Options from the specified file Initializing RocksDB Options from command-line flags DB path: [/tmp/rocksdbtest-112628/dbbench] fillseq : 4.393 micros/op 227639 ops/sec; 25.2 MB/s [gzh@dev11575.prn2 ~/roc Closes https://github.com/facebook/rocksdb/pull/1564 Differential Revision: D4241093 Pulled By: lightmark fbshipit-source-id: 98c29e3	9 years ago
Yi Wu	5d1457dbbf	Dump persistent cache options Summary: Dump persistent cache options Closes https://github.com/facebook/rocksdb/pull/1679 Differential Revision: D4337019 Pulled By: yiwu-arbug fbshipit-source-id: 3812f8a	9 years ago
Ding Ma	1a136c1f13	Expose file size Summary: add a new function to SstFileWriter that will tell the user how big is there file right now. Closes https://github.com/facebook/rocksdb/pull/1686 Differential Revision: D4338868 Pulled By: mdyuki1016 fbshipit-source-id: c1ee16a	9 years ago
Daniel Black	816c1e30ca	gcc-7 requires include <functional> for std::function Summary: Fixes compile error: In file included from ./util/statistics.h:17:0, from ./util/stop_watch.h:8, from ./util/perf_step_timer.h:9, from ./util/iostats_context_imp.h:8, from ./util/posix_logger.h:27, from ./port/util_logger.h:18, from ./db/auto_roll_logger.h:15, from db/auto_roll_logger.cc:6: ./util/thread_local.h:65:16: error: 'function' in namespace 'std' does not name a template type typedef std::function<void(void, void)> FoldFunc; Closes https://github.com/facebook/rocksdb/pull/1656 Differential Revision: D4318702 Pulled By: yiwu-arbug fbshipit-source-id: 8c5d17a	9 years ago
siddontang	8f5d24ae68	C API: support get usage and pinned_usage for cache Summary: Closes https://github.com/facebook/rocksdb/pull/1671 Differential Revision: D4327453 Pulled By: yiwu-arbug fbshipit-source-id: bcdbc65	9 years ago
Daniel Black	bfbcec2339	Gcc 7 error expansion to defined Summary: sorry if these gcc-7/clang-4 cleanups are getting tedious. Closes https://github.com/facebook/rocksdb/pull/1658 Differential Revision: D4318792 Pulled By: yiwu-arbug fbshipit-source-id: 8e85891	9 years ago
siddontang	b57dd9262a	C API: support writebatch delete range Summary: Seem that writebatch delete range can work now, so I add C API for later use. Btw, can we use this feature in production now? Closes https://github.com/facebook/rocksdb/pull/1647 Differential Revision: D4314534 Pulled By: ajkr fbshipit-source-id: e835165	9 years ago
Andrew Kryczka	f0c509e2c8	Return finer-granularity status from Env::GetChildren* Summary: It'd be nice to use the error status type to distinguish between user error and system error. For example, GetChildren can fail listing a backup directory's contents either because a bad path was provided (user error) or because an operation failed, e.g., a remote storage service call failed (system error). In the former case, we want to continue and treat the backup directory as empty; in the latter case, we want to immediately propagate the error to the caller. This diff uses NotFound to indicate user error and IOError to indicate system error. Previously IOError indicated both. Closes https://github.com/facebook/rocksdb/pull/1644 Differential Revision: D4312157 Pulled By: ajkr fbshipit-source-id: 51b4f24	9 years ago
zhangjinpeng1987	45c7ce1377	CompactRangeOptions C API Summary: Add C API for CompactRangeOptions. Closes https://github.com/facebook/rocksdb/pull/1596 Differential Revision: D4252339 Pulled By: yiwu-arbug fbshipit-source-id: f768f93	9 years ago
Islam AbdelRahman	ed8fbdb560	Add EventListener::OnExternalFileIngested() event Summary: Add EventListener::OnExternalFileIngested() to allow user to subscribe to external file ingestion events Closes https://github.com/facebook/rocksdb/pull/1623 Differential Revision: D4285844 Pulled By: IslamAbdelRahman fbshipit-source-id: 0b95a88	9 years ago
Manuel Ung	2005c88a75	Implement non-exclusive locks Summary: This is an implementation of non-exclusive locks for pessimistic transactions. It is relatively simple and does not prevent starvation (ie. it's possible that request for exclusive access will never be granted if there are always threads holding shared access). It is done by changing `KeyLockInfo` to hold an set a transaction ids, instead of just one, and adding a flag specifying whether this lock is currently held with exclusive access or not. Some implementation notes: - Some lock diagnostic functions had to be updated to return a set of transaction ids for a given lock, eg. `GetWaitingTxn` and `GetLockStatusData`. - Deadlock detection is a bit more complicated since a transaction can now wait on multiple other transactions. A BFS is done in this case, and deadlock detection depth is now just a limit on the number of transactions we visit. - Expirable transactions do not work efficiently with shared locks at the moment, but that's okay for now. Closes https://github.com/facebook/rocksdb/pull/1573 Differential Revision: D4239097 Pulled By: lth fbshipit-source-id: da7c074	9 years ago
Islam AbdelRahman	67f37cf198	Allow user to specify a CF for SST files generated by SstFileWriter Summary: Allow user to explicitly specify that the generated file by SstFileWriter will be ingested in a specific CF. This allow us to persist the CF id in the generated file Closes https://github.com/facebook/rocksdb/pull/1615 Differential Revision: D4270422 Pulled By: IslamAbdelRahman fbshipit-source-id: 7fb954e	9 years ago
Igor Canadi	3f407b065c	Kill flashcache code in RocksDB Summary: Now that we have userspace persisted cache, we don't need flashcache anymore. Closes https://github.com/facebook/rocksdb/pull/1588 Differential Revision: D4245114 Pulled By: igorcanadi fbshipit-source-id: e2c1c72	9 years ago
Mike Kolupaev	247d0979aa	Support for range skips in compaction filter Summary: This adds the ability for compaction filter to say "drop this key-value, and also drop everything up to key x". This will cause the compaction to seek input iterator to x, without reading the data. This can make compaction much faster when large consecutive chunks of data are filtered out. See the changes in include/rocksdb/compaction_filter.h for the new API. Along the way this diff also adds ability for compaction filter changing merge operands, similar to how it can change values; we're not going to use this feature, it just seemed easier and cleaner to implement it than to document that it's not implemented :) The diff is not as big as it may seem, about half of the lines are a test. Closes https://github.com/facebook/rocksdb/pull/1599 Differential Revision: D4252092 Pulled By: al13n321 fbshipit-source-id: 41e1e48	9 years ago
Panagiotis Ktistakis	96fcefbf1d	c api: expose option for dynamic level size target Summary: Closes https://github.com/facebook/rocksdb/pull/1587 Differential Revision: D4245923 Pulled By: yiwu-arbug fbshipit-source-id: 6ee7291	9 years ago
zhangjinpeng1987	00197cff39	Add C API to set base_backgroud_compactions Summary: Add C API to set base_backgroud_compactions Closes https://github.com/facebook/rocksdb/pull/1571 Differential Revision: D4245709 Pulled By: yiwu-arbug fbshipit-source-id: 792c6b8	9 years ago
Andrew Kryczka	7ffb10fc1a	DeleteRange compaction statistics Summary: - "rocksdb.compaction.key.drop.range_del" - number of keys dropped during compaction due to a range tombstone covering them - "rocksdb.compaction.range_del.drop.obsolete" - number of range tombstones dropped due to compaction to bottom level and no snapshot saving them - s/CompactionIteratorStats/CompactionIterationStats/g since this class is no longer specific to CompactionIterator -- it's also updated for range tombstone iteration during compaction - Move the above class into a separate .h file to avoid circular dependency. Closes https://github.com/facebook/rocksdb/pull/1520 Differential Revision: D4187179 Pulled By: ajkr fbshipit-source-id: 10c2103	9 years ago
Mike Kolupaev	236d4c67e9	Less linear search in DBIter::Seek() when keys are overwritten a lot Summary: In one deployment we saw high latencies (presumably from slow iterator operations) and a lot of CPU time reported by perf with this stack: ``` rocksdb::MergingIterator::Next rocksdb::DBIter::FindNextUserEntryInternal rocksdb::DBIter::Seek ``` I think what's happening is: 1. we create a snapshot iterator, 2. we do lots of Put()s for the same key x; this creates lots of entries in memtable, 3. we seek the iterator to a key slightly smaller than x, 4. the seek walks over lots of entries in memtable for key x, skipping them because of high sequence numbers. CC IslamAbdelRahman Closes https://github.com/facebook/rocksdb/pull/1413 Differential Revision: D4083879 Pulled By: IslamAbdelRahman fbshipit-source-id: a83ddae	9 years ago
Yi Wu	dfb6fe6755	Unified InlineSkipList::Insert algorithm with hinting Summary: This PR is based on nbronson's diff with small modifications to wire it up with existing interface. Comparing to previous version, this approach works better for inserting keys in decreasing order or updating the same key, and impose less restriction to the prefix extractor. ---- Summary from original diff ---- This diff introduces a single InlineSkipList::Insert that unifies the existing sequential insert optimization (prev_), concurrent insertion, and insertion using externally-managed insertion point hints. There's a deep symmetry between insertion hints (cursors) and the concurrent algorithm. In both cases we have partial information from the recent past that is likely but not certain to be accurate. This diff introduces the struct InlineSkipList::Splice, which encodes predecessor and successor information in the same form that was previously only used within a single call to InsertConcurrently. Splice holds information about an insertion point that can be used to levera Closes https://github.com/facebook/rocksdb/pull/1561 Differential Revision: D4217283 Pulled By: yiwu-arbug fbshipit-source-id: 33ee437	9 years ago
Maysam Yabandeh	182b940e70	Add WriteOptions.no_slowdown Summary: If the WriteOptions.no_slowdown flag is set AND we need to wait or sleep for the write request, then fail immediately with Status::Incomplete(). Closes https://github.com/facebook/rocksdb/pull/1527 Differential Revision: D4191405 Pulled By: maysamyabandeh fbshipit-source-id: 7f3ce3f	9 years ago
Karthikeyan Radhakrishnan	4118e13330	Persistent Cache: Expose stats to user via public API Summary: Exposing persistent cache stats (counters) to the user via public API. Closes https://github.com/facebook/rocksdb/pull/1485 Differential Revision: D4155274 Pulled By: siying fbshipit-source-id: 30a9f50	9 years ago
Maysam Yabandeh	9d60151b04	Implement PositionedAppend for PosixWritableFile Summary: This patch clarifies the contract of PositionedAppend with some unit tests and also implements it for PosixWritableFile. (Tasks: 14524071) Closes https://github.com/facebook/rocksdb/pull/1514 Differential Revision: D4204907 Pulled By: maysamyabandeh fbshipit-source-id: 06eabd2	9 years ago
Islam AbdelRahman	c1038d2837	Release RocksDB 5.0 Summary: Update HISTORY.md and version.h Closes https://github.com/facebook/rocksdb/pull/1536 Differential Revision: D4202987 Pulled By: IslamAbdelRahman fbshipit-source-id: 94985e3	9 years ago
Islam AbdelRahman	f39452e81f	Fix heap use after free ASAN/Valgrind Summary: Dont use c_str() of temp std::string in RocksLuaCompactionFilter::Name() Closes https://github.com/facebook/rocksdb/pull/1535 Differential Revision: D4199094 Pulled By: IslamAbdelRahman fbshipit-source-id: e56ce62	9 years ago
Yi Wu	36e4762ce0	Remove Ticker::SEQUENCE_NUMBER Summary: Remove the ticker count because: * Having to reset the ticker count in WriteImpl is ineffiecent; * It doesn't make sense to have it as a ticker count if multiple db instance share a statistics object. Closes https://github.com/facebook/rocksdb/pull/1531 Differential Revision: D4194442 Pulled By: yiwu-arbug fbshipit-source-id: e2110a9	9 years ago
Yueh-Hsuan Chiang	647eafdc21	Introduce Lua Extension: RocksLuaCompactionFilter Summary: This diff includes an implementation of CompactionFilter that allows users to write CompactionFilter in Lua. With this ability, users can dynamically change compaction filter logic without requiring building the rocksdb binary and restarting the database. To compile, WITH_LUA_PATH must be specified to the base directory of lua. Closes https://github.com/facebook/rocksdb/pull/1478 Differential Revision: D4150138 Pulled By: yhchiang fbshipit-source-id: ed84222	9 years ago
Siying Dong	972e3ff295	Enable allow_concurrent_memtable_write and enable_write_thread_adaptive_yield by default Summary: Closes https://github.com/facebook/rocksdb/pull/1496 Differential Revision: D4168080 Pulled By: siying fbshipit-source-id: 056ae62	9 years ago
Andrew Kryczka	489d142808	DeleteRange interface Summary: Expose DeleteRange() interface since we think the implementation is functionally correct now. Closes https://github.com/facebook/rocksdb/pull/1503 Differential Revision: D4171921 Pulled By: ajkr fbshipit-source-id: 5e21c98	9 years ago
Yi Wu	1ea79a78c9	Optimize sequential insert into memtable - Part 1: Interface Summary: Currently our skip-list have an optimization to speedup sequential inserts from a single stream, by remembering the last insert position. We extend the idea to support sequential inserts from multiple streams, and even tolerate small reordering wihtin each stream. This PR is the interface part adding the following: - Add `memtable_insert_prefix_extractor` to allow specifying prefix for each key. - Add `InsertWithHint()` interface to memtable, to allow underlying implementation to return a hint of insert position, which can be later pass back to optimize inserts. - Memtable will maintain a map from prefix to hints and pass the hint via `InsertWithHint()` if `memtable_insert_prefix_extractor` is non-null. Closes https://github.com/facebook/rocksdb/pull/1419 Differential Revision: D4079367 Pulled By: yiwu-arbug fbshipit-source-id: 3555326	9 years ago
Maysam Yabandeh	361010d447	Exporting compaction stats in the form of a map Summary: Currently the compaction stats are printed to stdout. We want to export the compaction stats in a map format so that the upper layer apps (e.g., MySQL) could present the stats in any format required by the them. Closes https://github.com/facebook/rocksdb/pull/1477 Differential Revision: D4149836 Pulled By: maysamyabandeh fbshipit-source-id: b3df19f	9 years ago
Jay Lee	a7875272d7	c: support seek_for_prev Summary: support seek_for_prev in c abi. Closes https://github.com/facebook/rocksdb/pull/1457 Differential Revision: D4135360 Pulled By: lightmark fbshipit-source-id: 61256b0	9 years ago
Andrew Kryczka	f998c9790f	DeleteRange Get support Summary: During Get()/MultiGet(), build up a RangeDelAggregator with range tombstones as we search through live memtable, immutable memtables, and SST files. This aggregator is then used by memtable.cc's SaveValue() and GetContext::SaveValue() to check whether keys are covered. added tests for Get on memtables/files; end-to-end tests mainly in https://reviews.facebook.net/D64761 Closes https://github.com/facebook/rocksdb/pull/1456 Differential Revision: D4111271 Pulled By: ajkr fbshipit-source-id: 6e388d4	9 years ago
zhangjinpeng1987	879f366366	Add C api for RateLimiter Summary: Add C api for RateLimiter. Closes https://github.com/facebook/rocksdb/pull/1455 Differential Revision: D4116362 Pulled By: yiwu-arbug fbshipit-source-id: cb05a8d	9 years ago
Yi Wu	437942e481	Add avoid_flush_during_shutdown DB option Summary: Add avoid_flush_during_shutdown DB option. Closes https://github.com/facebook/rocksdb/pull/1451 Differential Revision: D4108643 Pulled By: yiwu-arbug fbshipit-source-id: abdaf4d	9 years ago
Benoit Girard	2b16d664cb	Change max_bytes_for_level_multiplier to double Summary: Closes https://github.com/facebook/rocksdb/pull/1427 Differential Revision: D4094732 Pulled By: yiwu-arbug fbshipit-source-id: b9b79e9	9 years ago
Jay Lee	16fb04434f	expose IngestExternalFile to c abi Summary: IngestExternalFile is very useful when doing bulk load. This pr expose this API to c so many bindings can benefit from it too. Closes https://github.com/facebook/rocksdb/pull/1454 Differential Revision: D4113420 Pulled By: yiwu-arbug fbshipit-source-id: 307c6ae	9 years ago
Kien-hung Li	eeb27e1bbd	Add handy option to turn on direct I/O in db_bench (#1424 )	9 years ago
Jan Doms	c6168d13ab	removed some declarations from c.h which resulted in undefined symbols (#1407 )	9 years ago

... 26 27 28 29 30 ...

2566 Commits (736b3c4909d858fc261034d94947226db99c361b)