rocksdb

Commit Graph

Author	SHA1	Message	Date
Siying Dong	d59549298f	Skip deleted WALs during recovery Summary: This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic. Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction) This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2. Closes https://github.com/facebook/rocksdb/pull/3765 Differential Revision: D7747618 Pulled By: siying fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729	7 years ago
Anand Ananthabhotla	fccc12f386	Add a histogram stat for memtable flush Summary: Add a new histogram stat called rocksdb.db.flush.micros for memtable flush Closes https://github.com/facebook/rocksdb/pull/3269 Differential Revision: D6559496 Pulled By: anand1976 fbshipit-source-id: f5c771ba2568630458751795e8c37a493ff9b14d	7 years ago
Yi Wu	d1b74b0c82	WritePrepared Txn: Compaction/Flush Summary: Update Compaction/Flush to support WritePreparedTxnDB: Add SnapshotChecker which is a proxy to query WritePreparedTxnDB::IsInSnapshot. Pass SnapshotChecker to DBImpl on WritePreparedTxnDB open. CompactionIterator use it to check if a key has been committed and if it is visible to a snapshot. In CompactionIterator: * check if key has been committed. If not, output uncommitted keys AS-IS. * use SnapshotChecker to check if key is visible to a snapshot when in need. * do not output key with seq = 0 if the key is not committed. Closes https://github.com/facebook/rocksdb/pull/2926 Differential Revision: D5902907 Pulled By: yiwu-arbug fbshipit-source-id: 945e037fdf0aa652dc5ba0ad879461040baa0320	7 years ago
Quinn Jarrell	6a541afcc4	Make bytes_per_sync and wal_bytes_per_sync mutable Summary: SUMMARY Moves the bytes_per_sync and wal_bytes_per_sync options from immutableoptions to mutable options. Also if wal_bytes_per_sync is changed, the wal file and memtables are flushed. TEST PLAN ran make check all passed Two new tests SetBytesPerSync, SetWalBytesPerSync check that after issuing setoptions with a new value for the var, the db options have the new value. Closes https://github.com/facebook/rocksdb/pull/2893 Reviewed By: yiwu-arbug Differential Revision: D5845814 Pulled By: TheRushingWookie fbshipit-source-id: 93b52d779ce623691b546679dcd984a06d2ad1bd	7 years ago
Siying Dong	3c327ac2d0	Change RocksDB License Summary: Closes https://github.com/facebook/rocksdb/pull/2589 Differential Revision: D5431502 Pulled By: siying fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75	7 years ago
Siying Dong	d616ebea23	Add GPLv2 as an alternative license. Summary: Closes https://github.com/facebook/rocksdb/pull/2226 Differential Revision: D4967547 Pulled By: siying fbshipit-source-id: dd3b58ae1e7a106ab6bb6f37ab5c88575b125ab4	8 years ago
Andrew Kryczka	6c57952002	Make range deletion inclusive-exclusive Summary: This makes it easier to implement future optimizations like range collapsing. Closes https://github.com/facebook/rocksdb/pull/1504 Differential Revision: D4172214 Pulled By: ajkr fbshipit-source-id: ac4942f	8 years ago
Andrew Kryczka	40a2e406f8	DeleteRange flush support Summary: Changed BuildTable() (used for flush) to (1) add range tombstones to the aggregator, which is used by CompactionIterator to determine which keys can be removed; and (2) add aggregator's range tombstones to the table that is output for the flush. Closes https://github.com/facebook/rocksdb/pull/1438 Differential Revision: D4100025 Pulled By: ajkr fbshipit-source-id: cb01a70	8 years ago
Yi Wu	9ed928e7a9	Split DBOptions into ImmutableDBOptions and MutableDBOptions Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs. Test Plan: make all check Reviewers: yhchiang, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D64065	8 years ago
sdong	d5a51d4de3	Need to make sure log file synced before flushing memtable of one column family Summary: Multiput atomiciy is broken across multiple column families if we don't sync WAL before flushing one column family. The WAL file may contain a write batch containing writes to a key to the CF to be flushed and a key to other CF. If we don't sync WAL before flushing, if machine crashes after flushing, the write batch will only be partial recovered. Data to other CFs are lost. Test Plan: Add a new unit test which will fail without the diff. Reviewers: yhchiang, IslamAbdelRahman, igor, yiwu Reviewed By: yiwu Subscribers: yiwu, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D60915	8 years ago
sdong	32df9733d1	Add options.write_buffer_manager: control total memtable size across DB instances Summary: Add option write_buffer_manager to help users control total memory spent on memtables across multiple DB instances. Test Plan: Add a new unit test. Reviewers: yhchiang, IslamAbdelRahman Reviewed By: IslamAbdelRahman Subscribers: adela, benj, sumeet, muthu, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D59925	8 years ago
sdong	4b6833aec1	Rename options.compaction_measure_io_stats to options.report_bg_io_stats and include flush too. Summary: It is useful to print out IO stats in flush jobs too. Extend options.compaction_measure_io_stats to flush jobs and raname it. Test Plan: Try db_bench and see the stats are printed out. Reviewers: yhchiang Reviewed By: yhchiang Subscribers: kradhakrishnan, yiwu, IslamAbdelRahman, leveldb, andrewkr, dhruba Differential Revision: https://reviews.facebook.net/D56769	9 years ago
Baraa Hamodi	21e95811d1	Updated all copyright headers to the new format.	9 years ago
agiardullo	3bfd3d39a3	Use SST files for Transaction conflict detection Summary: Currently, transactions can fail even if there is no actual write conflict. This is due to relying on only the memtables to check for write-conflicts. Users have to tune memtable settings to try to avoid this, but it's hard to figure out exactly how to tune these settings. With this diff, TransactionDB will use both memtables and SST files to determine if there are any write conflicts. This relies on the fact that BlockBasedTable stores sequence numbers for all writes that happen after any open snapshot. Also, D50295 is needed to prevent SingleDelete from disappearing writes (the TODOs in this test code will be fixed once the other diff is approved and merged). Note that Optimistic transactions will still rely on tuning memtable settings as we do not want to read from SST while on the write thread. Also, memtable settings can still be used to reduce how often TransactionDB needs to read SST files. Test Plan: unit tests, db bench Reviewers: rven, yhchiang, kradhakrishnan, IslamAbdelRahman, sdong Reviewed By: sdong Subscribers: dhruba, leveldb, yoshinorim Differential Revision: https://reviews.facebook.net/D50475	9 years ago
Sage Weil	5830c699f2	log_writer: pass log number and whether recycling is enabled to ctor When we recycle log files, we need to mix the log number into the CRC for each record. Note that for logs that don't get recycled (like the manifest), we always pass a log_number of 0 and false. Signed-off-by: Sage Weil <sage@redhat.com>	9 years ago
Andres Noetzli	3c9cef1eed	Unified maps with Comparator for sorting, other cleanup Summary: This diff is a collection of cleanups that were initially part of D43179. Additionally it adds a unified way of defining key-value maps that use a Comparator for sorting (this was previously implemented in four different places). Test Plan: make clean check all Reviewers: rven, anthony, yhchiang, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D45993	9 years ago
Igor Canadi	4ab26c5ad1	Smarter purging during flush Summary: Currently, we only purge duplicate keys and deletions during flush if `earliest_seqno_in_memtable <= newest_snapshot`. This means that the newest snapshot happened before we first created the memtable. This is almost never true for MyRocks and MongoRocks. This patch makes purging during flush able to understand snapshots. The main logic is copied from compaction_job.cc, although the logic over there is much more complicated and extensive. However, we should try to merge the common functionality at some point. I need this patch to implement no_overwrite_i_promise functionality for flush. We'll also need this to support SingleDelete() during Flush(). @yoshinorim requested the feature. Test Plan: make check I had to adjust some unit tests to understand this new behavior Reviewers: yhchiang, yoshinorim, anthony, sdong, noetzli Reviewed By: noetzli Subscribers: yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42087	9 years ago
Andres Notzli	68f934355a	Better CompactionJob testing Summary: Changed compaction_job_test to support better/more thorough tests and added two tests. Also changed MockFileContents to order using InternalKeyComparator. Test Plan: make compaction_job_test && ./compaction_job_test; make all && make check Reviewers: sdong, rven, igor, yhchiang, anthony Reviewed By: anthony Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D42837	9 years ago
sdong	6e9fbeb27c	Move rate_limiter, write buffering, most perf context instrumentation and most random kill out of Env Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future. Test Plan: Run all existing unit tests. Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D42321	9 years ago
agiardullo	dc9d70de65	Optimistic Transactions Summary: Optimistic transactions supporting begin/commit/rollback semantics. Currently relies on checking the memtable to determine if there are any collisions at commit time. Not yet implemented would be a way of enuring the memtable has some minimum amount of history so that we won't fail to commit when the memtable is empty. You should probably start with transaction.h to get an overview of what is currently supported. Test Plan: Added a new test, but still need to look into stress testing. Reviewers: yhchiang, igor, rven, sdong Reviewed By: sdong Subscribers: adamretter, MarkCallaghan, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D33435	10 years ago
agiardullo	c815351038	Support saving history in memtable_list Summary: For transactions, we are using the memtables to validate that there are no write conflicts. But after flushing, we don't have any memtables, and transactions could fail to commit. So we want to someone keep around some extra history to use for conflict checking. In addition, we want to provide a way to increase the size of this history if too many transactions fail to commit. After chatting with people, it seems like everyone prefers just using Memtables to store this history (instead of a separate history structure). It seems like the best place for this is abstracted inside the memtable_list. I decide to create a separate list in MemtableListVersion as using the same list complicated the flush/installalflushresults logic too much. This diff adds a new parameter to control how much memtable history to keep around after flushing. However, it sounds like people aren't too fond of adding new parameters. So I am making the default size of flushed+not-flushed memtables be set to max_write_buffers. This should not change the maximum amount of memory used, but make it more likely we're using closer the the limit. (We are now postponing deleting flushed memtables until the max_write_buffer limit is reached). So while we might use more memory on average, we are still obeying the limit set (and you could argue it's better to go ahead and use up memory now instead of waiting for a write stall to happen to test this limit). However, if people are opposed to this default behavior, we can easily set it to 0 and require this parameter be set in order to use transactions. Test Plan: Added a xfunc test to play around with setting different values of this parameter in all tests. Added testing in memtablelist_test and planning on adding more testing here. Reviewers: sdong, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D37443	10 years ago
Igor Sugak	9405b5ef8f	rocksdb: Remove #include "util/string_util.h" from util/testharness.h Summary: 1. Manually deleted #include "util/string_util.h" from util/testharness.h 2. ``` % USE_CLANG=1 make all -j55 -k 2> build.log % perl -naF: -E 'say $F[0] if /: error:/' build.log \| sort -u \| xargs sed -i '/#include "util\/testharness.h"/i #include "util\/string_util.h"' ``` Test Plan: Make sure make all completes with no errors. ``` % make all -j55 ``` Reviewers: meyering, igor, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35493	10 years ago
Igor Canadi	c88ff4ca76	Deprecate removeScanCountLimit in NewLRUCache Summary: It is no longer used by the implementation, so we should also remove it from the public API. Test Plan: make check Reviewers: sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D34971	10 years ago
Igor Sugak	b4b69e4f77	rocksdb: switch to gtest Summary: Our existing test notation is very similar to what is used in gtest. It makes it easy to adopt what is different. In this diff I modify existing [[ https://code.google.com/p/googletest/wiki/Primer#Test_Fixtures:_Using_the_Same_Data_Configuration_for_Multiple_Te \| test fixture ]] classes to inherit from `testing::Test`. Also for unit tests that use fixture class, `TEST` is replaced with `TEST_F` as required in gtest. There are several custom `main` functions in our existing tests. To make this transition easier, I modify all `main` functions to fallow gtest notation. But eventually we can remove them and use implementation of `main` that gtest provides. ```lang=bash % cat ~/transform #!/bin/sh files=$(git ls-files 'test\.cc') for file in $files do if grep -q "rocksdb::test::RunAllTests()" $file then if grep -Eq '^class \w+Test {' $file then perl -pi -e 's/^(class \w+Test) {/${1}: public testing::Test {/g' $file perl -pi -e 's/^(TEST)/${1}_F/g' $file fi perl -pi -e 's/(int main.\{)/${1}::testing::InitGoogleTest(&argc, argv);/g' $file perl -pi -e 's/rocksdb::test::RunAllTests/RUN_ALL_TESTS/g' $file fi done % sh ~/transform % make format ``` Second iteration of this diff contains only scripted changes. Third iteration contains manual changes to fix last errors and make it compilable. Test Plan: Build and notice no errors. ```lang=bash % USE_CLANG=1 make check -j55 ``` Tests are still testing. Reviewers: meyering, sdong, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D35157	10 years ago
Igor Sugak	9fd6edf81c	rocksdb: Replace ASSERT* with EXPECT* in functions that does not return void value Summary: gtest does not use exceptions to fail a unit test by design, and `ASSERT`s are implemented using `return`. As a consequence we cannot use `ASSERT` in a function that does not return `void` value ([[ https://code.google.com/p/googletest/wiki/AdvancedGuide#Assertion_Placement \| 1]]), and have to fix our existing code. This diff does this in a generic way, with no manual changes. In order to detect all existing `ASSERT` that are used in functions that doesn't return void value, I change the code to generate compile errors for such cases. In `util/testharness.h` I defined `EXPECT` assertions, the same way as `ASSERT`, and redefined `ASSERT` to return `void`. Then executed: ```lang=bash % USE_CLANG=1 make all -j55 -k 2> build.log % perl -naF: -e 'print "-- -number=".$F[1]." ".$F[0]."\n" if /: error:/' \ build.log \| xargs -L 1 perl -spi -e 's/ASSERT/EXPECT/g if $. == $number' % make format ``` After that I reverted back change to `ASSERT` in `util/testharness.h`. But preserved introduced `EXPECT`, which is the same as `ASSERT*`. This will be deleted once switched to gtest. This diff is independent and contains manual changes only in `util/testharness.h`. Test Plan: Make sure all tests are passing. ```lang=bash % USE_CLANG=1 make check ``` Reviewers: igor, lgalanis, sdong, yufei.zhu, rven, meyering Reviewed By: meyering Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D33333	10 years ago
Igor Canadi	52d8347a91	EventLogger Summary: Here's my proposal for making our LOGs easier to read by machines. The idea is to dump all events as JSON objects. JSON is easy to read by humans, but more importantly, it's easy to read by machines. That way, we can parse this, load into SQLite/mongo and then query or visualize. I started with table_create and table_delete events, but if everybody agrees, I'll continue by adding more events (flush/compaction/etc etc) Test Plan: Ran db_bench. Observed: 2015/01/15-14:13:25.788019 1105ef000 EVENT_LOG_v1 {"time_micros": 1421360005788015, "event": "table_file_creation", "file_number": 12, "file_size": 1909699} 2015/01/15-14:13:25.956500 110740000 EVENT_LOG_v1 {"time_micros": 1421360005956498, "event": "table_file_deletion", "file_number": 12} Reviewers: yhchiang, rven, dhruba, MarkCallaghan, lgalanis, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D31647	10 years ago
Igor Canadi	e7ea51a8e7	Introduce job_id for flush and compaction Summary: It would be good to assing background job their IDs. Two benefits: 1) makes LOGs more readable 2) I might use it in my EventLogger, which will try to make our LOG easier to read/query/visualize Test Plan: ran rocksdb, read the LOG Reviewers: sdong, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D31617	10 years ago
Yueh-Hsuan Chiang	181191a1e4	Add a counter for collecting the wait time on db mutex. Summary: Add a counter for collecting the wait time on db mutex. Also add MutexWrapper and CondVarWrapper for measuring wait time. Test Plan: ./db_test export ROCKSDB_TESTS=MutexWaitStats ./db_test verify stats output using db_bench make clean make release ./db_bench --statistics=1 --benchmarks=fillseq,readwhilewriting --num=10000 --threads=10 Sample output: rocksdb.db.mutex.wait.micros COUNT : 7546866 Reviewers: MarkCallaghan, rven, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D32787	10 years ago
sdong	d888c95748	Sync WAL Directory and DB Path if different from DB directory Summary: 1. If WAL directory is different from db directory. Sync the directory after creating a log file under it. 2. After creating an SST file, sync its parent directory instead of DB directory. 3. change the check of kResetDeleteUnsyncedFiles in fault_injection_test. Since we changed the behavior to sync log files' parent directory after first WAL sync, instead of creating, kResetDeleteUnsyncedFiles will not guarantee to show post sync updates. Test Plan: make all check Reviewers: yhchiang, rven, igor Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D32067	10 years ago
Jonah Cohen	a14b7873ee	Enforce write buffer memory limit across column families Summary: Introduces a new class for managing write buffer memory across column families. We supplement ColumnFamilyOptions::write_buffer_size with ColumnFamilyOptions::write_buffer, a shared pointer to a WriteBuffer instance that enforces memory limits before flushing out to disk. Test Plan: Added SharedWriteBuffer unit test to db_test.cc Reviewers: sdong, rven, ljin, igor Reviewed By: igor Subscribers: tnovak, yhchiang, dhruba, xjin, MarkCallaghan, yoshinorim Differential Revision: https://reviews.facebook.net/D22581	10 years ago
Yueh-Hsuan Chiang	13de000f07	Add rocksdb::ToString() to address cases where std::to_string is not available. Summary: In some environment such as android, the c++ library does not have std::to_string. This path adds rocksdb::ToString(), which wraps std::to_string when std::to_string is not available, and implements std::to_string in the other case. Test Plan: make dbg -j32 ./db_test make clean make dbg OPT=-DOS_ANDROID -j32 ./db_test Reviewers: ljin, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D29181	10 years ago
Igor Canadi	23295b74b6	Clean job_context	10 years ago
Igor Canadi	9be338cf9d	CompactionJobTest Summary: This is just a simple test that passes two files though a compaction. It shows the framework so that people can continue building new compaction unit tests. In the future we might want to move some Compaction* tests from DBTest here. For example, CompactBetweenSnapshot seems a good candidate. Hopefully this test can be simpler when we mock out VersionSet. Test Plan: this is a test Reviewers: ljin, rven, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28449	10 years ago
Igor Canadi	53af5d877d	Redesign pending_outputs_ Summary: Here's a prototype of redesigning pending_outputs_. This way, we don't have to expose pending_outputs_ to other classes (CompactionJob, FlushJob, MemtableList). DBImpl takes care of it. Still have to write some comments, but should be good enough to start the discussion. Test Plan: make check, will also run stress test Reviewers: ljin, sdong, rven, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28353	10 years ago
Igor Canadi	abac3d6476	TableMock + framework for mock classes Summary: This diff replaces BlockBasedTable in flush_job_test with TableMock, making it depend on less things and making it closer to an unit test than integration test. It also introduces a framework to compile mock classes -- Any file named *mock.cc will not be compiled into the build. It will only get compiled into the tests. What way we can mock out most other classes, Version, VersionSet, DBImpl, etc. Test Plan: flush_job_test Reviewers: ljin, rven, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D27681	10 years ago
Igor Canadi	a39e931e50	FlushProcess Summary: Abstract out FlushProcess and take it out of DBImpl. This also includes taking DeletionState outside of DBImpl. Currently this diff is only doing the refactoring. Future work includes: 1. Decoupling flush_process.cc, make it depend on less state 2. Write flush_process_test, which will mock out everything that FlushProcess depends on and test it in isolation Test Plan: make check Reviewers: rven, yhchiang, sdong, ljin Reviewed By: ljin Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D27561	10 years ago

36 Commits (2694b6dc26d1418d3368f9389a98493e764853cb)