Summary:
<This diff is for Column Family branch>
Sharing some of the work I've done so far. This diff compiles and passes the tests.
The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.
Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]
Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.
Please provide feedback.
Test Plan: make check works, the code is backward compatible
Reviewers: dhruba, haobo, sdong, kailiu, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14445
Summary: I realized that "D14409 Avoid sorting in Version::Get() by presorting them in VersionSet::Builder::SaveTo()" is not done in an optimized place. SaveTo() is usually inside mutex. Move it to Finalize(), which is called out of mutex.
Test Plan: make all check
Reviewers: dhruba, haobo, kailiu
Reviewed By: dhruba
CC: igor, leveldb
Differential Revision: https://reviews.facebook.net/D14607
Summary: It seems to be a decision tradeoff in current codes: we make a malloc for every Get() to reduce one malloc for a flush inside mutex. It takes about 5% of CPU time in readrandom tests. We might consider the tradeoff to be the other way around.
Test Plan: make all check
Reviewers: dhruba, haobo, igor
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14697
Summary: A bug to fix. IT's already fixed in D14457, but want to check it in sooner to unblock tests
Test Plan: plain_table_db_test
Reviewers: nkg-, haobo
Reviewed By: nkg-
CC: kailiu, leveldb
Differential Revision: https://reviews.facebook.net/D14673
Summary:
This is the last diff that adds the property block to plain table.
The format resembles that of the block-based table: https://github.com/facebook/rocksdb/wiki/Rocksdb-table-format
[data block]
[meta block 1: stats block]
[meta block 2: future extended block]
...
[meta block K: future extended block] (we may add more meta blocks in the future)
[metaindex block]
[index block: we only have the placeholder here, we can add persistent index block in the future]
[Footer: contains magic number, handle to metaindex block and index block]
<end_of_file>
Test Plan: extended existing property block test.
Reviewers: haobo, sdong, dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14523
Summary:
By disassemble the function, we found that the atomic variables do invoke the `lock` that locks the memory bus.
As a tradeoff, we protect the GetUsage by mutex and leave usage_ as plain size_t.
Test Plan: passed `cache_test`
Reviewers: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14667
Summary: make release complains signed unsigned comparison.
Test Plan: make release
Reviewers: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14661
Summary: This diff will help us to figure out the memory usage for the cache part.
Test Plan: added a new memory usage test for cache
Reviewers: haobo, sdong, dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14559
Summary:
PlainTable now has a bug of the ordering of indexes for the prefixes in the same bucket. I thought std::map guaranteed key order but it didn't, probably because I didn't use it properly. But seems to me that we don't need to make extra sorting as input prefixes are already sorted. Found by problem by running leaf4 against plain table. Replace the map with a vector. It should performs better too.
After the fix, leaf4 unit tests are passing.
Test Plan:
run plain_table_db_test
Also going to run db_test with plain table in the uncommitted branch.
Reviewers: haobo, kailiu
Reviewed By: haobo
CC: nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D14649
Summary:
I realized that manifest will get deleted by PurgeObsoleteFiles in DBImpl, but it is sill cleaner to delete
files before we restore the backup
Test Plan: backupable_db_test
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14619
Summary: To reduce mutex contention caused by DBImpl.NewInternalIterator(), in this function, move all the iteration creation works out of mutex, only leaving object ref and get.
Test Plan:
make all check
will run db_stress for a while too to make sure no problem.
Reviewers: haobo, dhruba, kailiu
Reviewed By: haobo
CC: igor, leveldb
Differential Revision: https://reviews.facebook.net/D14589
Summary: When deconstructing an iterator, no need to check obsolete file if it doesn't hold last reference of any version.
Test Plan: make all check
Reviewers: haobo, igor, dhruba, kailiu
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14595
Summary: @MarkCallaghan's tests indicate that performance with 8k rows in memtable is much worse than empty memtable. I wanted to add a regression tests that measures this effect, so we could optimize it. However, current config shows 634461 QPS on my devbox. Mark, any idea why this is so much faster than your measurements?
Test Plan: Ran the regression test.
Reviewers: MarkCallaghan, dhruba, haobo
Reviewed By: MarkCallaghan
CC: leveldb, MarkCallaghan
Differential Revision: https://reviews.facebook.net/D14511
Summary: In get operations, merge_operands is only used in few cases. Lazily initialize it can reduce average latency in some cases
Test Plan: make all check
Reviewers: haobo, kailiu, dhruba
Reviewed By: haobo
CC: igor, nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D14415
Conflicts:
db/db_impl.cc
db/memtable.cc
Summary: Pre-sort files in VersionSet::Builder::SaveTo() so that when getting the value, no need to sort them. It can avoid the costs of vector operations and sorting in Version::Get().
Test Plan: make all check
Reviewers: haobo, kailiu, dhruba
Reviewed By: dhruba
CC: nkg-, igor, leveldb
Differential Revision: https://reviews.facebook.net/D14409
Summary: Pre-sort files in VersionSet::Builder::SaveTo() so that when getting the value, no need to sort them. It can avoid the costs of vector operations and sorting in Version::Get().
Test Plan: make all check
Reviewers: haobo, kailiu, dhruba
Reviewed By: dhruba
CC: nkg-, igor, leveldb
Differential Revision: https://reviews.facebook.net/D14409
Summary:
creating new iterators of mem tables can be expensive. Move them out of mutex.
DBImpl::WriteLevel0Table()'s mems seems to be a local vector and is only used by flushing. memtables to flush are also immutable, so it should be safe to do so.
Test Plan: make all check
Reviewers: haobo, dhruba, kailiu
Reviewed By: dhruba
CC: igor, leveldb
Differential Revision: https://reviews.facebook.net/D14577
Conflicts:
db/db_impl.cc
Summary:
creating new iterators of mem tables can be expensive. Move them out of mutex.
DBImpl::WriteLevel0Table()'s mems seems to be a local vector and is only used by flushing. memtables to flush are also immutable, so it should be safe to do so.
Test Plan: make all check
Reviewers: haobo, dhruba, kailiu
Reviewed By: dhruba
CC: igor, leveldb
Differential Revision: https://reviews.facebook.net/D14577
Summary:
I have ran a get benchmark where all the data is in the cache and observed that most of the time is spent on waiting for lock in LRUCache.
This is an effort to optimize LRUCache.
Test Plan:
The data was loaded with fillseq. Then, I ran a benchmark:
/db_bench --db=/tmp/rocksdb_stat_bench --num=1000000 --benchmarks=readrandom --statistics=1 --use_existing_db=1 --threads=16 --disable_seek_compaction=1 --cache_size=20000000000 --cache_numshardbits=8 --table_cache_numshardbits=8
I ran the benchmark three times. Here are the results:
AFTER THE PATCH: 798072, 803998, 811807
BEFORE THE PATCH: 782008, 815593, 763017
Reviewers: dhruba, haobo, kailiu
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14571
Summary: as title
Test Plan: dynamic_bloom_test
Reviewers: dhruba, sdong, kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14385
Summary: We now delete backups with newer sequence number, so the clients don't have to handle confusing situations when they restore from backup.
Test Plan: added a unit test
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14547
Summary: This will allow us to access constant via `DB::GetOptions().table_cache.GetCapacity()` or `DB::GetOptions().block_cache.GetCapacity()` since GetOptions() is also constant method.
Summary: So fflush() takes a lock which is heavyweight. I added flush_pending_, but more importantly, I removed LogFlush() from foreground threads.
Test Plan: ./db_test
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14535
Summary: Valgrind complained about BackupableDB. This fixes valgrind errors. Also, I cleaned up some code.
Test Plan: valgrind does not complain anymore
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14529
Summary:
In this diff I present you BackupableDB v1. You can easily use it to backup your DB and it will do incremental snapshots for you.
Let's first describe how you would use BackupableDB. It's inheriting StackableDB interface so you can easily construct it with your DB object -- it will add a method RollTheSnapshot() to the DB object. When you call RollTheSnapshot(), current snapshot of the DB will be stored in the backup dir. To restore, you can just call RestoreDBFromBackup() on a BackupableDB (which is a static method) and it will restore all files from the backup dir. In the next version, it will even support automatic backuping every X minutes.
There are multiple things you can configure:
1. backup_env and db_env can be different, which is awesome because then you can easily backup to HDFS or wherever you feel like.
2. sync - if true, it *guarantees* backup consistency on machine reboot
3. number of snapshots to keep - this will keep last N snapshots around if you want, for some reason, be able to restore from an earlier snapshot. All the backuping is done in incremental fashion - if we already have 00010.sst, we will not copy it again. *IMPORTANT* -- This is based on assumption that 00010.sst never changes - two files named 00010.sst from the same DB will always be exactly the same. Is this true? I always copy manifest, current and log files.
4. You can decide if you want to flush the memtables before you backup, or you're fine with backing up the log files -- either way, you get a complete and consistent view of the database at a time of backup.
5. More things you can find in BackupableDBOptions
Here is the directory structure I use:
backup_dir/CURRENT_SNAPSHOT - just 4 bytes holding the latest snapshot
0, 1, 2, ... - files containing serialized version of each snapshot - containing a list of files
files/*.sst - sst files shared between snapshots - if one snapshot references 00010.sst and another one needs to backup it from the DB, it will just reference the same file
files/ 0/, 1/, 2/, ... - snapshot directories containing private snapshot files - current, manifest and log files
All the files are ref counted and deleted immediatelly when they get out of scope.
Some other stuff in this diff:
1. Added GetEnv() method to the DB. Discussed with @haobo and we agreed that it seems right thing to do.
2. Fixed StackableDB interface. The way it was set up before, I was not able to implement BackupableDB.
Test Plan:
I have a unittest, but please don't look at this yet. I just hacked it up to help me with debugging. I will write a lot of good tests and update the diff.
Also, `make asan_check`
Reviewers: dhruba, haobo, emayanke
Reviewed By: dhruba
CC: leveldb, haobo
Differential Revision: https://reviews.facebook.net/D14295
Branch detection did not work in Jenkins. I realized that it set
GIT_BRANCH env variable to point to the current branch, so let's try
using this for branch detection.
Summary:
This will help me a lot! When we hit an assertion in unittest, we get the whole stack trace now.
Also, changed stack trace a bit, we now include actual demangled C++ class::function symbols!
Test Plan: Added ASSERT_TRUE(false) to a test, observed a stack trace
Reviewers: haobo, dhruba, kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14499
Summary: When running regression tests on other branches, this will push values to entity rocksdb_build.$git_branch
Test Plan: Ran regression test on regression branch, observed values send to ODS in entity rocksdb_build.regression
Reviewers: kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14493
Summary: Now DBWithTTL takes DB* and can behave more like StackableDB. This saves us a lot of duplicate work by defining interfaces
Test Plan: ttl_test with ASAN - OK
Reviewers: emayanke
Reviewed By: emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14481