Summary:
Instead of locking and saving a DB state, we can cache a DB state and update it only when it changes. This change reduces lock contention and speeds up read operations on the DB.
Performance improvements are substantial, although there is some cost in no-read workloads. I ran the regression tests on my devserver and here are the numbers:
overwrite 56345 -> 63001
fillseq 193730 -> 185296
readrandom 771301 -> 1219803 (58% improvement!)
readrandom_smallblockcache 677609 -> 862850
readrandom_memtable_sst 710440 -> 1109223
readrandom_fillunique_random 221589 -> 247869
memtablefillrandom 105286 -> 92643
memtablereadrandom 763033 -> 1288862
Test Plan:
make asan_check
I am also running db_stress
Reviewers: dhruba, haobo, sdong, kailiu
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14679
Summary:
For some tests I want to cache the database prior to running other tests on the same invocation
of db_bench. The readtocache test ignores --threads and --reads so those can be used by other tests
and it will still do a full read of --num rows with one thread. It might be invoked like:
db_bench --benchmarks=readtocache,readrandom --reads 100 --num 10000 --threads 8
Task ID: #
Blame Rev:
Test Plan:
run db_bench
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14739
Summary:
<This diff is for Column Family branch>
Sharing some of the work I've done so far. This diff compiles and passes the tests.
The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.
Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]
Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.
Please provide feedback.
Test Plan: make check works, the code is backward compatible
Reviewers: dhruba, haobo, sdong, kailiu, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14445
Summary: I realized that "D14409 Avoid sorting in Version::Get() by presorting them in VersionSet::Builder::SaveTo()" is not done in an optimized place. SaveTo() is usually inside mutex. Move it to Finalize(), which is called out of mutex.
Test Plan: make all check
Reviewers: dhruba, haobo, kailiu
Reviewed By: dhruba
CC: igor, leveldb
Differential Revision: https://reviews.facebook.net/D14607
Summary: It seems to be a decision tradeoff in current codes: we make a malloc for every Get() to reduce one malloc for a flush inside mutex. It takes about 5% of CPU time in readrandom tests. We might consider the tradeoff to be the other way around.
Test Plan: make all check
Reviewers: dhruba, haobo, igor
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14697
Summary:
This is the last diff that adds the property block to plain table.
The format resembles that of the block-based table: https://github.com/facebook/rocksdb/wiki/Rocksdb-table-format
[data block]
[meta block 1: stats block]
[meta block 2: future extended block]
...
[meta block K: future extended block] (we may add more meta blocks in the future)
[metaindex block]
[index block: we only have the placeholder here, we can add persistent index block in the future]
[Footer: contains magic number, handle to metaindex block and index block]
<end_of_file>
Test Plan: extended existing property block test.
Reviewers: haobo, sdong, dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14523
Summary: To reduce mutex contention caused by DBImpl.NewInternalIterator(), in this function, move all the iteration creation works out of mutex, only leaving object ref and get.
Test Plan:
make all check
will run db_stress for a while too to make sure no problem.
Reviewers: haobo, dhruba, kailiu
Reviewed By: haobo
CC: igor, leveldb
Differential Revision: https://reviews.facebook.net/D14589
Summary: When deconstructing an iterator, no need to check obsolete file if it doesn't hold last reference of any version.
Test Plan: make all check
Reviewers: haobo, igor, dhruba, kailiu
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14595
Summary: In get operations, merge_operands is only used in few cases. Lazily initialize it can reduce average latency in some cases
Test Plan: make all check
Reviewers: haobo, kailiu, dhruba
Reviewed By: haobo
CC: igor, nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D14415
Conflicts:
db/db_impl.cc
db/memtable.cc
Summary: Pre-sort files in VersionSet::Builder::SaveTo() so that when getting the value, no need to sort them. It can avoid the costs of vector operations and sorting in Version::Get().
Test Plan: make all check
Reviewers: haobo, kailiu, dhruba
Reviewed By: dhruba
CC: nkg-, igor, leveldb
Differential Revision: https://reviews.facebook.net/D14409
Summary: Pre-sort files in VersionSet::Builder::SaveTo() so that when getting the value, no need to sort them. It can avoid the costs of vector operations and sorting in Version::Get().
Test Plan: make all check
Reviewers: haobo, kailiu, dhruba
Reviewed By: dhruba
CC: nkg-, igor, leveldb
Differential Revision: https://reviews.facebook.net/D14409
Summary:
creating new iterators of mem tables can be expensive. Move them out of mutex.
DBImpl::WriteLevel0Table()'s mems seems to be a local vector and is only used by flushing. memtables to flush are also immutable, so it should be safe to do so.
Test Plan: make all check
Reviewers: haobo, dhruba, kailiu
Reviewed By: dhruba
CC: igor, leveldb
Differential Revision: https://reviews.facebook.net/D14577
Conflicts:
db/db_impl.cc
Summary:
creating new iterators of mem tables can be expensive. Move them out of mutex.
DBImpl::WriteLevel0Table()'s mems seems to be a local vector and is only used by flushing. memtables to flush are also immutable, so it should be safe to do so.
Test Plan: make all check
Reviewers: haobo, dhruba, kailiu
Reviewed By: dhruba
CC: igor, leveldb
Differential Revision: https://reviews.facebook.net/D14577
Summary: as title
Test Plan: dynamic_bloom_test
Reviewers: dhruba, sdong, kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14385
Summary: So fflush() takes a lock which is heavyweight. I added flush_pending_, but more importantly, I removed LogFlush() from foreground threads.
Test Plan: ./db_test
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14535
Summary:
In this diff I present you BackupableDB v1. You can easily use it to backup your DB and it will do incremental snapshots for you.
Let's first describe how you would use BackupableDB. It's inheriting StackableDB interface so you can easily construct it with your DB object -- it will add a method RollTheSnapshot() to the DB object. When you call RollTheSnapshot(), current snapshot of the DB will be stored in the backup dir. To restore, you can just call RestoreDBFromBackup() on a BackupableDB (which is a static method) and it will restore all files from the backup dir. In the next version, it will even support automatic backuping every X minutes.
There are multiple things you can configure:
1. backup_env and db_env can be different, which is awesome because then you can easily backup to HDFS or wherever you feel like.
2. sync - if true, it *guarantees* backup consistency on machine reboot
3. number of snapshots to keep - this will keep last N snapshots around if you want, for some reason, be able to restore from an earlier snapshot. All the backuping is done in incremental fashion - if we already have 00010.sst, we will not copy it again. *IMPORTANT* -- This is based on assumption that 00010.sst never changes - two files named 00010.sst from the same DB will always be exactly the same. Is this true? I always copy manifest, current and log files.
4. You can decide if you want to flush the memtables before you backup, or you're fine with backing up the log files -- either way, you get a complete and consistent view of the database at a time of backup.
5. More things you can find in BackupableDBOptions
Here is the directory structure I use:
backup_dir/CURRENT_SNAPSHOT - just 4 bytes holding the latest snapshot
0, 1, 2, ... - files containing serialized version of each snapshot - containing a list of files
files/*.sst - sst files shared between snapshots - if one snapshot references 00010.sst and another one needs to backup it from the DB, it will just reference the same file
files/ 0/, 1/, 2/, ... - snapshot directories containing private snapshot files - current, manifest and log files
All the files are ref counted and deleted immediatelly when they get out of scope.
Some other stuff in this diff:
1. Added GetEnv() method to the DB. Discussed with @haobo and we agreed that it seems right thing to do.
2. Fixed StackableDB interface. The way it was set up before, I was not able to implement BackupableDB.
Test Plan:
I have a unittest, but please don't look at this yet. I just hacked it up to help me with debugging. I will write a lot of good tests and update the diff.
Also, `make asan_check`
Reviewers: dhruba, haobo, emayanke
Reviewed By: dhruba
CC: leveldb, haobo
Differential Revision: https://reviews.facebook.net/D14295
Summary: In get operations, merge_operands is only used in few cases. Lazily initialize it can reduce average latency in some cases
Test Plan: make all check
Reviewers: haobo, kailiu, dhruba
Reviewed By: haobo
CC: igor, nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D14415
Summary: This change will allow other table to reuse the code for meta blocks.
Test Plan: all existing unit tests passed
Reviewers: dhruba, haobo, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14475
Summary: As title
Test Plan: make clean and make
Reviewers: igor
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14469
Summary: This would enable rocksdb users to get the db identity without depending on implementation details(storing that in IDENTITY file)
Test Plan: db/db_test (has identity checks)
Reviewers: dhruba, haobo, igor, kailiu
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14463
Summary:
This adds 2 options for compression to db_bench:
* universal_compression_size_percent
* compression_level - to set zlib compression level
It also logs compression_size_percent at startup in LOG
Task ID: #
Blame Rev:
Test Plan:
make check, run db_bench
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14439
Summary:
Let's get rid of TransformRep and it's children. We have confirmed that HashSkipListRep works better with multifeed, so there is no benefit to keeping this around.
This diff is mostly just deleting references to obsoleted functions. I also have a diff for fbcode that we'll need to push when we switch to new release.
I had to expose HashSkipListRepFactory in the client header files because db_impl.cc needs access to GetTransform() function for SanitizeOptions.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14397
Summary:
I went through all remaining shared_ptrs and removed the ones that I found not-necessary. Only GenerateCachePrefix() is called fairly often, so don't expect much perf wins.
The ones that are left are accessed infrequently and I think we're fine with keeping them.
Test Plan: make asan_check
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14427
Summary:
The commit at 27bbef1180 had a memory leak
that was detected by valgrind. The memtable that has a refcount decrement
in MemTableList::InstallMemtableFlushResults was not freed.
Test Plan: valgrind ./db_test --leak-check=full
Reviewers: igor
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14391
Summary: These tests fail if compression libraries are not installed.
Test Plan: Manually disabled snappy, observed tests not ran.
Reviewers: dhruba, kailiu
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14379
Summary: As title. Especially, HashSkipListRepFactory will be able to specify a relatively small height, to reduce the memory overhead of one skiplist per bucket.
Test Plan: make check and test it on leaf4
Reviewers: dhruba, sdong, kailiu
CC: reconnect.grayhat, leveldb
Differential Revision: https://reviews.facebook.net/D14307
Summary:
This code path can potentially accumulate multiple important_files for level 0.
But for other levels, it should have only one file in the
important_files, so it is ok not to reserve excessive space, is it not?
Test Plan: make check
Reviewers: haobo
Reviewed By: haobo
CC: reconnect.grayhat, leveldb
Differential Revision: https://reviews.facebook.net/D14349
Summary:
Large memory allocations and frees are costly and best done outside the
db-mutex. The memtables are already allocated outside the db-mutex but
they were being freed while holding the db-mutex.
This patch frees obsolete memtables outside the db-mutex.
Test Plan:
make check
db_stress
Unit tests pass, I am in the process of running stress tests.
Reviewers: haobo, igor, emayanke
Reviewed By: haobo
CC: reconnect.grayhat, leveldb
Differential Revision: https://reviews.facebook.net/D14319
Summary: We need access to options for BackupableDB
Test Plan: make check
Reviewers: dhruba
Reviewed By: dhruba
CC: leveldb, reconnect.grayhat
Differential Revision: https://reviews.facebook.net/D14331
Summary: This is part of https://reviews.facebook.net/D14295 -- smaller diff that is easier to review
Test Plan: make asan_check
Reviewers: dhruba, haobo, emayanke
Reviewed By: emayanke
CC: leveldb, kailiu, reconnect.grayhat
Differential Revision: https://reviews.facebook.net/D14301
Summary:
All filesystem Io should be done outside the dbmutex. There was one place
when we have to roll the transaction log that we were creating the new log file
while holding the dbmutex.
I rearranged this code so that the act of creating the new transaction log
file is done without holding the dbmutex. I also allocate the new memtable
outside the dbmutex, this is important because creating the memtable
could be heavyweight.
Test Plan: make check and dbstress
Reviewers: haobo, igor
Reviewed By: haobo
CC: leveldb, reconnect.grayhat
Differential Revision: https://reviews.facebook.net/D14283
Summary: liveness of the statistics object is already ensured by the shared pointer in DB options. There's no reason to pass again shared pointer among internal functions. Raw pointer is sufficient and efficient.
Test Plan: make check
Reviewers: dhruba, MarkCallaghan, igor
Reviewed By: dhruba
CC: leveldb, reconnect.grayhat
Differential Revision: https://reviews.facebook.net/D14289
Summary:
Provide a framework to profile a query in detail to figure out latency bottleneck. Currently, in Get(), Put() and iterators, 2-3 simple timing is used. We can easily add more profile counters to the framework later.
Test Plan: Enable this profiling in seveal existing tests.
Reviewers: haobo, dhruba, kailiu, emayanke, vamsi, igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14001
Conflicts:
table/merger.cc
Summary:
Provide a framework to profile a query in detail to figure out latency bottleneck. Currently, in Get(), Put() and iterators, 2-3 simple timing is used. We can easily add more profile counters to the framework later.
Test Plan: Enable this profiling in seveal existing tests.
Reviewers: haobo, dhruba, kailiu, emayanke, vamsi, igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14001
Summary:
For prefix mem tables, encoding mem table key may be unnecessary if the prefix doesn't have any key. This patch is a little bit hacky but I want to try out the performance gain of removing this lazy initialization.
In longer term, we might want to revisit the way we abstract mem tables implementations.
Test Plan: make all check
Reviewers: haobo, igor, kailiu
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14265
Summary:
A Simple plain table format. No block structure. When creating the table reader, scanning the full table to create indexes.
Test Plan:Add unit test
Reviewers:haobo,dhruba,kailiu
CC:
Task ID: #
Blame Rev: