Summary: This one is big. It adds ability to write to and read from different column families (see the unit test). It also supports recovery of different column families from log, which was the hardest part to reason about. We need to make sure to never delete the log file which has unflushed data from any column family. To support that, I added another concept, which is versions_->MinLogNumber()
Test Plan: Added a unit test in column_family_test
Reviewers: dhruba, haobo, sdong, kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15537
Summary: This removes the default implementation of LogAndApply that applied the changed to the default column family by default. It is mostly simple reformatting.
Test Plan: make check
Reviewers: dhruba, kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15465
Summary: All memtables and immutable memtables are moved from DBImpl to ColumnFamilyData. For now, they are all referenced from default column family in DBImpl. It shouldn't be hard to get them from custom column family.
Test Plan: make check
Reviewers: dhruba, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15459
Summary:
@dhruba, I'm not sure where we need to sync the directory. I implemented the function in Env() and added the dir sync just after we close the newly created file in the builder.
Should I also add FsyncDir() to new files that get created by a compaction?
Test Plan: Confirmed that FsyncDir is returning Status::OK()
Reviewers: dhruba, haobo
Reviewed By: dhruba
CC: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D14751
Summary: There is no reason to have functions NeedCompaction(), MaxCompactionScore() and MaxCompactionScoreLevel() in VersionSet, since they don't access any data in VersionSet.
Test Plan: make check
Reviewers: kailiu, haobo, sdong
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15333
Summary:
MemTableListVersion is to MemTableList what Version is to VersionSet. I took almost the same ideas to develop MemTableListVersion. The reason is to have copying std::list done in background, while flushing, rather than in foreground (MultiGet() and NewIterator()) under a mutex! Also, whenever we copied MemTableList, we copied also some MemTableList metadata (flush_requested_, commit_in_progress_, etc.), which was wasteful.
This diff avoids std::list copy under a mutex in both MultiGet() and NewIterator(). I created a small database with some number of immutable memtables, and creating 100.000 iterators in a single-thread (!) decreased from {188739, 215703, 198028} to {154352, 164035, 159817}. A lot of the savings come from code under a mutex, so we should see much higher savings with multiple threads. Creating new iterator is very important to LogDevice team.
I also think this diff will make SuperVersion obsolete for performance reasons. I will try it in the next diff. SuperVersion gave us huge savings on Get() code path, but I think that most of the savings came from copying MemTableList under a mutex. If we had MemTableListVersion, we would never need to copy the entire object (like we still do in NewIterator() and MultiGet())
Test Plan: `make check` works. I will also do `make valgrind_check` before commit
Reviewers: dhruba, haobo, kailiu, sdong, emayanke, tnovak
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15255
Summary: as title
Test Plan:
make all check
What else tests shall I cover?
Reviewers: igor, haobo
CC:
Differential Revision: https://reviews.facebook.net/D15339
Summary:
This diff implements a special type of iterator that doesn't create a snapshot
(can be used to read newly inserted data) and is optimized for doing sequential
reads.
TailingIterator uses current superversion number to determine whether to
invalidate its internal iterators. If the version hasn't changed, it can often
avoid doing expensive seeks over immutable structures (sst files and immutable
memtables).
Test Plan:
* new unit tests
* running LD with this patch
Reviewers: igor, dhruba, haobo, sdong, kailiu
Reviewed By: sdong
CC: leveldb, lovro, march
Differential Revision: https://reviews.facebook.net/D15285
Summary:
I created a separate class ColumnFamilySet to keep track of column families. Before we did this in VersionSet and I believe this approach is cleaner.
Let me know if you have any comments. I will commit tomorrow.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15357
Summary:
This diff does two things:
* Rethinks how we call Recover() with read_only option. Before, we call it with pointer to memtable where we'd like to apply those changes to. This memtable is set in db_impl_readonly.cc and it's actually DBImpl::mem_. Why don't we just apply updates to mem_ right away? It seems more intuitive.
* Changes when we apply updates to manifest. Before, the process is to recover all the logs, flush it to sst files and then do one giant commit that atomically adds all recovered sst files and sets the next log number. This works good enough, but causes some small troubles for my column family approach, since I can't have one VersionEdit apply to more than single column family[1]. The change here is to commit the files recovered from logs right away. Here is the state of the world before the change:
1. Recover log 5, add new sst files to edit
2. Recover log 7, add new sst files to edit
3. Recover log 8, add new sst files to edit
4. Commit all added sst files to manifest and mark log files 5, 7 and 8 as recoverd (via SetLogNumber(9) function)
After the change, we'll do:
1. Recover log 5, commit the new sst files and set log 5 as recovered
2. Recover log 7, commit the new sst files and set log 7 as recovered
3. Recover log 8, commit the new sst files and set log 8 as recovered
The added (small) benefit is that if we fail after (2), the new recovery will only have to recover log 8. In previous case, we'll have to restart the recovery from the beginning. The bigger benefit will be to enable easier integration of multiple column families in Recovery code path.
[1] I'm happy to dicuss this decison, but I believe this is the cleanest way to go. It also makes backward compatibility much easier. We don't have a requirement of adding multiple column families atomically.
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15237
Summary:
This moves the use of versions_ to before the mutex is unlocked
to avoid a possible race.
Task ID: #
Blame Rev:
Test Plan:
make check
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: haobo, dhruba
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15279
Summary: I'm separating code-cleanup part of https://reviews.facebook.net/D14517. This will make D14517 easier to understand and this diff easier to review.
Test Plan: make check
Reviewers: haobo, kailiu, sdong, dhruba, tnovak
Reviewed By: tnovak
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15099
Summary:
This had a few bugs.
1) bottom and top were reversed. top is for the max value but the callers were passing the max
value to bottom. The result is that the max sleep is used when n >= bottom.
2) one of the callers passed values with type double and these values are frequently between
1.0 and 2.0 so rounding will do some bad things
3) sometimes the function returned 0 when there should be a stall
With this change and one other diff (out for review soon) there are slightly fewer stalls on one workload.
With the fix.
Stalls(secs): 160.166 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 58.495 leveln_slowdown
Stalls(count): 910261 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 54526 leveln_slowdown
Without the fix.
Stalls(secs): 172.227 level0_slowdown, 0.000 level0_numfiles, 0.000 memtable_compaction, 56.538 leveln_slowdown
Stalls(count): 160831 level0_slowdown, 0 level0_numfiles, 0 memtable_compaction, 52845 leveln_slowdown
Task ID: #
Blame Rev:
Test Plan:
run db_bench for --benchmarks=overwrite with IO-bound database
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: haobo
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15243
Summary:
There were some functions in VersionSet that had no reason to be there instead of Version. Moving them to Version will make column families implementation easier.
The functions moved are:
* NumLevelBytes
* LevelSummary
* LevelFileSummary
* MaxNextLevelOverlappingBytes
* AddLiveFiles (previously AddLiveFilesCurrentVersion())
* NeedSlowdownForNumLevel0Files
The diff continues on (and depends on) D15171
Test Plan: make check
Reviewers: dhruba, haobo, kailiu, sdong, emayanke
Reviewed By: sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15183
Summary:
With column families VersionSet will not have a constant number of levels (each CF can have different options), so we'll need to eliminate call to VersionSet::NumberLevels()
This diff decreases number of callsites, but we're not there yet. It associates number of levels with Version (each version is associated with single CF) instead of VersionSet.
I have also slightly changed how VersionSet keeps track of manifest size.
This diff also modifies constructor of Compaction such that it takes input_version and automatically Ref()s it. Before this was done outside of constructor.
In next diffs I will continue to decrease number of callsites of VersionSet::NumberLevels() and also references to current_
Test Plan: make check
Reviewers: haobo, dhruba, kailiu, sdong
Reviewed By: sdong
Differential Revision: https://reviews.facebook.net/D15171
Summary: To reduce mutex contention caused by DBImpl.NewInternalIterator(), in this function, move all the iteration creation works out of mutex, only leaving object ref and get.
Test Plan:
make all check
will run db_stress for a while too to make sure no problem.
Reviewers: haobo, dhruba, kailiu
Reviewed By: haobo
CC: igor, leveldb
Differential Revision: https://reviews.facebook.net/D14589
Conflicts:
db/db_impl.cc
Summary:
When doing CompactRange(), we should first flush the memtable and then calculate max_level_with_files. Also, we want to compact all the levels that have files, including level `max_level_with_files`.
This patch fixed the unit test.
Test Plan: Added a failing unit test and a fix, so it's not failing anymore.
Reviewers: dhruba, haobo, sdong
Reviewed By: haobo
CC: leveldb, xjin
Differential Revision: https://reviews.facebook.net/D14421
Summary:
I will submit a sequence of diffs that are preparing master branch for column families. There are a lot of implicit assumptions in the code that are making column family implementation hard. If I make the change only in column family branch, it will make merging back to master impossible.
Most of the diffs will be simple code refactorings, so I hope we can have fast turnaround time. Feel free to grab me in person to discuss any of them.
This diff removes number of level check from VersionEdit. It is used only when VersionEdit is read, not written, but has to be set when it is written. I believe it is a right thing to make VersionEdit dumb and check consistency on the caller side. This will also make it much easier to implement Column Families, since different column families can have different number of levels.
Test Plan: make check
Reviewers: dhruba, haobo, sdong, kailiu
Reviewed By: kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15159
Summary: When building batch group, don't actually build a new batch since it requires heavy-weight mem copy and malloc. Only store references to the batches and build the batch group without lock held.
Test Plan:
`make check`
I am also planning to run performance tests. The workload that will benefit from this change is readwhilewriting. I will post the results once I have them.
Reviewers: dhruba, haobo, kailiu
Reviewed By: haobo
CC: leveldb, xjin
Differential Revision: https://reviews.facebook.net/D15063
Summary: We use SanitizeOptions() to set appropriate values for some options, based on other options. So we should use the sanitized options by default. Luckily it hasn't caused a bug yet, but can result in a bug in the fugture.
Test Plan: make check
Reviewers: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14103
Summary:
In one of CPU profiles, we see some CPU costs of string::reserve() inside Batch.Put(). This patch should be able to reduce some of the costs by allocating sufficient buffer before hand.
Since it is a trivial percentage of CPU costs, I didn't find a way to show the improvement in one of the benchmarks. I'll deploy it to same application and do the same CPU profiling to make sure those CPU costs are reduced.
Test Plan: make all check
Reviewers: haobo, kailiu, igor
Reviewed By: haobo
CC: leveldb, nkg-
Differential Revision: https://reviews.facebook.net/D15135
Summary: Currently in DBImpl::MakeRoomForWrite(), we do "versions_->NumLevelFiles(0) >= options_.level0_slowdown_writes_trigger" to check whether the writer thread needs to slow down. However, versions_->NumLevelFiles(0) is slightly more expensive than we expected. By caching the result of the comparison when installing a new version, we can avoid this function call every time.
Test Plan:
make all check
Manually trigger this behavior by applying universal compaction style and make sure inserts are made slow after there are certain number of files.
Reviewers: haobo, kailiu, igor
Reviewed By: kailiu
CC: nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D15141
Summary:
In one of CPU profiles, we see some CPU costs of string::reserve() inside Batch.Put(). This patch should be able to reduce some of the costs by allocating sufficient buffer before hand.
Since it is a trivial percentage of CPU costs, I didn't find a way to show the improvement in one of the benchmarks. I'll deploy it to same application and do the same CPU profiling to make sure those CPU costs are reduced.
Test Plan: make all check
Reviewers: haobo, kailiu, igor
Reviewed By: haobo
CC: leveldb, nkg-
Differential Revision: https://reviews.facebook.net/D15135
Summary:
The biggest change here is getting rid of current_ Version and adding a column_family_data->current Version to each column family.
I have also fixed some smaller things in VersionSet that made it easier to implement Column family support.
Test Plan: make check
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15105
Summary:
In some use cases, table readers for all live files should always be cached. In that case, there will be an opportunity to avoid the table cache look-up while Get() and NewIterator().
We define options.max_open_files = -1 to be the mode that table readers for live files will always be kept. In that mode, table readers are cached in FileMetaData (with a reference count hold in table cache). So that when executing table_cache.Get() and table_cache.newInterator(), LRU cache checking can be by-passed, to reduce latency.
Test Plan: add a test case in db_test
Reviewers: haobo, kailiu
Reviewed By: haobo
CC: dhruba, igor, leveldb
Differential Revision: https://reviews.facebook.net/D15039
Summary: Currently, even if statistics is not enabled, StopWatch only for the stats still gets the time of the day, which is wasteful. This patch adds a new option to StopWatch to disable this get in this case.
Test Plan: make all check
Reviewers: dhruba, haobo, igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14703
Conflicts:
db/db_impl.cc
Summary: Currently, even if statistics is not enabled, StopWatch only for the stats still gets the time of the day, which is wasteful. This patch adds a new option to StopWatch to disable this get in this case.
Test Plan: make all check
Reviewers: dhruba, haobo, igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14703
Summary:
I have added three new value types:
* kTypeColumnFamilyDeletion
* kTypeColumnFamilyValue
* kTypeColumnFamilyMerge
which include column family Varint32 before the data (value, deletion and merge). These values are used only in WAL (not in memtables yet).
This endeavour required changing some WriteBatch internals.
Test Plan: Added a unittest
Reviewers: dhruba, haobo, sdong, kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15045
Summary:
Code was always compressing L0 files written by a memtable flush
when compression was enabled. Now this is done when
min_level_to_compress=0 for leveled compaction and when
universal_compaction_size_percent=-1 for universal compaction.
Task ID: #3416472
Blame Rev:
Test Plan:
ran db_bench with compression options
Revert Plan:
Database Impact:
Memcache Impact:
Other Notes:
EImportant:
- begin *PUBLIC* platform impact section -
Bugzilla: #
- end platform impact -
Reviewers: dhruba, igor, sdong
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14757
Summary:
In addition to implementing OpenWithColumnFamilies, this diff also includes some minor changes:
* Changed all column family names from Slice() to std::string. The performance of column family name handling is not critical, and it's more convenient and cleaner to have names as std::strings
* Implemented ColumnFamilyOptions(const Options&) and DBOptions(const Options&)
* Added ColumnFamilyOptions to VersionSet::ColumnFamilyData. ColumnFamilyOptions are specified on OpenWithColumnFamilies() and CreateColumnFamily()
I will keep the diff in the Phabricator for a day or two and will push to the branch then. Feel free to comment even after the diff has been pushed.
Test Plan: Added a simple unit test
Reviewers: dhruba, haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15033
Summary:
The way DBImpl::TEST_CompactRange() throttles down the number of bg compactions
can cause it to deadlock when CompactRange() is called concurrently from
multiple threads. Imagine a following scenario with only two threads
(max_background_compactions is 10 and bg_compaction_scheduled_ is initially 0):
1. Thread #1 increments bg_compaction_scheduled_ (to LargeNumber), sets
bg_compaction_scheduled_ to 9 (newvalue), schedules the compaction
(bg_compaction_scheduled_ is now 10) and waits for it to complete.
2. Thread #2 calls TEST_CompactRange(), increments bg_compaction_scheduled_
(now LargeNumber + 10) and waits on a cv for bg_compaction_scheduled_ to
drop to LargeNumber.
3. BG thread completes the first manual compaction, decrements
bg_compaction_scheduled_ and wakes up all threads waiting on bg_cv_.
Thread #1 runs, increments bg_compaction_scheduled_ by LargeNumber again
(now 2*LargeNumber + 9). Since that's more than LargeNumber + newvalue,
thread #2 also goes to sleep (waiting on bg_cv_), without resetting
bg_compaction_scheduled_.
This diff attempts to address the problem by introducing a new counter
bg_manual_only_ (when positive, MaybeScheduleFlushOrCompaction() will only
schedule manual compactions).
Test Plan:
I could pretty much consistently reproduce the deadlock with a program that
calls CompactRange(nullptr, nullptr) immediately after Write() from multiple
threads. This no longer happens with this patch.
Tests (make check) pass.
Reviewers: dhruba, igor, sdong, haobo
Reviewed By: igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14799
Summary:
This diff provides basic implementations of CreateColumnFamily(), DropColumnFamily() and ListColumnFamilies(). It builds on top of https://reviews.facebook.net/D14733
It also includes a bug fix for DBImplReadOnly, where Get implementation would be redirected to DBImpl instead of DBImplReadOnly.
Test Plan: Added unit test
Reviewers: dhruba, haobo, kailiu
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15021
Summary: this diff only replace the cases when we need to frequently create vector with small amount of entries. This diff doesn't aim to improve performance of a specific area, but more like a small scale test for the autovector and see how it works in real life.
Test Plan:
make check
I also ran the performance tests, however there is no performance gain/loss. All performance numbers are pretty much the same before/after the change.
Reviewers: dhruba, haobo, sdong, igor
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14985
Summary: I made some cleanup while reading the source code in `db`. Most changes are about style, naming or C++ 11 new features.
Test Plan: ran `make check`
Reviewers: haobo, dhruba, sdong
CC: leveldb
Differential Revision: https://reviews.facebook.net/D15009
Summary:
We don't want two threads to clash if they concurrently call DisableFileDeletions() and EnableFileDeletions(). I'm adding a counter that will enable file deletions only after all DisableFileDeletions() calls have been negated with EnableFileDeletions().
However, we also don't want to break the old behavior, so I added a parameter force to EnableFileDeletions(). If force is true, we will still enable file deletions after every call to EnableFileDeletions(), which is what is happening now.
Test Plan: make check
Reviewers: dhruba, haobo, sanketh
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14781
Summary:
Instead of locking and saving a DB state, we can cache a DB state and update it only when it changes. This change reduces lock contention and speeds up read operations on the DB.
Performance improvements are substantial, although there is some cost in no-read workloads. I ran the regression tests on my devserver and here are the numbers:
overwrite 56345 -> 63001
fillseq 193730 -> 185296
readrandom 771301 -> 1219803 (58% improvement!)
readrandom_smallblockcache 677609 -> 862850
readrandom_memtable_sst 710440 -> 1109223
readrandom_fillunique_random 221589 -> 247869
memtablefillrandom 105286 -> 92643
memtablereadrandom 763033 -> 1288862
Test Plan:
make asan_check
I am also running db_stress
Reviewers: dhruba, haobo, sdong, kailiu
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14679
Summary:
<This diff is for Column Family branch>
Sharing some of the work I've done so far. This diff compiles and passes the tests.
The biggest change is in options.h - I broke down Options into two parts - DBOptions and ColumnFamilyOptions. DBOptions is DB-specific (env, create_if_missing, block_cache, etc.) and ColumnFamilyOptions is column family-specific (all compaction options, compresion options, etc.). Note that this does not break backwards compatibility at all.
Further, I created DBWithColumnFamily which inherits DB interface and adds new functions with column family support. Clients can transparently switch to DBWithColumnFamily and it will not break their backwards compatibility.
There are few methods worth checking out: ListColumnFamilies(), MultiNewIterator(), MultiGet() and GetSnapshot(). [GetSnapshot() returns the snapshot across all column families for now - I think that's what we agreed on]
Finally, I made small changes to WriteBatch so we are able to atomically insert data across column families.
Please provide feedback.
Test Plan: make check works, the code is backward compatible
Reviewers: dhruba, haobo, sdong, kailiu, emayanke
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14445
Summary: It seems to be a decision tradeoff in current codes: we make a malloc for every Get() to reduce one malloc for a flush inside mutex. It takes about 5% of CPU time in readrandom tests. We might consider the tradeoff to be the other way around.
Test Plan: make all check
Reviewers: dhruba, haobo, igor
Reviewed By: dhruba
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14697
Summary: To reduce mutex contention caused by DBImpl.NewInternalIterator(), in this function, move all the iteration creation works out of mutex, only leaving object ref and get.
Test Plan:
make all check
will run db_stress for a while too to make sure no problem.
Reviewers: haobo, dhruba, kailiu
Reviewed By: haobo
CC: igor, leveldb
Differential Revision: https://reviews.facebook.net/D14589
Summary: When deconstructing an iterator, no need to check obsolete file if it doesn't hold last reference of any version.
Test Plan: make all check
Reviewers: haobo, igor, dhruba, kailiu
Reviewed By: haobo
CC: leveldb
Differential Revision: https://reviews.facebook.net/D14595
Summary: In get operations, merge_operands is only used in few cases. Lazily initialize it can reduce average latency in some cases
Test Plan: make all check
Reviewers: haobo, kailiu, dhruba
Reviewed By: haobo
CC: igor, nkg-, leveldb
Differential Revision: https://reviews.facebook.net/D14415
Conflicts:
db/db_impl.cc
db/memtable.cc
Summary:
creating new iterators of mem tables can be expensive. Move them out of mutex.
DBImpl::WriteLevel0Table()'s mems seems to be a local vector and is only used by flushing. memtables to flush are also immutable, so it should be safe to do so.
Test Plan: make all check
Reviewers: haobo, dhruba, kailiu
Reviewed By: dhruba
CC: igor, leveldb
Differential Revision: https://reviews.facebook.net/D14577
Conflicts:
db/db_impl.cc
Summary:
creating new iterators of mem tables can be expensive. Move them out of mutex.
DBImpl::WriteLevel0Table()'s mems seems to be a local vector and is only used by flushing. memtables to flush are also immutable, so it should be safe to do so.
Test Plan: make all check
Reviewers: haobo, dhruba, kailiu
Reviewed By: dhruba
CC: igor, leveldb
Differential Revision: https://reviews.facebook.net/D14577