Migrate the RocksDB Worpdress blog over to Jekyll

Summary:
Tried to:

- preserve existing links
- move existing images over (there were 2)
- preserve codeblocks (modified where apprporiate)
- etc.

Also as agreed upon:

- All blog posts are preserved.
- Comments are not preserved.
- Not turning on comments for future blog posts (use the FB developer group instead).
- Like button at the end of the blog post.

Depends on https://reviews.facebook.net/D63051

Test Plan: Visual

Reviewers: IslamAbdelRahman, lgalanis

Reviewed By: lgalanis

Subscribers: andrewkr, dhruba, leveldb

Differential Revision: https://reviews.facebook.net/D63105
main
Joel Marcey 8 years ago
parent ee0e2201e0
commit 3c2262400f
  1. 2
      docs/_includes/post.html
  2. 133
      docs/_posts/2014-03-27-how-to-backup-rocksdb.markdown
  3. 50
      docs/_posts/2014-03-27-how-to-persist-in-memory-rocksdb-database.markdown
  4. 39
      docs/_posts/2014-04-02-the-1st-rocksdb-local-meetup-held-on-march-27-2014.markdown
  5. 37
      docs/_posts/2014-04-07-rocksdb-2-8-release.markdown
  6. 24
      docs/_posts/2014-04-21-indexing-sst-files-for-better-lookup-performance.markdown
  7. 52
      docs/_posts/2014-05-14-lock.markdown
  8. 25
      docs/_posts/2014-05-19-rocksdb-3-0-release.markdown
  9. 22
      docs/_posts/2014-05-22-rocksdb-3-1-release.markdown
  10. 37
      docs/_posts/2014-06-23-plaintable-a-new-file-format.markdown
  11. 88
      docs/_posts/2014-06-27-avoid-expensive-locks-in-get.markdown
  12. 30
      docs/_posts/2014-06-27-rocksdb-3-2-release.markdown
  13. 34
      docs/_posts/2014-07-29-rocksdb-3-3-release.markdown
  14. 132
      docs/_posts/2014-09-12-cuckoo.markdown
  15. 105
      docs/_posts/2014-09-12-new-bloom-filter-format.markdown
  16. 48
      docs/_posts/2014-09-15-rocksdb-3-5-release.markdown
  17. 108
      docs/_posts/2015-01-16-migrating-from-leveldb-to-rocksdb-2.markdown
  18. 37
      docs/_posts/2015-02-24-reading-rocksdb-options-from-a-file.markdown
  19. 16
      docs/_posts/2015-02-27-write-batch-with-index.markdown
  20. 12
      docs/_posts/2015-04-22-integrating-rocksdb-with-mongodb-2.markdown
  21. 8
      docs/_posts/2015-06-12-rocksdb-in-osquery.markdown
  22. 81
      docs/_posts/2015-07-15-rocksdb-2015-h2-roadmap.markdown
  23. 74
      docs/_posts/2015-07-17-spatial-indexing-in-rocksdb.markdown
  24. 12
      docs/_posts/2015-07-22-rocksdb-is-now-available-in-windows-platform.markdown
  25. 36
      docs/_posts/2015-07-23-dynamic-level.markdown
  26. 189
      docs/_posts/2015-10-27-getthreadlist.markdown
  27. 43
      docs/_posts/2015-11-10-use-checkpoints-for-efficient-snapshots.markdown
  28. 148
      docs/_posts/2015-11-16-analysis-file-read-latency-by-level.markdown
  29. 41
      docs/_posts/2016-01-29-compaction_pri.markdown
  30. 51
      docs/_posts/2016-02-24-rocksdb-4-2-release.markdown
  31. 18
      docs/_posts/2016-02-25-rocksdb-ama.markdown
  32. 26
      docs/_posts/2016-03-07-rocksdb-options-file.markdown
  33. 12
      docs/_posts/2016-04-07-blog-post-example.md
  34. 56
      docs/_posts/2016-04-26-rocksdb-4-5-1-released.markdown
  35. 44
      docs/_posts/2016-07-26-rocksdb-4-8-released.markdown
  36. BIN
      docs/static/images/Resize-of-20140327_200754-300x225.jpg
  37. BIN
      docs/static/images/tree_example1.png

@ -29,7 +29,7 @@
{{ content }}
{% endif %}
{% unless include.truncate %}
{% include plugins/all_share.html %}
{% include plugins/like_button.html %}
{% endunless %}
</article>
</div>

@ -0,0 +1,133 @@
---
title: How to backup RocksDB?
layout: post
author: icanadi
category: blog
---
In RocksDB, we have implemented an easy way to backup your DB. Here is a simple example:
#include "rocksdb/db.h"
#include "utilities/backupable_db.h"
using namespace rocksdb;
DB* db;
DB::Open(Options(), "/tmp/rocksdb", &db);
BackupableDB* backupable_db = new BackupableDB(db, BackupableDBOptions("/tmp/rocksdb_backup"));
backupable_db->Put(...); // do your thing
backupable_db->CreateNewBackup();
delete backupable_db; // no need to also delete db
This simple example will create a backup of your DB in "/tmp/rocksdb_backup". Creating new BackupableDB consumes DB* and you should be calling all the DB methods on object `backupable_db` going forward.
Restoring is also easy:
RestoreBackupableDB* restore = new RestoreBackupableDB(Env::Default(), BackupableDBOptions("/tmp/rocksdb_backup"));
restore->RestoreDBFromLatestBackup("/tmp/rocksdb", "/tmp/rocksdb");
delete restore;
This code will restore the backup back to "/tmp/rocksdb". The second parameter is the location of log files (In some DBs they are different from DB directory, but usually they are the same. See Options::wal_dir for more info).
An alternative API for backups is to use BackupEngine directly:
#include "rocksdb/db.h"
#include "utilities/backupable_db.h"
using namespace rocksdb;
DB* db;
DB::Open(Options(), "/tmp/rocksdb", &db);
db->Put(...); // do your thing
BackupEngine* backup_engine = BackupEngine::NewBackupEngine(Env::Default(), BackupableDBOptions("/tmp/rocksdb_backup"));
backup_engine->CreateNewBackup(db);
delete db;
delete backup_engine;
Restoring with BackupEngine is similar to RestoreBackupableDB:
BackupEngine* backup_engine = BackupEngine::NewBackupEngine(Env::Default(), BackupableDBOptions("/tmp/rocksdb_backup"));
backup_engine->RestoreDBFromLatestBackup("/tmp/rocksdb", "/tmp/rocksdb");
delete backup_engine;
Backups are incremental. You can create a new backup with `CreateNewBackup()` and only the new data will be copied to backup directory (for more details on what gets copied, see "Under the hood"). Checksum is always calculated for any backuped file (including sst, log, and etc). It is used to make sure files are kept sound in the file system. Checksum is also verified for files from the previous backups even though they do not need to be copied. A checksum mismatch aborts the current backup (see "Under the hood" for more details). Once you have more backups saved, you can issue `GetBackupInfo()` call to get a list of all backups together with information on timestamp of the backup and the size (please note that sum of all backups' sizes is bigger than the actual size of the backup directory because some data is shared by multiple backups). Backups are identified by their always-increasing IDs. `GetBackupInfo()` is available both in `BackupableDB` and `RestoreBackupableDB`.
You probably want to keep around only small number of backups. To delete old backups, just call `PurgeOldBackups(N)`, where N is how many backups you'd like to keep. All backups except the N newest ones will be deleted. You can also choose to delete arbitrary backup with call `DeleteBackup(id)`.
`RestoreDBFromLatestBackup()` will restore the DB from the latest consistent backup. An alternative is `RestoreDBFromBackup()` which takes a backup ID and restores that particular backup. Checksum is calculated for any restored file and compared against the one stored during the backup time. If a checksum mismatch is detected, the restore process is aborted and `Status::Corruption` is returned. Very important thing to note here: Let's say you have backups 1, 2, 3, 4. If you restore from backup 2 and start writing more data to your database, newly created backup will delete old backups 3 and 4 and create new backup 3 on top of 2.
## Advanced usage
Let's say you want to backup your DB to HDFS. There is an option in `BackupableDBOptions` to set `backup_env`, which will be used for all file I/O related to backup dir (writes when backuping, reads when restoring). If you set it to HDFS Env, all the backups will be stored in HDFS.
`BackupableDBOptions::info_log` is a Logger object that is used to print out LOG messages if not-nullptr.
If `BackupableDBOptions::sync` is true, we will sync data to disk after every file write, guaranteeing that backups will be consistent after a reboot or if machine crashes. Setting it to false will speed things up a bit, but some (newer) backups might be inconsistent. In most cases, everything should be fine, though.
If you set `BackupableDBOptions::destroy_old_data` to true, creating new `BackupableDB` will delete all the old backups in the backup directory.
`BackupableDB::CreateNewBackup()` method takes a parameter `flush_before_backup`, which is false by default. When `flush_before_backup` is true, `BackupableDB` will first issue a memtable flush and only then copy the DB files to the backup directory. Doing so will prevent log files from being copied to the backup directory (since flush will delete them). If `flush_before_backup` is false, backup will not issue flush before starting the backup. In that case, the backup will also include log files corresponding to live memtables. Backup will be consistent with current state of the database regardless of `flush_before_backup` parameter.
## Under the hood
`BackupableDB` implements `DB` interface and adds four methods to it: `CreateNewBackup()`, `GetBackupInfo()`, `PurgeOldBackups()`, `DeleteBackup()`. Any `DB` interface calls will get forwarded to underlying `DB` object.
When you call `BackupableDB::CreateNewBackup()`, it does the following:
1. Disable file deletions
2. Get live files (this includes table files, current and manifest file).
3. Copy live files to the backup directory. Since table files are immutable and filenames unique, we don't copy a table file that is already present in the backup directory. For example, if there is a file `00050.sst` already backed up and `GetLiveFiles()` returns `00050.sst`, we will not copy that file to the backup directory. However, checksum is calculated for all files regardless if a file needs to be copied or not. If a file is already present, the calculated checksum is compared against previously calculated checksum to make sure nothing crazy happened between backups. If a mismatch is detected, backup is aborted and the system is restored back to the state before `BackupableDB::CreateNewBackup()` is called. One thing to note is that a backup abortion could mean a corruption from a file in backup directory or the corresponding live file in current DB. Both manifest and current files are copied, since they are not immutable.
4. If `flush_before_backup` was set to false, we also need to copy log files to the backup directory. We call `GetSortedWalFiles()` and copy all live files to the backup directory.
5. Enable file deletions
Backup IDs are always increasing and we have a file `LATEST_BACKUP` that contains the ID of the latest backup. If we crash in middle of backing up, on a restart we will detect that there are newer backup files than `LATEST_BACKUP` claims there are. In that case, we will delete any backup newer than `LATEST_BACKUP` and clean up all the files since some of the table files might be corrupted. Having corrupted table files in the backup directory is dangerous because of our deduplication strategy.
## Further reading
For the API details, see `include/utilities/backupable_db.h`. For the implementation, see `utilities/backupable/backupable_db.cc`.

@ -0,0 +1,50 @@
---
title: How to persist in-memory RocksDB database?
layout: post
author: icanadi
category: blog
---
In recent months, we have focused on optimizing RocksDB for in-memory workloads. With growing RAM sizes and strict low-latency requirements, lots of applications decide to keep their entire data in memory. Running in-memory database with RocksDB is easy -- just mount your RocksDB directory on tmpfs or ramfs [1]. Even if the process crashes, RocksDB can recover all of your data from in-memory filesystem. However, what happens if the machine reboots?
In this article we will explain how you can recover your in-memory RocksDB database even after a machine reboot.
Every update to RocksDB is written to two places - one is an in-memory data structure called memtable and second is write-ahead log. Write-ahead log can be used to completely recover the data in memtable. By default, when we flush the memtable to table file, we also delete the current log, since we don't need it anymore for recovery (the data from the log is "persisted" in the table file -- we say that the log file is obsolete). However, if your table file is stored in in-memory file system, you may need the obsolete write-ahead log to recover the data after the machine reboots. Here's how you can do that.
Options::wal_dir is the directory where RocksDB stores write-ahead log files. If you configure this directory to be on flash or disk, you will not lose current log file on machine reboot.
Options::WAL_ttl_seconds is the timeout when we delete the archived log files. If the timeout is non-zero, obsolete log files will be moved to `archive/` directory under Options::wal_dir. Those archived log files will only be deleted after the specified timeout.
Let's assume Options::wal_dir is a directory on persistent storage and Options::WAL_ttl_seconds is set to one day. To fully recover the DB, we also need to backup the current snapshot of the database (containing table and metadata files) with a frequency of less than one day. RocksDB provides an utility that enables you to easily backup the snapshot of your database. You can learn more about it here: [How to backup RocksDB?](https://github.com/facebook/rocksdb/wiki/How-to-backup-RocksDB%3F)
You should configure the backup process to avoid backing up log files, since they are already stored in persistent storage. To do that, set BackupableDBOptions::backup_log_files to false.
Restore process by default cleans up entire DB and WAL directory. Since we didn't include log files in the backup, we need to make sure that restoring the database doesn't delete log files in WAL directory. When restoring, configure RestoreOptions::keep_log_file to true. That option will also move any archived log files back to WAL directory, enabling RocksDB to replay all archived log files and rebuild the in-memory database state.
To reiterate, here's what you have to do:
* Set DB directory to tmpfs or ramfs mounted drive
* Set Options::wal_log to a directory on persistent storage
* Set Options::WAL_ttl_seconds to T seconds
* Backup RocksDB every T/2 seconds, with BackupableDBOptions::backup_log_files = false
* When you lose data, restore from backup with RestoreOptions::keep_log_file = true
[1] You might also want to consider using [PlainTable format](https://github.com/facebook/rocksdb/wiki/PlainTable-Format) for table files

@ -0,0 +1,39 @@
---
title: The 1st RocksDB Local Meetup Held on March 27, 2014
layout: post
author: xjin
category: blog
---
On Mar 27, 2014, RocksDB team @ Facebook held the 1st RocksDB local meetup in FB HQ (Menlo Park, California). We invited around 80 guests from 20+ local companies, including LinkedIn, Twitter, Dropbox, Square, Pinterest, MapR, Microsoft and IBM. Finally around 50 guests showed up, totaling around 60% show-up rate.
[![Resize of 20140327_200754](/static/images/Resize-of-20140327_200754-300x225.jpg)](/static/images/Resize-of-20140327_200754-300x225.jpg)
RocksDB team @ Facebook gave four talks about the latest progress and experience on RocksDB:
* [Supporting a 1PB In-Memory Workload](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Haobo-RocksDB-In-Memory.pdf)
* [Column Families in RocksDB](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Igor-Column-Families.pdf)
* ["Lockless" Get() in RocksDB?](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Lei-Lockless-Get.pdf)
* [Prefix Hashing in RocksDB](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Siying-Prefix-Hash.pdf)
A very interesting question asked by a massive number of guests is: does RocksDB plan to provide replication functionality? Obviously, many applications need a resilient and distributed storage solution, not just single-node storage. We are considering how to approach this issue.
When will be the next meetup? We haven't decided yet. We will see whether the community is interested in it and how it can help RocksDB grow.
If you have any questions or feedback for the meetup or RocksDB, please let us know in [our Facebook group](https://www.facebook.com/groups/rocksdb.dev/).

@ -0,0 +1,37 @@
---
title: RocksDB 2.8 release
layout: post
author: icanadi
category: blog
---
Check out the new RocksDB 2.8 release on [Github](https://github.com/facebook/rocksdb/releases/tag/2.8.fb).
RocksDB 2.8. is mostly focused on improving performance for in-memory workloads. We are seeing read QPS as high as 5M (we will write a separate blog post on this). Here is the summary of new features:
* Added a new table format called PlainTable, which is optimized for RAM storage (ramfs or tmpfs). You can read more details about it on [our wiki](https://github.com/facebook/rocksdb/wiki/PlainTable-Format).
* New prefixed memtable format HashLinkedList, which is optimized for cases where there are only a few keys for each prefix.
* Merge operator supports a new function PartialMergeMulti() that allows users to do partial merges against multiple operands. This function enables big speedups for workloads that use merge operators.
* Added a V2 compaction filter interface. It buffers the kv-pairs sharing the same key prefix, process them in batches, and return the batched results back to DB.
* Geo-spatial support for locations and radial-search.
* Improved read performance using thread local cache for frequently accessed data.
* Stability improvements -- we're now ignoring partially written tailing record to MANIFEST or WAL files.
We have also introduced small incompatible API changes (mostly for advanced users). You can see full release notes in our [HISTORY.my](https://github.com/facebook/rocksdb/blob/2.8.fb/HISTORY.md) file.

@ -0,0 +1,24 @@
---
title: Indexing SST Files for Better Lookup Performance
layout: post
author: leijin
category: blog
---
For a `Get()` request, RocksDB goes through mutable memtable, list of immutable memtables, and SST files to look up the target key. SST files are organized in levels.
On level 0, files are sorted based on the time they are flushed. Their key range (as defined by FileMetaData.smallest and FileMetaData.largest) are mostly overlapped with each other. So it needs to look up every L0 file.
Compaction is scheduled periodically to pick up files from an upper level and merges them with files from lower level. As a result, key/values are moved from L0 down the LSM tree gradually. Compaction sorts key/values and split them into files. From level 1 and below, SST files are sorted based on key. Their key range are mutually exclusive. Instead of scanning through each SST file and checking if a key falls into its range, RocksDB performs a binary search based on FileMetaData.largest to locate a candidate file that can potentially contain the target key. This reduces complexity from O(N) to O(log(N)). However, log(N) can still be large for bottom levels. For a fan-out ratio of 10, level 3 can have 1000 files. That requires 10 comparisons to locate a candidate file. This is a significant cost for an in-memory database when you can do [several million gets per second](https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks).
One observation to this problem is that: after the LSM tree is built, an SST file's position in its level is fixed. Furthermore, its order relative to files from the next level is also fixed. Based on this idea, we can perform [fractional cascading](http://en.wikipedia.org/wiki/Fractional_cascading) kind of optimization to narrow down the binary search range. Here is an example:
[![tree_example](/static/images/tree_example1.png)](/static/images/tree_example1.png)
Level 1 has 2 files and level 2 has 8 files. Now, we want to look up key 80. A binary search based FileMetaData.largest tells you file 1 is the candidate. Then key 80 is compared with its FileMetaData.smallest and FileMetaData.largest to decide if it falls into the range. The comparison shows 80 is less than FileMetaData.smallest (100), so file 1 does not possibly contain key 80. We to proceed to check level 2. Usually, we need to do binary search among all 8 files on level 2. But since we already know target key 80 is less than 100 and only file 1 to file 3 can contain key less than 100, we can safely exclude other files from the search. As a result we cut down the search space from 8 files to 3 files.
Let's look at another example. We want to get key 230. A binary search on level 1 locates to file 2 (this also implies key 230 is larger than file 1's FileMetaData.largest 200). A comparison with file 2's range shows the target key is smaller than file 2's FileMetaData.smallest 300. Even though, we couldn't find key on level 1, we have derived hints that target key is in range between 200 and 300. Any files on level 2 that cannot overlap with [200, 300] can be safely excluded. As a result, we only need to look at file 5 and file 6 on level 2.
Inspired by this concept, we pre-build pointers at compaction time on level 1 files that point to a range of files on level 2. For example, file 1 on level 1 points to file 3 (on level 2) on the left and file 4 on the right. File 2 will point to level 2 files 6 and 7. At query time, these pointers are used to determine the actual binary search range based on comparison result.
Our benchmark shows that this optimization improves lookup QPS by ~5% for similar setup mentioned [here](https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks).

@ -0,0 +1,52 @@
---
title: Reducing Lock Contention in RocksDB
layout: post
author: sdong
category: blog
---
In this post, we briefly introduce the recent improvements we did to RocksDB to improve the issue of lock contention costs.
RocksDB has a simple thread synchronization mechanism (See [RocksDB Architecture Guide](https://github.com/facebook/rocksdb/wiki/Rocksdb-Architecture-Guide)  to understand terms used below, like SST tables or mem tables). SST tables are immutable after being written and mem tables are lock-free data structures supporting single writer and multiple readers. There is only one single major lock, the DB mutex (DBImpl.mutex_) protecting all the meta operations, including:
* Increase or decrease reference counters of mem tables and SST tables
* Change and check meta data structures, before and after finishing compactions, flushes and new mem table creations
* Coordinating writers
This DB mutex used to be scalability bottleneck preventing us from scaling to more than 16 threads. To address the issue, we improved RocksDB in several ways.
1. Consolidate reference counters and introduce "super version". For every read operation, mutex was acquired, and reference counters for each mem table and each SST table were increased. One such operation is not expensive but if you are building a high throughput server with lots of reads, the lock contention will become the bottleneck. This is especially true if you store all your data in RAM.
To solve this problem, we created a meta-meta data structure called “[super version](https://reviews.facebook.net/rROCKSDB1fdb3f7dc60e96394e3e5b69a46ede5d67fb976c)”, which holds reference counters to all those mem table and SST tables, so that readers only need to increase the reference counters for this single data structure. In RocksDB, list of live mem tables and SST tables only changes infrequently, which would happen when new mem tables are created or flush/compaction happens. Now, at those times, a new super version is created with their reference counters increased. A super version lists live mem tables and SST tables so a reader only needs acquire the lock in order to find the latest super version and increase its reference counter. From the super version, the reader can find all the mem and SST tables which are safety accessible as long as the reader holds the reference count for the super version.
2. We replace some reference counters to stc::atomic objects, so that decreasing reference count of an object usually doesn’t need to be inside the mutex any more.
3. Make fetching super version and reference counting lock-free in read queries. After consolidating reference counting to one single super version and removing the locking for decreasing reference counts, in read case, we only acquire mutex for one thing: fetch the latest super version and increase the reference count for that (dereference the counter is done in an atomic decrease). We designed and implemented a (mostly) lock-free approach to do it. See [details](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Lei-Lockless-Get.pdf). We will write a separate blog post for that.
4. Avoid disk I/O inside the mutex. As we know, each disk I/O to hard drives takes several milliseconds. It can be even longer if file system journal is involved or I/Os are queued. Even occasional disk I/O within mutex can cause huge performance outliers.
We identified in two situations, we might do disk I/O inside mutex and we removed them:
(1) Opening and closing transactional log files. We moved those operations out of the mutex.
(2) Information logging. In multiple places we write to logs within mutex. There is a chance that file write will wait for disk I/O to finish before finishing, even if fsync() is not issued, especially in EXT systems. We occasionally see 100+ milliseconds write() latency on EXT. Instead of removing those logging, we came up with a solution of delay logging. When inside mutex, instead of directly writing to the log file, we write to a log buffer, with the timing information. As soon as mutex is released, we flush the log buffer to log files.
5. Reduce object creation inside the mutex.
Object creation can be slow because it involves malloc (in our case). Malloc sometimes is slow because it needs to lock some shared data structures. Allocating can also be slow because we sometimes do expensive operations in some of our classes' constructors. For these reasons, we try to reduce object creations inside the mutex. Here are two examples:
(1) std::vector uses malloc inside. We introduced “[autovector](https://reviews.facebook.net/rROCKSDBc01676e46d3be08c3c140361ef1f5884f47d3b3c)” data structure, in which memory for first a few elements are pre-allocated as members of the autovector class. When an autovector is used as a stack variable, no malloc will be needed unless the pre-allocated buffer is used up. This autovector is quite useful for manipulating those meta data structures. Those meta operations are often locked inside DB mutex.
(2) When building an iterator, we used to creating iterator of every live men table and SST table within the mutex and a merging iterator on top of them. Besides malloc, some of those iterators can be quite expensive to create, like sorting. Now, instead of doing that, we simply increase the reference counters of them, and release the mutex before creating any iterator.
6. Deal with mutexes in LRU caches.
When I said there was only one single major lock, I was lying. In RocksDB, all LRU caches had exclusive mutexes within to protect writes to the LRU lists, which are done in both of read and write operations. LRU caches are used in block cache and table cache. Both of them are accessed more frequently than DB data structures. Lock contention of these two locks are as intense as the DB mutex. Even if LRU cache is sharded into ShardedLRUCache, we can still see lock contentions, especially table caches. We further address this issue in two way:
(1) Bypassing table caches. A table cache maintains list of SST table’s read handlers. Those handlers contain SST files’ descriptors, table metadata, and possibly data indexes, as well as bloom filters. When the table handler needs to be evicted based on LRU, those information is cleared. When the SST table needs to be read and its table handler is not in LRU cache, the table is opened and those metadata is loaded. In some cases, users want to tune the system in a way that table handler evictions should never happen. It is common for high-throughput, low-latency servers. We introduce a mode where table cache is bypassed in read queries. In this mode, all table handlers are cached and accessed directly, so there is no need to query and adjust table caches for reading the database. It is the users’ responsibility to reserve enough resource for it. This mode can be turned on by setting options.max_open_files=-1.
(2) [New PlainTable format](//github.com/facebook/rocksdb/wiki/PlainTable-Format) (optimized for SST in ramfs/tmpfs) does not organize data by blocks. Data are located by memory addresses so no block cache is needed.
With all of those improvements, lock contention is not a bottleneck anymore, which is shown in our [memory-only benchmark](https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks) . Furthermore, lock contentions are not causing some huge (50 milliseconds+) latency outliers they used to cause.

@ -0,0 +1,25 @@
---
title: RocksDB 3.0 release
layout: post
author: icanadi
category: blog
---
Check out new RocksDB release on [Github](https://github.com/facebook/rocksdb/releases/tag/3.0.fb)!
New features in RocksDB 3.0:
* [Column Family support](https://github.com/facebook/rocksdb/wiki/Column-Families)
* [Ability to chose different checksum function](https://github.com/facebook/rocksdb/commit/0afc8bc29a5800e3212388c327c750d32e31f3d6)
* Deprecated ReadOptions::prefix_seek and ReadOptions::prefix
Check out the full [change log](https://github.com/facebook/rocksdb/blob/3.0.fb/HISTORY.md).

@ -0,0 +1,22 @@
---
title: RocksDB 3.1 release
layout: post
author: icanadi
category: blog
---
Check out the new release on [Github](https://github.com/facebook/rocksdb/releases/tag/rocksdb-3.1)!
New features in RocksDB 3.1:
* [Materialized hash index](https://github.com/facebook/rocksdb/commit/0b3d03d026a7248e438341264b4c6df339edc1d7)
* [FIFO compaction style](https://github.com/facebook/rocksdb/wiki/FIFO-compaction-style)
We released 3.1 so fast after 3.0 because one of our internal customers needed materialized hash index.

@ -0,0 +1,37 @@
---
title: PlainTable — A New File Format
layout: post
author: sdong
category: blog
---
In this post, we are introducing "PlainTable" -- a file format we designed for RocksDB, initially to satisfy a production use case at Facebook.
Design goals:
1. All data stored in memory, in files stored in tmpfs/ramfs. Support DBs larger than 100GB (may be sharded across multiple RocksDB instance).
1. Optimize for [prefix hashing](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Siying-Prefix-Hash.pdf)
1. Less than or around 1 micro-second average latency for single Get() or Seek().
1. Minimize memory consumption.
1. Queries efficiently return empty results
Notice that our priority was not to maximize query performance, but to strike a balance between query performance and memory consumption. PlainTable query performance is not as good as you would see with a nicely-designed hash table, but they are of the same order of magnitude, while keeping memory overhead to a minimum.
Since we are targeting micro-second latency, it is on the level of the number of CPU cache misses (if they cannot be parallellized, which are usually the case for index look-ups). On our target hardware with Intel CPUs of multiple sockets with NUMA, we can only allow 4-5 CPU cache misses (including costs of data TLB).
To meet our requirements, given that only hash prefix iterating is needed, we made two decisions:
1. to use a hash index, which is
1. directly addressed to rows, with no block structure.
Having addressed our latency goal, the next task was to design a very compact hash index to minimize memory consumption. Some tricks we used to meet this goal:
1. We only use 32-bit integers for data and index offsets.The first bit serves as a flag, so we can avoid using 8-byte pointers.
1. We never copy keys or parts of keys to index search structures. We store only offsets from which keys can be retrieved, to make comparisons with search keys.
1. Since our file is immutable, we can accurately estimate the number of hash buckets needed.
To make sure the format works efficiently with empty queries, we added a bloom filter check before the query. This adds only one cache miss for non-empty cases [1], but avoids multiple cache misses for most empty results queries. This is a good trade-off for use cases with a large percentage of empty results.
These are the design goals and basic ideas of PlainTable file format. For detailed information, see [this wiki page](https://github.com/facebook/rocksdb/wiki/PlainTable-Format).
[1] Bloom filter checks typically require multiple memory access. However, because they are independent, they usually do not make the CPU pipeline stale. In any case, we improved the bloom filter to improve data locality - we may cover this further in a future blog post.

@ -0,0 +1,88 @@
---
title: Avoid Expensive Locks in Get()
layout: post
author: leijin
category: blog
---
As promised in the previous [blog post](blog/2014/05/14/lock.html)!
RocksDB employs a multiversion concurrency control strategy. Before reading data, it needs to grab the current version, which is encapsulated in a data structure called [SuperVersion](https://reviews.facebook.net/rROCKSDB1fdb3f7dc60e96394e3e5b69a46ede5d67fb976c).
At the beginning of `GetImpl()`, it used to do this:
<span class="zw-portion">mutex_.Lock();
</span>auto* s = super_version_->Ref();
mutex_.Unlock();
The lock is necessary because pointer super_version_ may be updated, the corresponding SuperVersion may be deleted while Ref() is in progress.
`Ref()` simply increases the reference counter and returns “this” pointer. However, this simple operation posed big challenges for in-memory workload and stopped RocksDB from scaling read throughput beyond 8 cores. Running 32 read threads on a 32-core CPU leads to [70% system CPU usage](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Lei-Lockless-Get.pdf). This is outrageous!
Luckily, we found a way to circumvent this problem by using [thread local storage](http://en.wikipedia.org/wiki/Thread-local_storage). Version change is a rare event comparable to millions of read requests. On the very first Get() request, each thread pays the mutex cost to acquire a reference to the new super version. Instead of releasing the reference after use, the reference is cached in thread’s local storage. An atomic variable is used to track global super version number. Subsequent reads simply compare the local super version number against the global super version number. If they are the same, the cached super version reference may be used directly, at no cost. If a version change is detected, mutex must be acquired to update the reference. The cost of mutex lock is amortized among millions of reads and becomes negligible.
The code looks something like this:
SuperVersion* s = thread_local_->Get();
if (s->version_number != super_version_number_.load()) {
// slow path, cleanup of current super version is omitted
mutex_.Lock();
s = super_version_->Ref();
mutex_.Unlock();
}
The result is quite amazing. RocksDB can nicely [scale to 32 cores](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Lei-Lockless-Get.pdf) and most CPU time is spent in user land.
Daryl Grove gives a pretty good [comparison between mutex and atomic](https://blogs.oracle.com/d/entry/the_cost_of_mutexes). However, the real cost difference lies beyond what is shown in the assembly code. Mutex can keep threads spinning on CPU or even trigger thread context switches in which all readers compete to access the critical area. Our approach prevents mutual competition by directing threads to check against a global version which does not change at high frequency, and is therefore much more cache-friendly.
The new approach entails one issue: a thread can visit GetImpl() once but can never come back again. SuperVersion is referenced and cached in its thread local storage. All resources (e.g., memtables, files) which belong to that version are frozen. A “supervisor” is required to visit each thread’s local storage and free its resources without incurring a lock. We designed a lockless sweep using CAS (compare and switch instruction). Here is how it works:
(1) A reader thread uses CAS to acquire SuperVersion from its local storage and to put in a special flag (SuperVersion::kSVInUse).
(2) Upon completion of GetImpl(), the reader thread tries to return SuperVersion to local storage by CAS, expecting the special flag (SuperVersion::kSVInUse) in its local storage. If it does not see SuperVersion::kSVInUse, that means a “sweep” was done and the reader thread is responsible for cleanup (this is expensive, but does not happen often on the hot path).
(3) After any flush/compaction, the background thread performs a sweep (CAS) across all threads’ local storage and frees encountered SuperVersion. A reader thread must re-acquire a new SuperVersion reference on its next visit.

@ -0,0 +1,30 @@
---
title: RocksDB 3.2 release
layout: post
author: leijin
category: blog
---
Check out new RocksDB release on [GitHub](https://github.com/facebook/rocksdb/releases/tag/rocksdb-3.2)!
New Features in RocksDB 3.2:
* PlainTable now supports a new key encoding: for keys of the same prefix, the prefix is only written once. It can be enabled through encoding_type paramter of NewPlainTableFactory()
* Add AdaptiveTableFactory, which is used to convert from a DB of PlainTable to BlockBasedTabe, or vise versa. It can be created using NewAdaptiveTableFactory()
Public API changes:
* We removed seek compaction as a concept from RocksDB
* Add two paramters to NewHashLinkListRepFactory() for logging on too many entries in a hash bucket when flushing
* Added new option BlockBasedTableOptions::hash_index_allow_collision. When enabled, prefix hash index for block-based table will not store prefix and allow hash collision, reducing memory consumption

@ -0,0 +1,34 @@
---
title: RocksDB 3.3 Release
layout: post
author: yhciang
category: blog
---
Check out new RocksDB release on [GitHub](https://github.com/facebook/rocksdb/releases/tag/rocksdb-3.3)!
New Features in RocksDB 3.3:
* **JSON API prototype**.
* **Performance improvement on HashLinkList**: We addressed performance outlier of HashLinkList caused by skewed bucket by switching data in the bucket from linked list to skip list. Add parameter threshold_use_skiplist in NewHashLinkListRepFactory().
* **More effective on storage space reclaim**: RocksDB is now able to reclaim storage space more effectively during the compaction process. This is done by compensating the size of each deletion entry by the 2X average value size, which makes compaction to be triggerred by deletion entries more easily.
* **TimeOut API to write**: Now WriteOptions have a variable called timeout_hint_us. With timeout_hint_us set to non-zero, any write associated with this timeout_hint_us may be aborted when it runs longer than the specified timeout_hint_us, and it is guaranteed that any write completes earlier than the specified time-out will not be aborted due to the time-out condition.
* **rate_limiter option**: We added an option that controls total throughput of flush and compaction. The throughput is specified in bytes/sec. Flush always has precedence over compaction when available bandwidth is constrained.
Public API changes:
* Removed NewTotalOrderPlainTableFactory because it is not used and implemented semantically incorrect.

@ -0,0 +1,132 @@
---
title: Cuckoo Hashing Table Format
layout: post
author: radheshyam
category: blog
---
## Introduction
We recently introduced a new [Cuckoo Hashing](http://en.wikipedia.org/wiki/Cuckoo_hashing) based SST file format which is optimized for fast point lookups. The new format was built for applications which require very high point lookup rates (~4Mqps) in read only mode but do not use operations like range scan, merge operator, etc. But, the existing RocksDB file formats were built to support range scan and other operations and the current best point lookup in RocksDB is 1.2 Mqps given by [PlainTable](https://github.com/facebook/rocksdb/wiki/PlainTable-Format)[ format](https://github.com/facebook/rocksdb/wiki/PlainTable-Format). This prompted a hashing based file format, which we present here. The new table format uses a cache friendly version of Cuckoo Hashing algorithm with only 1 or 2 memory accesses per lookup.
Goals:
* Reduce memory accesses per lookup to 1 or 2
* Get an end to end point lookup rate of at least 4 Mqps
* Minimize database size
Assumptions:
* Key length and value length are fixed
* The database is operated in read only mode
Non-goals:
While optimizing the performance of Get() operation was our primary goal, compaction and build times were secondary. We may work on improving them in future.
Details for setting up the table format can be found in [GitHub](https://github.com/facebook/rocksdb/wiki/CuckooTable-Format).
## Cuckoo Hashing Algorithm
In order to achieve high lookup speeds, we did multiple optimizations, including a cache friendly cuckoo hash algorithm. Cuckoo Hashing uses multiple hash functions, _h1, ..., __hn._
### Original Cuckoo Hashing
To insert any new key _k_, we compute hashes of the key _h1(k), ..., __hn__(k)_. We insert the key in the first hash location that is free. If all the locations are blocked, we try to move one of the colliding keys to a different location by trying to re-insert it.
Finding smallest set of keys to displace in order to accommodate the new key is naturally a shortest path problem in a directed graph where nodes are buckets of hash table and there is an edge from bucket _A_ to bucket _B_ if the element stored in bucket _A_ can be accommodated in bucket _B_ using one of the hash functions. The source nodes are the possible hash locations for the given key _k_ and destination is any one of the empty buckets. We use this algorithm to handle collision.
To retrieve a key _k_, we compute hashes, _h1(k), ..., __hn__(k)_ and the key must be present in one of these locations.
Our goal is to minimize average (and maximum) number of hash functions required and hence the number of memory accesses. In our experiments, with a hash utilization of 90%, we found that the average number of lookups is 1.8 and maximum is 3. Around 44% of keys are accommodated in first hash location and 33% in second location.
### Cache Friendly Cuckoo Hashing
We noticed the following two sub-optimal properties in original Cuckoo implementation:
* If the key is not present in first hash location, we jump to second hash location which may not be in cache. This results in many cache misses.
* Because only 44% of keys are located in first cuckoo block, we couldn't have an optimal prefetching strategy - prefetching all hash locations for a key is wasteful. But prefetching only the first hash location helps only 44% of cases.
The solution is to insert more keys near first location. In case of collision in the first hash location - _h1(k)_, we try to insert it in next few buckets, _h1(k)+1, _h1(k)+2, _..., h1(k)+t-1_. If all of these _t_ locations are occupied, we skip over to next hash function _h2_ and repeat the process. We call the set of _t_ buckets as a _Cuckoo Block_. We chose _t_ such that size of a block is not bigger than a cache line and we prefetch the first cuckoo block.
With the new algorithm, for 90% hash utilization, we found that 85% of keys are accommodated in first Cuckoo Block. Prefetching the first cuckoo block yields best results. For a database of 100 million keys with key length 8 and value length 4, the hash algorithm alone can achieve 9.6 Mqps and we are working on improving it further. End to end RocksDB performance results can be found [here](https://github.com/facebook/rocksdb/wiki/CuckooTable-Format).

@ -0,0 +1,105 @@
---
title: New Bloom Filter Format
layout: post
author: zagfox
category: blog
---
## Introduction
In this post, we are introducing "full filter block" --- a new bloom filter format for [block based table](https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format). This could bring about 40% of improvement for key query under in-memory (all data stored in memory, files stored in tmpfs/ramfs, an [example](https://github.com/facebook/rocksdb/wiki/RocksDB-In-Memory-Workload-Performance-Benchmarks) workload. The main idea behind is to generate a big filter that covers all the keys in SST file to avoid lots of unnecessary memory look ups.
## What is Bloom Filter
In brief, [bloom filter](https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter) is a bits array generated for a set of keys that could tell if an arbitrary key may exist in that set.
In RocksDB, we generate such a bloom filter for each SST file. When we conduct a query for a key, we first goes to the bloom filter block of SST file. If key may exist in filter, we goes into data block in SST file to search for the key. If not, we would return directly. So it could help speed up point look up operation a lot.
## Original Bloom Filter Format
Original bloom filter creates filters for each individual data block in SST file. It has complex structure (ref [here](https://github.com/facebook/rocksdb/wiki/Rocksdb-BlockBasedTable-Format#filter-meta-block)) which results in a lot of non-adjacent memory look ups.
Here's the work flow for checking original bloom filter in block based table:
1. Given the target key, we goes to the index block to get the "data block ID" where this key may reside.
1. Using the "data block ID", we goes to the filter block and get the correct "offset of filter".
1. Using the "offset of filter", we goes to the actual filter and do the checking.
## New Bloom Filter Format
New bloom filter creates filter for all keys in SST file and we name it "full filter". The data structure of full filter is very simple, there is just one big filter:
    [ full filter ]
In this way, the work flow of bloom filter checking is much simplified.
(1) Given the target key, we goes directly to the filter block and conduct the filter checking.
To be specific, there would be no checking for index block and no address jumping inside of filter block.
Though it is a big filter, the total filter size would be the same as the original filter.
One little draw back is that the new bloom filter introduces more memory consumption when building SST file because we need to buffer keys (or their hashes) before generating filter. Original filter just creates a bunch of small filters so it just buffer a small amount of keys. For full filter, we buffer hashes of all keys, which would take more memory when SST file size increases.
## Usage & Customization
You can refer to the document here for [usage](https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter#usage-of-new-bloom-filter) and [customization](https://github.com/facebook/rocksdb/wiki/RocksDB-Bloom-Filter#customize-your-own-filterpolicy).
 

@ -0,0 +1,48 @@
---
title: RocksDB 3.5 Release!
layout: post
author: leijin
category: blog
---
New RocksDB release - 3.5!
**New Features**
1. Add include/utilities/write_batch_with_index.h, providing a utility class to query data out of WriteBatch when building it.
2. new ReadOptions.total_order_seek to force total order seek when block-based table is built with hash index.
**Public API changes**
1. The Prefix Extractor used with V2 compaction filters is now passed user key to SliceTransform::Transform instead of unparsed RocksDB key.
2. Move BlockBasedTable related options to BlockBasedTableOptions from Options. Change corresponding JNI interface. Options affected include: no_block_cache, block_cache, block_cache_compressed, block_size, block_size_deviation, block_restart_interval, filter_policy, whole_key_filtering. filter_policy is changed to shared_ptr from a raw pointer.
3. Remove deprecated options: disable_seek_compaction and db_stats_log_interval
4. OptimizeForPointLookup() takes one parameter for block cache size. It now builds hash index, bloom filter, and block cache.
[https://github.com/facebook/rocksdb/releases/tag/v3.5](https://github.com/facebook/rocksdb/releases/tag/rocksdb-3.5)

@ -0,0 +1,108 @@
---
title: Migrating from LevelDB to RocksDB
layout: post
author: lgalanis
category: blog
---
If you have an existing application that uses LevelDB and would like to migrate to using RocksDB, one problem you need to overcome is to map the options for LevelDB to proper options for RocksDB. As of release 3.9 this can be automatically done by using our option conversion utility found in rocksdb/utilities/leveldb_options.h. What is needed, is to first replace `leveldb::Options` with `rocksdb::LevelDBOptions`. Then, use `rocksdb::ConvertOptions( )` to convert the `LevelDBOptions` struct into appropriate RocksDB options. Here is an example:
LevelDB code:
```c++
#include <string>
#include "leveldb/db.h"
using namespace leveldb;
int main(int argc, char** argv) {
DB *db;
Options opt;
opt.create_if_missing = true;
opt.max_open_files = 1000;
opt.block_size = 4096;
Status s = DB::Open(opt, "/tmp/mydb", &db);
delete db;
}
```
RocksDB code:
```c++
#include <string>
#include "rocksdb/db.h"
#include "rocksdb/utilities/leveldb_options.h"
using namespace rocksdb;
int main(int argc, char** argv) {
DB *db;
LevelDBOptions opt;
opt.create_if_missing = true;
opt.max_open_files = 1000;
opt.block_size = 4096;
Options rocksdb_options = ConvertOptions(opt);
// add rocksdb specific options here
Status s = DB::Open(rocksdb_options, "/tmp/mydb_rocks", &db);
delete db;
}
```
The difference is:
```diff
-#include "leveldb/db.h"
+#include "rocksdb/db.h"
+#include "rocksdb/utilities/leveldb_options.h"
-using namespace leveldb;
+using namespace rocksdb;
- Options opt;
+ LevelDBOptions opt;
- Status s = DB::Open(opt, "/tmp/mydb", &db);
+ Options rocksdb_options = ConvertOptions(opt);
+ // add rockdb specific options here
+
+ Status s = DB::Open(rocksdb_options, "/tmp/mydb_rocks", &db);
```
Once you get up and running with RocksDB you can then focus on tuning RocksDB further by modifying the converted options struct.
The reason why ConvertOptions is handy is because a lot of individual options in RocksDB have moved to other structures in different components. For example, block_size is not available in struct rocksdb::Options. It resides in struct rocksdb::BlockBasedTableOptions, which is used to create a TableFactory object that RocksDB uses internally to create the proper TableBuilder objects. If you were to write your application from scratch it would look like this:
RocksDB code from scratch:
```c++
#include <string>
#include "rocksdb/db.h"
#include "rocksdb/table.h"
using namespace rocksdb;
int main(int argc, char** argv) {
DB *db;
Options opt;
opt.create_if_missing = true;
opt.max_open_files = 1000;
BlockBasedTableOptions topt;
topt.block_size = 4096;
opt.table_factory.reset(NewBlockBasedTableFactory(topt));
Status s = DB::Open(opt, "/tmp/mydb_rocks", &db);
delete db;
}
```
The LevelDBOptions utility can ease migration to RocksDB from LevelDB and allows us to break down the various options across classes as it is needed.

@ -0,0 +1,37 @@
---
title: Reading RocksDB options from a file
layout: post
author: lgalanis
category: blog
---
RocksDB options can be provided using a file or any string to RocksDB. The format is straightforward: `write_buffer_size=1024;max_write_buffer_number=2`. Any whitespace around `=` and `;` is OK. Moreover, options can be nested as necessary. For example `BlockBasedTableOptions` can be nested as follows: `write_buffer_size=1024; max_write_buffer_number=2; block_based_table_factory={block_size=4k};`. Similarly any white space around `{` or `}` is ok. Here is what it looks like in code:
```c++
#include <string>
#include "rocksdb/db.h"
#include "rocksdb/table.h"
#include "rocksdb/utilities/convenience.h"
using namespace rocksdb;
int main(int argc, char** argv) {
DB *db;
Options opt;
std::string options_string =
"create_if_missing=true;max_open_files=1000;"
"block_based_table_factory={block_size=4096}";
Status s = GetDBOptionsFromString(opt, options_string, &opt);
s = DB::Open(opt, "/tmp/mydb_rocks", &db);
// use db
delete db;
}
```
Using `GetDBOptionsFromString` is a convenient way of changing options for your RocksDB application without needing to resort to recompilation or tedious command line parsing.

@ -0,0 +1,16 @@
---
title: 'WriteBatchWithIndex: Utility for Implementing Read-Your-Own-Writes'
layout: post
author: sdong
category: blog
---
RocksDB can be used as a storage engine of a higher level database. In fact, we are currently plugging RocksDB into MySQL and MongoDB as one of their storage engines. RocksDB can help with guaranteeing some of the ACID properties: durability is guaranteed by RocksDB by design; while consistency and isolation need to be enforced by concurrency controls on top of RocksDB; Atomicity can be implemented by committing a transaction's writes with one write batch to RocksDB in the end.
However, if we enforce atomicity by only committing all writes in the end of the transaction in one batch, you cannot get the updated value from RocksDB previously written by the same transaction (read-your-own-write). To read the updated value, the databases on top of RocksDB need to maintain an internal buffer for all the written keys, and when a read happens they need to merge the result from RocksDB and from this buffer. This is a problem we faced when building the RocksDB storage engine in MongoDB. We solved it by creating a utility class, WriteBatchWithIndex (a write batch with a searchable index) and made it part of public API so that the community can also benefit from it.
Before talking about the index part, let me introduce write batch first. The write batch class, `WriteBatch`, is a RocksDB data structure for atomic writes of multiple keys. Users can buffer their updates to a `WriteBatch` by calling `write_batch.Put("key1", "value1")` or `write_batch.Delete("key2")`, similar as calling RocksDB's functions of the same names. In the end, they call `db->Write(write_batch)` to atomically update all those batched operations to the DB. It is how a database can guarantee atomicity, as shown above. Adding a searchable index to `WriteBatch`, we now have `WriteBatchWithIndex`. Users can put updates to WriteBatchIndex in the same way as to `WriteBatch`. In the end, users can get a `WriteBatch` object from it and issue `db->Write()`. Additionally, users can create an iterator of a WriteBatchWithIndex, seek to any key location and iterate from there.
To implement read-your-own-write using `WriteBatchWithIndex`, every time the user creates a transaction, we create a `WriteBatchWithIndex` attached to it. All the writes of the transaction go to the `WriteBatchWithIndex` first. When we commit the transaction, we atomically write the batch to RocksDB. When the user wants to call `Get()`, we first check if the value exists in the `WriteBatchWithIndex` and return the value if existing, by seeking and reading from an iterator of the write batch, before checking data in RocksDB. For example, here is the we implement it in MongoDB's RocksDB storage engine: [link](https://github.com/mongodb/mongo/blob/a31cc114a89a3645e97645805ba77db32c433dce/src/mongo/db/storage/rocks/rocks_recovery_unit.cpp#L245-L260). If a range query comes, we pass a DB's iterator to `WriteBatchWithIndex`, which creates a super iterator which combines the results from the DB iterator with the batch's iterator. Using this super iterator, we can iterate the DB with the transaction's own writes. Here is the iterator creation codes in MongoDB's RocksDB storage engine: [link](https://github.com/mongodb/mongo/blob/a31cc114a89a3645e97645805ba77db32c433dce/src/mongo/db/storage/rocks/rocks_recovery_unit.cpp#L266-L269). In this way, the database can solve the read-your-own-write problem by using RocksDB to handle a transaction's uncommitted writes.
Using `WriteBatchWithIndex`, we successfully implemented read-your-own-writes in the RocksDB storage engine of MongoDB. If you also have a read-your-own-write problem, `WriteBatchWithIndex` can help you implement it quickly and correctly.

@ -0,0 +1,12 @@
---
title: Integrating RocksDB with MongoDB
layout: post
author: icanadi
category: blog
---
Over the last couple of years, we have been busy integrating RocksDB with various services here at Facebook that needed to store key-value pairs locally. We have also seen other companies using RocksDB as local storage components of their distributed systems.
The next big challenge for us is to bring RocksDB storage engine to general purpose databases. Today we have an exciting milestone to share with our community! We're running MongoDB with RocksDB in production and seeing great results! You can read more about it here: [http://blog.parse.com/announcements/mongodb-rocksdb-parse/](http://blog.parse.com/announcements/mongodb-rocksdb-parse/)
Keep tuned for benchmarks and more stability and performance improvements.

@ -0,0 +1,8 @@
---
title: RocksDB in osquery
layout: post
author: icanadi
category: lgalanis
---
Check out [this](https://code.facebook.com/posts/1411870269134471/how-rocksdb-is-used-in-osquery/) blog post by [Mike Arpaia](https://www.facebook.com/mike.arpaia) and [Ted Reed](https://www.facebook.com/treeded) about how osquery leverages RocksDB to build an embedded pub-sub system. This article is a great read and contains insights on how to properly use RocksDB.

@ -0,0 +1,81 @@
---
title: RocksDB 2015 H2 roadmap
layout: post
author: icanadi
category: blog
---
Every 6 months, RocksDB team gets together to prioritize the work ahead of us. We just went through this exercise and we wanted to share the results with the community. Here's what RocksDB team will be focusing on for the next 6 months:
**MyRocks**
As you might know, we're working hard to integrate RocksDB as a storage engine for MySQL. This project is pretty important for us because we're heavy users of MySQL. We're already getting pretty good performance results, but there is more work to be done. We need to focus on both performance and stability. The most high priority items on are list are:
1. Reduce CPU costs of RocksDB as a MySQL storage engine
2. Implement pessimistic concurrency control to support repeatable read isolation level in MyRocks
3. Reduce P99 read latency, which is high mostly because of lingering tombstones
4. Port ZSTD compression
**MongoRocks**
Another database that we're working on is MongoDB. The project of integrating MongoDB with RocksDB storage engine is called MongoRocks. It's already running in production at Parse [1] and we're seeing surprisingly few issues. Our plans for the next half:
1. Keep improving performance and stability, possibly reuse work done on MyRocks (workloads are pretty similar).
2. Increase internal and external adoption.
3. Support new MongoDB 3.2.
**RocksDB on cheaper storage media**
Up to now, our mission was to build the best key-value store “for fast storage” (flash and in-memory). However, there are some use-cases at Facebook that don't need expensive high-end storage. In the next six months, we plan to deploy RocksDB on cheaper storage media. We will optimize performance to RocksDB on either or both:
1. Hard drive storage array.
2. Tiered Storage.
**Quality of Service**
When talking to our customers, there are couple of issues that keep reoccurring. We need to fix them to make our customers happy. We will improve RocksDB to provide better assurance of performance and resource usage. Non-exhaustive list includes:
1. Iterate P99 can be high due to the presence of tombstones.
2. Write stalls can happen during high write loads.
3. Better control of memory and disk usage.
4. Service quality and performance of backup engine.
**Operation's user experience**
As we increase deployment of RocksDB, engineers are spending more time on debugging RocksDB issues. We plan to improve user experience when running RocksDB. The goal is to reduce TTD (time-to-debug). The work includes monitoring, visualizations and documentations.
[1][ http://blog.parse.com/announcements/mongodb-rocksdb-parse/](http://blog.parse.com/announcements/mongodb-rocksdb-parse/)

@ -0,0 +1,74 @@
---
title: Spatial indexing in RocksDB
layout: post
author: icanadi
category: blog
---
About a year ago, there was a need to develop a spatial database at Facebook. We needed to store and index Earth's map data. Before building our own, we looked at the existing spatial databases. They were all very good technology, but also general purpose. We could sacrifice a general-purpose API, so we thought we could build a more performant database, since it would be specifically designed for our use-case. Furthermore, we decided to build the spatial database on top of RocksDB, because we have a lot of operational experience with running and tuning RocksDB at a large scale.
When we started looking at this project, the first thing that surprised us was that our planet is not that big. Earth's entire map data can fit in memory on a reasonably high-end machine. Thus, we also decided to build a spatial database optimized for memory-resident dataset.
The first use-case of our spatial database was an experimental map renderer. As part of our project, we successfully loaded [Open Street Maps](https://www.openstreetmap.org/) dataset and hooked it up with [Mapnik](http://mapnik.org/), a map rendering engine.
The usual Mapnik workflow is to load the map data into a SQL-based database and then define map layers with SQL statements. To render a tile, Mapnik needs to execute a couple of SQL queries. The benefit of this approach is that you don't need to reload your database when you change your map style. You can just change your SQL query and Mapnik picks it up. In our model, we decided to precompute the features we need for each tile. We need to know the map style before we create the database. However, when rendering the map tile, we only fetch the features that we need to render.
We haven't open sourced the RocksDB Mapnik plugin or the database loading pipeline. However, the spatial indexing is available in RocksDB under a name [SpatialDB](https://github.com/facebook/rocksdb/blob/master/include/rocksdb/utilities/spatial_db.h). The API is focused on map rendering use-case, but we hope that it can also be used for other spatial-based applications.
Let's take a tour of the API. When you create a spatial database, you specify the spatial indexes that need to be built. Each spatial index is defined by a bounding box and granularity. For map rendering, we create a spatial index for each zoom levels. Higher zoom levels have more granularity.
SpatialDB::Create(
SpatialDBOptions(),
"/data/map", {
SpatialIndexOptions("zoom10", BoundingBox(0, 0, 100, 100), 10),
SpatialIndexOptions("zoom16", BoundingBox(0, 0, 100, 100), 16)
}
);
When you insert a feature (building, street, country border) into SpatialDB, you need to specify the list of spatial indexes that will index the feature. In the loading phase we process the map style to determine the list of zoom levels on which we'll render the feature. For example, we will not render the building on zoom level that shows an entire country. Building will only be indexed on higher zoom level's index. Country borders will be indexes on all zoom levels.
FeatureSet feature;
feature.Set("type", "building");
feature.Set("height", 6);
db->Insert(WriteOptions(), BoundingBox<double>(5, 5, 10, 10),
well_known_binary_blob, feature, {"zoom16"});
The indexing part is pretty simple. For each feature, we first find a list of index tiles that it intersects. Then, we add a link from the tile's [quad key](https://msdn.microsoft.com/en-us/library/bb259689.aspx) to the feature's primary key. Using quad keys improves data locality, i.e. features closer together geographically will have similar quad keys. Even though we're optimizing for a memory-resident dataset, data locality is still very important due to different caching effects.
After you're done inserting all the features, you can call an API Compact() that will compact the dataset and speed up read queries.
db->Compact();
SpatialDB's query specifies: 1) bounding box we're interested in, and 2) a zoom level. We find all tiles that intersect with the query's bounding box and return all features in those tiles.
Cursor* c = db_->Query(ReadOptions(), BoundingBox<double>(1, 1, 7, 7), "zoom16");
for (c->Valid(); c->Next()) {
Render(c->blob(), c->feature_set());
}
Note: `Render()` function is not part of RocksDB. You will need to use one of many open source map renderers, for example check out [Mapnik](http://mapnik.org/).
TL;DR If you need an embedded spatial database, check out RocksDB's SpatialDB. [Let us know](https://www.facebook.com/groups/rocksdb.dev/) how we can make it better.
If you're interested in learning more, check out this [talk](https://www.youtube.com/watch?v=T1jWsDMONM8).

@ -0,0 +1,12 @@
---
title: RocksDB is now available in Windows Platform
layout: post
author: dmitrism
category: blog
---
Over the past 6 months we have seen a number of use cases where RocksDB is successfully used by the community and various companies to achieve high throughput and volume in a modern server environment.
We at Microsoft Bing could not be left behind. As a result we are happy to [announce](http://bit.ly/1OmWBT9) the availability of the Windows Port created here at Microsoft which we intend to use as a storage option for one of our key/value data stores.
We are happy to make this available for the community. Keep tuned for more announcements to come.

@ -0,0 +1,36 @@
---
title: Dynamic Level Size for Level-Based Compaction
layout: post
author: sdong
category: blog
---
In this article, we follow up on the first part of an answer to one of the questions in our [AMA](https://www.reddit.com/r/IAmA/comments/3de3cv/we_are_rocksdb_engineering_team_ask_us_anything/ct4a8tb), the dynamic level size in level-based compaction.
Level-based compaction is the original LevelDB compaction style and one of the two major compaction styles in RocksDB (See [our wiki](https://github.com/facebook/rocksdb/wiki/RocksDB-Basics#multi-threaded-compactions)). In RocksDB we introduced parallelism and more configurable options to it but the main algorithm stayed the same, until we recently introduced the dynamic level size mode.
In level-based compaction, we organize data to different sorted runs, called levels. Each level has a target size.  Usually target size of levels increases by the same size multiplier. For example, you can set target size of level 1 to be 1GB, and size multiplier to be 10, and the target size of level 1, 2, 3, 4 will be 1GB, 10GB, 100GB and 1000GB. Before level 1, there will be some staging file flushed from mem tables, called Level 0 files, which will later be merged to level 1. Compactions will be triggered as soon as actual size of a level exceeds its target size. We will merge a subset of data of that level to next level, to reduce size of the level. More compactions will be triggered until sizes of all the levels are lower than their target sizes. In a steady state, the size of each level will be around the same size of the size of level targets.
Level-based compaction’s advantage is its good space efficiency. We usually use the metric space amplification to measure the space efficiency. In this article ignore the effects of data compression so space amplification= size_on_file_system / size_of_user_data.
How do we estimate space amplification of level-based compaction? We focus specifically on the databases in steady state, which means database size is stable or grows slowly over time. This means updates will add roughly the same or little more data than what is removed by deletes. Given that, if we compact all the data all to the last level, the size of level will be equal as the size of last level before the compaction. On the other hand, the size of user data will be approximately the size of DB if we compact all the levels down to the last level. So the size of the last level will be a good estimation of user data size. So total size of the DB divided by the size of the last level will be a good estimation of space amplification.
Applying the equation, if we have four non-zero levels, their sizes are 1GB, 10GB, 100GB, 1000GB, the size amplification will be approximately (1000GB + 100GB + 10GB + 1GB) / 1000GB = 1.111, which is a very good number. However, there is a catch here: how to make sure the last level’s size is 1000GB, the same as the level’s size target? A user has to fine tune level sizes to achieve this number and will need to re-tune if DB size changes. The theoretic number 1.11 is hard to achieve in practice. In a worse case, if you have the target size of last level to be 1000GB but the user data is only 200GB, then the actual space amplification will be (200GB + 100GB + 10GB + 1GB) / 200GB = 1.555, a much worse number.
To solve this problem, my colleague Igor Kabiljo came up with a solution of dynamic level size target mode. You can enable it by setting options.level_compaction_dynamic_level_bytes=true. In this mode, size target of levels are changed dynamically based on size of the last level. Suppose the level size multiplier to be 10, and the DB size is 200GB. The target size of the last level is automatically set to be the actual size of the level, which is 200GB, the second to last level’s size target will be automatically set to be size_last_level / 10 = 20GB, the third last level’s will be size_last_level/100 = 2GB, and next level to be size_last_level/1000 = 200MB. We stop here because 200MB is within the range of the first level. In this way, we can achieve the 1.111 space amplification, without fine tuning of the level size targets. More details can be found in [code comments of the option](https://github.com/facebook/rocksdb/blob/v3.11/include/rocksdb/options.h#L366-L423) in the header file.

@ -0,0 +1,189 @@
---
title: GetThreadList
layout: post
author: yhciang
category: blog
---
We recently added a new API, called `GetThreadList()`, that exposes the RocksDB background thread activity. With this feature, developers will be able to obtain the real-time information about the currently running compactions and flushes such as the input / output size, elapsed time, the number of bytes it has written. Below is an example output of `GetThreadList`. To better illustrate the example, we have put a sample output of `GetThreadList` into a table where each column represents a thread status:
<table width="637" >
<tbody >
<tr style="border:2px solid #000000" >
<td style="padding:3px" >ThreadID
</td>
<td >140716395198208
</td>
<td >140716416169728
</td>
</tr>
<tr >
<td style="padding:3px" >DB
</td>
<td >db1
</td>
<td >db2
</td>
</tr>
<tr >
<td style="padding:3px" >CF
</td>
<td >default
</td>
<td >picachu
</td>
</tr>
<tr >
<td style="padding:3px" >ThreadType
</td>
<td >High Pri
</td>
<td >Low Pri
</td>
</tr>
<tr >
<td style="padding:3px" >Operation
</td>
<td >Flush
</td>
<td >Compaction
</td>
</tr>
<tr >
<td style="padding:3px" >ElapsedTime
</td>
<td >143.459 ms
</td>
<td >607.538 ms
</td>
</tr>
<tr >
<td style="padding:3px" >Stage
</td>
<td >FlushJob::WriteLevel0Table
</td>
<td >CompactionJob::Install
</td>
</tr>
<tr >
<td style="vertical-align:top;padding:3px" >OperationProperties
</td>
<td style="vertical-align:top;padding:3px" >
BytesMemtables 4092938
BytesWritten 1050701
</td>
<td style="vertical-align:top" >
BaseInputLevel 1
BytesRead 4876417
BytesWritten 4140109
IsDeletion 0
IsManual 0
IsTrivialMove 0
JobID 146
OutputLevel 2
TotalInputBytes 4883044
</td>
</tr>
</tbody>
</table>
In the above output, we can see `GetThreadList()` reports the activity of two threads: one thread running flush job (middle column) and the other thread running a compaction job (right-most column). In each thread status, it shows basic information about the thread such as thread id, it's target db / column family, and the job it is currently doing and the current status of the job. For instance, we can see thread 140716416169728 is doing compaction on the `picachu` column family in database `db2`. In addition, we can see the compaction has been running for 600 ms, and it has read 4876417 bytes out of 4883044 bytes. This indicates the compaction is about to complete. The stage property indicates which code block the thread is currently executing. For instance, thread 140716416169728 is currently running `CompactionJob::Install`, which further indicates the compaction job is almost done.
Below we briefly describe its API.
## How to Enable it?
To enable thread-tracking of a rocksdb instance, simply set `enable_thread_tracking` to true in its DBOptions:
```c++
// If true, then the status of the threads involved in this DB will
// be tracked and available via GetThreadList() API.
//
// Default: false
bool enable_thread_tracking;
```
## The API
The GetThreadList API is defined in [include/rocksdb/env.h](https://github.com/facebook/rocksdb/blob/master/include/rocksdb/env.h#L317-L318), which is an Env
function:
```c++
virtual Status GetThreadList(std::vector* thread_list)
```
Since an Env can be shared across multiple rocksdb instances, the output of
`GetThreadList()` include the background activity of all the rocksdb instances
that using the same Env.
The `GetThreadList()` API simply returns a vector of `ThreadStatus`, each describes
the current status of a thread. The `ThreadStatus` structure, defined in
[include/rocksdb/thread_status.h](https://github.com/facebook/rocksdb/blob/master/include/rocksdb/thread_status.h), contains the following information:
```c++
// An unique ID for the thread.
const uint64_t thread_id;
// The type of the thread, it could be HIGH_PRIORITY,
// LOW_PRIORITY, and USER
const ThreadType thread_type;
// The name of the DB instance where the thread is currently
// involved with. It would be set to empty string if the thread
// does not involve in any DB operation.
const std::string db_name;
// The name of the column family where the thread is currently
// It would be set to empty string if the thread does not involve
// in any column family.
const std::string cf_name;
// The operation (high-level action) that the current thread is involved.
const OperationType operation_type;
// The elapsed time in micros of the current thread operation.
const uint64_t op_elapsed_micros;
// An integer showing the current stage where the thread is involved
// in the current operation.
const OperationStage operation_stage;
// A list of properties that describe some details about the current
// operation. Same field in op_properties[] might have different
// meanings for different operations.
uint64_t op_properties[kNumOperationProperties];
// The state (lower-level action) that the current thread is involved.
const StateType state_type;
```
If you are interested in the background thread activity of your RocksDB application, please feel free to give `GetThreadList()` a try :)

@ -0,0 +1,43 @@
---
title: Use Checkpoints for Efficient Snapshots
layout: post
author: rven2
category: blog
---
**Checkpoint** is a feature in RocksDB which provides the ability to take a snapshot of a running RocksDB database in a separate directory. Checkpoints can be used as a point in time snapshot, which can be opened Read-only to query rows as of the point in time or as a Writeable snapshot by opening it Read-Write. Checkpoints can be used for both full and incremental backups.
The Checkpoint feature enables RocksDB to create a consistent snapshot of a given RocksDB database in the specified directory. If the snapshot is on the same filesystem as the original database, the SST files will be hard-linked, otherwise SST files will be copied. The manifest and CURRENT files will be copied. In addition, if there are multiple column families, log files will be copied for the period covering the start and end of the checkpoint, in order to provide a consistent snapshot across column families.
A Checkpoint object needs to be created for a database before checkpoints are created. The API is as follows:
`Status Create(DB* db, Checkpoint** checkpoint_ptr);`
Given a checkpoint object and a directory, the CreateCheckpoint function creates a consistent snapshot of the database in the given directory.
`Status CreateCheckpoint(const std::string& checkpoint_dir);`
The directory should not already exist and will be created by this API. The directory will be an absolute path. The checkpoint can be used as a read-only copy of the DB or can be opened as a standalone DB. When opened read/write, the SST files continue to be hard links and these links are removed when the files are obsoleted. When the user is done with the snapshot, the user can delete the directory to remove the snapshot.
Checkpoints are used for online backup in MyRocks. which is MySQL using RocksDB as the storage engine . ([MySQL on RocksDB](https://github.com/facebook/mysql-5.6))

@ -0,0 +1,148 @@
---
title: Analysis File Read Latency by Level
layout: post
author: sdong
category: blog
---
In many use cases of RocksDB, people rely on OS page cache for caching compressed data. With this approach, verifying effective of the OS page caching is challenging, because file system is a black box to users.
As an example, a user can tune the DB as following: use level-based compaction, with L1 - L4 sizes to be 1GB, 10GB, 100GB and 1TB. And they reserve about 20GB memory as OS page cache, expecting level 0, 1 and 2 are mostly cached in memory, leaving only reads from level 3 and 4 requiring disk I/Os. However, in practice, it's not easy to verify whether OS page cache does exactly what we expect. For example, if we end up with doing 4 instead of 2 I/Os per query, it's not easy for users to figure out whether the it's because of efficiency of OS page cache or reading multiple blocks for a level. Analysis like it is especially important if users run RocksDB on hard drive disks, for the gap of latency between hard drives and memory is much higher than flash-based SSDs.
In order to make tuning easier, we added new instrumentation to help users analysis latency distribution of file reads in different levels. If users turn DB statistics on, we always keep track of distribution of file read latency for each level. Users can retrieve the information by querying DB property “rocksdb.stats” ( [https://github.com/facebook/rocksdb/blob/v3.13.1/include/rocksdb/db.h#L315-L316](https://github.com/facebook/rocksdb/blob/v3.13.1/include/rocksdb/db.h#L315-L316) ). It will also printed out as a part of compaction summary in info logs periodically.
The output looks like this:
```bash
** Level 0 read latency histogram (micros):
Count: 696 Average: 489.8118 StdDev: 222.40
Min: 3.0000 Median: 452.3077 Max: 1896.0000
Percentiles: P50: 452.31 P75: 641.30 P99: 1068.00 P99.9: 1860.80 P99.99: 1896.00
------------------------------------------------------
[ 2, 3 ) 1 0.144% 0.144%
[ 18, 20 ) 1 0.144% 0.287%
[ 45, 50 ) 5 0.718% 1.006%
[ 50, 60 ) 26 3.736% 4.741% #
[ 60, 70 ) 6 0.862% 5.603%
[ 90, 100 ) 1 0.144% 5.747%
[ 120, 140 ) 2 0.287% 6.034%
[ 140, 160 ) 1 0.144% 6.178%
[ 160, 180 ) 1 0.144% 6.322%
[ 200, 250 ) 9 1.293% 7.615%
[ 250, 300 ) 45 6.466% 14.080% #
[ 300, 350 ) 88 12.644% 26.724% ###
[ 350, 400 ) 88 12.644% 39.368% ###
[ 400, 450 ) 71 10.201% 49.569% ##
[ 450, 500 ) 65 9.339% 58.908% ##
[ 500, 600 ) 74 10.632% 69.540% ##
[ 600, 700 ) 92 13.218% 82.759% ###
[ 700, 800 ) 64 9.195% 91.954% ##
[ 800, 900 ) 35 5.029% 96.983% #
[ 900, 1000 ) 12 1.724% 98.707%
[ 1000, 1200 ) 6 0.862% 99.569%
[ 1200, 1400 ) 2 0.287% 99.856%
[ 1800, 2000 ) 1 0.144% 100.000%
** Level 1 read latency histogram (micros):
(......not pasted.....)
** Level 2 read latency histogram (micros):
(......not pasted.....)
** Level 3 read latency histogram (micros):
(......not pasted.....)
** Level 4 read latency histogram (micros):
(......not pasted.....)
** Level 5 read latency histogram (micros):
Count: 25583746 Average: 421.1326 StdDev: 385.11
Min: 1.0000 Median: 376.0011 Max: 202444.0000
Percentiles: P50: 376.00 P75: 438.00 P99: 1421.68 P99.9: 4164.43 P99.99: 9056.52
------------------------------------------------------
[ 0, 1 ) 2351 0.009% 0.009%
[ 1, 2 ) 6077 0.024% 0.033%
[ 2, 3 ) 8471 0.033% 0.066%
[ 3, 4 ) 788 0.003% 0.069%
[ 4, 5 ) 393 0.002% 0.071%
[ 5, 6 ) 786 0.003% 0.074%
[ 6, 7 ) 1709 0.007% 0.080%
[ 7, 8 ) 1769 0.007% 0.087%
[ 8, 9 ) 1573 0.006% 0.093%
[ 9, 10 ) 1495 0.006% 0.099%
[ 10, 12 ) 3043 0.012% 0.111%
[ 12, 14 ) 2259 0.009% 0.120%
[ 14, 16 ) 1233 0.005% 0.125%
[ 16, 18 ) 762 0.003% 0.128%
[ 18, 20 ) 451 0.002% 0.130%
[ 20, 25 ) 794 0.003% 0.133%
[ 25, 30 ) 1279 0.005% 0.138%
[ 30, 35 ) 1172 0.005% 0.142%
[ 35, 40 ) 1363 0.005% 0.148%
[ 40, 45 ) 409 0.002% 0.149%
[ 45, 50 ) 105 0.000% 0.150%
[ 50, 60 ) 80 0.000% 0.150%
[ 60, 70 ) 280 0.001% 0.151%
[ 70, 80 ) 1583 0.006% 0.157%
[ 80, 90 ) 4245 0.017% 0.174%
[ 90, 100 ) 6572 0.026% 0.200%
[ 100, 120 ) 9724 0.038% 0.238%
[ 120, 140 ) 3713 0.015% 0.252%
[ 140, 160 ) 2383 0.009% 0.261%
[ 160, 180 ) 18344 0.072% 0.333%
[ 180, 200 ) 51873 0.203% 0.536%
[ 200, 250 ) 631722 2.469% 3.005%
[ 250, 300 ) 2721970 10.639% 13.644% ##
[ 300, 350 ) 5909249 23.098% 36.742% #####
[ 350, 400 ) 6522507 25.495% 62.237% #####
[ 400, 450 ) 4296332 16.793% 79.030% ###
[ 450, 500 ) 2130323 8.327% 87.357% ##
[ 500, 600 ) 1553208 6.071% 93.428% #
[ 600, 700 ) 642129 2.510% 95.938% #
[ 700, 800 ) 372428 1.456% 97.394%
[ 800, 900 ) 187561 0.733% 98.127%
[ 900, 1000 ) 85858 0.336% 98.462%
[ 1000, 1200 ) 82730 0.323% 98.786%
[ 1200, 1400 ) 50691 0.198% 98.984%
[ 1400, 1600 ) 38026 0.149% 99.133%
[ 1600, 1800 ) 32991 0.129% 99.261%
[ 1800, 2000 ) 30200 0.118% 99.380%
[ 2000, 2500 ) 62195 0.243% 99.623%
[ 2500, 3000 ) 36684 0.143% 99.766%
[ 3000, 3500 ) 21317 0.083% 99.849%
[ 3500, 4000 ) 10216 0.040% 99.889%
[ 4000, 4500 ) 8351 0.033% 99.922%
[ 4500, 5000 ) 4152 0.016% 99.938%
[ 5000, 6000 ) 6328 0.025% 99.963%
[ 6000, 7000 ) 3253 0.013% 99.976%
[ 7000, 8000 ) 2082 0.008% 99.984%
[ 8000, 9000 ) 1546 0.006% 99.990%
[ 9000, 10000 ) 1055 0.004% 99.994%
[ 10000, 12000 ) 1566 0.006% 100.000%
[ 12000, 14000 ) 761 0.003% 100.003%
[ 14000, 16000 ) 462 0.002% 100.005%
[ 16000, 18000 ) 226 0.001% 100.006%
[ 18000, 20000 ) 126 0.000% 100.006%
[ 20000, 25000 ) 107 0.000% 100.007%
[ 25000, 30000 ) 43 0.000% 100.007%
[ 30000, 35000 ) 15 0.000% 100.007%
[ 35000, 40000 ) 14 0.000% 100.007%
[ 40000, 45000 ) 16 0.000% 100.007%
[ 45000, 50000 ) 1 0.000% 100.007%
[ 50000, 60000 ) 22 0.000% 100.007%
[ 60000, 70000 ) 10 0.000% 100.007%
[ 70000, 80000 ) 5 0.000% 100.007%
[ 80000, 90000 ) 14 0.000% 100.007%
[ 90000, 100000 ) 11 0.000% 100.007%
[ 100000, 120000 ) 33 0.000% 100.007%
[ 120000, 140000 ) 6 0.000% 100.007%
[ 140000, 160000 ) 3 0.000% 100.007%
[ 160000, 180000 ) 7 0.000% 100.007%
[ 200000, 250000 ) 2 0.000% 100.007%
```
In this example, you can see we only issued 696 reads from level 0 while issued 25 million reads from level 5. The latency distribution is also clearly shown among those reads. This will be helpful for users to analysis OS page cache efficiency.
Currently the read latency per level includes reads from data blocks, index blocks, as well as bloom filter blocks. We are also working on a feature to break down those three type of blocks.

@ -0,0 +1,41 @@
---
title: Option of Compaction Priority
layout: post
author: sdong
category: blog
---
The most popular compaction style of RocksDB is level-based compaction, which is an improved version of LevelDB's compaction algorithm. Page 9- 16 of this [slides](https://github.com/facebook/rocksdb/blob/gh-pages/talks/2015-09-29-HPTS-Siying-RocksDB.pdf) gives an illustrated introduction of this compaction style. The basic idea that: data is organized by multiple levels with exponential increasing target size. Except a special level 0, every level is key-range partitioned into many files. When size of a level exceeds its target size, we pick one or more of its files, and merge the file into the next level.
Which file to pick to compact is an interesting question. LevelDB only uses one thread for compaction and it always picks files in round robin manner. We implemented multi-thread compaction in RocksDB by picking multiple files from the same level and compact them in parallel. We had to move away from LevelDB's file picking approach. Recently, we created an option [options.compaction_pri](https://github.com/facebook/rocksdb/blob/d6c838f1e130d8860407bc771fa6d4ac238859ba/include/rocksdb/options.h#L83-L93), which indicated three different algorithms to pick files to compact.
Why do we need to multiple algorithms to choose from? Because there are different factors to consider when picking the files, and we now don't yet know how to balance them automatically, so we expose it to users to choose. Here are factors to consider:
**Write amplification**
When we estimate write amplification, we usually simplify the problem by assuming keys are uniformly distributed inside each level. In reality, it is not the case, even if user updates are uniformly distributed across the whole key range. For instance, when we compact one file of a level to the next level, it creates a hole. Over time, incoming compaction will fill data to the hole, but the density will still be lower for a while. Picking a file with keys least densely populated is more expensive to get the file to the next level, because there will be more overlapping files in the next level so we need to rewrite more data. For example, assume a file is 100MB, if an L2 file overlaps with 8 L3 files, we need to rewrite about 800MB of data to get the file to L3. If the file overlaps with 12 L3 files, we'll need to rewrite about 1200MB to get a file of the same size out of L2. It uses 50% more writes. (This analysis ignores the key density of the next level, because the range covers N times of files in that level so one hole only impacts write amplification by 1/N)
If all the updates are uniformly distributed, LevelDB's approach optimizes write amplification, because a file being picked covers a range whose last compaction time to the next level is the oldest, so the range will accumulated keys from incoming compactions for the longest and the density is the highest.
We created a compaction priority **kOldestSmallestSeqFirst **for the same effect. With this mode, we always pick the file covers the oldest updates in the level, which usually is contains the densest key range. If you have a use case where writes are uniformly distributed across the key space and you want to reduce write amplification, you should set options.compaction_pri=kOldestSmallestSeqFirst.
**Optimize for small working set**
We are assuming updates are uniformly distributed across the whole key space in previous analysis. However, in many use cases, there are subset of keys that are frequently updated while other key ranges are very cold. In this case, keeping hot key ranges from compacting to deeper levels will benefit write amplification, as well as space amplification. For example, if in a DB only key 150-160 are updated and other keys are seldom updated. If level 1 contains 20 keys, we want to keep 150-160 all stay in level 1. Because when next level 0 -> 1 compaction comes, it will simply overwrite existing keys so size level 1 doesn't increase, so no need to schedule further compaction for level 1->2. On the other hand, if we compact key 150-155 to level2, when a new Level 1->2 compaction comes, it increases the size of level 1, making size of level 1 exceed target size and more compactions will be needed, which generates more writes.
The compaction priority **kOldestLargestSeqFirst **optimizes this use case. In this mode, we will pick a file whose latest update is the oldest. It means there is no incoming data for the range for the longest. Usually it is the coldest range. By compacting coldest range first, we leave the hot ranges in the level. If your use case is to overwrite existing keys in a small range, try options.compaction_pri=kOldestLargestSeqFirst**.**
**Drop delete marker sooner**
If one file contains a lot of delete markers, it may slow down iterating over this area, because we still need to iterate those deleted keys just to ignore them. Furthermore, the sooner we compact delete keys into the last level, the sooner the disk space is reclaimed, so it is good for space efficiency.
Our default compaction priority **kByCompensatedSize **considers the case. If number of deletes in a file exceeds number of inserts, it is more likely to be picked for compaction. The more number of deletes exceed inserts, the more likely it is being compacted. The optimization is added to avoid the worst performance of space efficiency and query performance when a large percentage of the DB is deleted.
**Efficiency of compaction filter**
Usually people use [compaction filters](https://github.com/facebook/rocksdb/blob/v4.1/include/rocksdb/options.h#L201-L226) to clean up old data to free up space. Picking files to compact may impact space efficiency. We don't yet have a a compaction priority to optimize this case. In some of our use cases, we solved the problem in a different way: we have an external service checking modify time of all SST files. If any of the files is too old, we force the single file to compaction by calling DB::CompactFiles() using the single file. In this way, we can provide a time bound of data passing through compaction filters.
In all, there three choices of compaction priority modes optimizing different scenarios. if you have a new use case, we suggest you start with options.compaction_pri=kOldestSmallestSeqFirst (note it is not the default one for backward compatible reason). If you want to further optimize your use case, you can try other two use cases if your use cases apply.
If you have good ideas about better compaction picker approach, you are welcome to implement and benchmark it. We'll be glad to review and merge your a pull requests.

@ -0,0 +1,51 @@
---
title: RocksDB 4.2 Release!
layout: post
author: sdong
category: blog
---
New RocksDB release - 4.2!
**New Features**
1. Introduce CreateLoggerFromOptions(), this function create a Logger for provided DBOptions.
2. Add GetAggregatedIntProperty(), which returns the sum of the GetIntProperty of all the column families.
3. Add MemoryUtil in rocksdb/utilities/memory.h. It currently offers a way to get the memory usage by type from a list rocksdb instances.
**Public API changes**
1. CompactionFilter::Context includes information of Column Family ID
2. The need-compaction hint given by TablePropertiesCollector::NeedCompact() will be persistent and recoverable after DB recovery. This introduces a breaking format change. If you use this experimental feature, including NewCompactOnDeletionCollectorFactory() in the new version, you may not be able to directly downgrade the DB back to version 4.0 or lower.
3. TablePropertiesCollectorFactory::CreateTablePropertiesCollector() now takes an option Context, containing the information of column family ID for the file being written.
4. Remove DefaultCompactionFilterFactory.
[https://github.com/facebook/rocksdb/releases/tag/v4.2](https://github.com/facebook/rocksdb/releases/tag/v4.2)

@ -0,0 +1,18 @@
---
title: RocksDB AMA
layout: post
author: yhchiang
category: blog
---
RocksDB developers are doing a Reddit Ask-Me-Anything now at 10AM – 11AM PDT! We welcome you to stop by and ask any RocksDB related questions, including existing / upcoming features, tuning tips, or database design.
Here are some enhancements that we'd like to focus on over the next six months:
* 2-Phase Commit
* Lua support in some custom functions
* Backup and repair tools
* Direct I/O to bypass OS cache
* RocksDB Java API
[https://www.reddit.com/r/IAmA/comments/47k1si/we_are_rocksdb_developers_ask_us_anything/](https://www.reddit.com/r/IAmA/comments/47k1si/we_are_rocksdb_developers_ask_us_anything/)

@ -0,0 +1,26 @@
---
title: RocksDB Options File
layout: post
author: yhciang
category: blog
---
In RocksDB 4.3, we added a new set of features that makes managing RocksDB options easier. Specifically:
* **Persisting Options Automatically**: Each RocksDB database will now automatically persist its current set of options into an INI file on every successful call of DB::Open(), SetOptions(), and CreateColumnFamily() / DropColumnFamily().
* **Load Options from File**: We added [LoadLatestOptions() / LoadOptionsFromFile()](https://github.com/facebook/rocksdb/blob/4.3.fb/include/rocksdb/utilities/options_util.h#L48-L58) that enables developers to construct RocksDB options object from an options file.
* **Sanity Check Options**: We added [CheckOptionsCompatibility](https://github.com/facebook/rocksdb/blob/4.3.fb/include/rocksdb/utilities/options_util.h#L64-L77) that performs compatibility check on two sets of RocksDB options.
Want to know more about how to use this new features? Check out the [RocksDB Options File wiki page](https://github.com/facebook/rocksdb/wiki/RocksDB-Options-File) and start using this new feature today!

@ -1,12 +0,0 @@
---
title: Blog Post Example
layout: post
author: exampleauthor
category: blog
---
This is an example blog post introduction, try to keep it short and about a paragraph long, to encourage people to click through to read the entire post.
<!--truncate-->
Everything below the `<!--truncate-->` tag will only show on the actual blog post page, not on the /blog/ index.

@ -0,0 +1,56 @@
---
title: RocksDB 4.5.1 Released!
layout: post
author: sdong
category: blog
---
## 4.5.1 (3/25/2016)
### Bug Fixes
*  Fix failures caused by the destorying order of singleton objects.
<br/>
## 4.5.0 (2/5/2016)
### Public API Changes
* Add a new perf context level between kEnableCount and kEnableTime. Level 2 now does not include timers for mutexes.
* Statistics of mutex operation durations will not be measured by default. If you want to have them enabled, you need to set Statistics::stats_level_ to kAll.
* DBOptions::delete_scheduler and NewDeleteScheduler() are removed, please use DBOptions::sst_file_manager and NewSstFileManager() instead
### New Features
* ldb tool now supports operations to non-default column families.
* Add kPersistedTier to ReadTier. This option allows Get and MultiGet to read only the persited data and skip mem-tables if writes were done with disableWAL = true.
* Add DBOptions::sst_file_manager. Use NewSstFileManager() in include/rocksdb/sst_file_manager.h to create a SstFileManager that can be used to track the total size of SST files and control the SST files deletion rate.
<br/>
## 4.4.0 (1/14/2016)
### Public API Changes
* Change names in CompactionPri and add a new one.
* Deprecate options.soft_rate_limit and add options.soft_pending_compaction_bytes_limit.
* If options.max_write_buffer_number > 3, writes will be slowed down when writing to the last write buffer to delay a full stop.
* Introduce CompactionJobInfo::compaction_reason, this field include the reason to trigger the compaction.
* After slow down is triggered, if estimated pending compaction bytes keep increasing, slowdown more.
* Increase default options.delayed_write_rate to 2MB/s.
* Added a new parameter --path to ldb tool. --path accepts the name of either MANIFEST, SST or a WAL file. Either --db or --path can be used when calling ldb.
<br/>
## 4.3.0 (12/8/2015)
### New Features
* CompactionFilter has new member function called IgnoreSnapshots which allows CompactionFilter to be called even if there are snapshots later than the key.
* RocksDB will now persist options under the same directory as the RocksDB database on successful DB::Open, CreateColumnFamily, DropColumnFamily, and SetOptions.
* Introduce LoadLatestOptions() in rocksdb/utilities/options_util.h. This function can construct the latest DBOptions / ColumnFamilyOptions used by the specified RocksDB intance.
* Introduce CheckOptionsCompatibility() in rocksdb/utilities/options_util.h. This function checks whether the input set of options is able to open the specified DB successfully.
### Public API Changes
* When options.db_write_buffer_size triggers, only the column family with the largest column family size will be flushed, not all the column families.

@ -0,0 +1,44 @@
---
title: RocksDB 4.8 Released!
layout: post
author: yiwu
category: blog
---
## 4.8.0 (5/2/2016)
### [](https://github.com/facebook/rocksdb/blob/master/HISTORY.md#public-api-change-1)Public API Change
* Allow preset compression dictionary for improved compression of block-based tables. This is supported for zlib, zstd, and lz4. The compression dictionary's size is configurable via CompressionOptions::max_dict_bytes.
* Delete deprecated classes for creating backups (BackupableDB) and restoring from backups (RestoreBackupableDB). Now, BackupEngine should be used for creating backups, and BackupEngineReadOnly should be used for restorations. For more details, see [https://github.com/facebook/rocksdb/wiki/How-to-backup-RocksDB%3F](https://github.com/facebook/rocksdb/wiki/How-to-backup-RocksDB%3F)
* Expose estimate of per-level compression ratio via DB property: "rocksdb.compression-ratio-at-levelN".
* Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status.
### [](https://github.com/facebook/rocksdb/blob/master/HISTORY.md#new-features-2)New Features
* Add ReadOptions::readahead_size. If non-zero, NewIterator will create a new table reader which performs reads of the given size.
<br/>
## [](https://github.com/facebook/rocksdb/blob/master/HISTORY.md#470-482016)4.7.0 (4/8/2016)
### [](https://github.com/facebook/rocksdb/blob/master/HISTORY.md#public-api-change-2)Public API Change
* rename options compaction_measure_io_stats to report_bg_io_stats and include flush too.
* Change some default options. Now default options will optimize for server-workloads. Also enable slowdown and full stop triggers for pending compaction bytes. These changes may cause sub-optimal performance or significant increase of resource usage. To avoid these risks, users can open existing RocksDB with options extracted from RocksDB option files. See [https://github.com/facebook/rocksdb/wiki/RocksDB-Options-File](https://github.com/facebook/rocksdb/wiki/RocksDB-Options-File) for how to use RocksDB option files. Or you can call Options.OldDefaults() to recover old defaults. DEFAULT_OPTIONS_HISTORY.md will track change history of default options.
<br/>
## [](https://github.com/facebook/rocksdb/blob/master/HISTORY.md#460-3102016)4.6.0 (3/10/2016)
### [](https://github.com/facebook/rocksdb/blob/master/HISTORY.md#public-api-changes-1)Public API Changes
* Change default of BlockBasedTableOptions.format_version to 2. It means default DB created by 4.6 or up cannot be opened by RocksDB version 3.9 or earlier
* Added strict_capacity_limit option to NewLRUCache. If the flag is set to true, insert to cache will fail if no enough capacity can be free. Signature of Cache::Insert() is updated accordingly.
* Tickers [NUMBER_DB_NEXT, NUMBER_DB_PREV, NUMBER_DB_NEXT_FOUND, NUMBER_DB_PREV_FOUND, ITER_BYTES_READ] are not updated immediately. The are updated when the Iterator is deleted.
* Add monotonically increasing counter (DB property "rocksdb.current-super-version-number") that increments upon any change to the LSM tree.
### [](https://github.com/facebook/rocksdb/blob/master/HISTORY.md#new-features-3)New Features
* Add CompactionPri::kMinOverlappingRatio, a compaction picking mode friendly to write amplification.
* Deprecate Iterator::IsKeyPinned() and replace it with Iterator::GetProperty() with prop_name="rocksdb.iterator.is.key.pinned"

Binary file not shown.

After

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 17 KiB

Loading…
Cancel
Save