Summary: Tried to: - preserve existing links - move existing images over (there were 2) - preserve codeblocks (modified where apprporiate) - etc. Also as agreed upon: - All blog posts are preserved. - Comments are not preserved. - Not turning on comments for future blog posts (use the FB developer group instead). - Like button at the end of the blog post. Depends on https://reviews.facebook.net/D63051 Test Plan: Visual Reviewers: IslamAbdelRahman, lgalanis Reviewed By: lgalanis Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D63105main
parent
ee0e2201e0
commit
3c2262400f
@ -0,0 +1,133 @@ |
|||||||
|
--- |
||||||
|
title: How to backup RocksDB? |
||||||
|
layout: post |
||||||
|
author: icanadi |
||||||
|
category: blog |
||||||
|
--- |
||||||
|
|
||||||
|
In RocksDB, we have implemented an easy way to backup your DB. Here is a simple example: |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
#include "rocksdb/db.h" |
||||||
|
#include "utilities/backupable_db.h" |
||||||
|
using namespace rocksdb; |
||||||
|
|
||||||
|
DB* db; |
||||||
|
DB::Open(Options(), "/tmp/rocksdb", &db); |
||||||
|
BackupableDB* backupable_db = new BackupableDB(db, BackupableDBOptions("/tmp/rocksdb_backup")); |
||||||
|
backupable_db->Put(...); // do your thing |
||||||
|
backupable_db->CreateNewBackup(); |
||||||
|
delete backupable_db; // no need to also delete db |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
This simple example will create a backup of your DB in "/tmp/rocksdb_backup". Creating new BackupableDB consumes DB* and you should be calling all the DB methods on object `backupable_db` going forward. |
||||||
|
|
||||||
|
Restoring is also easy: |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
RestoreBackupableDB* restore = new RestoreBackupableDB(Env::Default(), BackupableDBOptions("/tmp/rocksdb_backup")); |
||||||
|
restore->RestoreDBFromLatestBackup("/tmp/rocksdb", "/tmp/rocksdb"); |
||||||
|
delete restore; |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
This code will restore the backup back to "/tmp/rocksdb". The second parameter is the location of log files (In some DBs they are different from DB directory, but usually they are the same. See Options::wal_dir for more info). |
||||||
|
|
||||||
|
An alternative API for backups is to use BackupEngine directly: |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
#include "rocksdb/db.h" |
||||||
|
#include "utilities/backupable_db.h" |
||||||
|
using namespace rocksdb; |
||||||
|
|
||||||
|
DB* db; |
||||||
|
DB::Open(Options(), "/tmp/rocksdb", &db); |
||||||
|
db->Put(...); // do your thing |
||||||
|
BackupEngine* backup_engine = BackupEngine::NewBackupEngine(Env::Default(), BackupableDBOptions("/tmp/rocksdb_backup")); |
||||||
|
backup_engine->CreateNewBackup(db); |
||||||
|
delete db; |
||||||
|
delete backup_engine; |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Restoring with BackupEngine is similar to RestoreBackupableDB: |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
BackupEngine* backup_engine = BackupEngine::NewBackupEngine(Env::Default(), BackupableDBOptions("/tmp/rocksdb_backup")); |
||||||
|
backup_engine->RestoreDBFromLatestBackup("/tmp/rocksdb", "/tmp/rocksdb"); |
||||||
|
delete backup_engine; |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Backups are incremental. You can create a new backup with `CreateNewBackup()` and only the new data will be copied to backup directory (for more details on what gets copied, see "Under the hood"). Checksum is always calculated for any backuped file (including sst, log, and etc). It is used to make sure files are kept sound in the file system. Checksum is also verified for files from the previous backups even though they do not need to be copied. A checksum mismatch aborts the current backup (see "Under the hood" for more details). Once you have more backups saved, you can issue `GetBackupInfo()` call to get a list of all backups together with information on timestamp of the backup and the size (please note that sum of all backups' sizes is bigger than the actual size of the backup directory because some data is shared by multiple backups). Backups are identified by their always-increasing IDs. `GetBackupInfo()` is available both in `BackupableDB` and `RestoreBackupableDB`. |
||||||
|
|
||||||
|
You probably want to keep around only small number of backups. To delete old backups, just call `PurgeOldBackups(N)`, where N is how many backups you'd like to keep. All backups except the N newest ones will be deleted. You can also choose to delete arbitrary backup with call `DeleteBackup(id)`. |
||||||
|
|
||||||
|
`RestoreDBFromLatestBackup()` will restore the DB from the latest consistent backup. An alternative is `RestoreDBFromBackup()` which takes a backup ID and restores that particular backup. Checksum is calculated for any restored file and compared against the one stored during the backup time. If a checksum mismatch is detected, the restore process is aborted and `Status::Corruption` is returned. Very important thing to note here: Let's say you have backups 1, 2, 3, 4. If you restore from backup 2 and start writing more data to your database, newly created backup will delete old backups 3 and 4 and create new backup 3 on top of 2. |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## Advanced usage |
||||||
|
|
||||||
|
|
||||||
|
Let's say you want to backup your DB to HDFS. There is an option in `BackupableDBOptions` to set `backup_env`, which will be used for all file I/O related to backup dir (writes when backuping, reads when restoring). If you set it to HDFS Env, all the backups will be stored in HDFS. |
||||||
|
|
||||||
|
`BackupableDBOptions::info_log` is a Logger object that is used to print out LOG messages if not-nullptr. |
||||||
|
|
||||||
|
If `BackupableDBOptions::sync` is true, we will sync data to disk after every file write, guaranteeing that backups will be consistent after a reboot or if machine crashes. Setting it to false will speed things up a bit, but some (newer) backups might be inconsistent. In most cases, everything should be fine, though. |
||||||
|
|
||||||
|
If you set `BackupableDBOptions::destroy_old_data` to true, creating new `BackupableDB` will delete all the old backups in the backup directory. |
||||||
|
|
||||||
|
`BackupableDB::CreateNewBackup()` method takes a parameter `flush_before_backup`, which is false by default. When `flush_before_backup` is true, `BackupableDB` will first issue a memtable flush and only then copy the DB files to the backup directory. Doing so will prevent log files from being copied to the backup directory (since flush will delete them). If `flush_before_backup` is false, backup will not issue flush before starting the backup. In that case, the backup will also include log files corresponding to live memtables. Backup will be consistent with current state of the database regardless of `flush_before_backup` parameter. |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## Under the hood |
||||||
|
|
||||||
|
|
||||||
|
`BackupableDB` implements `DB` interface and adds four methods to it: `CreateNewBackup()`, `GetBackupInfo()`, `PurgeOldBackups()`, `DeleteBackup()`. Any `DB` interface calls will get forwarded to underlying `DB` object. |
||||||
|
|
||||||
|
When you call `BackupableDB::CreateNewBackup()`, it does the following: |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
1. Disable file deletions |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
2. Get live files (this includes table files, current and manifest file). |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
3. Copy live files to the backup directory. Since table files are immutable and filenames unique, we don't copy a table file that is already present in the backup directory. For example, if there is a file `00050.sst` already backed up and `GetLiveFiles()` returns `00050.sst`, we will not copy that file to the backup directory. However, checksum is calculated for all files regardless if a file needs to be copied or not. If a file is already present, the calculated checksum is compared against previously calculated checksum to make sure nothing crazy happened between backups. If a mismatch is detected, backup is aborted and the system is restored back to the state before `BackupableDB::CreateNewBackup()` is called. One thing to note is that a backup abortion could mean a corruption from a file in backup directory or the corresponding live file in current DB. Both manifest and current files are copied, since they are not immutable. |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
4. If `flush_before_backup` was set to false, we also need to copy log files to the backup directory. We call `GetSortedWalFiles()` and copy all live files to the backup directory. |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
5. Enable file deletions |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Backup IDs are always increasing and we have a file `LATEST_BACKUP` that contains the ID of the latest backup. If we crash in middle of backing up, on a restart we will detect that there are newer backup files than `LATEST_BACKUP` claims there are. In that case, we will delete any backup newer than `LATEST_BACKUP` and clean up all the files since some of the table files might be corrupted. Having corrupted table files in the backup directory is dangerous because of our deduplication strategy. |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## Further reading |
||||||
|
|
||||||
|
|
||||||
|
For the API details, see `include/utilities/backupable_db.h`. For the implementation, see `utilities/backupable/backupable_db.cc`. |
@ -0,0 +1,50 @@ |
|||||||
|
--- |
||||||
|
title: How to persist in-memory RocksDB database? |
||||||
|
layout: post |
||||||
|
author: icanadi |
||||||
|
category: blog |
||||||
|
--- |
||||||
|
|
||||||
|
In recent months, we have focused on optimizing RocksDB for in-memory workloads. With growing RAM sizes and strict low-latency requirements, lots of applications decide to keep their entire data in memory. Running in-memory database with RocksDB is easy -- just mount your RocksDB directory on tmpfs or ramfs [1]. Even if the process crashes, RocksDB can recover all of your data from in-memory filesystem. However, what happens if the machine reboots? |
||||||
|
|
||||||
|
In this article we will explain how you can recover your in-memory RocksDB database even after a machine reboot. |
||||||
|
|
||||||
|
Every update to RocksDB is written to two places - one is an in-memory data structure called memtable and second is write-ahead log. Write-ahead log can be used to completely recover the data in memtable. By default, when we flush the memtable to table file, we also delete the current log, since we don't need it anymore for recovery (the data from the log is "persisted" in the table file -- we say that the log file is obsolete). However, if your table file is stored in in-memory file system, you may need the obsolete write-ahead log to recover the data after the machine reboots. Here's how you can do that. |
||||||
|
|
||||||
|
Options::wal_dir is the directory where RocksDB stores write-ahead log files. If you configure this directory to be on flash or disk, you will not lose current log file on machine reboot. |
||||||
|
Options::WAL_ttl_seconds is the timeout when we delete the archived log files. If the timeout is non-zero, obsolete log files will be moved to `archive/` directory under Options::wal_dir. Those archived log files will only be deleted after the specified timeout. |
||||||
|
|
||||||
|
Let's assume Options::wal_dir is a directory on persistent storage and Options::WAL_ttl_seconds is set to one day. To fully recover the DB, we also need to backup the current snapshot of the database (containing table and metadata files) with a frequency of less than one day. RocksDB provides an utility that enables you to easily backup the snapshot of your database. You can learn more about it here: [How to backup RocksDB?](https://github.com/facebook/rocksdb/wiki/How-to-backup-RocksDB%3F) |
||||||
|
|
||||||
|
You should configure the backup process to avoid backing up log files, since they are already stored in persistent storage. To do that, set BackupableDBOptions::backup_log_files to false. |
||||||
|
|
||||||
|
Restore process by default cleans up entire DB and WAL directory. Since we didn't include log files in the backup, we need to make sure that restoring the database doesn't delete log files in WAL directory. When restoring, configure RestoreOptions::keep_log_file to true. That option will also move any archived log files back to WAL directory, enabling RocksDB to replay all archived log files and rebuild the in-memory database state. |
||||||
|
|
||||||
|
To reiterate, here's what you have to do: |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
* Set DB directory to tmpfs or ramfs mounted drive |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
* Set Options::wal_log to a directory on persistent storage |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
* Set Options::WAL_ttl_seconds to T seconds |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
* Backup RocksDB every T/2 seconds, with BackupableDBOptions::backup_log_files = false |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
* When you lose data, restore from backup with RestoreOptions::keep_log_file = true |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
[1] You might also want to consider using [PlainTable format](https://github.com/facebook/rocksdb/wiki/PlainTable-Format) for table files |
@ -0,0 +1,39 @@ |
|||||||
|
--- |
||||||
|
title: The 1st RocksDB Local Meetup Held on March 27, 2014 |
||||||
|
layout: post |
||||||
|
author: xjin |
||||||
|
category: blog |
||||||
|
--- |
||||||
|
|
||||||
|
On Mar 27, 2014, RocksDB team @ Facebook held the 1st RocksDB local meetup in FB HQ (Menlo Park, California). We invited around 80 guests from 20+ local companies, including LinkedIn, Twitter, Dropbox, Square, Pinterest, MapR, Microsoft and IBM. Finally around 50 guests showed up, totaling around 60% show-up rate. |
||||||
|
|
||||||
|
[![Resize of 20140327_200754](/static/images/Resize-of-20140327_200754-300x225.jpg)](/static/images/Resize-of-20140327_200754-300x225.jpg) |
||||||
|
|
||||||
|
RocksDB team @ Facebook gave four talks about the latest progress and experience on RocksDB: |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
* [Supporting a 1PB In-Memory Workload](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Haobo-RocksDB-In-Memory.pdf) |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
* [Column Families in RocksDB](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Igor-Column-Families.pdf) |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
* ["Lockless" Get() in RocksDB?](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Lei-Lockless-Get.pdf) |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
* [Prefix Hashing in RocksDB](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Siying-Prefix-Hash.pdf) |
||||||
|
|
||||||
|
|
||||||
|
A very interesting question asked by a massive number of guests is: does RocksDB plan to provide replication functionality? Obviously, many applications need a resilient and distributed storage solution, not just single-node storage. We are considering how to approach this issue. |
||||||
|
|
||||||
|
When will be the next meetup? We haven't decided yet. We will see whether the community is interested in it and how it can help RocksDB grow. |
||||||
|
|
||||||
|
If you have any questions or feedback for the meetup or RocksDB, please let us know in [our Facebook group](https://www.facebook.com/groups/rocksdb.dev/). |
@ -0,0 +1,37 @@ |
|||||||
|
--- |
||||||
|
title: RocksDB 2.8 release |
||||||
|
layout: post |
||||||
|
author: icanadi |
||||||
|
category: blog |
||||||
|
--- |
||||||
|
|
||||||
|
Check out the new RocksDB 2.8 release on [Github](https://github.com/facebook/rocksdb/releases/tag/2.8.fb). |
||||||
|
|
||||||
|
RocksDB 2.8. is mostly focused on improving performance for in-memory workloads. We are seeing read QPS as high as 5M (we will write a separate blog post on this). Here is the summary of new features: |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
* Added a new table format called PlainTable, which is optimized for RAM storage (ramfs or tmpfs). You can read more details about it on [our wiki](https://github.com/facebook/rocksdb/wiki/PlainTable-Format). |
||||||
|
|
||||||
|
|
||||||
|
* New prefixed memtable format HashLinkedList, which is optimized for cases where there are only a few keys for each prefix. |
||||||
|
|
||||||
|
|
||||||
|
* Merge operator supports a new function PartialMergeMulti() that allows users to do partial merges against multiple operands. This function enables big speedups for workloads that use merge operators. |
||||||
|
|
||||||
|
|
||||||
|
* Added a V2 compaction filter interface. It buffers the kv-pairs sharing the same key prefix, process them in batches, and return the batched results back to DB. |
||||||
|
|
||||||
|
|
||||||
|
* Geo-spatial support for locations and radial-search. |
||||||
|
|
||||||
|
|
||||||
|
* Improved read performance using thread local cache for frequently accessed data. |
||||||
|
|
||||||
|
|
||||||
|
* Stability improvements -- we're now ignoring partially written tailing record to MANIFEST or WAL files. |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
We have also introduced small incompatible API changes (mostly for advanced users). You can see full release notes in our [HISTORY.my](https://github.com/facebook/rocksdb/blob/2.8.fb/HISTORY.md) file. |
@ -0,0 +1,25 @@ |
|||||||
|
--- |
||||||
|
title: RocksDB 3.0 release |
||||||
|
layout: post |
||||||
|
author: icanadi |
||||||
|
category: blog |
||||||
|
--- |
||||||
|
|
||||||
|
Check out new RocksDB release on [Github](https://github.com/facebook/rocksdb/releases/tag/3.0.fb)! |
||||||
|
|
||||||
|
New features in RocksDB 3.0: |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
* [Column Family support](https://github.com/facebook/rocksdb/wiki/Column-Families) |
||||||
|
|
||||||
|
|
||||||
|
* [Ability to chose different checksum function](https://github.com/facebook/rocksdb/commit/0afc8bc29a5800e3212388c327c750d32e31f3d6) |
||||||
|
|
||||||
|
|
||||||
|
* Deprecated ReadOptions::prefix_seek and ReadOptions::prefix |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Check out the full [change log](https://github.com/facebook/rocksdb/blob/3.0.fb/HISTORY.md). |
@ -0,0 +1,22 @@ |
|||||||
|
--- |
||||||
|
title: RocksDB 3.1 release |
||||||
|
layout: post |
||||||
|
author: icanadi |
||||||
|
category: blog |
||||||
|
--- |
||||||
|
|
||||||
|
Check out the new release on [Github](https://github.com/facebook/rocksdb/releases/tag/rocksdb-3.1)! |
||||||
|
|
||||||
|
New features in RocksDB 3.1: |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
* [Materialized hash index](https://github.com/facebook/rocksdb/commit/0b3d03d026a7248e438341264b4c6df339edc1d7) |
||||||
|
|
||||||
|
|
||||||
|
* [FIFO compaction style](https://github.com/facebook/rocksdb/wiki/FIFO-compaction-style) |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
We released 3.1 so fast after 3.0 because one of our internal customers needed materialized hash index. |
@ -0,0 +1,37 @@ |
|||||||
|
--- |
||||||
|
title: PlainTable — A New File Format |
||||||
|
layout: post |
||||||
|
author: sdong |
||||||
|
category: blog |
||||||
|
--- |
||||||
|
|
||||||
|
In this post, we are introducing "PlainTable" -- a file format we designed for RocksDB, initially to satisfy a production use case at Facebook. |
||||||
|
|
||||||
|
Design goals: |
||||||
|
|
||||||
|
1. All data stored in memory, in files stored in tmpfs/ramfs. Support DBs larger than 100GB (may be sharded across multiple RocksDB instance). |
||||||
|
1. Optimize for [prefix hashing](https://github.com/facebook/rocksdb/raw/gh-pages/talks/2014-03-27-RocksDB-Meetup-Siying-Prefix-Hash.pdf) |
||||||
|
1. Less than or around 1 micro-second average latency for single Get() or Seek(). |
||||||
|
1. Minimize memory consumption. |
||||||
|
1. Queries efficiently return empty results |
||||||
|
|
||||||
|
Notice that our priority was not to maximize query performance, but to strike a balance between query performance and memory consumption. PlainTable query performance is not as good as you would see with a nicely-designed hash table, but they are of the same order of magnitude, while keeping memory overhead to a minimum. |
||||||
|
|
||||||
|
Since we are targeting micro-second latency, it is on the level of the number of CPU cache misses (if they cannot be parallellized, which are usually the case for index look-ups). On our target hardware with Intel CPUs of multiple sockets with NUMA, we can only allow 4-5 CPU cache misses (including costs of data TLB). |
||||||
|
|
||||||
|
To meet our requirements, given that only hash prefix iterating is needed, we made two decisions: |
||||||
|
|
||||||
|
1. to use a hash index, which is |
||||||
|
1. directly addressed to rows, with no block structure. |
||||||
|
|
||||||
|
Having addressed our latency goal, the next task was to design a very compact hash index to minimize memory consumption. Some tricks we used to meet this goal: |
||||||
|
|
||||||
|
1. We only use 32-bit integers for data and index offsets.The first bit serves as a flag, so we can avoid using 8-byte pointers. |
||||||
|
1. We never copy keys or parts of keys to index search structures. We store only offsets from which keys can be retrieved, to make comparisons with search keys. |
||||||
|
1. Since our file is immutable, we can accurately estimate the number of hash buckets needed. |
||||||
|
|
||||||
|
To make sure the format works efficiently with empty queries, we added a bloom filter check before the query. This adds only one cache miss for non-empty cases [1], but avoids multiple cache misses for most empty results queries. This is a good trade-off for use cases with a large percentage of empty results. |
||||||
|
|
||||||
|
These are the design goals and basic ideas of PlainTable file format. For detailed information, see [this wiki page](https://github.com/facebook/rocksdb/wiki/PlainTable-Format). |
||||||
|
|
||||||
|
[1] Bloom filter checks typically require multiple memory access. However, because they are independent, they usually do not make the CPU pipeline stale. In any case, we improved the bloom filter to improve data locality - we may cover this further in a future blog post. |
@ -0,0 +1,48 @@ |
|||||||
|
--- |
||||||
|
title: RocksDB 3.5 Release! |
||||||
|
layout: post |
||||||
|
author: leijin |
||||||
|
category: blog |
||||||
|
--- |
||||||
|
|
||||||
|
New RocksDB release - 3.5! |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
**New Features** |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
1. Add include/utilities/write_batch_with_index.h, providing a utility class to query data out of WriteBatch when building it. |
||||||
|
|
||||||
|
|
||||||
|
2. new ReadOptions.total_order_seek to force total order seek when block-based table is built with hash index. |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
**Public API changes** |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
1. The Prefix Extractor used with V2 compaction filters is now passed user key to SliceTransform::Transform instead of unparsed RocksDB key. |
||||||
|
|
||||||
|
|
||||||
|
2. Move BlockBasedTable related options to BlockBasedTableOptions from Options. Change corresponding JNI interface. Options affected include: no_block_cache, block_cache, block_cache_compressed, block_size, block_size_deviation, block_restart_interval, filter_policy, whole_key_filtering. filter_policy is changed to shared_ptr from a raw pointer. |
||||||
|
|
||||||
|
|
||||||
|
3. Remove deprecated options: disable_seek_compaction and db_stats_log_interval |
||||||
|
|
||||||
|
|
||||||
|
4. OptimizeForPointLookup() takes one parameter for block cache size. It now builds hash index, bloom filter, and block cache. |
||||||
|
|
||||||
|
|
||||||
|
[https://github.com/facebook/rocksdb/releases/tag/v3.5](https://github.com/facebook/rocksdb/releases/tag/rocksdb-3.5) |
@ -0,0 +1,108 @@ |
|||||||
|
--- |
||||||
|
title: Migrating from LevelDB to RocksDB |
||||||
|
layout: post |
||||||
|
author: lgalanis |
||||||
|
category: blog |
||||||
|
--- |
||||||
|
|
||||||
|
If you have an existing application that uses LevelDB and would like to migrate to using RocksDB, one problem you need to overcome is to map the options for LevelDB to proper options for RocksDB. As of release 3.9 this can be automatically done by using our option conversion utility found in rocksdb/utilities/leveldb_options.h. What is needed, is to first replace `leveldb::Options` with `rocksdb::LevelDBOptions`. Then, use `rocksdb::ConvertOptions( )` to convert the `LevelDBOptions` struct into appropriate RocksDB options. Here is an example: |
||||||
|
|
||||||
|
LevelDB code: |
||||||
|
|
||||||
|
```c++ |
||||||
|
#include <string> |
||||||
|
#include "leveldb/db.h" |
||||||
|
|
||||||
|
using namespace leveldb; |
||||||
|
|
||||||
|
int main(int argc, char** argv) { |
||||||
|
DB *db; |
||||||
|
|
||||||
|
Options opt; |
||||||
|
opt.create_if_missing = true; |
||||||
|
opt.max_open_files = 1000; |
||||||
|
opt.block_size = 4096; |
||||||
|
|
||||||
|
Status s = DB::Open(opt, "/tmp/mydb", &db); |
||||||
|
|
||||||
|
delete db; |
||||||
|
} |
||||||
|
``` |
||||||
|
|
||||||
|
RocksDB code: |
||||||
|
|
||||||
|
```c++ |
||||||
|
#include <string> |
||||||
|
#include "rocksdb/db.h" |
||||||
|
#include "rocksdb/utilities/leveldb_options.h" |
||||||
|
|
||||||
|
using namespace rocksdb; |
||||||
|
|
||||||
|
int main(int argc, char** argv) { |
||||||
|
DB *db; |
||||||
|
|
||||||
|
LevelDBOptions opt; |
||||||
|
opt.create_if_missing = true; |
||||||
|
opt.max_open_files = 1000; |
||||||
|
opt.block_size = 4096; |
||||||
|
|
||||||
|
Options rocksdb_options = ConvertOptions(opt); |
||||||
|
// add rocksdb specific options here |
||||||
|
|
||||||
|
Status s = DB::Open(rocksdb_options, "/tmp/mydb_rocks", &db); |
||||||
|
|
||||||
|
delete db; |
||||||
|
} |
||||||
|
``` |
||||||
|
|
||||||
|
The difference is: |
||||||
|
|
||||||
|
```diff |
||||||
|
-#include "leveldb/db.h" |
||||||
|
+#include "rocksdb/db.h" |
||||||
|
+#include "rocksdb/utilities/leveldb_options.h" |
||||||
|
|
||||||
|
-using namespace leveldb; |
||||||
|
+using namespace rocksdb; |
||||||
|
|
||||||
|
- Options opt; |
||||||
|
+ LevelDBOptions opt; |
||||||
|
|
||||||
|
- Status s = DB::Open(opt, "/tmp/mydb", &db); |
||||||
|
+ Options rocksdb_options = ConvertOptions(opt); |
||||||
|
+ // add rockdb specific options here |
||||||
|
+ |
||||||
|
+ Status s = DB::Open(rocksdb_options, "/tmp/mydb_rocks", &db); |
||||||
|
``` |
||||||
|
|
||||||
|
Once you get up and running with RocksDB you can then focus on tuning RocksDB further by modifying the converted options struct. |
||||||
|
|
||||||
|
The reason why ConvertOptions is handy is because a lot of individual options in RocksDB have moved to other structures in different components. For example, block_size is not available in struct rocksdb::Options. It resides in struct rocksdb::BlockBasedTableOptions, which is used to create a TableFactory object that RocksDB uses internally to create the proper TableBuilder objects. If you were to write your application from scratch it would look like this: |
||||||
|
|
||||||
|
RocksDB code from scratch: |
||||||
|
|
||||||
|
```c++ |
||||||
|
#include <string> |
||||||
|
#include "rocksdb/db.h" |
||||||
|
#include "rocksdb/table.h" |
||||||
|
|
||||||
|
using namespace rocksdb; |
||||||
|
|
||||||
|
int main(int argc, char** argv) { |
||||||
|
DB *db; |
||||||
|
|
||||||
|
Options opt; |
||||||
|
opt.create_if_missing = true; |
||||||
|
opt.max_open_files = 1000; |
||||||
|
|
||||||
|
BlockBasedTableOptions topt; |
||||||
|
topt.block_size = 4096; |
||||||
|
opt.table_factory.reset(NewBlockBasedTableFactory(topt)); |
||||||
|
|
||||||
|
Status s = DB::Open(opt, "/tmp/mydb_rocks", &db); |
||||||
|
|
||||||
|
delete db; |
||||||
|
} |
||||||
|
``` |
||||||
|
|
||||||
|
The LevelDBOptions utility can ease migration to RocksDB from LevelDB and allows us to break down the various options across classes as it is needed. |
@ -0,0 +1,37 @@ |
|||||||
|
--- |
||||||
|
title: Reading RocksDB options from a file |
||||||
|
layout: post |
||||||
|
author: lgalanis |
||||||
|
category: blog |
||||||
|
--- |
||||||
|
|
||||||
|
RocksDB options can be provided using a file or any string to RocksDB. The format is straightforward: `write_buffer_size=1024;max_write_buffer_number=2`. Any whitespace around `=` and `;` is OK. Moreover, options can be nested as necessary. For example `BlockBasedTableOptions` can be nested as follows: `write_buffer_size=1024; max_write_buffer_number=2; block_based_table_factory={block_size=4k};`. Similarly any white space around `{` or `}` is ok. Here is what it looks like in code: |
||||||
|
|
||||||
|
```c++ |
||||||
|
#include <string> |
||||||
|
#include "rocksdb/db.h" |
||||||
|
#include "rocksdb/table.h" |
||||||
|
#include "rocksdb/utilities/convenience.h" |
||||||
|
|
||||||
|
using namespace rocksdb; |
||||||
|
|
||||||
|
int main(int argc, char** argv) { |
||||||
|
DB *db; |
||||||
|
|
||||||
|
Options opt; |
||||||
|
|
||||||
|
std::string options_string = |
||||||
|
"create_if_missing=true;max_open_files=1000;" |
||||||
|
"block_based_table_factory={block_size=4096}"; |
||||||
|
|
||||||
|
Status s = GetDBOptionsFromString(opt, options_string, &opt); |
||||||
|
|
||||||
|
s = DB::Open(opt, "/tmp/mydb_rocks", &db); |
||||||
|
|
||||||
|
// use db |
||||||
|
|
||||||
|
delete db; |
||||||
|
} |
||||||
|
``` |
||||||
|
|
||||||
|
Using `GetDBOptionsFromString` is a convenient way of changing options for your RocksDB application without needing to resort to recompilation or tedious command line parsing. |
@ -0,0 +1,12 @@ |
|||||||
|
--- |
||||||
|
title: Integrating RocksDB with MongoDB |
||||||
|
layout: post |
||||||
|
author: icanadi |
||||||
|
category: blog |
||||||
|
--- |
||||||
|
|
||||||
|
Over the last couple of years, we have been busy integrating RocksDB with various services here at Facebook that needed to store key-value pairs locally. We have also seen other companies using RocksDB as local storage components of their distributed systems. |
||||||
|
|
||||||
|
The next big challenge for us is to bring RocksDB storage engine to general purpose databases. Today we have an exciting milestone to share with our community! We're running MongoDB with RocksDB in production and seeing great results! You can read more about it here: [http://blog.parse.com/announcements/mongodb-rocksdb-parse/](http://blog.parse.com/announcements/mongodb-rocksdb-parse/) |
||||||
|
|
||||||
|
Keep tuned for benchmarks and more stability and performance improvements. |
@ -0,0 +1,8 @@ |
|||||||
|
--- |
||||||
|
title: RocksDB in osquery |
||||||
|
layout: post |
||||||
|
author: icanadi |
||||||
|
category: lgalanis |
||||||
|
--- |
||||||
|
|
||||||
|
Check out [this](https://code.facebook.com/posts/1411870269134471/how-rocksdb-is-used-in-osquery/) blog post by [Mike Arpaia](https://www.facebook.com/mike.arpaia) and [Ted Reed](https://www.facebook.com/treeded) about how osquery leverages RocksDB to build an embedded pub-sub system. This article is a great read and contains insights on how to properly use RocksDB. |
@ -0,0 +1,74 @@ |
|||||||
|
--- |
||||||
|
title: Spatial indexing in RocksDB |
||||||
|
layout: post |
||||||
|
author: icanadi |
||||||
|
category: blog |
||||||
|
--- |
||||||
|
|
||||||
|
About a year ago, there was a need to develop a spatial database at Facebook. We needed to store and index Earth's map data. Before building our own, we looked at the existing spatial databases. They were all very good technology, but also general purpose. We could sacrifice a general-purpose API, so we thought we could build a more performant database, since it would be specifically designed for our use-case. Furthermore, we decided to build the spatial database on top of RocksDB, because we have a lot of operational experience with running and tuning RocksDB at a large scale. |
||||||
|
|
||||||
|
When we started looking at this project, the first thing that surprised us was that our planet is not that big. Earth's entire map data can fit in memory on a reasonably high-end machine. Thus, we also decided to build a spatial database optimized for memory-resident dataset. |
||||||
|
|
||||||
|
The first use-case of our spatial database was an experimental map renderer. As part of our project, we successfully loaded [Open Street Maps](https://www.openstreetmap.org/) dataset and hooked it up with [Mapnik](http://mapnik.org/), a map rendering engine. |
||||||
|
|
||||||
|
The usual Mapnik workflow is to load the map data into a SQL-based database and then define map layers with SQL statements. To render a tile, Mapnik needs to execute a couple of SQL queries. The benefit of this approach is that you don't need to reload your database when you change your map style. You can just change your SQL query and Mapnik picks it up. In our model, we decided to precompute the features we need for each tile. We need to know the map style before we create the database. However, when rendering the map tile, we only fetch the features that we need to render. |
||||||
|
|
||||||
|
We haven't open sourced the RocksDB Mapnik plugin or the database loading pipeline. However, the spatial indexing is available in RocksDB under a name [SpatialDB](https://github.com/facebook/rocksdb/blob/master/include/rocksdb/utilities/spatial_db.h). The API is focused on map rendering use-case, but we hope that it can also be used for other spatial-based applications. |
||||||
|
|
||||||
|
Let's take a tour of the API. When you create a spatial database, you specify the spatial indexes that need to be built. Each spatial index is defined by a bounding box and granularity. For map rendering, we create a spatial index for each zoom levels. Higher zoom levels have more granularity. |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
SpatialDB::Create( |
||||||
|
SpatialDBOptions(), |
||||||
|
"/data/map", { |
||||||
|
SpatialIndexOptions("zoom10", BoundingBox(0, 0, 100, 100), 10), |
||||||
|
SpatialIndexOptions("zoom16", BoundingBox(0, 0, 100, 100), 16) |
||||||
|
} |
||||||
|
); |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
When you insert a feature (building, street, country border) into SpatialDB, you need to specify the list of spatial indexes that will index the feature. In the loading phase we process the map style to determine the list of zoom levels on which we'll render the feature. For example, we will not render the building on zoom level that shows an entire country. Building will only be indexed on higher zoom level's index. Country borders will be indexes on all zoom levels. |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
FeatureSet feature; |
||||||
|
feature.Set("type", "building"); |
||||||
|
feature.Set("height", 6); |
||||||
|
db->Insert(WriteOptions(), BoundingBox<double>(5, 5, 10, 10), |
||||||
|
well_known_binary_blob, feature, {"zoom16"}); |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
The indexing part is pretty simple. For each feature, we first find a list of index tiles that it intersects. Then, we add a link from the tile's [quad key](https://msdn.microsoft.com/en-us/library/bb259689.aspx) to the feature's primary key. Using quad keys improves data locality, i.e. features closer together geographically will have similar quad keys. Even though we're optimizing for a memory-resident dataset, data locality is still very important due to different caching effects. |
||||||
|
|
||||||
|
After you're done inserting all the features, you can call an API Compact() that will compact the dataset and speed up read queries. |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
db->Compact(); |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
SpatialDB's query specifies: 1) bounding box we're interested in, and 2) a zoom level. We find all tiles that intersect with the query's bounding box and return all features in those tiles. |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Cursor* c = db_->Query(ReadOptions(), BoundingBox<double>(1, 1, 7, 7), "zoom16"); |
||||||
|
for (c->Valid(); c->Next()) { |
||||||
|
Render(c->blob(), c->feature_set()); |
||||||
|
} |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Note: `Render()` function is not part of RocksDB. You will need to use one of many open source map renderers, for example check out [Mapnik](http://mapnik.org/). |
||||||
|
|
||||||
|
TL;DR If you need an embedded spatial database, check out RocksDB's SpatialDB. [Let us know](https://www.facebook.com/groups/rocksdb.dev/) how we can make it better. |
||||||
|
|
||||||
|
If you're interested in learning more, check out this [talk](https://www.youtube.com/watch?v=T1jWsDMONM8). |
@ -0,0 +1,12 @@ |
|||||||
|
--- |
||||||
|
title: RocksDB is now available in Windows Platform |
||||||
|
layout: post |
||||||
|
author: dmitrism |
||||||
|
category: blog |
||||||
|
--- |
||||||
|
|
||||||
|
Over the past 6 months we have seen a number of use cases where RocksDB is successfully used by the community and various companies to achieve high throughput and volume in a modern server environment. |
||||||
|
|
||||||
|
We at Microsoft Bing could not be left behind. As a result we are happy to [announce](http://bit.ly/1OmWBT9) the availability of the Windows Port created here at Microsoft which we intend to use as a storage option for one of our key/value data stores. |
||||||
|
|
||||||
|
We are happy to make this available for the community. Keep tuned for more announcements to come. |
@ -0,0 +1,148 @@ |
|||||||
|
--- |
||||||
|
title: Analysis File Read Latency by Level |
||||||
|
layout: post |
||||||
|
author: sdong |
||||||
|
category: blog |
||||||
|
--- |
||||||
|
|
||||||
|
In many use cases of RocksDB, people rely on OS page cache for caching compressed data. With this approach, verifying effective of the OS page caching is challenging, because file system is a black box to users. |
||||||
|
|
||||||
|
As an example, a user can tune the DB as following: use level-based compaction, with L1 - L4 sizes to be 1GB, 10GB, 100GB and 1TB. And they reserve about 20GB memory as OS page cache, expecting level 0, 1 and 2 are mostly cached in memory, leaving only reads from level 3 and 4 requiring disk I/Os. However, in practice, it's not easy to verify whether OS page cache does exactly what we expect. For example, if we end up with doing 4 instead of 2 I/Os per query, it's not easy for users to figure out whether the it's because of efficiency of OS page cache or reading multiple blocks for a level. Analysis like it is especially important if users run RocksDB on hard drive disks, for the gap of latency between hard drives and memory is much higher than flash-based SSDs. |
||||||
|
|
||||||
|
In order to make tuning easier, we added new instrumentation to help users analysis latency distribution of file reads in different levels. If users turn DB statistics on, we always keep track of distribution of file read latency for each level. Users can retrieve the information by querying DB property “rocksdb.stats” ( [https://github.com/facebook/rocksdb/blob/v3.13.1/include/rocksdb/db.h#L315-L316](https://github.com/facebook/rocksdb/blob/v3.13.1/include/rocksdb/db.h#L315-L316) ). It will also printed out as a part of compaction summary in info logs periodically. |
||||||
|
|
||||||
|
The output looks like this: |
||||||
|
|
||||||
|
|
||||||
|
```bash |
||||||
|
** Level 0 read latency histogram (micros): |
||||||
|
Count: 696 Average: 489.8118 StdDev: 222.40 |
||||||
|
Min: 3.0000 Median: 452.3077 Max: 1896.0000 |
||||||
|
Percentiles: P50: 452.31 P75: 641.30 P99: 1068.00 P99.9: 1860.80 P99.99: 1896.00 |
||||||
|
------------------------------------------------------ |
||||||
|
[ 2, 3 ) 1 0.144% 0.144% |
||||||
|
[ 18, 20 ) 1 0.144% 0.287% |
||||||
|
[ 45, 50 ) 5 0.718% 1.006% |
||||||
|
[ 50, 60 ) 26 3.736% 4.741% # |
||||||
|
[ 60, 70 ) 6 0.862% 5.603% |
||||||
|
[ 90, 100 ) 1 0.144% 5.747% |
||||||
|
[ 120, 140 ) 2 0.287% 6.034% |
||||||
|
[ 140, 160 ) 1 0.144% 6.178% |
||||||
|
[ 160, 180 ) 1 0.144% 6.322% |
||||||
|
[ 200, 250 ) 9 1.293% 7.615% |
||||||
|
[ 250, 300 ) 45 6.466% 14.080% # |
||||||
|
[ 300, 350 ) 88 12.644% 26.724% ### |
||||||
|
[ 350, 400 ) 88 12.644% 39.368% ### |
||||||
|
[ 400, 450 ) 71 10.201% 49.569% ## |
||||||
|
[ 450, 500 ) 65 9.339% 58.908% ## |
||||||
|
[ 500, 600 ) 74 10.632% 69.540% ## |
||||||
|
[ 600, 700 ) 92 13.218% 82.759% ### |
||||||
|
[ 700, 800 ) 64 9.195% 91.954% ## |
||||||
|
[ 800, 900 ) 35 5.029% 96.983% # |
||||||
|
[ 900, 1000 ) 12 1.724% 98.707% |
||||||
|
[ 1000, 1200 ) 6 0.862% 99.569% |
||||||
|
[ 1200, 1400 ) 2 0.287% 99.856% |
||||||
|
[ 1800, 2000 ) 1 0.144% 100.000% |
||||||
|
|
||||||
|
** Level 1 read latency histogram (micros): |
||||||
|
(......not pasted.....) |
||||||
|
|
||||||
|
** Level 2 read latency histogram (micros): |
||||||
|
(......not pasted.....) |
||||||
|
|
||||||
|
** Level 3 read latency histogram (micros): |
||||||
|
(......not pasted.....) |
||||||
|
|
||||||
|
** Level 4 read latency histogram (micros): |
||||||
|
(......not pasted.....) |
||||||
|
|
||||||
|
** Level 5 read latency histogram (micros): |
||||||
|
Count: 25583746 Average: 421.1326 StdDev: 385.11 |
||||||
|
Min: 1.0000 Median: 376.0011 Max: 202444.0000 |
||||||
|
Percentiles: P50: 376.00 P75: 438.00 P99: 1421.68 P99.9: 4164.43 P99.99: 9056.52 |
||||||
|
------------------------------------------------------ |
||||||
|
[ 0, 1 ) 2351 0.009% 0.009% |
||||||
|
[ 1, 2 ) 6077 0.024% 0.033% |
||||||
|
[ 2, 3 ) 8471 0.033% 0.066% |
||||||
|
[ 3, 4 ) 788 0.003% 0.069% |
||||||
|
[ 4, 5 ) 393 0.002% 0.071% |
||||||
|
[ 5, 6 ) 786 0.003% 0.074% |
||||||
|
[ 6, 7 ) 1709 0.007% 0.080% |
||||||
|
[ 7, 8 ) 1769 0.007% 0.087% |
||||||
|
[ 8, 9 ) 1573 0.006% 0.093% |
||||||
|
[ 9, 10 ) 1495 0.006% 0.099% |
||||||
|
[ 10, 12 ) 3043 0.012% 0.111% |
||||||
|
[ 12, 14 ) 2259 0.009% 0.120% |
||||||
|
[ 14, 16 ) 1233 0.005% 0.125% |
||||||
|
[ 16, 18 ) 762 0.003% 0.128% |
||||||
|
[ 18, 20 ) 451 0.002% 0.130% |
||||||
|
[ 20, 25 ) 794 0.003% 0.133% |
||||||
|
[ 25, 30 ) 1279 0.005% 0.138% |
||||||
|
[ 30, 35 ) 1172 0.005% 0.142% |
||||||
|
[ 35, 40 ) 1363 0.005% 0.148% |
||||||
|
[ 40, 45 ) 409 0.002% 0.149% |
||||||
|
[ 45, 50 ) 105 0.000% 0.150% |
||||||
|
[ 50, 60 ) 80 0.000% 0.150% |
||||||
|
[ 60, 70 ) 280 0.001% 0.151% |
||||||
|
[ 70, 80 ) 1583 0.006% 0.157% |
||||||
|
[ 80, 90 ) 4245 0.017% 0.174% |
||||||
|
[ 90, 100 ) 6572 0.026% 0.200% |
||||||
|
[ 100, 120 ) 9724 0.038% 0.238% |
||||||
|
[ 120, 140 ) 3713 0.015% 0.252% |
||||||
|
[ 140, 160 ) 2383 0.009% 0.261% |
||||||
|
[ 160, 180 ) 18344 0.072% 0.333% |
||||||
|
[ 180, 200 ) 51873 0.203% 0.536% |
||||||
|
[ 200, 250 ) 631722 2.469% 3.005% |
||||||
|
[ 250, 300 ) 2721970 10.639% 13.644% ## |
||||||
|
[ 300, 350 ) 5909249 23.098% 36.742% ##### |
||||||
|
[ 350, 400 ) 6522507 25.495% 62.237% ##### |
||||||
|
[ 400, 450 ) 4296332 16.793% 79.030% ### |
||||||
|
[ 450, 500 ) 2130323 8.327% 87.357% ## |
||||||
|
[ 500, 600 ) 1553208 6.071% 93.428% # |
||||||
|
[ 600, 700 ) 642129 2.510% 95.938% # |
||||||
|
[ 700, 800 ) 372428 1.456% 97.394% |
||||||
|
[ 800, 900 ) 187561 0.733% 98.127% |
||||||
|
[ 900, 1000 ) 85858 0.336% 98.462% |
||||||
|
[ 1000, 1200 ) 82730 0.323% 98.786% |
||||||
|
[ 1200, 1400 ) 50691 0.198% 98.984% |
||||||
|
[ 1400, 1600 ) 38026 0.149% 99.133% |
||||||
|
[ 1600, 1800 ) 32991 0.129% 99.261% |
||||||
|
[ 1800, 2000 ) 30200 0.118% 99.380% |
||||||
|
[ 2000, 2500 ) 62195 0.243% 99.623% |
||||||
|
[ 2500, 3000 ) 36684 0.143% 99.766% |
||||||
|
[ 3000, 3500 ) 21317 0.083% 99.849% |
||||||
|
[ 3500, 4000 ) 10216 0.040% 99.889% |
||||||
|
[ 4000, 4500 ) 8351 0.033% 99.922% |
||||||
|
[ 4500, 5000 ) 4152 0.016% 99.938% |
||||||
|
[ 5000, 6000 ) 6328 0.025% 99.963% |
||||||
|
[ 6000, 7000 ) 3253 0.013% 99.976% |
||||||
|
[ 7000, 8000 ) 2082 0.008% 99.984% |
||||||
|
[ 8000, 9000 ) 1546 0.006% 99.990% |
||||||
|
[ 9000, 10000 ) 1055 0.004% 99.994% |
||||||
|
[ 10000, 12000 ) 1566 0.006% 100.000% |
||||||
|
[ 12000, 14000 ) 761 0.003% 100.003% |
||||||
|
[ 14000, 16000 ) 462 0.002% 100.005% |
||||||
|
[ 16000, 18000 ) 226 0.001% 100.006% |
||||||
|
[ 18000, 20000 ) 126 0.000% 100.006% |
||||||
|
[ 20000, 25000 ) 107 0.000% 100.007% |
||||||
|
[ 25000, 30000 ) 43 0.000% 100.007% |
||||||
|
[ 30000, 35000 ) 15 0.000% 100.007% |
||||||
|
[ 35000, 40000 ) 14 0.000% 100.007% |
||||||
|
[ 40000, 45000 ) 16 0.000% 100.007% |
||||||
|
[ 45000, 50000 ) 1 0.000% 100.007% |
||||||
|
[ 50000, 60000 ) 22 0.000% 100.007% |
||||||
|
[ 60000, 70000 ) 10 0.000% 100.007% |
||||||
|
[ 70000, 80000 ) 5 0.000% 100.007% |
||||||
|
[ 80000, 90000 ) 14 0.000% 100.007% |
||||||
|
[ 90000, 100000 ) 11 0.000% 100.007% |
||||||
|
[ 100000, 120000 ) 33 0.000% 100.007% |
||||||
|
[ 120000, 140000 ) 6 0.000% 100.007% |
||||||
|
[ 140000, 160000 ) 3 0.000% 100.007% |
||||||
|
[ 160000, 180000 ) 7 0.000% 100.007% |
||||||
|
[ 200000, 250000 ) 2 0.000% 100.007% |
||||||
|
``` |
||||||
|
|
||||||
|
|
||||||
|
In this example, you can see we only issued 696 reads from level 0 while issued 25 million reads from level 5. The latency distribution is also clearly shown among those reads. This will be helpful for users to analysis OS page cache efficiency. |
||||||
|
|
||||||
|
Currently the read latency per level includes reads from data blocks, index blocks, as well as bloom filter blocks. We are also working on a feature to break down those three type of blocks. |
@ -0,0 +1,51 @@ |
|||||||
|
--- |
||||||
|
title: RocksDB 4.2 Release! |
||||||
|
layout: post |
||||||
|
author: sdong |
||||||
|
category: blog |
||||||
|
--- |
||||||
|
|
||||||
|
New RocksDB release - 4.2! |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
**New Features** |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
1. Introduce CreateLoggerFromOptions(), this function create a Logger for provided DBOptions. |
||||||
|
|
||||||
|
|
||||||
|
2. Add GetAggregatedIntProperty(), which returns the sum of the GetIntProperty of all the column families. |
||||||
|
|
||||||
|
|
||||||
|
3. Add MemoryUtil in rocksdb/utilities/memory.h. It currently offers a way to get the memory usage by type from a list rocksdb instances. |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
**Public API changes** |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
1. CompactionFilter::Context includes information of Column Family ID |
||||||
|
|
||||||
|
|
||||||
|
2. The need-compaction hint given by TablePropertiesCollector::NeedCompact() will be persistent and recoverable after DB recovery. This introduces a breaking format change. If you use this experimental feature, including NewCompactOnDeletionCollectorFactory() in the new version, you may not be able to directly downgrade the DB back to version 4.0 or lower. |
||||||
|
|
||||||
|
|
||||||
|
3. TablePropertiesCollectorFactory::CreateTablePropertiesCollector() now takes an option Context, containing the information of column family ID for the file being written. |
||||||
|
|
||||||
|
|
||||||
|
4. Remove DefaultCompactionFilterFactory. |
||||||
|
|
||||||
|
|
||||||
|
[https://github.com/facebook/rocksdb/releases/tag/v4.2](https://github.com/facebook/rocksdb/releases/tag/v4.2) |
@ -0,0 +1,18 @@ |
|||||||
|
--- |
||||||
|
title: RocksDB AMA |
||||||
|
layout: post |
||||||
|
author: yhchiang |
||||||
|
category: blog |
||||||
|
--- |
||||||
|
|
||||||
|
RocksDB developers are doing a Reddit Ask-Me-Anything now at 10AM – 11AM PDT! We welcome you to stop by and ask any RocksDB related questions, including existing / upcoming features, tuning tips, or database design. |
||||||
|
|
||||||
|
Here are some enhancements that we'd like to focus on over the next six months: |
||||||
|
|
||||||
|
* 2-Phase Commit |
||||||
|
* Lua support in some custom functions |
||||||
|
* Backup and repair tools |
||||||
|
* Direct I/O to bypass OS cache |
||||||
|
* RocksDB Java API |
||||||
|
|
||||||
|
[https://www.reddit.com/r/IAmA/comments/47k1si/we_are_rocksdb_developers_ask_us_anything/](https://www.reddit.com/r/IAmA/comments/47k1si/we_are_rocksdb_developers_ask_us_anything/) |
@ -0,0 +1,26 @@ |
|||||||
|
--- |
||||||
|
title: RocksDB Options File |
||||||
|
layout: post |
||||||
|
author: yhciang |
||||||
|
category: blog |
||||||
|
--- |
||||||
|
|
||||||
|
In RocksDB 4.3, we added a new set of features that makes managing RocksDB options easier. Specifically: |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
* **Persisting Options Automatically**: Each RocksDB database will now automatically persist its current set of options into an INI file on every successful call of DB::Open(), SetOptions(), and CreateColumnFamily() / DropColumnFamily(). |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
* **Load Options from File**: We added [LoadLatestOptions() / LoadOptionsFromFile()](https://github.com/facebook/rocksdb/blob/4.3.fb/include/rocksdb/utilities/options_util.h#L48-L58) that enables developers to construct RocksDB options object from an options file. |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
* **Sanity Check Options**: We added [CheckOptionsCompatibility](https://github.com/facebook/rocksdb/blob/4.3.fb/include/rocksdb/utilities/options_util.h#L64-L77) that performs compatibility check on two sets of RocksDB options. |
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
Want to know more about how to use this new features? Check out the [RocksDB Options File wiki page](https://github.com/facebook/rocksdb/wiki/RocksDB-Options-File) and start using this new feature today! |
@ -1,12 +0,0 @@ |
|||||||
--- |
|
||||||
title: Blog Post Example |
|
||||||
layout: post |
|
||||||
author: exampleauthor |
|
||||||
category: blog |
|
||||||
--- |
|
||||||
|
|
||||||
This is an example blog post introduction, try to keep it short and about a paragraph long, to encourage people to click through to read the entire post. |
|
||||||
|
|
||||||
<!--truncate--> |
|
||||||
|
|
||||||
Everything below the `<!--truncate-->` tag will only show on the actual blog post page, not on the /blog/ index. |
|
@ -0,0 +1,44 @@ |
|||||||
|
--- |
||||||
|
title: RocksDB 4.8 Released! |
||||||
|
layout: post |
||||||
|
author: yiwu |
||||||
|
category: blog |
||||||
|
--- |
||||||
|
|
||||||
|
## 4.8.0 (5/2/2016) |
||||||
|
|
||||||
|
### [](https://github.com/facebook/rocksdb/blob/master/HISTORY.md#public-api-change-1)Public API Change |
||||||
|
|
||||||
|
* Allow preset compression dictionary for improved compression of block-based tables. This is supported for zlib, zstd, and lz4. The compression dictionary's size is configurable via CompressionOptions::max_dict_bytes. |
||||||
|
* Delete deprecated classes for creating backups (BackupableDB) and restoring from backups (RestoreBackupableDB). Now, BackupEngine should be used for creating backups, and BackupEngineReadOnly should be used for restorations. For more details, see [https://github.com/facebook/rocksdb/wiki/How-to-backup-RocksDB%3F](https://github.com/facebook/rocksdb/wiki/How-to-backup-RocksDB%3F) |
||||||
|
* Expose estimate of per-level compression ratio via DB property: "rocksdb.compression-ratio-at-levelN". |
||||||
|
* Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status. |
||||||
|
|
||||||
|
### [](https://github.com/facebook/rocksdb/blob/master/HISTORY.md#new-features-2)New Features |
||||||
|
|
||||||
|
* Add ReadOptions::readahead_size. If non-zero, NewIterator will create a new table reader which performs reads of the given size. |
||||||
|
|
||||||
|
<br/> |
||||||
|
|
||||||
|
## [](https://github.com/facebook/rocksdb/blob/master/HISTORY.md#470-482016)4.7.0 (4/8/2016) |
||||||
|
|
||||||
|
### [](https://github.com/facebook/rocksdb/blob/master/HISTORY.md#public-api-change-2)Public API Change |
||||||
|
|
||||||
|
* rename options compaction_measure_io_stats to report_bg_io_stats and include flush too. |
||||||
|
* Change some default options. Now default options will optimize for server-workloads. Also enable slowdown and full stop triggers for pending compaction bytes. These changes may cause sub-optimal performance or significant increase of resource usage. To avoid these risks, users can open existing RocksDB with options extracted from RocksDB option files. See [https://github.com/facebook/rocksdb/wiki/RocksDB-Options-File](https://github.com/facebook/rocksdb/wiki/RocksDB-Options-File) for how to use RocksDB option files. Or you can call Options.OldDefaults() to recover old defaults. DEFAULT_OPTIONS_HISTORY.md will track change history of default options. |
||||||
|
|
||||||
|
<br/> |
||||||
|
|
||||||
|
## [](https://github.com/facebook/rocksdb/blob/master/HISTORY.md#460-3102016)4.6.0 (3/10/2016) |
||||||
|
|
||||||
|
### [](https://github.com/facebook/rocksdb/blob/master/HISTORY.md#public-api-changes-1)Public API Changes |
||||||
|
|
||||||
|
* Change default of BlockBasedTableOptions.format_version to 2. It means default DB created by 4.6 or up cannot be opened by RocksDB version 3.9 or earlier |
||||||
|
* Added strict_capacity_limit option to NewLRUCache. If the flag is set to true, insert to cache will fail if no enough capacity can be free. Signature of Cache::Insert() is updated accordingly. |
||||||
|
* Tickers [NUMBER_DB_NEXT, NUMBER_DB_PREV, NUMBER_DB_NEXT_FOUND, NUMBER_DB_PREV_FOUND, ITER_BYTES_READ] are not updated immediately. The are updated when the Iterator is deleted. |
||||||
|
* Add monotonically increasing counter (DB property "rocksdb.current-super-version-number") that increments upon any change to the LSM tree. |
||||||
|
|
||||||
|
### [](https://github.com/facebook/rocksdb/blob/master/HISTORY.md#new-features-3)New Features |
||||||
|
|
||||||
|
* Add CompactionPri::kMinOverlappingRatio, a compaction picking mode friendly to write amplification. |
||||||
|
* Deprecate Iterator::IsKeyPinned() and replace it with Iterator::GetProperty() with prop_name="rocksdb.iterator.is.key.pinned" |
After Width: | Height: | Size: 26 KiB |
After Width: | Height: | Size: 17 KiB |
Loading…
Reference in new issue