EnvLibrados is a customized RocksDB Env to use RADOS as the backend file system of RocksDB. It overrides all file system related API of default Env. The easiest way to use it is just like following:
std::string db_name = "test_db";
std::string config_path = "path/to/ceph/config";
DB* db;
Options options;
options.env = EnvLibrados(db_name, config_path);
Status s = DB::Open(options, kDBPath, &db);
Then EnvLibrados will forward all file read/write operation to the RADOS cluster assigned by config_path. Default pool is db_name+"_pool".
There are some options that users could set for EnvLibrados.
- write_buffer_size. This variable is the max buffer size for WritableFile. After reaching the buffer_max_size, EnvLibrados will sync buffer content to RADOS, then clear buffer.
- db_pool. Rather than using default pool, users could set their own db pool name
- wal_dir. The dir for WAL files. Because RocksDB only has 2-level structure (dir_name/file_name), the format of wal_dir is "/dir_name"(CAN'T be "/dir1/dir2"). Default wal_dir is "/wal".
- wal_pool. Corresponding pool name for WAL files. Default value is db_name+"_wal_pool"
The example of setting options looks like following:
db_name = "test_db";
db_pool = db_name+"_pool";
wal_dir = "/wal";
wal_pool = db_name+"_wal_pool";
write_buffer_size = 1 << 20;
env_ = new EnvLibrados(db_name, config, db_pool, wal_dir, wal_pool, write_buffer_size);
DB* db;
Options options;
options.env = env_;
// The last level dir name should match the dir name in prefix_pool_map
options.wal_dir = "/tmp/wal";
// open DB
Status s = DB::Open(options, kDBPath, &db);
Librados is required to compile EnvLibrados. Then use "$make LIBRADOS=1" to compile RocksDB. If you want to only compile EnvLibrados test, just run "$ make env_librados_test LIBRADOS=1". To run env_librados_test, you need to have a running RADOS cluster with the configure file located in "../ceph/src/ceph.conf" related to "rocksdb/".
* Replace %zu format specifier with Windows-compatible macro 'ROCKSDB_PRIszt'
* Added "port/port.h" include to sim_cache.cc for call to snprintf().
* Applied cleaner fix to windows build, reverting part of 7bedd94
Summary: In T8216281 we decided to disable prefetching the index and filter during opening table handlers during startup (max_open_files = -1).
Test Plan: Rely on `IndexAndFilterBlocksOfNewTableAddedToCache` to guarantee L0 indexes and filters are still cached and change `PinL0IndexAndFilterBlocksTest` to make sure other levels are not cached (maybe add one more test to test we don't cache other levels?)
Reviewers: sdong, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59913
Summary:
(1) Integer size correction (mac build break)
(2) snprint usage in Windows (windows build break)
Test Plan: Build in windows and mac
Reviewers: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60927
Summary:
Persistent cache tier is the tier abstraction that can work for any block
device based device mounted on a file system. The design/implementation can
handle any generic block device.
Any generic block support is achieved by generalizing the access patten as
{io-size, q-depth, direct-io/buffered}.
We have specifically tested and adapted the IO path for NVM and SSD.
Persistent cache tier consists of there parts :
1) File layout
Provides the implementation for handling IO path for reading and writing data
(key/value pair).
2) Meta-data
Provides the implementation for handling the index for persistent read cache.
3) Implementation
It binds (1) and (2) and flushed out the PersistentCacheTier interface
This patch provides implementation for (1)(2). Follow up patch will provide (3)
and tests.
Test Plan: Compile and run check
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57117
Summary: Refactor cache.cc so that I can plugin clock cache (D55581). Mainly move `ShardedCache` to separate file, move `LRUHandle` back to cache.cc and rename it lru_cache.cc.
Test Plan:
make check -j64
Reviewers: lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59655
Summary: use of dynamic_cast<TransactionImpl*> is unnecessary and also introduce difficulty for fbrocksdb support of TransactionDB
Test Plan: ./transaction_test
Reviewers: sdong, IslamAbdelRahman, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60501
Summary: LockFile is unnecessary in unit test
Test Plan: env_basic_test.cc
Reviewers: andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60285
Summary:
We have alot of code duplication whenever we call FullMerge we keep duplicating the instrumentation and statistics code
This is a simple diff to refactor the code to use TimedFullMerge instead of FullMerge
Test Plan: COMPILE_WITH_ASAN=1 make check -j64
Reviewers: andrewkr, yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59577
Summary:
The tsan error was because the random implementation we have is not
thread safe, using Random::GetTLSInstance
Test Plan: Run tests in Linux
Reviewers: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59559
Summary: Backup options file to private directory
Test Plan:
backupable_db_test.cc, BackupOptions
Modify DB options by calling OpenDB for 3 times. Check the latest options file is in the right place. Also check no redundent files are backuped.
Reviewers: andrewkr
Reviewed By: andrewkr
Subscribers: leveldb, dhruba, andrewkr
Differential Revision: https://reviews.facebook.net/D59373
Summary: Enabled build in Windows and corresponding fixes
Test Plan:
Compile and run persistent_cache_test in Windows and make check in
Linux
Reviewers: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59307
Summary:
PersistentCacheOptions class in persistent_cache_tier.h is not used any where yet in the code base
but it break the unity build because it have the same name as PersistentCacheOptions in table/persistent_cache_helper.h
Remove it temporarily, and the @krad can add it again with a different name when we start using it
Test Plan:
make unity_test -j64
make check -j64
Reviewers: kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59349
Summary: Disable the test under TSAN temporary to temporarily the build
Test Plan: run the test under TSAN
Reviewers: kradhakrishnan, andrewkr, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59337
Summary:
This provides provides an implementation of PersistentCacheTier that is
specialized for RAM. This tier does not persist data though.
Why do we need this tier ?
This is ideal as tier 0. This tier can host data that is too hot.
Why can't we use Cache variants ?
Yes you can use them instead. This tier can potentially outperform BlockCache
in RAW mode by virtue of compression and compressed cache in block cache doesn't
seem very popular. Potentially this tier can be modified to under stand the
disadvantage of the tier below and retain data that the tier below is bad at
handling (for example index and bloom data that is huge in size)
Test Plan: Run unit tests added
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57069
Summary:
Fix 2 issues that was breaking Windows build
1) double to size_t potential downcast warning
2) port_posix is not ready for windows, avoiding building hash_table_bench to
avoid build break
Test Plan: compile in Windoes and make check
Reviewers: sdong, andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59265
Summary:
This diff provides the basic interface definitions of persistent read
cache system
PersistentCacheOptions captures the persistent read cache options used to
configure and control the system
PersistentCacheTier provides the basic building block for constructing tiered
cache
PersistentTieredCache provides a logical abstraction of tiers of cache layered
over one another
Test Plan: Compile
Reviewers: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57051
Summary:
Rocksdb backup and restore rate limiting is currently done per backup/restore.
So, it is difficult to control rate across multiple backup/restores. With this
change, a throttler can be provided. If a throttler is provided, it is used.
Otherwise, a new throttler is created based on the actual rate limits specified
in the options.
Test Plan: Added unit tests
Reviewers: ldemailly, andrewkr, sdong
Reviewed By: andrewkr
Subscribers: igor, yiwu, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56265
Summary:
This enables configurable Envs without recompiling. For example, my
next diff will make env_test test an Env created by NewEnvFromUri(). Then,
users can determine which Env is tested simply by providing the URI for
NewEnvFromUri() (e.g., through a CLI argument or environment variable).
The registration process allows us to register any Env that is linked with the
RocksDB library, so we can register our internal Envs as well.
The registration code is inspired by our internal InitRegistry.
Test Plan: new unit test
Reviewers: IslamAbdelRahman, lightmark, ldemailly, sdong
Reviewed By: sdong
Subscribers: leveldb, dhruba, andrewkr
Differential Revision: https://reviews.facebook.net/D58449
Summary: Add hash table (under persistent cache) to CMake list
Test Plan: Run hash_test in windows and make check in Linux
Reviewers: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59151
Summary:
Persistent read cache isn't very applicable for lite builds. Wrapping
the code with #ifndef ROCKSDB_LITE .. #endif
Test Plan: Run unit, lite, lite_test
Reviewers: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D58563
Summary:
Google C++ Style writes: In particular, prefer to write lambda captures explicitly when capturing this or if the lambda will escape the current scope.
Here it is the case for both.
Test Plan: Run all test suites.
Reviewers: andrewkr, dhruba
Reviewed By: andrewkr, dhruba
Subscribers: yhchiang, IslamAbdelRahman, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58515
Summary:
Introduce MaxOperator a simple merge operator that return the max of all operands.
This merge operand help me in benchmarking
Test Plan: Add new unitttests
Reviewers: sdong, andrewkr, yhchiang
Reviewed By: yhchiang
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57873
Summary: This tests that a prepared transaction is not lost after several crashes, restarts, and memtable flushes.
Test Plan: TwoPhaseLongPrepareTest
Reviewers: sdong
Subscribers: hermanlee4, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58185
Summary:
TransactionTest.TwoPhaseMultiThreadTest runs forever under TSAN and our CI builds time out
looks like the reason is that some threads keep running and other threads dont get a chance to increment the counter
Test Plan: run the test under TSAN
Reviewers: sdong, horuff
Reviewed By: horuff
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58359
Summary:
We expect the persistent read cache to perform at speeds upto 8 GB/s. In order
to accomplish that, we need build a index mechanism which operate in the order
of multiple millions per sec rate.
This patch provide the basic data structure to accomplish that:
(1) Hash table implementation with lock contention spread
It is based on the StripedHashSet<T> implementation in
The Art of multiprocessor programming by Maurice Henry & Nir Shavit
(2) LRU implementation
Place holder algorithm for further optimizing
(3) Evictable Hash Table implementation
Building block for building index data structure that evicts data like files
etc
TODO:
(1) Figure if the sharded hash table and LRU can be used instead
(2) Figure if we need to support configurable eviction algorithm for
EvictableHashTable
Test Plan: Run unit tests
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D55785
Summary:
- Make sure we clean up recovered_transactions_ on DBImpl destructor
- delete leaked txns and env in TransactionTest
Test Plan: Run transaction_test under valgrind
Reviewers: sdong, andrewkr, yhchiang, horuff
Reviewed By: horuff
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58263
Summary: Disable backupable_db_test.cc on Windows since EnvChroot is not supported
Test Plan: check ROCKSDB_LITE
Reviewers: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58035
Summary:
- Used ChrootEnv so the database and backup Envs are isolated in the filesystem.
- Removed DifferentEnvs test since now every test uses different Envs
Depends on D57543
Test Plan:
- ran backupable_db_test
- verified backupable_db_test now catches the bug when D57159 is backed out (this bug previously passed through the test cases, which motivated this change)
Reviewers: sdong, lightmark, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57615
Summary:
1. prepare()
2. crash
3. recover
4. commit()
5. crash
6. data is lost
This is due to the transaction data still only residing in the WAL but because the logs were flushed on the first recovery the data is ignored on the second recovery. We must scan all logs found on recovery and only ignore redundant data at the time of replay. It is not possible to know which logs still contain relevant data at time of recovery. We cannot simply ignore a log because all of the non-2pc data it contains has already been written to L0.
The changes made to MemTableInserter are to ensure that prepared sections are still recovered even if all of the non-2pc data in that log has already been flushed to L0.
Test Plan: Provided test.
Reviewers: sdong
Subscribers: andrewkr, hermanlee4, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57729
Summary:
Consider the following WAL with 4 batch entries prefixed with their sequence at time of memtable insert.
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(a)]
[1: BEGIN_PREPARE, PUT, PUT, PUT, PUT, END_PREPARE(b)]
[4: COMMIT(a)]
[7: COMMIT(b)]
The first two batches do not consume any sequence numbers so are both prefixed with seq=1.
For 2pc commit, memtable insertion takes place before COMMIT batch is written to WAL.
We can see that sequence number consumption takes place between WAL entries giving us the seemingly sparse sequence prefix for WAL entries.
This is a valid WAL.
Because with 2PC markers one WriteBatch points to another batch containing its inserts a writebatch can consume more or less sequence numbers than the number of sequence consuming entries that it contains.
We can see that, given the entries in the WAL, 6 sequence ids were consumed. Yet on recovery the maximum sequence consumed would be 7 + 3 (the number of sequence numbers consumed by COMMIT(b))
So, now upon recovery we must track the actual consumption of sequence numbers.
In the provided scenario there will be no sequence gaps, but it is possible to produce a sequence gap. This should not be a problem though. correct?
Test Plan: provided test.
Reviewers: sdong
Subscribers: andrewkr, leveldb, dhruba, hermanlee4
Differential Revision: https://reviews.facebook.net/D57645
Summary: Adds three new WriteBatch data types: Prepare(xid), Commit(xid), Rollback(xid). Prepare(xid) should precede the (single) operation to which is applies. There can obviously be multiple Prepare(xid) markers. There should only be one Rollback(xid) or Commit(xid) marker yet not both. None of this logic is currently enforced and will most likely be implemented further up such as in the memtableinserter. All three markers are similar to PutLogData in that they are writebatch meta-data, ie stored but not counted. All three markers differ from PutLogData in that they will actually be written to disk. As for WriteBatchWithIndex, Prepare, Commit, Rollback are all implemented just as PutLogData and none are tested just as PutLogData.
Test Plan: single unit test in write_batch_test.
Reviewers: hermanlee4, sdong, anthony
Subscribers: leveldb, dhruba, vasilep, andrewkr
Differential Revision: https://reviews.facebook.net/D57867
Summary: Adds three new WriteBatch data types: Prepare(xid), Commit(xid), Rollback(xid). Prepare(xid) should precede the (single) operation to which is applies. There can obviously be multiple Prepare(xid) markers. There should only be one Rollback(xid) or Commit(xid) marker yet not both. None of this logic is currently enforced and will most likely be implemented further up such as in the memtableinserter. All three markers are similar to PutLogData in that they are writebatch meta-data, ie stored but not counted. All three markers differ from PutLogData in that they will actually be written to disk. As for WriteBatchWithIndex, Prepare, Commit, Rollback are all implemented just as PutLogData and none are tested just as PutLogData.
Test Plan: single unit test in write_batch_test.
Reviewers: hermanlee4, sdong, anthony
Subscribers: andrewkr, vasilep, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D54093