Summary: add ColumnFamilyHandleDeletionStarted listener which can be called when user deletes handler.
Test Plan: ./listener_test
Reviewers: yiwu, IslamAbdelRahman, sdong, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60717
Summary:
Add Env::RandomRWFile in env.h and implement it for POSIX
RandomRWFile is a file that allow us to read from / write to random offsets in the file
I will implement it for other Envs later after finishing the whole task for AddFile()
Test Plan: unit tests
Reviewers: andrewkr, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D62433
Summary:
ZSTD 1.0.0 is coming. We can finally add a support of ZSTD without worrying about compatibility.
Still keep ZSTDNotFinal for compatibility reason.
Test Plan: Run all tests. Run db_bench with ZSTD version with RocksDB built with ZSTD 1.0 and older.
Reviewers: andrewkr, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: cyan, igor, IslamAbdelRahman, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D63141
Summary: There's no reference to ImmutableCFOptions elsewhere in /include/rocksdb. ImmutableCFOptions was introduced in this commit (5665e5e285) but later its reference in /include/rocksdb/table.h is removed.
Test Plan:
make all check
Reviewers: IslamAbdelRahman, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: yhchiang, andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D63177
* Fix StatsLevel so that kExceptTimeForMutex leaves compression stats enabled and kExceptDetailedTimers disables mutex lock stats. Also change default stats level to kExceptDetailedTimers (disabling both compression and mutex timing).
* Changed order of StatsLevel enum to simplify logic for determining what stats to record.
Summary: To reduce number of options, merge source_compaction_factor, max_grandparent_overlap_bytes and expanded_compaction_factor into max_compaction_bytes.
Test Plan: Add two new unit tests. Run all existing tests, including jtest.
Reviewers: yhchiang, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59829
Summary: basically for SimCache stats. I find most times it is hard to pass Statistics* to SimCache constructor.
Test Plan: make all check
Reviewers: andrewkr, sdong, yiwu
Reviewed By: yiwu
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62193
Summary:
Add ReadOptions::read_amp_bytes_per_bit option which allow us to create a bitmap for every data block we read
the bitmap will contain (block_size / read_amp_bytes_per_bit) bits.
We will use this bitmap to mark which bytes have been used of the block so we can calculate the read amplification
Test Plan: added new tests
Reviewers: andrewkr, yhchiang, sdong
Reviewed By: sdong
Subscribers: yiwu, leveldb, march, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D58707
Summary:
Make sure prefix extractor name is stored in SST files and if DB is opened with a prefix extractor of a different name, prefix bloom is skipped when read the file.
Also add unit tests for that.
Test Plan:
before change:
```
Note: Google Test filter = BlockBasedTableTest.SkipPrefixBloomFilter
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from BlockBasedTableTest
[ RUN ] BlockBasedTableTest.SkipPrefixBloomFilter
table/table_test.cc:1421: Failure
Value of: db_iter->Valid()
Actual: false
Expected: true
[ FAILED ] BlockBasedTableTest.SkipPrefixBloomFilter (1 ms)
[----------] 1 test from BlockBasedTableTest (1 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (1 ms total)
[ PASSED ] 0 tests.
[ FAILED ] 1 test, listed below:
[ FAILED ] BlockBasedTableTest.SkipPrefixBloomFilter
1 FAILED TEST
```
after:
```
Note: Google Test filter = BlockBasedTableTest.SkipPrefixBloomFilter
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from BlockBasedTableTest
[ RUN ] BlockBasedTableTest.SkipPrefixBloomFilter
[ OK ] BlockBasedTableTest.SkipPrefixBloomFilter (0 ms)
[----------] 1 test from BlockBasedTableTest (0 ms total)
[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (0 ms total)
[ PASSED ] 1 test.
```
Reviewers: sdong, andrewkr, yiwu, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61215
Summary: as title
Test Plan: make all check
Reviewers: yiwu
Reviewed By: yiwu
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D62493
Summary:
This diff split ThreadPool to
-ThreadPool (abstract interface exposed in include/rocksdb/threadpool.h)
-ThreadPoolImpl (actual implementation in util/threadpool_imp.h)
This allow us to expose ThreadPool to the user so we can use it as an option later
Test Plan: existing unit tests
Reviewers: andrewkr, yiwu, yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D62085
Summary:
Add option to block based table to insert index/filter blocks to block cache with priority. Combined with LRUCache with high_pri_pool_ratio, we can reserved space for index/filter blocks, make them less likely to be evicted.
Depends on D61977.
Test Plan: See unit test.
Reviewers: lightmark, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, march, leveldb
Differential Revision: https://reviews.facebook.net/D62241
Summary:
Add mid-point insertion functionality to LRU cache. Caller of `Cache::Insert()` can set an additional parameter to make a cache entry have higher priority. The LRU cache will reserve at most `capacity * high_pri_pool_pct` bytes for high-pri cache entries. If `high_pri_pool_pct` is zero, the cache degenerates to normal LRU cache.
Context: If we are to put index and filter blocks into RocksDB block cache, index/filter block can be swap out too early. We want to add an option to RocksDB to reserve some capacity in block cache just for index/filter blocks, to mitigate the issue.
In later diffs I'll update block based table reader to use the interface to cache index/filter blocks at high priority, and expose the option to `DBOptions` and make it dynamic changeable.
Test Plan: unit test.
Reviewers: IslamAbdelRahman, sdong, lightmark
Reviewed By: lightmark
Subscribers: andrewkr, dhruba, march, leveldb
Differential Revision: https://reviews.facebook.net/D61977
Summary: 1. Range Deletion Tombstone structure 2. Modify Add() in table_builder to make it usable for adding range del tombstones 3. Expose NewTombstoneIterator() API in table_reader
Test Plan: table_test.cc (now BlockBasedTableBuilder::Add() only accepts InternalKey. I make table_test only pass InternalKey to BlockBasedTableBuidler. Also test writing/reading range deletion tombstones in table_test )
Reviewers: sdong, IslamAbdelRahman, lightmark, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61473
Summary:
Clock-based cache implemenetation aim to have better concurreny than
default LRU cache. See inline comments for implementation details.
Test Plan:
Update cache_test to run on both LRUCache and ClockCache. Adding some
new tests to catch some of the bugs that I fixed while implementing the
cache.
Reviewers: kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61647
Summary:
Added 2 statistics in compaction job statistics, to
identify if single deletes are not meeting a matching key
(fallthrough) or single deletes are meeting a merge, delete or
another single delete (i.e. not the expected case of put).
Test Plan: Tested the statistics using write_stress and compaction_job_stats_test
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61749
Summary:
Add API to WriteBatch to store range deletions in its buffer
which are later added to memtable. In the WriteBatch buffer, a range
deletion is encoded as "<optype><CF ID (optional)><begin key><end key>".
With this diff, the range tombstones are stored inline with the data in
the memtable. It's useful for now because the test cases rely on the
data being accessible via memtable. My next step is to store range
tombstones in a separate area in the memtable.
Test Plan: unit tests
Reviewers: IslamAbdelRahman, sdong, wanning
Reviewed By: wanning
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61401
Summary: preparation for detecting Cache type. If SimCache, we then may trigger some command like "setSimCapacity()" with setOptions()
Test Plan: make all check
Reviewers: yiwu, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61953
Summary:
This diff improves the documentation for GetOptionsFromMap APIs and
fixes a bug in GetOptionsFromMap functions in convenience.h
where new_options will still be changed when the function
call is not successful.
Test Plan: options_test
Reviewers: IslamAbdelRahman, kradhakrishnan, andrewkr, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61731
Summary: make transactionDB working with StackableDB
Test Plan: make all check -j64
Reviewers: andrewkr, yiwu, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60705
Summary:
add SIM_BLOCK_CACHE_HIT and SIM_BLOCK_CACHE_MISS tickers.
maybe can be combined with Histograms like DB_GET to evaluate the current setting of the size of block cache.
Test Plan: make all check
Reviewers: sdong, andrewkr, IslamAbdelRahman, yiwu
Reviewed By: yiwu
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61803
Summary: RocksDB 4.11 has been branched out. Now need to move to 4.12.
Test Plan: N/A
Reviewers: kradhakrishnan, andrewkr, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61845
* Added check_snapshot option in the DB's AddFile function
* change check_snapshot to skip_snapshot_check
* add unit test for skip_snapshot_check
* Add skip_snapshot_check comment
Summary: Implement a time series database that supports DateTieredCompactionStrategy. It wraps a db object and separate SST files in different column families (time windows).
Test Plan: Add `date_tiered_test`.
Reviewers: dhruba, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61653
Summary: Add a utility function that trigger necessary full compaction and put output to the correct level by looking at new options and old options.
Test Plan: Add unit tests for it.
Reviewers: andrewkr, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: muthu, sumeet, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60783
Summary:
patch for diff https://reviews.facebook.net/D58587
Also change StopWatch class to add a fifth param named overwrite which decides whether to overwrite *elapse or add on it.
Test Plan: make all check -j64
Reviewers: sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D61239
Summary: MyRocks is adding support for the user of the SstFileWriter which needs a comparator. It would be more convenient to get the comparator from the column family (which already has to have it) than to have caller keep track of it.
Test Plan: Standard tests (adding one for the new method)
Reviewers: IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61155
Summary: Extend the option memtable_prefix_bloom_huge_page_tlb_size from just putting memtable bloom filter to huge page to memtable itself too.
Test Plan: Run all existing tests.
Reviewers: IslamAbdelRahman, yhchiang, andrewkr
Reviewed By: andrewkr
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60513
Summary:
TickersNameMap is not consistent with Tickers enum.
this cause us to report wrong statistics and sometimes to access TickersNameMap outside it's boundary causing crashes (in Fb303 statistics)
Test Plan: added new unit test
Reviewers: sdong, kradhakrishnan
Reviewed By: kradhakrishnan
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61083
Summary: Bump master version to the next potential release version of 4.11
Test Plan: Compile
Reviewers: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60993
EnvLibrados is a customized RocksDB Env to use RADOS as the backend file system of RocksDB. It overrides all file system related API of default Env. The easiest way to use it is just like following:
std::string db_name = "test_db";
std::string config_path = "path/to/ceph/config";
DB* db;
Options options;
options.env = EnvLibrados(db_name, config_path);
Status s = DB::Open(options, kDBPath, &db);
Then EnvLibrados will forward all file read/write operation to the RADOS cluster assigned by config_path. Default pool is db_name+"_pool".
There are some options that users could set for EnvLibrados.
- write_buffer_size. This variable is the max buffer size for WritableFile. After reaching the buffer_max_size, EnvLibrados will sync buffer content to RADOS, then clear buffer.
- db_pool. Rather than using default pool, users could set their own db pool name
- wal_dir. The dir for WAL files. Because RocksDB only has 2-level structure (dir_name/file_name), the format of wal_dir is "/dir_name"(CAN'T be "/dir1/dir2"). Default wal_dir is "/wal".
- wal_pool. Corresponding pool name for WAL files. Default value is db_name+"_wal_pool"
The example of setting options looks like following:
db_name = "test_db";
db_pool = db_name+"_pool";
wal_dir = "/wal";
wal_pool = db_name+"_wal_pool";
write_buffer_size = 1 << 20;
env_ = new EnvLibrados(db_name, config, db_pool, wal_dir, wal_pool, write_buffer_size);
DB* db;
Options options;
options.env = env_;
// The last level dir name should match the dir name in prefix_pool_map
options.wal_dir = "/tmp/wal";
// open DB
Status s = DB::Open(options, kDBPath, &db);
Librados is required to compile EnvLibrados. Then use "$make LIBRADOS=1" to compile RocksDB. If you want to only compile EnvLibrados test, just run "$ make env_librados_test LIBRADOS=1". To run env_librados_test, you need to have a running RADOS cluster with the configure file located in "../ceph/src/ceph.conf" related to "rocksdb/".
Summary: In T8216281 we decided to disable prefetching the index and filter during opening table handlers during startup (max_open_files = -1).
Test Plan: Rely on `IndexAndFilterBlocksOfNewTableAddedToCache` to guarantee L0 indexes and filters are still cached and change `PinL0IndexAndFilterBlocksTest` to make sure other levels are not cached (maybe add one more test to test we don't cache other levels?)
Reviewers: sdong, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59913
* Added new statistics and refactored to allow ioptions to be passed around as required to access environment and statistics pointers (and, as a convenient side effect, info_log pointer).
* Prevent incrementing compression counter when compression is turned off in options.
* Prevent incrementing compression counter when compression is turned off in options.
* Added two more supported compression types to test code in db_test.cc
* Prevent incrementing compression counter when compression is turned off in options.
* Added new StatsLevel that excludes compression timing.
* Fixed casting error in coding.h
* Fixed CompressionStatsTest for new StatsLevel.
* Removed unused variable that was breaking the Linux build
Summary: Refactor cache.cc so that I can plugin clock cache (D55581). Mainly move `ShardedCache` to separate file, move `LRUHandle` back to cache.cc and rename it lru_cache.cc.
Test Plan:
make check -j64
Reviewers: lightmark, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59655
Summary: If options.write_buffer_size is not set, nor options.write_buffer_manager, no need to update the bytes allocated counter in MemTableAllocator, which is expensive in parallel memtable insert case. Remove it can improve parallel memtable insert throughput by 10% with write batch size 128.
Test Plan:
Run benchmarks
TEST_TMPDIR=/dev/shm/ ./db_bench --benchmarks=fillrandom -disable_auto_compactions -level0_slowdown_writes_trigger=9999 -level0_stop_writes_trigger=9999 -num=10000000 --writes=1000000 -max_background_flushes=16 -max_write_buffer_number=16 --threads=32 --batch_size=128 -allow_concurrent_memtable_write -enable_write_thread_adaptive_yield
The throughput grows 10% with the benchmark.
Reviewers: andrewkr, yiwu, IslamAbdelRahman, igor, ngbronson
Reviewed By: ngbronson
Subscribers: ngbronson, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60465
Summary:
add DestroyColumnFamilyHandle(ColumnFamilyHandle**) to close column family instead of deleting cfh*
User should call this to close a cf and then we can detect the deletion in this function.
Test Plan: make all check -j64
Reviewers: andrewkr, yiwu, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D60765
Summary:
DB::AddFile(std::string file_path) API that allow them to ingest an SST file created using SstFileWriter
We want to update this interface to be able to accept a list of files that will be ingested, DB::AddFile(std::vector<std::string> file_path_list).
Test Plan:
Add test case `AddExternalSstFileList` in `DBSSTTest`. To make sure:
1. files key ranges are not overlapping with each other
2. each file key range dont overlap with the DB key range
3. make sure no snapshots are held
Reviewers: andrewkr, sdong, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D58587