Summary:
Changes in the diff
API changes:
- Introduce IngestExternalFile to replace AddFile (I think this make the API more clear)
- Introduce IngestExternalFileOptions (This struct will encapsulate the options for ingesting the external file)
- Deprecate AddFile() API
Logic changes:
- If our file overlap with the memtable we will flush the memtable
- We will find the first level in the LSM tree that our file key range overlap with the keys in it
- We will find the lowest level in the LSM tree above the the level we found in step 2 that our file can fit in and ingest our file in it
- We will assign a global sequence number to our new file
- Remove AddFile restrictions by using global sequence numbers
Other changes:
- Refactor all AddFile logic to be encapsulated in ExternalSstFileIngestionJob
Test Plan:
unit tests (still need to add more)
addfile_stress (https://reviews.facebook.net/D65037)
Reviewers: yiwu, andrewkr, lightmark, yhchiang, sdong
Reviewed By: sdong
Subscribers: jkedgar, hcz, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D65061
Summary:
reland https://reviews.facebook.net/D62523
- Update SstFileWriter to include a property for a global sequence number in the SST file `rocksdb.external_sst_file.global_seqno`
- Update TableProperties to be aware of the offset of each property in the file
- Update BlockBasedTableReader and Block to be able to honor the sequence number in `rocksdb.external_sst_file.global_seqno` property and use it to overwrite all sequence number in the file
Something worth mentioning is that we don't update the seqno in the index block since and when doing a binary search, the reason for that is that it's guaranteed that SST files with global seqno will have only one user_key and each key will have seqno=0 encoded in it, This mean that this key is greater than any other key with seqno> 0. That mean that we can actually keep the current logic for these blocks
Test Plan: unit tests
Reviewers: sdong, yhchiang
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D65211
Summary: Currently the WAL file preallocation size is 1.1 * write_buffer_size. This, however, will be over-estimated if options.db_write_buffer_size or options.max_total_wal_size is set and is much smaller.
Test Plan: Add a unit test.
Reviewers: andrewkr, yiwu
Reviewed By: yiwu
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D63957
Summary:
This diff update ForwardIterator to support pinning keys and values, which will allow DBIter to take advantage of that and eliminate memcpy when executing merge operators
This diff is stacked on D61305
Test Plan:
existing tests (updated them to test tailing iterator)
new test
Reviewers: andrewkr, yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D60009
Summary: With read_options.background_purge_on_iterator_cleanup=true, File deletion and closing can still happen in forward iterator, or WAL file closing. Cover those cases too.
Test Plan: I am adding unit tests.
Reviewers: andrewkr, IslamAbdelRahman, yiwu
Reviewed By: yiwu
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D61503
Summary:
DB::AddFile() right now always add the ingested file to L0
update the logic to add the file to the lowest possible level
Test Plan: unit tests
Reviewers: jkedgar, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D59637
Summary: filter_deltes is not a frequently used feature. Remove it.
Test Plan: Run all test suites.
Reviewers: igor, yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D59427
Summary: Fixing error with win build where we compare int64_t with size_t.
Test Plan: make check
Reviewers: andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57885
Summary:
Add an option `iterator_readahead_size` to `ReadOptions` to enable
configurable readahead for iterators similar to the corresponding
option for compaction.
Test Plan:
```
make commit_prereq
```
Reviewers: kumar.rangarajan, ott, igor, sdong
Reviewed By: sdong
Subscribers: yiwu, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D55419
Summary:
This is needed so we can measure compression ratio improvements
achieved by D52287.
The property compares raw data size against the total file size for a given
level. If the level is empty it should return 0.0.
Test Plan: new unit test
Reviewers: IslamAbdelRahman, yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D56967
Summary:
Introude SstFileManager::SetMaxAllowedSpaceUsage() that can be used to limit the maximum space usage allowed for RocksDB.
When this limit is exceeded WriteImpl() will fail and return Status::Aborted()
Test Plan: unit testing
Reviewers: yhchiang, anthony, andrewkr, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D53763
Summary: Add a new option to BlockBasedTableOptions that will allow us to change the restart interval for the index block
Test Plan: unit tests
Reviewers: yhchiang, anthony, andrewkr, sdong
Reviewed By: sdong
Subscribers: march, dhruba
Differential Revision: https://reviews.facebook.net/D53721
Summary:
InlineSkipList::InsertConcurrently should invalidate the
sequential-insertion cache prev_[] for all inserts of multi-level nodes,
not just those that increase the height of the skip list. The invariant
for prev_ is that prev_[i] (i > 0) is supposed to be the predecessor of
prev_[0] at level i. Before this diff InsertConcurrently could violate
this constraint when inserting a multi-level node after prev_[i] but
before prev_[0].
This diff also reenables kConcurrentSkipList as db_test's
MultiThreaded/MultiThreadedDBTest.MultiThreaded/29.
Test Plan:
1. unit tests
2. temporarily hack kConcurrentSkipList timing so that it is fast but has a 1.5% failure rate on my dev box (1ms stagger on thread launch, 1s test duration, failure rate baseline over 1000 runs)
3. observe 1000 passes post-fix
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, dhruba
Differential Revision: https://reviews.facebook.net/D53751
Summary: Disable test that is intermittently failing
Test Plan: unit tests
Reviewers: igor, andrewkr, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D53715
Summary:
Concurrent memtable adds were incorrectly computing
the last sequence number for a write batch group when the
write batches were not solitary. This is the cause of
https://github.com/facebook/mysql-5.6/issues/155
Test Plan:
1. unit tests
2. new unit test
3. parallel db_bench stress tests with batch size of 10 and asserts enabled
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: IslamAbdelRahman, MarkCallaghan, dhruba
Differential Revision: https://reviews.facebook.net/D53595
Summary:
Add a new class SstFileTracker that will be notified whenever a DB add/delete/move and sst file, it will also replace DeleteScheduler
SstFileTracker can be used later to abort writes when we exceed a specific size
Test Plan: unit tests
Reviewers: rven, anthony, yhchiang, sdong
Reviewed By: sdong
Subscribers: igor, lovro, march, dhruba
Differential Revision: https://reviews.facebook.net/D50469
Summary: Break down DBTest.Randomized to multiple gtest tests based on config type
Test Plan: Run the test and all tests. Make sure configurations are correctly set
Reviewers: yhchiang, IslamAbdelRahman, rven, kradhakrishnan, andrewkr, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D53247
Summary:
Moved all the tests that verify property correctness into a separate
file. The goal is to reduce compile time and complexity of db_test. I didn't
add parallelism for db_properties_test, even though these tests were
parallelized in db_test, since the file is small enough that it won't matter.
Some of these moves may be controversial since it's hard to say whether the
test is "verifying property correctness," or "using properties to verify
rocksdb's correctness." I'm interested in any opinions.
Test Plan: ran db_properties_test, also waiting on "make commit-prereq -j32"
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52995
Summary: Skip list now cannot estimate memory across allocators
consistently and hence triggers flush at different time. This breaks certain
unit tests.
The fix is to adopt key count instead of size for flush.
Test Plan: Ran test on dev box and mac (where it used to fail)
Reviewers: sdong
CC: leveldb@
Task ID: #9273334
Blame Rev:
Summary: DBTest.SuggestCompactRangeTest fails for the case when jemalloc is disabled, including ASAN and valgrind builds. It is caused by the improvement of skip list, which allocates different size of nodes for a new records. Fix it by using a special mem table that triggers a flush by number of entries. In that way the behavior will be consistent for all allocators.
Test Plan: Run the test with both of DISABLE_JEMALLOC=1 and 0
Reviewers: anthony, rven, yhchiang, kradhakrishnan, igor, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D51423
* conversion from 'size_t' to 'type', by add static_cast
Tested:
* by build solution on Windows, Linux locally,
* run tests
* build CI system successful
Summary: DBTest.MergeTestTime is a test verifying timing counters. Depending on real time may cause non-determinstic results. Change to fake time to be determinsitic.
Test Plan: Run the test and make sure it passes
Reviewers: yhchiang, anthony, rven, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D50883
Summary:
We are cleaning up dependencies.
This diff takes a first step at moving memtable files to their own
directory called memtable. In future diffs, we will move other memtable
files from db to memtable.
Test Plan: make check
Reviewers: sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48915
Summary: db_test_util is used in multiple test files but it dont compile under ROCKSDB_LITE
Test Plan:
make check
make static_lib
OPT=-DROCKSDB_LITE make db_wal_test
Reviewers: igor, yhchiang, rven, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D48579
Summary:
Separate a new class InternalIterator from class Iterator, when the look-up is done internally, which also means they operate on key with sequence ID and type.
This change will enable potential future optimizations but for now InternalIterator's functions are still the same as Iterator's.
At the same time, separate the cleanup function to a separate class and let both of InternalIterator and Iterator inherit from it.
Test Plan: Run all existing tests.
Reviewers: igor, yhchiang, anthony, kradhakrishnan, IslamAbdelRahman, rven
Reviewed By: rven
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D48549
Summary:
As part of tech debt week, we are cleaning up dependencies.
This diff moves db_test_util.[h,cc] from util to db directory.
Test Plan: make check
Reviewers: igor, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48543
Summary:
Handle SST files with both ".sst" and ".ldb" suffix.
This enables user to migrate from leveldb to rocksdb.
Test Plan:
Added unit test with DB operating on SSTs with names schema.
See db/dc_test.cc:SSTsWithLdbSuffixHandling for details
Reviewers: yhchiang, sdong, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D48003
Summary:
PlainTableReader now only allows mmap-mode. Add the support to non-mmap mode for more flexibility.
Refactor the codes to move all logic of reading data to PlainTableKeyDecoder, and consolidate the calls to Read() call and ReadVarint32() call. Implement the calls for both of mmap and non-mmap case seperately. For non-mmap mode, make copy of keys in several places when we need to move the buffer after reading the keys.
Test Plan: Add the mode of non-mmap case in plain_table_db_test. Run it in valgrind mode too.
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D47187
Summary:
This patch fixes#7460559. It introduces SingleDelete as a new database
operation. This operation can be used to delete keys that were never
overwritten (no put following another put of the same key). If an overwritten
key is single deleted the behavior is undefined. Single deletion of a
non-existent key has no effect but multiple consecutive single deletions are
not allowed (see limitations).
In contrast to the conventional Delete() operation, the deletion entry is
removed along with the value when the two are lined up in a compaction. Note:
The semantics are similar to @igor's prototype that allowed to have this
behavior on the granularity of a column family (
https://reviews.facebook.net/D42093 ). This new patch, however, is more
aggressive when it comes to removing tombstones: It removes the SingleDelete
together with the value whenever there is no snapshot between them while the
older patch only did this when the sequence number of the deletion was older
than the earliest snapshot.
Most of the complex additions are in the Compaction Iterator, all other changes
should be relatively straightforward. The patch also includes basic support for
single deletions in db_stress and db_bench.
Limitations:
- Not compatible with cuckoo hash tables
- Single deletions cannot be used in combination with merges and normal
deletions on the same key (other keys are not affected by this)
- Consecutive single deletions are currently not allowed (and older version of
this patch supported this so it could be resurrected if needed)
Test Plan: make all check
Reviewers: yhchiang, sdong, rven, anthony, yoshinorim, igor
Reviewed By: igor
Subscribers: maykov, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D43179
Summary: Fix two constants in WaitFor() that multiply a value with 000 instead of 1000.
Test Plan: make clean all check
Reviewers: rven, anthony, yhchiang, aekmekji, sdong, MarkCallaghan, igor
Reviewed By: igor
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47091
Summary. A change https://reviews.facebook.net/differential/diff/224721/
Has attempted to move common functionality out of platform dependent
code to a new facility called file_reader_writer.
This includes:
- perf counters
- Buffering
- RateLimiting
However, the change did not attempt to refactor Windows code.
To mitigate, we introduce new quering interfaces such as UseOSBuffer(),
GetRequiredBufferAlignment() and ReaderWriterForward()
for pure forwarding where required.
Introduce WritableFile got a new method Truncate(). This is to communicate
to the file as to how much data it has on close.
- When space is pre-allocated on Linux it is filled with zeros implicitly,
no such thing exist on Windows so we must truncate file on close.
- When operating in unbuffered mode the last page is filled with zeros but we still want to truncate.
Previously, Close() would take care of it but now buffer management is shifted to the wrappers and the file has
no idea about the file true size.
This means that Close() on the wrapper level must always include
Truncate() as well as wrapper __dtor should call Close() and
against double Close().
Move buffered/unbuffered write logic to the wrapper.
Utilize Aligned buffer class.
Adjust tests and implement Truncate() where necessary.
Come up with reasonable defaults for new virtual interfaces.
Forward calls for RandomAccessReadAhead class to avoid double
buffering and locking (double locking in unbuffered mode on WIndows).
Summary:
Now that the approach to parallelizing L0-L1 level-based
compactions by breaking the compaction job into subcompactions is
being extended to apply to universal compactions as well, the unit
tests need to account for this and run the universal compaction
tests with subcompactions both enabled and disabled.
Test Plan: make all && make check
Reviewers: sdong, igor, noetzli, anthony, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D45657
Summary: malloc_usable_size() gets a better estimation of memory usage. It is already used to calculate block cache memory usage. Use it in arena too.
Test Plan: Run all unit tests
Reviewers: anthony, kradhakrishnan, rven, IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D43317
Summary: Currently compaction inputs share the same file descriptor and table reader as other foreground threads. It makes fadvise works less predictable. Add options.new_table_reader_for_compaction_inputs to enforce to create a new file descriptor and new table reader for it.
Test Plan: Add the option.
Reviewers: rven, anthony, kradhakrishnan, IslamAbdelRahman, igor, yhchiang
Reviewed By: igor
Subscribers: igor, MarkCallaghan, leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D43311
Summary:
Add two unit tests for SyncWAL(). One makes sure SyncWAL() doesn't block writes in the other thread. Another one makes sure SyncWAL() doesn't wait ongoing writes to finish before being executed.
Create a new test file db_wal_test and move two WAL related tests from db_test to here.
Test Plan: Run the new tests
Reviewers: IslamAbdelRahman, rven, kradhakrishnan, kolmike, tnovak, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D43605
Summary:
Subj. We really need this feature.
Previous diff D40899 has most of the changes to make this possible, this diff just adds the method.
Test Plan: `make check`, the new test fails without this diff; ran with ASAN, TSAN and valgrind.
Reviewers: igor, rven, IslamAbdelRahman, anthony, kradhakrishnan, tnovak, yhchiang, sdong
Reviewed By: sdong
Subscribers: MarkCallaghan, maykov, hermanlee4, yoshinorim, tnovak, dhruba
Differential Revision: https://reviews.facebook.net/D40905
Summary:
UpdateAccumulatedStats() is used to optimize compaction decision
esp. when the number of deletion entries are high, but this function
can slowdown DBOpen esp. in disk environment.
This patch adds DBOptions::skip_sats_update_on_db_open, which skips
UpdateAccumulatedStats() in DB::Open() time when it's set to true.
Test Plan: Add DBCompactionTest.SkipStatsUpdateTest
Reviewers: igor, anthony, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: tnovak, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42843
Summary:
As of now compactions involving files from Level 0 and Level 1 are single
threaded because the files in L0, although sorted, are not range partitioned like
the other levels. This means that during L0-L1 compaction each file from L1
needs to be merged with potentially all the files from L0.
This attempt to parallelize the L0-L1 compaction assigns a thread and a
corresponding iterator to each L1 file that then considers only the key range
found in that L1 file and only the L0 files that have those keys (and only the
specific portion of those L0 files in which those keys are found). In this way
the overlap is minimized and potentially eliminated between different iterators
focusing on the same files.
The first step is to restructure the compaction logic to break L0-L1 compactions
into multiple, smaller, sequential compactions. Eventually each of these smaller
jobs will be run simultaneously. Areas to pay extra attention to are
# Correct aggregation of compaction job statistics across multiple threads
# Proper opening/closing of output files (make sure each thread's is unique)
# Keys that span multiple L1 files
# Skewed distributions of keys within L0 files
Test Plan: Make and run db_test (newer version has separate compaction tests) and compaction_job_stats_test
Reviewers: igor, noetzli, anthony, sdong, yhchiang
Reviewed By: yhchiang
Subscribers: MarkCallaghan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42699
Summary: Directly using TMPDIR can cause problems when running tests using parallel option. Fix them.
Test Plan: Run all tests in parallel
Reviewers: kradhakrishnan, yhchiang, IslamAbdelRahman, anthony
Reviewed By: anthony
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42807
Summary: We want to keep Env a think layer for better portability. Less platform dependent codes should be moved out of Env. In this patch, I create a wrapper of file readers and writers, and put rate limiting, write buffering, as well as most perf context instrumentation and random kill out of Env. It will make it easier to maintain multiple Env in the future.
Test Plan: Run all existing unit tests.
Reviewers: anthony, kradhakrishnan, IslamAbdelRahman, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D42321
Summary:
When we first started, max_background_flushes was 0 by default and compaction thread was executing flushes (since there was no flush thread). Then, we switched the default max_background_flushes to 1. However, we still support the case where there is no flush thread and flushes are done in compaction. This is making our code a bit more complicated. By not supporting this use-case we can make our code simpler.
We have a special case that when you set max_background_flushes to 0, we
schedule the flush to execute on the compaction thread.
Test Plan: make check (there might be some unit tests that depend on this behavior)
Reviewers: IslamAbdelRahman, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D41931
Summary:
ROCKSDB_WARNING is only defined if either ROCKSDB_PLATFORM_POSIX or OS_WIN is defined. This works well for building rocksdb with its own build scripts. But this won't work when an outside project(like mongodb) doesn't define ROCKSDB_PLATFORM_POSIX.
This fix defines ROCKSDB_WARNING for all platforms. No idea if its defined correctly on non-posix,non-windows platforms but this is no worse that the current situation where this macro is missing on unexpected platforms.
This fix should hopefully fix anyone whose build broke now that we've switched from using #warning to Pragma (to support windows). Unfortunately, while mongo-rocks compiles, it ignores the Pragma and doesn't print a warning. I have not been able to figure out a way to implement this portably on all platforms.
Of course, an alternate solution would be to just get rid of ROCKSDB_WARNING and live with include file redirects indefinitely. Thoughts?
Test Plan: build rocks, build mongorocks
Reviewers: igor, kradhakrishnan, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D42477
- Remove make file defines from public headers and use _WIN32 because it is compiler defined
- use __GNUC__ and __clang__ to guard non-portable attributes
- add #include "port/port.h" to some new .cc files.
- minor changes in CMakeLists to reflect recent changes