Summary:
This adds a new metablock containing a shared dictionary that is used
to compress all data blocks in the SST file. The size of the shared dictionary
is configurable in CompressionOptions and defaults to 0. It's currently only
used for zlib/lz4/lz4hc, but the block will be stored in the SST regardless of
the compression type if the user chooses a nonzero dictionary size.
During compaction, computes the dictionary by randomly sampling the first
output file in each subcompaction. It pre-computes the intervals to sample
by assuming the output file will have the maximum allowable length. In case
the file is smaller, some of the pre-computed sampling intervals can be beyond
end-of-file, in which case we skip over those samples and the dictionary will
be a bit smaller. After the dictionary is generated using the first file in a
subcompaction, it is loaded into the compression library before writing each
block in each subsequent file of that subcompaction.
On the read path, gets the dictionary from the metablock, if it exists. Then,
loads that dictionary into the compression library before reading each block.
Test Plan: new unit test
Reviewers: yhchiang, IslamAbdelRahman, cyan, sdong
Reviewed By: sdong
Subscribers: andrewkr, yoshinorim, kradhakrishnan, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D52287
Summary:
Introduced option to dump malloc statistics using new option flag.
Added new command line option to db_bench tool to enable this
funtionality.
Also extended build to support environments with/without jemalloc.
Test Plan:
1) Build rocksdb using `make` command. Launch the following command
`./db_bench --benchmarks=fillrandom --dump_malloc_stats=true
--num=10000000` end verified that jemalloc dump is present in LOG file.
2) Build rocksdb using `DISABLE_JEMALLOC=1 make db_bench -j32` and ran
the same db_bench tool and found the following message in LOG file:
"Please compile with jemalloc to enable malloc dump".
3) Also built rocksdb using `make` command on MacOS to verify behavior
in non-FB environment.
Also to debug build configuration change temporary changed
AM_DEFAULT_VERBOSITY = 1 in Makefile to see compiler and build
tools output. For case 1) -DROCKSDB_JEMALLOC was present in compiler
command line. For both 2) and 3) this flag was not present.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57321
Summary: CompactedDB skips memtable. So we shouldn't use compacted DB if there is outstanding WAL files.
Test Plan: Change to options.max_open_files = -1 perf context test to create a compacted DB, which we shouldn't do.
Reviewers: yhchiang, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57057
Summary:
While trying to reuse PinData() / ReleasePinnedData() .. to optimize away some memcpys I realized that there is a significant overhead for using PinData() / ReleasePinnedData if they were called many times.
This diff refactor the pinning logic by introducing PinnedIteratorsManager a centralized component that will be created once and will be notified whenever we need to Pin an Iterator. This implementation have much less overhead than the original implementation
Test Plan:
make check -j64
COMPILE_WITH_ASAN=1 make check -j64
Reviewers: yhchiang, sdong, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56493
Summary:
The current implementation find the first different byte and try to increment it, if it cannot it return the original key
we can improve this by keep going after the first different byte to find the first non 0xFF byte and increment it
After trying this patch on some logdevice sst files I see decrease in there index block size by 8.5%
Test Plan: existing tests and updated test
Reviewers: yhchiang, andrewkr, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56241
Summary:
In this test some times automatic compactions do everything and Manual compaction become a no-op.
Update the test to make sure manual compaction is not a no-op
Test Plan: run the test
Reviewers: andrewkr, yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57189
Summary:
Now we collect compaction stats per column family, but report default colum family's stat as compaction stats for DB.
Fix it by reporting compaction stats per column family instead.
Test Plan: Run db_bench with --num_column_families=4 and see the number fixed.
Reviewers: IslamAbdelRahman, yhchiang
Reviewed By: yhchiang
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D57063
Summary:
In https://reviews.facebook.net/D56271, we fixed an issue where
we consider flush as compaction. However, that makes us mistakenly
count FLUSH_WRITE_BYTES twice (one in flush_job and one in db_impl.)
This patch removes the one incremented in db_impl.
Test Plan: db_test
Reviewers: yiwu, andrewkr, IslamAbdelRahman, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57111
* Musl libc does not provide adaptive mutex. Added feature test for PTHREAD_MUTEX_ADAPTIVE_NP.
* Musl libc does not provide backtrace(3). Added a feature check for backtrace(3).
* Fixed compiler error.
* Musl libc does not implement backtrace(3). Added platform check for libexecinfo.
* Alpine does not appear to support gcc -pg option. By default (gcc has PIE option enabled) it fails with:
gcc: error: -pie and -pg|p|profile are incompatible when linking
When -fno-PIE and -nopie are used it fails with:
/usr/lib/gcc/x86_64-alpine-linux-musl/5.3.0/../../../../x86_64-alpine-linux-musl/bin/ld: cannot find gcrt1.o: No such file or directory
Added gcc -pg platform test and output PROFILING_FLAGS accordingly. Replaced pg var in Makefile with PROFILING_FLAGS.
* fix segfault when TEST_IOCTL_FRIENDLY_TMPDIR is undefined and default candidates are not suitable
* use ASSERT_DOUBLE_EQ instead of ASSERT_EQ
* When compiled with ROCKSDB_MALLOC_USABLE_SIZE UniversalCompactionFourPaths and UniversalCompactionSecondPathRatio tests fail due to premature memtable flushes on systems with 16-byte alignment. Arena runs out of block space before GenerateNewFile() completes.
Increased options.write_buffer_size.
Summary:
This is needed so we can measure compression ratio improvements
achieved by D52287.
The property compares raw data size against the total file size for a given
level. If the level is empty it should return 0.0.
Test Plan: new unit test
Reviewers: IslamAbdelRahman, yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D56967
Comparable with Snappy on comp ratio.
Implemented using Windows API, does not require external package.
Avaiable since Windows 8 and server 2012.
Use -DXPRESS=1 with CMake to enable.
Summary: We need to enable sync_point processing before creating the SstFileManager to ensure that we are holding the bg delete scheduler thread from running
Test Plan:
run the test
debug using printf
Reviewers: sdong, yhchiang, yiwu, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56871
Summary: D56715 move some of the tests from db_test to db_block_cache_test. Some of them should be disabled in lite build.
Test Plan:
make check -j32
OPT='-DROCKSDB_LITE' make check -j32
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D56907
Summary: write_callback_test fails if previous run didn't finish cleanly. Clean the DB before runing the test.
Test Plan: Run the test that see it doesn't fail any more.
Reviewers: andrewkr, yhchiang, yiwu, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: kradhakrishnan, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56859
Summary:
Solution is not to change db sequence number to start from 1 because 0 value is used in multiple other places.
Fix covers only compact_iterator::findEarliestVisibleSnapshot with updated logic to support snapshot's numbering starting from 0.
Test Plan:
run:
make all check
it should pass all tests
Reviewers: IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: lgalanis, mgalushka, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56601
Summary: In DBTest.HardLimit, multiple flushes may merge into one, based on thread scheduling. Avoid it by waiting each flush to finish before generating the next one.
Test Plan: Run test in parallel several times and see it doesn't fail any more.
Reviewers: yhchiang, kradhakrishnan, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: yiwu, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56853
Summary: DBTestUniversalCompaction, IncreaseUniversalCompactionNumLevels fails one in about 30 runs when running in parallel. We wait for compaction after each flush to make the compaction behavior deterministic.
Test Plan: Run the test 1000 times in parallel and it still passes.
Reviewers: yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: kradhakrishnan, yiwu, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56841
Summary: It is useful to print out IO stats in flush jobs too. Extend options.compaction_measure_io_stats to flush jobs and raname it.
Test Plan: Try db_bench and see the stats are printed out.
Reviewers: yhchiang
Reviewed By: yhchiang
Subscribers: kradhakrishnan, yiwu, IslamAbdelRahman, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56769
Summary:
In the case where we can't find a filter block, there is not much benefit of doing the binary search and see whether the index key has the prefix. With the change, we blindly return true if we can't get the filter.
It also fixes missing row cases for reverse comparator with full bloom.
Test Plan: Add a test case that used to fail.
Reviewers: yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: kradhakrishnan, yiwu, hermanlee4, yoshinorim, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56697
Summary: The reason for this test flakiness is that we try to verify that number of files in L0 is 3 after flushing the 3rd file although we may have a compaction running in the background that may finish before we do the check and the 3 L0 files are converted to 1 L1 file
Test Plan: Run a modified version of the test that sleep before doing the check
Reviewers: sdong, andrewkr, kradhakrishnan, yhchiang
Reviewed By: yhchiang
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56643
Summary:
Full block checking should be a good enough indication of prefix existance. No need to further check data block.
This also fixes wrong results when using prefix bloom and reverse bitwise comparator.
Test Plan: Will add a unit test.
Reviewers: yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: hermanlee4, yoshinorim, yiwu, kradhakrishnan, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56625
Summary:
In some case, it is possible to have two concesutive SST files might sharing
same boundary keys. However, in the assertion in Compaction::ShouldStopBefore,
it exclude such possibility.
This patch fix this issue by relaxing the assertion to allow the equal case.
Test Plan: rocksdb tests
Reviewers: IslamAbdelRahman, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D55875
Summary: Fixed a bug in RocksDB Statistics where flush is considered as compaction
Test Plan: unit test
Reviewers: sdong, IslamAbdelRahman, rven, kradhakrishnan, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D56271
Summary: The code assumes that if use_mmap_reads is on then use_os_buffer is also on. This make sense as by using memory mapped files for reading you are expecting the OS to cache what it needs. Add code to make sure the user does not turn off use_os_buffer when they turn on use_mmap_reads
Test Plan: New test: DBTest.MMapAndBufferOptions
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56397
Summary:
Added the column family name to the properties block. This property
is omitted only if the property is unavailable, such as when RepairDB()
writes SST files.
In a next diff, I will change RepairDB to use this new property for
deciding to which column family an existing SST file belongs. If this
property is missing, it will add it to the "unknown" column family (same
as its existing behavior).
Test Plan:
New unit test:
$ ./db_table_properties_test --gtest_filter=DBTablePropertiesTest.GetColumnFamilyNameProperty
Reviewers: IslamAbdelRahman, yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D55605
Summary: We use object `v` in the error message, which is not initialized if the edit is column family manipulation. This doesn't provide much useful info, so this diff is removing it. Instead, it dumps actual VersionEdit contents.
Test Plan: compiles. would be great to get tests in version_set_test.cc that cover cases where a file write fails
Reviewers: sdong, yhchiang, andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D56349
Summary:
There is a hardcoded constraint in our statistics collection that prevents reading properties from more than 20 SST files. This means our statistics will be very inaccurate for databases with > 20 files since additional files are just ignored. The purpose of constraining the number of files used is to bound the I/O performed during statistics collection, since these statistics need to be recomputed every time the database reopened.
However, this constraint doesn't take into account the case where option "max_open_files" is -1. In that case, all the file metadata has already been read, so MaybeInitializeFileMetaData() won't incur any I/O cost. so this diff gets rid of the 20-file constraint in case max_open_files == -1.
Test Plan:
write into unit test db/db_properties_test.cc - "ValidateSampleNumber".
We generate 20 files with 2 rows and 10 files with 1 row.
If max_open_files !=-1, the `rocksdb.estimate-num-keys` should be (10*1 + 10*2)/20 * 30 = 45. Otherwise, it should be the ground truth, 50.
{F1089153}
Reviewers: andrewkr
Reviewed By: andrewkr
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D56253
Summary:
- Put key offset and key size in WriteBatchIndexEntry
- Use vector for comparators in WriteBatchEntryComparator
I use a slightly modified version of @yoshinorim code to benchmark
https://gist.github.com/IslamAbdelRahman/b120f4fba8d6ff7d58d2
For Put I create a transaction that put a 1000000 keys and measure the time spent without commit.
For GetForUpdate I read the keys that I added in the Put transaction.
Original time:
```
rm -rf /dev/shm/rocksdb-example/
./txn_bench put 1000000
1000000 OK Ops | took 3.679 seconds
./txn_bench get_for_update 1000000
1000000 OK Ops | took 3.940 seconds
```
New Time
```
rm -rf /dev/shm/rocksdb-example/
./txn_bench put 1000000
1000000 OK Ops | took 2.727 seconds
./txn_bench get_for_update 1000000
1000000 OK Ops | took 3.880 seconds
```
It looks like there is no significant improvement in GetForUpdate() but we can see ~30% improvement in Put()
Test Plan: unittests
Reviewers: yhchiang, anthony, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, yoshinorim
Differential Revision: https://reviews.facebook.net/D55539
Summary:
When a block based table file is opened, if prefetch_index_and_filter is true, it will prefetch the index and filter blocks, putting them into the block cache.
What this feature adds: when a L0 block based table file is opened, if pin_l0_filter_and_index_blocks_in_cache is true in the options (and prefetch_index_and_filter is true), then the filter and index blocks aren't released back to the block cache at the end of BlockBasedTableReader::Open(). Instead the table reader takes ownership of them, hence pinning them, ie. the LRU cache will never push them out. Meanwhile in the table reader, further accesses will not hit the block cache, thus avoiding lock contention.
Test Plan:
'export TEST_TMPDIR=/dev/shm/ && DISABLE_JEMALLOC=1 OPT=-g make all valgrind_check -j32' is OK.
I didn't run the Java tests, I don't have Java set up on my devserver.
Reviewers: sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D56133
Summary: Change some RocksDB default options to make it more friendly to server workloads.
Test Plan: Run all existing tests
Reviewers: yhchiang, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: sumeet, muthu, benj, MarkCallaghan, igor, leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D55941
record.size() should not be less than 12.
This "magic number" seems to be the WriteBatch header (8 byte sequence
and 4 byte count). Replaced all the places where "12" was used
by WriteBatchInternal::kHeader.
Summary: We were not measuring the time spent in merge_operator when called from Version::Get()
Test Plan: added a unittest
Reviewers: sdong, yhchiang
Reviewed By: yhchiang
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D55905
Summary:
When subcompaction is used, all subcompactions share the same Compaction
pointer in CompactionJob while each subcompaction all keeps their mutable
stats in SubcompactionState. However, there're still some mutable part
that is currently store in the shared Compaction pointer.
This patch makes two changes:
1. Make the shared Compaction pointer const so that it can never be modified
during the compaction.
2. Move necessary states from Compaction to SubcompactionState.
3. Make functions of Compaction const if the function does not modify
its internal state.
Test Plan: rocksdb and MyRocks test
Reviewers: sdong, kradhakrishnan, andrewkr, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, yoshinorim, gunnarku, leveldb
Differential Revision: https://reviews.facebook.net/D55923
Summary:
I ran into this assert when stress testing transactions. It's pretty easy to repro.
Changing VersionSet::last_sequence_ to start at 1 seems pretty straightforward. We would just need to change the 4 callers of SetLastSequence(), including recovery code. I'd make this change myself, but I do not have enough time to test changes to recovery code-paths this week. But checking in this test case (disabled) for future fixing.
Test Plan: n/a
Reviewers: yhchiang, kradhakrishnan, andrewkr, anthony, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D55311
Summary:
Basic test cases:
- Manifest is lost or corrupt
- Manifest refers to too many or too few SST files
- SST file is corrupt
- Unflushed data is present when RepairDB is called
Depends on D55065 for its CreateFile() function in file_utils
Test Plan: Ran the tests.
Reviewers: IslamAbdelRahman, yhchiang, yoshinorim, sdong
Reviewed By: sdong
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D55485