rocksdb

Commit Graph

Author	SHA1	Message	Date
Siying Dong	8843129ece	Move some memory related files from util/ to memory/ (#5382 ) Summary: Move arena, allocator, and memory tools under util to a separate memory/ directory. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5382 Differential Revision: D15564655 Pulled By: siying fbshipit-source-id: 9cd6b5d0d3d52b39606e19221fa154596e5852a5	6 years ago
Yanqin Jin	9358178edc	Support for single-primary, multi-secondary instances (#4899 ) Summary: This PR allows RocksDB to run in single-primary, multi-secondary process mode. The writer is a regular RocksDB (e.g. an `DBImpl`) instance playing the role of a primary. Multiple `DBImplSecondary` processes (secondaries) share the same set of SST files, MANIFEST, WAL files with the primary. Secondaries tail the MANIFEST of the primary and apply updates to their own in-memory state of the file system, e.g. `VersionStorageInfo`. This PR has several components: 1. (Originally in #4745). Add a `PathNotFound` subcode to `IOError` to denote the failure when a secondary tries to open a file which has been deleted by the primary. 2. (Similar to #4602). Add `FragmentBufferedReader` to handle partially-read, trailing record at the end of a log from where future read can continue. 3. (Originally in #4710 and #4820). Add implementation of the secondary, i.e. `DBImplSecondary`. 3.1 Tail the primary's MANIFEST during recovery. 3.2 Tail the primary's MANIFEST during normal processing by calling `ReadAndApply`. 3.3 Tailing WAL will be in a future PR. 4. Add an example in 'examples/multi_processes_example.cc' to demonstrate the usage of secondary RocksDB instance in a multi-process setting. Instructions to run the example can be found at the beginning of the source code. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4899 Differential Revision: D14510945 Pulled By: riversand963 fbshipit-source-id: 4ac1c5693e6012ad23f7b4b42d3c374fecbe8886	6 years ago
Siying Dong	5d9a623e2c	Add a unit test to Ignorable manfiest record (#4964 ) Summary: https://github.com/facebook/rocksdb/pull/4960 introduced ignorable manfiest record. Adding a test to it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4964 Differential Revision: D14012667 Pulled By: siying fbshipit-source-id: e5f10ecc68dec2716e178d44f0fe2b76c3d857ef	6 years ago
Yanqin Jin	d116a1725d	Update recovery code for version edits group commit. (#3945 ) Summary: During recovery, RocksDB is able to handle version edits that belong to group commits. This PR is a subset of [PR 3752](https://github.com/facebook/rocksdb/pull/3752) Pull Request resolved: https://github.com/facebook/rocksdb/pull/3945 Differential Revision: D8529122 Pulled By: riversand963 fbshipit-source-id: 57cb0f9cc55ecca684a837742d6626dc9c07f37e	6 years ago
Yanqin Jin	54de56844d	Remove random writes from SST file ingestion (#4172 ) Summary: RocksDB used to store global_seqno in external SST files written by SstFileWriter. During file ingestion, RocksDB uses `pwrite` to update the `global_seqno`. Since random write is not supported in some non-POSIX compliant file systems, external SST file ingestion is not supported on these file systems. To address this limitation, we no longer update `global_seqno` during file ingestion. Later RocksDB uses the MANIFEST and other information in table properties to deduce global seqno for externally-ingested SST files. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4172 Differential Revision: D8961465 Pulled By: riversand963 fbshipit-source-id: 4382ec85270a96be5bc0cf33758ca2b167b05071	6 years ago
Nikhil Benesch	5f3088d565	Range deletion performance improvements + cleanup (#4014 ) Summary: This fixes the same performance issue that #3992 fixes but with much more invasive cleanup. I'm more excited about this PR because it paves the way for fixing another problem we uncovered at Cockroach where range deletion tombstones can cause massive compactions. For example, suppose L4 contains deletions from [a, c) and [x, z) and no other keys, and L5 is entirely empty. L6, however, is full of data. When compacting L4 -> L5, we'll end up with one file that spans, massively, from [a, z). When we go to compact L5 -> L6, we'll have to rewrite all of L6! If, instead of range deletions in L4, we had keys a, b, x, y, and z, RocksDB would have been smart enough to create two files in L5: one for a and b and another for x, y, and z. With the changes in this PR, it will be possible to adjust the compaction logic to split tombstones/start new output files when they would span too many files in the grandparent level. ajkr please take a look when you have a minute! Pull Request resolved: https://github.com/facebook/rocksdb/pull/4014 Differential Revision: D8773253 Pulled By: ajkr fbshipit-source-id: ec62fa85f648fdebe1380b83ed997f9baec35677	6 years ago
Siying Dong	d59549298f	Skip deleted WALs during recovery Summary: This patch record min log number to keep to the manifest while flushing SST files to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic. Before the commit, for 2PC case, we determined which log number to keep in FindObsoleteFiles(). We looked at the earliest logs with outstanding prepare entries, or prepare entries whose respective commit or abort are in memtable. With the commit, the same calculation is done while we apply the SST flush. Just before installing the flush file, we precompute the earliest log file to keep after the flush finishes using the same logic (but skipping the memtables just flushed), record this information to the manifest entry for this new flushed SST file. This pre-computed value is also remembered in memory, and will later be used to determine whether a log file can be deleted. This value is unlikely to change until next flush because the commit entry will stay in memtable. (In WritePrepared, we could have removed the older log files as soon as all prepared entries are committed. It's not yet done anyway. Even if we do it, the only thing we loss with this new approach is earlier log deletion between two flushes, which does not guarantee to happen anyway because the obsolete file clean-up function is only executed after flush or compaction) This min log number to keep is stored in the manifest using the safely-ignore customized field of AddFile entry, in order to guarantee that the DB generated using newer release can be opened by previous releases no older than 4.2. Closes https://github.com/facebook/rocksdb/pull/3765 Differential Revision: D7747618 Pulled By: siying fbshipit-source-id: d00c92105b4f83852e9754a1b70d6b64cb590729	7 years ago
Siying Dong	d5afa73789	Revert "Skip deleted WALs during recovery" Summary: This reverts commit `73f21a7b21`. It breaks compatibility. When created a DB using a build with this new change, opening the DB and reading the data will fail with this error: "Corruption: Can't access /000000.sst: IO error: while stat a file for size: /tmp/xxxx/000000.sst: No such file or directory" This is because the dummy AddFile4 entry generated by the new code will be treated as a real entry by an older build. The older build will think there is a real file with number 0, but there isn't such a file. Closes https://github.com/facebook/rocksdb/pull/3762 Differential Revision: D7730035 Pulled By: siying fbshipit-source-id: f2051859eff20ef1837575ecb1e1bb96b3751e77	7 years ago
Maysam Yabandeh	73f21a7b21	Skip deleted WALs during recovery Summary: This patch record the deleted WAL numbers in the manifest to ignore them and any WAL older than them during recovery. This is to avoid scenarios when we have a gap between the WAL files are fed to the recovery procedure. The gap could happen by for example out-of-order WAL deletion. Such gap could cause problems in 2PC recovery where the prepared and commit entry are placed into two separate WAL and gap in the WALs could result into not processing the WAL with the commit entry and hence breaking the 2PC recovery logic. Closes https://github.com/facebook/rocksdb/pull/3488 Differential Revision: D6967893 Pulled By: maysamyabandeh fbshipit-source-id: 13119feb155a08ab6d4909f437c7a750480dc8a1	7 years ago
Prashant D	34aa245dd8	Fix coverity issues version, write_batch Summary: db/version_builder.cc: 117 base_vstorage_->InternalComparator(); CID 1351713 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR) 2. uninit_member: Non-static class member field level_zero_cmp_.internal_comparator is not initialized in this constructor nor in any functions that it calls. db/version_edit.h: 145 FdWithKeyRange() 146 : fd(), 147 smallest_key(), 148 largest_key() { CID 1418254 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR) 2. uninit_member: Non-static class member file_metadata is not initialized in this constructor nor in any functions that it calls. 149 } db/version_set.cc: 120 } CID 1322789 (#1 of 1): Uninitialized pointer field (UNINIT_CTOR) 4. uninit_member: Non-static class member curr_file_level_ is not initialized in this constructor nor in any functions that it calls. 121 } db/write_batch.cc: 939 assert(cf_mems_); CID 1419862 (#1 of 1): Uninitialized scalar field (UNINIT_CTOR) 3. uninit_member: Non-static class member rebuilding_trx_seq_ is not initialized in this constructor nor in any functions that it calls. 940 } Closes https://github.com/facebook/rocksdb/pull/3092 Differential Revision: D6505666 Pulled By: yiwu-arbug fbshipit-source-id: fd2c68948a0280772691a419d72ac7e190951d86	7 years ago
Siying Dong	3c327ac2d0	Change RocksDB License Summary: Closes https://github.com/facebook/rocksdb/pull/2589 Differential Revision: D5431502 Pulled By: siying fbshipit-source-id: 8ebf8c87883daa9daa54b2303d11ce01ab1f6f75	7 years ago
Siying Dong	5582123dee	Sample number of reads per SST file Summary: We estimate number of reads per SST files, by updating the counter per file in sampled read requests. This information can later be used to trigger compactions to improve read performacne. Closes https://github.com/facebook/rocksdb/pull/2417 Differential Revision: D5193528 Pulled By: siying fbshipit-source-id: b4241c5ad0eaf444b61afb53f8e6290d9f5da2df	8 years ago
Siying Dong	d616ebea23	Add GPLv2 as an alternative license. Summary: Closes https://github.com/facebook/rocksdb/pull/2226 Differential Revision: D4967547 Pulled By: siying fbshipit-source-id: dd3b58ae1e7a106ab6bb6f37ab5c88575b125ab4	8 years ago
Andrew Kryczka	56222f57df	Avoid FileMetaData copy Summary: as titled Test Plan: unit tests Reviewers: sdong, lightmark Reviewed By: lightmark Subscribers: andrewkr, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D60597	8 years ago
Baraa Hamodi	21e95811d1	Updated all copyright headers to the new format.	9 years ago
Dmitri Smirnov	236fe21c92	Enable MS compiler warning c4244. Mostly due to the fact that there are differences in sizes of int,long on 64 bit systems vs GNU.	9 years ago
sdong	b77eb16aba	New Manifest format to allow customized fields in NewFile. Summary: With this commit, we add a new format in manifest when adding a new file. Now path ID and need-compaction hint are first two customized fields. Test Plan: Add a test case in version_edit_test to verify the encoding and decoding logic. Add a unit test in db_test to verify need compaction is persistent after DB restarting. Reviewers: kradhakrishnan, anthony, IslamAbdelRahman, yhchiang, rven, igor Reviewed By: igor Subscribers: javigon, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D48123	9 years ago
Andres Noetzli	8aa1f15197	Refactored common code of Builder/CompactionJob out into a CompactionIterator Summary: Builder and CompactionJob share a lot of fairly complex code. This patch refactors this code into a separate class, the CompactionIterator. Because the shared code is fairly complex, this patch hopefully improves maintainability. While there are is a lot of potential for further improvements, the patch is intentionally pretty close to the original structure because the change is already complex enough. Test Plan: make clean all check && ./db_stress Reviewers: rven, anthony, yhchiang, sdong, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D46197	9 years ago
Ari Ekmekji	74c755c552	Added JSON manifest dump option to ldb command Summary: Added a new flag --json to the ldb manifest_dump command that prints out the version edits as JSON objects for easier reading and parsing of information. Test Plan: Sample usage: ``` ./ldb manifest_dump --json --path=path/to/manifest/file ``` Sample output: ``` {"EditNumber": 0, "Comparator": "leveldb.BytewiseComparator", "ColumnFamily": 0} {"EditNumber": 1, "LogNumber": 0, "ColumnFamily": 0} {"EditNumber": 2, "LogNumber": 4, "PrevLogNumber": 0, "NextFileNumber": 7, "LastSeq": 35356, "AddedFiles": [{"Level": 0, "FileNumber": 5, "FileSize": 1949284, "SmallestIKey": "'", "LargestIKey": "'"}], "ColumnFamily": 0} ... {"EditNumber": 13, "PrevLogNumber": 0, "NextFileNumber": 36, "LastSeq": 290994, "DeletedFiles": [{"Level": 0, "FileNumber": 17}, {"Level": 0, "FileNumber": 20}, {"Level": 0, "FileNumber": 22}, {"Level": 0, "FileNumber": 24}, {"Level": 1, "FileNumber": 13}, {"Level": 1, "FileNumber": 14}, {"Level": 1, "FileNumber": 15}, {"Level": 1, "FileNumber": 18}], "AddedFiles": [{"Level": 1, "FileNumber": 25, "FileSize": 2114340, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 26, "FileSize": 2115213, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 27, "FileSize": 2114807, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 30, "FileSize": 2115271, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 31, "FileSize": 2115165, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 32, "FileSize": 2114683, "SmallestIKey": "'", "LargestIKey": "'"}, {"Level": 1, "FileNumber": 35, "FileSize": 1757512, "SmallestIKey": "'", "LargestIKey": "'"}], "ColumnFamily": 0} ... ``` Reviewers: sdong, anthony, yhchiang, igor Reviewed By: igor Subscribers: dhruba Differential Revision: https://reviews.facebook.net/D41727	9 years ago
sdong	6df589b446	Add TablePropertiesCollector::NeedCompact() to suggest DB to further compact output files Summary: It is experimental. Allow users to return from a call back function TablePropertiesCollector::NeedCompact(), based on the data in the file. It can be used to allow users to suggest DB to clear up delete tombstones faster. Test Plan: Add a unit test. Reviewers: igor, yhchiang, kradhakrishnan, rven Reviewed By: rven Subscribers: yoshinorim, MarkCallaghan, maykov, leveldb, dhruba Differential Revision: https://reviews.facebook.net/D39585	10 years ago
Igor Canadi	6059bdf86a	Add experimental API MarkForCompaction() Summary: Some Mongo+Rocks datasets in Parse's environment are not doing compactions very frequently. During the quiet period (with no IO), we'd like to schedule compactions so that our reads become faster. Also, aggressively compacting during quiet periods helps when write bursts happen. In addition, we also want to compact files that are containing deleted key ranges (like old oplog keys). All of this is currently not possible with CompactRange() because it's single-threaded and blocks all other compactions from happening. Running CompactRange() risks an issue of blocking writes because we generate too much Level 0 files before the compaction is over. Stopping writes is very dangerous because they hold transaction locks. We tried running manual compaction once on Mongo+Rocks and everything fell apart. MarkForCompaction() solves all of those problems. This is very light-weight manual compaction. It is lower priority than automatic compactions, which means it shouldn't interfere with background process keeping the LSM tree clean. However, if no automatic compactions need to be run (or we have extra background threads available), we will start compacting files that are marked for compaction. Test Plan: added a new unit test Reviewers: yhchiang, rven, MarkCallaghan, sdong Reviewed By: sdong Subscribers: yoshinorim, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D37083	10 years ago
Igor Canadi	2a979822b6	Fix deleting obsolete files Summary: This diff basically reverts D30249 and also adds a unit test that was failing before this patch. I have no idea how I didn't catch this terrible bug when writing a diff, sorry about that :( I think we should redesign our system of keeping track of and deleting files. This is already a second bug in this critical piece of code. I'll think of few ideas. BTW this diff is also a regression when running lots of column families. I plan to revisit this separately. Test Plan: added a unit test Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D33045	10 years ago
Igor Canadi	e39f4f6cf9	Fix data race #3 Summary: Added requirement that ComputeCompactionScore() be executed in mutex, since it's accessing being_compacted bool, which can be mutated by other threads. Also added more comments about thread safety of FileMetaData, since it was a bit confusing. However, it seems that FileMetaData doesn't have data races (except being_compacted) Test Plan: Ran 100 ConvertCompactionStyle tests with thread sanitizer. On master -- some failures. With this patch -- none. Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D32283	10 years ago
Igor Canadi	0acc738810	Speed up FindObsoleteFiles() Summary: There are two versions of FindObsoleteFiles(): * full scan, which is executed every 6 hours (and it's terribly slow) * no full scan, which is executed every time a background process finishes and iterator is deleted This diff is optimizing the second case (no full scan). Here's what we do before the diff: * Get the list of obsolete files (files with ref==0). Some files in obsolete_files set might actually be live. * Get the list of live files to avoid deleting files that are live. * Delete files that are in obsolete_files and not in live_files. After this diff: * The only files with ref==0 that are still live are files that have been part of move compaction. Don't include moved files in obsolete_files. * Get the list of obsolete files (which exclude moved files). * No need to get the list of live files, since all files in obsolete_files need to be deleted. I'll post the benchmark results, but you can get the feel of it here: https://reviews.facebook.net/D30123 This depends on D30123. P.S. We should do full scan only in failure scenarios, not every 6 hours. I'll do this in a follow-up diff. Test Plan: One new unit test. Made sure that unit test fails if we don't have a `if (!f->moved)` safeguard in ~Version. make check Big number of compactions and flushes: ./db_stress --threads=30 --ops_per_thread=20000000 --max_key=10000 --column_families=20 --clear_column_family_one_in=10000000 --verify_before_write=0 --reopen=15 --max_background_compactions=10 --max_background_flushes=10 --db=/fast-rocksdb-tmp/db_stress --prefixpercent=0 --iterpercent=0 --writepercent=75 --db_write_buffer_size=2000000 Reviewers: yhchiang, rven, sdong Reviewed By: sdong Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D30249	10 years ago
Igor Canadi	767777c2bd	Turn on -Wshorten-64-to-32 and fix all the errors Summary: We need to turn on -Wshorten-64-to-32 for mobile. See D1671432 (internal phabricator) for details. This diff turns on the warning flag and fixes all the errors. There were also some interesting errors that I might call bugs, especially in plain table. Going forward, I think it makes sense to have this flag turned on and be very very careful when converting 64-bit to 32-bit variables. Test Plan: compiles Reviewers: ljin, rven, yhchiang, sdong Reviewed By: yhchiang Subscribers: bobbaldwin, dhruba, leveldb Differential Revision: https://reviews.facebook.net/D28689	10 years ago
Igor Canadi	9f7fc3ac45	Turn on -Wshadow Summary: ...and fix all the errors :) Jim suggested turning on -Wshadow because it helped him fix number of critical bugs in fbcode. I think it's a good idea to be -Wshadow clean. Test Plan: compiles Reviewers: yhchiang, rven, sdong, ljin Reviewed By: ljin Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D27711	10 years ago
sdong	4d2ba38b65	Make VersionBuilder unit testable Summary: Rename Version::Builder to VersionBuilder and expose its definition to a header. Make VerisonBuilder not reference Version or ColumnFamilyData, only working with VersionStorageInfo. Add version_builder_test which has a simple test. Test Plan: make all check Reviewers: rven, yhchiang, igor, ljin Reviewed By: igor Subscribers: leveldb, dhruba Differential Revision: https://reviews.facebook.net/D27969	10 years ago
Yueh-Hsuan Chiang	3772a3d09d	Fix the bug where compaction does not fail when RocksDB can't create a new file. Summary: This diff has two fixes. 1. Fix the bug where compaction does not fail when RocksDB can't create a new file. 2. When NewWritableFiles() fails in OpenCompactionOutputFiles(), previously such fail-to-created file will be still be included as a compaction output. This patch also fixes this bug. 3. Allow VersionEdit::EncodeTo() to return Status and add basic check. Test Plan: ./version_edit_test export ROCKSDB_TESTS=FileCreationRandomFailure ./db_test Reviewers: ljin, sdong, nkg-, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D25581	10 years ago
Lei Jin	834c67d77f	rename FileLevel to LevelFilesBrief / unfriend CompactedDBImpl Summary: We have several different types of data structures for file information. FileLevel is kinda of confusing since it only contains file range and fd. Rename it to LevelFilesBrief to make it clear. Unfriend CompactedDBImpl as a by product Test Plan: make release / make all will run full test with all stacked diffs Reviewers: sdong, yhchiang, rven, igor Reviewed By: igor Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D27585	10 years ago
Yueh-Hsuan Chiang	6c66918645	Speed up DB::Open() and Version creation by limiting the number of FileMetaData initialization. Summary: This diff speeds up DB::Open() and Version creation by limiting the number of FileMetaData initialization. The behavior of Version::UpdateAccumulatedStats() is changed as follows: * It only initializes the first 20 uninitialized FileMetaData from file. This guarantees the size of the latest 20 files will always be compensated when they have any deletion entries. Previously it may initialize all FileMetaData by loading all files at DB::Open(). * In case none the first 20 files has any data entry, UpdateAccumulatedStats() will initialize the FileMetaData of the oldest file. Test Plan: db_test Reviewers: igor, sdong, ljin Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D24255	10 years ago
Nik Bougalis	4329d74e05	Fix swapped variable names to accurately reflect usage	10 years ago
Yueh-Hsuan Chiang	570ba5aca8	Avoid retrying to read property block from a table when it does not exist. Summary: Avoid retrying to read property block from a table when it does not exist in updating stats for compensating deletion entries. In addition, ReadTableProperties() now returns Status::NotFound instead of Status::Corruption when table properties does not exist in the file. Test Plan: make db_test -j32 export ROCKSDB_TESTS=CompactionDeleteionTrigger ./db_test Reviewers: ljin, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D21867	10 years ago
Feng Zhu	178fd6f9db	use FileLevel in LevelFileNumIterator Summary: Use FileLevel in LevelFileNumIterator, thus use new version of findFile. Old version of findFile function is deleted. Write a function in version_set.cc to generate FileLevel from files_. Add GenerateFileLevelTest in version_set_test.cc Test Plan: make all check Reviewers: ljin, haobo, yhchiang, sdong Reviewed By: sdong Subscribers: igor, dhruba Differential Revision: https://reviews.facebook.net/D19659	10 years ago
Feng Zhu	222cf2555a	change the init parameter for FileDescriptor Summary: fix a bug in improve_file_key_search, change the parameter for FileDescriptor Test Plan: make all check Reviewers: sdong Reviewed By: sdong Differential Revision: https://reviews.facebook.net/D19611	11 years ago
Feng Zhu	f697cad15c	create compressed_levels_ in Version, allocate its space using arena. Make Version::Get, Version::FindFile faster Summary: Define CompressedFileMetaData that just contains fd, smallest_slice, largest_slice. Create compressed_levels_ in Version, the space is allocated using arena Thus increase the file meta data locality, speed up "Get" and "FindFile" benchmark with in-memory tmpfs, could have 4% improvement under "random read" and 2% improvement under "read while writing" benchmark command: ./db_bench --db=/mnt/db/rocksdb --num_levels=6 --key_size=20 --prefix_size=20 --keys_per_prefix=0 --value_size=100 --block_size=4096 --cache_size=17179869184 --cache_numshardbits=6 --compression_type=none --compression_ratio=1 --min_level_to_compress=-1 --disable_seek_compaction=1 --hard_rate_limit=2 --write_buffer_size=134217728 --max_write_buffer_number=2 --level0_file_num_compaction_trigger=8 --target_file_size_base=33554432 --max_bytes_for_level_base=1073741824 --disable_wal=0 --sync=0 --disable_data_sync=1 --verify_checksum=1 --delete_obsolete_files_period_micros=314572800 --max_grandparent_overlap_factor=10 --max_background_compactions=4 --max_background_flushes=0 --level0_slowdown_writes_trigger=16 --level0_stop_writes_trigger=24 --statistics=0 --stats_per_interval=0 --stats_interval=1048576 --histogram=0 --use_plain_table=1 --open_files=-1 --mmap_read=1 --mmap_write=0 --memtablerep=prefix_hash --bloom_bits=10 --bloom_locality=1 --perf_level=0 --benchmarks=readwhilewriting,readwhilewriting,readwhilewriting --use_existing_db=1 --num=52428800 --threads=1 —writes_per_second=81920 Read Random: From 1.8363 ms/op, improve to 1.7587 ms/op. Read while writing: From 2.985 ms/op, improve to 2.924 ms/op. Test Plan: make all check Reviewers: ljin, haobo, yhchiang, sdong Reviewed By: sdong Subscribers: dhruba, igor Differential Revision: https://reviews.facebook.net/D19419	11 years ago
Yueh-Hsuan Chiang	70828557ef	Some fixes on size compensation logic for deletion entry in compaction Summary: This patch include two fixes: 1. newly created Version will now takes the aggregated stats for average-value-size from the latest Version. 2. compensated size of a file is now computed only for newly created / loaded file, this addresses the issue where files are already sorted by their compensated file size but might sometimes observe some out-of-order due to later update on compensated file size. Test Plan: export ROCKSDB_TESTS=CompactionDele ./db_test Reviewers: ljin, igor, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19557	11 years ago
sdong	2459f7ec4e	Support Multiple DB paths (without having an interface to expose to users) Summary: In this patch, we allow RocksDB to support multiple DB paths internally. No user interface is supported yet so this patch is silent to users. Test Plan: make all check Reviewers: igor, haobo, ljin, yhchiang Reviewed By: yhchiang Subscribers: dhruba, leveldb Differential Revision: https://reviews.facebook.net/D18921	11 years ago
Yueh-Hsuan Chiang	8898a0a0d1	Reorder the member variables of FileMetaData to improve cache locality. Summary: Move stats related member variables of FileMetaData to the bottom to improve cache locality of normal DB operations. Test Plan: make Reviewers: haobo, ljin, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19287	11 years ago
Yueh-Hsuan Chiang	e813f5b6d9	Allow compaction to reclaim storage more effectively. Summary: This diff allows compaction to reclaim storage more effectively. In the current design, compactions are mainly triggered based on the file sizes. However, since deletion entries does not have value, files which have many deletion entries are less likely to be compacted. As a result, it may took a while to make deletion entries to be compacted. This diff address issue by compensating the size of deletion entries during compaction process: the size of each deletion entry in the compaction process is augmented by 2x average value size. The diff applies to both leveled and universal compacitons. Test Plan: develop CompactionDeletionTrigger make db_test ./db_test Reviewers: haobo, igor, ljin, sdong Reviewed By: sdong Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19029	11 years ago
Igor Canadi	d4a8423334	Remove seek compaction Summary: As discussed in our internal group, we don't get much use of seek compaction at the moment, while it's making code more complicated and slower in some cases. This diff removes seek compaction and (hopefully) all code that was introduced to support seek compaction. There is one test case that relied on didIO information. I'll try to find another way to implement it. Test Plan: make check Reviewers: sdong, haobo, yhchiang, ljin, dhruba Reviewed By: ljin Subscribers: leveldb Differential Revision: https://reviews.facebook.net/D19161	11 years ago
sdong	cadc1adffa	Refactor: group metadata needed to open an SST file to a separate copyable struct Summary: We added multiple fields to FileMetaData recently and are planning to add more. This refactoring separate the minimum information for accessing the file. This object is copyable (FileMetaData is not copyable since the ref counter). I hope this refactoring can enable further improvements: (1) use it to design a more efficient data structure to speed up read queries. (2) in the future, when we add information of storage level, we can easily do the encoding, instead of enlarge this structure, which might expand memory work set for file meta data. The definition is same as current EncodedFileMetaData used in two level iterator, so now the logic in two level iterator is easier to understand. Test Plan: make all check Reviewers: haobo, igor, ljin Reviewed By: ljin Subscribers: leveldb, dhruba, yhchiang Differential Revision: https://reviews.facebook.net/D18933	11 years ago
sdong	fa430bfd04	Minimize accessing multiple objects in Version::Get() Summary: One of our profilings shows that Version::Get() sometimes is slow when getting pointer of user comparators or other global objects. In this patch: (1) we keep pointers of immutable objects in Version to avoid accesses them though option objects or cfd objects (2) table_reader is directly cached in FileMetaData so that table cache don't have to go through handle first to fetch it (3) If level 0 has less than 3 files, skip the filtering logic based on SST tables' key range. Smallest and largest key are stored in separated memory locations, which has potential cache misses Test Plan: make all check Reviewers: haobo, ljin Reviewed By: haobo CC: igor, yhchiang, nkg-, leveldb Differential Revision: https://reviews.facebook.net/D17739	11 years ago
Igor Canadi	577556d5f9	Don't store version number in MANIFEST Summary: Talked to <insert internal project name> folks and they found it really scary that they won't be able to roll back once they upgrade to 2.8. We should fix this. Test Plan: make check Reviewers: haobo, ljin Reviewed By: ljin CC: leveldb Differential Revision: https://reviews.facebook.net/D17343	11 years ago
Lei Jin	63cef90078	disable the log_number check in Recover() Summary: There is a chance that an old MANIFEST is corrupted in 2.7 but just not noticed. This check would fail them. Change it to log instead of returning a Corruption status. Test Plan: make Reviewers: haobo, igor Reviewed By: igor CC: leveldb Differential Revision: https://reviews.facebook.net/D16923	11 years ago
Igor Canadi	9625acbf70	[CF] Dont reuse dropped column family IDs Summary: Column family IDs should be unique, even if column family is dropped. To achieve this, we save max column family in manifest. Note that the diff is still not ready. I'm only using differential to move the patch to my Mac machine. Test Plan: added a test to column_family_test Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16581	11 years ago
Igor Canadi	8ea21a778b	[CF] Rething LogAndApply for column families Summary: I though I might get away with as little changes to LogAndApply() as possible. It turns out this is not the case. This diff introduces different behavior of LogAndApply() for three cases: 1. column family add 2. column family drop 3. no-column family manipulation (1) and (2) don't support group commit yet. There were a lot of problems with old version od LogAndApply, detected by db_stress. The biggest was non-atomicity of manifest writes and metadata changes (i.e. if column family add is in manifest, it also has to be in in-memory data structure). Test Plan: db_stress Reviewers: dhruba, haobo CC: leveldb Differential Revision: https://reviews.facebook.net/D16491	11 years ago
Igor Canadi	514e42c7cc	Fix some lint warnings	11 years ago
Igor Canadi	6d6fb70960	Remove compaction pointers Summary: The only thing we do with compaction pointers is set them to some values, we never actually read them. I don't know what we used them for, but it doesn't look like we use them anymore. Test Plan: make check Reviewers: dhruba, haobo, kailiu, sdong Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15225	11 years ago
Igor Canadi	055e6df45b	VersionEdit not to take NumLevels() Summary: I will submit a sequence of diffs that are preparing master branch for column families. There are a lot of implicit assumptions in the code that are making column family implementation hard. If I make the change only in column family branch, it will make merging back to master impossible. Most of the diffs will be simple code refactorings, so I hope we can have fast turnaround time. Feel free to grab me in person to discuss any of them. This diff removes number of level check from VersionEdit. It is used only when VersionEdit is read, not written, but has to be set when it is written. I believe it is a right thing to make VersionEdit dumb and check consistency on the caller side. This will also make it much easier to implement Column Families, since different column families can have different number of levels. Test Plan: make check Reviewers: dhruba, haobo, sdong, kailiu Reviewed By: kailiu CC: leveldb Differential Revision: https://reviews.facebook.net/D15159	11 years ago
Siying Dong	aa0ef6602d	[Performance Branch] If options.max_open_files set to be -1, cache table readers in FileMetadata for Get() and NewIterator() Summary: In some use cases, table readers for all live files should always be cached. In that case, there will be an opportunity to avoid the table cache look-up while Get() and NewIterator(). We define options.max_open_files = -1 to be the mode that table readers for live files will always be kept. In that mode, table readers are cached in FileMetaData (with a reference count hold in table cache). So that when executing table_cache.Get() and table_cache.newInterator(), LRU cache checking can be by-passed, to reduce latency. Test Plan: add a test case in db_test Reviewers: haobo, kailiu Reviewed By: haobo CC: dhruba, igor, leveldb Differential Revision: https://reviews.facebook.net/D15039	11 years ago

1 2

78 Commits (b9f590065872db9b818874ba4bf4402ddd476cc3)