rocksdb

Commit Graph

Author	SHA1	Message	Date
Siying Dong	5e298f865b	Add two more StatsLevel (#5027 ) Summary: Statistics cost too much CPU for some use cases. Add two stats levels so that people can choose to skip two types of expensive stats, timers and histograms. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5027 Differential Revision: D14252765 Pulled By: siying fbshipit-source-id: 75ecec9eaa44c06118229df4f80c366115346592	7 years ago
Maysam Yabandeh	a661c0d208	WritePrepared: optimize read path by avoiding virtual (#5018 ) Summary: The read path includes a callback function, ReadCallback, which would eventually calls IsInSnapshot to figure if a particular seq is in the reading snapshot or not. This callback is virtual, which adds the cost of multiple virtual function call to each read. The first few checks in IsInSnapshot, however, are quite trivial and take care of majority of the cases. The patch moves those to a non-virtual function in the the parent class, ReadCallback, to lower the virtual callback cost. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5018 Differential Revision: D14226562 Pulled By: maysamyabandeh fbshipit-source-id: 6feed5b34f3b082e52092c5ef143e29b49c46b44	7 years ago
Zhongyi Xie	c4f5d0aa15	add GetStatsHistory to retrieve stats snapshots (#4748 ) Summary: This PR adds public `GetStatsHistory` API to retrieve stats history in the form of an std map. The key of the map is the timestamp in microseconds when the stats snapshot is taken, the value is another std map from stats name to stats value (stored in std string). Two DBOptions are introduced: `stats_persist_period_sec` (default 10 minutes) controls the intervals between two snapshots are taken; `max_stats_history_count` (default 10) controls the max number of history snapshots to keep in memory. RocksDB will stop collecting stats snapshots if `stats_persist_period_sec` is set to 0. (This PR is the in-memory part of https://github.com/facebook/rocksdb/pull/4535) Pull Request resolved: https://github.com/facebook/rocksdb/pull/4748 Differential Revision: D13961471 Pulled By: miasantreble fbshipit-source-id: ac836d401ecb84ea92216bf9966f969dedf4ad04	7 years ago
Siying Dong	93f7e7a450	Temporarily Disable DBTest2.PresetCompressionDict (#5003 ) Summary: DBTest2.PresetCompressionDict is flaky. Temparily disable it for now. Pull Request resolved: https://github.com/facebook/rocksdb/pull/5003 Differential Revision: D14139505 Pulled By: siying fbshipit-source-id: ebf1872d364b76b2cb021b489ea2f17ee997116a	7 years ago
Michael Liu	3c5d1b16b1	Apply modernize-use-override (3) Summary: Use C++11’s override and remove virtual where applicable. Change are automatically generated. bypass-lint drop-conflicts Reviewed By: igorsugak Differential Revision: D14131816 fbshipit-source-id: f20e7f7cecf2e699d70f5fa036f72c0e3f59b50e	7 years ago
Zhongyi Xie	ed995c6a69	add whole key bloom filter support in memtables (#4985 ) Summary: MyRocks calls `GetForUpdate` on `INSERT`, for unique key check, and in almost all cases GetForUpdate returns empty result. For such cases, whole key bloom filter is helpful. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4985 Differential Revision: D14118257 Pulled By: miasantreble fbshipit-source-id: d35cb7109c62fd5ad541a26968e3a3e16d3e85ea	7 years ago
Siying Dong	c2affccc18	Header logger should call LogHeader() (#4980 ) Summary: The info log header feature never worked well, because log level Header was not translated to Logger::LogHeader() call. Fix it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4980 Differential Revision: D14087283 Pulled By: siying fbshipit-source-id: 7e7d03ce35fa8d13d4ee549f46f7326f7bc0006d	7 years ago
Siying Dong	26a33ee5bd	flush_job logs data size too (#4979 ) Summary: Right now when a flush is triggered, the memory consumption is logged but data size is not. It's useful to log both when we debug unexpected small flushed file size. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4979 Differential Revision: D14071979 Pulled By: siying fbshipit-source-id: 0cd60449c5205eb00e0fbc299084418f609904ed	7 years ago
Aubin Sanyal	3231a2e581	Deprecate ttl option from CompactionOptionsFIFO (#4965 ) Summary: We introduced ttl option in CompactionOptionsFIFO when ttl-based file deletion (compaction) was supported only as part of FIFO Compaction. But with the extension of ttl semantics even to Level compaction, CompactionOptionsFIFO.ttl can now be deprecated. Instead we will start using ColumnFamilyOptions.ttl for FIFO compaction as well. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4965 Differential Revision: D14072960 Pulled By: sagar0 fbshipit-source-id: c98cc2ae695a28136295787cd88d36a220fc219e	7 years ago
Michael Liu	ca89ac2ba9	Apply modernize-use-override (2nd iteration) Summary: Use C++11’s override and remove virtual where applicable. Change are automatically generated. Reviewed By: Orvid Differential Revision: D14090024 fbshipit-source-id: 1e9432e87d2657e1ff0028e15370a85d1739ba2a	7 years ago
Andrew Kryczka	c8c8104d7e	Dictionary compression for files written by SstFileWriter (#4978 ) Summary: If `CompressionOptions::max_dict_bytes` and/or `CompressionOptions::zstd_max_train_bytes` are set, `SstFileWriter` will now generate files respecting those options. I refactored the logic a bit for deciding when to use dictionary compression. Previously we plumbed `is_bottommost_level` down to the table builder and used that. However it was kind of confusing in `SstFileWriter`'s context since we don't know what level the file will be ingested to. Instead, now the higher-level callers (e.g., flush, compaction, file writer) are responsible for building the right `CompressionOptions` to give the table builder. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4978 Differential Revision: D14060763 Pulled By: ajkr fbshipit-source-id: dc802c327896df2b319dc162d6acc82b9cdb452a	7 years ago
Yanqin Jin	4fc442029a	Avoid using kInAtomicGroup tag for single-cf op (#4981 ) Summary: if an operation just involves a single column family, then we do not have to set the kInAtomicGroup tag when writing to MANIFEST. This change can fix a compatibility test failure, i.e. 5.15 and earlier cannot recognize kInAtomicGroup tag. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4981 Differential Revision: D14072687 Pulled By: riversand963 fbshipit-source-id: 46b0c61e399f16c6b7169de0b33430d0ed90d6d4	7 years ago
Yanqin Jin	a69d4deefb	Atomic ingest (#4895 ) Summary: Make file ingestion atomic. as title. Ingesting external SST files into multiple column families should be atomic. If a crash occurs and db reopens, either all column families have successfully ingested the files before the crash, or non of the ingestions have any effect on the state of the db. Also add unit tests for atomic ingestion. Note that the unit test here does not cover the case of incomplete atomic group in the MANIFEST, which is covered in VersionSetTest already. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4895 Differential Revision: D13718245 Pulled By: riversand963 fbshipit-source-id: 7df97cc483af73ad44dd6993008f99b083852198	7 years ago
Siying Dong	49ddd7ec4f	Stats should be logged in INFO level (#4977 ) Summary: Previously, stats were logged in warning level. This was done in that way because people reported that it wasn't logged in MyRocks. However, later we learned that it turns out to be due to a bug in MyRocks, which is fixed in `79bb705e74` Now we revert the stats logging to INFO level, so that it doesn't pollute the warning level logging. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4977 Differential Revision: D14058485 Pulled By: siying fbshipit-source-id: 19fab323c19d9bc88184287f209551f9a77ca0e6	7 years ago
Yanqin Jin	c5a64cffd2	Avoid fsync on the same directory in atomic flush (#4817 ) Summary: In `DBImpl::AtomicFlushMemTablesToOutputFiles`, we need to call fsync only once on the same data directory. If two column families share a common directory for their data, we call fsync only once. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4817 Differential Revision: D13543689 Pulled By: riversand963 fbshipit-source-id: 4701d77c96a47802fbf6cb9f3337ee65d46b95f5	7 years ago
Andrew Kryczka	62f70f6d14	Reduce scope of compression dictionary to single SST (#4952 ) Summary: Our previous approach was to train one compression dictionary per compaction, using the first output SST to train a dictionary, and then applying it on subsequent SSTs in the same compaction. While this was great for minimizing CPU/memory/I/O overhead, it did not achieve good compression ratios in practice. In our most promising potential use case, moderate reductions in a dictionary's scope make a major difference on compression ratio. So, this PR changes compression dictionary to be scoped per-SST. It accepts the tradeoff during table building to use more memory and CPU. Important changes include: - The `BlockBasedTableBuilder` has a new state when dictionary compression is in-use: `kBuffered`. In that state it accumulates uncompressed data in-memory whenever `Add` is called. - After accumulating target file size bytes or calling `BlockBasedTableBuilder::Finish`, a `BlockBasedTableBuilder` moves to the `kUnbuffered` state. The transition (`EnterUnbuffered()`) involves sampling the buffered data, training a dictionary, and compressing/writing out all buffered data. In the `kUnbuffered` state, a `BlockBasedTableBuilder` behaves the same as before -- blocks are compressed/written out as soon as they fill up. - Samples are now whole uncompressed data blocks, except the final sample may be a partial data block so we don't breach the user's configured `max_dict_bytes` or `zstd_max_train_bytes`. The dictionary trainer is supposed to work better when we pass it real units of compression. Previously we were passing 64-byte KV samples which was not realistic. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4952 Differential Revision: D13967980 Pulled By: ajkr fbshipit-source-id: 82bea6f7537e1529c7a1a4cdee84585f5949300f	7 years ago
Maysam Yabandeh	576d2d6c60	WritePrepared: relax assert in compaction iterator (#4969 ) Summary: If IsInSnapshot(seq2, snapshot) determines that the snapshot is released, the future queries IsInSnapshot(seq1, snapshot) could still return a definitive answer of true if for example seq1 is too old that is determined visible in all snapshots. This violates a recently added assert statement to compaction iterator. The patch relaxes the assert. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4969 Differential Revision: D14030998 Pulled By: maysamyabandeh fbshipit-source-id: 6db53db0e37d0a20e8997ef2c1004b8627614ab9	7 years ago
Yanqin Jin	2d049ab7e8	Checksum properties block for block-based table (#4956 ) Summary: Always enable properties block checksum verification for block-based table. For external SST file ingested with 'write_global_seqno==true', we use 'DecodeEntrySlow' to parse its blocks' contents so that the process will not die upon failing the assertion possibly caused by corruption. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4956 Differential Revision: D14012741 Pulled By: riversand963 fbshipit-source-id: 8b766e6f54b36f8f9e074c0e19e0926ec3cce186	7 years ago
Siying Dong	5d9a623e2c	Add a unit test to Ignorable manfiest record (#4964 ) Summary: https://github.com/facebook/rocksdb/pull/4960 introduced ignorable manfiest record. Adding a test to it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4964 Differential Revision: D14012667 Pulled By: siying fbshipit-source-id: e5f10ecc68dec2716e178d44f0fe2b76c3d857ef	7 years ago
tang-jianfeng	08809f5e6c	Implement trace sampling (#4963 ) Summary: Implement trace sampling to allow user to specify the sampling frequency, i.e. save one per how many requests, so that a user does not need to log all if he/she is interested in only a sampled set. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4963 Differential Revision: D14011190 Pulled By: tang-jianfeng fbshipit-source-id: 078b631d9319b67cb089dd2c30e21d0df8dc406a	7 years ago
Siying Dong	1a761e6a6c	Add a placeholder in manifest indicating ignorable record (#4960 ) Summary: We want to reserve some right that some extra information added manifest in the future can be forward compatible by previous versions. Now we create a place holder for that. A bit in tag is added to indicate that a field can be safely ignored. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4960 Differential Revision: D14000484 Pulled By: siying fbshipit-source-id: cbf5bad3f9d5ec798f789806f244d1c20d3b66d6	7 years ago
Siying Dong	f48758e939	Deprecate CompactionFilter::IgnoreSnapshots() = false (#4954 ) Summary: We found that the behavior of CompactionFilter::IgnoreSnapshots() = false isn't what we have expected. We thought that snapshot will always be preserved. However, we just realized that, if no snapshot is created while compaction starts, and a snapshot is created after that, the data seen from the snapshot can successfully be dropped by the compaction. This creates a strange behavior to the feature, which is hard to explain. Like what is documented in code comment, this feature is not very useful with snapshot anyway. The decision is to deprecate the feature. We keep the function to avoid to break users code. However, we will fail compactions if false is returned. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4954 Differential Revision: D13981900 Pulled By: siying fbshipit-source-id: 2db8c2c3865acd86a28dca625945d1481b1d1e36	7 years ago
Siying Dong	cf3a671733	Remove cuckoo hash memtable (#4953 ) Summary: Cuckoo Hash is less useful than we initially expected. Remove it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4953 Differential Revision: D13979264 Pulled By: siying fbshipit-source-id: 2a60afdaa989f045357398b43a1cc5d46f4492ed	7 years ago
Zhongyi Xie	71cae59a99	exclude test CompactFilesShouldTriggerAutoCompaction from ROCKSDB_LITE (#4950 ) Summary: This will fix the following build error: > db/db_test.cc: In member function ‘virtual void rocksdb::DBTest_CompactFilesShouldTriggerAutoCompaction_Test::TestBody()’: > db/db_test.cc:5462:8: error: ‘class rocksdb::DB’ has no member named ‘GetColumnFamilyMetaData’ > db_->GetColumnFamilyMetaData(db_->DefaultColumnFamily(), &cf_meta_data); > db/db_test.cc:5490:8: error: ‘class rocksdb::DB’ has no member named ‘GetColumnFamilyMetaData’ > db_->GetColumnFamilyMetaData(db_->DefaultColumnFamily(), &cf_meta_data); > db/db_test.cc:5499:8: error: ‘class rocksdb::DB’ has no member named ‘GetColumnFamilyMetaData’ > db_->GetColumnFamilyMetaData(db_->DefaultColumnFamily(), &cf_meta_data); Pull Request resolved: https://github.com/facebook/rocksdb/pull/4950 Differential Revision: D13965378 Pulled By: miasantreble fbshipit-source-id: a975435476fe555b1cd9d5da263ee3da3acdea56	7 years ago
Zhongyi Xie	00ed41daee	Allow copy for PerfContext objects (#4919 ) Summary: Existing implementation of PerfContext does not define copy constructor or assignment operator, which could potentially cause problems when user create copies and resets the builtin one. This PR address the issue by providing these two constructors with deep copy semantics. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4919 Differential Revision: D13960406 Pulled By: miasantreble fbshipit-source-id: 36aab5aaee65d4480f537e4e22148faa45e8e334	7 years ago
Jay Zhuang	c9a52cbdc8	Fix potential DB hang while using CompactFiles (#4940 ) Summary: CompactFiles() may block auto compaction which could cuase DB hang when it reachs level0_stop_writes_trigger. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4940 Differential Revision: D13929648 Pulled By: cooldoger fbshipit-source-id: 10842df38df3bebf862cd1a120a88ce961fdd381	7 years ago
Siying Dong	8fe073324f	BYTES_READ stats miscount for NotFound cases (#4938 ) Summary: In NotFound cases, stats BYTES_READ and perf_context.get_read_bytes is still be increased. The amount increased will be whatever size of the string or PinnableSlice that users passed in as the output data structure. This is wrong. Fix this by not increasing these two counters. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4938 Differential Revision: D13908963 Pulled By: siying fbshipit-source-id: 60bce42e4fbb9862bba3da36dbc27b2963ea6162	7 years ago
yangzhijia	31221bb7e8	Properly set upper bound of subcompaction output (#4879 ) (#4898 ) Summary: Fix the ouput overlap bug when using subcompactions, the upper bound of output file was extended incorrectly. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4898 Differential Revision: D13736107 Pulled By: ajkr fbshipit-source-id: 21dca09f81d5f07bf2766bf566f9b50dcab7d8e3	7 years ago
Maysam Yabandeh	30468d8eb4	Fix analyze error on possible un-initialized value (#4937 ) Summary: The patch fixes the following analyze error by checking the return status of ParseInternalKey. ``` db/merge_helper.cc:306:23: warning: The right operand of '==' is a garbage value assert(kTypeMerge == orig_ikey.type); ``` Pull Request resolved: https://github.com/facebook/rocksdb/pull/4937 Differential Revision: D13908506 Pulled By: maysamyabandeh fbshipit-source-id: 68d7771e75519da3d4bd807fd231675ec12093f6	7 years ago
Ming Zhao	59244447e3	Zero seqnum of final key / drop final tombstone when compacting to bottommost level Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4927 Differential Revision: D13889458 Pulled By: mzhaom fbshipit-source-id: d6b66db85901a9eb90748fba6a9dc4e7457b9c5e	7 years ago
Yanqin Jin	842cdc11dd	Use correct FileMeta for atomic flush result install (#4932 ) Summary: 1. this commit fixes our handling of a combination of two separate edge cases. If a flush job does not pick any memtable to flush (because another flush job has already picked the same memtables), and the column family assigned to the flush job is dropped right before RocksDB calls rocksdb::InstallMemtableAtomicFlushResults, our original code passes a FileMetaData object whose file number is 0, failing the assertion in rocksdb::InstallMemtableAtomicFlushResults (assert(m->GetFileNumber() > 0)). 2. Also piggyback a small change: since we already create a local copy of column family's mutable CF options to eliminate potential race condition with `SetOptions` call, we might as well use the local copy in other function calls in the same scope. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4932 Differential Revision: D13901322 Pulled By: riversand963 fbshipit-source-id: b936580af7c127ea0c6c19ea10cd5fcede9fb0f9	7 years ago
Maysam Yabandeh	35e5689e11	Take snapshots once for all cf flushes (#4934 ) Summary: FlushMemTablesToOutputFiles calls FlushMemTableToOutputFile for each column family. The patch moves the take-snapshot logic to outside FlushMemTableToOutputFile so that it does it once for all the flushes. This also addresses a deadlock issue for resetting the managed snapshot of job_snapshot in the 2nd call to FlushMemTableToOutputFile. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4934 Differential Revision: D13900747 Pulled By: maysamyabandeh fbshipit-source-id: f3cd650c5fff24cf95c1aaf8a10c149d42bf042c	7 years ago
Alexander Zinoviev	32a6dd9a41	Add a new CPU time counter to compaction report (#4889 ) Summary: Measure CPU time consumed for a compaction and report it in the stats report Enable NowCPUNanos() to work for MacOS Pull Request resolved: https://github.com/facebook/rocksdb/pull/4889 Differential Revision: D13701276 Pulled By: zinoale fbshipit-source-id: 5024e5bbccd4dd10fd90d947870237f436445055	7 years ago
Yanqin Jin	158da7a6ee	Verify checksum before ingestion (#4916 ) Summary: before file ingestion (in preparation phase), verify the checksums of the blocks of the external SST file, including properties block with global seqno. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4916 Differential Revision: D13863501 Pulled By: riversand963 fbshipit-source-id: dc54697f970e3807832e2460f7228fcc7efe81ee	7 years ago
Sagar Vemuri	4978caaa6f	Remove a redundant call to TableFileName in CompactionJob::FinishCompactionOutputFile (#4925 ) Summary: While stepping through the code I noticed that there is a redundant call to TableFileName. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4925 Differential Revision: D13845749 Pulled By: sagar0 fbshipit-source-id: 31db45716b4d720e0e0350dd457b49d6f1848e7d	7 years ago
Siying Dong	ee1818081f	Remove PlainTable's feature store_index_in_file (#4914 ) Summary: Store_index_in_file is a less useful feature. To simplify the code to maintain, we are dropping the feature. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4914 Differential Revision: D13791883 Pulled By: siying fbshipit-source-id: d187c5d662584866103e4b77d09dfb925509ae2e	7 years ago
Siying Dong	bc7d1661a8	Fix test name typo in PlainTableDBTest Summary: Pull Request resolved: https://github.com/facebook/rocksdb/pull/4926 Differential Revision: D13830196 Pulled By: siying fbshipit-source-id: e06bf2a6cd273b5eb18dfd82bdd35ffce197d021	7 years ago
Siying Dong	f184bee77b	PlainTable should avoid copying Get() results from immortal source. (#4924 ) Summary: https://github.com/facebook/rocksdb/pull/4053 avoids memcopy for Get() results if files are immortable (read-only DB, max_open_files=-1) and the file is ammaped. The same optimization is being applied to PlainTable here. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4924 Differential Revision: D13827749 Pulled By: siying fbshipit-source-id: 1f2cbfc530b40ce08ccd53f95f6e78de4d1c2f96	7 years ago
Siying Dong	fc53839bfa	Disallow customized hash function in DynamicBloom (#4915 ) Summary: I didn't find where customized hash function is used in DynamicBloom. This can only reduce performance. Remove it. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4915 Differential Revision: D13794452 Pulled By: siying fbshipit-source-id: e38669b11e01444d2d782da11c7decabbd851819	7 years ago
Dmitry Fink	e07aa8669d	Allow full merge when root of history for a key is reached (#4909 ) Summary: Previously compaction was not collapsing operands for a first key on a layer, even in cases when it was its root of history. Some tests (CompactionJobTest.NonAssocMerge) was actually accounting for that bug, Pull Request resolved: https://github.com/facebook/rocksdb/pull/4909 Differential Revision: D13781169 Pulled By: finik fbshipit-source-id: d2de353ecf05bec39b942cd8d5b97a8dc445f336	7 years ago
Andrew Kryczka	8ec3e72551	Cache dictionary used for decompressing data blocks (#4881 ) Summary: - If block cache disabled or not used for meta-blocks, `BlockBasedTableReader::Rep::uncompression_dict` owns the `UncompressionDict`. It is preloaded during `PrefetchIndexAndFilterBlocks`. - If block cache is enabled and used for meta-blocks, block cache owns the `UncompressionDict`, which holds dictionary and digested dictionary when needed. It is never prefetched though there is a TODO for this in the code. The cache key is simply the compression dictionary block handle. - New stats for compression dictionary accesses in block cache: "BLOCK_CACHE_COMPRESSION_DICT_*" and "compression_dict_block_read_count" Pull Request resolved: https://github.com/facebook/rocksdb/pull/4881 Differential Revision: D13663801 Pulled By: ajkr fbshipit-source-id: bdcc54044e180855cdcc57639b493b0e016c9a3f	7 years ago
PeifengSi	43defe9872	Correct the code comment in Compaction::KeyNotExistsBeyondOutputLevel (#4902 ) Summary: Even one key falls in a file's range, we can not infer it definitely exists in this file. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4902 Differential Revision: D13795018 Pulled By: siying fbshipit-source-id: 590956f727e9440fcdee55ad9541ace934c64914	7 years ago
Siying Dong	d94aa2f7db	Make compaction_pri = kMinOverlappingRatio to be default (#4911 ) Summary: compaction_pri = kMinOverlappingRatio usually provides much better write amplification than the default. https://github.com/facebook/rocksdb/pull/4907 fixes one shortcome of this option. Make it default. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4911 Differential Revision: D13789262 Pulled By: siying fbshipit-source-id: d90acf8c4dede44f00d183ca4c7a210259378269	7 years ago
Siying Dong	5bf941966b	CompactionPri = kMinOverlappingRatio also uses compensated file size (#4907 ) Summary: Right now, CompactionPri = kMinOverlappingRatio provides best write amplification, but it doesn't prioritize files with more tombstones. We combine the two good features: make kMinOverlappingRatio to boost files with lots of tombstones too. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4907 Differential Revision: D13788774 Pulled By: siying fbshipit-source-id: 1991cbb495fb76c8b529de69896e38d81ed9d9b3	7 years ago
Andrew Kryczka	01013ae766	Digest ZSTD compression dictionary once when writing SST file (#4849 ) Summary: This is essentially a re-submission of #4251 with a few improvements: - Split `CompressionDict` into two separate classes: `CompressionDict` and `UncompressionDict` - Eliminated `Init` functions. Instead do all initialization work in constructors. - Added test case for parallel DB open, which is the scenario where #4251 failed under TSAN. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4849 Differential Revision: D13606039 Pulled By: ajkr fbshipit-source-id: 08c236059798c710db9cbf545fce0f371232d447	7 years ago
Yi Wu	b1ad6ebba8	WritePrepared: fix two versions in compaction see different status for released snapshots (#4890 ) Summary: Fix how CompactionIterator::findEarliestVisibleSnapshots handles released snapshot. It fixing the two scenarios: Scenario 1: key1 has two values v1 and v2. There're two snapshots s1 and s2 taken after v1 and v2 are committed. Right after compaction output v2, s1 is released. Now findEarliestVisibleSnapshot may see s1 being released, and return the next snapshot, which is s2. That's larger than v2's earliest visible snapshot, which was s1. The fix: the only place we check against last snapshot and current key snapshot is when we decide whether to compact out a value if it is hidden by a later value. In the check if we see current snapshot is even larger than last snapshot, we know last snapshot is released, and we are safe to compact out current key. Scenario 2: key1 has two values v1 and v2. there are two snapshots s1 and s2 taken after v1 and v2 are committed. During compaction before we process the key, s1 is released. When compaction process v2, snapshot checker may return kSnapshotReleased, and the earliest visible snapshot for v2 become s2. When compaction process v1, snapshot checker may return kIsInSnapshot (for WritePrepared transaction, it could be because v1 is still in commit cache). The result will become inconsistent here. The fix: remember the set of released snapshots ever reported by snapshot checker, and ignore them when finding result for findEarliestVisibleSnapshot. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4890 Differential Revision: D13705538 Pulled By: maysamyabandeh fbshipit-source-id: e577f0d9ee1ff5a6035f26859e56902ecc85a5a4	7 years ago
Yi Wu	128f532858	WritePrepared: fix issue with snapshot released during compaction (#4858 ) Summary: Compaction iterator keep a copy of list of live snapshots at the beginning of compaction, and then query snapshot checker to verify if values of a sequence number is visible to these snapshots. However when the snapshot is released in the middle of compaction, the snapshot checker implementation (i.e. WritePreparedSnapshotChecker) may remove info with the snapshot and may report incorrect result, which lead to values being compacted out when it shouldn't. This patch conservatively keep the values if snapshot checker determines that the snapshots is released. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4858 Differential Revision: D13617146 Pulled By: maysamyabandeh fbshipit-source-id: cf18a94f6f61a94bcff73c280f117b224af5fbc3	7 years ago
Yanqin Jin	e79df377c5	Use chrono::time_point instead of time_t (#4868 ) Summary: By convention, time_t almost always stores the integral number of seconds since 00:00 hours, Jan 1, 1970 UTC, according to http://www.cplusplus.com/reference/ctime/time_t/. We surely want more precision than seconds. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4868 Differential Revision: D13633046 Pulled By: riversand963 fbshipit-source-id: 4e01e23a22e8838023c51a91247a286dbf3a5396	7 years ago
Yi Wu	5d4fddfa52	WritePrepared: Fix visible key compacted out by compaction (#4883 ) Summary: With WritePrepared transaction, flush/compaction can contain uncommitted keys, and those keys can get committed during compaction. If a snapshot is taken before the key is committed, it should not see the key. On the other hand, compaction grab the list of snapshots at its beginning, and only consider those snapshots to dedup keys. Consider the case: ``` seq = 1: put "foo" = "bar" seq = 2: transaction T: delete "foo", prepare seq = 3: compaction start seq = 4: take snapshot S seq = 5: transaction T: commit. ... seq = N: compaction iterator reached key "foo". ``` When compaction start, the list of snapshot is empty. Compaction doesn't take snapshot S into account. When it reached "foo", transaction T is committed. Compaction may think the value "foo=bar" is not visible by any snapshot (which is wrong), and compact the value out. The fix is to explicitly take a snapshot before compaction grabbing the list of snapshots. Compaction will then has to keep keys visible to this snapshot. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4883 Differential Revision: D13668775 Pulled By: maysamyabandeh fbshipit-source-id: 1cab9615f94b7d3e8522cc3d44c3a14c7d4720e4	7 years ago
Maysam Yabandeh	cad99a6031	WritePrepared: snapshot should be larger than max_evicted_seq_ (#4886 ) Summary: The AdvanceMaxEvictedSeq algorithm assumes that new snapshots always have sequence number larger than the last max_evicted_seq_. To enforce this assumption we make two changes: i) max is not advanced beyond the last published seq, with the exception that the evicted commit entry itself is not published yet, which is quite rare. ii) When obtaining the snapshot if the max_evicted_seq_ is not published yet, commit a dummy entry so that it waits for it to be published and also increased the latest published seq by one above the max. To test these non-realistic corner cases we create a commit cache with size 1 so that every single commit results into eviction. Pull Request resolved: https://github.com/facebook/rocksdb/pull/4886 Differential Revision: D13685270 Pulled By: maysamyabandeh fbshipit-source-id: 5461bc09c2a9b75798bfcb9853a256c81cdac0b0	7 years ago

... 2 3 4 5 6 ...

3638 Commits (5dc9fbd1175ad10454b877d9044c4b909d00ae3b)