Summary:
In include/rocksdb/db.h, enum EntryType is redeclared even though
original declaration in types.h in included.
Closes https://github.com/facebook/rocksdb/pull/3766
Differential Revision: D7765504
Pulled By: anand1976
fbshipit-source-id: 622a8ecb306993915be1b9dd5cdd79dbc6a4ea05
Summary:
In this change, an option to set different paths for different column families is added.
This option is set via cf_paths setting of ColumnFamilyOptions. This option will work in a similar fashion to db_paths setting. Cf_paths is a vector of Dbpath values which contains a pair of the absolute path and target size. Multiple levels in a Column family can go to different paths if cf_paths has more than one path.
To maintain backward compatibility, if cf_paths is not specified for a column family, db_paths setting will be used. Note that, if db_paths setting is also not specified, RocksDB already has code to use db_name as the only path.
Changes :
1) A new member "cf_paths" is added to ImmutableCfOptions. This is set, based on cf_paths setting of ColumnFamilyOptions and db_paths setting of ImmutableDbOptions. This member is used to identify the path information whenever files are accessed.
2) Validation checks are added for cf_paths setting based on existing checks for db_paths setting.
3) DestroyDB, PurgeObsoleteFiles etc. are edited to support multiple cf_paths.
4) Unit tests are added appropriately.
Closes https://github.com/facebook/rocksdb/pull/3102
Differential Revision: D6951697
Pulled By: ajkr
fbshipit-source-id: 60d2262862b0a8fd6605b09ccb0da32bb331787d
Summary:
This is a small API extension to allow the CompactFiles method to return the names of files that were created during the compaction.
Closes https://github.com/facebook/rocksdb/pull/3608
Differential Revision: D7275789
Pulled By: siying
fbshipit-source-id: 1ec0c3954a0f10cd877efb5f29f9be6c7b59e9ba
Summary:
... so that people are not confused.
Closes https://github.com/facebook/rocksdb/pull/3580
Differential Revision: D7187175
Pulled By: sagar0
fbshipit-source-id: bce70093d52e38cd24c9432fd708885d7c2c013e
Summary:
Add "rocksdb.live-sst-files-size" DB property which only include files of latest version. Existing "rocksdb.total-sst-files-size" include files from all versions and thus include files that's obsolete but not yet deleted. I'm going to use this new property to cap blob db sst + blob files size.
Closes https://github.com/facebook/rocksdb/pull/3548
Differential Revision: D7116939
Pulled By: yiwu-arbug
fbshipit-source-id: c6a52e45ce0f24ef78708156e1a923c1dd6bc79a
Summary:
CompactRange has a call to Flush because we guarantee that, at the time it's called, all existing keys in the range will be pushed through the user's compaction filter. However, previously the flush was done blindly, so it'd happen even if the memtable does not contain keys in the range specified by the user. This caused unnecessarily many L0 files to be created, leading to write stalls in some cases. This PR checks the memtable's contents, and decides to flush only if it overlaps with `CompactRange`'s range.
- Move the memtable overlap check logic from `ExternalSstFileIngestionJob` to `ColumnFamilyData::RangesOverlapWithMemtables`
- Reuse the above logic in `CompactRange` and skip flushing if no overlap
Closes https://github.com/facebook/rocksdb/pull/3520
Differential Revision: D7018897
Pulled By: ajkr
fbshipit-source-id: a3c6b1cfae56687b49dd89ccac7c948e53545934
Summary:
The recent Logger::Close() and DBImpl::Close() implementation rely on
calling the CloseImpl() virtual function from the destructor, which will
not work. Refactor the implementation to have a private close helper
function in derived classes that can be called by both CloseImpl() and
the destructor.
Closes https://github.com/facebook/rocksdb/pull/3528
Reviewed By: gfosco
Differential Revision: D7049303
Pulled By: anand1976
fbshipit-source-id: 76a64cbf403209216dfe4864ecf96b5d7f3db9f4
Summary:
There are a couple of places where we swallow any error from
WriteBuffer() - in SwitchMemtable() and DBImpl::CloseImpl(). Propagate
the error up in those cases rather than ignoring it.
Closes https://github.com/facebook/rocksdb/pull/3404
Differential Revision: D6879954
Pulled By: anand1976
fbshipit-source-id: 2ef88b554be5286b0a8bad7384ba17a105395bdb
Summary:
Using `DeleteFilesInRange` to delete files in a lot of ranges can be slow, because
`VersionSet::LogAndApply` is expensive.
This PR adds a new `DeleteFilesInRange` function to delete files in multiple
ranges at once.
Close https://github.com/facebook/rocksdb/issues/2951
Closes https://github.com/facebook/rocksdb/pull/3431
Differential Revision: D6849228
Pulled By: ajkr
fbshipit-source-id: daeedcabd8def4b1d9ee95a58266dee77b5d68cb
Summary:
Currently, the only way to close an open DB is to destroy the DB
object. There is no way for the caller to know the status. In one
instance, the destructor encountered an error due to failure to
close a log file on HDFS. In order to prevent silent failures, we add
DB::Close() that calls CloseImpl() which must be implemented by its
descendants.
The main failure point in the destructor is closing the log file. This
patch also adds a Close() entry point to Logger in order to get status.
When DBOptions::info_log is allocated and owned by the DBImpl, it is
explicitly closed by DBImpl::CloseImpl().
Closes https://github.com/facebook/rocksdb/pull/3348
Differential Revision: D6698158
Pulled By: anand1976
fbshipit-source-id: 9468e2892553eb09c4c41b8723f590c0dbd8ab7d
Summary:
The sequence number was not properly advanced after a rollback marker. The patch extends the existing unit tests to detect the bug and also fixes it.
Closes https://github.com/facebook/rocksdb/pull/3157
Differential Revision: D6304291
Pulled By: maysamyabandeh
fbshipit-source-id: 1b519c44a5371b802da49c9e32bd00087a8da401
Summary:
The motivation for this PR is to add to RocksDB support for differential (incremental) snapshots, as snapshot of the DB changes between two points in time (one can think of it as diff between to sequence numbers, or the diff D which can be thought of as an SST file or just set of KVs that can be applied to sequence number S1 to get the database to the state at sequence number S2).
This feature would be useful for various distributed storages layers built on top of RocksDB, as it should help reduce resources (time and network bandwidth) needed to recover and rebuilt DB instances as replicas in the context of distributed storages.
From the API standpoint that would like client app requesting iterator between (start seqnum) and current DB state, and reading the "diff".
This is a very draft PR for initial review in the discussion on the approach, i'm going to rework some parts and keep updating the PR.
For now, what's done here according to initial discussions:
Preserving deletes:
- We want to be able to optionally preserve recent deletes for some defined period of time, so that if a delete came in recently and might need to be included in the next incremental snapshot it would't get dropped by a compaction. This is done by adding new param to Options (preserve deletes flag) and new variable to DB Impl where we keep track of the sequence number after which we don't want to drop tombstones, even if they are otherwise eligible for deletion.
- I also added a new API call for clients to be able to advance this cutoff seqnum after which we drop deletes; i assume it's more flexible to let clients control this, since otherwise we'd need to keep some kind of timestamp < -- > seqnum mapping inside the DB, which sounds messy and painful to support. Clients could make use of it by periodically calling GetLatestSequenceNumber(), noting the timestamp, doing some calculation and figuring out by how much we need to advance the cutoff seqnum.
- Compaction codepath in compaction_iterator.cc has been modified to avoid dropping tombstones with seqnum > cutoff seqnum.
Iterator changes:
- couple params added to ReadOptions, to optionally allow client to request internal keys instead of user keys (so that client can get the latest value of a key, be it delete marker or a put), as well as min timestamp and min seqnum.
TableCache changes:
- I modified table_cache code to be able to quickly exclude SST files from iterators heep if creation_time on the file is less then iter_start_ts as passed in ReadOptions. That would help a lot in some DB settings (like reading very recent data only or using FIFO compactions), but not so much for universal compaction with more or less long iterator time span.
What's left:
- Still looking at how to best plug that inside DBIter codepath. So far it seems that FindNextUserKeyInternal only parses values as UserKeys, and iter->key() call generally returns user key. Can we add new API to DBIter as internal_key(), and modify this internal method to optionally set saved_key_ to point to the full internal key? I don't need to store actual seqnum there, but I do need to store type.
Closes https://github.com/facebook/rocksdb/pull/2999
Differential Revision: D6175602
Pulled By: mikhail-antonov
fbshipit-source-id: c779a6696ee2d574d86c69cec866a3ae095aa900
Summary:
With FIFO compaction we would like to get the oldest data time for monitoring. The problem is we don't have timestamp for each key in the DB. As an approximation, we expose the earliest of sst file "creation_time" property.
My plan is to override the property with a more accurate value with blob db, where we actually have timestamp.
Closes https://github.com/facebook/rocksdb/pull/2842
Differential Revision: D5770600
Pulled By: yiwu-arbug
fbshipit-source-id: 03833c8f10bbfbee62f8ea5c0d03c0cafb5d853a
Summary:
This branch extends existing property map which keeps values in doubles to keep values in strings so that it can be used to provide wider range of properties. The immediate need for that is to provide IO stall stats in an easy parseable way to MyRocks which is also part of this branch.
Closes https://github.com/facebook/rocksdb/pull/2794
Differential Revision: D5717676
Pulled By: Tema
fbshipit-source-id: e34ba5b79ba774697f7b97ce1138d8fd55471b8a
Summary:
We need a tool to check any sst file corruption in the db.
It will check all the sst files in current version and read all the blocks (data, meta, index) with checksum verification. If any verification fails, the function will return non-OK status.
Closes https://github.com/facebook/rocksdb/pull/2498
Differential Revision: D5324269
Pulled By: lightmark
fbshipit-source-id: 6f8a272008b722402a772acfc804524c9d1a483b
Summary:
Replace dynamic_cast<> so that users can choose to build with RTTI off, so that they can save several bytes per object, and get tiny more memory available.
Some nontrivial changes:
1. Add Comparator::GetRootComparator() to get around the internal comparator hack
2. Add the two experiemental functions to DB
3. Add TableFactory::GetOptionString() to avoid unnecessary casting to get the option string
4. Since 3 is done, move the parsing option functions for table factory to table factory files too, to be symmetric.
Closes https://github.com/facebook/rocksdb/pull/2645
Differential Revision: D5502723
Pulled By: siying
fbshipit-source-id: fd13cec5601cf68a554d87bfcf056f2ffa5fbf7c
Summary:
This reverts the previous commit 1d7048c598, which broke the build.
Did a `git revert 1d7048c`.
Closes https://github.com/facebook/rocksdb/pull/2627
Differential Revision: D5476473
Pulled By: sagar0
fbshipit-source-id: 4756ff5c0dfc88c17eceb00e02c36176de728d06
Summary: This uses `clang-tidy` to comment out unused parameters (in functions, methods and lambdas) in fbcode. Cases that the tool failed to handle are fixed manually.
Reviewed By: igorsugak
Differential Revision: D5454343
fbshipit-source-id: 5dee339b4334e25e963891b519a5aa81fbf627b2
Summary:
Throughput: 46k tps in our sysbench settings (filling the details later)
The idea is to have the simplest change that gives us a reasonable boost
in 2PC throughput.
Major design changes:
1. The WAL file internal buffer is not flushed after each write. Instead
it is flushed before critical operations (WAL copy via fs) or when
FlushWAL is called by MySQL. Flushing the WAL buffer is also protected
via mutex_.
2. Use two sequence numbers: last seq, and last seq for write. Last seq
is the last visible sequence number for reads. Last seq for write is the
next sequence number that should be used to write to WAL/memtable. This
allows to have a memtable write be in parallel to WAL writes.
3. BatchGroup is not used for writes. This means that we can have
parallel writers which changes a major assumption in the code base. To
accommodate for that i) allow only 1 WriteImpl that intends to write to
memtable via mem_mutex_--which is fine since in 2PC almost all of the memtable writes
come via group commit phase which is serial anyway, ii) make all the
parts in the code base that assumed to be the only writer (via
EnterUnbatched) to also acquire mem_mutex_, iii) stat updates are
protected via a stat_mutex_.
Note: the first commit has the approach figured out but is not clean.
Submitting the PR anyway to get the early feedback on the approach. If
we are ok with the approach I will go ahead with this updates:
0) Rebase with Yi's pipelining changes
1) Currently batching is disabled by default to make sure that it will be
consistent with all unit tests. Will make this optional via a config.
2) A couple of unit tests are disabled. They need to be updated with the
serial commit of 2PC taken into account.
3) Replacing BatchGroup with mem_mutex_ got a bit ugly as it requires
releasing mutex_ beforehand (the same way EnterUnbatched does). This
needs to be cleaned up.
Closes https://github.com/facebook/rocksdb/pull/2345
Differential Revision: D5210732
Pulled By: maysamyabandeh
fbshipit-source-id: 78653bd95a35cd1e831e555e0e57bdfd695355a4
Summary:
First cut for early review; there are few conceptual points to answer and some code structure issues.
For conceptual points -
- restriction-wise, we're going to disallow ingest_behind if (use_seqno_zero_out=true || disable_auto_compaction=false), the user is responsible to properly open and close DB with required params
- we wanted to ingest into reserved bottom most level. Should we fail fast if bottom level isn't empty, or should we attempt to ingest if file fits there key-ranges-wise?
- Modifying AssignLevelForIngestedFile seems the place we we'd handle that.
On code structure - going to refactor GenerateAndAddExternalFile call in the test class to allow passing instance of IngestionOptions, that's just going to incur lots of changes at callsites.
Closes https://github.com/facebook/rocksdb/pull/2144
Differential Revision: D4873732
Pulled By: lightmark
fbshipit-source-id: 81cb698106b68ef8797f564453651d50900e153a
Summary:
Adding DB::CreateColumnFamilie() and DB::DropColumnFamilies() to bulk create/drop column families. This is to address the problem creating/dropping 1k column families takes minutes. The bottleneck is we persist options files for every single column family create/drop, and it parses the persisted options file for verification, which take a lot CPU time.
The new APIs simply create/drop column families individually, and persist options file once at the end. This improves create 1k column families to within ~0.1s. Further improvement can be merge manifest write to one IO.
Closes https://github.com/facebook/rocksdb/pull/2248
Differential Revision: D5001578
Pulled By: yiwu-arbug
fbshipit-source-id: d4e00bda671451e0b314c13e12ad194b1704aa03
Summary:
Add a function to allow users to reset internal stats without restarting the DB.
Closes https://github.com/facebook/rocksdb/pull/2167
Differential Revision: D4907939
Pulled By: siying
fbshipit-source-id: ab2dd85b88aabe9380da7485320a1d460d3e1f68
Summary:
Currently level histogram is only printed out for DB stats and for default CF. This is confusing. Change to print for every CF instead.
Closes https://github.com/facebook/rocksdb/pull/2126
Differential Revision: D4865373
Pulled By: siying
fbshipit-source-id: 1c853e0ac66e00120ee931cabc9daf69ccc2d577
Summary:
some fbcode services override it, we need to keep it virtual.
original change: #1756
Closes https://github.com/facebook/rocksdb/pull/2065
Differential Revision: D4808123
Pulled By: ajkr
fbshipit-source-id: 5eaeea7
Summary:
Add two DB properties: rocksdb.actual_delayed_write_rate and rocksdb.is_write_stooped, for people to know whether current writes are being throttled.
Closes https://github.com/facebook/rocksdb/pull/2043
Differential Revision: D4782975
Pulled By: siying
fbshipit-source-id: 6b2f5cf
Summary:
PinnableSlice
Summary:
Currently the point lookup values are copied to a string provided by the
user. This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.
Here is the summary for improvements:
value 100 byte: 1.8% regular, 1.2% merge values
value 1k byte: 11.5% regular, 7.5% merge values
value 10k byte: 26% regular, 29.9% merge values
The improvement for merge could be more if we extend this approach to
pin the merge output and delay the full merge operation until the user
actually needs it. We have put that for future work.
PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The d
Closes https://github.com/facebook/rocksdb/pull/1756
Differential Revision: D4391738
Pulled By: maysamyabandeh
fbshipit-source-id: 6f3edd3
Summary:
MongoRocks is still using some deprecated functions, return them temporarily
Closes https://github.com/facebook/rocksdb/pull/1892
Differential Revision: D4592451
Pulled By: IslamAbdelRahman
fbshipit-source-id: 5e6be3e
Summary:
Remove functions that we deprecated long time ago in db.h
Closes https://github.com/facebook/rocksdb/pull/1878
Differential Revision: D4576521
Pulled By: IslamAbdelRahman
fbshipit-source-id: dfddad1
Summary:
Added method that returns approx num of entries as well as size for memtables.
Closes https://github.com/facebook/rocksdb/pull/1841
Differential Revision: D4511990
Pulled By: VitaliyLi
fbshipit-source-id: 9a4576e
Summary:
- rocksdb_property_int (so that we don't have to parse strings)
- and rocksdb_set_options (to allow controlling options via strings)
- a few other missing options exposed
- a documentation comment fix
Closes https://github.com/facebook/rocksdb/pull/1793
Differential Revision: D4456569
Pulled By: yiwu-arbug
fbshipit-source-id: 9f1fac1
Summary:
Added an option to GetApproximateSizes to exclude file stats, as MyRocks has those counted exactly and we need only stats from memtables.
Closes https://github.com/facebook/rocksdb/pull/1787
Differential Revision: D4441111
Pulled By: IslamAbdelRahman
fbshipit-source-id: c11f4c3
Summary:
Currently the point lookup values are copied to a string provided by the user.
This incures an extra memcpy cost. This patch allows doing point lookup
via a PinnableSlice which pins the source memory location (instead of
copying their content) and releases them after the content is consumed
by the user. The old API of Get(string) is translated to the new API
underneath.
Here is the summary for improvements:
1. value 100 byte: 1.8% regular, 1.2% merge values
2. value 1k byte: 11.5% regular, 7.5% merge values
3. value 10k byte: 26% regular, 29.9% merge values
The improvement for merge could be more if we extend this approach to
pin the merge output and delay the full merge operation until the user
actually needs it. We have put that for future work.
PS:
Sometimes we observe a small decrease in performance when switching from
t5452014 to this patch but with the old Get(string) API. The difference
is a little and could be noise. More importantly it is safely
cancelled
Closes https://github.com/facebook/rocksdb/pull/1732
Differential Revision: D4374613
Pulled By: maysamyabandeh
fbshipit-source-id: a077f1a
Summary:
If 2PC is enabled, checkpoint may not copy previous log files that contain uncommitted prepare records. In this diff we keep those files.
Closes https://github.com/facebook/rocksdb/pull/1724
Differential Revision: D4368319
Pulled By: siying
fbshipit-source-id: cc2c746
Summary:
Currently the compaction stats are printed to stdout. We want to export the compaction stats in a map format so that the upper layer apps (e.g., MySQL) could present
the stats in any format required by the them.
Closes https://github.com/facebook/rocksdb/pull/1477
Differential Revision: D4149836
Pulled By: maysamyabandeh
fbshipit-source-id: b3df19f
Summary:
Changes in the diff
API changes:
- Introduce IngestExternalFile to replace AddFile (I think this make the API more clear)
- Introduce IngestExternalFileOptions (This struct will encapsulate the options for ingesting the external file)
- Deprecate AddFile() API
Logic changes:
- If our file overlap with the memtable we will flush the memtable
- We will find the first level in the LSM tree that our file key range overlap with the keys in it
- We will find the lowest level in the LSM tree above the the level we found in step 2 that our file can fit in and ingest our file in it
- We will assign a global sequence number to our new file
- Remove AddFile restrictions by using global sequence numbers
Other changes:
- Refactor all AddFile logic to be encapsulated in ExternalSstFileIngestionJob
Test Plan:
unit tests (still need to add more)
addfile_stress (https://reviews.facebook.net/D65037)
Reviewers: yiwu, andrewkr, lightmark, yhchiang, sdong
Reviewed By: sdong
Subscribers: jkedgar, hcz, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D65061
Summary: Use ImmutableDBOptions/MutableDBOptions internally and DBOptions only for user-facing APIs. MutableDBOptions is barely a placeholder for now. I'll start to move options to MutableDBOptions in following diffs.
Test Plan:
make all check
Reviewers: yhchiang, IslamAbdelRahman, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D64065