Summary:
For performance purposes, the lower level routines were changed to use a SystemClock* instead of a std::shared_ptr<SystemClock>. The shared ptr has some performance degradation on certain hardware classes.
For most of the system, there is no risk of the pointer being deleted/invalid because the shared_ptr will be stored elsewhere. For example, the ImmutableDBOptions stores the Env which has a std::shared_ptr<SystemClock> in it. The SystemClock* within the ImmutableDBOptions is essentially a "short cut" to gain access to this constant resource.
There were a few classes (PeriodicWorkScheduler?) where the "short cut" property did not hold. In those cases, the shared pointer was preserved.
Using db_bench readrandom perf_level=3 on my EC2 box, this change performed as well or better than 6.17:
6.17: readrandom : 28.046 micros/op 854902 ops/sec; 61.3 MB/s (355999 of 355999 found)
6.18: readrandom : 32.615 micros/op 735306 ops/sec; 52.7 MB/s (290999 of 290999 found)
PR: readrandom : 27.500 micros/op 871909 ops/sec; 62.5 MB/s (367999 of 367999 found)
(Note that the times for 6.18 are prior to revert of the SystemClock).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8033
Reviewed By: pdillinger
Differential Revision: D27014563
Pulled By: mrambacher
fbshipit-source-id: ad0459eba03182e454391b5926bf5cdd45657b67
Summary:
Introduces and uses a SystemClock class to RocksDB. This class contains the time-related functions of an Env and these functions can be redirected from the Env to the SystemClock.
Many of the places that used an Env (Timer, PerfStepTimer, RepeatableThread, RateLimiter, WriteController) for time-related functions have been changed to use SystemClock instead. There are likely more places that can be changed, but this is a start to show what can/should be done. Over time it would be nice to migrate most (if not all) of the uses of the time functions from the Env to the SystemClock.
There are several Env classes that implement these functions. Most of these have not been converted yet to SystemClock implementations; that will come in a subsequent PR. It would be good to unify many of the Mock Timer implementations, so that they behave similarly and be tested similarly (some override Sleep, some use a MockSleep, etc).
Additionally, this change will allow new methods to be introduced to the SystemClock (like https://github.com/facebook/rocksdb/issues/7101 WaitFor) in a consistent manner across a smaller number of classes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7858
Reviewed By: pdillinger
Differential Revision: D26006406
Pulled By: mrambacher
fbshipit-source-id: ed10a8abbdab7ff2e23d69d85bd25b3e7e899e90
Summary:
Create a thread in DeleteScheduler only when delete rate limit is set
because when there is no rate limit on deletion, a thread per DeleteScheduler
consumes unnecessary resources.
Test Plan: make -j64 check
Reviewed By: riversand963
Differential Revision: D20538138
Pulled By: akankshamahajan15
fbshipit-source-id: 137499e810e817156345c30d627f8678b9adadf7
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
Differential Revision: D19977691
fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f
Summary:
util/ means for lower level libraries, so it's a good idea to move the files which requires knowledge to DB out. Create a file/ and move some files there.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5375
Differential Revision: D15550935
Pulled By: siying
fbshipit-source-id: 61a9715dcde5386eebfb43e93f847bba1ae0d3f2
Summary:
Blob DB files are not tracked by the SFM, so they currently don't get
deleted in the background. Force them to be deleted in background so
rate limiting can be applied
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4928
Differential Revision: D13854649
Pulled By: anand1976
fbshipit-source-id: 8031ce66842ff0af440c715d886b377983dad7d8
Summary:
Right now slow deletion with ftruncate doesn't work well with checkpoints because it ruin hard linked files in checkpoints. To fix it, check the file has no other hard link before ftruncate it.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/4093
Differential Revision: D8730360
Pulled By: siying
fbshipit-source-id: 756eea5bce8a87b9a2ea3a5bfa190b2cab6f75df
Summary:
sync parent directory after deleting a file in delete scheduler. Otherwise, trim speed may not be as smooth as what we want.
Closes https://github.com/facebook/rocksdb/pull/3767
Differential Revision: D7760136
Pulled By: siying
fbshipit-source-id: ec131d53b61953f09c60d67e901e5eeb2716b05f
Summary:
Add `bytes_max_delete_chunk` in SstFileManager so that we can drop a large file in multiple batches.
Closes https://github.com/facebook/rocksdb/pull/3640
Differential Revision: D7358679
Pulled By: siying
fbshipit-source-id: ef17f0da2f5723dbece2669485a9b91b3edc0bb7
Summary:
Allow users to configure the trash-to-DB size ratio limit, so
that ratelimits for deletes can be enforced even when larger portions of
the database are being deleted.
Closes https://github.com/facebook/rocksdb/pull/3158
Differential Revision: D6304897
Pulled By: gdavidsson
fbshipit-source-id: a28dd13059ebab7d4171b953ed91ce383a84d6b3
Summary:
SstFileManager move files that need to be deleted into a trash directory.
Deprecate this behaviour and instead add ".trash" extension to files that need to be deleted
Closes https://github.com/facebook/rocksdb/pull/2970
Differential Revision: D5976805
Pulled By: IslamAbdelRahman
fbshipit-source-id: 27374ece4315610b2792c30ffcd50232d4c9a343
Summary:
Move some files under util/ to new directories env/, monitoring/ options/ and cache/
Closes https://github.com/facebook/rocksdb/pull/2090
Differential Revision: D4833681
Pulled By: siying
fbshipit-source-id: 2fd8bef
Summary:
introduce new methods into a public threadpool interface,
- allow submission of std::functions as they allow greater flexibility.
- add Joining methods to the implementation to join scheduled and submitted jobs with
an option to cancel jobs that did not start executing.
- Remove ugly `#ifdefs` between pthread and std implementation, make it uniform.
- introduce pimpl for a drop in replacement of the implementation
- Introduce rocksdb::port::Thread typedef which is a replacement for std::thread. On Posix Thread defaults as before std::thread.
- Implement WindowsThread that allocates memory in a more controllable manner than windows std::thread with a replaceable implementation.
- should be no functionality changes.
Closes https://github.com/facebook/rocksdb/pull/1823
Differential Revision: D4492902
Pulled By: siying
fbshipit-source-id: c74cb11
Summary: Using real clock causes failures of DBSSTTest.RateLimitedDelete in some cases. Turn away from the real time. Use fake time instead.
Test Plan: Run the tests and all existing tests.
Reviewers: yiwu, IslamAbdelRahman
Reviewed By: IslamAbdelRahman
Subscribers: leveldb, andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D65145
Summary:
Add a new class SstFileTracker that will be notified whenever a DB add/delete/move and sst file, it will also replace DeleteScheduler
SstFileTracker can be used later to abort writes when we exceed a specific size
Test Plan: unit tests
Reviewers: rven, anthony, yhchiang, sdong
Reviewed By: sdong
Subscribers: igor, lovro, march, dhruba
Differential Revision: https://reviews.facebook.net/D50469
Summary: Update DestroyDB so that all SST files in the first path id go through DeleteScheduler instead of being deleted immediately
Test Plan: added a unittest
Reviewers: igor, yhchiang, anthony, kradhakrishnan, rven, sdong
Reviewed By: sdong
Subscribers: jeanxu2012, dhruba
Differential Revision: https://reviews.facebook.net/D44955
Summary:
Introduce DeleteScheduler that allow enforcing a rate limit on file deletion
Instead of deleting files immediately, files are moved to trash directory and deleted in a background thread that apply sleep penalty between deletes if needed.
I have updated PurgeObsoleteFiles and PurgeObsoleteWALFiles to use the delete_scheduler instead of env_->DeleteFile
Test Plan:
added delete_scheduler_test
existing unit tests
Reviewers: kradhakrishnan, anthony, rven, yhchiang, sdong
Reviewed By: sdong
Subscribers: dhruba
Differential Revision: https://reviews.facebook.net/D43221