Summary:
in PR https://github.com/facebook/rocksdb/issues/7419 , we introduce the new Append and PositionedAppend APIs to WritableFile at File System, which enable RocksDB to pass the data verification information (e.g., checksum of the data) to the lower layer. In this PR, we use the new API in WritableFileWriter, such that the file created via WritableFileWrite can pass the checksum to the storage layer. To control which types file should apply the checksum handoff, we add checksum_handoff_file_types to DBOptions. User can use this option to control which file types (Currently supported file tyes: kLogFile, kTableFile, kDescriptorFile.) should use the new Append and PositionedAppend APIs to handoff the verification information.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7523
Test Plan: add new unit test, pass make check/ make asan_check
Reviewed By: pdillinger
Differential Revision: D24313271
Pulled By: zhichao-cao
fbshipit-source-id: aafd69091ae85c3318e3e17cbb96fe7338da11d0
Summary:
This PR does the following:
-> Creates a WinFileSystem class. This class is the Windows equivalent of the PosixFileSystem and will be used on Windows systems.
-> Introduces a CustomEnv class. A CustomEnv is an Env that takes a FileSystem as constructor argument. I believe there will only ever be two implementations of this class (PosixEnv and WinEnv). There is still a CustomEnvWrapper class that takes an Env and a FileSystem and wraps the Env calls with the input Env but uses the FileSystem for the FileSystem calls
-> Eliminates the public uses of the LegacyFileSystemWrapper.
With this change in place, there are effectively the following patterns of Env:
- "Base Env classes" (PosixEnv, WinEnv). These classes implement the core Env functions (e.g. Threads) and have a hard-coded input FileSystem. These classes inherit from CompositeEnv, implement the core Env functions (threads) and delegate the FileSystem-like calls to the input file system.
- Wrapped Composite Env classes (MemEnv). These classes take in an Env and a FileSystem. The core env functions are re-directed to the wrapped env. The file system calls are redirected to the input file system
- Legacy Wrapped Env classes. These classes take in an Env input (but no FileSystem). The core env functions are re-directed to the wrapped env. A "Legacy File System" is created using this env and the file system calls directed to the env itself.
With these changes in place, the PosixEnv becomes a singleton -- there is only ever one created. Any other use of the PosixEnv is via another wrapped env. This cleans up some of the issues with the env construction and destruction.
Additionally, there were places in the code that required had an Env when they required a FileSystem. Many of these places would wrap the Env with a LegacyFileSystemWrapper instead of using the env->GetFileSystem(). These places were changed, thereby removing layers of additional redirection (LegacyFileSystem --> Env --> Env::FileSystem).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7703
Reviewed By: zhichao-cao
Differential Revision: D25762190
Pulled By: anand1976
fbshipit-source-id: 1a088e97fc916f28ac69c149cd1dcad0ab31704b
Summary:
This PR does a few things:
1. The MockFileSystem class was split out from the MockEnv. This change would theoretically allow a MockFileSystem to be used by other Environments as well (if we created a means of constructing one). The MockFileSystem implements a FileSystem in its entirety and does not rely on any Wrapper implementation.
2. Make the RocksDB test suite work when MOCK_ENV=1 and ENCRYPTED_ENV=1 are set. To accomplish this, a few things were needed:
- The tests that tried to use the "wrong" environment (Env::Default() instead of env_) were updated
- The MockFileSystem was changed to support the features it was missing or mishandled (such as recursively deleting files in a directory or supporting renaming of a directory).
3. Updated the test framework to have a ROCKSDB_GTEST_SKIP macro. This can be used to flag tests that are skipped. Currently, this defaults to doing nothing (marks the test as SUCCESS) but will mark the tests as SKIPPED when RocksDB is upgraded to a version of gtest that supports this (gtest-1.10).
I have run a full "make check" with MEM_ENV, ENCRYPTED_ENV, both, and neither under both MacOS and RedHat. A few tests were disabled/skipped for the MEM/ENCRYPTED cases. The error_handler_fs_test fails/hangs for MEM_ENV (presumably a timing problem) and I will introduce another PR/issue to track that problem. (I will also push a change to disable those tests soon). There is one more test in DBTest2 that also fails which I need to investigate or skip before this PR is merged.
Theoretically, this PR should also allow the test suite to run against an Env loaded from the registry, though I do not have one to try it with currently.
Finally, once this is accepted, it would be nice if there was a CircleCI job to run these tests on a checkin so this effort does not become stale. I do not know how to do that, so if someone could write that job, it would be appreciated :)
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7566
Reviewed By: zhichao-cao
Differential Revision: D24408980
Pulled By: jay-zhuang
fbshipit-source-id: 911b1554a4d0da06fd51feca0c090a4abdcb4a5f
Summary:
Add new AppendWithVerify and PositionedAppendWithVerify APIs to Env and FileSystem to bring the data verification information (data checksum information) from upper layer (e.g., WritableFileWriter) to the storage layer. This PR only include the API definition, no functional codes are added to unblock other developers which depend on these APIs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7419
Test Plan: make -j32
Reviewed By: pdillinger
Differential Revision: D23883196
Pulled By: zhichao-cao
fbshipit-source-id: 94676c26bc56144cc32e3661f84f21eccd790411
Summary:
A new file interface `SupportPrefetch()` is added. When the user overrides it to `false`, an internal prefetch buffer will be used for readahead. Useful for non-directIO but FS doesn't have readahead support.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7312
Reviewed By: anand1976
Differential Revision: D23329847
Pulled By: jay-zhuang
fbshipit-source-id: 71cd4ce6f4a820840294e4e6aec111ab76175527
Summary:
Make (most of) the env*_test pass when ASSERT_STATUS_CHECKED is enabled.
One test that opens a database is currently disabled in this mode, as there are many errors that need revisited for DB tests and status checks.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7176
Reviewed By: cheng-chang
Differential Revision: D22799278
Pulled By: ajkr
fbshipit-source-id: 16d8a02eaeecd6df1060249b6a5811292801f2ed
Summary:
1. Add the wrapper classes FileSystemTracingWrapper, FSSequentialFileTracingWrapper, FSRandomAccessFileTracingWrapper, FSWritableFileTracingWrapper, FSRandomRWFileTracingWrapper that forward the calls to underlying storage system and then pass the file operation information to IOTracer. IOTracer dumps the record in binary format for tracing.
2. Add the wrapper classes FileSystemPtr, FSSequentialFilePtr, FSRandomAccessFilePtr, FSWritableFilePtr and FSRandomRWFilePtr that overload operator-> and return ptr to underlying storage system or Tracing wrapper class based on enabling/disabling of IO tracing. These classes are added to bypass Tracing Wrapper classes when we disable tracing.
3. Add enums in trace.h that distinguish which options need to be added for different file operations(Read, close, write etc) as part of tracing record.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7002
Test Plan: make check -j64
Reviewed By: anand1976
Differential Revision: D22127897
Pulled By: akankshamahajan15
fbshipit-source-id: 74cff58ce5661c9a3832dfaa52483f3b2d8565e0
Summary:
Calculate ```IOOptions::timeout``` using ```ReadOptions::deadline``` and pass it to ```FileSystem::Read/FileSystem::MultiRead```. This allows us to impose a tighter bound on the time taken by Get/MultiGet on FileSystem/Envs that support IO timeouts. Even on those that don't support, check in ```RandomAccessFileReader::Read``` and ```MultiRead``` and return ```Status::TimedOut()``` if the deadline is exceeded.
For now, TableReader creation, which might do file opens and reads, are not covered. It will be implemented in another PR.
Tests:
Update existing unit tests to verify the correct timeout value is being passed
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6751
Reviewed By: riversand963
Differential Revision: D21285631
Pulled By: anand1976
fbshipit-source-id: d89af843e5a91ece866e87aa29438b52a65a8567
Summary:
Based on https://github.com/facebook/rocksdb/issues/6648 (CLA Signed), but heavily modified / extended:
* Implicit capture of this via [=] deprecated in C++20, and [=,this] not standard before C++20 -> now using explicit capture lists
* Implicit copy operator deprecated in gcc 9 -> add explicit '= default' definition
* std::random_shuffle deprecated in C++17 and removed in C++20 -> migrated to a replacement in RocksDB random.h API
* Add the ability to build with different std version though -DCMAKE_CXX_STANDARD=11/14/17/20 on the cmake command line
* Minimal rebuild flag of MSVC is deprecated and is forbidden with /std:c++latest (C++20)
* Added MSVC 2019 C++11 & MSVC 2019 C++20 in AppVeyor
* Added GCC 9 C++11 & GCC9 C++20 in Travis
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6697
Test Plan: make check and CI
Reviewed By: cheng-chang
Differential Revision: D21020318
Pulled By: pdillinger
fbshipit-source-id: 12311be5dbd8675a0e2c817f7ec50fa11c18ab91
Summary:
IsDirectory() is a common API to check whether a path is a regular file or
directory.
POSIX: call stat() and use S_ISDIR(st_mode)
Windows: PathIsDirectoryA() and PathIsDirectoryW()
HDFS: FileSystem.IsDirectory()
Java: File.IsDirectory()
...
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6711
Test Plan: make check
Reviewed By: anand1976
Differential Revision: D21053520
Pulled By: riversand963
fbshipit-source-id: 680aadfd8ce982b63689190cf31b3145d5a89e27
Summary:
In the current code base, we use Status to get and store the returned status from the call. Specifically, for IO related functions, the current Status cannot reflect the IO Error details such as error scope, error retryable attribute, and others. With the implementation of https://github.com/facebook/rocksdb/issues/5761, we have the new Wrapper for IO, which returns IOStatus instead of Status. However, the IOStatus is purged at the lower level of write path and transferred to Status.
The first job of this PR is to pass the IOStatus to the write path (flush, WAL write, and Compaction). The second job is to identify the Retryable IO Error as HardError, and set the bg_error_ as HardError. In this case, the DB Instance becomes read only. User is informed of the Status and need to take actions to deal with it (e.g., call db->Resume()).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6487
Test Plan: Added the testing case to error_handler_fs_test. Pass make asan_check
Reviewed By: anand1976
Differential Revision: D20685017
Pulled By: zhichao-cao
fbshipit-source-id: ff85f042896243abcd6ef37877834e26f36b6eb0
Summary:
The current Env/FileSystem API separation has a couple of issues -
1. It requires the user to specify 2 options - ```Options::env``` and ```Options::file_system``` - which means they have to make code changes to benefit from the new APIs. Furthermore, there is a risk of accessing the same APIs in two different ways, through Env in the old way and through FileSystem in the new way. The two may not always match, for example, if env is ```PosixEnv``` and FileSystem is a custom implementation. Any stray RocksDB calls to env will use the ```PosixEnv``` implementation rather than the file_system implementation.
2. There needs to be a simple way for the FileSystem developer to instantiate an Env for backward compatibility purposes.
This PR solves the above issues and simplifies the migration in the following ways -
1. Embed a shared_ptr to the ```FileSystem``` in the ```Env```, and remove ```Options::file_system``` as a configurable option. This way, no code changes will be required in application code to benefit from the new API. The default Env constructor uses a ```LegacyFileSystemWrapper``` as the embedded ```FileSystem```.
1a. - This also makes it more robust by ensuring that even if RocksDB
has some stray calls to Env APIs rather than FileSystem, they will go
through the same object and thus there is no risk of getting out of
sync.
2. Provide a ```NewCompositeEnv()``` API that can be used to construct a
PosixEnv with a custom FileSystem implementation. This eliminates an
indirection to call Env APIs, and relieves the FileSystem developer of
the burden of having to implement wrappers for the Env APIs.
3. Add a couple of missing FileSystem APIs - ```SanitizeEnvOptions()``` and
```NewLogger()```
Tests:
1. New unit tests
2. make check and make asan_check
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6552
Reviewed By: riversand963
Differential Revision: D20592038
Pulled By: anand1976
fbshipit-source-id: c3801ad4153f96d21d5a3ae26c92ba454d1bf1f7
Summary:
In Linux, when reopening DB with many SST files, profiling shows that 100% system cpu time spent for a couple of seconds for `GetLogicalBufferSize`. This slows down MyRocks' recovery time when site is down.
This PR introduces two new APIs:
1. `Env::RegisterDbPaths` and `Env::UnregisterDbPaths` lets `DB` tell the env when it starts or stops using its database directories . The `PosixFileSystem` takes this opportunity to set up a cache from database directories to the corresponding logical block sizes.
2. `LogicalBlockSizeCache` is defined only for OS_LINUX to cache the logical block sizes.
Other modifications:
1. rename `logical buffer size` to `logical block size` to be consistent with Linux terms.
2. declare `GetLogicalBlockSize` in `PosixHelper` to expose it to `PosixFileSystem`.
3. change the functions `IOError` and `IOStatus` in `env/io_posix.h` to have external linkage since they are used in other translation units too.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6457
Test Plan:
1. A new unit test is added for `LogicalBlockSizeCache` in `env/io_posix_test.cc`.
2. A new integration test is added for `DB` operations related to the cache in `db/db_logical_block_size_cache_test.cc`.
`make check`
Differential Revision: D20131243
Pulled By: cheng-chang
fbshipit-source-id: 3077c50f8065c0bffb544d8f49fb10bba9408d04
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
Differential Revision: D19977691
fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
Summary:
The current Env API encompasses both storage/file operations, as well as OS related operations. Most of the APIs return a Status, which does not have enough metadata about an error, such as whether its retry-able or not, scope (i.e fault domain) of the error etc., that may be required in order to properly handle a storage error. The file APIs also do not provide enough control over the IO SLA, such as timeout, prioritization, hinting about placement and redundancy etc.
This PR separates out the file/storage APIs from Env into a new FileSystem class. The APIs are updated to return an IOStatus with metadata about the error, as well as to take an IOOptions structure as input in order to allow more control over the IO.
The user can set both ```options.env``` and ```options.file_system``` to specify that RocksDB should use the former for OS related operations and the latter for storage operations. Internally, a ```CompositeEnvWrapper``` has been introduced that inherits from ```Env``` and redirects individual methods to either an ```Env``` implementation or the ```FileSystem``` as appropriate. When options are sanitized during ```DB::Open```, ```options.env``` is replaced with a newly allocated ```CompositeEnvWrapper``` instance if both env and file_system have been specified. This way, the rest of the RocksDB code can continue to function as before.
This PR also ports PosixEnv to the new API by splitting it into two - PosixEnv and PosixFileSystem. PosixEnv is defined as a sub-class of CompositeEnvWrapper, and threading/time functions are overridden with Posix specific implementations in order to avoid an extra level of indirection.
The ```CompositeEnvWrapper``` translates ```IOStatus``` return code to ```Status```, and sets the severity to ```kSoftError``` if the io_status is retryable. The error handling code in RocksDB can then recover the DB automatically.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/5761
Differential Revision: D18868376
Pulled By: anand1976
fbshipit-source-id: 39efe18a162ea746fabac6360ff529baba48486f