Summary:
```PosixMmapReadableFile::fd_``` is closed after created, but needs to remain open for the lifetime of `PosixMmapReadableFile` since it is used whenever `InvalidateCache` is called.
Closes https://github.com/facebook/rocksdb/pull/2764
Differential Revision: D8152515
Pulled By: ajkr
fbshipit-source-id: b738a6a55ba4e392f9b0f374ff396a1e61c64f65
Summary:
The only use of RandomRW is to change seqno when bulkloading, and in this use case, the file should exist. We should fail the file opening in this case.
Closes https://github.com/facebook/rocksdb/pull/3827
Differential Revision: D7913719
Pulled By: siying
fbshipit-source-id: 62cf6734f1a6acb9e14f715b927da388131c3492
Summary:
this is a repeat commit of a8a28da215, which got reverted together with 6afe22db2e, but forgotten about when that commit was un-reverted in 46152d53bf.
Closes https://github.com/facebook/rocksdb/pull/3796
Differential Revision: D7826077
Pulled By: ajkr
fbshipit-source-id: edb22375da56e2feda50c5b35f942f4d2d52b19c
Summary:
- Original commit: a4fb1f8c04
- Revert commit (we reverted as a quick fix to get crash tests passing): 6afe22db2e
This PR includes the contents of the original commit plus two bug fixes, which are:
- In whitebox crash test, only set `--expected_values_path` for `db_stress` runs in the first half of the crash test's duration. In the second half, a fresh DB is created for each `db_stress` run, so we cannot maintain expected state across `db_stress` runs.
- Made `Exists()` return true for `UNKNOWN_SENTINEL` values. I previously had an assert in `Exists()` that value was not `UNKNOWN_SENTINEL`. But it is possible for post-crash-recovery expected values to be `UNKNOWN_SENTINEL` (i.e., if the crash happens in the middle of an update), in which case this assertion would be tripped. The effect of returning true in this case is there may be cases where a `SingleDelete` deletes no data. But if we had returned false, the effect would be calling `SingleDelete` on a key with multiple older versions, which is not supported.
Closes https://github.com/facebook/rocksdb/pull/3793
Differential Revision: D7811671
Pulled By: ajkr
fbshipit-source-id: 67e0295bfb1695ff9674837f2e05bb29c50efc30
Summary:
crash-recovery verification is failing in the whitebox testing, which may or may not be a valid correctness issue -- need more time to investigate. In the meantime, reverting so we don't mask other failures.
Closes https://github.com/facebook/rocksdb/pull/3786
Differential Revision: D7794516
Pulled By: ajkr
fbshipit-source-id: 28ccdfdb9ec9b3b0fb08c15cbf9d2e282201ff33
Summary:
It seems clear to me that the variable is initialized before line 492, but it wasn't clear to UBSAN. The failure was:
```
In file included from ./env/io_posix.h:14:0,
from env/env_posix.cc:44:
./include/rocksdb/env.h: In member function ‘virtual rocksdb::Status rocksdb::{anonymous}::PosixEnv::NewMemoryMappedFileBuffer(const string&, std::unique_ptr<rocksdb::MemoryMappedFileBuffer>*)’:
./include/rocksdb/env.h:822:36: error: ‘base’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
: base(_base), length(_length) {}
^
env/env_posix.cc:482:11: note: ‘base’ was declared here
void* base;
```
We can just initialize to nullptr to keep UBSAN happy.
Closes https://github.com/facebook/rocksdb/pull/3770
Differential Revision: D7756287
Pulled By: ajkr
fbshipit-source-id: 0f2efb9594e2d3a30706a4ca7e1d4a6328031bf2
Summary:
Previously, our `db_stress` tool held the expected state of the DB in-memory, so after crash-recovery, there was no way to verify data correctness. This PR adds an option, `--expected_values_file`, which specifies a file holding the expected values.
In black-box testing, the `db_stress` process can be killed arbitrarily, so updates to the `--expected_values_file` must be atomic. We achieve this by `mmap`ing the file and relying on `std::atomic<uint32_t>` for atomicity. Actually this doesn't provide a total guarantee on what we want as `std::atomic<uint32_t>` could, in theory, be translated into multiple stores surrounded by a mutex. We can verify our assumption by looking at `std::atomic::is_always_lock_free`.
For the `mmap`'d file, we didn't have an existing way to expose its contents as a raw memory buffer. This PR adds it in the `Env::NewMemoryMappedFileBuffer` function, and `MemoryMappedFileBuffer` class.
`db_crashtest.py` is updated to use an expected values file for black-box testing. On the first iteration (when the DB is created), an empty file is provided as `db_stress` will populate it when it runs. On subsequent iterations, that same filename is provided so `db_stress` can check the data is as expected on startup.
Closes https://github.com/facebook/rocksdb/pull/3629
Differential Revision: D7463144
Pulled By: ajkr
fbshipit-source-id: c8f3e82c93e045a90055e2468316be155633bd8b
Summary:
Background activities like compaction can negatively affect
latency of higher-priority tasks like request processing. To avoid this,
rocksdb already lowers the IO priority of background threads on Linux
systems. While this takes care of typical IO-bound systems, it does not
help much when CPU (temporarily) becomes the bottleneck. This is
especially likely when using more expensive compression settings.
This patch adds an API to allow for lowering the CPU priority of
background threads, modeled on the IO priority API. Benchmarks (see
below) show significant latency and throughput improvements when CPU
bound. As a result, workloads with some CPU usage bursts should benefit
from lower latencies at a given utilization, or should be able to push
utilization higher at a given request latency target.
A useful side effect is that compaction CPU usage is now easily visible
in common tools, allowing for an easier estimation of the contribution
of compaction vs. request processing threads.
As with IO priority, the implementation is limited to Linux, degrading
to a no-op on other systems.
Closes https://github.com/facebook/rocksdb/pull/3763
Differential Revision: D7740096
Pulled By: gwicke
fbshipit-source-id: e5d32373e8dc403a7b0c2227023f9ce4f22b413c
Summary:
This PR comments out the rest of the unused arguments which allow us to turn on the -Wunused-parameter flag. This is the second part of a codemod relating to https://github.com/facebook/rocksdb/pull/3557.
Closes https://github.com/facebook/rocksdb/pull/3662
Differential Revision: D7426121
Pulled By: Dayvedde
fbshipit-source-id: 223994923b42bd4953eb016a0129e47560f7e352
Summary:
- removed a few unneeded variables
- fused some variable declarations and their assignments
- fixed right-trimming code in string_util.cc to not underflow
- simplifed an assertion
- move non-nullptr check assertion before dereferencing of that pointer
- pass an std::string function parameter by const reference instead of by value (avoiding potential copy)
Closes https://github.com/facebook/rocksdb/pull/3507
Differential Revision: D7004679
Pulled By: sagar0
fbshipit-source-id: 52944952d9b56dfcac3bea3cd7878e315bb563c4
Summary:
There seems to be a typo mistake in env ReuseWritableFile func
where status is being returned twice.
Closes https://github.com/facebook/rocksdb/pull/3099
Differential Revision: D6196204
Pulled By: ajkr
fbshipit-source-id: abb6e3e1c1e772dd485fc39e7f1b9d502fa188fe
Summary:
Problem:
- `DB::SanitizeOptions` strips trailing slash from `wal_dir` but not `dbname`
- We check whether `wal_dir` and `dbname` refer to the same directory using string equality: https://github.com/facebook/rocksdb/blob/master/db/repair.cc#L258
- Providing `dbname` with trailing slash causes default `wal_dir` to be misidentified as a separate directory.
- Then the repair tries to add all SST files to the `VersionEdit` twice (once for `dbname` dir, once for `wal_dir`) and fails with coredump.
Solution:
- Add a new `Env` function, `AreFilesSame`, which uses device and inode number to check whether files are the same. It's currently only implemented in `PosixEnv`.
- Migrate repair to use `AreFilesSame` to check whether `dbname` and `wal_dir` are same. If unsupported, falls back to string comparison.
Closes https://github.com/facebook/rocksdb/pull/2827
Differential Revision: D5761349
Pulled By: ajkr
fbshipit-source-id: c839d548678b742af1166d60b09abd94e5476238
Summary:
When we had a single thread pool for compactions, a thread could be busy for a long time (minutes) executing a compaction involving the bottom level. In multi-instance setups, the entire thread pool could be consumed by such bottom-level compactions. Then, top-level compactions (e.g., a few L0 files) would be blocked for a long time ("head-of-line blocking"). Such top-level compactions are critical to prevent compaction stalls as they can quickly reduce number of L0 files / sorted runs.
This diff introduces a bottom-priority queue for universal compactions including the bottom level. This alleviates the head-of-line blocking situation for fast, top-level compactions.
- Added `Env::Priority::BOTTOM` thread pool. This feature is only enabled if user explicitly configures it to have a positive number of threads.
- Changed `ThreadPoolImpl`'s default thread limit from one to zero. This change is invisible to users as we call `IncBackgroundThreadsIfNeeded` on the low-pri/high-pri pools during `DB::Open` with values of at least one. It is necessary, though, for bottom-pri to start with zero threads so the feature is disabled by default.
- Separated `ManualCompaction` into two parts in `PrepickedCompaction`. `PrepickedCompaction` is used for any compaction that's picked outside of its execution thread, either manual or automatic.
- Forward universal compactions involving last level to the bottom pool (worker thread's entry point is `BGWorkBottomCompaction`).
- Track `bg_bottom_compaction_scheduled_` so we can wait for bottom-level compactions to finish. We don't count them against the background jobs limits. So users of this feature will get an extra compaction for free.
Closes https://github.com/facebook/rocksdb/pull/2580
Differential Revision: D5422916
Pulled By: ajkr
fbshipit-source-id: a74bd11f1ea4933df3739b16808bb21fcd512333
Summary:
Force people to write something other than file name while returning status for IOError.
Closes https://github.com/facebook/rocksdb/pull/2493
Differential Revision: D5321309
Pulled By: siying
fbshipit-source-id: 38bcf6c19e80831cd3e300a047e975cbb131d822
Summary:
Previously users could set `max_background_flushes=0` to force rocksdb to use a single thread pool for both background flushes and compactions. That'll no longer be possible since I'm going to deprecate `max_background_flushes` and `max_background_compactions` in favor of a single option. This diff introduces a new way to force a single thread pool: when high-pri pool has zero threads, all background jobs will be submitted to low-pri pool.
Note the majority of the code change is adding `Env::GetBackgroundThreads()`, which is necessary to check whether the user has provided a zero-sized thread pool.
Closes https://github.com/facebook/rocksdb/pull/2204
Differential Revision: D4936256
Pulled By: ajkr
fbshipit-source-id: 929a07a0c0705f7766f5339cd013ff74e90d6e01
Summary:
Replacement of #2147
The change was squashed due to a lot of conflicts.
Closes https://github.com/facebook/rocksdb/pull/2194
Differential Revision: D4929799
Pulled By: siying
fbshipit-source-id: 5cd49c254737a1d5ac13f3c035f128e86524c581
Summary:
OpenBSD doesn't have `O_DIRECT`, so avoid it. (RocksDB compiles successfully on
OpenBSD with this patch.)
Closes https://github.com/facebook/rocksdb/pull/2106
Differential Revision: D4847833
Pulled By: siying
fbshipit-source-id: 214b785
Summary:
Move some files under util/ to new directories env/, monitoring/ options/ and cache/
Closes https://github.com/facebook/rocksdb/pull/2090
Differential Revision: D4833681
Pulled By: siying
fbshipit-source-id: 2fd8bef
Summary:
fix lite bugs
disable direct io in lite mode
Closes https://github.com/facebook/rocksdb/pull/1870
Differential Revision: D4559866
Pulled By: yiwu-arbug
fbshipit-source-id: 3761c51
Summary:
introduce new methods into a public threadpool interface,
- allow submission of std::functions as they allow greater flexibility.
- add Joining methods to the implementation to join scheduled and submitted jobs with
an option to cancel jobs that did not start executing.
- Remove ugly `#ifdefs` between pthread and std implementation, make it uniform.
- introduce pimpl for a drop in replacement of the implementation
- Introduce rocksdb::port::Thread typedef which is a replacement for std::thread. On Posix Thread defaults as before std::thread.
- Implement WindowsThread that allocates memory in a more controllable manner than windows std::thread with a replaceable implementation.
- should be no functionality changes.
Closes https://github.com/facebook/rocksdb/pull/1823
Differential Revision: D4492902
Pulled By: siying
fbshipit-source-id: c74cb11
Summary:
direct IO reads refactoring
remove unnecessary classes and unified interfaces
tested with db_bench
need more change for options and ON/OFF for different files.
Since disabled is default, it should be fine now
Closes https://github.com/facebook/rocksdb/pull/1636
Differential Revision: D4307189
Pulled By: lightmark
fbshipit-source-id: 6991e22
Summary:
It'd be nice to use the error status type to distinguish
between user error and system error. For example, GetChildren can fail
listing a backup directory's contents either because a bad path was provided
(user error) or because an operation failed, e.g., a remote storage service
call failed (system error). In the former case, we want to continue and treat
the backup directory as empty; in the latter case, we want to immediately
propagate the error to the caller.
This diff uses NotFound to indicate user error and IOError to indicate
system error. Previously IOError indicated both.
Closes https://github.com/facebook/rocksdb/pull/1644
Differential Revision: D4312157
Pulled By: ajkr
fbshipit-source-id: 51b4f24
Summary:
We should close the fd, before overriding it. This bug was
introduced by f89caa127b
Closes https://github.com/facebook/rocksdb/pull/1553
Differential Revision: D4214101
Pulled By: siying
fbshipit-source-id: 0d65de0
Summary:
This patch clarifies the contract of PositionedAppend with some unit
tests and also implements it for PosixWritableFile. (Tasks: 14524071)
Closes https://github.com/facebook/rocksdb/pull/1514
Differential Revision: D4204907
Pulled By: maysamyabandeh
fbshipit-source-id: 06eabd2
Summary:
Add Env::RandomRWFile in env.h and implement it for POSIX
RandomRWFile is a file that allow us to read from / write to random offsets in the file
I will implement it for other Envs later after finishing the whole task for AddFile()
Test Plan: unit tests
Reviewers: andrewkr, kradhakrishnan, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D62433
Summary:
This diff split ThreadPool to
-ThreadPool (abstract interface exposed in include/rocksdb/threadpool.h)
-ThreadPoolImpl (actual implementation in util/threadpool_imp.h)
This allow us to expose ThreadPool to the user so we can use it as an option later
Test Plan: existing unit tests
Reviewers: andrewkr, yiwu, yhchiang, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba
Differential Revision: https://reviews.facebook.net/D62085
Summary:
Env holds a pointer of ThreadStatusUpdater, which will be deleted when
Env is deleted. However, in case a rocksdb database is deleted after
Env is deleted. Then this will introduce a free-after-use of this
ThreadStatusUpdater.
This patch fix this by never deleting the ThreadStatusUpdater in Env,
which is in general safe as Env is a singleton in most cases.
Test Plan: thread_list_test
Reviewers: andrewkr, sdong
Reviewed By: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D59187
Summary:
O_DIRECT is not available in Mac as a flag for open. The fix is to make
use of fctl after the file is opened
Test Plan: Run the tests on mac and Linux
Reviewers: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D58665
Summary:
This patch adds direct IO capability to RocksDB Env.
The direct IO capability is required for persistent cache since NVM is best
accessed as 4K direct IO. SSDs can leverage direct IO for reading.
Direct IO requires the offset and size be sector size aligned, and memory to
be kernel page aligned. Since neither RocksDB/Persistent read cache data
layout is aligned to sector size, the code can accommodate reading unaligned IO size
(or unaligned memory) at the cost of an alloc/copy.
The write code path expects the size and memory to be aligned.
Test Plan: Run RocksDB unit tests
Reviewers: sdong
Subscribers: andrewkr, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D57393
Conditionally retrofit thread_posix for use with std::thread
and reuse the same logic. Posix users continue using Posix interfaces.
Enable XPRESS compression in test runs.
Fix master introduced signed/unsigned mismatch.
Summary:
Remove the SyncPoint usage in the destructor of PosixEnv as none
of any active tests is using it.
SyncPoint is a test-only utility class, and it's a static varible.
As a result, using SyncPoint in the destructor of PosixEnv will
make default Env depends on SyncPoint. Removing such dependency
could solve the problem crash issue only reproducable in Mac
environment.
Test Plan: OPT=-DTRAVIS V=1 make -j4 check on Mac environment
Reviewers: sdong, anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D54333
Summary:
When a child thread that uses ThreadLocalPtr, ThreadLocalPtr::OnThreadExit
will be called when that child thread is destroyed. However,
OnThreadExit will try to access a static singleton of ThreadLocalPtr,
which will be destroyed when the main thread exit. As a result,
when a child thread that uses ThreadLocalPtr exits AFTER the main thread
exits, illegal memory access will occur.
This diff includes a test that reproduce this legacy bug.
==2095206==ERROR: AddressSanitizer: heap-use-after-free on address
0x608000007fa0 at pc 0x959b79 bp 0x7f5fa7426b60 sp 0x7f5fa7426b58
READ of size 8 at 0x608000007fa0 thread T1
This patch fix this issue by having the thread local mutex never be deleted
(but will leak small piece of memory at the end.) The patch also describe
a better solution (thread_local) in the comment that requires gcc 4.8.1 and
in latest clang as a future work once we agree to move toward gcc 4.8.
Test Plan:
COMPILE_WITH_ASAN=1 make thread_local_test -j32
./thread_local_test --gtest_filter="*MainThreadDiesFirst"
Reviewers: anthony, hermanlee4, sdong
Reviewed By: sdong
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D53013
Summary:
We can avoid the dependency by forward-declaring ColumnFamilyData and
then treating it as a black box. That means callers of ThreadStatusUtil need to
explicitly provide more options, even if they can be derived from the
ColumnFamilyData, since ThreadStatusUtil doesn't include the definition.
This is part of a series of diffs to eliminate circular dependencies between
directories (e.g., db/* files depending on util/* files and vice-versa).
Test Plan:
$ ./db_test --gtest_filter=DBTest.GetThreadStatus
$ make -j32 commit-prereq
Reviewers: sdong, yhchiang, IslamAbdelRahman
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D53361
Summary:
This diff provides a framework for doing manual
compactions in parallel with other compactions. We now have a deque of manual compactions. We also pass manual compactions as an argument from RunManualCompactions down to
BackgroundCompactions, so that RunManualCompactions can be reentrant.
Parallelism is controlled by the two routines
ConflictingManualCompaction to allow/disallow new parallel/manual
compactions based on already existing ManualCompactions. In this diff, by default manual compactions still have to run exclusive of other compactions. However, by setting the compaction option, exclusive_manual_compaction to false, it is possible to run other compactions in parallel with a manual compaction. However, we are still restricted to one manual compaction per column family at a time. All of these restrictions will be relaxed in future diffs.
I will be adding more tests later.
Test Plan: Rocksdb regression + new tests + valgrind
Reviewers: igor, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang, sdong
Reviewed By: sdong
Subscribers: yoshinorim, dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D47973
Summary:
By default, RocksDB initializes the singletons of ThreadLocalPtr first, then initializes PosixEnv
via static initializer. Destructor terminates objects in reverse order, so terminating PosixEnv
(calling pthread_mutex_lock), then ThreadLocal (calling pthread_mutex_destroy).
However, in certain case, application might initialize PosixEnv first, then ThreadLocalPtr.
This will cause core dump at the end of the program (eg. https://github.com/facebook/mysql-5.6/issues/122)
This patch fix this issue by ensuring the destruction order by moving the global static singletons
to function static singletons. Since function static singletons are initialized when the function is first
called, this property allows us invoke to enforce the construction of the static PosixEnv and the
singletons of ThreadLocalPtr by calling the function where the ThreadLocalPtr singletons belongs
right before we initialize the static PosixEnv.
Test Plan: Verified in the MyRocks.
Reviewers: yoshinorim, IslamAbdelRahman, rven, kradhakrishnan, anthony, sdong, MarkCallaghan
Reviewed By: anthony
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D51789
Summary: This patch moves all posix thread logic to a separate library.
The motivation is to allow another environments to easily reuse posix
threads. HDFS wraps already posix threads; this split would simplify
this code.
Test Plan: No new functionality is added to posix Env or the threading
library, thus the current tests should suffice.
Summary: include/posix/io_posix.h is not a public API. Although include/posix/ is not a public header directory, it is confusing to put non-public headers to under include/. Move it to util/ to be clearer.
Test Plan: Run all tests
Reviewers: rven, IslamAbdelRahman, anthony, kradhakrishnan, yhchiang, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D49611
Summary: IO Posix depends on too many .h files. Move most of them to .cc files.
Test Plan: make all
Reviewers: anthony, rven, IslamAbdelRahman, yhchiang, kradhakrishnan, igor
Reviewed By: igor
Subscribers: leveldb, dhruba
Differential Revision: https://reviews.facebook.net/D49479
Summary: This patch splits the posix storage backend into Env and
the actual *File implementations. The motivation is to allow other Envs
to use posix as a library. This enables a storage backend different from
posix to split its secondary storage between a normal file system
partition managed by posix, and it own media.
Test Plan: No new functionality is added to posix Env or the library,
thus the current tests should suffice.
Summary: We don't yet have a CI build for iOS, so our iOS compile gets broken sometimes. Most of the errors are from assumption that size_t is 64-bit, while it's actually 32-bit on some (all?) iOS platforms. This diff fixes the compile.
Test Plan:
TARGET_OS=IOS make static_lib
Observe there are no warnings
Reviewers: sdong, anthony, IslamAbdelRahman, kradhakrishnan, yhchiang
Reviewed By: yhchiang
Subscribers: dhruba, leveldb
Differential Revision: https://reviews.facebook.net/D49029
Add an environment method to reuse an existing file. Provide a generic
implementation that does a simple rename + open (writeable), and also a
posix variant that is more careful about error handling (if we fail to
open, do not rename, etc.).
Signed-off-by: Sage Weil <sage@redhat.com>