Summary:
In rare cases, optimistic transaction commit returns TryAgain. This change tolerates that intentional behavior in db_stress, up to a small limit in a row. This way, we don't miss a possible regression with excessive TryAgain, and trying again (rolling back the transaction) should have a well renewed chance of success as the writes will be associated with fresh sequence numbers.
Also, some of the APIs were not clear about Transaction semantics, so I have clarified:
* (Best I can tell....) Destroying a Transaction is safe without calling Rollback() (or at least should be). I don't know why it's a common pattern in our test code and examples to rollback before unconditional destruction. Stress test updated not to call Rollback unnecessarily (to test safe destruction).
* Despite essentially doing what is asked, simply trying Commit() again when it returns TryAgain does not have a chance of success, because of the transaction being bound to the DB state at the time of operations before Commit. Similar logic applies to Busy AFAIK. Commit() API comments updated, and expanded unit test in optimistic_transaction_test.
Also also, because I can't stop myself, I refactored a good portion of the transaction handling code in db_stress.
* Avoid existing and new copy-paste for most transaction interactions with a new ExecuteTransaction (higher-order) function.
* Use unique_ptr (nicely complements removing unnecessary Rollbacks)
* Abstract out a pattern for safely calling std::terminate() and use it in more places. (The TryAgain errors we saw did not have stack traces because of "terminate called recursively".)
Intended follow-up: resurrect use of `FLAGS_rollback_one_in` but also include non-trivial cases
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11653
Test Plan:
this is the test :)
Also, temporarily bypassed the new retry logic and boosted the chance of hitting TryAgain. Quickly reproduced the TryAgain error. Then re-enabled the new retry logic, and was not able to hit the error after running for tens of minutes, even with the boosted chances.
Reviewed By: cbi42
Differential Revision: D47882995
Pulled By: pdillinger
fbshipit-source-id: 21eadb1525423340dbf28d17cf166b9583311a0d
Summary:
**Context:**
Current `NonBatchedOpsStressTest` does not allow multi-thread read (i.e, Get, Iterator) and write (i.e, Put, Merge) or delete to the same key. Every read or write/delete operation will acquire lock (`GetLocksForKeyRange`) on the target key to gain exclusive access to it. This does not align with RocksDB's nature of allowing multi-thread read and write/delete to the same key, that is concurrent threads can issue read/write/delete to RocksDB without external locking. Therefore this is a gap in our testing coverage.
To close the gap, biggest challenge remains in verifying db value against expected state in presence of parallel read and write/delete. The challenge is due to read/write/delete to the db and read/write to expected state is not within one atomic operation. Therefore we may not know the exact expected state of a certain db read, as by the time we read the expected state for that db read, another write to expected state for another db write to the same key might have changed the expected state.
**Summary:**
Credited to ajkr's idea, we now solve this challenge by breaking the 32-bits expected value of a key into different parts that can be read and write to in parallel.
Basically we divide the 32-bits expected value into `value_base` (corresponding to the previous whole 32 bits but now with some shrinking in the value base range we allow), `pending_write` (i.e, whether there is an ongoing concurrent write), `del_counter` (i.e, number of times a value has been deleted, analogous to value_base for write), `pending_delete` (similar to pending_write) and `deleted` (i.e whether a key is deleted).
Also, we need to use incremental `value_base` instead of random value base as before because we want to control the range of value base a correct db read result can possibly be in presence of parallel read and write. In that way, we can verify the correctness of the read against expected state more easily. This is at the cost of reducing the randomness of the value generated in NonBatchedOpsStressTest we are willing to accept.
(For detailed algorithm of how to use these parts to infer expected state of a key, see the PR)
Misc: hide value_base detail from callers of ExpectedState by abstracting related logics into ExpectedValue class
Pull Request resolved: https://github.com/facebook/rocksdb/pull/11058
Test Plan:
- Manual test of small number of keys (i.e, high chances of parallel read and write/delete to same key) with equally distributed read/write/deleted for 30 min
```
python3 tools/db_crashtest.py --simple {blackbox|whitebox} --sync_fault_injection=1 --skip_verifydb=0 --continuous_verification_interval=1000 --clear_column_family_one_in=0 --max_key=10 --column_families=1 --threads=32 --readpercent=25 --writepercent=25 --nooverwritepercent=0 --iterpercent=25 --verify_iterator_with_expected_state_one_in=1 --num_iterations=5 --delpercent=15 --delrangepercent=10 --range_deletion_width=5 --use_merge={0|1} --use_put_entity_one_in=0 --use_txn=0 --verify_before_write=0 --user_timestamp_size=0 --compact_files_one_in=1000 --compact_range_one_in=1000 --flush_one_in=1000 --get_property_one_in=1000 --ingest_external_file_one_in=100 --backup_one_in=100 --checkpoint_one_in=100 --approximate_size_one_in=0 --acquire_snapshot_one_in=100 --use_multiget=0 --prefixpercent=0 --get_live_files_one_in=1000 --manual_wal_flush_one_in=1000 --pause_background_one_in=1000 --target_file_size_base=524288 --write_buffer_size=524288 --verify_checksum_one_in=1000 --verify_db_one_in=1000
```
- Rehearsal stress test for normal parameter and aggressive parameter to see if such change can find what existing stress test can find (i.e, no regression in testing capability)
- [Ongoing]Try to find new bugs with this change that are not found by current NonBatchedOpsStressTest with no parallel read and write/delete to same key
Reviewed By: ajkr
Differential Revision: D42257258
Pulled By: hx235
fbshipit-source-id: e6fdc18f1fad3753e5ac91731483a644d9b5b6eb
Summary:
Some lines of .h and .cc files are not properly fomatted. Clear them up with clang format.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10868
Test Plan: Watch existing CI to pass
Reviewed By: ajkr
Differential Revision: D40683485
fbshipit-source-id: 491fbb78b2cdcb948164f306829909ad816d5d0b
Summary:
Before this PR, we call `now()` to get the wall time before performing point-lookup and range
scans when user-defined timestamp is enabled.
With this PR, we expand the coverage to:
- read with an older timestamp which is larger then the wall time when the process starts but potentially smaller than now()
- add coverage for `ReadOptions::iter_start_ts != nullptr`
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10280
Test Plan:
```bash
make check
```
Also,
```bash
TEST_TMPDIR=/dev/shm/rocksdb make crash_test_with_ts
```
So far, we have had four successful runs of the above
In addition,
```bash
TEST_TMPDIR=/dev/shm/rocksdb make crash_test
```
Succeeded twice showing no regression.
Reviewed By: ltamasi
Differential Revision: D37539805
Pulled By: riversand963
fbshipit-source-id: f2d9887ad95245945ce17a014d55bb93f00e1cb5
Summary:
ROCKSDB_SUPPORT_THREAD_LOCAL definition has been removed.
`__thread`(#define) has been replaced with `thread_local`(C++ keyword) across the code base.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/10015
Reviewed By: siying
Differential Revision: D36485491
Pulled By: pdillinger
fbshipit-source-id: 6522d212514ee190b90b4e2750c80c7e34013c78
Summary:
Previously all fault injection was ignored in release mode. This PR adds it back except for read fault injection (`--read_fault_one_in > 0`) since its dependency (`IGNORE_STATUS_IF_ERROR`) is unavailable in release mode.
Other notable changes include:
- Moved `EnableWriteErrorInjection()` for `--write_fault_one_in > 0` so it's after `DB::Open()` without depending on `SyncPoint`
- Made `--read_fault_one_in > 0` return an error in release mode
- Updated `db_crashtest.py` to always set `--read_fault_one_in=0` in release mode
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9957
Test Plan:
```
$ DEBUG_LEVEL=0 make -j24 db_stress
$ DEBUG_LEVEL=0 TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox
```
Reviewed By: anand1976
Differential Revision: D36193830
Pulled By: ajkr
fbshipit-source-id: 0b97946b4e3f06e3e0f6e7833c2763da08ec5321
Summary:
When compaction filter determines that a key should be removed, it updates the internal key's type
to `Delete`. If this internal key is preserved in current compaction but seen by a later compaction
together with `SingleDelete`, it will cause compaction iterator to return Corruption.
To fix the issue, compaction filter should return more information in addition to the intention of removing
a key. Therefore, we add a new `kRemoveWithSingleDelete` to `CompactionFilter::Decision`. Seeing
`kRemoveWithSingleDelete`, compaction iterator will update the op type of the internal key to `kTypeSingleDelete`.
In addition, I updated db_stress_shared_state.[cc|h] so that `no_overwrite_ids_` becomes `const`. It is easier to
reason about thread-safety if accessed from multiple threads. This information is passed to `PrepareTxnDBOptions()`
when calling from `Open()` so that we can set up the rollback deletion type callback for transactions.
Finally, disable compaction filter for multiops_txn because the key removal logic of `DbStressCompactionFilter` does
not quite work with `MultiOpsTxnsStressTest`.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9929
Test Plan:
make check
make crash_test
make crash_test_with_txn
Reviewed By: anand1976
Differential Revision: D36069678
Pulled By: riversand963
fbshipit-source-id: cedd2f1ba958af59ad3916f1ba6f424307955f92
Summary:
With the code on main, `RunStressTest` increments the number of threads
one by one as the threads are created and started. This results in a
data race with `NonBatchedOpsStressTest::VerifyDb`, which reads this
value without synchronization, and is also not correct in the sense
that `VerifyDb` assumes that the number of threads already has its final
value set (e.g. it's checking whether the current thread is the last
one). The patch fixes this by setting the number of threads before
creating/starting any threads. This also eliminates the need for locking
the mutex during thread startup.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9466
Test Plan: Ran the blackbox crash test under TSAN for a while.
Reviewed By: ajkr
Differential Revision: D33858856
Pulled By: ltamasi
fbshipit-source-id: 8a6515a83fd1808b8b8dca61978777c4404f04cc
Summary:
The `SharedState` constructor had an early return in case of
`-test_batches_snapshots=1`. This early return caused `num_bg_threads_`
to never be incremented. Consequently, the driver thread could cleanup
objects like the `SharedState` while BG threads were still running and
accessing it, leading to crash.
The fix is to move the logic for counting threads (both FG and BG) to
the place they are launched. That way we can be sure the counts are
consistent, at least for now.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/9313
Test Plan:
below command used to fail, now it passes.
```
$ ./db_stress --db=./test-db/ --expected_values_dir=./test-db-expected/ --max_key=100000 --ops_per_thread=1000 --sync_fault_injection=1 --clear_column_family_one_in=0 --destroy_db_initially=0 -reopen=0 -test_batches_snapshots=1
```
Reviewed By: jay-zhuang
Differential Revision: D33198670
Pulled By: ajkr
fbshipit-source-id: 126592dc1eb31998bc8f82ffbf5a0d4eb8dec317
Summary:
When a previous run left behind historical state/trace files (implying it was run with --sync_fault_injection set), this PR uses them to restore the expected state according to the DB's recovered sequence number. That way, a tail of latest unsynced operations are permitted to be dropped, as is the case when data in page cache or certain `Env`s is lost. The point of the verification in this scenario is just to ensure there is no hole in the recovered data.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8966
Test Plan:
- ran it a while, made sure it is restoring expected values using the historical state/trace files:
```
$ rm -rf ./tmp-db/ ./exp/ && mkdir -p ./tmp-db/ ./exp/ && while ./db_stress -compression_type=none -clear_column_family_one_in=0 -expected_values_dir=./exp -sync_fault_injection=1 -destroy_db_initially=0 -db=./tmp-db -max_key=1000000 -ops_per_thread=10000 -reopen=0 -threads=32 ; do : ; done
```
Reviewed By: pdillinger
Differential Revision: D31219445
Pulled By: ajkr
fbshipit-source-id: f0e1d51fe5b35465b00565c33331190ea38ba0ad
Summary:
When `--sync_fault_injection` is set, this PR takes a snapshot of the expected values and starts an operation trace when the DB is opened. These files are stored in `--expected_values_dir`. They will be used for recovering the expected state of the DB following a crash where a suffix of unsynced operations are allowed to be lost.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8960
Test Plan: injected crashed at various points in `FileExpectedStateManager` and verified the next run recovers the state/trace file with highest seqno and removes all older/temporary files. Note we don't use sync_fault_injection in CI crash tests yet.
Reviewed By: pdillinger
Differential Revision: D31194941
Pulled By: ajkr
fbshipit-source-id: b0f935a529a0186c5a9c7709fcaa8829de8a84cf
Summary:
This is a precursor refactoring to enable an upcoming feature: persistence failure correctness testing.
- Changed `--expected_values_path` to `--expected_values_dir` and migrated "db_crashtest.py" to use the new flag. For persistence failure correctness testing there are multiple possible correct states since unsynced data is allowed to be dropped. Making it possible to restore all these possible correct states will eventually involve files containing snapshots of expected values and DB trace files.
- The expected values directory is managed by an `ExpectedStateManager` instance. Managing expected state files is separated out of `SharedState` to prevent `SharedState` from becoming too complex when the new files and features (snapshotting, tracing, and restoring) are introduced.
- Migrated expected values file access/management out of `SharedState` into a separate class called `ExpectedState`. This is not exposed directly to the test but rather the `ExpectedState` for the latest values file is accessed via a pass-through API on `ExpectedStateManager`. This forces the test to always access the single latest `ExpectedState`.
- Changed the initialization of the latest expected values file to use a tempfile followed by rename, and also add cleanup logic for possible stranded tempfiles.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8913
Test Plan:
run in several ways; try to make sure it's not obviously broken.
- crashtest blackbox without TEST_TMPDIR
```
$ python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --duration=120 --interval=10 --compression_type=none --blob_compression_type=none
```
- crashtest blackbox with TEST_TMPDIR
```
$ TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py blackbox --simple --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --duration=120 --interval=10 --compression_type=none --blob_compression_type=none
```
- crashtest whitebox with TEST_TMPDIR
```
$ TEST_TMPDIR=/dev/shm python3 tools/db_crashtest.py whitebox --simple --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --duration=120 --interval=10 --compression_type=none --blob_compression_type=none --random_kill_odd=88887
```
- db_stress without expected_values_dir
```
$ ./db_stress --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --ops_per_thread=10000 --clear_column_family_one_in=0 --destroy_db_initially=true
```
- db_stress with expected_values_dir and manual corruption
```
$ ./db_stress --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --ops_per_thread=10000 --clear_column_family_one_in=0 --destroy_db_initially=true --expected_values_dir=./
// modify one byte in "./LATEST.state"
$ ./db_stress --write_buffer_size=1048576 --target_file_size_base=1048576 --max_bytes_for_level_base=4194304 --max_key=100000 --value_size_mult=33 --compression_type=none --ops_per_thread=10000 --clear_column_family_one_in=0 --destroy_db_initially=false --expected_values_dir=./
...
Verification failed for column family 0 key 0000000000000000 (0): Value not found: NotFound:
...
```
Reviewed By: riversand963
Differential Revision: D30921951
Pulled By: ajkr
fbshipit-source-id: babfe218062e55d018c9b046536c0289fb78f41c
Summary:
Inject read failures in DB reopen, just as what we do for metadata writes and writes.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8476
Test Plan: Some manual tests and make sure failures are triggered.
Reviewed By: anand1976
Differential Revision: D29507283
fbshipit-source-id: d04da0163973447041038bd87701686a417c4e0c
Summary:
add the injest_error_severity to control if it is a retryable IO Error or a fatal or unrecoverable error. Use a flag to indicate, if fatal error comes, the flag is set and db is stopped (but not corrupted).
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8479
Test Plan: run ./db_stress --reopen=0 --read_fault_one_in=1000 --write_fault_one_in=5 --disable_wal=true --write_buffer_size=3000000 -writepercent=5 -readpercent=50 --injest_error_severity=2 --column_families=1, make check
Reviewed By: anand1976
Differential Revision: D29524271
Pulled By: zhichao-cao
fbshipit-source-id: 1aa9fb9b5655b0adba6f5ad12005ca8c074c795b
Summary:
Add a new test ```fbcode_crash_test``` to rocksdb-lego-determinator. This test allows the crash test to be run on Facebook Sandcastle infra using fbcode components. Also use the default Env in db_stress to access the expected values path as it requires a memory mapped file and may not work with custom Envs.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8471
Reviewed By: ajkr
Differential Revision: D29474722
Pulled By: anand1976
fbshipit-source-id: 7d086d82dd7091ae48e08cb4ace763ce3e3b87ef
Summary:
Previously Stress can inject metadata write failures when reopening a DB. We extend it to file append too, in the same way.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8474
Test Plan: manually run crash test with various setting and make sure the failures are triggered as expected.
Reviewed By: zhichao-cao
Differential Revision: D29503116
fbshipit-source-id: e73a446e80ccbd09301a579280e56ff949381fab
Summary:
DB Stress to add --open_metadata_write_fault_one_in which would randomly fail in some file metadata modification operations during DB Open, including file creation, close, renaming and directory sync. Some operations can fail before and after the operations take place.
If DB open fails, db_stress would retry without the failure ingestion, and DB is expected to open successfully.
This option is enabled in crash test in half of the time.
Some follow up changes would allow write failures in open time, and ingesting those failures in non-DB open cases.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8235
Test Plan: Run stress tests for a while and see failures got triggered. This can reproduce the bug fixed by https://github.com/facebook/rocksdb/pull/8192 and a similar one that fails when fsyncing parent directory.
Reviewed By: anand1976
Differential Revision: D28010944
fbshipit-source-id: 36a96da4dc3633e5f7680cef3ea0a900fcdb5558
Summary:
Add some basic test for user-defined timestamp to db_stress. Currently,
read with timestamp always tries to read using the current timestamp.
Due to the per-key timestamp-sequence ordering constraint, we only add timestamp-
related tests to the `NonBatchedOpsStressTest` since this test serializes accesses
to the same key and uses a file to cross-check data correctness.
The timestamp feature is not supported in a number of components, e.g. Merge, SingleDelete,
DeleteRange, CompactionFilter, Readonly instance, secondary instance, SST file ingestion, transaction,
etc. Therefore, db_stress should exit if user enables both timestamp and these features at the same
time. The (currently) incompatible features can be found in
`CheckAndSetOptionsForUserTimestamp`.
This PR also fixes a bug triggered when timestamp is enabled together with
`index_type=kBinarySearchWithFirstKey`. This bug fix will also be in another separate PR
with more unit tests coverage. Fixing it here because I do not want to exclude the index type
from crash test.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/8061
Test Plan: make crash_test_with_ts
Reviewed By: jay-zhuang
Differential Revision: D27056282
Pulled By: riversand963
fbshipit-source-id: c3e00ad1023fdb9ebbdf9601ec18270c5e2925a9
Summary:
Inject the random write error to stress test, it requires set reopen=0 and disable_wal=true.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/7653
Test Plan: pass db_stress and python3 db_crashtest.py blackbox
Reviewed By: ajkr
Differential Revision: D25354132
Pulled By: zhichao-cao
fbshipit-source-id: 44721104eecb416e27f65f854912c40e301dd669
Summary:
This PR implements a fault injection mechanism for injecting errors in reads in db_stress. The FaultInjectionTestFS is used for this purpose. A thread local structure is used to track the errors, so that each db_stress thread can independently enable/disable error injection and verify observed errors against expected errors. This is initially enabled only for Get and MultiGet, but can be extended to iterator as well once its proven stable.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6538
Test Plan:
crash_test
make check
Reviewed By: riversand963
Differential Revision: D20714347
Pulled By: anand1976
fbshipit-source-id: d7598321d4a2d72bda0ced57411a337a91d87dc7
Summary:
When dynamically linking two binaries together, different builds of RocksDB from two sources might cause errors. To provide a tool for user to solve the problem, the RocksDB namespace is changed to a flag which can be overridden in build time.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6433
Test Plan: Build release, all and jtest. Try to build with ROCKSDB_NAMESPACE with another flag.
Differential Revision: D19977691
fbshipit-source-id: aa7f2d0972e1c31d75339ac48478f34f6cfcfb3e
Summary:
Currently, db_stress performs verification by calling `VerifyDb()` at the end of test and optionally before tests start. In case of corruption or incorrect result, it will be too late. This PR adds more verification in two ways.
1. For cf consistency test, each test thread takes a snapshot and verifies every N ops. N is configurable via `-verify_db_one_in`. This option is not supported in other stress tests.
2. For cf consistency test, we use another background thread in which a secondary instance periodically tails the primary (interval is configurable). We verify the secondary. Once an error is detected, we terminate the test and report. This does not affect other stress tests.
Test plan (devserver)
```
$./db_stress -test_cf_consistency -verify_db_one_in=0 -ops_per_thread=100000 -continuous_verification_interval=100
$./db_stress -test_cf_consistency -verify_db_one_in=1000 -ops_per_thread=10000 -continuous_verification_interval=0
$make crash_test
```
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6173
Differential Revision: D19047367
Pulled By: riversand963
fbshipit-source-id: aeed584ad71f9310c111445f34975e5ab47a0615
Summary:
db_stress_tool.cc now is a giant file. In order to main it easier to improve and maintain, break it down to multiple source files.
Most classes are turned into their own files. Separate .h and .cc files are created for gflag definiations. Another .h and .cc files are created for some common functions. Some test execution logic that is only loosely related to class StressTest is moved to db_stress_driver.h and db_stress_driver.cc. All the files are located under db_stress_tool/. The directory name is created as such because if we end it with either stress or test, .gitignore will ignore any file under it and makes it prone to issues in developements.
Pull Request resolved: https://github.com/facebook/rocksdb/pull/6134
Test Plan: Build under GCC7 with and without LITE on using GNU Make. Build with GCC 4.8. Build with cmake with -DWITH_TOOL=1
Differential Revision: D18876064
fbshipit-source-id: b25d0a7451840f31ac0f5ebb0068785f783fdf7d