CPU corruption injector: gdb register flip into one db_stress op#14857
Closed
hx235 wants to merge 2 commits into
Closed
CPU corruption injector: gdb register flip into one db_stress op#14857hx235 wants to merge 2 commits into
hx235 wants to merge 2 commits into
Conversation
added 2 commits
June 15, 2026 23:46
…acebook#14852) Summary: Detection layer of the CPU corruption injector (coming up). With `--verify_cpu_corruption_dir=<dir>`, db_stress reads back the full keyspace after every write/flush/compaction op and compares it to the expected-values model, classifying any mismatch by `kind`: `lost` / `resurrected` / `wrong-value` (silent data corruption) or `detected-corruption` (a status/checksum-caught error). Each finding is written to `<dir>/data_corruption.<tid>.json` ({kind, cf, key, value_from_db, value_from_expected, op_status}) and routed through db_stress's standard `VerificationAbort` for a clean exit-1. A startup guard requires `--threads=1` and all fault injection off so the read-back is single-writer and the only corruption present is the injected one **Test plan:** 1.Startup guard rejects misconfiguration: ``` --threads=2 -> exit 1: "--verify_cpu_corruption_dir requires --threads=1" --read_fault_one_in=5 -> exit 1: "requires all fault injection off" ``` 2.No false positive (clean CORE preset run, no injection): ``` $ db_stress --verify_cpu_corruption_dir=<dir> --threads=1 (full protections, all *_fault_one_in=0) ... exit 0; no data_corruption.<tid>.json produced; "Verification successful" ``` 3.Write-path cpu corruption injection (coming up, e.g, gdb flips a register inside MemTable::Add), then the immediate post-op read-back catches it. Real `<dir>/data_corruption.<tid>.json`: silent data corruption -- write returned OK but the key is gone on read-back: ``` {"kind":"lost","cf":0,"key":9814,"value_from_db":"","value_from_expected":"010000000504070609080B0A0D0C0F0E","op_status":"Get: NotFound"} ``` detected corruption -- read-back Get returns Corruption via the memtable per-key checksum: ``` {"kind":"detected-corruption","cf":0,"key":139,"value_from_db":"","value_from_expected":"","op_status":"Get: Corruption: Corrupted memtable entry, per key-value checksum verification failed." ``` Differential Revision: D107999834
Summary: Injection layer of the CPU corruption injector (tools/cpu_corruption_injector/injector.py), runs inside gdb and corrupt a register by bit flip in exactly one db_stress op (i.e, write, foreground compaction and flush) per stress test run. Detection is at db_stress (facebook#14852); orchestration is coming up. How one run works - The orchestration layer, coming up, randomly picks which op instance (so corruption lands at different points in the LSM's life) and which target_fn per run (so it has a reasonable number of instructions to step under a reasonable time limit); injector.py picks which instruction within target_fn. - Attach: gdb starts with injector.py's parameters passed via -iex and the db_stress command after --args, so db_stress runs unmodified. Example: ``` gdb --batch --nx \ -iex "py import sys; sys.argv=['injector.py','--op','write','--op_index','42','--entry_fn','rocksdb::MemTable::Add','--target_fn','rocksdb::MemTable::Add','--corruptions_per_op','1','--seed','7','--dir','<rundir>']" \ -x tools/cpu_corruption_injector/injector.py \ --args <db_stress> --threads=1 --verify_cpu_corruption_dir=<rundir> ... ``` - Reach the op: entry_fn is called exactly once per stress test run's op so the op_index-th op is its op_index-th call. The orchestration layer picks op_index . `injector_navigate.py` breaks on entry_fn and set a gdb ignore-count of op_index-1 to fast-forward to op_index-th one. - Warm up: `injector_critical_instruction.py` will choose "critical instruction" (those that move key/value bytes with general-purpose or vector registers or set a branch flag) uniformly within the chosen `target_fn` (within `entry_fn`) by the orchestration layer. In order to do that, it needs to approximate how many such instructions within `target_fn`. Hence we have this warm-up phase. It single-steps the first call of target_fn to count and pick the critical instruction index, then corrupt that index at a later call. - Corrupt: on a later call of target_fn, `injector_critical_instruction.py` single-step to the m-th critical instruction and bit-flip the register through `injector_register_corruption.py`. The way to corrupt register depends on what instruction it is. - Record: `injector_telemetry.py` provides telemetry to capture the corruption for later analysis. Differential Revision: D107999835
|
@hx235 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107999835. |
✅ clang-tidy: No findings on changed linesCompleted in 98.2s. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Injection layer of the CPU corruption injector (tools/cpu_corruption_injector/injector.py), runs inside gdb and corrupt a register by bit flip in exactly one db_stress op (i.e, write, foreground compaction and flush) per stress test run. Detection is at db_stress (#14852); orchestration is coming up.
How one run works
Reach the op: entry_fn is called exactly once per stress test run's op so the op_index-th op is its op_index-th call. The orchestration layer picks op_index .
injector_navigate.pybreaks on entry_fn and set a gdb ignore-count of op_index-1 to fast-forward to op_index-th one.Warm up:
injector_critical_instruction.pywill choose "critical instruction" (those that move key/value bytes with general-purpose or vector registers or set a branch flag) uniformly within the chosentarget_fn(withinentry_fn) by the orchestration layer. In order to do that, it needs to approximate how many such instructions withintarget_fn. Hence we have this warm-up phase. It single-steps the first call of target_fn to count and pick the critical instruction index, then corrupt that index at a later call.Corrupt: on a later call of target_fn,
injector_critical_instruction.pysingle-step to the m-th critical instruction and bit-flip the register throughinjector_register_corruption.py. The way to corrupt register depends on what instruction it is.Record:
injector_telemetry.pyprovides telemetry to capture the corruption for later analysis.Differential Revision: D107999835