Skip to content

CPU corruption injector: gdb register flip into one db_stress op#14857

Closed
hx235 wants to merge 2 commits into
facebook:mainfrom
hx235:export-D107999835
Closed

CPU corruption injector: gdb register flip into one db_stress op#14857
hx235 wants to merge 2 commits into
facebook:mainfrom
hx235:export-D107999835

Conversation

@hx235

@hx235 hx235 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Summary:
Injection layer of the CPU corruption injector (tools/cpu_corruption_injector/injector.py), runs inside gdb and corrupt a register by bit flip in exactly one db_stress op (i.e, write, foreground compaction and flush) per stress test run. Detection is at db_stress (#14852); orchestration is coming up.

How one run works

  • The orchestration layer, coming up, randomly picks which op instance (so corruption lands at different points in the LSM's life) and which target_fn per run (so it has a reasonable number of instructions to step under a reasonable time limit); injector.py picks which instruction within target_fn.
  • Attach: gdb starts with injector.py's parameters passed via -iex and the db_stress command after --args, so db_stress runs unmodified. Example:
gdb --batch --nx \
  -iex "py import sys; sys.argv=['injector.py','--op','write','--op_index','42','--entry_fn','rocksdb::MemTable::Add','--target_fn','rocksdb::MemTable::Add','--corruptions_per_op','1','--seed','7','--dir','<rundir>']" \
  -x tools/cpu_corruption_injector/injector.py \
  --args <db_stress> --threads=1 --verify_cpu_corruption_dir=<rundir>  ...
  • Reach the op: entry_fn is called exactly once per stress test run's op so the op_index-th op is its op_index-th call. The orchestration layer picks op_index . injector_navigate.py breaks on entry_fn and set a gdb ignore-count of op_index-1 to fast-forward to op_index-th one.

  • Warm up: injector_critical_instruction.py will choose "critical instruction" (those that move key/value bytes with general-purpose or vector registers or set a branch flag) uniformly within the chosen target_fn (within entry_fn) by the orchestration layer. In order to do that, it needs to approximate how many such instructions within target_fn. Hence we have this warm-up phase. It single-steps the first call of target_fn to count and pick the critical instruction index, then corrupt that index at a later call.

  • Corrupt: on a later call of target_fn, injector_critical_instruction.py single-step to the m-th critical instruction and bit-flip the register through injector_register_corruption.py. The way to corrupt register depends on what instruction it is.

  • Record: injector_telemetry.py provides telemetry to capture the corruption for later analysis.

Differential Revision: D107999835

Hui Xiao added 2 commits June 15, 2026 23:46
…acebook#14852)

Summary:

Detection layer of the CPU corruption injector (coming up). With `--verify_cpu_corruption_dir=<dir>`, db_stress reads back the full keyspace after every write/flush/compaction op and compares it to the expected-values model, classifying any mismatch by `kind`: `lost` / `resurrected` / `wrong-value` (silent data corruption) or `detected-corruption` (a status/checksum-caught error). Each finding is written to `<dir>/data_corruption.<tid>.json` ({kind, cf, key, value_from_db, value_from_expected, op_status}) and routed through db_stress's standard `VerificationAbort` for a clean exit-1. A startup guard requires `--threads=1` and all fault injection off so the read-back is single-writer and the only corruption present is the injected one

**Test plan:**
1.Startup guard rejects misconfiguration:
```
--threads=2           -> exit 1: "--verify_cpu_corruption_dir requires --threads=1"
--read_fault_one_in=5 -> exit 1: "requires all fault injection off"
```
2.No false positive (clean CORE preset run, no injection):
```
$ db_stress --verify_cpu_corruption_dir=<dir> --threads=1 (full protections, all *_fault_one_in=0) ...
exit 0; no data_corruption.<tid>.json produced; "Verification successful"
```
3.Write-path cpu corruption injection (coming up, e.g, gdb flips a register inside MemTable::Add), then the immediate post-op read-back catches it. Real `<dir>/data_corruption.<tid>.json`:

silent data corruption -- write returned OK but the key is gone on read-back:
```
{"kind":"lost","cf":0,"key":9814,"value_from_db":"","value_from_expected":"010000000504070609080B0A0D0C0F0E","op_status":"Get: NotFound"}
```
detected corruption -- read-back Get returns Corruption via the memtable per-key checksum:
```
{"kind":"detected-corruption","cf":0,"key":139,"value_from_db":"","value_from_expected":"","op_status":"Get: Corruption: Corrupted memtable entry, per key-value checksum verification failed."
```

Differential Revision: D107999834
Summary:
Injection layer of the CPU corruption injector (tools/cpu_corruption_injector/injector.py), runs inside gdb and corrupt a register by bit flip in exactly one db_stress op (i.e, write, foreground compaction and flush) per stress test run. Detection is at db_stress (facebook#14852); orchestration is coming up.

How one run works 
- The orchestration layer, coming up, randomly picks which op instance (so corruption lands at different points in the LSM's life) and which target_fn per run (so it has a reasonable number of instructions to step under a reasonable time limit); injector.py picks which instruction within target_fn.
- Attach: gdb starts with injector.py's parameters passed via -iex and the db_stress command after --args, so db_stress runs unmodified. Example:
```
gdb --batch --nx \
  -iex "py import sys; sys.argv=['injector.py','--op','write','--op_index','42','--entry_fn','rocksdb::MemTable::Add','--target_fn','rocksdb::MemTable::Add','--corruptions_per_op','1','--seed','7','--dir','<rundir>']" \
  -x tools/cpu_corruption_injector/injector.py \
  --args <db_stress> --threads=1 --verify_cpu_corruption_dir=<rundir>  ...
```
- Reach the op: entry_fn is called exactly once per stress test run's op so the op_index-th op is its op_index-th call. The orchestration layer picks op_index . `injector_navigate.py` breaks on entry_fn and set a gdb ignore-count of op_index-1 to fast-forward to op_index-th one. 

- Warm up: `injector_critical_instruction.py` will choose "critical instruction" (those that move key/value bytes with general-purpose or vector registers or set a branch flag) uniformly within the chosen `target_fn` (within `entry_fn`) by the orchestration layer. In order to do that, it needs to approximate how many such instructions within `target_fn`. Hence we have this warm-up phase. It single-steps the first call of target_fn to count and pick the critical instruction index, then corrupt that index at a later call. 

- Corrupt: on a later call of target_fn, `injector_critical_instruction.py` single-step to the m-th critical instruction and bit-flip the register through `injector_register_corruption.py`. The way to corrupt register depends on what instruction it is. 

- Record: `injector_telemetry.py` provides telemetry to capture the corruption for later analysis.

Differential Revision: D107999835
@meta-cla meta-cla Bot added the CLA Signed label Jun 16, 2026
@meta-codesync

meta-codesync Bot commented Jun 16, 2026

Copy link
Copy Markdown

@hx235 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107999835.

@github-actions

Copy link
Copy Markdown

✅ clang-tidy: No findings on changed lines

Completed in 98.2s.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

1 participant