CPU corruption injector: runner + db_stress preset flags (#14866) by hx235 · Pull Request #14866 · facebook/rocksdb

hx235 · 2026-06-18T22:17:43Z

Summary:

Orchestration layer of the CPU corruption injector -- the glue between detection (#14852) and injection (#14858
).

One CPU-corruption injection says little on its own. What matters is the OUTCOME DISTRIBUTION over many injections -- how often a corruption is silently absorbed (NO_EFFECT), crashes the process (CRASH), is caught by an integrity check (CORRUPTION), or more importantly slips through as a silent data corruption (SDC) and the paths frequently leading to those outcomes. A trustworthy distribution needs a (somewhat) repeatable and reproducible harness as well as a db_stress configuration in which an injected corruption is both reachable and attributable to the chosen stress test op (i.e, write, foreground compaction or flush).

Therefore this PR implements a runner that can launch N independent runs for a chosen op type (i.e, write, foreground compaction or flush). Each run picks where to inject, runs db_stress under gdb via the injector,py (#14858), and is classified into one outcome bucket (#14852).

The runner has DB_STRESS_PRESET -- the pinned db_stress config that isolates a single injected corruption (single-threaded, integrity checks on, other fault injection off, auto-compaction off). The runner also does gdb preflight that fails fast when the build or gdb cannot support injection because, for example, the hard-coded target_fn has changed its name, provides parallel launching of many runs and one summary.json per campaign (the outcome distribution plus each run's record). The whole run set is reproducible from one logged base_seed (run i uses base_seed + i).

Test plan:

Build: make DEBUG_LEVEL=0 EXTRA_CXXFLAGS="-g -fno-inline" db_stress

1. Preflight (`verify_injection_site`) catches a build that can't support injection

Before doing any work, the runner has gdb confirm — on this exact binary — that every injection-site function resolves and gdb can read its source line. A good build logs:

INFO gdb check OK for op=compaction: rocksdb::CompactionJob::Run, rocksdb::CompactionIterator::NextFromInput, rocksdb::BlockBuilder::Add

A build that cannot support injection (functions renamed, fully inlined, or absent) fails fast with exit 2 before any run — forced here by pointing --stress_cmd at a non-db_stress binary:

$ python3 tools/cpu_corruption_injector/runner.py --op compaction --runs 1 --stress_cmd /bin/ls --report_dir /tmp/icc/preflight_demo
ERROR gdb could not set a breakpoint on these functions in ls (renamed, fully inlined, or not in this build?): rocksdb::CompactionJob::Run, rocksdb::CompactionIterator::NextFromInput, rocksdb::BlockBuilder::Add
Function "rocksdb::CompactionJob::Run" not defined.
Function "rocksdb::CompactionIterator::NextFromInput" not defined.
# exit code 2

So a broken/inlined build is rejected up front instead of silently producing NO_INJECTION runs.

2. Compaction op -- 100 runs

$ python3 tools/cpu_corruption_injector/runner.py --op compaction --runs 100 --stress_cmd /data/users/huixiao/rocksdb/db_stress  --report_dir /tmp/icc/preflight_demo

Runs' outcomes (summary.json):

SDC	CORRUPTION	CRASH	NO_EFFECT	NO_INJECTION	ERROR
9	5	6	79	1	0

Spread:

target_fn x outcome: NextFromInput {SDC 9, CORRUPTION 3, CRASH 2, NO_EFFECT 34, NO_INJECTION 1}; BlockBuilder::Add {CORRUPTION 2, CRASH 4, NO_EFFECT 45}.
corruption_type x outcome: bit_flip {SDC 8, CORRUPTION 3, CRASH 6, NO_EFFECT 32}; flag_flip {SDC 1, CORRUPTION 2, NO_EFFECT 42}; lane_bit_flip {NO_EFFECT 5}.

Analysis: all 9 SDCs land on the read/iterate path (NextFromInput); corrupting the output writer (BlockBuilder::Add) never produced an SDC (its blocks are checksummed -- corruption there is caught or inert). The 5 detected CORRUPTIONs are compaction's key-order and record-count cross-checks firing (both CompactRange and CompactFiles origins appear), correctly bucketed by the fix.

A representative compaction SDC: `run_00000`

What we corrupted (inject.json):

{"op":"compaction","op_index":17,"entry_fn":"rocksdb::CompactionJob::Run","target_fn":"rocksdb::CompactionIterator::NextFromInput","injection_result":"injected","db_stress_crash_signal":null,
 "corruptions":[{"instruction":"mov    %rdx,0x160(%r12)","register":"rdx","corruption_type":"bit_flip","before":"0x10","after":"0x18",
   "details":{"source":"rocksdb::CompactionIterator::NextFromInput @ db/compaction/compaction_iterator.cc:719"}}]}

The recorded silent corruption (data_corruption.<tid>.json):

{"kind":"wrong-value","cf":0,"key":70,"value_from_db":"010000000504070609080B0A0D0C0F0E070F105E78787878","value_from_expected":"010000000504070609080B0A0D0C0F0E","op_status":"Get: OK"}

Walkthrough: a single bit flip on rdx (0x10 -> 0x18, 16 -> 24) at CompactionIterator::NextFromInput (compaction_iterator.cc:719) -- rdx holds the value LENGTH stored into the iterator's record (offset 0x160). The length reads 24 instead of 16, so compaction copies a value 8 bytes too long into the output SST, absorbing adjacent bytes. The internal key is untouched (ParseInternalKey passes); the over-long value is the single Slice fed to both the paranoid validator and the SST builder, so the file is self-consistent and every checksum agrees. On read-back Get(key=70) returns OK with the wrong bytes -- value_from_db is the expected value (...0F0E) plus 8 trailing bytes (070F105E78787878). Silent: read OK, all checks pass, the value visibly grew. classify() routes kind=wrong-value to the SDC bucket.

A representative compaction CORRUPTION (detected): `run_00007`

What we corrupted (inject.json):

{"op":"compaction","op_index":41,"entry_fn":"rocksdb::CompactionJob::Run","target_fn":"rocksdb::CompactionIterator::NextFromInput","injection_result":"injected","db_stress_crash_signal":"SIGABRT",
 "corruptions":[{"instruction":"mov    (%rbx),%rdi","register":"rdi","corruption_type":"bit_flip","before":"0x7fffeee8c1c0","after":"0x7fffeee8c1c2",
   "details":{"source":"rocksdb::IterKey::SetKeyImpl @ ./db/dbformat.h:941","call_chain":["rocksdb::IterKey::SetKeyImpl @ ./db/dbformat.h:941","rocksdb::CompactionIterator::NextFromInput @ db/compaction/compaction_iterator.cc:781"]}}]}

The recorded detection (data_corruption.<tid>.json):

{"kind":"detected-corruption","cf":-1,"key":-1,"value_from_db":"","value_from_expected":"","op_status":"compactfiles: Corruption: Compaction sees out-of-order keys."}

Walkthrough: a bit flip on rdi (...c1c0 -> ...c1c2, a key pointer) at IterKey::SetKeyImpl (dbformat.h:941), reached from NextFromInput, mis-sets the iterator's key so the next emitted key is out of order. Compaction's key-order check catches it and returns compactfiles: Corruption: Compaction sees out-of-order keys. The op then takes SIGABRT, but classify() reads the recorded data_corruption result before the crash signal, so the run is correctly bucketed CORRUPTION (the bucketization fix; pre-fix this surfaced as CRASH). classify() routes kind=detected-corruption to the CORRUPTION bucket.

3. Flush op -- 100 runs

$ python3 tools/cpu_corruption_injector/runner.py --op flush --runs 100 --stress_cmd /data/users/huixiao/rocksdb/db_stress  --report_dir /tmp/icc/preflight_demo

Runs' outcomes (summary.json):

SDC	CORRUPTION	CRASH	NO_EFFECT	NO_INJECTION	ERROR
2	12	11	74	1	0

Spread:

target_fn x outcome: BlockBuilder::Add {SDC 1, CORRUPTION 5, CRASH 5, NO_EFFECT 49}; NextFromInput {SDC 1, CORRUPTION 7, CRASH 6, NO_EFFECT 25, NO_INJECTION 1}.
corruption_type x outcome: bit_flip {SDC 2, CORRUPTION 9, CRASH 9, NO_EFFECT 32}; flag_flip {CORRUPTION 3, CRASH 2, NO_EFFECT 42}.

Analysis: flush mirrors compaction's mechanisms (shared iterator/builder). The 2 SDCs are a value/key-pointer corruption that slips past the checksums; the 12 corruptions are caught by the flush-time key-order / key-size integrity checks.

A representative flush SDC: `run_00027`

What we corrupted (inject.json):

{"op":"flush","op_index":16,"entry_fn":"rocksdb::FlushJob::Run","target_fn":"rocksdb::BlockBuilder::Add","injection_result":"injected","db_stress_crash_signal":null,
 "corruptions":[{"instruction":"mov    (%rdi),%rax","register":"rax","corruption_type":"bit_flip","before":"0x7fffef059400","after":"0x7fffef059440",
   "details":{"source":"rocksdb::Slice::data @ ./include/rocksdb/slice.h:58","call_chain":["rocksdb::Slice::data @ ./include/rocksdb/slice.h:58","rocksdb::BlockBuilder::AddWithLastKeyImpl @ table/block_based/block_builder.cc:351"]}}]}

The recorded silent corruption (data_corruption.<tid>.json):

{"kind":"lost","cf":0,"key":763,"value_from_db":"","value_from_expected":"010000000504070609080B0A0D0C0F0E","op_status":"Get: NotFound"}

Walkthrough: a bit flip on rax (a key/value data pointer, ...9400 -> ...9440) at Slice::data (slice.h:58), reached from BlockBuilder::AddWithLastKeyImpl while the flush builds the output block, makes the builder read key bytes from the wrong address, so key 763's entry is written wrong and the key is dropped from the flushed SST. On read-back Get(key=763) returns NotFound for a committed key -- silent. classify() routes kind=lost to the SDC bucket.

A representative flush CORRUPTION (detected): `run_00047`

What we corrupted (inject.json):

{"op":"flush","op_index":7,"entry_fn":"rocksdb::FlushJob::Run","target_fn":"rocksdb::CompactionIterator::NextFromInput","injection_result":"injected","db_stress_crash_signal":"SIGABRT",
 "corruptions":[{"instruction":"cmp    $0x7,%rax","register":"eflags","corruption_type":"flag_flip","before":"0x216","after":"0x256",
   "details":{"source":"rocksdb::ParseInternalKey @ ./db/dbformat.h:523","call_chain":["rocksdb::ParseInternalKey @ ./db/dbformat.h:523","rocksdb::CompactionIterator::NextFromInput @ db/compaction/compaction_iterator.cc:731"]}}]}

The recorded detection (data_corruption.<tid>.json):

{"kind":"detected-corruption","cf":-1,"key":-1,"value_from_db":"","value_from_expected":"","op_status":"flush: Corruption: Corrupted Key: Internal Key too small. Size=16. "}

Walkthrough: a flag flip (eflags 0x216 -> 0x256) on a cmp $0x7,%rax branch in ParseInternalKey (dbformat.h:523), reached from NextFromInput, makes the parser mis-judge the internal-key size, so the flush emits a malformed key and the key-size integrity check returns flush: Corruption: Corrupted Key: Internal Key too small. The op takes SIGABRT; classify() reads the recorded data_corruption before the signal and buckets CORRUPTION (bucketization fix). classify() routes kind=detected-corruption to the CORRUPTION bucket.

4. Write op (`MemTable::Add`) -- two key spaces

$ python3 tools/cpu_corruption_injector/runner.py --op write --runs 100 --stress_cmd /data/users/huixiao/rocksdb/db_stress  --report_dir /tmp/icc/preflight_demo

A write injection corrupts a single MemTable::Add (a Put/Delete/DeleteRange). The corruption is reachable and attributable, but whether it surfaces as a silent write SDC depends heavily on the key space. A silent write SDC needs the affected/mispositioned key to have other live versions to fall through to -- which only happens in a dense, multi-version memtable. We therefore run two write campaigns: the default max_key=1000, then a small max_key=8. The contrast is what motivates randomizing max_key (see PR #14867 for --randomize_stress_flags).

4a. Default `max_key=1000` -- 100 runs (no silent write SDC)

Runs' outcomes (summary.json):

SDC	CORRUPTION	CRASH	NO_EFFECT	NO_INJECTION	ERROR
0	31	13	56	0	0

With a 1000-key space almost every write touches a distinct key, so a corrupted entry has no older live version to mask it: a value/key byte flip is caught at write by the per-key checksum (VerifyEncodedEntry -> CORRUPTION), and a structural flip tends to crash (CRASH) rather than silently mis-read. ERROR=0, NO_INJECTION=0. No write op silently corrupted data -- every reachable corruption was caught or crashed.

4b. Small `max_key=8` -- 100 runs (surfaces 2 silent write SDCs)

Runs' outcomes (summary.json):

SDC	CORRUPTION	CRASH	NO_EFFECT	NO_INJECTION	ERROR
2	33	8	57	0	0

Shrinking the key space makes each key hold ~125 versions (ops_per_thread / max_key), so a misplaced entry can fall through to an older version of the same key and be returned silently -- the per-key checksum (bytes intact) and on-seek verify cannot see a pure link-position error.

A representative write SDC: `run_00028` (skiplist misposition -> silent stale read, flush catches)

What we corrupted (inject.json):

{"op":"write","op_index":317,"entry_fn":"rocksdb::MemTable::Add","target_fn":"rocksdb::MemTable::Add","injection_result":"injected","db_stress_crash_signal":null,
 "corruptions":[{"instruction":"cmp    %rbx,-0xb8(%rbp)","register":"eflags","corruption_type":"flag_flip","before":"0x216","after":"0x217",
   "details":{"source":"rocksdb::MemTable::Add @ db/memtable.cc:1319",
              "call_chain":["rocksdb::MemTable::Add @ db/memtable.cc:1319"]}}]}

The recorded silent corruption (data_corruption.<tid>.json):

{"kind":"resurrected","cf":0,"key":1,"value_from_db":"110000001514171619181B1A1D1C1F1E0100030205040706","value_from_expected":"","op_status":"Get: OK"}

Walkthrough: a flag flip (CF, eflags 0x216 -> 0x217) on the cmp that produces KeyIsAfterNode inside InlineSkipList::Insert (inlineskiplist.h:1253; the @ memtable.cc:1319 in the record is inlining line-drift) inverts the key comparison, so the Delete tombstone for key 1 is linked at the wrong position. The stored bytes and per-key checksum are intact, so neither the checksum nor on-seek verify sees anything wrong -- on read-back Get(key=1) returns OK with key 1's live value for a key that was Deleted (kind=resurrected, silent). A follow-up Flush() in unit test repro does catch it: the full-scan order check returns Corruption: Out-of-order keys found in skiplist -- caught only after the silent read, not during it.

A representative write CORRUPTION (detected) `max_key=1000 or 8` : `run_00018`

Where run_00028's pure link-position error is invisible to the per-key checksum, this run shows a byte-level corruption that the checksum catches at write time. What we corrupted (inject.json):

{"op":"write","op_index":106,"entry_fn":"rocksdb::MemTable::Add","target_fn":"rocksdb::MemTable::Add","injection_result":"injected","db_stress_crash_signal":null,
 "corruptions":[{"instruction":"mov    %rsi,(%rdi)","register":"rsi","corruption_type":"bit_flip","before":"0x7fffeec2a21c","after":"0x7fffeec2a25c",
   "details":{"source":"rocksdb::Slice::Slice @ ./include/rocksdb/slice.h:39",
              "call_chain":["rocksdb::Slice::Slice @ ./include/rocksdb/slice.h:39","rocksdb::GetVarint32 @ ./util/coding.h:280","rocksdb::MemTable::VerifyEncodedEntry @ db/memtable.cc:1102","rocksdb::MemTable::Add @ db/memtable.cc:1189"]}}]}

The recorded detection (data_corruption.<tid>.json):

{"kind":"detected-corruption","cf":-1,"key":-1,"value_from_db":"","value_from_expected":"","op_status":"put: Corruption: ProtectionInfo mismatch"}

Walkthrough: a bit flip on rsi (0x7fffeec2a21c -> 0x7fffeec2a25c) at Slice::Slice (slice.h:39) while MemTable::Add re-parses the just-encoded entry through VerifyEncodedEntry (memtable.cc:1102) corrupts the Slice the verifier reads, so the recomputed per-key protection info no longer matches and the put returns Corruption: ProtectionInfo mismatch.

Reviewed By: pdillinger

Differential Revision: D108367345

meta-codesync · 2026-06-18T22:17:51Z

@hx235 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108367345.

github-actions · 2026-06-18T22:21:03Z

✅ clang-tidy: No findings on changed lines

Completed in 0.0s.

…ook#14852) Summary: Detection layer of the CPU corruption injector (facebook#14858). With `--verify_cpu_corruption_dir=<dir>`, db_stress reads back the full keyspace after every write/manual flush/manual compaction op and compares it to the expected-values model, classifying any mismatch by `kind`: `lost` / `resurrected` / `wrong-value` (silent data corruption) or `detected-corruption` (a status/checksum-caught error). Each finding is written to `<dir>/data_corruption.<tid>.json` ({kind, cf, key, value_from_db, value_from_expected, op_status}) and routed through db_stress's standard `VerificationAbort` for a clean exit-1. A startup guard requires `--threads=1` and all fault injection off so the read-back is single-writer and the only corruption present is the injected one Bonus: a minor refactoring into the surrounding error handling code in these ops **Test plan:** 1.Startup guard rejects misconfiguration: ``` --threads=2 -> exit 1: "--verify_cpu_corruption_dir requires --threads=1" --read_fault_one_in=5 -> exit 1: "requires all fault injection off" ``` 2.No false positive (clean CORE preset run, no injection): ``` $ db_stress --verify_cpu_corruption_dir=<dir> --threads=1 (full protections, all *_fault_one_in=0) ... exit 0; no data_corruption.<tid>.json produced; "Verification successful" ``` 3.Write-path cpu corruption injection (coming up, e.g, gdb flips a register inside MemTable::Add), then the immediate post-op read-back catches it. Real `<dir>/data_corruption.<tid>.json`: silent data corruption -- write returned OK but the key is gone on read-back: ``` {"kind":"lost","cf":0,"key":9814,"value_from_db":"","value_from_expected":"010000000504070609080B0A0D0C0F0E","op_status":"Get: NotFound"} ``` detected corruption -- read-back Get returns Corruption via the memtable per-key checksum: ``` {"kind":"detected-corruption","cf":0,"key":139,"value_from_db":"","value_from_expected":"","op_status":"Get: Corruption: Corrupted memtable entry, per key-value checksum verification failed." ``` 4.See PR facebook#14866 test plan's spread in the outcome for verification of detection Differential Revision: D107999834

…ook#14852) Summary: Detection layer of the CPU corruption injector (facebook#14858). With `--verify_cpu_corruption_dir=<dir>`, db_stress reads back the full keyspace after every write/manual flush/manual compaction op and compares it to the expected-values model, classifying any mismatch by `kind`: `lost` / `resurrected` / `wrong-value` (silent data corruption) or `detected-corruption` (a status/checksum-caught error). Each finding is written to `<dir>/data_corruption.<tid>.json` ({kind, cf, key, value_from_db, value_from_expected, op_status}) and routed through db_stress's standard `VerificationAbort` for a clean exit-1. A startup guard requires `--threads=1` and all fault injection off so the read-back is single-writer and the only corruption present is the injected one Bonus: a minor refactoring into the surrounding error handling code in these ops **Test plan:** 1.Startup guard rejects misconfiguration: ``` --threads=2 -> exit 1: "--verify_cpu_corruption_dir requires --threads=1" --read_fault_one_in=5 -> exit 1: "requires all fault injection off" ``` 2.No false positive (clean CORE preset run, no injection): ``` $ db_stress --verify_cpu_corruption_dir=<dir> --threads=1 (full protections, all *_fault_one_in=0) ... exit 0; no data_corruption.<tid>.json produced; "Verification successful" ``` 3.Write-path cpu corruption injection (coming up, e.g, gdb flips a register inside MemTable::Add), then the immediate post-op read-back catches it. Real `<dir>/data_corruption.<tid>.json`: silent data corruption -- write returned OK but the key is gone on read-back: ``` {"kind":"lost","cf":0,"key":9814,"value_from_db":"","value_from_expected":"010000000504070609080B0A0D0C0F0E","op_status":"Get: NotFound"} ``` detected corruption -- read-back Get returns Corruption via the memtable per-key checksum: ``` {"kind":"detected-corruption","cf":0,"key":139,"value_from_db":"","value_from_expected":"","op_status":"Get: Corruption: Corrupted memtable entry, per key-value checksum verification failed." ``` 4.See PR facebook#14866 test plan's spread in the outcome for verification of detection Reviewed By: pdillinger Differential Revision: D107999834

Summary: Pull Request resolved: #14852 Detection layer of the CPU corruption injector (#14858). With `--verify_cpu_corruption_dir=<dir>`, db_stress reads back the full keyspace after every write/manual flush/manual compaction op and compares it to the expected-values model, classifying any mismatch by `kind`: `lost` / `resurrected` / `wrong-value` (silent data corruption) or `detected-corruption` (a status/checksum-caught error). Each finding is written to `<dir>/data_corruption.<tid>.json` ({kind, cf, key, value_from_db, value_from_expected, op_status}) and routed through db_stress's standard `VerificationAbort` for a clean exit-1. A startup guard requires `--threads=1` and all fault injection off so the read-back is single-writer and the only corruption present is the injected one Bonus: a minor refactoring into the surrounding error handling code in these ops **Test plan:** 1.Startup guard rejects misconfiguration: ``` --threads=2 -> exit 1: "--verify_cpu_corruption_dir requires --threads=1" --read_fault_one_in=5 -> exit 1: "requires all fault injection off" ``` 2.No false positive (clean CORE preset run, no injection): ``` $ db_stress --verify_cpu_corruption_dir=<dir> --threads=1 (full protections, all *_fault_one_in=0) ... exit 0; no data_corruption.<tid>.json produced; "Verification successful" ``` 3.Write-path cpu corruption injection (coming up, e.g, gdb flips a register inside MemTable::Add), then the immediate post-op read-back catches it. Real `<dir>/data_corruption.<tid>.json`: silent data corruption -- write returned OK but the key is gone on read-back: ``` {"kind":"lost","cf":0,"key":9814,"value_from_db":"","value_from_expected":"010000000504070609080B0A0D0C0F0E","op_status":"Get: NotFound"} ``` detected corruption -- read-back Get returns Corruption via the memtable per-key checksum: ``` {"kind":"detected-corruption","cf":0,"key":139,"value_from_db":"","value_from_expected":"","op_status":"Get: Corruption: Corrupted memtable entry, per key-value checksum verification failed." ``` 4.See PR #14866 test plan's spread in the outcome for verification of detection Reviewed By: pdillinger Differential Revision: D107999834 fbshipit-source-id: 18decbf51dc56eec62b735136e2dfb4e1175b773

Summary: Orchestration layer of the CPU corruption injector -- the glue between detection (#14852) and injection (#14858 ). One CPU-corruption injection says little on its own. What matters is the OUTCOME DISTRIBUTION over many injections -- how often a corruption is silently absorbed (NO_EFFECT), crashes the process (CRASH), is caught by an integrity check (CORRUPTION), or more importantly slips through as a silent data corruption (SDC) and the paths frequently leading to those outcomes. A trustworthy distribution needs a (somewhat) repeatable and reproducible harness as well as a db_stress configuration in which an injected corruption is both reachable and attributable to the chosen stress test op (i.e, write, foreground compaction or flush). Therefore this PR implements a runner that can launch N independent runs for a chosen op type (i.e, write, foreground compaction or flush). Each run picks where to inject, runs db_stress under gdb via the `injector,py` (#14858), and is classified into one outcome bucket (#14852). The runner has DB_STRESS_PRESET -- the pinned db_stress config that isolates a single injected corruption (single-threaded, integrity checks on, other fault injection off, auto-compaction off). The runner also does gdb preflight that fails fast when the build or gdb cannot support injection because, for example, the hard-coded `target_fn` has changed its name, provides parallel launching of many runs and one summary.json per campaign (the outcome distribution plus each run's record). The whole run set is reproducible from one logged base_seed (run i uses base_seed + i). **Test plan:** Build: `make DEBUG_LEVEL=0 EXTRA_CXXFLAGS="-g -fno-inline" db_stress` # 1. Preflight (`verify_injection_site`) catches a build that can't support injection Before doing any work, the runner has gdb confirm — on this exact binary — that every injection-site function resolves and gdb can read its source line. A good build logs: ``` INFO gdb check OK for op=compaction: rocksdb::CompactionJob::Run, rocksdb::CompactionIterator::NextFromInput, rocksdb::BlockBuilder::Add ``` A build that cannot support injection (functions renamed, fully inlined, or absent) fails fast with exit 2 before any run — forced here by pointing `--stress_cmd` at a non-db_stress binary: ``` $ python3 tools/cpu_corruption_injector/runner.py --op compaction --runs 1 --stress_cmd /bin/ls --report_dir /tmp/icc/preflight_demo ERROR gdb could not set a breakpoint on these functions in ls (renamed, fully inlined, or not in this build?): rocksdb::CompactionJob::Run, rocksdb::CompactionIterator::NextFromInput, rocksdb::BlockBuilder::Add Function "rocksdb::CompactionJob::Run" not defined. Function "rocksdb::CompactionIterator::NextFromInput" not defined. # exit code 2 ``` So a broken/inlined build is rejected up front instead of silently producing `NO_INJECTION` runs. # 2. Compaction op -- 100 runs ``` $ python3 tools/cpu_corruption_injector/runner.py --op compaction --runs 100 --stress_cmd /data/users/huixiao/rocksdb/db_stress --report_dir /tmp/icc/preflight_demo ``` **Runs' outcomes (`summary.json`):** | SDC | CORRUPTION | CRASH | NO_EFFECT | NO_INJECTION | ERROR | | --- | --- | --- | --- | --- | --- | | 9 | 5 | 6 | 79 | 1 | 0 | **Spread:** - target_fn x outcome: `NextFromInput` {SDC 9, CORRUPTION 3, CRASH 2, NO_EFFECT 34, NO_INJECTION 1}; `BlockBuilder::Add` {CORRUPTION 2, CRASH 4, NO_EFFECT 45}. - corruption_type x outcome: `bit_flip` {SDC 8, CORRUPTION 3, CRASH 6, NO_EFFECT 32}; `flag_flip` {SDC 1, CORRUPTION 2, NO_EFFECT 42}; `lane_bit_flip` {NO_EFFECT 5}. **Analysis:** all 9 SDCs land on the read/iterate path (`NextFromInput`); corrupting the output writer (`BlockBuilder::Add`) never produced an SDC (its blocks are checksummed -- corruption there is caught or inert). The 5 detected `CORRUPTION`s are compaction's key-order and record-count cross-checks firing (both CompactRange and CompactFiles origins appear), correctly bucketed by the fix. ### A representative compaction SDC: `run_00000` What we corrupted (`inject.json`): ```json {"op":"compaction","op_index":17,"entry_fn":"rocksdb::CompactionJob::Run","target_fn":"rocksdb::CompactionIterator::NextFromInput","injection_result":"injected","db_stress_crash_signal":null, "corruptions":[{"instruction":"mov %rdx,0x160(%r12)","register":"rdx","corruption_type":"bit_flip","before":"0x10","after":"0x18", "details":{"source":"rocksdb::CompactionIterator::NextFromInput @ db/compaction/compaction_iterator.cc:719"}}]} ``` The recorded silent corruption (`data_corruption.<tid>.json`): ```json {"kind":"wrong-value","cf":0,"key":70,"value_from_db":"010000000504070609080B0A0D0C0F0E070F105E78787878","value_from_expected":"010000000504070609080B0A0D0C0F0E","op_status":"Get: OK"} ``` **Walkthrough:** a single bit flip on `rdx` (`0x10 -> 0x18`, 16 -> 24) at `CompactionIterator::NextFromInput` (`compaction_iterator.cc:719`) -- `rdx` holds the value LENGTH stored into the iterator's record (offset `0x160`). The length reads 24 instead of 16, so compaction copies a value 8 bytes too long into the output SST, absorbing adjacent bytes. The internal key is untouched (`ParseInternalKey` passes); the over-long value is the single Slice fed to both the paranoid validator and the SST builder, so the file is self-consistent and every checksum agrees. On read-back `Get(key=70)` returns OK with the wrong bytes -- `value_from_db` is the expected value (`...0F0E`) **plus 8 trailing bytes** (`070F105E78787878`). Silent: read OK, all checks pass, the value visibly grew. `classify()` routes `kind=wrong-value` to the SDC bucket. ### A representative compaction CORRUPTION (detected): `run_00007` What we corrupted (`inject.json`): ```json {"op":"compaction","op_index":41,"entry_fn":"rocksdb::CompactionJob::Run","target_fn":"rocksdb::CompactionIterator::NextFromInput","injection_result":"injected","db_stress_crash_signal":"SIGABRT", "corruptions":[{"instruction":"mov (%rbx),%rdi","register":"rdi","corruption_type":"bit_flip","before":"0x7fffeee8c1c0","after":"0x7fffeee8c1c2", "details":{"source":"rocksdb::IterKey::SetKeyImpl @ ./db/dbformat.h:941","call_chain":["rocksdb::IterKey::SetKeyImpl @ ./db/dbformat.h:941","rocksdb::CompactionIterator::NextFromInput @ db/compaction/compaction_iterator.cc:781"]}}]} ``` The recorded detection (`data_corruption.<tid>.json`): ```json {"kind":"detected-corruption","cf":-1,"key":-1,"value_from_db":"","value_from_expected":"","op_status":"compactfiles: Corruption: Compaction sees out-of-order keys."} ``` **Walkthrough:** a bit flip on `rdi` (`...c1c0 -> ...c1c2`, a key pointer) at `IterKey::SetKeyImpl` (`dbformat.h:941`), reached from `NextFromInput`, mis-sets the iterator's key so the next emitted key is out of order. Compaction's key-order check catches it and returns `compactfiles: Corruption: Compaction sees out-of-order keys`. The op then takes `SIGABRT`, but `classify()` reads the recorded `data_corruption` result before the crash signal, so the run is correctly bucketed `CORRUPTION` (the bucketization fix; pre-fix this surfaced as `CRASH`). `classify()` routes `kind=detected-corruption` to the `CORRUPTION` bucket. # 3. Flush op -- 100 runs ``` $ python3 tools/cpu_corruption_injector/runner.py --op flush --runs 100 --stress_cmd /data/users/huixiao/rocksdb/db_stress --report_dir /tmp/icc/preflight_demo ``` **Runs' outcomes (`summary.json`):** | SDC | CORRUPTION | CRASH | NO_EFFECT | NO_INJECTION | ERROR | | --- | --- | --- | --- | --- | --- | | 2 | 12 | 11 | 74 | 1 | 0 | **Spread:** - target_fn x outcome: `BlockBuilder::Add` {SDC 1, CORRUPTION 5, CRASH 5, NO_EFFECT 49}; `NextFromInput` {SDC 1, CORRUPTION 7, CRASH 6, NO_EFFECT 25, NO_INJECTION 1}. - corruption_type x outcome: `bit_flip` {SDC 2, CORRUPTION 9, CRASH 9, NO_EFFECT 32}; `flag_flip` {CORRUPTION 3, CRASH 2, NO_EFFECT 42}. **Analysis:** flush mirrors compaction's mechanisms (shared iterator/builder). The 2 SDCs are a value/key-pointer corruption that slips past the checksums; the 12 corruptions are caught by the flush-time key-order / key-size integrity checks. ### A representative flush SDC: `run_00027` What we corrupted (`inject.json`): ```json {"op":"flush","op_index":16,"entry_fn":"rocksdb::FlushJob::Run","target_fn":"rocksdb::BlockBuilder::Add","injection_result":"injected","db_stress_crash_signal":null, "corruptions":[{"instruction":"mov (%rdi),%rax","register":"rax","corruption_type":"bit_flip","before":"0x7fffef059400","after":"0x7fffef059440", "details":{"source":"rocksdb::Slice::data @ ./include/rocksdb/slice.h:58","call_chain":["rocksdb::Slice::data @ ./include/rocksdb/slice.h:58","rocksdb::BlockBuilder::AddWithLastKeyImpl @ table/block_based/block_builder.cc:351"]}}]} ``` The recorded silent corruption (`data_corruption.<tid>.json`): ```json {"kind":"lost","cf":0,"key":763,"value_from_db":"","value_from_expected":"010000000504070609080B0A0D0C0F0E","op_status":"Get: NotFound"} ``` **Walkthrough:** a bit flip on `rax` (a key/value data pointer, `...9400 -> ...9440`) at `Slice::data` (`slice.h:58`), reached from `BlockBuilder::AddWithLastKeyImpl` while the flush builds the output block, makes the builder read key bytes from the wrong address, so key 763's entry is written wrong and the key is dropped from the flushed SST. On read-back `Get(key=763)` returns `NotFound` for a committed key -- silent. `classify()` routes `kind=lost` to the SDC bucket. ### A representative flush CORRUPTION (detected): `run_00047` What we corrupted (`inject.json`): ```json {"op":"flush","op_index":7,"entry_fn":"rocksdb::FlushJob::Run","target_fn":"rocksdb::CompactionIterator::NextFromInput","injection_result":"injected","db_stress_crash_signal":"SIGABRT", "corruptions":[{"instruction":"cmp $0x7,%rax","register":"eflags","corruption_type":"flag_flip","before":"0x216","after":"0x256", "details":{"source":"rocksdb::ParseInternalKey @ ./db/dbformat.h:523","call_chain":["rocksdb::ParseInternalKey @ ./db/dbformat.h:523","rocksdb::CompactionIterator::NextFromInput @ db/compaction/compaction_iterator.cc:731"]}}]} ``` The recorded detection (`data_corruption.<tid>.json`): ```json {"kind":"detected-corruption","cf":-1,"key":-1,"value_from_db":"","value_from_expected":"","op_status":"flush: Corruption: Corrupted Key: Internal Key too small. Size=16. "} ``` **Walkthrough:** a flag flip (`eflags 0x216 -> 0x256`) on a `cmp $0x7,%rax` branch in `ParseInternalKey` (`dbformat.h:523`), reached from `NextFromInput`, makes the parser mis-judge the internal-key size, so the flush emits a malformed key and the key-size integrity check returns `flush: Corruption: Corrupted Key: Internal Key too small`. The op takes `SIGABRT`; `classify()` reads the recorded `data_corruption` before the signal and buckets `CORRUPTION` (bucketization fix). `classify()` routes `kind=detected-corruption` to the `CORRUPTION` bucket. # 4. Write op (`MemTable::Add`) -- two key spaces ``` $ python3 tools/cpu_corruption_injector/runner.py --op write --runs 100 --stress_cmd /data/users/huixiao/rocksdb/db_stress --report_dir /tmp/icc/preflight_demo ``` A write injection corrupts a single `MemTable::Add` (a Put/Delete/DeleteRange). The corruption is reachable and attributable, but whether it surfaces as a *silent* write SDC depends heavily on the key space. A silent write SDC needs the affected/mispositioned key to have other live versions to fall through to -- which only happens in a dense, multi-version memtable. We therefore run two write campaigns: the default `max_key=1000`, then a small `max_key=8`. The contrast is what motivates randomizing `max_key` (see PR #14867 for `--randomize_stress_flags`). ### 4a. Default `max_key=1000` -- 100 runs (no silent write SDC) **Runs' outcomes (`summary.json`):** | SDC | CORRUPTION | CRASH | NO_EFFECT | NO_INJECTION | ERROR | | --- | --- | --- | --- | --- | --- | | 0 | 31 | 13 | 56 | 0 | 0 | With a 1000-key space almost every write touches a distinct key, so a corrupted entry has no older live version to mask it: a value/key byte flip is caught at write by the per-key checksum (`VerifyEncodedEntry` -> `CORRUPTION`), and a structural flip tends to crash (`CRASH`) rather than silently mis-read. `ERROR=0`, `NO_INJECTION=0`. No write op silently corrupted data -- every reachable corruption was caught or crashed. ### 4b. Small `max_key=8` -- 100 runs (surfaces 2 silent write SDCs) **Runs' outcomes (`summary.json`):** | SDC | CORRUPTION | CRASH | NO_EFFECT | NO_INJECTION | ERROR | | --- | --- | --- | --- | --- | --- | | 2 | 33 | 8 | 57 | 0 | 0 | Shrinking the key space makes each key hold ~125 versions (`ops_per_thread` / `max_key`), so a misplaced entry can fall through to an older version of the *same* key and be returned silently -- the per-key checksum (bytes intact) and on-seek verify cannot see a pure link-position error. ### A representative write SDC: `run_00028` (skiplist misposition -> silent stale read, flush catches) What we corrupted (`inject.json`): ```json {"op":"write","op_index":317,"entry_fn":"rocksdb::MemTable::Add","target_fn":"rocksdb::MemTable::Add","injection_result":"injected","db_stress_crash_signal":null, "corruptions":[{"instruction":"cmp %rbx,-0xb8(%rbp)","register":"eflags","corruption_type":"flag_flip","before":"0x216","after":"0x217", "details":{"source":"rocksdb::MemTable::Add @ db/memtable.cc:1319", "call_chain":["rocksdb::MemTable::Add @ db/memtable.cc:1319"]}}]} ``` The recorded silent corruption (`data_corruption.<tid>.json`): ```json {"kind":"resurrected","cf":0,"key":1,"value_from_db":"110000001514171619181B1A1D1C1F1E0100030205040706","value_from_expected":"","op_status":"Get: OK"} ``` **Walkthrough:** a flag flip (CF, `eflags 0x216 -> 0x217`) on the `cmp` that produces `KeyIsAfterNode` inside `InlineSkipList::Insert` (`inlineskiplist.h:1253`; the `@ memtable.cc:1319` in the record is inlining line-drift) inverts the key comparison, so the Delete tombstone for key 1 is linked at the wrong position. The stored bytes and per-key checksum are intact, so neither the checksum nor on-seek verify sees anything wrong -- on read-back `Get(key=1)` returns OK with key 1's live value for a key that was Deleted (`kind=resurrected`, silent). A follow-up `Flush()` in unit test repro *does* catch it: the full-scan order check returns `Corruption: Out-of-order keys found in skiplist` -- caught only after the silent read, not during it. ### A representative write CORRUPTION (detected) `max_key=1000 or 8` : `run_00018` Where `run_00028`'s pure link-position error is invisible to the per-key checksum, this run shows a byte-level corruption that the checksum *catches* at write time. What we corrupted (`inject.json`): ```json {"op":"write","op_index":106,"entry_fn":"rocksdb::MemTable::Add","target_fn":"rocksdb::MemTable::Add","injection_result":"injected","db_stress_crash_signal":null, "corruptions":[{"instruction":"mov %rsi,(%rdi)","register":"rsi","corruption_type":"bit_flip","before":"0x7fffeec2a21c","after":"0x7fffeec2a25c", "details":{"source":"rocksdb::Slice::Slice @ ./include/rocksdb/slice.h:39", "call_chain":["rocksdb::Slice::Slice @ ./include/rocksdb/slice.h:39","rocksdb::GetVarint32 @ ./util/coding.h:280","rocksdb::MemTable::VerifyEncodedEntry @ db/memtable.cc:1102","rocksdb::MemTable::Add @ db/memtable.cc:1189"]}}]} ``` The recorded detection (`data_corruption.<tid>.json`): ```json {"kind":"detected-corruption","cf":-1,"key":-1,"value_from_db":"","value_from_expected":"","op_status":"put: Corruption: ProtectionInfo mismatch"} ``` **Walkthrough:** a bit flip on `rsi` (`0x7fffeec2a21c -> 0x7fffeec2a25c`) at `Slice::Slice` (`slice.h:39`) while `MemTable::Add` re-parses the just-encoded entry through `VerifyEncodedEntry` (`memtable.cc:1102`) corrupts the Slice the verifier reads, so the recomputed per-key protection info no longer matches and the put returns `Corruption: ProtectionInfo mismatch`. Reviewed By: pdillinger Differential Revision: D108367345

github-actions · 2026-07-01T03:30:32Z

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI reached the early-review threshold — reviewing commit f2b6fbf

❌ Codex review failed before producing findings.

WARNING: proceeding, even though we could not create PATH aliases: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

Codex may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/codex-review [context] — Request a code review
/codex-query <question> — Ask about the PR or codebase

github-actions · 2026-07-01T03:36:14Z

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI reached the early-review threshold — reviewing commit f2b6fbf

Summary

Well-structured orchestration layer for the CPU corruption injector. The module decomposition is clean, the DB_STRESS_PRESET is thorough, and the classification logic correctly handles all injector outcomes. The code follows existing RocksDB Python tooling patterns (e.g., db_crashtest.py's seed generation).

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🔴 HIGH

None.

🟡 MEDIUM

M1. `_data_corruption_kind()` uses `assert` for a runtime invariant — `runner_execute.py:107`

Issue: assert len(names) <= 1 will be silently skipped under python3 -O. If db_stress somehow writes two data_corruption.*.json files (e.g., a race in a future multi-thread extension), this would silently pick the first file instead of failing loudly.
Root cause: assert is a debug check, but this invariant guards data correctness (wrong classification if multiple files exist).
Suggested fix: Replace with an explicit if len(names) > 1: raise RuntimeError(...) or log a warning and pick the first. Alternatively, since --threads=1 is enforced, the invariant is guaranteed today — but a defensive check is safer.

M2. `_data_corruption_kind()` can `KeyError` on unknown `kind` values — `runner_execute.py:86` via `classify()` line 93

Issue: classify() does return _CORRUPTION_BUCKET[kind] without checking that kind is a key in _CORRUPTION_BUCKET. If db_stress ever adds a new kind value (e.g., a new corruption type), this raises an unhandled KeyError, which the broad except Exception in execute_all() catches and maps to ERROR — but the error message would be cryptic.
Root cause: No defensive check for unknown kind values.
Suggested fix: Use _CORRUPTION_BUCKET.get(kind) with a fallback, or log a clear error for unknown kinds.

M3. `InjectionSite` is `frozen=True` but contains a mutable `list[str]` for `target_fns` — `runner_build.py:14`

Issue: @dataclass(frozen=True) prevents reassignment of fields but does NOT prevent mutation of the list (e.g., INJECTION_SITES["write"].target_fns.append("foo")). This is a minor immutability hole.
Root cause: Python's frozen dataclass only prevents attribute reassignment, not deep immutability.
Suggested fix: Use tuple[str, ...] instead of list[str] for target_fns to make it truly immutable.

🟢 LOW / NIT

L1. `report()` return value is unused — `runner_report.py:23` / `runner.py:63`

Issue: report() returns a dict but runner.py:main() ignores the return value.
Suggested fix: Either use the return value or change the return type to None.

L2. Logging setup replaces root logger handlers — `runner.py:122`

Issue: root_logger.handlers = [console, log_file] replaces ALL existing handlers on the root logger. Standard for a standalone CLI but could surprise library callers.
Suggested fix: Use root_logger.addHandler() instead, or scope to the "runner" logger.

L3. `_get_run_timeout_sec()` doesn't validate non-numeric `ICC_RUN_TIMEOUT` — `runner_execute.py:28`

Issue: A non-numeric ICC_RUN_TIMEOUT env var raises ValueError with no context.
Suggested fix: Add a try/except around float(...) with a descriptive error message.

L4. No unit tests for the Python runner code

Issue: No unit tests for classify(), build_runs(), op_count_estimate(). These are pure functions amenable to testing.
Suggested fix: Consider adding unit tests, especially for classify().

Cross-Component Analysis

Context	Applies?	Assessment
injection_result values	YES	classify() handles all 5 values from injector.py correctly.
data_corruption kind values	YES	_CORRUPTION_BUCKET maps all 4 kinds from db_stress correctly.
DB_STRESS_PRESET vs startup validation	YES	Percent sum = 100. All startup checks pass.
Process lifecycle	YES	start_new_session + os.killpg + proc.wait() — no zombie risk.
Seed reproducibility	YES	(base_seed + index) % 2**64 wraps correctly. Deterministic per run.

Positive Observations

Clean module decomposition: build/execute/report separation with single responsibilities.
Robust process management: Correct pattern for reliably cleaning up gdb+db_stress process groups.
Thorough DB_STRESS_PRESET: Well-documented 4-group config with load-bearing comments.
Consistent with codebase: Seed generation matches db_crashtest.py.
Good preflight: verify_injection_site() catches build/gdb incompatibilities before wasting time.
Excellent README: Thorough documentation with concrete 3-layer examples.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

Claude may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/claude-review [context] — Request a code review
/claude-query <question> — Ask about the PR or codebase

meta-codesync · 2026-07-01T03:40:25Z

This pull request has been merged in eddfee8.

github-actions · 2026-07-01T03:40:26Z

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit f2b6fbf

❌ Codex review failed before producing findings.

WARNING: proceeding, even though we could not create PATH aliases: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

Codex may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/codex-review [context] — Request a code review
/codex-query <question> — Ask about the PR or codebase

github-actions · 2026-07-01T03:48:05Z

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit f2b6fbf

Summary

Well-structured orchestration layer with clean module separation, good reproducibility design, and thorough preflight validation. The code follows existing injector patterns and is well-documented.

High-severity findings (2):

[runner_execute.py:classify] _CORRUPTION_BUCKET[kind] will raise KeyError if db_stress ever emits an unknown kind value, crashing the classifier instead of gracefully handling it.
[runner_build.py:build_runs] Stale data_corruption.*.json from a previous campaign in a reused --report_dir will be picked up by _data_corruption_kind(), causing misclassification of runs.

Full review (click to expand)

Findings

🔴 HIGH

H1. Unknown `kind` in data_corruption.json causes KeyError -- `runner_execute.py:classify`

Issue: classify() calls _CORRUPTION_BUCKET[kind] where kind comes from db_stress's data_corruption.*.json. If db_stress ever adds a new kind value (or writes an unexpected one due to a bug), this raises an unhandled KeyError, which propagates up to execute_all() and marks the run as ERROR -- losing the actual outcome.
Root cause: No .get() fallback or unknown-kind handling in the _CORRUPTION_BUCKET lookup.
Suggested fix: Use _CORRUPTION_BUCKET.get(kind) and fall back to a sensible default (e.g., "CORRUPTION" for unknown kinds, or log a warning and return "ERROR").

H2. Stale files from reused `--report_dir` cause misclassification -- `runner_build.py:build_runs` / `runner_execute.py:_data_corruption_kind`

Issue: build_runs() creates run_XXXXX/ directories with exist_ok=True. If a user reruns with the same --report_dir, stale data_corruption.*.json files from the previous campaign survive in the reused run_XXXXX/ directories. _data_corruption_kind() will find these stale files and misclassify the new run (e.g., a NO_EFFECT run becomes SDC).
Root cause: No cleanup of prior artifacts in run directories before a new campaign.
Suggested fix: Either (a) clean data_corruption.*.json and inject.json from each run_dir before launching, or (b) refuse to reuse a non-empty --report_dir, or (c) document that --report_dir must be fresh.

🟡 MEDIUM

M1. `random.randint(1, 2**64)` produces value outside uint64 range -- `runner.py:37`

Issue: random.randint(a, b) is inclusive on both ends. 2**64 = 18446744073709551616 is one beyond the uint64 max. When base_seed = 2**64, the seed passed to db_stress via --seed exceeds uint64 range. db_stress uses gflags uint64, which may truncate or reject it.
Root cause: Off-by-one in the random range.
Suggested fix: Use random.randint(1, 2**64 - 1).

M2. `--seed 0` cannot reproduce a campaign that randomly got seed 0 -- `runner.py:37`

Issue: base_seed = args.seed or random.randint(1, 2**64) uses Python's or operator, which treats 0 as falsy. So --seed 0 generates a fresh seed instead of using seed 0. While seed 0 is excluded from randint(1, ...), the comment says "0 -> a fresh seed each run" which is the intended behavior. However, (base_seed + index) % 2**64 with large base_seed could wrap to 0, so individual runs could have seed=0, which the injector accepts.
Root cause: Using truthiness instead of explicit is None or sentinel for "no seed provided".
Suggested fix: This is acceptable given randint starts at 1, but consider args.seed if args.seed is not None else ... with default=None for clarity, or document that seed=0 is reserved as "auto".

M3. `detect_filter_construct_corruption` uses string `"true"` while all other bool flags use `0`/`1` -- `runner_build.py`

Issue: In _SDC_PROTECTIONS, detect_filter_construct_corruption is set to the string "true" while sync_fault_injection and all other bool flags use integers (0 or 1). Both are DEFINE_bool in db_stress. gflags accepts both forms, so this works, but it's inconsistent and could confuse future maintainers.
Root cause: Inconsistent flag value representation.
Suggested fix: Use 1 instead of "true" for consistency with all other flags in the preset.

M4. No unit tests for runner code -- test coverage gap

Issue: The PR adds ~680 lines of Python logic across 4 modules with no unit tests. The existing injector code has tests/test_injector_critical_instruction.py as a precedent for testing. Key functions like classify(), build_runs(), op_count_estimate(), _function_not_found(), and _no_source_lines() are all unit-testable without gdb or db_stress.
Root cause: Tests not included in PR.
Suggested fix: Add a tests/test_runner.py covering at minimum: classify() with all outcome paths (including missing inject.json, unknown kind), build_runs() seed reproducibility, op_count_estimate() for each op type, and the gdb output parsing functions.

🟢 LOW / NIT

L1. `_setup_logging` replaces all root logger handlers -- `runner.py:_setup_logging`

Issue: root_logger.handlers = [console, log_file] replaces any existing handlers on the root logger. If runner.py is imported as a library (e.g., for testing), this silently clobbers the caller's logging setup.
Suggested fix: Use root_logger.addHandler() or scope to the "runner" logger instead of the root logger.

L2. `report()` return value is unused -- `runner_report.py` / `runner.py`

Issue: report() returns the summary dict, but runner.py:main() ignores the return value. The return type annotation says dict[str, object] suggesting it's part of the interface.
Suggested fix: Either use the return value or make it -> None.

L3. `_no_source_lines` cold-split false positive is only documented in comment -- `runner.py`

Issue: The comment explains that .cold split functions legitimately have "No line number information", but the code only checks for a positive Line N of match. This seems correct as documented, but worth a test case.
Suggested fix: Add a unit test with gdb output containing both a valid Line N of and a cold-split No line number information to confirm the logic.

L4. `assert` in `_data_corruption_kind` runs in production -- `runner_execute.py`

Issue: The assert len(names) <= 1 will raise AssertionError in production if violated (Python asserts are not stripped with -O unless explicitly invoked). For a testing tool this is acceptable, but it would be better to log a warning and handle the case (e.g., pick the first file).
Suggested fix: Consider using an explicit if + raise or logger.warning instead of assert.

L5. Import of private function `_get_gdb` across module boundary -- `runner.py`

Issue: runner.py imports _get_gdb from runner_execute.py. The underscore prefix conventionally marks it as private/internal. It's used in logging the preflight info message.
Suggested fix: Either rename to get_gdb() (no underscore) since it's part of the cross-module interface, or move the log message into execute or a shared constants module.

Cross-Component Analysis

Context	Applies?	Issue?	Notes
Reused report_dir	YES	YES	H2: stale data_corruption files cause misclassification
Path with spaces	YES	NO	json.dumps for injector args, list-based Popen -- both handle spaces correctly
gdb version mismatch	YES	NO	Preflight catches old gdb via source line check
db_stress flag validation	YES	NO	DB_STRESS_PRESET satisfies db_stress's verify_cpu_corruption_dir requirements
Concurrent file access	NO	-	Each run has its own directory
Large run counts (>10000)	YES	LOW	May exhaust filesystem inodes but is a usage issue, not a bug

Positive Observations

Clean module separation (build/execute/report) with single-responsibility design
Reproducibility via deterministic seeding is well-designed and documented
Preflight validation catches common failure modes (missing binary, wrong gdb, inlined functions) before doing expensive work
The _CORRUPTION_BUCKET priority (data_corruption > crash_signal) correctly handles the case where db_stress detects corruption and then crashes
start_new_session=True + os.killpg correctly handles timeout cleanup of the gdb+db_stress process group
Comprehensive README with real-world examples and output walkthroughs

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

Claude may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/claude-review [context] — Request a code review
/claude-query <question> — Ask about the PR or codebase

meta-cla Bot added the CLA Signed label Jun 18, 2026

meta-codesync Bot added the meta-exported label Jun 18, 2026

hx235 changed the title ~~CPU corruption injector: runner + db_stress preset~~ Jun 18, 2026

hx235 mentioned this pull request Jun 18, 2026

Verify CPU corruption after op via --verify_cpu_corruption_dir (#14852) #14852

Closed

meta-codesync Bot changed the title ~~CPU corruption injector: runner + db_stress preset flags~~ Jul 1, 2026

meta-codesync Bot force-pushed the export-D108367345 branch from 1eb772d to f2b6fbf Compare July 1, 2026 03:21

meta-codesync Bot closed this in eddfee8 Jul 1, 2026

meta-codesync Bot added the Merged label Jul 1, 2026

Uh oh!

Conversation

hx235 commented Jun 18, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Preflight (verify_injection_site) catches a build that can't support injection

2. Compaction op -- 100 runs

A representative compaction SDC: run_00000

A representative compaction CORRUPTION (detected): run_00007

3. Flush op -- 100 runs

A representative flush SDC: run_00027

A representative flush CORRUPTION (detected): run_00047

4. Write op (MemTable::Add) -- two key spaces

4a. Default max_key=1000 -- 100 runs (no silent write SDC)

4b. Small max_key=8 -- 100 runs (surfaces 2 silent write SDCs)

A representative write SDC: run_00028 (skiplist misposition -> silent stale read, flush catches)

A representative write CORRUPTION (detected) max_key=1000 or 8 : run_00018

meta-codesync Bot commented Jun 18, 2026

github-actions Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ clang-tidy: No findings on changed lines

github-actions Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codex Code Review - OBSOLETE

🟡 Codex Code Review

github-actions Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Claude Code Review - OBSOLETE

✅ Claude Code Review

Summary

Findings

🔴 HIGH

🟡 MEDIUM

M1. _data_corruption_kind() uses assert for a runtime invariant — runner_execute.py:107

M2. _data_corruption_kind() can KeyError on unknown kind values — runner_execute.py:86 via classify() line 93

M3. InjectionSite is frozen=True but contains a mutable list[str] for target_fns — runner_build.py:14

🟢 LOW / NIT

L1. report() return value is unused — runner_report.py:23 / runner.py:63

L2. Logging setup replaces root logger handlers — runner.py:122

L3. _get_run_timeout_sec() doesn't validate non-numeric ICC_RUN_TIMEOUT — runner_execute.py:28

L4. No unit tests for the Python runner code

Cross-Component Analysis

Positive Observations

meta-codesync Bot commented Jul 1, 2026

github-actions Bot commented Jul 1, 2026

🟡 Codex Code Review

github-actions Bot commented Jul 1, 2026

✅ Claude Code Review

Summary

Findings

🔴 HIGH

H1. Unknown kind in data_corruption.json causes KeyError -- runner_execute.py:classify

H2. Stale files from reused --report_dir cause misclassification -- runner_build.py:build_runs / runner_execute.py:_data_corruption_kind

🟡 MEDIUM

M1. random.randint(1, 2**64) produces value outside uint64 range -- runner.py:37

M2. --seed 0 cannot reproduce a campaign that randomly got seed 0 -- runner.py:37

M3. detect_filter_construct_corruption uses string "true" while all other bool flags use 0/1 -- runner_build.py

M4. No unit tests for runner code -- test coverage gap

🟢 LOW / NIT

L1. _setup_logging replaces all root logger handlers -- runner.py:_setup_logging

L2. report() return value is unused -- runner_report.py / runner.py

L3. _no_source_lines cold-split false positive is only documented in comment -- runner.py

L4. assert in _data_corruption_kind runs in production -- runner_execute.py

L5. Import of private function _get_gdb across module boundary -- runner.py

Cross-Component Analysis

Positive Observations

Labels

1 participant

hx235 commented Jun 18, 2026 •

edited by meta-codesync Bot

Loading

1. Preflight (`verify_injection_site`) catches a build that can't support injection

A representative compaction SDC: `run_00000`

A representative compaction CORRUPTION (detected): `run_00007`

A representative flush SDC: `run_00027`

A representative flush CORRUPTION (detected): `run_00047`

4. Write op (`MemTable::Add`) -- two key spaces

4a. Default `max_key=1000` -- 100 runs (no silent write SDC)

4b. Small `max_key=8` -- 100 runs (surfaces 2 silent write SDCs)

A representative write SDC: `run_00028` (skiplist misposition -> silent stale read, flush catches)

A representative write CORRUPTION (detected) `max_key=1000 or 8` : `run_00018`

github-actions Bot commented Jun 18, 2026 •

edited

Loading

github-actions Bot commented Jul 1, 2026 •

edited

Loading

github-actions Bot commented Jul 1, 2026 •

edited

Loading

M1. `_data_corruption_kind()` uses `assert` for a runtime invariant — `runner_execute.py:107`

M2. `_data_corruption_kind()` can `KeyError` on unknown `kind` values — `runner_execute.py:86` via `classify()` line 93

M3. `InjectionSite` is `frozen=True` but contains a mutable `list[str]` for `target_fns` — `runner_build.py:14`

L1. `report()` return value is unused — `runner_report.py:23` / `runner.py:63`

L2. Logging setup replaces root logger handlers — `runner.py:122`

L3. `_get_run_timeout_sec()` doesn't validate non-numeric `ICC_RUN_TIMEOUT` — `runner_execute.py:28`

H1. Unknown `kind` in data_corruption.json causes KeyError -- `runner_execute.py:classify`

H2. Stale files from reused `--report_dir` cause misclassification -- `runner_build.py:build_runs` / `runner_execute.py:_data_corruption_kind`

M1. `random.randint(1, 2**64)` produces value outside uint64 range -- `runner.py:37`

M2. `--seed 0` cannot reproduce a campaign that randomly got seed 0 -- `runner.py:37`

M3. `detect_filter_construct_corruption` uses string `"true"` while all other bool flags use `0`/`1` -- `runner_build.py`

L1. `_setup_logging` replaces all root logger handlers -- `runner.py:_setup_logging`

L2. `report()` return value is unused -- `runner_report.py` / `runner.py`

L3. `_no_source_lines` cold-split false positive is only documented in comment -- `runner.py`

L4. `assert` in `_data_corruption_kind` runs in production -- `runner_execute.py`

L5. Import of private function `_get_gdb` across module boundary -- `runner.py`