[db_bench] Graceful shutdown on fatal benchmark errors by pdillinger · Pull Request #14855 · facebook/rocksdb

pdillinger · 2026-06-16T00:22:37Z

Summary:
db_bench worker operations historically responded to a fatal error (failed Get/Put/Merge/Delete, checksum failure, etc.) by calling db_bench_exit() -> exit() in place. When that happened on a worker thread while sibling workers and background (compaction/flush) threads were still running, exit() began static/global destruction underneath those live threads. A thread mid-call into a user callback (e.g. a custom Env/ FileSystem) could then touch a global/singleton that had already been destroyed -> cross-thread use-after-free. This is the same class of bug fixed for db_stress in #14850, but it also matters for db_bench's secondary role as an integration validation tool: we want production-like processes to shut down cleanly after an error (so a custom FileSystem can log the adverse condition and run its teardown) rather than crash or _exit at the first sign of trouble.

This change makes fatal errors on benchmark worker threads unwind instead of exiting in place, and performs an ordered shutdown on the main thread:

SharedState gains a cooperative fatal flag + first-error Status/message and SetFatal(). A worker that hits a fatal error records it and returns out of its operation (releasing its DbUseGuard) instead of exiting.
Op loops stop promptly via Duration: the loop-termination helper now consults the run's fatal flag at its existing throttled check boundary (~every FLAGS_ops_between_duration_checks ops), so peers wind down with no added per-key cost.
After RunBenchmark joins all workers, the (guard-free) main thread runs the ordered shutdown -- DeleteDBs() closes all DBs, joining background threads and running user Env/FileSystem teardown -- and only then exits nonzero. No static destructors run while RocksDB threads or user callbacks are still live.
The worker fatal sites that previously called db_bench_exit() (BGWriter, DoDelete, RandomWithVerify, UpdateRandom, XORUpdateRandom, MergeRandom, ReadRandomMergeRandom, VerifyChecksum, VerifyFileChecksums, RandomReplaceKeys, Replay, DoDeterministicCompact) now SetFatal + return. Flush (main-thread) routes to ErrorExit() which closes DBs first.

Supporting changes:

Duration is now a nested member type of SharedState, constructible only via SharedState::MakeDuration() (movable, non-copyable). This binds every Duration to its run's fatal flag by construction and removes the previous mutable global (g_cooperative_abort_flag) and its set/clear lifecycle. A using Duration = SharedState::Duration; alias keeps call sites tidy.
New test-only flag --fault_injection_read_after_reads: wraps the FileSystem so random-access reads start returning a non-retryable IOError after N reads, to exercise the fatal/shutdown path on demand.
New flag --fatal_shutdown_watchdog_seconds (default 180, 0 disables): arms a one-shot detached watchdog on the first fatal error (and before ErrorExit's cleanup) that forces port::ImmediateExit() if the careful shutdown itself hangs (e.g. a background thread or misbehaving Env that never returns). This is a backstop, not the path under test.
Added an "Error handling background" comment documenting the rationale and the machinery, and an invariant comment at db_bench_exit covering pre-Open vs worker vs main-thread exit rules.

Existing _Exit()/abort() sites are intentionally left unchanged.

Future work:

Worker ops that still abort() on error (ReadRandom, ReadRandomFast, ReadToRowCache, MultiReadRandom, MixGraph, AppendRandom, UpdateRandom's read path, BGScan, RandomTransaction, CreateNewCf) are not the use-after-free class (abort() skips static destruction) but are not graceful either; converting them would extend clean-shutdown coverage. RandomTransaction needs a rollback decision; MultiReadRandom needs a structured break-out.
Post-Open main-thread db_bench_exit() sites in Open()/Run()/ VerifyDBFromDB() are the harder, Use _exit(1) instead of exit(1) after DB is open to avoid UAF #14850-class case: Open() is reachable both under and outside a DbStateMutationGuard, so it cannot unconditionally call ErrorExit() without deadlock. Needs guard-aware handling or restructuring Open() to return Status.
Pre-Open setup exits (option/flag parsing, cache setup) correctly stay plain exit() -- no live threads -- and are out of scope.
Fault injection currently covers read faults only; verifying graceful shutdown for write/open/transaction error paths would need write-fault, open-fault, and conflict injection.

Test Plan:
Built debug db_bench (make db_bench) and validated manually:

Correctness of loop termination (unchanged by the Duration refactor):
- Count-bounded is exact: db_bench --benchmarks=fillrandom --num=300 reports "300 operations".
- Time-bounded honors duration and is not capped by max_ops: db_bench --benchmarks=readrandom --use_existing_db --duration=5 --reads=3000 ran 5.002 s / ~2.0-2.7M operations.
Graceful shutdown on a fatal error (multi-threaded, peers live): db_bench --benchmarks=xorupdaterandom --use_existing_db --threads=8 --duration=30 --fault_injection_read_after_reads=5000 -> exit code 1, ZERO "Received signal", prints 'Fatal error during "xorupdaterandom"' after the ordered shutdown.
Confirmed this is a real fix, not just masking: rebuilt pristine main + only the fault-injection flag (graceful-shutdown changes reverted) and ran the same command -> SIGSEGV (rc=139), crash stack in __run_exit_handlers -> jemalloc free, triggered from XORUpdateRandom -> db_bench_exit() -> exit() with sibling threads still live. With the fix in place the same workload is graceful (above).
Watchdog:
- Default (180 s): graceful runs above complete with no watchdog message (no spurious firing).
- --fatal_shutdown_watchdog_seconds=0 disables it (still graceful).
- Positive: temporarily inserted a 60 s sleep before DeleteDBs in the fatal path, ran with --fatal_shutdown_watchdog_seconds=3 -> process exited at ~3 s (not 60 s) with the watchdog message and rc=1; reverted the temporary sleep and rebuilt clean.
No benchmark performance impact (normal, error-free runs):
- Structural: the cooperative-abort check adds no per-key work. It is evaluated only at Duration's pre-existing throttled boundary (~every FLAGS_ops_between_duration_checks ops, same site as the existing time check); there is no atomic load on the per-op hot path. Fatal recording (SetFatal) lives only in cold error branches. A Duration is constructed once per benchmark via the factory (returned by value, elided), not per op.
- Observed: error-free fillrandom / fillseq / readrandom throughput (e.g. fillrandom ~300K ops/s, fillseq ~800K-870K ops/s) was in line with pre-change runs; no regression observed.
make format-auto applied; debug build is clean.

Summary: db_bench worker operations historically responded to a fatal error (failed Get/Put/Merge/Delete, checksum failure, etc.) by calling db_bench_exit() -> exit() in place. When that happened on a worker thread while sibling workers and background (compaction/flush) threads were still running, exit() began static/global destruction underneath those live threads. A thread mid-call into a user callback (e.g. a custom Env/ FileSystem) could then touch a global/singleton that had already been destroyed -> cross-thread use-after-free. This is the same class of bug fixed for db_stress in facebook#14850, but it also matters for db_bench's secondary role as an *integration validation* tool: we want production-like processes to shut down cleanly after an error (so a custom FileSystem can log the adverse condition and run its teardown) rather than crash or _exit at the first sign of trouble. This change makes fatal errors on benchmark worker threads unwind instead of exiting in place, and performs an ordered shutdown on the main thread: * SharedState gains a cooperative fatal flag + first-error Status/message and SetFatal(). A worker that hits a fatal error records it and returns out of its operation (releasing its DbUseGuard) instead of exiting. * Op loops stop promptly via Duration: the loop-termination helper now consults the run's fatal flag at its existing throttled check boundary (~every FLAGS_ops_between_duration_checks ops), so peers wind down with no added per-key cost. * After RunBenchmark joins all workers, the (guard-free) main thread runs the ordered shutdown -- DeleteDBs() closes all DBs, joining background threads and running user Env/FileSystem teardown -- and only then exits nonzero. No static destructors run while RocksDB threads or user callbacks are still live. * The worker fatal sites that previously called db_bench_exit() (BGWriter, DoDelete, RandomWithVerify, UpdateRandom, XORUpdateRandom, MergeRandom, ReadRandomMergeRandom, VerifyChecksum, VerifyFileChecksums, RandomReplaceKeys, Replay, DoDeterministicCompact) now SetFatal + return. Flush (main-thread) routes to ErrorExit() which closes DBs first. Supporting changes: * Duration is now a nested member type of SharedState, constructible only via SharedState::MakeDuration() (movable, non-copyable). This binds every Duration to its run's fatal flag by construction and removes the previous mutable global (g_cooperative_abort_flag) and its set/clear lifecycle. A `using Duration = SharedState::Duration;` alias keeps call sites tidy. * New test-only flag --fault_injection_read_after_reads: wraps the FileSystem so random-access reads start returning a non-retryable IOError after N reads, to exercise the fatal/shutdown path on demand. * New flag --fatal_shutdown_watchdog_seconds (default 180, 0 disables): arms a one-shot detached watchdog on the first fatal error (and before ErrorExit's cleanup) that forces port::ImmediateExit() if the careful shutdown itself hangs (e.g. a background thread or misbehaving Env that never returns). This is a backstop, not the path under test. * Added an "Error handling background" comment documenting the rationale and the machinery, and an invariant comment at db_bench_exit covering pre-Open vs worker vs main-thread exit rules. Existing _Exit()/abort() sites are intentionally left unchanged. Future work: * Worker ops that still abort() on error (ReadRandom, ReadRandomFast, ReadToRowCache, MultiReadRandom, MixGraph, AppendRandom, UpdateRandom's read path, BGScan, RandomTransaction, CreateNewCf) are not the use-after-free class (abort() skips static destruction) but are not graceful either; converting them would extend clean-shutdown coverage. RandomTransaction needs a rollback decision; MultiReadRandom needs a structured break-out. * Post-Open main-thread db_bench_exit() sites in Open()/Run()/ VerifyDBFromDB() are the harder, facebook#14850-class case: Open() is reachable both under and outside a DbStateMutationGuard, so it cannot unconditionally call ErrorExit() without deadlock. Needs guard-aware handling or restructuring Open() to return Status. * Pre-Open setup exits (option/flag parsing, cache setup) correctly stay plain exit() -- no live threads -- and are out of scope. * Fault injection currently covers read faults only; verifying graceful shutdown for write/open/transaction error paths would need write-fault, open-fault, and conflict injection. Test Plan: Built debug db_bench (make db_bench) and validated manually: * Correctness of loop termination (unchanged by the Duration refactor): - Count-bounded is exact: `db_bench --benchmarks=fillrandom --num=300` reports "300 operations". - Time-bounded honors duration and is not capped by max_ops: `db_bench --benchmarks=readrandom --use_existing_db --duration=5 --reads=3000` ran 5.002 s / ~2.0-2.7M operations. * Graceful shutdown on a fatal error (multi-threaded, peers live): `db_bench --benchmarks=xorupdaterandom --use_existing_db --threads=8 --duration=30 --fault_injection_read_after_reads=5000` -> exit code 1, ZERO "Received signal", prints 'Fatal error during "xorupdaterandom"' after the ordered shutdown. * Confirmed this is a real fix, not just masking: rebuilt pristine main + only the fault-injection flag (graceful-shutdown changes reverted) and ran the same command -> SIGSEGV (rc=139), crash stack in __run_exit_handlers -> jemalloc free, triggered from XORUpdateRandom -> db_bench_exit() -> exit() with sibling threads still live. With the fix in place the same workload is graceful (above). * Watchdog: - Default (180 s): graceful runs above complete with no watchdog message (no spurious firing). - --fatal_shutdown_watchdog_seconds=0 disables it (still graceful). - Positive: temporarily inserted a 60 s sleep before DeleteDBs in the fatal path, ran with --fatal_shutdown_watchdog_seconds=3 -> process exited at ~3 s (not 60 s) with the watchdog message and rc=1; reverted the temporary sleep and rebuilt clean. * No benchmark performance impact (normal, error-free runs): - Structural: the cooperative-abort check adds no per-key work. It is evaluated only at Duration's pre-existing throttled boundary (~every FLAGS_ops_between_duration_checks ops, same site as the existing time check); there is no atomic load on the per-op hot path. Fatal recording (SetFatal) lives only in cold error branches. A Duration is constructed once per benchmark via the factory (returned by value, elided), not per op. - Observed: error-free fillrandom / fillseq / readrandom throughput (e.g. fillrandom ~300K ops/s, fillseq ~800K-870K ops/s) was in line with pre-change runs; no regression observed. * make format-auto applied; debug build is clean.

meta-codesync · 2026-06-16T00:23:18Z

@pdillinger has imported this pull request. If you are a Meta employee, you can view this in D108690137.

github-actions · 2026-06-16T00:26:48Z

⚠️ clang-tidy: 1 warning(s) on changed lines

Completed in 156.6s.

Summary by check

Check	Count
`cppcoreguidelines-special-member-functions`	1
Total	1

Details

tools/db_bench_tool.cc (1 warning(s))

tools/db_bench_tool.cc:3028:9: warning: class 'Duration' defines a copy constructor, a copy assignment operator, a move constructor and a move assignment operator but does not define a destructor [cppcoreguidelines-special-member-functions]

github-actions · 2026-06-16T00:42:20Z

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 1afb27a

❌ Codex review failed before producing findings.

WARNING: proceeding, even though we could not create PATH aliases: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

Codex may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/codex-review [context] — Request a code review
/codex-query <question> — Ask about the PR or codebase

github-actions · 2026-06-16T00:50:56Z

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 1afb27a

Summary

Well-designed PR that solves a real use-after-free class of bugs in db_bench's fatal error handling. The cooperative shutdown via atomic flag + poll is the right approach. The Duration refactoring is clean and eliminates the global mutable state.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🔴 HIGH

None.

🟡 MEDIUM

M1. Duration::Done() performance regression for ops-only benchmarks -- `tools/db_bench_tool.cc` (new Duration::Done)

Issue: The old code entered the throttled-check block only when max_seconds_ was nonzero. For ops-only benchmarks (max_seconds_==0, --duration=0), it went straight to return ops_ > max_ops_. The new code's condition max_seconds_ > 0 || abort_flag_ != nullptr is ALWAYS true (since abort_flag_ is always set by MakeDuration()). This means ops-only benchmarks now unconditionally compute prev_ops, enter the if-block, and perform two integer divisions by FLAGS_ops_between_duration_checks (default 1000) on every Done() call.
Root cause: The fatal-flag poll was piggybacked onto the time-check throttle, but the condition that gates entry was broadened to always enter.
Impact assessment: Two int64 divisions by a non-power-of-2 constant (1000) per op. The boundary-crossing check exits quickly (~999/1000 iterations just fall through), so actual overhead is small. Given db_bench already includes function pointer dispatch and DB calls, this is likely in the noise. But it IS a semantic change from the old behavior.

Suggested fix: Restructure to preserve the old fast-path for ops-only benchmarks:

if (max_seconds_ > 0) {
  // existing throttled time + abort check
} else {
  if (abort_flag_ != nullptr) {
    auto granularity = FLAGS_ops_between_duration_checks;
    if ((ops_ / granularity) != (prev_ops / granularity)) {
      if (abort_flag_->Load()) return true;
    }
  }
  return ops_ > max_ops_;
}

M2. ReadFaultInjectionFS off-by-one in read count -- `tools/db_bench_tool.cc` (ReadFaultFile::Read)

Issue: The flag --fault_injection_read_after_reads=N is documented as "reads start returning IOError after this many reads." But FetchSubRelaxed(1) returns the pre-decrement value. With initial=N: 1st call returns N (>0, success), ..., Nth call returns 1 (>0, success), (N+1)th call returns 0 (<=0, FAIL). So the fault triggers after N+1 successful reads, not N.
Suggested fix: Change to FetchSubRelaxed(1) <= 1, or document the count as approximate (which it is anyway due to concurrent threads).

M3. No automated test coverage -- `tools/db_bench_tool.cc`

Issue: The PR adds significant new functionality but no automated tests. There is an existing tools/db_bench_tool_test.cc that could be extended.
Suggested fix: Add at least one test that runs db_bench with --fault_injection_read_after_reads and verifies exit code 1 without crash signals.

🟢 LOW / NIT

L1. `inline` on `ArmFatalShutdownWatchdog` -- `tools/db_bench_tool.cc`

Unnecessary in a .cc file. Harmless but could be removed.

L2. Comment style -- `tools/db_bench_tool.cc:2888-2930`

The ======== decorator style is unusual for the codebase.

Cross-Component Analysis

Context	Assumptions hold?	Action needed?
Worker threads with DbUseGuard	YES - SetFatal + return releases guard	None
Main thread (Flush, ErrorExit)	YES - ErrorExit handles guard state	None
BGWriter (tid==0)	YES - peers wind down via Duration	None
Ops-only benchmarks	MOSTLY - see M1 perf concern	Minor
RandomWithVerify (no Duration)	YES - SetFatal + return exits	Peers run to completion
DoDeterministicCompact callers	YES - ignore Status but SetFatal flagged	None

Positive Observations

Clean design: Duration-as-nested-class eliminates global mutable state and binds every Duration to its SharedState's fatal flag by construction.
Thorough conversion: All 21 Duration construction sites updated systematically.
Correct synchronization: Proper acquire/release on the atomic flag, mutex-protected first-writer-wins, thread-join happens-before for post-join reads.
Well-scoped: Clear documentation of what's changed, what's out of scope, and future work.
Pragmatic watchdog: Idempotent once_flag backstop against teardown hangs.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

Claude may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/claude-review [context] — Request a code review
/claude-query <question> — Ask about the PR or codebase

meta-codesync · 2026-06-16T02:56:12Z

@pdillinger merged this pull request in 123030f.

meta-cla Bot added the CLA Signed label Jun 16, 2026

meta-codesync Bot closed this in 123030f Jun 16, 2026

meta-codesync Bot added the Merged label Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[db_bench] Graceful shutdown on fatal benchmark errors#14855

[db_bench] Graceful shutdown on fatal benchmark errors#14855
pdillinger wants to merge 1 commit into
facebook:mainfrom
pdillinger:db_bench_error_handling

pdillinger commented Jun 16, 2026

meta-codesync Bot commented Jun 16, 2026

github-actions Bot commented Jun 16, 2026

github-actions Bot commented Jun 16, 2026

github-actions Bot commented Jun 16, 2026

Findings

🔴 HIGH

🟡 MEDIUM

M1. Duration::Done() performance regression for ops-only benchmarks -- `tools/db_bench_tool.cc` (new Duration::Done)

M2. ReadFaultInjectionFS off-by-one in read count -- `tools/db_bench_tool.cc` (ReadFaultFile::Read)

M3. No automated test coverage -- `tools/db_bench_tool.cc`

🟢 LOW / NIT

L1. `inline` on `ArmFatalShutdownWatchdog` -- `tools/db_bench_tool.cc`

L2. Comment style -- `tools/db_bench_tool.cc:2888-2930`

Cross-Component Analysis

Positive Observations

meta-codesync Bot commented Jun 16, 2026

Labels

1 participant

Uh oh!

Conversation

pdillinger commented Jun 16, 2026

meta-codesync Bot commented Jun 16, 2026

github-actions Bot commented Jun 16, 2026

⚠️ clang-tidy: 1 warning(s) on changed lines

Summary by check

Details

github-actions Bot commented Jun 16, 2026

🟡 Codex Code Review

github-actions Bot commented Jun 16, 2026

✅ Claude Code Review

Summary

Findings

🔴 HIGH

🟡 MEDIUM

M1. Duration::Done() performance regression for ops-only benchmarks -- tools/db_bench_tool.cc (new Duration::Done)

M2. ReadFaultInjectionFS off-by-one in read count -- tools/db_bench_tool.cc (ReadFaultFile::Read)

M3. No automated test coverage -- tools/db_bench_tool.cc

🟢 LOW / NIT

L1. inline on ArmFatalShutdownWatchdog -- tools/db_bench_tool.cc

L2. Comment style -- tools/db_bench_tool.cc:2888-2930

Cross-Component Analysis

Positive Observations

meta-codesync Bot commented Jun 16, 2026

Labels

1 participant

M1. Duration::Done() performance regression for ops-only benchmarks -- `tools/db_bench_tool.cc` (new Duration::Done)

M2. ReadFaultInjectionFS off-by-one in read count -- `tools/db_bench_tool.cc` (ReadFaultFile::Read)

M3. No automated test coverage -- `tools/db_bench_tool.cc`

L1. `inline` on `ArmFatalShutdownWatchdog` -- `tools/db_bench_tool.cc`

L2. Comment style -- `tools/db_bench_tool.cc:2888-2930`