Skip to content

Use _exit(1) instead of exit(1) after DB is open to avoid UAF#14850

Closed
pdillinger wants to merge 2 commits into
facebook:mainfrom
pdillinger:export-D108298839
Closed

Use _exit(1) instead of exit(1) after DB is open to avoid UAF#14850
pdillinger wants to merge 2 commits into
facebook:mainfrom
pdillinger:export-D108298839

Conversation

@pdillinger

Copy link
Copy Markdown
Contributor

Summary:
When FinishInitDb() or Open() calls exit(1) after the DB has been opened, background compaction/flush threads are still running. exit() triggers static object destruction (including the KillPoint singleton and its rocksdb_kill_exclude_prefixes vector) while those threads are still accessing them via TestKillRandom(), causing a heap-use-after-free detected by ASAN.

This became more likely to trigger after the multi-DB support commit (3d0d60101e7f) which runs RunStressTestImpl on worker threads, making the race window larger when one DB fails initialization while other DBs background threads are active.

The fix replaces exit(1) with _exit(1) in all error paths that fire after the DB has been opened. _exit() terminates immediately without running atexit handlers or destroying static objects, avoiding the race with background threads.

Differential Revision: D108298839

Summary:
When `FinishInitDb()` or `Open()` calls `exit(1)` after the DB has been opened, background compaction/flush threads are still running. `exit()` triggers static object destruction (including the `KillPoint` singleton and its `rocksdb_kill_exclude_prefixes` vector) while those threads are still accessing them via `TestKillRandom()`, causing a heap-use-after-free detected by ASAN.

This became more likely to trigger after the multi-DB support commit (3d0d60101e7f) which runs `RunStressTestImpl` on worker threads, making the race window larger when one DB fails initialization while other DBs background threads are active.

The fix replaces `exit(1)` with `_exit(1)` in all error paths that fire after the DB has been opened. `_exit()` terminates immediately without running atexit handlers or destroying static objects, avoiding the race with background threads.

Differential Revision: D108298839
@meta-cla meta-cla Bot added the CLA Signed label Jun 11, 2026
@meta-codesync

meta-codesync Bot commented Jun 11, 2026

Copy link
Copy Markdown

@pdillinger has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108298839.

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown

✅ clang-tidy: No findings on changed lines

Completed in 81.8s.

@meta-codesync

meta-codesync Bot commented Jun 11, 2026

Copy link
Copy Markdown

@pdillinger has imported this pull request. If you are a Meta employee, you can view this in D108298839.

@github-actions

Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit b503ace


Codex review failed before producing findings.

WARNING: proceeding, even though we could not create PATH aliases: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@github-actions

Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit b503ace


Summary

Correct fix for a real ASAN-detected UAF bug. The approach (_exit() via port abstraction) is sound and the implementation is clean. However, the PR only converts 4 of many exit(1) calls that fire after a DB is open, leaving the same UAF risk in dozens of other error paths.

High-severity findings (1):

  • [db_stress_tool/*.cc] Incomplete scope: many remaining exit(1) calls fire after DB is open with background threads active, sharing the same UAF risk as the 4 that are fixed.
Full review (click to expand)

Findings

🔴 HIGH

H1. Incomplete scope -- many post-DB-open exit(1) calls remain unfixed -- db_stress_tool/*.cc
  • Issue: The PR converts only 4 exit(1) calls in db_stress_test_base.cc, but dozens more fire when a DB is open and background threads are active. In the multi-DB stress test scenario, even exit(1) calls where the local DB failed to open are unsafe because other DB instances on other worker threads have active background threads accessing shared statics.
  • Root cause: The PR's selection criterion appears to be "after the DB is opened in this call stack," but the real criterion should be "after any DB in the process has been opened" -- which includes virtually all exit(1) calls except early flag validation in db_stress_tool.cc.
  • Key remaining risky call sites:
    • db_stress_driver.cc:183 -- cleanup failure after crash-recovery verification (DB is open)
    • db_stress_test_base.cc:1048 -- preload failure
    • db_stress_test_base.cc:4715,4749,4788,4841,4880 -- MANIFEST verification failures
    • db_stress_test_base.cc:4974-5524 -- various verification/operational phases
    • db_stress_listener.cc:54 -- listener callback (DB definitely open)
    • no_batched_ops_stress.cc:493 -- during operations
    • multi_ops_txns_stress.cc:1782-1834 -- transaction validation during operations
  • Suggested fix: Convert all exit(1) calls in the db_stress_tool to port::ImmediateExit(1), or at minimum all those reachable after RunStressTestImpl begins (which is when DBs are opened on worker threads). The flag-validation exits in db_stress_tool.cc (before any DB opens) can safely remain as exit(1).

🟡 MEDIUM

M1. Potential stderr output loss from _exit() -- db_stress_test_base.cc
  • Issue: _exit() does not flush stdio buffers. While stderr is typically unbuffered, the C/C++ standard does not guarantee this. If stderr is redirected to a file or pipe, the error messages printed via fprintf(stderr, ...) just before port::ImmediateExit(1) could be lost.
  • Suggested fix: The existing code at some call sites already calls fflush(stderr) (e.g., lines 4443, 4481, 4495). Consider adding fflush(stderr) before the 4 port::ImmediateExit(1) calls, or better yet, have ImmediateExit itself call fflush(stderr) and fflush(stdout) before _exit().
M2. CLAUDE.md portability section scope -- CLAUDE.md
  • Issue: The CLAUDE.md addition about cross-platform portability is useful and accurate, but it is a standalone documentation improvement unrelated to the UAF fix. Bundling it into the same PR makes the commit harder to revert or cherry-pick independently.
  • Suggested fix: Consider splitting the CLAUDE.md changes into a separate commit or PR for cleaner git history.

🟢 LOW / NIT

L1. Documentation cross-reference in port_win.h is minimal
  • Issue: port_win.h says only // See ImmediateExit in port/port_posix.h for documentation. A one-line summary of what the function does would help Windows-focused developers who may not routinely read the POSIX header.
  • Suggested fix: Add a brief summary, e.g., // Terminates immediately without static destructors or atexit handlers. See port/port_posix.h for full documentation.
L2. Naming alternative consideration
  • Issue: ImmediateExit could be confused with quick_exit(). The name doesn't explicitly convey the "no destructors" property which is the key safety guarantee.
  • Suggested fix: Not blocking -- the current name is fine and well-documented. Just noting that ExitWithoutCleanup or RawExit were alternatives. The existing name with good documentation is acceptable.
L3. No new tests added
  • Issue: No regression test is included. This is understandable since testing _exit() vs exit() requires process-level testing, and the ASAN stress test infrastructure already serves as implicit coverage.
  • Suggested fix: Acceptable as-is. The existing ASAN stress test coverage found the original bug and will catch regressions.

Cross-Component Analysis

Context Does code execute? Assumptions hold? Action needed?
Windows/MSVC YES YES -- _exit() via <process.h> None
MinGW cross-compile YES YES -- same <process.h> None
ASAN builds YES YES -- _exit() skips ASAN leak report but that's acceptable for error exits None
TSAN builds YES YES -- skips TSAN cleanup but acceptable for error exits None
Multi-DB worker threads YES Partially -- only 4 of many exit calls fixed Fix remaining calls (H1)

Positive Observations

  • The port abstraction pattern (port::ImmediateExit) is well-chosen and consistent with existing patterns like port::Crash.
  • [[noreturn]] attribute is correctly applied and has precedent in db_bench_tool.cc:1380.
  • The documentation comment in port_posix.h is thorough and clearly explains the motivation.
  • Include dependencies are already satisfied (<unistd.h> in port_posix.cc, <process.h> in port_win.h).
  • The fix correctly addresses the root cause (static destructor UAF) rather than trying to join threads before exit, which would be far more complex and error-prone.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase
@meta-codesync meta-codesync Bot closed this in 77d9ed7 Jun 12, 2026
@meta-codesync meta-codesync Bot added the Merged label Jun 12, 2026
@meta-codesync

meta-codesync Bot commented Jun 12, 2026

Copy link
Copy Markdown

@pdillinger merged this pull request in 77d9ed7.

meta-codesync Bot pushed a commit that referenced this pull request Jun 16, 2026
Summary:
db_bench worker operations historically responded to a fatal error (failed Get/Put/Merge/Delete, checksum failure, etc.) by calling db_bench_exit() -> exit() in place. When that happened on a worker thread while sibling workers and background (compaction/flush) threads were still running, exit() began static/global destruction underneath those live threads. A thread mid-call into a user callback (e.g. a custom Env/ FileSystem) could then touch a global/singleton that had already been destroyed -> cross-thread use-after-free. This is the same class of bug fixed for db_stress in #14850, but it also matters for db_bench's secondary role as an *integration validation* tool: we want production-like processes to shut down cleanly after an error (so a custom FileSystem can log the adverse condition and run its teardown) rather than crash or _exit at the first sign of trouble.

This change makes fatal errors on benchmark worker threads unwind instead of exiting in place, and performs an ordered shutdown on the main thread:

  * SharedState gains a cooperative fatal flag + first-error Status/message and SetFatal(). A worker that hits a fatal error records it and returns out of its operation (releasing its DbUseGuard) instead of exiting.
  * Op loops stop promptly via Duration: the loop-termination helper now consults the run's fatal flag at its existing throttled check boundary (~every FLAGS_ops_between_duration_checks ops), so peers wind down with no added per-key cost.
  * After RunBenchmark joins all workers, the (guard-free) main thread runs the ordered shutdown -- DeleteDBs() closes all DBs, joining background threads and running user Env/FileSystem teardown -- and only then exits nonzero. No static destructors run while RocksDB threads or user callbacks are still live.
  * The worker fatal sites that previously called db_bench_exit() (BGWriter, DoDelete, RandomWithVerify, UpdateRandom, XORUpdateRandom, MergeRandom, ReadRandomMergeRandom, VerifyChecksum, VerifyFileChecksums, RandomReplaceKeys, Replay, DoDeterministicCompact) now SetFatal + return. Flush (main-thread) routes to ErrorExit() which closes DBs first.

Supporting changes:

  * Duration is now a nested member type of SharedState, constructible only via SharedState::MakeDuration() (movable, non-copyable). This binds every Duration to its run's fatal flag by construction and removes the previous mutable global (g_cooperative_abort_flag) and its set/clear lifecycle. A `using Duration = SharedState::Duration;` alias keeps call sites tidy.
  * New test-only flag --fault_injection_read_after_reads: wraps the FileSystem so random-access reads start returning a non-retryable IOError after N reads, to exercise the fatal/shutdown path on demand.
  * New flag --fatal_shutdown_watchdog_seconds (default 180, 0 disables): arms a one-shot detached watchdog on the first fatal error (and before ErrorExit's cleanup) that forces port::ImmediateExit() if the careful shutdown itself hangs (e.g. a background thread or misbehaving Env that never returns). This is a backstop, not the path under test.
  * Added an "Error handling background" comment documenting the rationale and the machinery, and an invariant comment at db_bench_exit covering pre-Open vs worker vs main-thread exit rules.

Existing _Exit()/abort() sites are intentionally left unchanged.

Future work:
  * Worker ops that still abort() on error (ReadRandom, ReadRandomFast, ReadToRowCache, MultiReadRandom, MixGraph, AppendRandom, UpdateRandom's read path, BGScan, RandomTransaction, CreateNewCf) are not the use-after-free class (abort() skips static destruction) but are not graceful either; converting them would extend clean-shutdown coverage. RandomTransaction needs a rollback decision; MultiReadRandom needs a structured break-out.
  * Post-Open main-thread db_bench_exit() sites in Open()/Run()/ VerifyDBFromDB() are the harder, #14850 case: Open() is reachable both under and outside a DbStateMutationGuard, so it cannot unconditionally call ErrorExit() without deadlock. Needs guard-aware handling or restructuring Open() to return Status.
  * Pre-Open setup exits (option/flag parsing, cache setup) correctly stay plain exit() -- no live threads -- and are out of scope.
  * Fault injection currently covers read faults only; verifying graceful shutdown for write/open/transaction error paths would need write-fault, open-fault, and conflict injection.

Pull Request resolved: #14855

Test Plan:
Built debug db_bench (make db_bench) and validated manually:

  * Correctness of loop termination (unchanged by the Duration refactor):
    - Count-bounded is exact: `db_bench --benchmarks=fillrandom --num=300` reports "300 operations".
    - Time-bounded honors duration and is not capped by max_ops: `db_bench --benchmarks=readrandom --use_existing_db --duration=5 --reads=3000` ran 5.002 s / ~2.0-2.7M operations.

  * Graceful shutdown on a fatal error (multi-threaded, peers live): `db_bench --benchmarks=xorupdaterandom --use_existing_db --threads=8 --duration=30 --fault_injection_read_after_reads=5000` -> exit code 1, ZERO "Received signal", prints 'Fatal error during "xorupdaterandom"' after the ordered shutdown.

  * Confirmed this is a real fix, not just masking: rebuilt pristine main + only the fault-injection flag (graceful-shutdown changes reverted) and ran the same command -> SIGSEGV (rc=139), crash stack in __run_exit_handlers -> jemalloc free, triggered from XORUpdateRandom -> db_bench_exit() -> exit() with sibling threads still live. With the fix in place the same workload is graceful (above).

  * Watchdog:
    - Default (180 s): graceful runs above complete with no watchdog message (no spurious firing).
    - --fatal_shutdown_watchdog_seconds=0 disables it (still graceful).
    - Positive: temporarily inserted a 60 s sleep before DeleteDBs in the fatal path, ran with --fatal_shutdown_watchdog_seconds=3 -> process exited at ~3 s (not 60 s) with the watchdog message and rc=1; reverted the temporary sleep and rebuilt clean.

  * No benchmark performance impact (normal, error-free runs):
    - Structural: the cooperative-abort check adds no per-key work. It is evaluated only at Duration's pre-existing throttled boundary (~every FLAGS_ops_between_duration_checks ops, same site as the existing time check); there is no atomic load on the per-op hot path. Fatal recording (SetFatal) lives only in cold error branches. A Duration is constructed once per benchmark via the factory (returned by value, elided), not per op.
    - Observed: error-free fillrandom / fillseq / readrandom throughput (e.g. fillrandom ~300K ops/s, fillseq ~800K-870K ops/s) was in line with pre-change runs; no regression observed.

  * make format-auto applied; debug build is clean.

Reviewed By: anand1976

Differential Revision: D108690137

Pulled By: pdillinger

fbshipit-source-id: 7ffdf4abc665b4c2172b718670c5c2fa1da58500
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment