Skip to content

Fix db_stress rollback retry exhaustion under concurrent Resume Busy#14877

Closed
pdillinger wants to merge 1 commit into
facebook:mainfrom
pdillinger:export-D108561314
Closed

Fix db_stress rollback retry exhaustion under concurrent Resume Busy#14877
pdillinger wants to merge 1 commit into
facebook:mainfrom
pdillinger:export-D108561314

Conversation

@pdillinger

Copy link
Copy Markdown
Contributor

Summary:
The rollback-after-failed-commit retry loop in db_stress_test_base.cc (added in b0b52e1d14f3) counted every loop iteration — including iterations where Resume() returned Busy — against a single 100-count limit. With 32 concurrent threads, when write fault injection causes all prepared transaction commits and rollbacks to fail simultaneously, Resume() returns Busy (recovery in progress from another thread) for the entire 100-iteration budget (100 × 10ms = 1 second). Recovery itself keeps failing due to ongoing fault injection causing repeated background flush errors, so no thread ever gets a chance to perform an actual rollback.

The fix separates two budgets:

  • kMaxRollbackRetries (100): counts only actual rollback attempts after Resume() returns OK
  • kMaxResumeBusyWaits (3000): allows up to 30 seconds of busy-waiting for recovery to complete on another thread

Key triggering options: write_fault_one_in=1000 + txn_write_policy=1 (write-prepared) + two_write_queues=1 + 32 threads + error_recovery_with_no_fault_injection=1.

Differential Revision: D108561314

Summary:
The rollback-after-failed-commit retry loop in `db_stress_test_base.cc` (added in b0b52e1d14f3) counted every loop iteration — including iterations where `Resume()` returned Busy — against a single 100-count limit. With 32 concurrent threads, when write fault injection causes all prepared transaction commits and rollbacks to fail simultaneously, `Resume()` returns Busy (recovery in progress from another thread) for the entire 100-iteration budget (100 × 10ms = 1 second). Recovery itself keeps failing due to ongoing fault injection causing repeated background flush errors, so no thread ever gets a chance to perform an actual rollback.

The fix separates two budgets:
- `kMaxRollbackRetries` (100): counts only actual rollback attempts after `Resume()` returns OK
- `kMaxResumeBusyWaits` (3000): allows up to 30 seconds of busy-waiting for recovery to complete on another thread

Key triggering options: `write_fault_one_in=1000` + `txn_write_policy=1` (write-prepared) + `two_write_queues=1` + 32 threads + `error_recovery_with_no_fault_injection=1`.

Differential Revision: D108561314
@meta-cla meta-cla Bot added the CLA Signed label Jun 23, 2026
@meta-codesync

meta-codesync Bot commented Jun 23, 2026

Copy link
Copy Markdown

@pdillinger has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108561314.

@github-actions

Copy link
Copy Markdown

✅ clang-tidy: No findings on changed lines

Completed in 79.1s.

@github-actions

Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit f38e738


Codex review failed before producing findings.

WARNING: proceeding, even though we could not create PATH aliases: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@github-actions

Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit f38e738


Summary

Clean, well-motivated fix to db_stress test infrastructure that separates retry budgets for rollback attempts vs. Resume-Busy waits. The logic is correct and the approach is sound. The change prevents false-positive stress test failures when 32 concurrent threads exhaust a shared-concept retry budget on Resume() Busy responses.

High-severity findings (0):
No high-severity findings.

Full review (click to expand)

Findings

🟡 MEDIUM

M1. Worst-case 30-second stall per thread — db_stress_test_base.cc:1224
  • Issue: With kMaxResumeBusyWaits=3000 at 10ms each, a thread can block for up to 30 seconds in the busy-wait loop. With 32 threads potentially all in this loop simultaneously, the stress test could experience a coordinated 30-second stall. While threads are sleeping (not spinning), this delays the stress test's progress and could interact poorly with external test timeouts.
  • Root cause: The 30x increase in busy-wait budget (from effective ~1s to 30s) is necessary for correctness but represents a significant increase in worst-case latency.
  • Suggested fix: This is likely acceptable given the alternative (false test failure). Consider adding a log message when resume_busy_count exceeds a threshold (e.g., 1000) to help diagnose slow stress test runs, though this is optional.

🟢 LOW / NIT

L1. Empty for-loop increment expression — db_stress_test_base.cc:1224
  • Issue: The for loop now has an empty increment expression: for (; ...; ). This is functionally a while loop. Using while (rollback_recovery_retries < kMaxRollbackRetries && resume_busy_count < kMaxResumeBusyWaits && !rollback_s.ok()) would be slightly more idiomatic.
  • Suggested fix: Consider converting to a while loop for clarity, though this is purely stylistic and the current form is fine.
L2. rollback_recovery_retries increment placement — db_stress_test_base.cc:1229
  • Issue: The increment ++rollback_recovery_retries is placed before the Rollback() call. It counts attempts before knowing the result. This is correct (counting attempts, not successes), but moving it after the Rollback() call would read more naturally.
  • Suggested fix: None needed — the semantics are correct.

Cross-Component Analysis

Context Relevant? Assessment
Multiple concurrent threads YES Primary motivation — 32 threads competing for Resume(). Fix correctly gives each thread independent budgets.
Resume() returning Busy YES Resume() returns Busy when auto-recovery is in progress on another thread. Fix correctly waits longer.
Fault injection interaction YES Loop only entered when IsErrorInjectedAndRetryable(rollback_s) — only during fault injection.
Test timeout POSSIBLE 30-second stall could interact with CI timeouts, but false failure is worse.

Loop termination analysis:
Every iteration either increments one of the two counters or breaks. The loop is guaranteed to terminate. There is no path that skips both increment and break.

Error reporting: The fprintf correctly reports both counters. The semantic change (retries now counts only actual rollback attempts) improves diagnostic value.

Positive Observations

  • Minimal and surgical fix
  • Comment explains the "why" not just the "what"
  • Descriptive constant names (kMaxRollbackRetries, kMaxResumeBusyWaits)
  • Existing error reporting works correctly without modification
  • PR description thoroughly documents the triggering scenario

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

@joshkang97 joshkang97 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review automatically exported from Phabricator review in Meta.

@meta-codesync meta-codesync Bot closed this in 30295a4 Jun 23, 2026
@meta-codesync

meta-codesync Bot commented Jun 23, 2026

Copy link
Copy Markdown

This pull request has been merged in 30295a4.

@meta-codesync meta-codesync Bot added the Merged label Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment