Fix db_stress rollback retry exhaustion under concurrent Resume Busy#14877
Fix db_stress rollback retry exhaustion under concurrent Resume Busy#14877pdillinger wants to merge 1 commit into
Conversation
Summary: The rollback-after-failed-commit retry loop in `db_stress_test_base.cc` (added in b0b52e1d14f3) counted every loop iteration — including iterations where `Resume()` returned Busy — against a single 100-count limit. With 32 concurrent threads, when write fault injection causes all prepared transaction commits and rollbacks to fail simultaneously, `Resume()` returns Busy (recovery in progress from another thread) for the entire 100-iteration budget (100 × 10ms = 1 second). Recovery itself keeps failing due to ongoing fault injection causing repeated background flush errors, so no thread ever gets a chance to perform an actual rollback. The fix separates two budgets: - `kMaxRollbackRetries` (100): counts only actual rollback attempts after `Resume()` returns OK - `kMaxResumeBusyWaits` (3000): allows up to 30 seconds of busy-waiting for recovery to complete on another thread Key triggering options: `write_fault_one_in=1000` + `txn_write_policy=1` (write-prepared) + `two_write_queues=1` + 32 threads + `error_recovery_with_no_fault_injection=1`. Differential Revision: D108561314
|
@pdillinger has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108561314. |
✅ clang-tidy: No findings on changed linesCompleted in 79.1s. |
🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit f38e738 ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit f38e738 SummaryClean, well-motivated fix to db_stress test infrastructure that separates retry budgets for rollback attempts vs. Resume-Busy waits. The logic is correct and the approach is sound. The change prevents false-positive stress test failures when 32 concurrent threads exhaust a shared-concept retry budget on Resume() Busy responses. High-severity findings (0): Full review (click to expand)Findings🟡 MEDIUMM1. Worst-case 30-second stall per thread —
|
| Context | Relevant? | Assessment |
|---|---|---|
| Multiple concurrent threads | YES | Primary motivation — 32 threads competing for Resume(). Fix correctly gives each thread independent budgets. |
| Resume() returning Busy | YES | Resume() returns Busy when auto-recovery is in progress on another thread. Fix correctly waits longer. |
| Fault injection interaction | YES | Loop only entered when IsErrorInjectedAndRetryable(rollback_s) — only during fault injection. |
| Test timeout | POSSIBLE | 30-second stall could interact with CI timeouts, but false failure is worse. |
Loop termination analysis:
Every iteration either increments one of the two counters or breaks. The loop is guaranteed to terminate. There is no path that skips both increment and break.
Error reporting: The fprintf correctly reports both counters. The semantic change (retries now counts only actual rollback attempts) improves diagnostic value.
Positive Observations
- Minimal and surgical fix
- Comment explains the "why" not just the "what"
- Descriptive constant names (
kMaxRollbackRetries,kMaxResumeBusyWaits) - Existing error reporting works correctly without modification
- PR description thoroughly documents the triggering scenario
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
joshkang97
left a comment
There was a problem hiding this comment.
Review automatically exported from Phabricator review in Meta.
|
This pull request has been merged in 30295a4. |
Summary:
The rollback-after-failed-commit retry loop in
db_stress_test_base.cc(added in b0b52e1d14f3) counted every loop iteration — including iterations whereResume()returned Busy — against a single 100-count limit. With 32 concurrent threads, when write fault injection causes all prepared transaction commits and rollbacks to fail simultaneously,Resume()returns Busy (recovery in progress from another thread) for the entire 100-iteration budget (100 × 10ms = 1 second). Recovery itself keeps failing due to ongoing fault injection causing repeated background flush errors, so no thread ever gets a chance to perform an actual rollback.The fix separates two budgets:
kMaxRollbackRetries(100): counts only actual rollback attempts afterResume()returns OKkMaxResumeBusyWaits(3000): allows up to 30 seconds of busy-waiting for recovery to complete on another threadKey triggering options:
write_fault_one_in=1000+txn_write_policy=1(write-prepared) +two_write_queues=1+ 32 threads +error_recovery_with_no_fault_injection=1.Differential Revision: D108561314