Skip to content

Fix crash test Restore() failure with empty trace when replay_write_ops == 0 (#14840)#14840

Open
anand1976 wants to merge 1 commit into
facebook:mainfrom
anand1976:export-D107968797
Open

Fix crash test Restore() failure with empty trace when replay_write_ops == 0 (#14840)#14840
anand1976 wants to merge 1 commit into
facebook:mainfrom
anand1976:export-D107968797

Conversation

@anand1976

@anand1976 anand1976 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Summary:
ExpectedState::Restore() always attempted to open and replay the post-saved_seqno_ trace. In the crash-test failure, recovery had already returned the DB exactly to saved_seqno_, so replay_write_ops == 0 and the trace was legitimately empty. FileTraceReader reported Status::Incomplete() on that empty trace, which incorrectly failed restore.

Skip the trace replay work when replay_write_ops == 0, but still run the normal state promotion, cleanup, and saved_seqno_ reset path so later saves continue to work.

Differential Revision: D107968797

@meta-cla meta-cla Bot added the CLA Signed label Jun 9, 2026
@meta-codesync

meta-codesync Bot commented Jun 9, 2026

Copy link
Copy Markdown

@anand1976 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D107968797.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

✅ clang-tidy: No findings on changed lines

Completed in 44.0s.

@meta-codesync meta-codesync Bot changed the title Fix crash test Restore() failure with empty trace when replay_write_ops == 0 Jun 9, 2026
anand1976 added a commit to anand1976/rocksdb that referenced this pull request Jun 9, 2026
…ps == 0 (facebook#14840)

Summary: Pull Request resolved: facebook#14840

Differential Revision: D107968797
@anand1976 anand1976 force-pushed the export-D107968797 branch from a4e4ab3 to 88a1016 Compare June 9, 2026 17:15
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 88a1016


Codex review failed before producing findings.

WARNING: proceeding, even though we could not create PATH aliases: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
…ps == 0 (facebook#14840)

Summary:
`ExpectedState::Restore()` always attempted to open and replay the post-`saved_seqno_` trace. In the crash-test failure, recovery had already returned the DB exactly to `saved_seqno_`, so `replay_write_ops == 0` and the trace was legitimately empty. `FileTraceReader` reported `Status::Incomplete()` on that empty trace, which incorrectly failed restore.

Skip the trace replay work when `replay_write_ops == 0`, but still run the normal state promotion, cleanup, and `saved_seqno_` reset path so later saves continue to work.


Differential Revision: D107968797
@anand1976 anand1976 force-pushed the export-D107968797 branch from 88a1016 to b00624a Compare June 9, 2026 18:54
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit b00624a


Codex review failed before producing findings.

WARNING: proceeding, even though we could not create PATH aliases: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit b00624a


Summary

Clean, well-scoped fix for a legitimate crash test failure. The change correctly skips trace replay when replay_write_ops == 0 (DB recovered exactly to saved_seqno_), while preserving all state promotion, cleanup, and saved_seqno_ reset logic. No correctness issues found.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🟡 MEDIUM

M1. Scattered guards could be simplified with early block skip -- expected_state.cc:1114
  • Issue: The fix adds replay_write_ops > 0 guards at 5 separate locations throughout the replay section. This works correctly but increases cognitive load and maintenance burden. A single if (replay_write_ops > 0) { ... } block wrapping the entire replay section (handler creation, replayer setup, replay loop, completion check) would be cleaner while preserving the same state-promotion/cleanup flow below.
  • Root cause: The replay logic and state-promotion logic are interleaved in a single scope rather than being separated into distinct blocks.
  • Suggested fix: Wrap lines 1147-1193 in a single if (replay_write_ops > 0) { ... } block instead of adding the condition to each individual if statement. The trace_reader conditional on line 1116 is already naturally separate and can remain as-is.
M2. No dedicated test coverage for replay_write_ops == 0 path -- expected_state.cc:1081
  • Issue: There are no unit tests for FileExpectedStateManager::Restore(), and the replay_write_ops == 0 edge case is only exercised by db_stress under specific crash-recovery timing. The new code path introduces null handler/replayer/trace_reader pointers that are handled correctly via short-circuit evaluation and null checks, but should have explicit test coverage to prevent regressions.
  • Root cause: FileExpectedStateManager is test infrastructure that itself lacks unit tests.
  • Suggested fix: Consider adding a test that sets up a saved state at seqno N, opens a DB at seqno N, and verifies Restore() succeeds without trace replay. This is a suggestion for follow-up, not a merge blocker for a bug fix.

🟢 LOW / NIT

L1. Comment could be more precise -- expected_state.cc:1116
  • Issue: The comment says "skip opening/replaying the trace and reuse the normal state promotion and cleanup flow below." This is accurate but could note why it's safe: because replay_write_ops == 0 means no writes survived recovery past saved_seqno_, so the saved state file already matches the DB.
  • Suggested fix: Optional: extend comment to say "...the saved state already matches the DB, so no replay is needed."

Cross-Component Analysis

Context Does code execute? Assumptions hold? Action needed?
WritePreparedTxnDB YES (db_stress supports --use_txn) YES - recovered prepared txns are committed/rolled back in FinishInitDb() before SaveAtAndAfter() is called Safe
WriteUnpreparedTxnDB YES YES - same resolution path as WritePrepared Safe
ReadOnly DB NO - db_stress doesn't use ReadOnly for Restore N/A N/A
Multiple column families YES YES - sequence numbers are global Safe
User-defined timestamps YES (if configured) YES - timestamps don't affect seqno arithmetic Safe
Sequence number underflow Guarded by existing check at line 1077-1078 YES Pre-existing protection, not affected by this PR

Assumption stress-test results:

Claim Preconditions Counterexample attempted Result
"replay_write_ops == 0 means no writes survived" SaveAtAndAfter() called with DB quiesced (API contract) Race condition during SaveAtAndAfter INVALID - API contract requires no concurrent mutation
"Trace file is legitimately empty" Writes traced after save are lost if DB recovers to saved_seqno_ WAL replay bringing DB to saved_seqno_ with traced-but-lost writes SAFE - lost writes shouldn't be replayed against a DB that doesn't have them
"2PC transactions create false replay_write_ops == 0" Prepared txns resolved before SaveAtAndAfter Prepared txn auto-committed during recovery INVALID - db_stress resolves all prepared txns in FinishInitDb() before SaveAtAndAfter()
"Skipping Prepare() loses validation" Trace file not used when replay_write_ops == 0 Corrupted trace file IRRELEVANT - trace is deleted during cleanup, validation serves no purpose

Positive Observations

  • The fix correctly preserves the entire state-promotion and cleanup flow (CopyFile, RenameFile, Open, DeleteFile, saved_seqno_ reset), which is the most important invariant.
  • Short-circuit evaluation in s.ok() && replay_write_ops > 0 && !handler->IsDone() is idiomatic C++ and prevents null dereference.
  • The existing if (handler) null check at line 1195 naturally handles the no-replay case for statistics extraction.
  • The fix eliminates unnecessary file I/O when no replay is needed.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

1 participant