[rocksdb] Sync recovery SST directory before reused MANIFEST append#14780
[rocksdb] Sync recovery SST directory before reused MANIFEST append#14780xingbowang wants to merge 1 commit into
Conversation
|
@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D106201774. |
✅ clang-tidy: No findings on changed linesCompleted in 380.4s. |
Codex Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit 1c0d2f5 ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
Claude Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit 1c0d2f5 SummaryCorrect and well-targeted fix for a real crash-safety gap in the High-severity findings (0): No high-severity findings. Full review (click to expand)Findings🟡 MEDIUMM1. Missing null check on
|
| Context | Affected? | Safe? | Notes |
|---|---|---|---|
| WritePreparedTxnDB | Yes (same recovery) | Yes | Same code path |
| ReadOnly DB | No | N/A | No recovery flush in read-only mode |
| best_efforts_recovery | No | N/A | Mutually exclusive with reuse_manifest_on_open |
| User-defined timestamps | Yes | Yes | Orthogonal to dir sync |
| btrfs filesystem | Yes | Yes | btrfs skips dir fsync for kNewFileSynced intentionally -- btrfs guarantees new file visibility after file fsync without dir fsync. This is a documented RocksDB optimization (io_posix.cc:1959-1963), not a bug. |
| avoid_flush_during_recovery | No | N/A | Recovery SSTs not created |
| Multiple db_paths | Yes | Yes | GetPathId() selects correct dir |
Positive Observations
- Correct fix location: The fsync is placed after SST creation and before
edit->AddFile(), matching the established pattern in FlushJob and CompactionJob. - Minimal and focused scope: Only the necessary code path is modified (reused MANIFEST during recovery).
- Proper condition: Checking both
reuse_manifest_on_openanddescriptor_log_ != nullptrcorrectly identifies exactly the case needing the fix. - Good test design: The regression test uses FaultInjectionTestFS and SyncPoint injection (not sleep), making it deterministic and non-flaky. The test correctly simulates the crash scenario using
DeleteFilesCreatedAfterLastDirSync. - Cold path: This runs only during
DB::Openrecovery, so there is zero performance impact on steady-state operations. - Error propagation is correct: A failed fsync properly prevents
edit->AddFile()from executing (the subsequentif (s.ok() && has_output)block is gated ons).
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
Summary: DB::Open recovery can flush WAL data into a new L0 SST and then publish a MANIFEST edit referencing it. When reuse_manifest_on_open appends to an already-current MANIFEST, there is no CURRENT update to fsync the DB directory. The same ordering assumption is unsafe for fresh MANIFEST publication: syncing CURRENT does not make the recovered SST directory entry durable, especially when the SST lives in a separate data path. Fix by fsyncing the recovered SST data directory before adding the file to the recovery VersionEdit. Add regression coverage for both reused-MANIFEST append and fresh-MANIFEST publish; the fresh-MANIFEST test fails without the broader fsync because crash cleanup deletes the referenced SST. Test Plan: make clean make format-auto make -j128 db_basic_test timeout 60 ./db_basic_test --gtest_filter='DBBasicTest.RecoverySstDirSyncedBeforeFreshManifestPublish:DBBasicTest.ReuseManifestOnOpenSyncsRecoverySstDirBeforeManifestAppend' timeout 60 ./db_basic_test --gtest_filter='DBBasicTest.RecoverySstDirSyncedBeforeFreshManifestPublish' # expected failure after temporarily restoring the old reused-MANIFEST-only condition timeout 60 ./db_basic_test --gtest_filter='DBBasicTest.RecoverySstDirSyncedBeforeFreshManifestPublish:DBBasicTest.ReuseManifestOnOpenSyncsRecoverySstDirBeforeManifestAppend' --gtest_repeat=5 timeout 60 ./db_basic_test --gtest_filter='DBBasicTest.ReuseManifestOnOpen*' make check-sources git diff --check Reviewers: anand76 Subscribers: Tasks: 272584339 Tags:
1c0d2f5 to
cad0d2f
Compare
|
@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D106201774. |
🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit cad0d2f ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit cad0d2f SummaryWell-targeted fix for a genuine durability race condition during DB recovery with High-severity findings (1):
Full review (click to expand)Findings🔴 HIGHH1. Missing blob file directory fsync --
|
| Context | Affected? | Assumptions Hold? | Notes |
|---|---|---|---|
| WritePreparedTxnDB | No custom override | Yes | Inherits DBImpl recovery |
| ReadOnly DB | Not executed | N/A | read_only=true skips flush |
| SecondaryInstance | Not executed | N/A | No WAL recovery flush |
| Multiple db_paths | Correctly handled | Yes | path_id=0 always, Test 1 covers |
| BlobDB | Gap | No | Blob files not synced (H1) |
| User-defined timestamps | Compatible | Yes | Orthogonal to dir sync |
| MemPurge | Not applicable | N/A | Different code path |
| FIFO/Universal compaction | Compatible | Yes | Compaction style irrelevant for recovery |
Assumption stress-test results:
- "Reused MANIFESTs have no CURRENT update" -- holds for typical recovery. Theoretically violated if MANIFEST exceeds size limit mid-recovery, but this would only add redundant protection (CURRENT sync + the new dir sync).
- "Directory fsync before MANIFEST write is correct ordering" -- confirmed by examining FlushJob (sync at line 1104, MANIFEST at line 1299) and CompactionJob (sync at line 1147, MANIFEST at line 2405). The PR follows the identical pattern.
- Null pointer risk from
GetDataDir-- disproven. Robust 4-level fallback chain ending atdb_dir_.get()which is always initialized during DB open.
Positive Observations
- The fix follows the exact same pattern used by FlushJob and CompactionJob for directory durability.
- Error handling is correct: fsync failure in
spreventsedit->AddFilefrom executing. - The new
AfterSetCurrentFiletest sync point enables testing of the fresh-MANIFEST crash scenario. - Recovery SST always uses
path_id=0(hardcoded at line 2079), making the behavior fully deterministic. - Performance impact is minimal: one directory fsync per recovered SST on the cold recovery path.
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
|
@xingbowang merged this pull request in d7afd3b. |
Summary
reuse_manifest_on_openreuses the current MANIFEST,DB::Openrecovery can flush WAL data into a new L0 SST and append the correspondingVersionEditto that already-current MANIFEST.Task
Test Plan
CI