Skip to content

[rocksdb] Sync recovery SST directory before reused MANIFEST append#14780

Closed
xingbowang wants to merge 1 commit into
facebook:mainfrom
xingbowang:2026_05_23_oncall_T272584339
Closed

[rocksdb] Sync recovery SST directory before reused MANIFEST append#14780
xingbowang wants to merge 1 commit into
facebook:mainfrom
xingbowang:2026_05_23_oncall_T272584339

Conversation

@xingbowang

@xingbowang xingbowang commented May 23, 2026

Copy link
Copy Markdown
Contributor

Summary

  • When reuse_manifest_on_open reuses the current MANIFEST, DB::Open recovery can flush WAL data into a new L0 SST and append the corresponding VersionEdit to that already-current MANIFEST.
  • If open later fails and the process crashes, the MANIFEST edit can be durable while the recovered SST directory entry is not, leaving the DB pointing at a missing SST.
  • Fsync the recovered SST's data directory before adding the file to the recovery edit when appending to a reused MANIFEST.
  • Add a regression test that injects failure after MANIFEST sync, simulates crash cleanup of files created after the last directory sync, and verifies the recovered key remains readable.

Task

  • T272584339

Test Plan

CI

@meta-cla meta-cla Bot added the CLA Signed label May 23, 2026
@meta-codesync

meta-codesync Bot commented May 23, 2026

Copy link
Copy Markdown

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D106201774.

@github-actions

github-actions Bot commented May 23, 2026

Copy link
Copy Markdown

✅ clang-tidy: No findings on changed lines

Completed in 380.4s.

@github-actions

github-actions Bot commented May 23, 2026

Copy link
Copy Markdown

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 1c0d2f5


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@github-actions

github-actions Bot commented May 23, 2026

Copy link
Copy Markdown

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 1c0d2f5


Summary

Correct and well-targeted fix for a real crash-safety gap in the reuse_manifest_on_open recovery path. The approach follows established RocksDB patterns (FlushJob, CompactionJob). One medium-severity defensive coding issue found.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🟡 MEDIUM

M1. Missing null check on GetDataDir() return value -- db/db_impl/db_impl_open.cc:2240
  • Issue: cfd->GetDataDir(meta.fd.GetPathId()) can return nullptr when data_dirs_ is empty (see column_family.cc:1752-1754). The code dereferences the result without a null check, which would crash.
  • Root cause: GetDataDir() returns nullptr when data_dirs_ is empty. In normal recovery flow, AddDirectories() (line 719) runs before RecoverLogFiles() (line 851), so data_dirs_ is populated. However, both FlushJob (flush_job.cc:1104) and CompactionJob (compaction_job.cc:864) defensively null-check before calling FsyncWithDirOptions, establishing a clear codebase convention. Edge cases (partial AddDirectories failure, unusual configurations) could theoretically leave data_dirs_ empty.
  • Suggested fix:
    if (s.ok() && has_output) {
      if (immutable_db_options_.reuse_manifest_on_open &&
          versions_->descriptor_log_ != nullptr) {
        FSDirectory* data_dir = cfd->GetDataDir(meta.fd.GetPathId());
        if (data_dir != nullptr) {
          s = data_dir->FsyncWithDirOptions(
                  IOOptions(), nullptr,
                  DirFsyncOptions(
                      DirFsyncOptions::FsyncReason::kNewFileSynced));
        }
      }
    }

🟢 LOW / NIT

L1. Blob file directory sync not addressed -- db/db_impl/db_impl_open.cc:2248
  • Issue: BuildTable can produce blob file additions during recovery (line 2074, 2170, 2248-2250). The PR syncs the SST directory but not the blob file directory. CompactionJob handles this separately (compaction_job.cc:870-873).
  • Root cause: In practice, during recovery blob files are created in cf_paths.front().path (blob_file_builder.cc:191), which is typically the same directory as the SST. The risk is low because blob files share the SST directory in the common case, and the same directory fsync covers both. This would only matter if blob files used a different directory, which doesn't happen via BuildTable during recovery.
  • Suggested fix: Consider adding a comment noting that blob files share the SST directory during recovery, or adding a separate blob dir fsync if blob_output_directory != data_dir (matching compaction_job pattern) for completeness.
L2. Test covers only single-CF, single-SST recovery -- db/db_basic_test.cc:770
  • Issue: The regression test exercises the fix for a single column family with a single recovered key. Multi-CF recovery and multiple recovered SSTs are not tested.
  • Suggested fix: Consider adding a variant with multiple column families to ensure directory sync works correctly across CFs.

Cross-Component Analysis

Context Affected? Safe? Notes
WritePreparedTxnDB Yes (same recovery) Yes Same code path
ReadOnly DB No N/A No recovery flush in read-only mode
best_efforts_recovery No N/A Mutually exclusive with reuse_manifest_on_open
User-defined timestamps Yes Yes Orthogonal to dir sync
btrfs filesystem Yes Yes btrfs skips dir fsync for kNewFileSynced intentionally -- btrfs guarantees new file visibility after file fsync without dir fsync. This is a documented RocksDB optimization (io_posix.cc:1959-1963), not a bug.
avoid_flush_during_recovery No N/A Recovery SSTs not created
Multiple db_paths Yes Yes GetPathId() selects correct dir

Positive Observations

  • Correct fix location: The fsync is placed after SST creation and before edit->AddFile(), matching the established pattern in FlushJob and CompactionJob.
  • Minimal and focused scope: Only the necessary code path is modified (reused MANIFEST during recovery).
  • Proper condition: Checking both reuse_manifest_on_open and descriptor_log_ != nullptr correctly identifies exactly the case needing the fix.
  • Good test design: The regression test uses FaultInjectionTestFS and SyncPoint injection (not sleep), making it deterministic and non-flaky. The test correctly simulates the crash scenario using DeleteFilesCreatedAfterLastDirSync.
  • Cold path: This runs only during DB::Open recovery, so there is zero performance impact on steady-state operations.
  • Error propagation is correct: A failed fsync properly prevents edit->AddFile() from executing (the subsequent if (s.ok() && has_output) block is gated on s).

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase
Summary: DB::Open recovery can flush WAL data into a new L0 SST and then publish a MANIFEST edit referencing it.

When reuse_manifest_on_open appends to an already-current MANIFEST, there is no CURRENT update to fsync the DB directory. The same ordering assumption is unsafe for fresh MANIFEST publication: syncing CURRENT does not make the recovered SST directory entry durable, especially when the SST lives in a separate data path.

Fix by fsyncing the recovered SST data directory before adding the file to the recovery VersionEdit. Add regression coverage for both reused-MANIFEST append and fresh-MANIFEST publish; the fresh-MANIFEST test fails without the broader fsync because crash cleanup deletes the referenced SST.

Test Plan: make clean

make format-auto

make -j128 db_basic_test

timeout 60 ./db_basic_test --gtest_filter='DBBasicTest.RecoverySstDirSyncedBeforeFreshManifestPublish:DBBasicTest.ReuseManifestOnOpenSyncsRecoverySstDirBeforeManifestAppend'

timeout 60 ./db_basic_test --gtest_filter='DBBasicTest.RecoverySstDirSyncedBeforeFreshManifestPublish' # expected failure after temporarily restoring the old reused-MANIFEST-only condition

timeout 60 ./db_basic_test --gtest_filter='DBBasicTest.RecoverySstDirSyncedBeforeFreshManifestPublish:DBBasicTest.ReuseManifestOnOpenSyncsRecoverySstDirBeforeManifestAppend' --gtest_repeat=5

timeout 60 ./db_basic_test --gtest_filter='DBBasicTest.ReuseManifestOnOpen*'

make check-sources

git diff --check

Reviewers: anand76

Subscribers:

Tasks: 272584339

Tags:
@xingbowang xingbowang force-pushed the 2026_05_23_oncall_T272584339 branch from 1c0d2f5 to cad0d2f Compare May 26, 2026 17:33
@meta-codesync

meta-codesync Bot commented May 26, 2026

Copy link
Copy Markdown

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D106201774.

@github-actions

Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit cad0d2f


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@github-actions

Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit cad0d2f


Summary

Well-targeted fix for a genuine durability race condition during DB recovery with reuse_manifest_on_open. The core approach is correct and follows established FlushJob/CompactionJob patterns. One substantive gap identified around blob file coverage.

High-severity findings (1):

  • [db/db_impl/db_impl_open.cc:~2248] Blob files created during recovery (blob_file_additions) are added to the VersionEdit without a prior directory fsync, creating the same race condition the PR fixes for SST files.
Full review (click to expand)

Findings

🔴 HIGH

H1. Missing blob file directory fsync -- db/db_impl/db_impl_open.cc:2248
  • Issue: The PR adds directory fsync before edit->AddFile() for SST files, but blob files created during recovery via BuildTable (line 2170: &blob_file_additions) are added to the same VersionEdit (lines 2248-2250: edit->AddBlobFile(blob)) without any directory fsync. Blob files go to cf_paths.front().path (see blob_file_builder.cc:191), which may or may not be the same directory as the SST.
  • Root cause: The PR only considered SST files when adding the durability barrier. BuildTable can produce both SST and blob file additions during recovery when enable_blob_files=true.
  • Suggested fix: Add blob file directory fsync before edit->AddBlobFile(), or consolidate by syncing all relevant directories before any edit->Add* calls. CompactionJob already has this pattern (lines 870-874: separate blob_output_directory_ fsync when it differs from the SST output directory).

🟡 MEDIUM

M1. Test 1 may not exercise the fix for the stated bug -- db/db_basic_test.cc
  • Issue: RecoverySstDirSyncedBeforeFreshManifestPublish uses reuse_manifest_on_open = false (fresh MANIFEST). The PR's core bug is about reused MANIFESTs. However, this test IS valid for a related scenario: when using separate db_paths, the recovery SST goes to a different directory than the MANIFEST, and SetCurrentFile only syncs the MANIFEST directory -- not the SST data directory. So this test covers a complementary durability gap for the fresh-MANIFEST + separate-db_paths case.
  • Root cause: The test comment could be clearer about WHY this scenario needs the fix (the SST is in dbname_ + "_2", not in dbname_ where CURRENT/MANIFEST live).
  • Suggested fix: Improve the test comment to explicitly state the cross-directory aspect is what makes this test necessary even with fresh MANIFESTs.
M2. Redundant directory fsyncs for multi-CF recovery -- db/db_impl/db_impl_open.cc
  • Issue: When multiple column families flush during recovery (common with many CFs and WAL data), each calls WriteLevel0TableForRecovery independently, producing redundant fsyncs of the same directory (since recovery SSTs always use path_id=0).
  • Root cause: No deduplication of directory fsyncs across CFs.
  • Suggested fix: This is a minor performance nit for the cold recovery path. Could be addressed in a follow-up by tracking which directories have been synced, but not blocking.

🟢 LOW / NIT

L1. Comment accuracy -- db/db_impl/db_impl_open.cc:~2238
  • Issue: The comment says "fresh MANIFEST publication should not be relied on to order the recovered SST entry." This is accurate for the common case where SSTs and MANIFEST share a directory, but the stronger reason (which Test 1 demonstrates) is that with separate db_paths, SetCurrentFile syncs the wrong directory entirely.
  • Suggested fix: Consider expanding the comment to mention the separate-directory case.
L2. Test deduplication opportunity -- db/db_basic_test.cc
  • Issue: The two new tests share substantial setup code (FaultInjectionTestFS creation, options setup, put/flush/WAL flush/close pattern, fault injection cleanup, and verification). Per CLAUDE.md test dedup guidelines, common patterns should be extracted.
  • Suggested fix: Extract a helper function for the shared setup/teardown pattern.

Cross-Component Analysis

Context Affected? Assumptions Hold? Notes
WritePreparedTxnDB No custom override Yes Inherits DBImpl recovery
ReadOnly DB Not executed N/A read_only=true skips flush
SecondaryInstance Not executed N/A No WAL recovery flush
Multiple db_paths Correctly handled Yes path_id=0 always, Test 1 covers
BlobDB Gap No Blob files not synced (H1)
User-defined timestamps Compatible Yes Orthogonal to dir sync
MemPurge Not applicable N/A Different code path
FIFO/Universal compaction Compatible Yes Compaction style irrelevant for recovery

Assumption stress-test results:

  • "Reused MANIFESTs have no CURRENT update" -- holds for typical recovery. Theoretically violated if MANIFEST exceeds size limit mid-recovery, but this would only add redundant protection (CURRENT sync + the new dir sync).
  • "Directory fsync before MANIFEST write is correct ordering" -- confirmed by examining FlushJob (sync at line 1104, MANIFEST at line 1299) and CompactionJob (sync at line 1147, MANIFEST at line 2405). The PR follows the identical pattern.
  • Null pointer risk from GetDataDir -- disproven. Robust 4-level fallback chain ending at db_dir_.get() which is always initialized during DB open.

Positive Observations

  • The fix follows the exact same pattern used by FlushJob and CompactionJob for directory durability.
  • Error handling is correct: fsync failure in s prevents edit->AddFile from executing.
  • The new AfterSetCurrentFile test sync point enables testing of the fresh-MANIFEST crash scenario.
  • Recovery SST always uses path_id=0 (hardcoded at line 2079), making the behavior fully deterministic.
  • Performance impact is minimal: one directory fsync per recovered SST on the cold recovery path.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase
@meta-codesync

meta-codesync Bot commented May 27, 2026

Copy link
Copy Markdown

@xingbowang merged this pull request in d7afd3b.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

2 participants