Add reuse_manifest_on_open DBOption (#14704)#14704
Conversation
|
@anand1976 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103568447. |
✅ clang-tidy: No findings on changed linesCompleted in 572.7s. |
Summary: Pull Request resolved: facebook#14704 Add an opt-in immutable DBOption reuse_manifest_on_open (default false). When true, VersionSet::Recover reopens the existing MANIFEST file for append at the end of recovery, so the next LogAndApply call appends to it instead of creating a fresh MANIFEST. The default behavior creates a fresh MANIFEST because descriptor_log_ is null after Recover. Avoids the cost of fresh-MANIFEST creation on the first post-open write — both the file system create call and the WriteCurrentStateToManifest preamble that serializes the entire in-memory database state into the new file. For long-running DBs with many SST files, that preamble can be sizeable. Safety gate — manifest_last_valid_record_end_ tracking: During MANIFEST recovery, VersionEditHandlerBase::Iterate tracks the file offset at the end of the last successfully decoded logical record (reader.LastRecordEnd() after each ReadRecord + DecodeFrom). This value is stored as manifest_last_valid_record_end_ on VersionSet, distinct from manifest_file_size_ (the reader's I/O high-water mark, which includes any tolerated tail garbage such as torn writes or partial records). ReopenManifestForAppend compares the on-disk file size against manifest_last_valid_record_end_ — not manifest_file_size_ — so even intra-block tail corruption that doesn't extend the physical file is detected and causes a fall-back to fresh MANIFEST creation. The writer's SetFileSize and initial_block_offset are also set from manifest_last_valid_record_end_, so if reuse proceeds the writer overwrites any trailing garbage rather than appending after it. Three additional defensive safety gates in ReopenManifestForAppend; failure of any returns OK and falls back to fresh-MANIFEST creation on the next LogAndApply: 1. on-disk file size != manifest_last_valid_record_end_ — tail corruption from a prior crash. 2. opt_file_opts.use_direct_writes — direct writes against an already-populated file are hard to reason about (alignment + tail-pad). POSIX OptimizeForManifestWrite forces it off, but custom FileSystems may not. 3. ReopenWritableFile() failure. Disabled when best_efforts_recovery=true (that mode rebuilds CURRENT/MANIFEST as a salvage operation; reusing the prior MANIFEST contradicts the contract). Mirrors the same exclusion applied to avoid_noop_manifest_edits_on_recovery. WritableFileWriter::filesize_ root cause fix: The standard WritableFileWriter constructor unconditionally zeros filesize_, flushed_size_, and next_write_offset_, which is wrong for a writer wrapping a file from FileSystem::ReopenWritableFile (GetFileSize returns 0, ProcessManifestWrites' size-limit check compares against 0, Close under direct I/O would Truncate the existing bytes, Flush's RangeSync targets the wrong window). Add a public WritableFileWriter::SetFileSize(uint64_t) method and call it from ReopenManifestForAppend with manifest_last_valid_record_end_ before any Append. Helper extraction: Two private VersionSet helpers factor out the shared MANIFEST-writer setup that was duplicated between the fresh-MANIFEST path (ProcessManifestWrites) and the new reuse path (ReopenManifestForAppend): - GetFileOptionsForManifestWrite — applies FS's OptimizeForManifestWrite tuning + re-applies the user-configured temperature. - CreateManifestWriter — builds the WritableFileWriter + log::Writer with all standard MANIFEST setup (preallocation block size, checksum-handoff classification, listeners). Takes existing_file_size as a parameter; when > 0, calls SetFileSize and computes the resumed initial_block_offset. Both call sites collapse to a single call each. Supporting changes: * log::Writer constructor accepts an optional initial_block_offset (default 0; preserves all existing call sites). Required so the resumed writer aligns its record framing to the existing MANIFEST's last 32 KiB block. * DBImpl::LogAndApplyForRecovery's assert is relaxed to allow descriptor_log_ non-null when reuse_manifest_on_open is set. * reuse_manifest_on_open added to db_stress (default false, not enabled in crash tests). Default false. Placed in ImmutableDBOptions because the recovery-time hook is decided at Open. Composes with avoid_noop_manifest_edits_on_recovery and write_wal_markers_to_manifest_on_close. A TEST_SYNC_POINT VersionSet::ReopenManifestForAppend:Reopened is added inside the success path so tests can directly observe the reopen. Differential Revision: D103568447
01e62c1 to
0eee675
Compare
Summary: Pull Request resolved: facebook#14704 Add an opt-in immutable DBOption reuse_manifest_on_open (default false). When true, VersionSet::Recover reopens the existing MANIFEST file for append at the end of recovery, so the next LogAndApply call appends to it instead of creating a fresh MANIFEST. The default behavior creates a fresh MANIFEST because descriptor_log_ is null after Recover. Avoids the cost of fresh-MANIFEST creation on the first post-open write — both the file system create call and the WriteCurrentStateToManifest preamble that serializes the entire in-memory database state into the new file. For long-running DBs with many SST files, that preamble can be sizeable. Safety gate — manifest_last_valid_record_end_ tracking: During MANIFEST recovery, VersionEditHandlerBase::Iterate tracks the file offset at the end of the last successfully decoded logical record (reader.LastRecordEnd() after each ReadRecord + DecodeFrom). This value is stored as manifest_last_valid_record_end_ on VersionSet, distinct from manifest_file_size_ (the reader's I/O high-water mark, which includes any tolerated tail garbage such as torn writes or partial records). ReopenManifestForAppend compares the on-disk file size against manifest_last_valid_record_end_ — not manifest_file_size_ — so even intra-block tail corruption that doesn't extend the physical file is detected and causes a fall-back to fresh MANIFEST creation. The writer's SetFileSize and initial_block_offset are also set from manifest_last_valid_record_end_, so if reuse proceeds the writer overwrites any trailing garbage rather than appending after it. Three additional defensive safety gates in ReopenManifestForAppend; failure of any returns OK and falls back to fresh-MANIFEST creation on the next LogAndApply: 1. on-disk file size != manifest_last_valid_record_end_ — tail corruption from a prior crash. 2. opt_file_opts.use_direct_writes — direct writes against an already-populated file are hard to reason about (alignment + tail-pad). POSIX OptimizeForManifestWrite forces it off, but custom FileSystems may not. 3. ReopenWritableFile() failure. Disabled when best_efforts_recovery=true (that mode rebuilds CURRENT/MANIFEST as a salvage operation; reusing the prior MANIFEST contradicts the contract). Mirrors the same exclusion applied to avoid_noop_manifest_edits_on_recovery. WritableFileWriter::filesize_ root cause fix: The standard WritableFileWriter constructor unconditionally zeros filesize_, flushed_size_, and next_write_offset_, which is wrong for a writer wrapping a file from FileSystem::ReopenWritableFile (GetFileSize returns 0, ProcessManifestWrites' size-limit check compares against 0, Close under direct I/O would Truncate the existing bytes, Flush's RangeSync targets the wrong window). Add a public WritableFileWriter::SetFileSize(uint64_t) method and call it from ReopenManifestForAppend with manifest_last_valid_record_end_ before any Append. Helper extraction: Two private VersionSet helpers factor out the shared MANIFEST-writer setup that was duplicated between the fresh-MANIFEST path (ProcessManifestWrites) and the new reuse path (ReopenManifestForAppend): - GetFileOptionsForManifestWrite — applies FS's OptimizeForManifestWrite tuning + re-applies the user-configured temperature. - CreateManifestWriter — builds the WritableFileWriter + log::Writer with all standard MANIFEST setup (preallocation block size, checksum-handoff classification, listeners). Takes existing_file_size as a parameter; when > 0, calls SetFileSize and computes the resumed initial_block_offset. Both call sites collapse to a single call each. Supporting changes: * log::Writer constructor accepts an optional initial_block_offset (default 0; preserves all existing call sites). Required so the resumed writer aligns its record framing to the existing MANIFEST's last 32 KiB block. * DBImpl::LogAndApplyForRecovery's assert is relaxed to allow descriptor_log_ non-null when reuse_manifest_on_open is set. * reuse_manifest_on_open added to db_stress (default false, not enabled in crash tests). Default false. Placed in ImmutableDBOptions because the recovery-time hook is decided at Open. Composes with avoid_noop_manifest_edits_on_recovery and write_wal_markers_to_manifest_on_close. A TEST_SYNC_POINT VersionSet::ReopenManifestForAppend:Reopened is added inside the success path so tests can directly observe the reopen. Differential Revision: D103568447
0eee675 to
f8f9c0e
Compare
🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit f8f9c0e ❌ Codex review failed before producing any output. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
Claude Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.��� Claude Code ReviewAuto-triggered after CI passed — reviewing commit f8f9c0e SummaryWell-structured PR adding three composable "warm start" DBOptions ( High-severity findings (1):
Full review (click to expand)Findings🔴 HIGHH1. Duplicate assignment of
|
| Context | Assumptions hold? | Action needed? |
|---|---|---|
| ReadOnly DB | N/A (guarded) | Safe |
| best_efforts_recovery | Disabled explicitly | Correct |
| 2PC / WritePreparedTxnDB | MinLogNumberToKeep skipped | Correct |
| ReactiveVersionSet | Different Recover path | Safe |
| User-defined timestamps | No interaction | Safe |
| verify_manifest_content_on_close | May trigger rotation | Acceptable |
| Manifest rotation | State cleaned up correctly | Correct |
Positive Observations
- Defensive three-check fallback in
ReopenManifestForAppendis well-designed GetFileOptionsForManifestWriteandCreateManifestWritercleanly deduplicate fresh/reuse paths- 25+ new tests covering individual options, composition, edge cases (tail corruption, dropped CFs, 2PC, best_efforts_recovery)
best_efforts_recoveryexclusion correctly preserves the rebuild contract- 2PC safety correctly preserves WAL availability for uncommitted prepared transactions
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
f8f9c0e to
f4d1bfb
Compare
Summary: Pull Request resolved: facebook#14704 Add an immutable DBOption `reuse_manifest_on_open` (default false). When enabled, `DB::Open` can keep using the recovered MANIFEST for the first post-open metadata update instead of rebuilding a fresh MANIFEST, which can reduce warm-open latency for DBs whose MANIFEST is expensive to regenerate. Reuse is still best-effort. If RocksDB cannot safely resume appending to the recovered MANIFEST, it falls back to the existing fresh-MANIFEST path. The option is also disabled under `best_efforts_recovery`. This diff also teaches the reopened MANIFEST writer to adopt the existing file size before appending, documents the small-`max_manifest_file_size` caveat for the reused path, and keeps the full warm-reopen composition working with `optimize_manifest_for_recovery`. Differential Revision: D103568447
Summary: Pull Request resolved: facebook#14704 Add an immutable DBOption `reuse_manifest_on_open` (default false). When enabled, `DB::Open` can keep using the recovered MANIFEST for the first post-open metadata update instead of rebuilding a fresh MANIFEST, which can reduce warm-open latency for DBs whose MANIFEST is expensive to regenerate. Reuse is still best-effort. If RocksDB cannot safely resume appending to the recovered MANIFEST, it falls back to the existing fresh-MANIFEST path. The option is also disabled under `best_efforts_recovery`. This diff also teaches the reopened MANIFEST writer to adopt the existing file size before appending, documents the small-`max_manifest_file_size` caveat for the reused path, and keeps the full warm-reopen composition working with `optimize_manifest_for_recovery`. Differential Revision: D103568447
1499d45 to
a1b6818
Compare
Claude Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit a1b6818 SummaryWell-structured PR that adds two complementary warm-reopen optimizations for MANIFEST handling. The implementation shows careful attention to safety guards (best_efforts_recovery, 2PC, dropped CFs, tail corruption) and includes thorough test coverage. The code is reviewed by an experienced Meta reviewer (pdillinger). A few medium-severity items warrant attention. High-severity findings (0): No high-severity findings. Full review (click to expand)Findings🔴 HIGHNone. 🟡 MEDIUMM1.
|
| Context | Safe? | Notes |
|---|---|---|
| WritePreparedTxnDB | YES | allow_2pc() guard prevents WAL marker advancement |
| ReadOnly DB | YES | read_only checks prevent all writes |
| SecondaryInstance | YES | ReactiveVersionSet doesn't call ReopenManifestForAppend |
| best_efforts_recovery | YES | 5 independent guard sites |
| User-defined timestamps | YES | HasPersistUserDefinedTimestamps() checked in predicate |
| Dropped CFs | YES | IsDropped() guard in close-time writer |
Block offset correctness: block_offset = initial_file_size % kBlockSize is correct — verified against EmitPhysicalRecord which maintains block_offset_ += header_size + n, and padding resets to 0 at block boundaries.
FetchAddFileNumber underflow: Protected by if (next_file_number <= new_log_num) guard.
IsEmpty() semantics: Verified checks both mem()->GetFirstSequenceNumber() == 0 and imm()->NumNotFlushed() == 0.
Positive Observations
- Defensive fallback design — any reuse failure silently falls back to fresh MANIFEST
- Clean refactoring of
CreateManifestWriterandGetFileOptionsForManifestWrite - 20+ test cases covering option combinations, 2PC, multi-CF, corruption, rotation
WritableFileWriterinitial_file_sizecorrectly seeds all 4 size-tracking fields- Close-time write is correctly positioned after background work shutdown and obsolete file cleanup
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
Summary: Pull Request resolved: facebook#14702 Add a mutable DBOption `optimize_manifest_for_recovery` (default false) as a temporary rollout / kill switch for warm-reopen MANIFEST optimizations. In this diff, enabling the option lets recovery skip MANIFEST updates during `DB::Open` when the recovered state is already reflected on disk, which reduces metadata appends after a clean shutdown and can lower warm-reopen latency on storage where MANIFEST appends are expensive. If the option is disabled, RocksDB follows the existing recovery path unchanged. The optimization is disabled under `best_efforts_recovery`, where recovery intentionally rewrites metadata as part of salvage, and the option is mutable so later diffs in this stack can share the same rollout knob. Differential Revision: D103568448
Summary: Pull Request resolved: facebook#14703 Extend `optimize_manifest_for_recovery` so a clean `DB::Close` can persist up-to-date WAL recovery markers when that can be done safely. Combined with the recovery-side optimization in the previous diff, a clean close/reopen can avoid recovery-time MANIFEST appends. This remains best-effort: if the close-time write is disabled, skipped, or fails, RocksDB falls back to the standard recovery path on the next open. The option stays mutable so it can be turned off before close to suppress the optimization without restarting the DB. The close-time path respects the existing recovery constraints for 2PC, non-empty column families, dropped column families, and WAL tracking, and preserves the existing file-number invariants. Differential Revision: D103568449
Summary: Pull Request resolved: facebook#14704 Add an immutable DBOption `reuse_manifest_on_open` (default false). When enabled, `DB::Open` can keep using the recovered MANIFEST for the first post-open metadata update instead of rebuilding a fresh MANIFEST, which can reduce warm-open latency for DBs whose MANIFEST is expensive to regenerate. Reuse is still best-effort. If RocksDB cannot safely resume appending to the recovered MANIFEST, it falls back to the existing fresh-MANIFEST path. The option is also disabled under `best_efforts_recovery`. This diff also teaches the reopened MANIFEST writer to adopt the existing file size before appending, documents the small-`max_manifest_file_size` caveat for the reused path, and keeps the full warm-reopen composition working with `optimize_manifest_for_recovery`. Reviewed By: pdillinger Differential Revision: D103568447
a1b6818 to
1fee285
Compare
✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit 1fee285 SummaryWell-designed, best-effort optimization that reduces MANIFEST recovery overhead on warm reopens. The dual-option approach is sound, fallback handling is robust, and test coverage is comprehensive. No high-severity correctness bugs found. High-severity findings (0): Full review (click to expand)Findings🔴 HIGHNo high-severity findings. 🟡 MEDIUMM1.
|
| Context | Assumptions hold? | Action needed? |
|---|---|---|
| WritePreparedTxnDB | YES — allow_2pc() guard prevents WAL marker advancement |
Safe |
| ReadOnly DB | N/A — code not reached | Safe |
| best_efforts_recovery | YES — both options explicitly disabled | Safe |
| Secondary Instance | N/A — different recovery path | Safe |
| Direct I/O | YES — explicit guard and fallback | Safe |
| MemPurge / FIFO / Universal | N/A — no interaction | Safe |
Key invariants verified:
- next_file_number, min_log_number_to_keep, WalSet monotonicity all preserved
- MANIFEST record framing correct (block_offset = file_size % kBlockSize)
- WritableFileWriter size accounting correct with initial_file_size seeding
Debunked findings: Crash-during-close data loss (CFs are empty), ShouldEmit field coverage (NumEntries covers file ops), block alignment at boundaries (offset=0 is start-of-block), preallocation size check (stat.st_size is logical), recyclable header mismatch (MANIFEST never recycled) — all false positives.
Positive Observations
- Excellent fallback design — both features degrade gracefully
- Comprehensive test coverage (14+ cases including edge cases)
- Clean refactoring of
CreateManifestWriterandGetFileOptionsForManifestWrite - Correct
LastRecordEnd()vsGetReadOffset()semantics for MANIFEST reuse boundary std::deque<VersionEdit>for pointer stability in edit_lists construction
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
|
This pull request has been merged in c734b7c. |
Summary: Pull Request resolved: #14704 Add an immutable DBOption `reuse_manifest_on_open` (default false). When enabled, `DB::Open` can keep using the recovered MANIFEST for the first post-open metadata update instead of rebuilding a fresh MANIFEST, which can reduce warm-open latency for DBs whose MANIFEST is expensive to regenerate. Reuse is still best-effort. If RocksDB cannot safely resume appending to the recovered MANIFEST, it falls back to the existing fresh-MANIFEST path. The option is also disabled under `best_efforts_recovery`. This diff also teaches the reopened MANIFEST writer to adopt the existing file size before appending, documents the small-`max_manifest_file_size` caveat for the reused path, and keeps the full warm-reopen composition working with `optimize_manifest_for_recovery`. Reviewed By: hx235, pdillinger Differential Revision: D103568447 fbshipit-source-id: f4f5c35ea3ef0b80a0d52d94be40c6bd11505999
Summary:
Add an immutable DBOption
reuse_manifest_on_open(default false). When enabled,DB::Opencan keep using the recovered MANIFEST for the first post-open metadata update instead of rebuilding a fresh MANIFEST, which can reduce warm-open latency for DBs whose MANIFEST is expensive to regenerate.Reuse is still best-effort. If RocksDB cannot safely resume appending to the recovered MANIFEST, it falls back to the existing fresh-MANIFEST path. The option is also disabled under
best_efforts_recovery.This diff also teaches the reopened MANIFEST writer to adopt the existing file size before appending, documents the small-
max_manifest_file_sizecaveat for the reused path, and keeps the full warm-reopen composition working withoptimize_manifest_for_recovery.Reviewed By: pdillinger
Differential Revision: D103568447