Skip to content

Add reuse_manifest_on_open DBOption (#14704)#14704

Closed
anand1976 wants to merge 3 commits into
facebook:mainfrom
anand1976:export-D103568447
Closed

Add reuse_manifest_on_open DBOption (#14704)#14704
anand1976 wants to merge 3 commits into
facebook:mainfrom
anand1976:export-D103568447

Conversation

@anand1976

@anand1976 anand1976 commented May 5, 2026

Copy link
Copy Markdown
Contributor

Summary:

Add an immutable DBOption reuse_manifest_on_open (default false). When enabled, DB::Open can keep using the recovered MANIFEST for the first post-open metadata update instead of rebuilding a fresh MANIFEST, which can reduce warm-open latency for DBs whose MANIFEST is expensive to regenerate.

Reuse is still best-effort. If RocksDB cannot safely resume appending to the recovered MANIFEST, it falls back to the existing fresh-MANIFEST path. The option is also disabled under best_efforts_recovery.

This diff also teaches the reopened MANIFEST writer to adopt the existing file size before appending, documents the small-max_manifest_file_size caveat for the reused path, and keeps the full warm-reopen composition working with optimize_manifest_for_recovery.

Reviewed By: pdillinger

Differential Revision: D103568447

@meta-codesync

meta-codesync Bot commented May 5, 2026

Copy link
Copy Markdown

@anand1976 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103568447.

@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown

✅ clang-tidy: No findings on changed lines

Completed in 572.7s.

anand1976 added a commit to anand1976/rocksdb that referenced this pull request May 5, 2026
Summary:
Pull Request resolved: facebook#14704

Add an opt-in immutable DBOption reuse_manifest_on_open (default false). When true, VersionSet::Recover reopens the existing MANIFEST file for append at the end of recovery, so the next LogAndApply call appends to it instead of creating a fresh MANIFEST. The default behavior creates a fresh MANIFEST because descriptor_log_ is null after Recover.

Avoids the cost of fresh-MANIFEST creation on the first post-open write — both the file system create call and the WriteCurrentStateToManifest preamble that serializes the entire in-memory database state into the new file. For long-running DBs with many SST files, that preamble can be sizeable.

Safety gate — manifest_last_valid_record_end_ tracking:
During MANIFEST recovery, VersionEditHandlerBase::Iterate tracks the file offset at the end of the last successfully decoded logical record (reader.LastRecordEnd() after each ReadRecord + DecodeFrom). This value is stored as manifest_last_valid_record_end_ on VersionSet, distinct from manifest_file_size_ (the reader's I/O high-water mark, which includes any tolerated tail garbage such as torn writes or partial records). ReopenManifestForAppend compares the on-disk file size against manifest_last_valid_record_end_ — not manifest_file_size_ — so even intra-block tail corruption that doesn't extend the physical file is detected and causes a fall-back to fresh MANIFEST creation. The writer's SetFileSize and initial_block_offset are also set from manifest_last_valid_record_end_, so if reuse proceeds the writer overwrites any trailing garbage rather than appending after it.

Three additional defensive safety gates in ReopenManifestForAppend; failure of any returns OK and falls back to fresh-MANIFEST creation on the next LogAndApply:
  1. on-disk file size != manifest_last_valid_record_end_ — tail corruption from a prior crash.
  2. opt_file_opts.use_direct_writes — direct writes against an already-populated file are hard to reason about (alignment + tail-pad). POSIX OptimizeForManifestWrite forces it off, but custom FileSystems may not.
  3. ReopenWritableFile() failure.

Disabled when best_efforts_recovery=true (that mode rebuilds CURRENT/MANIFEST as a salvage operation; reusing the prior MANIFEST contradicts the contract). Mirrors the same exclusion applied to avoid_noop_manifest_edits_on_recovery.

WritableFileWriter::filesize_ root cause fix:
The standard WritableFileWriter constructor unconditionally zeros filesize_, flushed_size_, and next_write_offset_, which is wrong for a writer wrapping a file from FileSystem::ReopenWritableFile (GetFileSize returns 0, ProcessManifestWrites' size-limit check compares against 0, Close under direct I/O would Truncate the existing bytes, Flush's RangeSync targets the wrong window). Add a public WritableFileWriter::SetFileSize(uint64_t) method and call it from ReopenManifestForAppend with manifest_last_valid_record_end_ before any Append.

Helper extraction:
Two private VersionSet helpers factor out the shared MANIFEST-writer setup that was duplicated between the fresh-MANIFEST path (ProcessManifestWrites) and the new reuse path (ReopenManifestForAppend):
  - GetFileOptionsForManifestWrite — applies FS's OptimizeForManifestWrite tuning + re-applies the user-configured temperature.
  - CreateManifestWriter — builds the WritableFileWriter + log::Writer with all standard MANIFEST setup (preallocation block size, checksum-handoff classification, listeners). Takes existing_file_size as a parameter; when > 0, calls SetFileSize and computes the resumed initial_block_offset.
Both call sites collapse to a single call each.

Supporting changes:
  * log::Writer constructor accepts an optional initial_block_offset (default 0; preserves all existing call sites). Required so the resumed writer aligns its record framing to the existing MANIFEST's last 32 KiB block.
  * DBImpl::LogAndApplyForRecovery's assert is relaxed to allow descriptor_log_ non-null when reuse_manifest_on_open is set.
  * reuse_manifest_on_open added to db_stress (default false, not enabled in crash tests).

Default false. Placed in ImmutableDBOptions because the recovery-time hook is decided at Open. Composes with avoid_noop_manifest_edits_on_recovery and write_wal_markers_to_manifest_on_close.

A TEST_SYNC_POINT VersionSet::ReopenManifestForAppend:Reopened is added inside the success path so tests can directly observe the reopen.

Differential Revision: D103568447
@anand1976 anand1976 force-pushed the export-D103568447 branch from 01e62c1 to 0eee675 Compare May 5, 2026 05:19
@meta-codesync meta-codesync Bot changed the title Add reuse_manifest_on_open DBOption May 5, 2026
anand1976 added a commit to anand1976/rocksdb that referenced this pull request May 5, 2026
Summary:
Pull Request resolved: facebook#14704

Add an opt-in immutable DBOption reuse_manifest_on_open (default false). When true, VersionSet::Recover reopens the existing MANIFEST file for append at the end of recovery, so the next LogAndApply call appends to it instead of creating a fresh MANIFEST. The default behavior creates a fresh MANIFEST because descriptor_log_ is null after Recover.

Avoids the cost of fresh-MANIFEST creation on the first post-open write — both the file system create call and the WriteCurrentStateToManifest preamble that serializes the entire in-memory database state into the new file. For long-running DBs with many SST files, that preamble can be sizeable.

Safety gate — manifest_last_valid_record_end_ tracking:
During MANIFEST recovery, VersionEditHandlerBase::Iterate tracks the file offset at the end of the last successfully decoded logical record (reader.LastRecordEnd() after each ReadRecord + DecodeFrom). This value is stored as manifest_last_valid_record_end_ on VersionSet, distinct from manifest_file_size_ (the reader's I/O high-water mark, which includes any tolerated tail garbage such as torn writes or partial records). ReopenManifestForAppend compares the on-disk file size against manifest_last_valid_record_end_ — not manifest_file_size_ — so even intra-block tail corruption that doesn't extend the physical file is detected and causes a fall-back to fresh MANIFEST creation. The writer's SetFileSize and initial_block_offset are also set from manifest_last_valid_record_end_, so if reuse proceeds the writer overwrites any trailing garbage rather than appending after it.

Three additional defensive safety gates in ReopenManifestForAppend; failure of any returns OK and falls back to fresh-MANIFEST creation on the next LogAndApply:
  1. on-disk file size != manifest_last_valid_record_end_ — tail corruption from a prior crash.
  2. opt_file_opts.use_direct_writes — direct writes against an already-populated file are hard to reason about (alignment + tail-pad). POSIX OptimizeForManifestWrite forces it off, but custom FileSystems may not.
  3. ReopenWritableFile() failure.

Disabled when best_efforts_recovery=true (that mode rebuilds CURRENT/MANIFEST as a salvage operation; reusing the prior MANIFEST contradicts the contract). Mirrors the same exclusion applied to avoid_noop_manifest_edits_on_recovery.

WritableFileWriter::filesize_ root cause fix:
The standard WritableFileWriter constructor unconditionally zeros filesize_, flushed_size_, and next_write_offset_, which is wrong for a writer wrapping a file from FileSystem::ReopenWritableFile (GetFileSize returns 0, ProcessManifestWrites' size-limit check compares against 0, Close under direct I/O would Truncate the existing bytes, Flush's RangeSync targets the wrong window). Add a public WritableFileWriter::SetFileSize(uint64_t) method and call it from ReopenManifestForAppend with manifest_last_valid_record_end_ before any Append.

Helper extraction:
Two private VersionSet helpers factor out the shared MANIFEST-writer setup that was duplicated between the fresh-MANIFEST path (ProcessManifestWrites) and the new reuse path (ReopenManifestForAppend):
  - GetFileOptionsForManifestWrite — applies FS's OptimizeForManifestWrite tuning + re-applies the user-configured temperature.
  - CreateManifestWriter — builds the WritableFileWriter + log::Writer with all standard MANIFEST setup (preallocation block size, checksum-handoff classification, listeners). Takes existing_file_size as a parameter; when > 0, calls SetFileSize and computes the resumed initial_block_offset.
Both call sites collapse to a single call each.

Supporting changes:
  * log::Writer constructor accepts an optional initial_block_offset (default 0; preserves all existing call sites). Required so the resumed writer aligns its record framing to the existing MANIFEST's last 32 KiB block.
  * DBImpl::LogAndApplyForRecovery's assert is relaxed to allow descriptor_log_ non-null when reuse_manifest_on_open is set.
  * reuse_manifest_on_open added to db_stress (default false, not enabled in crash tests).

Default false. Placed in ImmutableDBOptions because the recovery-time hook is decided at Open. Composes with avoid_noop_manifest_edits_on_recovery and write_wal_markers_to_manifest_on_close.

A TEST_SYNC_POINT VersionSet::ReopenManifestForAppend:Reopened is added inside the success path so tests can directly observe the reopen.

Differential Revision: D103568447
@anand1976 anand1976 force-pushed the export-D103568447 branch from 0eee675 to f8f9c0e Compare May 5, 2026 05:34
@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit f8f9c0e


❌ Codex review failed before producing any output.


ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

��� Claude Code Review

Auto-triggered after CI passed — reviewing commit f8f9c0e


Summary

Well-structured PR adding three composable "warm start" DBOptions (reuse_manifest_on_open, avoid_noop_manifest_edits_on_recovery, write_wal_markers_to_manifest_on_close) with comprehensive test coverage and sound safety gates. One confirmed code bug; the remaining findings are suggestions and minor concerns.

High-severity findings (1):

  • [options_helper.cc] Duplicate assignment of avoid_noop_manifest_edits_on_recovery in BuildDBOptions.
Full review (click to expand)

Findings

🔴 HIGH

H1. Duplicate assignment of avoid_noop_manifest_edits_on_recovery in BuildDBOptionsoptions/options_helper.cc
  • Issue: The diff adds options.avoid_noop_manifest_edits_on_recovery = immutable_db_options.avoid_noop_manifest_edits_on_recovery; twice — once before write_identity_file and again after it.
  • Root cause: Copy-paste error in the options_helper.cc changes.
  • Suggested fix: Remove the first occurrence, keeping only the one grouped with reuse_manifest_on_open.

🟡 MEDIUM

M1. has_meaningful_state lambda may need future maintenance — db/db_impl/db_impl_open.cc:1893
  • Issue: The positive-check lambda lists specific VersionEdit::Has*() methods. If a future VersionEdit setter is added but not added to this lambda, the edit will be silently dropped when avoid_noop_manifest_edits_on_recovery is enabled. The PR comment acknowledges this ("if a future change adds a new VersionEdit setter…add it here too") but the check is not enforced programmatically.
  • Suggested fix: Consider adding a compile-time or test-time assertion that the lambda covers all Has* methods, or invert the logic to use a negative check instead.
M2. last_valid_record_end_ updated for records in incomplete atomic groups — db/version_edit_handler.cc:38
  • Issue: last_valid_record_end_ is updated after every successfully decoded record, including records part of an incomplete atomic group at the MANIFEST tail. If the physical file ends exactly at this point, ReopenManifestForAppend will proceed, and the incomplete atomic group records remain in the file.
  • Risk assessment: LOW in practice — on next recovery the incomplete group is re-read, buffered, and discarded. For clean close (primary use case), atomic groups are always complete.
  • Suggested fix: Consider only updating last_valid_record_end_ when a full atomic group is committed or a non-group record is applied, so the size mismatch check triggers fallback in the incomplete case.
M3. Close-time WAL marker write could trigger MANIFEST rotation — db/db_impl/db_impl.cc:818
  • Issue: The LogAndApply in CloseHelper goes through ProcessManifestWrites, which may rotate to a fresh MANIFEST if the size cap is hit. After rotation, the next open cannot benefit from reuse.
  • Risk assessment: LOW — the system remains correct, just suboptimal in this edge case.
  • Suggested fix: Consider passing new_descriptor_log=false to prevent rotation during close.
M4. Design: Three separate options create a complex configuration surface
  • Issue: Users must understand the composition of three options to get the full warm-start benefit.
  • Suggested fix: Document the recommended "all three on" configuration more prominently.

🟢 LOW / NIT

L1. Mutable vs immutable inconsistency among the three options
L2. std::deque<VersionEdit> heap allocation during close (cold path, acceptable)
L3. Missing assert(initial_block_offset < kBlockSize) in log::Writer constructor
L4. Size-only integrity check in ReopenManifestForAppend (pragmatic, acceptable)
L5. Redundant option entries in options_settable_test.cc

Cross-Component Analysis

Context Assumptions hold? Action needed?
ReadOnly DB N/A (guarded) Safe
best_efforts_recovery Disabled explicitly Correct
2PC / WritePreparedTxnDB MinLogNumberToKeep skipped Correct
ReactiveVersionSet Different Recover path Safe
User-defined timestamps No interaction Safe
verify_manifest_content_on_close May trigger rotation Acceptable
Manifest rotation State cleaned up correctly Correct

Positive Observations

  • Defensive three-check fallback in ReopenManifestForAppend is well-designed
  • GetFileOptionsForManifestWrite and CreateManifestWriter cleanly deduplicate fresh/reuse paths
  • 25+ new tests covering individual options, composition, edge cases (tail corruption, dropped CFs, 2PC, best_efforts_recovery)
  • best_efforts_recovery exclusion correctly preserves the rebuild contract
  • 2PC safety correctly preserves WAL availability for uncommitted prepared transactions

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase
@anand1976 anand1976 force-pushed the export-D103568447 branch from f8f9c0e to f4d1bfb Compare May 6, 2026 22:21
anand1976 added a commit to anand1976/rocksdb that referenced this pull request May 6, 2026
Summary:
Pull Request resolved: facebook#14704

Add an immutable DBOption `reuse_manifest_on_open` (default false). When enabled, `DB::Open` can keep using the recovered MANIFEST for the first post-open metadata update instead of rebuilding a fresh MANIFEST, which can reduce warm-open latency for DBs whose MANIFEST is expensive to regenerate.

Reuse is still best-effort. If RocksDB cannot safely resume appending to the recovered MANIFEST, it falls back to the existing fresh-MANIFEST path. The option is also disabled under `best_efforts_recovery`.

This diff also teaches the reopened MANIFEST writer to adopt the existing file size before appending, documents the small-`max_manifest_file_size` caveat for the reused path, and keeps the full warm-reopen composition working with `optimize_manifest_for_recovery`.

Differential Revision: D103568447
anand1976 added a commit to anand1976/rocksdb that referenced this pull request May 6, 2026
Summary:
Pull Request resolved: facebook#14704

Add an immutable DBOption `reuse_manifest_on_open` (default false). When enabled, `DB::Open` can keep using the recovered MANIFEST for the first post-open metadata update instead of rebuilding a fresh MANIFEST, which can reduce warm-open latency for DBs whose MANIFEST is expensive to regenerate.

Reuse is still best-effort. If RocksDB cannot safely resume appending to the recovered MANIFEST, it falls back to the existing fresh-MANIFEST path. The option is also disabled under `best_efforts_recovery`.

This diff also teaches the reopened MANIFEST writer to adopt the existing file size before appending, documents the small-`max_manifest_file_size` caveat for the reused path, and keeps the full warm-reopen composition working with `optimize_manifest_for_recovery`.

Differential Revision: D103568447
@anand1976 anand1976 force-pushed the export-D103568447 branch 2 times, most recently from 1499d45 to a1b6818 Compare May 7, 2026 16:03
@github-actions

github-actions Bot commented May 7, 2026

Copy link
Copy Markdown

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit a1b6818


Summary

Well-structured PR that adds two complementary warm-reopen optimizations for MANIFEST handling. The implementation shows careful attention to safety guards (best_efforts_recovery, 2PC, dropped CFs, tail corruption) and includes thorough test coverage. The code is reviewed by an experienced Meta reviewer (pdillinger). A few medium-severity items warrant attention.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🔴 HIGH

None.

🟡 MEDIUM

M1. ShouldEmitPerColumnFamilyRecoveryEdit may incorrectly skip edits with WAL additions — db/version_edit.cc:112
  • Issue: The predicate checks NumEntries() > 0 to detect file-level mutations, but NumEntries() includes wal_additions_.size() and !wal_deletion_.IsEmpty(). The semantic coupling is fragile: if NumEntries() definition changes in the future to exclude WAL additions, this function would silently suppress needed edits.
  • Root cause: The function relies on NumEntries() as a proxy for "has meaningful file-level content" but NumEntries() counts WAL manipulations too.
  • Suggested fix: Add a clarifying comment on the NumEntries() check, or add explicit HasWalAdditions() || HasWalDeletion() checks for clarity.
M2. Double size-check in ReopenManifestForAppend may give inconsistent results — db/version_set.cc
  • Issue: The function does fs_->GetFileSize() then manifest_file->GetFileSize() and compares both against manifest_last_valid_record_end_. FSWritableFile::GetFileSize() behavior is filesystem-dependent and may return different values than FileSystem::GetFileSize() for custom FS implementations.
  • Suggested fix: Consider keeping only the FSWritableFile::GetFileSize() check (after reopen). The pre-reopen stat check could be removed or made advisory-only.
M3. Close-time MANIFEST write failure leaves FetchAddFileNumber side-effect — db/db_impl/db_impl.cc:670
  • Issue: If FetchAddFileNumber is called but subsequent LogAndApply fails, next_file_number_ has been advanced atomically but the MANIFEST doesn't reflect it. This is self-healing (next open re-derives via MaybeUpdateNextFileNumber), but the in-memory state has a gap.
  • Suggested fix: Add a comment noting the self-healing property.
M4. ReactiveVersionSet doesn't initialize manifest_last_valid_record_end_
  • Issue: New field in base class only set through VersionEditHandler::CheckIterationResult. Secondary instances use ManifestTailer, not VersionEditHandler. Currently safe since ReopenManifestForAppend is only called from VersionSet::Recover.
  • Suggested fix: Document that the field is only valid after VersionSet::Recover.
M5. Stress test doesn't randomize the new options — db_stress_tool/db_stress_gflags.cc
  • Issue: Both options use static defaults (false). Without randomization in stress test setup, these code paths won't get stress coverage.
  • Suggested fix: Add randomization similar to other boolean options.

🟢 LOW / NIT

L1. Option naming: optimize_manifest_for_recovery doesn't clearly convey close-time behavior
L2. Normalized file path construction micro-optimization is a nice cleanup — db/db_impl/db_impl_files.cc:1202

Cross-Component Analysis

Context Safe? Notes
WritePreparedTxnDB YES allow_2pc() guard prevents WAL marker advancement
ReadOnly DB YES read_only checks prevent all writes
SecondaryInstance YES ReactiveVersionSet doesn't call ReopenManifestForAppend
best_efforts_recovery YES 5 independent guard sites
User-defined timestamps YES HasPersistUserDefinedTimestamps() checked in predicate
Dropped CFs YES IsDropped() guard in close-time writer

Block offset correctness: block_offset = initial_file_size % kBlockSize is correct — verified against EmitPhysicalRecord which maintains block_offset_ += header_size + n, and padding resets to 0 at block boundaries.

FetchAddFileNumber underflow: Protected by if (next_file_number <= new_log_num) guard.

IsEmpty() semantics: Verified checks both mem()->GetFirstSequenceNumber() == 0 and imm()->NumNotFlushed() == 0.

Positive Observations

  • Defensive fallback design — any reuse failure silently falls back to fresh MANIFEST
  • Clean refactoring of CreateManifestWriter and GetFileOptionsForManifestWrite
  • 20+ test cases covering option combinations, 2PC, multi-CF, corruption, rotation
  • WritableFileWriter initial_file_size correctly seeds all 4 size-tracking fields
  • Close-time write is correctly positioned after background work shutdown and obsolete file cleanup

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase
anand1976 added 3 commits May 7, 2026 10:26
Summary:
Pull Request resolved: facebook#14702

Add a mutable DBOption `optimize_manifest_for_recovery` (default false) as a temporary rollout / kill switch for warm-reopen MANIFEST optimizations.

In this diff, enabling the option lets recovery skip MANIFEST updates during `DB::Open` when the recovered state is already reflected on disk, which reduces metadata appends after a clean shutdown and can lower warm-reopen latency on storage where MANIFEST appends are expensive.

If the option is disabled, RocksDB follows the existing recovery path unchanged. The optimization is disabled under `best_efforts_recovery`, where recovery intentionally rewrites metadata as part of salvage, and the option is mutable so later diffs in this stack can share the same rollout knob.

Differential Revision: D103568448
Summary:
Pull Request resolved: facebook#14703

Extend `optimize_manifest_for_recovery` so a clean `DB::Close` can persist up-to-date WAL recovery markers when that can be done safely. Combined with the recovery-side optimization in the previous diff, a clean close/reopen can avoid recovery-time MANIFEST appends.

This remains best-effort: if the close-time write is disabled, skipped, or fails, RocksDB falls back to the standard recovery path on the next open. The option stays mutable so it can be turned off before close to suppress the optimization without restarting the DB.

The close-time path respects the existing recovery constraints for 2PC, non-empty column families, dropped column families, and WAL tracking, and preserves the existing file-number invariants.

Differential Revision: D103568449
Summary:
Pull Request resolved: facebook#14704

Add an immutable DBOption `reuse_manifest_on_open` (default false). When enabled, `DB::Open` can keep using the recovered MANIFEST for the first post-open metadata update instead of rebuilding a fresh MANIFEST, which can reduce warm-open latency for DBs whose MANIFEST is expensive to regenerate.

Reuse is still best-effort. If RocksDB cannot safely resume appending to the recovered MANIFEST, it falls back to the existing fresh-MANIFEST path. The option is also disabled under `best_efforts_recovery`.

This diff also teaches the reopened MANIFEST writer to adopt the existing file size before appending, documents the small-`max_manifest_file_size` caveat for the reused path, and keeps the full warm-reopen composition working with `optimize_manifest_for_recovery`.

Reviewed By: pdillinger

Differential Revision: D103568447
@anand1976 anand1976 force-pushed the export-D103568447 branch from a1b6818 to 1fee285 Compare May 7, 2026 17:56
@github-actions

github-actions Bot commented May 7, 2026

Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 1fee285


Summary

Well-designed, best-effort optimization that reduces MANIFEST recovery overhead on warm reopens. The dual-option approach is sound, fallback handling is robust, and test coverage is comprehensive. No high-severity correctness bugs found.

High-severity findings (0):
No high-severity findings.

Full review (click to expand)

Findings

🔴 HIGH

No high-severity findings.

🟡 MEDIUM

M1. ReopenWritableFile may not truncate preallocated tail — version_set.cc:6733
  • Issue: ReopenManifestForAppend compares physical_size (from fs_->GetFileSize) with manifest_last_valid_record_end_. POSIX GetFileSize returns stat.st_size which reflects the logical end of file (after the last write), not preallocated extent. However, FSWritableFile::GetFileSize (the second check after reopen) may behave differently across filesystem implementations. If any FS implementation reports preallocated size rather than written size, the double-check would pass but appended data would follow garbage bytes.
  • Root cause: The code does two size checks (filesystem-level and file-handle-level), which is good defensive coding. However, the FS abstraction contract for GetFileSize on reopened files is not well-specified.
  • Suggested fix: Add a brief comment documenting the assumption that FSWritableFile::GetFileSize returns logical file size, not preallocated extent. Consider adding a Truncate(manifest_last_valid_record_end_) call after reopen as a safety measure.
M2. Leaked file numbers on LogAndApply failure during close — db_impl.cc:682
  • Issue: MaybeWriteWalMarkersToManifestOnClose calls FetchAddFileNumber to reserve file numbers, then calls LogAndApply. If LogAndApply fails, the reserved file numbers are permanently leaked (never reclaimed). While this is cosmetic (file numbers are 64-bit and won't exhaust), it's a minor correctness imperfection.
  • Root cause: FetchAddFileNumber is an atomic increment with no rollback mechanism.
  • Suggested fix: This is acceptable as-is since the DB is closing, but worth a comment noting the intentional leak on failure.
M3. optimize_manifest_for_recovery mutable option safety — db_impl.cc:640
  • Issue: Initially flagged as a concern: mutable option could be toggled between close and crash. On deeper analysis, this is safe because the skip logic in recovery is state-based (compares actual CF log numbers and min_log_number_to_keep), not flag-based. Skipping only happens when the edit truly doesn't advance state.
  • Suggested fix: None needed.

🟢 LOW / NIT

L1. PR scope is large — consider splitting
  • Issue: 24 files, 1215 insertions across two independent features. Could split into optimize_manifest_for_recovery (Phase 1) and reuse_manifest_on_open (Phase 2).
  • Suggested fix: Consider splitting in a follow-up.
L2. Missing stress test randomization for new options — db_stress_gflags.cc
  • Issue: New flags are added to db_stress but need randomization in the stress test runner for effective crash testing.
  • Suggested fix: Add randomization in crash_test_common.py or equivalent.
L3. ShouldEmitPerColumnFamilyRecoveryEdit future-proofing — version_edit.cc:112
  • Issue: Comprehensive today (NumEntries() > 0 + explicit Has* checks), but a future VersionEdit field could be missed if it doesn't go through NumEntries() or any existing check.
  • Suggested fix: Add a comment noting new fields should be considered for this check.
L4. ReopenWritableFile default returns NotSupportedfile_system.h:468
  • Issue: Custom FS implementations that don't override this will silently fall back to fresh MANIFEST. Already documented in the option description.

Cross-Component Analysis

Context Assumptions hold? Action needed?
WritePreparedTxnDB YES — allow_2pc() guard prevents WAL marker advancement Safe
ReadOnly DB N/A — code not reached Safe
best_efforts_recovery YES — both options explicitly disabled Safe
Secondary Instance N/A — different recovery path Safe
Direct I/O YES — explicit guard and fallback Safe
MemPurge / FIFO / Universal N/A — no interaction Safe

Key invariants verified:

  • next_file_number, min_log_number_to_keep, WalSet monotonicity all preserved
  • MANIFEST record framing correct (block_offset = file_size % kBlockSize)
  • WritableFileWriter size accounting correct with initial_file_size seeding

Debunked findings: Crash-during-close data loss (CFs are empty), ShouldEmit field coverage (NumEntries covers file ops), block alignment at boundaries (offset=0 is start-of-block), preallocation size check (stat.st_size is logical), recyclable header mismatch (MANIFEST never recycled) — all false positives.

Positive Observations

  • Excellent fallback design — both features degrade gracefully
  • Comprehensive test coverage (14+ cases including edge cases)
  • Clean refactoring of CreateManifestWriter and GetFileOptionsForManifestWrite
  • Correct LastRecordEnd() vs GetReadOffset() semantics for MANIFEST reuse boundary
  • std::deque<VersionEdit> for pointer stability in edit_lists construction

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase
@meta-codesync

meta-codesync Bot commented May 7, 2026

Copy link
Copy Markdown

This pull request has been merged in c734b7c.

anand1976 added a commit that referenced this pull request May 7, 2026
Summary:
Pull Request resolved: #14704

Add an immutable DBOption `reuse_manifest_on_open` (default false). When enabled, `DB::Open` can keep using the recovered MANIFEST for the first post-open metadata update instead of rebuilding a fresh MANIFEST, which can reduce warm-open latency for DBs whose MANIFEST is expensive to regenerate.

Reuse is still best-effort. If RocksDB cannot safely resume appending to the recovered MANIFEST, it falls back to the existing fresh-MANIFEST path. The option is also disabled under `best_efforts_recovery`.

This diff also teaches the reopened MANIFEST writer to adopt the existing file size before appending, documents the small-`max_manifest_file_size` caveat for the reused path, and keeps the full warm-reopen composition working with `optimize_manifest_for_recovery`.

Reviewed By: hx235, pdillinger

Differential Revision: D103568447

fbshipit-source-id: f4f5c35ea3ef0b80a0d52d94be40c6bd11505999
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment