Skip to content

Enable MANIFEST optimization options in db_crashtest.py and add db_stress verification#14742

Closed
anand1976 wants to merge 1 commit into
facebook:mainfrom
anand1976:export-D105224068
Closed

Enable MANIFEST optimization options in db_crashtest.py and add db_stress verification#14742
anand1976 wants to merge 1 commit into
facebook:mainfrom
anand1976:export-D105224068

Conversation

@anand1976

@anand1976 anand1976 commented May 14, 2026

Copy link
Copy Markdown
Contributor

Summary:
This change enables testing of the two new MANIFEST optimization options introduced in D103568447:

  1. optimize_manifest_for_recovery - Skips unnecessary MANIFEST edits during recovery
  2. reuse_manifest_on_open - Reuses existing MANIFEST file on DB open

Changes:

  • tools/db_crashtest.py: Add both options with 20% probability to default_params
  • db_stress_tool/db_stress_test_base.h: Add ManifestVerifyMode enum and member variables for tracking MANIFEST state
  • db_stress_tool/db_stress_test_base.cc: Implement RecordManifestStateBeforeReopen() and VerifyManifestNotRewritten() methods to validate MANIFEST reuse on DB reopen

The verification logic handles 4 combinations:

  • Both disabled: No verification (baseline)
  • Only optimize enabled: No verification (hard to measure without sync points)
  • Only reuse enabled: Verify MANIFEST file is reused
  • Both enabled: Verify MANIFEST reused AND CURRENT unchanged

Verification is warning-only (not fatal) to account for legitimate fallback cases like corruption or size limits.

Reviewed By: hx235

Differential Revision: D105224068

@meta-cla meta-cla Bot added the CLA Signed label May 14, 2026
@meta-codesync

meta-codesync Bot commented May 14, 2026

Copy link
Copy Markdown

@anand1976 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D105224068.

@github-actions

github-actions Bot commented May 14, 2026

Copy link
Copy Markdown

⚠️ clang-tidy: 5 warning(s) on changed lines

Completed in 67.4s.

Summary by check

Check Count
concurrency-mt-unsafe 3
performance-inefficient-string-concatenation 2
Total 5

Details

db_stress_tool/db_stress_test_base.cc (5 warning(s))
db_stress_tool/db_stress_test_base.cc:4337:52: warning: string concatenation results in allocation of unnecessary temporary strings; consider using 'operator+=' or 'string::append()' instead [performance-inefficient-string-concatenation]
db_stress_tool/db_stress_test_base.cc:4393:52: warning: string concatenation results in allocation of unnecessary temporary strings; consider using 'operator+=' or 'string::append()' instead [performance-inefficient-string-concatenation]
db_stress_tool/db_stress_test_base.cc:4410:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_test_base.cc:4444:11: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_test_base.cc:4482:11: warning: function is not thread safe [concurrency-mt-unsafe]
@anand1976 anand1976 force-pushed the export-D105224068 branch from 7372093 to 3750446 Compare May 20, 2026 16:48
@meta-codesync meta-codesync Bot changed the title Enable MANIFEST optimization options in db_crashtest.py and add db_stress verification May 20, 2026
anand1976 added a commit to anand1976/rocksdb that referenced this pull request May 20, 2026
…ress verification (facebook#14742)

Summary:

This change enables testing of the two new MANIFEST optimization options introduced in D103568447:
1. optimize_manifest_for_recovery - Skips unnecessary MANIFEST edits during recovery
2. reuse_manifest_on_open - Reuses existing MANIFEST file on DB open

Changes:
- tools/db_crashtest.py: Add both options with 20% probability to default_params
- db_stress_tool/db_stress_test_base.h: Add ManifestVerifyMode enum and member variables for tracking MANIFEST state
- db_stress_tool/db_stress_test_base.cc: Implement RecordManifestStateBeforeReopen() and VerifyManifestNotRewritten() methods to validate MANIFEST reuse on DB reopen

The verification logic handles 4 combinations:
- Both disabled: No verification (baseline)
- Only optimize enabled: No verification (hard to measure without sync points)
- Only reuse enabled: Verify MANIFEST file is reused
- Both enabled: Verify MANIFEST reused AND CURRENT unchanged

Verification is warning-only (not fatal) to account for legitimate fallback cases like corruption or size limits.

Reviewed By: hx235

Differential Revision: D105224068
@anand1976 anand1976 force-pushed the export-D105224068 branch from 3750446 to 33e3d07 Compare May 20, 2026 16:53
…ress verification

Summary:
This change enables testing of the two new MANIFEST optimization options introduced in D103568447:
1. optimize_manifest_for_recovery - Skips unnecessary MANIFEST edits during recovery
2. reuse_manifest_on_open - Reuses existing MANIFEST file on DB open

Changes:
- tools/db_crashtest.py: Add both options with 20% probability to default_params
- db_stress_tool/db_stress_test_base.h: Add ManifestVerifyMode enum and member variables for tracking MANIFEST state
- db_stress_tool/db_stress_test_base.cc: Implement RecordManifestStateBeforeReopen() and VerifyManifestNotRewritten() methods to validate MANIFEST reuse on DB reopen

The verification logic handles 4 combinations:
- Both disabled: No verification (baseline)
- Only optimize enabled: No verification (hard to measure without sync points)
- Only reuse enabled: Verify MANIFEST file is reused
- Both enabled: Verify MANIFEST reused AND CURRENT unchanged

Verification is warning-only (not fatal) to account for legitimate fallback cases like corruption or size limits.

Reviewed By: hx235

Differential Revision: D105224068
@meta-codesync meta-codesync Bot changed the title Enable MANIFEST optimization options in db_crashtest.py and add db_stress verification (#14742) May 20, 2026
@anand1976 anand1976 force-pushed the export-D105224068 branch from 33e3d07 to 864bbb6 Compare May 20, 2026 17:02
@github-actions

Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 864bbb6


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@github-actions

Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 864bbb6


Summary

This PR adds crash test coverage for two MANIFEST optimization options (optimize_manifest_for_recovery and reuse_manifest_on_open) with 20% probability each, and implements verification logic in db_stress to validate MANIFEST behavior during DB reopen. The approach is sound and follows existing db_stress verification patterns. The verification logic is well-structured with appropriate strictness levels.

High-severity findings (1):

  • [db_stress_test_base.cc: VerifyManifestNotRewritten] GetFileSize() return status is unchecked when reading current_manifest_size, which could silently use a stale zero value and produce false verification results.
Full review (click to expand)

Findings

🔴 HIGH

H1. Unchecked GetFileSize() Status in VerifyManifestNotRewritten() -- db_stress_test_base.cc
  • Issue: In VerifyManifestNotRewritten(), the call Env::Default()->GetFileSize(manifest_path, &current_manifest_size) does not check the returned Status. If the call fails, current_manifest_size remains 0, which could cause false verification results (e.g., incorrectly concluding zero growth when the size simply wasn't read). This contrasts with RecordManifestStateBeforeReopen() where GetFileSize failure is properly handled by falling back to MANIFEST_VERIFY_NONE.
  • Root cause: Inconsistent error handling between the two methods for the same operation.
  • Suggested fix: Check the Status and either skip verification or log a warning:
    Status size_s = Env::Default()->GetFileSize(manifest_path, &current_manifest_size);
    if (!size_s.ok()) {
      fprintf(stderr, "Warning: Failed to get MANIFEST file size for verification: %s\n",
              size_s.ToString().c_str());
      manifest_verify_mode_ = MANIFEST_VERIFY_NONE;
      return;
    }

🟡 MEDIUM

M1. STRICT mode probability is extremely low -- db_stress_test_base.cc
  • Issue: STRICT mode requires all of: reuse_manifest_on_open=1 (~20%), optimize_manifest_for_recovery=1 (~20%), avoid_flush_during_recovery=true (~12.5%), write_dbid_to_manifest=0 (~50%), metadata_write_fault_one_in=0 (~33%), and open_metadata_write_fault_one_in=0 (~67%). The compound probability is roughly 0.04%, meaning STRICT mode will rarely activate in practice and may not catch regressions effectively.
  • Suggested fix: Consider adding a dedicated crashtest configuration or test variant that explicitly enables all STRICT mode preconditions to ensure this critical verification path gets regular exercise.
M2. Multiple MANIFEST files could cause inconsistent file selection -- db_stress_test_base.cc
  • Issue: Both RecordManifestStateBeforeReopen() and VerifyManifestNotRewritten() iterate through GetChildren() results and break on the first MANIFEST file found. GetChildren() does not guarantee ordering. If multiple MANIFEST files exist temporarily (e.g., during cleanup races), the pre-reopen and post-reopen scans could select different files, causing false verification failures or missed violations.
  • Suggested fix: Select the highest-numbered MANIFEST file (scan all, keep max) to ensure deterministic selection.
M3. Unguarded comment reformatting changes mixed with functional changes -- db_stress_test_base.cc:4261-4263
  • Issue: The diff includes a non-functional reformatting of an existing comment (breaking "access" to next line, "upcoming" to next line) mixed with the functional changes. This makes the diff harder to review and clutters the change history.
  • Suggested fix: Separate formatting-only changes from functional changes, or avoid reformatting existing comments.

🟢 LOW / NIT

L1. write_dbid_to_manifest uses non-lambda evaluation in db_crashtest.py:346
  • Issue: Pre-existing inconsistency: write_dbid_to_manifest is random.randint(0, 1) (evaluated once at import), while the new options use lambdas. Not a bug in this PR, but STRICT mode's write_dbid_to_manifest=0 check will be constant per session.
L2. Arbitrary 10KB growth threshold for REUSE mode
  • Issue: The 10240 byte threshold lacks documentation explaining the rationale.
  • Suggested fix: Add a brief comment.
L3. No handling for manifest_file_number_before_reopen_ == 0
  • Issue: If no MANIFEST found pre-reopen, verification could false-trigger. Unlikely in practice.
  • Suggested fix: Early return to NONE mode if no MANIFEST found after scanning.

Cross-Component Analysis

Context Assumptions hold? Action needed?
WritePreparedTxnDB YES - verification is observational Safe
best_efforts_recovery YES - explicitly returns NONE Safe
Fault injection (open) YES - STRICT checks fault flags Safe
CF operations during reopen YES - DB is closed, no ops possible Safe

Positive Observations

  • The 4-mode verification design provides graduated strictness matching confidence levels.
  • Using Env::Default() for verification correctly avoids fault injection interference.
  • Warning-only for non-STRICT is appropriate given legitimate fallback scenarios.
  • Plain enum is consistent with LastIterateOp in the same class.
  • Lambda syntax in db_crashtest.py is consistent with other probabilistic options.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase
@meta-codesync

meta-codesync Bot commented May 20, 2026

Copy link
Copy Markdown

This pull request has been merged in 88d7d2d.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment