Skip to content

Fix false-positive corruption error with remote compaction#14808

Closed
joshkang97 wants to merge 1 commit into
facebook:mainfrom
joshkang97:debug/crashtest-sc45035996706498448
Closed

Fix false-positive corruption error with remote compaction#14808
joshkang97 wants to merge 1 commit into
facebook:mainfrom
joshkang97:debug/crashtest-sc45035996706498448

Conversation

@joshkang97

Copy link
Copy Markdown
Contributor

Summary

Fix a false-positive compaction corruption error in the remote compaction path. Remote workers can mark CompactionJobStats::num_input_records as unreliable when input iteration uses seek/skip behavior, but the remote result serialization was dropping has_accurate_num_input_records. The primary then deserialized the flag as its default true, trusted an unreliable zero input-record count, and raised Compaction number of input keys does not match number of keys processed.

This change serializes the accuracy flag with the rest of CompactionJobStats so the primary preserves the remote worker's "do not verify this count" signal. It also adds a targeted regression test and a release note for the bug fix.

Test Plan

New regression test that triggers corruption error without the fix by modifying has_accurate_num_input_records=false

@joshkang97 joshkang97 requested a review from xingbowang June 1, 2026 18:23
@meta-cla meta-cla Bot added the CLA Signed label Jun 1, 2026
@meta-codesync

meta-codesync Bot commented Jun 1, 2026

Copy link
Copy Markdown

@joshkang97 has imported this pull request. If you are a Meta employee, you can view this in D107128357.

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown

✅ clang-tidy: No findings on changed lines

Completed in 288.8s.

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit e7cc99e


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit e7cc99e


Summary

Correct, well-targeted fix for a real false-positive corruption error in the remote compaction path. The serialization change and test are sound. However, the same systemic issue (missing serialization entries) affects 4 additional CompactionJobStats fields, and the test could be strengthened with a data verification call.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🟡 MEDIUM

M1. Four additional CompactionJobStats fields missing from serialization map -- compaction_service_job.cc:645-766
  • Issue: Comparing the CompactionJobStats struct (include/rocksdb/compaction_job_stats.h) against the compaction_job_stats_type_info serialization map reveals 4 other fields that are not serialized for remote compaction results:
    1. num_input_files_trivially_moved (size_t, default 0)
    2. num_filtered_input_files (size_t, default 0)
    3. num_filtered_input_files_at_output_level (size_t, default 0)
    4. total_skipped_input_bytes (uint64_t, default 0)
  • Root cause: Same systemic issue as the bug being fixed -- new struct fields are added without updating the remote compaction serialization map. No CI or compile-time check enforces completeness.
  • Impact: These fields default to 0 on the primary after deserialization, causing incorrect statistics/metrics for remote compactions. Unlike has_accurate_num_input_records, these do not cause false-positive errors -- they only affect observability and monitoring accuracy.
  • Suggested fix: Add the 4 missing fields to compaction_job_stats_type_info in a follow-up PR. Consider adding a static assertion or test that validates serialization map completeness against the struct.
M2. Test should call VerifyTestData() after compaction -- compaction_service_test.cc:971
  • Issue: The new InaccurateInputRecordCount test verifies that CompactRange succeeds (ASSERT_OK) but does not verify that the data is correct after compaction. Other tests in the same file (e.g., BasicCompactions, ManualCompaction) call VerifyTestData() after compaction to confirm data integrity.
  • Root cause: Test omission.
  • Suggested fix: Add VerifyTestData(); after the ASSERT_GE line to confirm the compaction produced correct results despite the inaccurate input record count.

🟢 LOW / NIT

L1. Consider adding a serialization round-trip unit test
  • Issue: The test uses a SyncPoint callback to inject has_accurate_num_input_records=false into the result after Run() but before serialization, which effectively tests the full path. However, a focused round-trip test (serialize with false, deserialize, verify false is preserved) would provide more direct coverage of the serialization fix itself and would be easier to debug if the serialization ever regresses.
  • Suggested fix: Add a small test that creates a CompactionServiceResult, sets stats.has_accurate_num_input_records = false, calls Write() then Read(), and asserts the flag is preserved. The existing TEST_Equals infrastructure can be leveraged.
L2. No assertion that verification was actually skipped
  • Issue: The test proves that CompactRange does not return a corruption error, but does not directly assert that VerifyInputRecordCount was skipped (as opposed to running and coincidentally passing). A SyncPoint on the verification path could strengthen confidence.
  • Suggested fix: Optional -- current coverage is adequate since num_input_records=0 would definitely fail verification if it ran.

Cross-Component Analysis

Context Does code execute? Assumptions hold? Action needed?
WritePreparedTxnDB No (no compaction_service usage) N/A None
ReadOnly DB No (no compaction in read-only) N/A None
User-defined timestamps Possible Yes (UDT has separate verification bypass) None
FIFO / Universal compaction Possible via remote Yes (flag is compaction-style agnostic) None
Old primary + new worker Yes Yes (ignore_unknown_options=true skips unknown field) None
New primary + old worker Yes Yes (defaults to true, same as current behavior) None

Backward/forward compatibility: Fully compatible. The ConfigOptions with ignore_unknown_options = true in CompactionServiceResult::Read() ensures old readers silently ignore the new field, and new readers use the safe default true when reading old data.

Aggregation correctness: CompactionJobStats::Add() uses has_accurate_num_input_records &= stats.has_accurate_num_input_records, so if any subcompaction signals inaccuracy, the aggregate correctly becomes false.

Positive Observations

  • The fix is minimal, well-scoped, and addresses the exact root cause.
  • The implementation follows existing serialization patterns precisely (OptionType::kBoolean, OptionVerificationType::kNormal, OptionTypeFlags::kNone).
  • The test reuses established infrastructure (CompactionServiceTest fixture, GenerateTestData(), existing SyncPoint).
  • The release note is clear and appropriately concise.
  • No performance impact -- serialization happens once per remote compaction (cold path).

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase
@meta-codesync

meta-codesync Bot commented Jun 2, 2026

Copy link
Copy Markdown

@joshkang97 merged this pull request in 768de6e.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

1 participant