Fix false-positive corruption error with remote compaction by joshkang97 · Pull Request #14808 · facebook/rocksdb

joshkang97 · 2026-06-01T18:23:44Z

Summary

Fix a false-positive compaction corruption error in the remote compaction path. Remote workers can mark CompactionJobStats::num_input_records as unreliable when input iteration uses seek/skip behavior, but the remote result serialization was dropping has_accurate_num_input_records. The primary then deserialized the flag as its default true, trusted an unreliable zero input-record count, and raised Compaction number of input keys does not match number of keys processed.

This change serializes the accuracy flag with the rest of CompactionJobStats so the primary preserves the remote worker's "do not verify this count" signal. It also adds a targeted regression test and a release note for the bug fix.

Test Plan

New regression test that triggers corruption error without the fix by modifying has_accurate_num_input_records=false

meta-codesync · 2026-06-01T18:24:39Z

@joshkang97 has imported this pull request. If you are a Meta employee, you can view this in D107128357.

github-actions · 2026-06-01T18:30:08Z

✅ clang-tidy: No findings on changed lines

Completed in 288.8s.

github-actions · 2026-06-01T19:07:02Z

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit e7cc99e

❌ Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

Codex may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/codex-review [context] — Request a code review
/codex-query <question> — Ask about the PR or codebase

github-actions · 2026-06-01T19:15:00Z

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit e7cc99e

Summary

Correct, well-targeted fix for a real false-positive corruption error in the remote compaction path. The serialization change and test are sound. However, the same systemic issue (missing serialization entries) affects 4 additional CompactionJobStats fields, and the test could be strengthened with a data verification call.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🟡 MEDIUM

M1. Four additional `CompactionJobStats` fields missing from serialization map -- `compaction_service_job.cc:645-766`

Issue: Comparing the CompactionJobStats struct (include/rocksdb/compaction_job_stats.h) against the compaction_job_stats_type_info serialization map reveals 4 other fields that are not serialized for remote compaction results:
1. num_input_files_trivially_moved (size_t, default 0)
2. num_filtered_input_files (size_t, default 0)
3. num_filtered_input_files_at_output_level (size_t, default 0)
4. total_skipped_input_bytes (uint64_t, default 0)
Root cause: Same systemic issue as the bug being fixed -- new struct fields are added without updating the remote compaction serialization map. No CI or compile-time check enforces completeness.
Impact: These fields default to 0 on the primary after deserialization, causing incorrect statistics/metrics for remote compactions. Unlike has_accurate_num_input_records, these do not cause false-positive errors -- they only affect observability and monitoring accuracy.
Suggested fix: Add the 4 missing fields to compaction_job_stats_type_info in a follow-up PR. Consider adding a static assertion or test that validates serialization map completeness against the struct.

M2. Test should call `VerifyTestData()` after compaction -- `compaction_service_test.cc:971`

Issue: The new InaccurateInputRecordCount test verifies that CompactRange succeeds (ASSERT_OK) but does not verify that the data is correct after compaction. Other tests in the same file (e.g., BasicCompactions, ManualCompaction) call VerifyTestData() after compaction to confirm data integrity.
Root cause: Test omission.
Suggested fix: Add VerifyTestData(); after the ASSERT_GE line to confirm the compaction produced correct results despite the inaccurate input record count.

🟢 LOW / NIT

L1. Consider adding a serialization round-trip unit test

Issue: The test uses a SyncPoint callback to inject has_accurate_num_input_records=false into the result after Run() but before serialization, which effectively tests the full path. However, a focused round-trip test (serialize with false, deserialize, verify false is preserved) would provide more direct coverage of the serialization fix itself and would be easier to debug if the serialization ever regresses.
Suggested fix: Add a small test that creates a CompactionServiceResult, sets stats.has_accurate_num_input_records = false, calls Write() then Read(), and asserts the flag is preserved. The existing TEST_Equals infrastructure can be leveraged.

L2. No assertion that verification was actually skipped

Issue: The test proves that CompactRange does not return a corruption error, but does not directly assert that VerifyInputRecordCount was skipped (as opposed to running and coincidentally passing). A SyncPoint on the verification path could strengthen confidence.
Suggested fix: Optional -- current coverage is adequate since num_input_records=0 would definitely fail verification if it ran.

Cross-Component Analysis

Context	Does code execute?	Assumptions hold?	Action needed?
WritePreparedTxnDB	No (no compaction_service usage)	N/A	None
ReadOnly DB	No (no compaction in read-only)	N/A	None
User-defined timestamps	Possible	Yes (UDT has separate verification bypass)	None
FIFO / Universal compaction	Possible via remote	Yes (flag is compaction-style agnostic)	None
Old primary + new worker	Yes	Yes (`ignore_unknown_options=true` skips unknown field)	None
New primary + old worker	Yes	Yes (defaults to `true`, same as current behavior)	None

Backward/forward compatibility: Fully compatible. The ConfigOptions with ignore_unknown_options = true in CompactionServiceResult::Read() ensures old readers silently ignore the new field, and new readers use the safe default true when reading old data.

Aggregation correctness: CompactionJobStats::Add() uses has_accurate_num_input_records &= stats.has_accurate_num_input_records, so if any subcompaction signals inaccuracy, the aggregate correctly becomes false.

Positive Observations

The fix is minimal, well-scoped, and addresses the exact root cause.
The implementation follows existing serialization patterns precisely (OptionType::kBoolean, OptionVerificationType::kNormal, OptionTypeFlags::kNone).
The test reuses established infrastructure (CompactionServiceTest fixture, GenerateTestData(), existing SyncPoint).
The release note is clear and appropriately concise.
No performance impact -- serialization happens once per remote compaction (cold path).

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

Claude may miss context from files not in the diff
Large PRs may be truncated
Always apply human judgment to AI suggestions

Commands:

/claude-review [context] — Request a code review
/claude-query <question> — Ask about the PR or codebase

meta-codesync · 2026-06-02T18:17:30Z

@joshkang97 merged this pull request in 768de6e.

fix

e7cc99e

joshkang97 requested a review from xingbowang June 1, 2026 18:23

meta-cla Bot added the CLA Signed label Jun 1, 2026

meta-codesync Bot closed this in 768de6e Jun 2, 2026

facebook-github-tools Bot added the Merged label Jun 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix false-positive corruption error with remote compaction#14808

Fix false-positive corruption error with remote compaction#14808
joshkang97 wants to merge 1 commit into
facebook:mainfrom
joshkang97:debug/crashtest-sc45035996706498448

joshkang97 commented Jun 1, 2026

meta-codesync Bot commented Jun 1, 2026

github-actions Bot commented Jun 1, 2026

github-actions Bot commented Jun 1, 2026

github-actions Bot commented Jun 1, 2026

Findings

🟡 MEDIUM

M1. Four additional `CompactionJobStats` fields missing from serialization map -- `compaction_service_job.cc:645-766`

M2. Test should call `VerifyTestData()` after compaction -- `compaction_service_test.cc:971`

🟢 LOW / NIT

L1. Consider adding a serialization round-trip unit test

L2. No assertion that verification was actually skipped

Cross-Component Analysis

Positive Observations

meta-codesync Bot commented Jun 2, 2026

Labels

1 participant

Uh oh!

Conversation

joshkang97 commented Jun 1, 2026

Summary

Test Plan

meta-codesync Bot commented Jun 1, 2026

github-actions Bot commented Jun 1, 2026

✅ clang-tidy: No findings on changed lines

github-actions Bot commented Jun 1, 2026

🟡 Codex Code Review

github-actions Bot commented Jun 1, 2026

✅ Claude Code Review

Summary

Findings

🟡 MEDIUM

M1. Four additional CompactionJobStats fields missing from serialization map -- compaction_service_job.cc:645-766

M2. Test should call VerifyTestData() after compaction -- compaction_service_test.cc:971

🟢 LOW / NIT

L1. Consider adding a serialization round-trip unit test

L2. No assertion that verification was actually skipped

Cross-Component Analysis

Positive Observations

meta-codesync Bot commented Jun 2, 2026

Labels

1 participant

M1. Four additional `CompactionJobStats` fields missing from serialization map -- `compaction_service_job.cc:645-766`

M2. Test should call `VerifyTestData()` after compaction -- `compaction_service_test.cc:971`