Skip to content

Size-tolerant deserialization for compaction service response's counts array (#14807)#14807

Open
hx235 wants to merge 2 commits into
facebook:mainfrom
hx235:export-D106321320
Open

Size-tolerant deserialization for compaction service response's counts array (#14807)#14807
hx235 wants to merge 2 commits into
facebook:mainfrom
hx235:export-D106321320

Conversation

@hx235

@hx235 hx235 commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Summary:

The "counts" array field in compaction service response is non-critical information. During deserialization, primary DB can disagree on the size of this array with secondary DB if they run different RocksDB versions during cross-process remote compaction. We don't want to fail the compaction due to this disagreement as missing some non-critical info is fine. Prior to this PR, a kNumOfReasons - 1 array hack was used for this purpose when kNumOfReasons was recently increased by 1.

This PR replace the hack with a more generalSizeTolerantDeserializeArray, which overrides only the parse callback to tolerate size mismatches during deserialization: extra elements are ignored, missing elements are zero-filled.

Serialization and comparison are inherited from Array<T, kSize> unchanged as they don't cross the process boundary.

Test plan:

Test 1: Bug repro — primary with the extra new enum (broken_ldb) + worker with previous enum (old_ldb) → FAIL

$ rm -rf /tmp/cs_test/t1 && mkdir -p /tmp/cs_test/t1
$ /tmp/cs_test/old_ldb remote_compaction_worker --db=/tmp/cs_test/t1/db --job_dir=/tmp/cs_test/t1 &
$ /tmp/cs_test/broken_ldb remote_compaction_primary --db=/tmp/cs_test/t1/db --job_dir=/tmp/cs_test/t1
remote_compaction_worker: OK (4716 bytes result)
Failed: CompactRange: Invalid argument: Serialized value has less elements than array size: counts

Test 2: Bug repro — primary with previous enum (old_ldb) + worker with the extra new enum (broken_ldb) → FAIL

$ rm -rf /tmp/cs_test/t2 && mkdir -p /tmp/cs_test/t2
$ /tmp/cs_test/broken_ldb remote_compaction_worker --db=/tmp/cs_test/t2/db --job_dir=/tmp/cs_test/t2 &
$ /tmp/cs_test/old_ldb remote_compaction_primary --db=/tmp/cs_test/t2/db --job_dir=/tmp/cs_test/t2
remote_compaction_worker: OK (4844 bytes result)
Failed: CompactRange: Invalid argument: Serialized value has more elements than array size: counts

Test 3: Fix handles fewer enums — primary with the extra new enum and the fix (fixed_ldb) + worker with previous enum (old_ldb) → PASS (vs Test 1)

$ rm -rf /tmp/cs_test/t3 && mkdir -p /tmp/cs_test/t3
$ /tmp/cs_test/old_ldb remote_compaction_worker --db=/tmp/cs_test/t3/db --job_dir=/tmp/cs_test/t3 &
$ /tmp/cs_test/fixed_ldb remote_compaction_primary --db=/tmp/cs_test/t3/db --job_dir=/tmp/cs_test/t3
remote_compaction_worker: OK (4717 bytes result)
remote_compaction_primary: OK

Test 4: Fix handles more enums — primary with previous enum and the fix (fixed_old_ldb) + worker with the extra new enum (broken_ldb) → PASS (vs Test 2)

$ rm -rf /tmp/cs_test/t4 && mkdir -p /tmp/cs_test/t4
$ /tmp/cs_test/broken_ldb remote_compaction_worker --db=/tmp/cs_test/t4/db --job_dir=/tmp/cs_test/t4 &
$ /tmp/cs_test/fixed_old_ldb remote_compaction_primary --db=/tmp/cs_test/t4/db --job_dir=/tmp/cs_test/t4
remote_compaction_worker: OK (4848 bytes result)
remote_compaction_primary: OK

Test 5: Sanity — primary with with the extra new enum and the fix + worker with with the extra new enum and the fix (same version of fixed_ldb) → PASS

$ rm -rf /tmp/cs_test/t5 && mkdir -p /tmp/cs_test/t5
$ /tmp/cs_test/fixed_ldb remote_compaction_worker --db=/tmp/cs_test/t5/db --job_dir=/tmp/cs_test/t5 &
$ /tmp/cs_test/fixed_ldb remote_compaction_primary --db=/tmp/cs_test/t5/db --job_dir=/tmp/cs_test/t5
remote_compaction_worker: OK (4842 bytes result)
remote_compaction_primary: OK

CI plan: Once PR #14798 lands and a new release branch ref is cut after that and the current release branch is added to check_format_compatible.sh, the nightly or manual-triggered check_format_compatible.sh job will automatically test this PR's fix against that current release branch ref via cross-version remote compaction tests.

@meta-cla meta-cla Bot added the CLA Signed label Jun 1, 2026
@meta-codesync

meta-codesync Bot commented Jun 1, 2026

Copy link
Copy Markdown

@hx235 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D106321320.

@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown

⚠️ clang-tidy: 4 warning(s) on changed lines

Completed in 389.0s.

Summary by check

Check Count
cert-err58-cpp 4
Total 4

Details

tools/ldb_cmd.cc (4 warning(s))
tools/ldb_cmd.cc:5469:19: warning: initialization of 'kInputFile' with static storage duration may throw an exception that cannot be caught [cert-err58-cpp]
tools/ldb_cmd.cc:5470:19: warning: initialization of 'kResultFile' with static storage duration may throw an exception that cannot be caught [cert-err58-cpp]
tools/ldb_cmd.cc:5537:51: warning: initialization of 'ARG_JOB_DIR' with static storage duration may throw an exception that cannot be caught [cert-err58-cpp]
tools/ldb_cmd.cc:5617:50: warning: initialization of 'ARG_JOB_DIR' with static storage duration may throw an exception that cannot be caught [cert-err58-cpp]
@meta-codesync meta-codesync Bot changed the title Size-tolerant deserialization for compaction service response's counts array Jun 1, 2026
hx235 pushed a commit to hx235/rocksdb that referenced this pull request Jun 1, 2026
…s array (facebook#14807)

Summary:

The "counts" array field in compaction service response is non-critical information. During deserialization, primary DB can disagree on the size of this array with secondary DB if they run different RocksDB versions during cross-process remote compaction. We don't want to fail the compaction due to this disagreement as missing some non-critical info is fine. Prior to this PR,  a `kNumOfReasons - 1` array hack was used for this purpose when kNumOfReasons was recently increased by 1.

This PR replace the hack with a more general`SizeTolerantDeserializeArray`, which overrides only the parse callback to tolerate size mismatches during deserialization: extra elements are ignored, missing elements are zero-filled.

Serialization and comparison are inherited from `Array<T, kSize>` unchanged as they don't cross the process boundary.

Differential Revision: D106321320
@hx235 hx235 force-pushed the export-D106321320 branch from 4032f4a to 05fb4f9 Compare June 1, 2026 12:13
Hui Xiao added 2 commits June 1, 2026 06:01
…tible.sh (facebook#14798)

Summary:

Add cross-version format compatibility testing for remote compaction to `check_format_compatible.sh`.

Two new ldb commands (`remote_compaction_primary`, `remote_compaction_worker`) coordinate via local files: primary's `CompactRange()` writes `input.bin` through `Schedule()` and polls for `result.bin` through `Wait()`; worker polls for `input.bin`, calls `OpenAndCompact()`, writes `result.bin`. For each old ref in `db_forward_with_options_refs` (10.4.fb+), the script tests both directions — current primary + old worker and old primary + current worker — to catch serialization incompatibilities. Old refs lacking the commands are skipped gracefully.

Differential Revision: D106321150
…s array (facebook#14807)

Summary:

The "counts" array field in compaction service response is non-critical information. During deserialization, primary DB can disagree on the size of this array with secondary DB if they run different RocksDB versions during cross-process remote compaction. We don't want to fail the compaction due to this disagreement as missing some non-critical info is fine. Prior to this PR,  a `kNumOfReasons - 1` array hack was used for this purpose when kNumOfReasons was recently increased by 1.

This PR replace the hack with a more general`SizeTolerantDeserializeArray`, which overrides only the parse callback to tolerate size mismatches during deserialization: extra elements are ignored, missing elements are zero-filled.

Serialization and comparison are inherited from `Array<T, kSize>` unchanged as they don't cross the process boundary.

Differential Revision: D106321320
@hx235 hx235 force-pushed the export-D106321320 branch from 05fb4f9 to b0aa42e Compare June 1, 2026 13:02
@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit b0aa42e


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@github-actions

github-actions Bot commented Jun 1, 2026

Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit b0aa42e


Summary

Solid PR that replaces a brittle kNumOfReasons - 1 hack with a proper SizeTolerantDeserializeArray template for cross-version remote compaction compatibility. The core serialization fix is correct and well-designed. The ldb-based cross-version testing infrastructure is a valuable addition.

High-severity findings (0):
No high-severity findings.

Full review (click to expand)

Findings

🔴 HIGH

None.

🟡 MEDIUM

M1. Forward compatibility: new writer vs old reader still fails — db/compaction/compaction_service_job.cc
  • Issue: The old code serialized kNumOfReasons - 1 elements. The new code serializes kNumOfReasons elements (full array via inherited SerializeArray). An old reader (with kNumOfReasons - 1 in ParseArray) will reject the extra element with InvalidArgument("Serialized value has more elements than array size") at options_type.h:1022-1024. The ignore_unknown_options = true flag does NOT suppress this error — it only handles unknown field names, not array size mismatches within known fields (see options_helper.cc:1128-1131).
  • Root cause: SizeTolerantDeserializeArray fixes the READER but does not help old readers that don't have the fix.
  • Mitigating factors: (1) The fallback path at compaction_service_job.cc:192-209 safely falls back to local compaction when deserialization fails. (2) This is NOT a regression — the same failure already existed for any cross-version scenario where kNumOfReasons differed. The old -1 hack only papered over the 11.1→11.2 transition. (3) The check_format_compatible.sh tests explicitly guard with grep -q remote_compaction_primary to skip versions that lack the new commands.
  • Suggested fix: Consider documenting this expected behavior (old reader falls back to local compaction when reading from new writer) in the PR description or code comments. Alternatively, for maximum compatibility, serialize only min(kNumOfReasons, kOldSize) elements on the write path, but this adds complexity for diminishing returns.
M2. Test only verifies output_level_stats, not proximal_level_statsdb/compaction/compaction_service_test.cc
  • Issue: The CompatCheckCountsWidthTolerance test only sets and verifies result.internal_stats.output_level_stats.counts. Both output_level_stats and proximal_level_stats share the same compaction_stats_type_info and thus both use SizeTolerantDeserializeArray, but only one is explicitly tested.
  • Root cause: Test coverage gap.
  • Suggested fix: Add assertions for proximal_level_stats.counts as well, or add a brief comment noting that both use the same type_info.

🟢 LOW / NIT

L1. scheduled_ flag lacks synchronization — tools/ldb_cmd.cc:5524
  • Issue: LocalFileCompactionService::scheduled_ is a plain bool accessed without synchronization. Schedule() sets it, and it's only used as a debug assertion.
  • Mitigating factors: This is in a test-only ldb command, not production code. CompactRange calls Schedule and Wait from the same thread context. The flag is only used in assert, so it's stripped in release builds.
  • Suggested fix: Consider using std::atomic<bool> or removing the flag if the assert isn't providing value.
L2. RemoveLastFromCountsFields assumes at least two colon-separated elements — db/compaction/compaction_service_test.cc:41
  • Issue: If the "counts" field has only one element (e.g., "counts=5;"), rfind(':', semi - 1) would find no separator within the value range, and the assert(last_sep >= val_begin) would fire. This is fine for current usage since kNumOfReasons is well above 1, but the helper is fragile for reuse.
  • Suggested fix: Add a comment noting this assumption, or skip erasing if no separator is found.
L3. Minor: #include <memory> added to compaction_service_test.cc may be unnecessary — db/compaction/compaction_service_test.cc:5
  • Issue: The diff adds #include <memory> but the test doesn't appear to directly use std::unique_ptr, std::shared_ptr, or other <memory> facilities in the newly added code. It may have been added speculatively.
  • Suggested fix: Verify whether <memory> is actually needed by the new test code. If it's transitively included elsewhere, remove it.

Cross-Component Analysis

Context Does code execute? Assumptions hold? Action needed?
WritePreparedTxnDB No — remote compaction is a separate service N/A None
ReadOnly DB No — no compaction N/A None
db_bench Yes — OpenAndCompact path Yes — same Read/Write path None
db_stress Yes — stress compaction service Yes — fallback to local on failure None
Old snapshots N/A — counts are statistics, not key-level N/A None
User-defined timestamps N/A — counts are per-CompactionReason N/A None

Assumption stress-test results:

  • Claim: "counts is non-critical" — Verified. counts tracks per-CompactionReason compaction frequency for observability. Zero-fill for missing reasons is semantically correct (the old version didn't have that reason). The sum(counts) == count invariant is validated only on locally-computed stats, not cross-version deserialized stats.
  • Claim: "extra elements are silently ignored" — Verified. ParseVector parses all elements, then only the first kSize are copied to the fixed array.
  • Claim: "serialization and comparison are inherited unchanged" — Verified. SetParseFunc replaces only the parse callback; serialize and equals callbacks from Array<T, kSize> remain.

Positive Observations

  • Clean design: SizeTolerantDeserializeArray elegantly reuses the existing Array infrastructure, overriding only what's needed.
  • Proper TODO resolution: This directly addresses the documented TODO in the codebase.
  • Safe fallback: Parse failures fall back to local compaction, ensuring no data corruption or crashes.
  • Good testing strategy: The ldb-based cross-version testing fills a real gap in the compatibility test suite.
  • Minimal scope: The change is focused and doesn't introduce unnecessary complexity.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment