Size-tolerant deserialization for compaction service response's counts array (#14807)#14807
Size-tolerant deserialization for compaction service response's counts array (#14807)#14807hx235 wants to merge 2 commits into
Conversation
|
@hx235 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D106321320. |
|
| Check | Count |
|---|---|
cert-err58-cpp |
4 |
| Total | 4 |
Details
tools/ldb_cmd.cc (4 warning(s))
tools/ldb_cmd.cc:5469:19: warning: initialization of 'kInputFile' with static storage duration may throw an exception that cannot be caught [cert-err58-cpp]
tools/ldb_cmd.cc:5470:19: warning: initialization of 'kResultFile' with static storage duration may throw an exception that cannot be caught [cert-err58-cpp]
tools/ldb_cmd.cc:5537:51: warning: initialization of 'ARG_JOB_DIR' with static storage duration may throw an exception that cannot be caught [cert-err58-cpp]
tools/ldb_cmd.cc:5617:50: warning: initialization of 'ARG_JOB_DIR' with static storage duration may throw an exception that cannot be caught [cert-err58-cpp]
…s array (facebook#14807) Summary: The "counts" array field in compaction service response is non-critical information. During deserialization, primary DB can disagree on the size of this array with secondary DB if they run different RocksDB versions during cross-process remote compaction. We don't want to fail the compaction due to this disagreement as missing some non-critical info is fine. Prior to this PR, a `kNumOfReasons - 1` array hack was used for this purpose when kNumOfReasons was recently increased by 1. This PR replace the hack with a more general`SizeTolerantDeserializeArray`, which overrides only the parse callback to tolerate size mismatches during deserialization: extra elements are ignored, missing elements are zero-filled. Serialization and comparison are inherited from `Array<T, kSize>` unchanged as they don't cross the process boundary. Differential Revision: D106321320
…tible.sh (facebook#14798) Summary: Add cross-version format compatibility testing for remote compaction to `check_format_compatible.sh`. Two new ldb commands (`remote_compaction_primary`, `remote_compaction_worker`) coordinate via local files: primary's `CompactRange()` writes `input.bin` through `Schedule()` and polls for `result.bin` through `Wait()`; worker polls for `input.bin`, calls `OpenAndCompact()`, writes `result.bin`. For each old ref in `db_forward_with_options_refs` (10.4.fb+), the script tests both directions — current primary + old worker and old primary + current worker — to catch serialization incompatibilities. Old refs lacking the commands are skipped gracefully. Differential Revision: D106321150
…s array (facebook#14807) Summary: The "counts" array field in compaction service response is non-critical information. During deserialization, primary DB can disagree on the size of this array with secondary DB if they run different RocksDB versions during cross-process remote compaction. We don't want to fail the compaction due to this disagreement as missing some non-critical info is fine. Prior to this PR, a `kNumOfReasons - 1` array hack was used for this purpose when kNumOfReasons was recently increased by 1. This PR replace the hack with a more general`SizeTolerantDeserializeArray`, which overrides only the parse callback to tolerate size mismatches during deserialization: extra elements are ignored, missing elements are zero-filled. Serialization and comparison are inherited from `Array<T, kSize>` unchanged as they don't cross the process boundary. Differential Revision: D106321320
🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit b0aa42e ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit b0aa42e SummarySolid PR that replaces a brittle High-severity findings (0): Full review (click to expand)Findings🔴 HIGHNone. 🟡 MEDIUMM1. Forward compatibility: new writer vs old reader still fails —
|
| Context | Does code execute? | Assumptions hold? | Action needed? |
|---|---|---|---|
| WritePreparedTxnDB | No — remote compaction is a separate service | N/A | None |
| ReadOnly DB | No — no compaction | N/A | None |
| db_bench | Yes — OpenAndCompact path | Yes — same Read/Write path | None |
| db_stress | Yes — stress compaction service | Yes — fallback to local on failure | None |
| Old snapshots | N/A — counts are statistics, not key-level | N/A | None |
| User-defined timestamps | N/A — counts are per-CompactionReason | N/A | None |
Assumption stress-test results:
- Claim: "counts is non-critical" — Verified.
countstracks per-CompactionReason compaction frequency for observability. Zero-fill for missing reasons is semantically correct (the old version didn't have that reason). Thesum(counts) == countinvariant is validated only on locally-computed stats, not cross-version deserialized stats. - Claim: "extra elements are silently ignored" — Verified.
ParseVectorparses all elements, then only the firstkSizeare copied to the fixed array. - Claim: "serialization and comparison are inherited unchanged" — Verified.
SetParseFuncreplaces only the parse callback; serialize and equals callbacks fromArray<T, kSize>remain.
Positive Observations
- Clean design:
SizeTolerantDeserializeArrayelegantly reuses the existingArrayinfrastructure, overriding only what's needed. - Proper TODO resolution: This directly addresses the documented TODO in the codebase.
- Safe fallback: Parse failures fall back to local compaction, ensuring no data corruption or crashes.
- Good testing strategy: The ldb-based cross-version testing fills a real gap in the compatibility test suite.
- Minimal scope: The change is focused and doesn't introduce unnecessary complexity.
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
Summary:
The "counts" array field in compaction service response is non-critical information. During deserialization, primary DB can disagree on the size of this array with secondary DB if they run different RocksDB versions during cross-process remote compaction. We don't want to fail the compaction due to this disagreement as missing some non-critical info is fine. Prior to this PR, a
kNumOfReasons - 1array hack was used for this purpose when kNumOfReasons was recently increased by 1.This PR replace the hack with a more general
SizeTolerantDeserializeArray, which overrides only the parse callback to tolerate size mismatches during deserialization: extra elements are ignored, missing elements are zero-filled.Serialization and comparison are inherited from
Array<T, kSize>unchanged as they don't cross the process boundary.Test plan:
Test 1: Bug repro — primary with the extra new enum (
broken_ldb) + worker with previous enum (old_ldb) → FAILTest 2: Bug repro — primary with previous enum (
old_ldb) + worker with the extra new enum (broken_ldb) → FAILTest 3: Fix handles fewer enums — primary with the extra new enum and the fix (
fixed_ldb) + worker with previous enum (old_ldb) → PASS (vs Test 1)Test 4: Fix handles more enums — primary with previous enum and the fix (
fixed_old_ldb) + worker with the extra new enum (broken_ldb) → PASS (vs Test 2)Test 5: Sanity — primary with with the extra new enum and the fix + worker with with the extra new enum and the fix (same version of
fixed_ldb) → PASSCI plan: Once PR #14798 lands and a new release branch ref is cut after that and the current release branch is added to
check_format_compatible.sh, the nightly or manual-triggeredcheck_format_compatible.shjob will automatically test this PR's fix against that current release branch ref via cross-version remote compaction tests.