Skip to content

Add remote compaction format compatibility test to check_format_compatible.sh (#14798)#14798

Closed
hx235 wants to merge 1 commit into
facebook:mainfrom
hx235:export-D106321150
Closed

Add remote compaction format compatibility test to check_format_compatible.sh (#14798)#14798
hx235 wants to merge 1 commit into
facebook:mainfrom
hx235:export-D106321150

Conversation

@hx235

@hx235 hx235 commented May 28, 2026

Copy link
Copy Markdown
Contributor

Summary:

Add cross-version format compatibility testing for remote compaction to check_format_compatible.sh as primary and worker RocksDB instances can run different versions.

Two new ldb commands coordinate via local files:

  • remote_compaction_primary: opens an existing DB and runs CompactRange() through LocalFileCompactionService, which writes input.bin via Schedule() and polls for result.bin via Wait().
  • remote_compaction_worker: polls for input.bin, calls OpenAndCompact(), writes result.bin.

The test script creates a DB using generate_random_db.sh with the primary's ldb binary (new optional 3rd argument) so the OPTIONS file matches the primary's version. An overlap key is written to ensure CompactRange triggers a real compaction (not a trivial move). For each old ref in db_forward_with_options_refs, the script tests both directions -- current primary + old worker and old primary + current worker -- to catch wire-format incompatibilities in CompactionServiceInput/CompactionServiceResult. Old refs lacking the commands are skipped gracefully.

Test plan:

1. generate_random_db.sh 3rd arg (custom ldb) -- PASSED:

$ bash tools/generate_random_db.sh /tmp/cs_t1/input /tmp/cs_t1/db1 "./ldb"
OK
$ bash tools/generate_random_db.sh /tmp/cs_t1/input /tmp/cs_t1/db2
OK

2. Simulated noexec: error message with exit code -- VERIFIED:

$ echo "not a binary" > /tmp/cs_t2/current_ldb && chmod +x /tmp/cs_t2/current_ldb
$ cs_ldb_output=$(env LD_LIBRARY_PATH=/tmp/cs_t2 /tmp/cs_t2/current_ldb --version 2>&1)
$ cs_ldb_exit=$?
$ echo $cs_ldb_exit
127
$ echo $cs_ldb_output
/tmp/cs_t2/current_ldb: line 1: not: command not found

3. SANITY_CHECK mode -- PASSED:

$ SANITY_CHECK=1 LONG_TEST=1 tools/check_format_compatible.sh
...
==== check_format_compatible.sh sanity check PASSED ====

4. SHORT_TEST mode -- PASSED:

$ SHORT_TEST=1 tools/check_format_compatible.sh
...
==== Compatibility Test PASSED ====

5. Cross-version ldb tests (sibling PR fix verification) -- ALL PASSED:

Four ldb binaries built from OSS repo:

  • old_ldb: 11.1.fb (no kReadTriggered) + this PR ldb commands, strict ParseArray
  • broken_ldb: main (has kReadTriggered) + this PR ldb commands, strict ParseArray, -1 hack removed
  • fixed_ldb: main (has kReadTriggered) + this PR + sibling PR SizeTolerantDeserializeArray
  • fixed_old_ldb: 11.1.fb (no kReadTriggered) + this PR + sibling PR SizeTolerantDeserializeArray

Test 1: Bug repro -- primary without fix (main, has kReadTriggered) + worker without fix (11.1.fb, no kReadTriggered)

$ bash tools/generate_random_db.sh /tmp/cs_test/t1/input /tmp/cs_test/t1/db "/tmp/cs_test/broken_ldb"
...
OK
$ printf "a ==> overlap_start\nzzzzzzz ==> overlap_end\n" | /tmp/cs_test/broken_ldb load --db=/tmp/cs_test/t1/db --auto_compaction=false
$ /tmp/cs_test/old_ldb remote_compaction_worker --db=/tmp/cs_test/t1/db --job_dir=/tmp/cs_test/t1 &
[2] 3831836
$ /tmp/cs_test/broken_ldb remote_compaction_primary --db=/tmp/cs_test/t1/db --job_dir=/tmp/cs_test/t1
remote_compaction_worker: OK (4697 bytes result)
Failed: CompactRange: Invalid argument: Serialized value has less elements than array size: counts
[2]-  Done                    /tmp/cs_test/old_ldb remote_compaction_worker --db=/tmp/cs_test/t1/db --job_dir=/tmp/cs_test/t1

Test 2: Bug repro -- primary without fix (11.1.fb, no kReadTriggered) + worker without fix (main, has kReadTriggered)

$ bash tools/generate_random_db.sh /tmp/cs_test/t2/input /tmp/cs_test/t2/db "/tmp/cs_test/old_ldb"
...
OK
$ printf "a ==> overlap_start\nzzzzzzz ==> overlap_end\n" | /tmp/cs_test/old_ldb load --db=/tmp/cs_test/t2/db --auto_compaction=false
$ /tmp/cs_test/broken_ldb remote_compaction_worker --db=/tmp/cs_test/t2/db --job_dir=/tmp/cs_test/t2 &
[2] 3854300
$ /tmp/cs_test/old_ldb remote_compaction_primary --db=/tmp/cs_test/t2/db --job_dir=/tmp/cs_test/t2
remote_compaction_worker: OK (4854 bytes result)
Failed: CompactRange: Invalid argument: Serialized value has more elements than array size: counts
[2]-  Done                    /tmp/cs_test/broken_ldb remote_compaction_worker --db=/tmp/cs_test/t2/db --job_dir=/tmp/cs_test/t2

Test 3: Fix -- primary with fix (main, has kReadTriggered) + worker without fix (11.1.fb, no kReadTriggered)

$ bash tools/generate_random_db.sh /tmp/cs_test/t3/input /tmp/cs_test/t3/db "/tmp/cs_test/fixed_ldb"
...
OK
$ printf "a ==> overlap_start\nzzzzzzz ==> overlap_end\n" | /tmp/cs_test/fixed_ldb load --db=/tmp/cs_test/t3/db --auto_compaction=false
$ /tmp/cs_test/old_ldb remote_compaction_worker --db=/tmp/cs_test/t3/db --job_dir=/tmp/cs_test/t3 &
[2] 3859945
$ /tmp/cs_test/fixed_ldb remote_compaction_primary --db=/tmp/cs_test/t3/db --job_dir=/tmp/cs_test/t3
remote_compaction_worker: OK (4692 bytes result)
remote_compaction_primary: OK
[2]-  Done                    /tmp/cs_test/old_ldb remote_compaction_worker --db=/tmp/cs_test/t3/db --job_dir=/tmp/cs_test/t3

Test 4: Fix -- primary with fix (11.1.fb, no kReadTriggered) + worker without fix (main, has kReadTriggered)

$ bash tools/generate_random_db.sh /tmp/cs_test/t4/input /tmp/cs_test/t4/db "/tmp/cs_test/fixed_old_ldb"
...
OK
$ printf "a ==> overlap_start\nzzzzzzz ==> overlap_end\n" | /tmp/cs_test/fixed_old_ldb load --db=/tmp/cs_test/t4/db --auto_compaction=false
$ /tmp/cs_test/broken_ldb remote_compaction_worker --db=/tmp/cs_test/t4/db --job_dir=/tmp/cs_test/t4 &
[2] 3883534
$ /tmp/cs_test/fixed_old_ldb remote_compaction_primary --db=/tmp/cs_test/t4/db --job_dir=/tmp/cs_test/t4
remote_compaction_worker: OK (4854 bytes result)
remote_compaction_primary: OK
[2]-  Done                    /tmp/cs_test/broken_ldb remote_compaction_worker --db=/tmp/cs_test/t4/db --job_dir=/tmp/cs_test/t4

Test 5: Sanity -- primary with fix (main) + worker with fix (main)

$ bash tools/generate_random_db.sh /tmp/cs_test/t5/input /tmp/cs_test/t5/db
...
OK
$ printf "a ==> overlap_start\nzzzzzzz ==> overlap_end\n" | /tmp/cs_test/fixed_ldb load --db=/tmp/cs_test/t5/db --auto_compaction=false
$ /tmp/cs_test/fixed_ldb remote_compaction_worker --db=/tmp/cs_test/t5/db --job_dir=/tmp/cs_test/t5 &
[2] 3892674
$ /tmp/cs_test/fixed_ldb remote_compaction_primary --db=/tmp/cs_test/t5/db --job_dir=/tmp/cs_test/t5
remote_compaction_worker: OK (4819 bytes result)
remote_compaction_primary: OK
[2]-  Done                    /tmp/cs_test/fixed_ldb remote_compaction_worker --db=/tmp/cs_test/t5/db --job_dir=/tmp/cs_test/t5

6. CI break-compat rehearsal -- FAIL (expected: FAIL):

Branches pushed to upstream (facebook/rocksdb):

CI job: https://github.com/facebook/rocksdb/actions/runs/26896515583/job/79336958934

7. CI smoke test -- PASS (expected: PASS):

Branch rc-ldb-smoke-test: upstream/main + this PR only (5 files).

CI job: https://github.com/facebook/rocksdb/actions/runs/26891267696

@meta-cla meta-cla Bot added the CLA Signed label May 28, 2026
@meta-codesync

meta-codesync Bot commented May 28, 2026

Copy link
Copy Markdown

@hx235 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D106321150.

@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown

⚠️ clang-tidy: 4 warning(s) on changed lines

Completed in 347.4s.

Summary by check

Check Count
cert-err58-cpp 4
Total 4

Details

tools/ldb_cmd.cc (4 warning(s))
tools/ldb_cmd.cc:5469:19: warning: initialization of 'kInputFile' with static storage duration may throw an exception that cannot be caught [cert-err58-cpp]
tools/ldb_cmd.cc:5470:19: warning: initialization of 'kResultFile' with static storage duration may throw an exception that cannot be caught [cert-err58-cpp]
tools/ldb_cmd.cc:5537:51: warning: initialization of 'ARG_JOB_DIR' with static storage duration may throw an exception that cannot be caught [cert-err58-cpp]
tools/ldb_cmd.cc:5590:50: warning: initialization of 'ARG_JOB_DIR' with static storage duration may throw an exception that cannot be caught [cert-err58-cpp]
@hx235 hx235 force-pushed the export-D106321150 branch from db040a0 to 192aa95 Compare May 28, 2026 10:58
@meta-codesync meta-codesync Bot changed the title Add remote compaction format compatibility test to check_format_compatible.sh May 28, 2026
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 28, 2026
…tible.sh (facebook#14798)

Summary:

Add cross-version format compatibility testing for remote compaction to `check_format_compatible.sh`.

Two new ldb commands (`remote_compaction_primary`, `remote_compaction_worker`) coordinate via local files: primary's `CompactRange()` writes `input.bin` through `Schedule()` and polls for `result.bin` through `Wait()`; worker polls for `input.bin`, calls `OpenAndCompact()`, writes `result.bin`. For each old ref in `db_forward_with_options_refs` (10.4.fb+), the script tests both directions — current primary + old worker and old primary + current worker — to catch serialization incompatibilities. Old refs lacking the commands are skipped gracefully.

Differential Revision: D106321150
@hx235 hx235 force-pushed the export-D106321150 branch from 192aa95 to c74dd23 Compare May 28, 2026 11:18
@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit c74dd23


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit c74dd23


Summary

Well-structured PR that adds cross-version remote compaction format compatibility testing via two new ldb commands and shell script integration. The design follows existing check_format_compatible.sh patterns appropriately. The file-based IPC approach is pragmatic for a test tool.

High-severity findings (1):

  • [ldb_cmd.cc:~5653] CreateDirIfMissing failure is silently ignored in worker command, which could lead to a confusing OpenAndCompact failure downstream.
Full review (click to expand)

Findings

🔴 HIGH

H1. Worker silently ignores CreateDirIfMissing failure -- ldb_cmd.cc:~5653
  • Issue: In RemoteCompactionWorkerCommand::DoCommand(), the output directory creation error is silently ignored:
    Env::Default()->CreateDirIfMissing(output_dir).PermitUncheckedError();
    If this fails (permissions, disk full), OpenAndCompact will fail later with a cryptic file-system error rather than a clear "cannot create output directory" message.
  • Root cause: Using PermitUncheckedError() on an operation whose failure should abort the command.
  • Suggested fix:
    Status dir_s = Env::Default()->CreateDirIfMissing(output_dir);
    if (!dir_s.ok()) {
      exec_state_ = LDBCommandExecuteResult::Failed(
          "CreateDirIfMissing: " + dir_s.ToString());
      return;
    }

🟡 MEDIUM

M1. Worker process can hang up to 120s on primary failure -- ldb_cmd.cc:~5497
  • Issue: If the primary fails before writing input.bin, the worker polls for 120 seconds before timing out. Combined with wait $worker_pid in the shell script, a single test failure adds up to 2 minutes of wall-clock time.
  • Suggested fix: Consider reducing kPollTimeoutMs to 30-60 seconds since the primary should write input.bin within seconds of starting, or add a sentinel file written by the primary at startup so the worker can detect primary death.
M2. DestroyDB Status unchecked -- ldb_cmd.cc:~5583
  • Issue: DestroyDB(db_path_, options).PermitUncheckedError() -- if DestroyDB fails (e.g., leftover locks from a previous crash), the subsequent DB::Open with create_if_missing may behave unexpectedly on a partially-destroyed DB.
  • Suggested fix: Check the Status and fail with a clear message, or at minimum log a warning.
M3. No process cleanup on early exit in run_cs_compat_test -- check_format_compatible.sh:~345
  • Issue: If the primary or worker command crashes with a signal, or if the rm -rf / mkdir -p at the top of run_cs_compat_test fails, the function calls exit 1 without killing the background worker. The global cleanup trap handles git state but does not track or kill per-test worker PIDs.
  • Suggested fix: Add kill $worker_pid 2>/dev/null || true before the exit 1 in the failure branch of run_cs_compat_test.
M4. RemoteCompactionPrimaryCommand::Help and RemoteCompactionWorkerCommand::Help not registered in PrintHelp under a section header -- ldb_tool.cc:148-149
  • Issue: The new help entries are appended after UnsafeRemoveSstFileCommand::Help(ret) without a section separator or category label. All other commands are grouped under "Data Access Commands" or "Admin Commands". The new commands appear orphaned at the end with no heading.
  • Suggested fix: Add them under "Admin Commands" or add a "Testing Commands" heading.

🟢 LOW / NIT

L1. assert(!scheduled_) could be a runtime check -- ldb_cmd.cc:~5516
  • Issue: LocalFileCompactionService::Schedule() uses assert(!scheduled_) which is stripped in release builds. Since this is test infrastructure built with DEBUG_LEVEL=0 for compat testing, the assert is a no-op during actual use.
  • Suggested fix: Replace with an if (scheduled_) return Failure; check, or accept that this is only meaningful in debug test runs. Low severity because with max_subcompactions=1 and num_levels=2, only one Schedule() call occurs.
L2. AtomicWriteStringToFile uses std::ofstream instead of RocksDB Env -- ldb_cmd.cc:~5482
  • Issue: The write uses std::ofstream + std::rename instead of RocksDB's WritableFile + Env::RenameFile. This is acceptable for test tooling and avoids dependency complexity, but is inconsistent with other ldb commands that use Env.
  • Suggested fix: No action needed; documenting the intentional deviation.
L3. Trailing whitespace-only diff in second loop -- check_format_compatible.sh:~541
  • Issue: The diff adds a blank line in the second for checkout_ref loop body with no functional change. Minor noise.

Cross-Component Analysis

Context Applies? Safe? Notes
LD_LIBRARY_PATH for old binary No N/A OLD_LDB="./ldb" runs without env wrapper; only CURRENT_LDB uses env
DB lock conflict (primary + worker) Yes Yes OpenAndCompact opens as secondary instance; no lock conflict
Stale input.bin from previous run No N/A rm -rf "$test_dir" && mkdir -p "$test_dir" cleans before each test
Multi-subcompaction race No N/A Default max_subcompactions=1 with num_levels=2 produces exactly one Schedule/Wait cycle
SANITY_CHECK mode Yes Yes Guarded by if [ ! "$SANITY_CHECK" ] around the ldb save and test execution
SHORT_TEST mode Yes Yes Runs only first ref in db_forward_with_options_refs; old branch may lack command, gets skipped
Cross-filesystem rename in AtomicWriteStringToFile No N/A tmp file and target are in the same directory (job_dir)
Word splitting of CURRENT_LDB Yes Yes Path /tmp/rocksdb_cs_compat_${USER:-$$}/current_ldb never contains spaces
Both directions tested Yes Yes First loop calls run_cs_compat_test twice: current-primary+old-worker AND old-primary+current-worker

Positive Observations

  • The design elegantly reuses the existing check_format_compatible.sh infrastructure without requiring changes to the build system or test framework.
  • Testing both serialization directions (current-to-old and old-to-current) is thorough.
  • The grep -q remote_compaction_primary check gracefully skips old branches that lack the commands.
  • The env LD_LIBRARY_PATH=... pattern for running the saved current binary is a clean solution for dynamic linking.
  • The test data generation (4 flushes x 20 keys) is sufficient to produce a real L0->L1 compaction that exercises the core CompactionServiceInput/Result serialization paths.
  • The PR includes comprehensive manual testing across multiple scenarios (same-version, cross-version, SANITY_CHECK, SHORT_TEST, CI break-compat rehearsal).

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase
@hx235 hx235 force-pushed the export-D106321150 branch from c74dd23 to ab96950 Compare May 29, 2026 05:30
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 29, 2026
…tible.sh (facebook#14798)

Summary:

Add cross-version format compatibility testing for remote compaction to `check_format_compatible.sh`.

Two new ldb commands (`remote_compaction_primary`, `remote_compaction_worker`) coordinate via local files: primary's `CompactRange()` writes `input.bin` through `Schedule()` and polls for `result.bin` through `Wait()`; worker polls for `input.bin`, calls `OpenAndCompact()`, writes `result.bin`. For each old ref in `db_forward_with_options_refs` (10.4.fb+), the script tests both directions — current primary + old worker and old primary + current worker — to catch serialization incompatibilities. Old refs lacking the commands are skipped gracefully.

Differential Revision: D106321150
@hx235

hx235 commented May 29, 2026

Copy link
Copy Markdown
Contributor Author
  • H1 — CreateDirIfMissing silently ignored: Fixed. Now checks Status and fails the command with a clear error message.
  • M1 — Worker 120s timeout: Pushed back. 120s is a reasonable safety margin for CI environments with variable load. The primary writes input.bin almost
    immediately, so this timeout only triggers on real failures.
  • M2 — DestroyDB unchecked: Fixed. Now checks Status and logs a warning. Not a hard failure since the subsequent Open will catch a truly broken state.
  • M3 — Worker not killed on early exit: Fixed. Added kill $worker_pid before exit 1.
  • M4 — No section header for new commands: Pushed back. These are admin-adjacent commands; adding a separate section for just two commands is unnecessary.
  • L1 — assert in test code: Pushed back. Single Schedule() call is structurally guaranteed; assert is adequate for a test tool.
  • L2 — std::ofstream vs Env: Pushed back. Intentional for test tooling simplicity.
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 29, 2026
…tible.sh (facebook#14798)

Summary:

Add cross-version format compatibility testing for remote compaction to `check_format_compatible.sh`.

Two new ldb commands (`remote_compaction_primary`, `remote_compaction_worker`) coordinate via local files: primary's `CompactRange()` writes `input.bin` through `Schedule()` and polls for `result.bin` through `Wait()`; worker polls for `input.bin`, calls `OpenAndCompact()`, writes `result.bin`. For each old ref in `db_forward_with_options_refs` (10.4.fb+), the script tests both directions — current primary + old worker and old primary + current worker — to catch serialization incompatibilities. Old refs lacking the commands are skipped gracefully.

Differential Revision: D106321150
@hx235 hx235 force-pushed the export-D106321150 branch from ab96950 to 7b5018d Compare May 29, 2026 05:35
@github-actions

github-actions Bot commented May 29, 2026

Copy link
Copy Markdown

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 7b5018d


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@github-actions

Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 7b5018d


Summary

Well-structured PR that adds cross-version format compatibility testing for remote compaction serialization (CompactionServiceInput/Result). The design is sound -- using actual CompactRange + OpenAndCompact exercises the real serialization path rather than synthetic round-trips, which is the right choice for a format compatibility test. The file-based coordination protocol is simple and effective.

High-severity findings (1):

  • [check_format_compatible.sh:cleanup] Background worker processes are not killed on script interruption (Ctrl+C / SIGINT), leading to orphaned processes that poll for up to 120 seconds.
Full review (click to expand)

Findings

🔴 HIGH

H1. Orphaned worker processes on script interruption -- check_format_compatible.sh:57
  • Issue: The cleanup() trap function does not track or kill background worker processes started by run_cs_compat_test(). If the script is interrupted (Ctrl+C, CI timeout, etc.) while a worker is running, the worker continues polling for input.bin for up to 120 seconds.
  • Root cause: worker_pid is a local variable in run_cs_compat_test() and not accessible to the global cleanup() function. The trap only handles git reset, branch cleanup, and temp directory removal.
  • Suggested fix: Track background PIDs in a global variable and kill them in cleanup():
    _bg_pids=()
    cleanup() {
      echo "== Cleaning up"
      for pid in "${_bg_pids[@]}"; do
        kill "$pid" 2>/dev/null || true
      done
      rm -rf "/tmp/rocksdb_cs_compat_${USER:-$$}" || true
      git reset --hard || true
      # ...
    }
    And in run_cs_compat_test():
    $worker_ldb remote_compaction_worker ... &
    local worker_pid=$!
    _bg_pids+=($worker_pid)

🟡 MEDIUM

M1. scheduled_ field is not protected for hypothetical multi-subcompaction -- ldb_cmd.cc:LocalFileCompactionService
  • Issue: bool scheduled_ in LocalFileCompactionService is not atomic. If max_subcompactions > 1 were ever set, multiple threads could call Schedule() concurrently, causing a data race and overwriting input.bin.
  • Root cause: The code relies on the default max_subcompactions=1 to prevent concurrent Schedule() calls. This is correct for the current usage but fragile.
  • Suggested fix: Use std::atomic<bool> instead of bool, or add a comment documenting the single-subcompaction assumption. Since this is test-only code with controlled options (num_levels=2, default max_subcompactions=1), the risk is low. A comment is sufficient.
M2. No early detection of worker failure -- check_format_compatible.sh:run_cs_compat_test
  • Issue: If the worker process exits immediately (e.g., old ldb binary crashes on startup), the primary continues running, flushes data, and then CompactRange() writes input.bin. Wait() then polls for result.bin for up to 120 seconds before timing out. This wastes ~2 minutes per failed test.
  • Root cause: No mechanism to detect early worker death. The protocol is purely file-based with no process health check.
  • Suggested fix: Add a quick kill -0 $worker_pid check after a short sleep before running the primary, or reduce the timeout for format compat tests.
M3. Missing comment on why remote compaction test is absent from second loop -- check_format_compatible.sh
  • Issue: The remote compaction compat test only runs in the first loop (old-ref checkout loop). This is actually correct -- both directions (current primary + old worker AND old primary + current worker) are tested in the first loop. However, a comment documenting this design decision would prevent future maintainers from trying to add it to the second loop.

🟢 LOW / NIT

L1. Unquoted $worker_ldb / $primary_ldb relies on intentional word-splitting -- check_format_compatible.sh:run_cs_compat_test
  • Issue: $primary_ldb and $worker_ldb are used unquoted, relying on word-splitting to expand env LD_LIBRARY_PATH=... /path/ldb into separate arguments. This is intentional and correct for the env command pattern, but breaks if $cs_test_dir contains spaces.
  • Suggested fix: Acceptable since paths are controlled. A comment noting the intentional word-splitting would improve clarity.
L2. kill $worker_pid after wait is redundant -- check_format_compatible.sh:run_cs_compat_test
  • Issue: After wait $worker_pid, the process has been reaped. The subsequent kill $worker_pid is a no-op.
L3. DestroyDB failure is only a warning -- ldb_cmd.cc:RemoteCompactionPrimaryCommand::DoCommand
  • Issue: Acceptable for test tooling. The subsequent DB::Open with create_if_missing handles it.
L4. Help text should be under a separate section -- ldb_tool.cc
  • Issue: New commands are appended after admin commands. Consider a "Testing Commands" section.

Cross-Component Analysis

Context Executes? Safe? Notes
SANITY_CHECK mode No Yes Guarded
SHORT_TEST mode Yes (first ref) Yes Works correctly
Old refs without commands No Yes grep -q skips gracefully
Default max_subcompactions=1 Yes Yes Single Schedule() call

Positive Observations

  • Atomic write via tmp+rename prevents partial reads
  • Bidirectional testing covers both serialization directions
  • Graceful skip for old refs lacking the commands
  • Consistent error handling following LDBCommand patterns
  • num_levels=2 cleanly constrains to single compaction job

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase
hx235 pushed a commit to hx235/rocksdb that referenced this pull request Jun 1, 2026
…tible.sh (facebook#14798)

Summary:

Add cross-version format compatibility testing for remote compaction to `check_format_compatible.sh`.

Two new ldb commands (`remote_compaction_primary`, `remote_compaction_worker`) coordinate via local files: primary's `CompactRange()` writes `input.bin` through `Schedule()` and polls for `result.bin` through `Wait()`; worker polls for `input.bin`, calls `OpenAndCompact()`, writes `result.bin`. For each old ref in `db_forward_with_options_refs` (10.4.fb+), the script tests both directions — current primary + old worker and old primary + current worker — to catch serialization incompatibilities. Old refs lacking the commands are skipped gracefully.

Differential Revision: D106321150
hx235 pushed a commit to hx235/rocksdb that referenced this pull request Jun 1, 2026
…tible.sh (facebook#14798)

Summary:

Add cross-version format compatibility testing for remote compaction to `check_format_compatible.sh`.

Two new ldb commands (`remote_compaction_primary`, `remote_compaction_worker`) coordinate via local files: primary's `CompactRange()` writes `input.bin` through `Schedule()` and polls for `result.bin` through `Wait()`; worker polls for `input.bin`, calls `OpenAndCompact()`, writes `result.bin`. For each old ref in `db_forward_with_options_refs` (10.4.fb+), the script tests both directions — current primary + old worker and old primary + current worker — to catch serialization incompatibilities. Old refs lacking the commands are skipped gracefully.

Differential Revision: D106321150
@hx235 hx235 force-pushed the export-D106321150 branch from 7b5018d to ff06d50 Compare June 3, 2026 16:00
hx235 pushed a commit to hx235/rocksdb that referenced this pull request Jun 3, 2026
…tible.sh (facebook#14798)

Summary:

Add cross-version format compatibility testing for remote compaction to `check_format_compatible.sh`.

Two new ldb commands coordinate via local files:
- `remote_compaction_primary`: opens an existing DB and runs `CompactRange()` through `LocalFileCompactionService`, which writes `input.bin` via `Schedule()` and polls for `result.bin` via `Wait()`.
- `remote_compaction_worker`: polls for `input.bin`, calls `OpenAndCompact()`, writes `result.bin`.

The test script creates a DB using `generate_random_db.sh` with the primary's ldb binary (new optional 3rd argument) so the OPTIONS file matches the primary's version. An overlap key is written to ensure `CompactRange` triggers a real compaction (not a trivial move). For each old ref in `db_forward_with_options_refs`, the script tests both directions -- current primary + old worker and old primary + current worker -- to catch wire-format incompatibilities in `CompactionServiceInput`/`CompactionServiceResult`. Old refs lacking the commands are skipped gracefully.

A same-version smoke test runs before cross-version tests to catch integration failures early. The saved ldb binary falls back from `test_dir` to `$PWD/tmp/cs` if `test_dir` is on a noexec filesystem (e.g. /dev/shm in CI Docker containers). Failure to run the saved binary is a hard error with exit code and error message printed.

Differential Revision: D106321150
hx235 pushed a commit to hx235/rocksdb that referenced this pull request Jun 3, 2026
…tible.sh (facebook#14798)

Summary:

Add cross-version format compatibility testing for remote compaction to `check_format_compatible.sh`.

Two new ldb commands coordinate via local files:
- `remote_compaction_primary`: opens an existing DB and runs `CompactRange()` through `LocalFileCompactionService`, which writes `input.bin` via `Schedule()` and polls for `result.bin` via `Wait()`.
- `remote_compaction_worker`: polls for `input.bin`, calls `OpenAndCompact()`, writes `result.bin`.

The test script creates a DB using `generate_random_db.sh` with the primary's ldb binary (new optional 3rd argument) so the OPTIONS file matches the primary's version. An overlap key is written to ensure `CompactRange` triggers a real compaction (not a trivial move). For each old ref in `db_forward_with_options_refs`, the script tests both directions -- current primary + old worker and old primary + current worker -- to catch wire-format incompatibilities in `CompactionServiceInput`/`CompactionServiceResult`. Old refs lacking the commands are skipped gracefully.

A same-version smoke test runs before cross-version tests to catch integration failures early. The saved ldb binary falls back from `test_dir` to `$PWD/tmp/cs` if `test_dir` is on a noexec filesystem (e.g. /dev/shm in CI Docker containers). Failure to run the saved binary is a hard error with exit code and error message printed.

Differential Revision: D106321150
@hx235 hx235 force-pushed the export-D106321150 branch from ff06d50 to 24a580a Compare June 3, 2026 16:31
…tible.sh (facebook#14798)

Summary:

Add cross-version format compatibility testing for remote compaction to `check_format_compatible.sh`.

Two new ldb commands coordinate via local files:
- `remote_compaction_primary`: opens an existing DB and runs `CompactRange()` through `LocalFileCompactionService`, which writes `input.bin` via `Schedule()` and polls for `result.bin` via `Wait()`.
- `remote_compaction_worker`: polls for `input.bin`, calls `OpenAndCompact()`, writes `result.bin`.

The test script creates a DB using `generate_random_db.sh` with the primary's ldb binary (new optional 3rd argument) so the OPTIONS file matches the primary's version. An overlap key is written to ensure `CompactRange` triggers a real compaction (not a trivial move). For each old ref in `db_forward_with_options_refs`, the script tests both directions -- current primary + old worker and old primary + current worker -- to catch wire-format incompatibilities in `CompactionServiceInput`/`CompactionServiceResult`. Old refs lacking the commands are skipped gracefully.

A same-version smoke test runs before cross-version tests to catch integration failures early. The saved ldb binary falls back from `test_dir` to `$PWD/tmp/cs` if `test_dir` is on a noexec filesystem (e.g. /dev/shm in CI Docker containers). Failure to run the saved binary is a hard error with exit code and error message printed.

Differential Revision: D106321150
@hx235 hx235 force-pushed the export-D106321150 branch from 24a580a to c02deb5 Compare June 3, 2026 17:06
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit c02deb5


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@meta-codesync meta-codesync Bot closed this in 5177a8e Jun 3, 2026
@meta-codesync

meta-codesync Bot commented Jun 3, 2026

Copy link
Copy Markdown

This pull request has been merged in 5177a8e.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment