Skip to content

Fix HyperClockCache naive Lookup fast-path chain indexing to skip full fallback#14901

Open
pdillinger wants to merge 1 commit into
facebook:mainfrom
pdillinger:export-D109906106
Open

Fix HyperClockCache naive Lookup fast-path chain indexing to skip full fallback#14901
pdillinger wants to merge 1 commit into
facebook:mainfrom
pdillinger:export-D109906106

Conversation

@pdillinger

Copy link
Copy Markdown
Contributor

Differential Revision: D109906106

…l fallback

Differential Revision: D109906106
@meta-cla meta-cla Bot added the CLA Signed label Jun 30, 2026
@meta-codesync

meta-codesync Bot commented Jun 30, 2026

Copy link
Copy Markdown

@pdillinger has exported this pull request. If you are a Meta employee, you can view the originating Diff in D109906106.

@github-actions

Copy link
Copy Markdown

✅ clang-tidy: No findings on changed lines

Completed in 96.7s.

@github-actions

Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 35d97ee


Codex review failed before producing findings.

WARNING: proceeding, even though we could not create PATH aliases: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@github-actions

Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 35d97ee


Summary

Correct and important one-line fix for a typo-class bug that rendered the AutoHyperClockCache naive fast-path Lookup completely non-functional: IsEnd() (returns bool) was used where GetNext() (returns array index) was needed, causing every fast-path iteration to access arr[0] instead of the actual chain element. This was a performance-only bug — the full Lookup fallback guaranteed correctness. The fix matches the pattern used in 10+ other call sites throughout the same file.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🟡 MEDIUM

M1. Residual damage from the bug on arr[0] — clock_cache.cc:3126
  • Issue: While the bug was active, the fast-path loop could spuriously increment the acquire counter on arr[0]'s metadata via h->meta.Apply(AcquireCounter::PlusTransformPromiseNoOverflow(1), ...) whenever the TSAN random path triggered probably_equal or (unlikely) arr[0] happened to have a matching hashed_key. In non-shareable states the increment is documented as having "no effect" (line 3136-3141), and in shareable states the subsequent Unref(*h) undoes it. So the ref counts are self-correcting and no lasting damage is expected.
  • Root cause: The bug caused operations on the wrong element; those operations are designed to be undoable.
  • Suggested fix: No action needed. Documenting for completeness — the existing undo logic (Unref on mismatch, no-op on non-shareable) handles this correctly.
M2. No test specifically validates fast-path effectiveness — cache/clock_cache_test.cc
  • Issue: Existing tests exercise AutoHyperClockCache Lookup through the public Cache API, which would succeed regardless of whether the fast path or full Lookup was used. There is no test that validates the fast path actually finds keys (e.g., by measuring that Lookup avoids the full algorithm). This allowed the bug to go undetected.
  • Root cause: The fast path is a pure optimization with no externally observable behavioral difference — only performance differs.
  • Suggested fix: Consider adding a performance regression test or a targeted unit test with instrumentation (e.g., a counter for fast-path hits vs. full-path fallbacks) to catch similar regressions in the future. This is optional given the fix is straightforward.

🟢 LOW / NIT

L1. The bug is a copy-paste or auto-complete error — clock_cache.cc:3111
  • Issue: next_with_shift.IsEnd() and next_with_shift.GetNext() are visually similar method names on the same object. This appears to be a typo introduced when the fast path was originally written. The loop condition on the previous line (3110) correctly uses IsEnd(), making it plausible that the author's cursor was on IsEnd when writing line 3111.
  • Suggested fix: The fix correctly changes to GetNext(). No further action needed.

Cross-Component Analysis

Context Does code execute? Assumptions hold? Action needed?
WritePreparedTxnDB Yes (cache used for block reads) Yes None — cache-internal fix
ReadOnly DB Yes Yes None
User-defined timestamps Yes Yes None
Concurrent writers Yes Yes — fast path uses relaxed loads by design None
FIFO / Universal compaction Yes Yes None
Table Grow (resize) Yes Yes — stale GetNext() values are handled by the full Lookup fallback; fast-path false negatives are safe by design None

Assumption stress-test:

  • Claim: "False negatives in the fast path are safe." Preconditions: (1) the full Lookup must always run after the fast path misses. Verified: the fast path loop simply falls through to line 3148 which starts the full algorithm from scratch. (2) The fast path must not corrupt state. Verified: acquire counter increments are undone by Unref on mismatch, and non-shareable states are documented as no-ops.
  • Claim: "GetNext() returns a valid index." Preconditions: the chain was constructed with valid indices. Verified: all chain construction paths (Insert, Grow) use Make() / MakeEnd() which set the Next field to valid array indices. Concurrent modifications could produce stale indices, but the existing design (comments at lines 3090-3106) explicitly accounts for this — "falling off the chain" is safe because memory safety is guaranteed and the full Lookup re-traverses correctly.

Positive Observations

  1. Correct fix: GetNext() matches the exact pattern used in all other chain traversal code in the file (lines 2052, 2065, 2441, 2578, 2637, 2698, 2731, 3213, 3229).
  2. Robust design: The two-phase Lookup design (naive fast path + correct full algorithm) made this bug a performance issue rather than a correctness issue. The defensive architecture worked exactly as intended.
  3. Minimal change: The fix is surgically precise — exactly one token changed, with no risk of introducing new bugs.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

1 participant