Skip to content

Fix compaction abort rescheduling before queue pick#14800

Closed
xingbowang wants to merge 2 commits into
facebook:mainfrom
xingbowang:2026_05_28_compaction_abort_bug
Closed

Fix compaction abort rescheduling before queue pick#14800
xingbowang wants to merge 2 commits into
facebook:mainfrom
xingbowang:2026_05_28_compaction_abort_bug

Conversation

@xingbowang

@xingbowang xingbowang commented May 28, 2026

Copy link
Copy Markdown
Contributor

Summary

  • AbortAllCompactions can race with an already-scheduled automatic compaction before the background worker calls PickCompactionFromQueue(). MaybeScheduleFlushOrCompaction() has already consumed one unscheduled_compactions_ credit when it scheduled the worker, but the CF is still present in compaction_queue_ with queued_for_compaction=true. If the worker returns kCompactionAborted before popping the CF, ResumeAllCompactions() can see unscheduled_compactions_ == 0 and fail to schedule the still-queued work, leaving compaction permanently stalled until DB restart.

  • Restore the unscheduled compaction credit in the non-prepicked abort path, matching the existing BG-work-stopped handling for the same scheduled-before-pick state. Prepicked/manual compactions are not adjusted because they do not represent an unpopped automatic compaction_queue_ entry whose scheduling credit was consumed.

  • Add AbortScheduledAutomaticCompactionBeforePick to deterministically reproduce the lost-credit race with a sync point at BackgroundCallCompaction:0. The test verifies that ResumeAllCompactions() schedules the still-queued automatic compaction by checking COMPACT_WRITE_BYTES advances and L0 file count drops.

Test Plan

  • Without the implementation fix, DBCompactionAbortTest.AbortScheduledAutomaticCompactionBeforePick fails because COMPACT_WRITE_BYTES remains unchanged after ResumeAllCompactions().
AbortAllCompactions can race with an already-scheduled automatic compaction before the background worker calls PickCompactionFromQueue(). MaybeScheduleFlushOrCompaction() has already consumed one unscheduled_compactions_ credit when it scheduled the worker, but the CF is still present in compaction_queue_ with queued_for_compaction=true. If the worker returns kCompactionAborted before popping the CF, ResumeAllCompactions() can see unscheduled_compactions_ == 0 and fail to schedule the still-queued work, leaving compaction permanently stalled until DB restart.

Restore the unscheduled compaction credit in the non-prepicked abort path, matching the existing BG-work-stopped handling for the same scheduled-before-pick state. Prepicked/manual compactions are not adjusted because they do not represent an unpopped automatic compaction_queue_ entry whose scheduling credit was consumed.

Add AbortScheduledAutomaticCompactionBeforePick to deterministically reproduce the lost-credit race with a sync point at BackgroundCallCompaction:0. The test verifies that ResumeAllCompactions() schedules the still-queued automatic compaction by checking COMPACT_WRITE_BYTES advances and L0 file count drops.

Add a bug-fix release note under unreleased_history/bug_fixes for the next release.

Test Plan: without the implementation fix, AbortScheduledAutomaticCompactionBeforePick fails because COMPACT_WRITE_BYTES remains 0 after ResumeAllCompactions(). Verified final tree with make format-auto, make -j128 db_compaction_abort_test, timeout 60s ./db_compaction_abort_test --gtest_filter=DBCompactionAbortTest.AbortScheduledAutomaticCompactionBeforePick, timeout 60s ./db_compaction_abort_test, COERCE_CONTEXT_SWITCH=1 make -j128 db_compaction_abort_test && timeout 60s ./db_compaction_abort_test --gtest_filter=DBCompactionAbortTest.AbortScheduledAutomaticCompactionBeforePick --gtest_repeat=5, and make check-sources. After adding the release note, reran make check-sources.
@meta-cla meta-cla Bot added the CLA Signed label May 28, 2026
@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown

✅ clang-tidy: No findings on changed lines

Completed in 108.4s.

@github-actions

Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 81e4eb2


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@github-actions

Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 81e4eb2


Summary

This PR correctly fixes a real race condition where AbortAllCompactions() can permanently stall compaction by losing an unscheduled_compactions_ credit when abort fires before PickCompactionFromQueue(). The fix is minimal, correctly scoped, thread-safe, and matches the existing BG-work-stopped pattern. No high-severity issues found.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🔴 HIGH

None.

🟡 MEDIUM

M1. Inconsistent !is_prepicked guard in BG-work-stopped path -- db_impl_compaction_flush.cc:4080
  • Issue: The existing BG-work-stopped path (line 4080) increments unscheduled_compactions_++ unconditionally (no !is_prepicked guard), while the new abort path correctly guards with !is_prepicked. For prepicked/manual compactions, no unscheduled_compactions_ credit was consumed by MaybeScheduleFlushOrCompaction(), so the unconditional increment in the BG-work-stopped path could create a spurious credit.
  • Root cause: The BG-work-stopped path predates the abort fix and was written without considering the prepicked distinction. Its comment says "Since we didn't pop a cfd from the compaction queue" -- but prepicked compactions never had a queue entry to begin with.
  • Suggested fix: Add the same !is_prepicked guard to the BG-work-stopped path for consistency:
    } else {
        status = error_handler_.GetBGError();
        if (!is_prepicked) {
            unscheduled_compactions_++;
        }
    }
    This is a pre-existing issue, not introduced by this PR. Could be addressed as a follow-up.
M2. Test coverage gaps for edge cases -- db_compaction_abort_test.cc
  • Issue: The new test covers the primary scenario (single CF, single background thread, automatic compaction) but does not cover:
    • Multiple CFs in compaction_queue_ when abort fires (multiple credits lost and restored)
    • max_background_compactions > 1 (multiple workers aborting simultaneously)
    • Verifying that manual/prepicked compactions do NOT incorrectly increment unscheduled_compactions_ during abort
  • Root cause: Test scope limited to the minimal reproduction case.
  • Suggested fix: Consider adding a multi-CF or multi-thread variant in a follow-up. The current test is sufficient to validate the core fix.

🟢 LOW / NIT

L1. Consider centralizing credit restoration -- db_impl_compaction_flush.cc
  • Issue: Credit restoration (unscheduled_compactions_++) now appears in four separate locations (lines 4080, 4070, 4161, 4174), each with slightly different guard conditions. This fragmentation makes the credit system harder to reason about.
  • Suggested fix: Consider extracting a helper method in a follow-up:
    void RestoreUnscheduledCompactionCredit(bool is_prepicked) {
        if (!is_prepicked) { unscheduled_compactions_++; }
    }

Cross-Component Analysis

Context Affected? Verified?
WritePreparedTxnDB No Same scheduling path
FIFO/Universal compaction Yes, works correctly Style-agnostic scheduling
Multiple CFs Yes, works correctly Per-worker credit accounting
Multiple BG threads Yes, works correctly Each worker restores own credit
Nested abort/resume Yes, works correctly Counter semantics preserved
Manual compaction No (guarded by !is_prepicked) Verified via caller audit
Forwarded-to-bottom compaction No (guarded by !is_prepicked) Verified: original pick consumes credit, forwarded worker is prepicked
ReadOnly DB No No compaction scheduling

Key invariant verified: MaybeScheduleFlushOrCompaction() at line 3995 (in BackgroundCallCompaction cleanup) cannot consume the restored credit because compaction_aborted_ is still > 0 at that point. AbortAllCompactions() waits for bg_compaction_scheduled_ == 0 (line 3188), which happens AFTER line 3986's decrement, so ResumeAllCompactions() cannot be called until after the cleanup completes.

Positive Observations

  • The fix precisely mirrors the existing BG-work-stopped pattern (line 4080), demonstrating good codebase awareness.
  • The !is_prepicked guard is more correct than the existing BG-work-stopped path, showing attention to the credit accounting model.
  • The test uses SyncPointAbortHelper consistently with other tests in the file.
  • The comment on the new code clearly explains the invariant being maintained.
  • The sync point "BackgroundCallCompaction:0" precisely targets the window between worker scheduling and queue pick, making the test deterministic.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase
@xingbowang xingbowang force-pushed the 2026_05_28_compaction_abort_bug branch from 81e4eb2 to 578e1d9 Compare May 28, 2026 18:15
@meta-codesync

meta-codesync Bot commented May 28, 2026

Copy link
Copy Markdown

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D106684155.

@meta-codesync

meta-codesync Bot commented Jun 1, 2026

Copy link
Copy Markdown

@xingbowang merged this pull request in 30dba7f.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

2 participants