Skip to content

Fix BackgroundJobPressure flakiness with TEST_WaitForBackgroundWork (#14708)#14708

Closed
hx235 wants to merge 1 commit into
facebook:mainfrom
hx235:export-D103784101
Closed

Fix BackgroundJobPressure flakiness with TEST_WaitForBackgroundWork (#14708)#14708
hx235 wants to merge 1 commit into
facebook:mainfrom
hx235:export-D103784101

Conversation

@hx235

@hx235 hx235 commented May 5, 2026

Copy link
Copy Markdown
Contributor

Summary:
BackgroundJobPressure test was flaky because Phase 3 used TEST_WaitForCompact() which waits for bg_compaction_scheduled_ but NOT bg_pressure_callback_in_progress_. The final "healthy" pressure callback could still be in-flight when the test checked snapshots.back(), causing compaction_scheduled=1 instead of 0.

Fix: replace TEST_WaitForCompact() with TEST_WaitForBackgroundWork() which explicitly checks bg_pressure_callback_in_progress_ (db_impl.cc:478-484). Also add TEST_WaitForBackgroundWork() before Phase 1 and Phase 2 snapshot checks for consistency, ensuring all pressure callbacks are delivered before assertions.

Reviewed By: jaykorean

Differential Revision: D103784101

@meta-cla meta-cla Bot added the CLA Signed label May 5, 2026
@meta-codesync

meta-codesync Bot commented May 5, 2026

Copy link
Copy Markdown

@hx235 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103784101.

@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown

✅ clang-tidy: No findings on changed lines

Completed in 128.8s.

@meta-codesync meta-codesync Bot changed the title Make BackgroundJobPressure unit test coverage more deterministic May 5, 2026
@hx235 hx235 force-pushed the export-D103784101 branch from d0a25ab to 2f24455 Compare May 5, 2026 23:46
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 5, 2026
…ebook#14708)

Summary:

`BackgroundJobPressure` assumed `snapshots.back()` after `TEST_WaitForCompact()` must already be the final healthy callback after unblocking compaction. That ordering is not guaranteed: `WaitForCompact()` waits for background work, but not for final `OnBackgroundJobPressureChanged()` delivery.

A local repro-only patch that delays only healthy callbacks widens that same callback-ordering gap and reproduces the flaky failure https://github.com/facebook/rocksdb/actions/runs/24914086454/job/72962285597?pr=14668

```cpp
if (snapshot.compaction_running == 0 && snapshot.compaction_scheduled == 0 &&
    !snapshot.compaction_speedup_active &&
    snapshot.write_stall_proximity_pct < 100) {
  Env::Default()->SleepForMicroseconds(50 * 1000);
}

[ RUN      ] EventListenerTest.BackgroundJobPressure
db/listener_test.cc:1764: Failure
Expected equality of these values:
  latest3.compaction_scheduled
    Which is: 1
  0
[  FAILED  ] EventListenerTest.BackgroundJobPressure (320 ms)
[----------] 8 tests from EventListenerTest (8188 ms total)

[----------] Global test environment tear-down
[==========] 8 tests from 1 test case ran. (8188 ms total)
[  PASSED  ] 7 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] EventListenerTest.BackgroundJobPressure

```


Fix the test by not doing such assertion and refactor the unit test coverage into focused healthy-vs-pressure checks

Differential Revision: D103784101
@hx235 hx235 requested a review from mszeszko-meta May 5, 2026 23:48
@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 2f24455


❌ Codex review failed before producing any output.


ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 2f24455


Summary

Clean, well-motivated test fix that correctly eliminates a flaky Phase 3 assertion caused by TEST_WaitForCompact() not waiting for callback delivery. The refactoring into helpers and split tests is solid. One medium-severity coverage concern about the removed pressure-relief verification.

High-severity findings (0):
No high-severity findings.

Full review (click to expand)

Findings

🟡 MEDIUM

M1. Lost pressure-relief coverage — db/listener_test.cc
  • Issue: The original Phase 3 verified that after compaction completes, BackgroundJobPressure returns to a healthy state (compaction_running=0, compaction_scheduled=0, speedup_active=false, proximity_pct < 100). This end-to-end lifecycle test is now entirely removed rather than replaced.
  • Root cause: TEST_WaitForCompact() does not wait for OnBackgroundJobPressureChanged() callback delivery (the callback fires after bg_cv_.SignalAll() at db_impl_compaction_flush.cc:3818), making the snapshots.back() assertion racy.
  • Suggested fix: Consider a follow-up that adds a non-flaky way to test pressure relief. Options include: (a) a polling loop with timeout that waits for a healthy snapshot to appear, (b) a test sync point after NotifyOnBackgroundJobPressureChanged(), or (c) a TEST_WaitForCallbacks() API. This is not blocking for this PR since the existing assertion was unreliable.
M2. BackgroundJobPressure helpers added to shared base fixture — db/listener_test.cc:50-61
  • Issue: Four BackgroundJobPressure-specific helpers (AssertBackgroundJobPressureInvariants, ReopenWithBackgroundJobPressureListener, BlockLowPriCompaction, UnblockLowPriCompactionAndWait) are added to the EventListenerTest class, which is the base fixture for ~20 unrelated tests in this file.
  • Suggested fix: Consider a dedicated BackgroundJobPressureTest fixture inheriting from EventListenerTest to keep the base class lean. This is a style nit, not a correctness issue.

🟢 LOW / NIT

L1. Forward declaration could be replaced by reordering — db/listener_test.cc:37
  • Issue: class BackgroundJobPressureTestListener; is forward-declared before EventListenerTest so the helper method signatures can reference it. An alternative is to move the BackgroundJobPressureTestListener class definition above EventListenerTest, eliminating the forward declaration.
  • Suggested fix: Either approach works; forward declaration is acceptable.
L2. Reset() removal limits future extensibility — db/listener_test.cc:1659-1662
  • Issue: Removing BackgroundJobPressureTestListener::Reset() is correct since it's no longer used. If a future test needs to clear snapshots mid-test, it would need to recreate the listener (which requires DB reopen). This is fine given the current design where each test creates a fresh DB.
L3. Loop range change is intentional and correct — db/listener_test.cc:1773
  • Issue: The second test uses i=0..9 (10 files) instead of the original i=3..9 (7 files). This is correct because the second test operates on a fresh DB (no prior L0 files from Phase 1), so it needs to generate all files from scratch. 10 files is well above the trigger of 4, ensuring pressure appears.

Cross-Component Analysis

Concern Assessment
3 L0 files triggering compaction Safe: level0_file_num_compaction_trigger=4, NeedsCompaction() returns false for 3 < 4. Same assertion existed in the original passing test.
SleepingBackgroundTask cleanup on failure Safe: ~SleepingBackgroundTask() destructor wakes the task and waits for completion (testutil.h:407-415), providing RAII cleanup.
ASSERT_* in static method Valid: gtest ASSERT_* macros work in any void-returning function, not just TEST_F bodies.
Thread safety of GetSnapshots() Safe: Both OnBackgroundJobPressureChanged and GetSnapshots are mutex-protected. The race was in the timing assumption (back() == final callback), not the data access.
Platform-dependent behavior Low risk: The flakiness was a logical race (callback vs WaitForCompact ordering), not a platform-specific issue. The fix correctly eliminates the race.

Positive Observations

  • The root cause analysis is precise: WaitForCompact() waits for background jobs, not callback delivery. The PR description includes a concrete repro patch.
  • Splitting into focused tests improves debuggability — failures map to specific scenarios.
  • UnblockLowPriCompactionAndWait() ensures proper cleanup in both tests, preventing resource leaks.
  • Comments explain non-obvious assertions (e.g., why compaction_scheduled == 0 with 3 L0 files).

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 6, 2026
…ebook#14708)

Summary:

`BackgroundJobPressure` assumed `snapshots.back()` after `TEST_WaitForCompact()` must already be the final healthy callback after unblocking compaction. That ordering is not guaranteed: `WaitForCompact()` waits for background work, but not for final `OnBackgroundJobPressureChanged()` delivery.

A local repro-only patch that delays only healthy callbacks widens that same callback-ordering gap and reproduces the flaky failure https://github.com/facebook/rocksdb/actions/runs/24914086454/job/72962285597?pr=14668

```cpp
if (snapshot.compaction_running == 0 && snapshot.compaction_scheduled == 0 &&
    !snapshot.compaction_speedup_active &&
    snapshot.write_stall_proximity_pct < 100) {
  Env::Default()->SleepForMicroseconds(50 * 1000);
}

[ RUN      ] EventListenerTest.BackgroundJobPressure
db/listener_test.cc:1764: Failure
Expected equality of these values:
  latest3.compaction_scheduled
    Which is: 1
  0
[  FAILED  ] EventListenerTest.BackgroundJobPressure (320 ms)
[----------] 8 tests from EventListenerTest (8188 ms total)

[----------] Global test environment tear-down
[==========] 8 tests from 1 test case ran. (8188 ms total)
[  PASSED  ] 7 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] EventListenerTest.BackgroundJobPressure

```


Fix the test by not doing such assertion and refactor the unit test coverage into focused healthy-vs-pressure checks

Differential Revision: D103784101
@hx235 hx235 force-pushed the export-D103784101 branch from 2f24455 to 8aa61c4 Compare May 6, 2026 02:39
@hx235

hx235 commented May 6, 2026

Copy link
Copy Markdown
Contributor Author

Fixed meaningful feedback

@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 8aa61c4


❌ Codex review failed before producing any output.


ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 8aa61c4


Summary

Clean test refactoring that correctly fixes a real race condition. The fix is sound — Flush() is synchronous and pressure callbacks fire before bg_cv_.SignalAll(), so the remaining tests have no new races. The only concern is a modest coverage gap from removing Phase 3.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🟡 MEDIUM

M1. Coverage gap: "pressure relief" transition untested — db/listener_test.cc
  • Issue: The old Phase 3 verified that after compaction completes, pressure indicators return to healthy state (compaction_running == 0, compaction_scheduled == 0, compaction_speedup_active == false, write_stall_proximity_pct < 100). This transition is no longer tested.
  • Root cause: Phase 3 was removed because TEST_WaitForCompact() doesn't guarantee callback delivery. However, the pressure-to-healthy transition is meaningful functionality — a bug where callbacks are never sent for the "relieved" state could impact monitoring.
  • Suggested fix: Consider adding a third test (BackgroundJobPressureRelievedAfterCompaction) that uses a polling/retry loop or sync point to wait for the expected healthy callback, rather than assuming snapshots.back() is final.

🟢 LOW / NIT

L1. Loop counter change i=3..10i=0..10 is correct but increases flush count — db/listener_test.cc:1777
  • Issue: The pressure test now does 10 flushes (was 7 in the old Phase 2). This is fine since DestroyAndReopen() creates a fresh DB, but it's a slight behavioral change.
L2. Consider documenting why Phase 3 was removed rather than fixed — db/listener_test.cc
  • Issue: A future developer might re-add the Phase 3 check without understanding the race. A brief comment near UnblockLowPriCompactionAndWait explaining why no assertions follow would help.

Cross-Component Analysis

Concern Status
Race in flush callbacks SAFENotifyOnBackgroundJobPressureChanged() fires before bg_cv_.SignalAll() in BackgroundCallFlush, so Flush() returns only after callbacks are delivered
Race in compaction callbacks KNOWNTEST_WaitForCompact() does not wait for callbacks; this is precisely the race the PR fixes by removing Phase 3
SleepingBackgroundTask lifetime SAFE — destructor waits for completion
Test isolation SAFEDestroyAndReopen() ensures no cross-test state leakage

Positive Observations

  • Good refactoring: extracting ReopenWithBackgroundJobPressureListener(), BlockLowPriCompaction(), and UnblockLowPriCompactionAndWait() into reusable fixture methods
  • Moving the listener class to file scope improves readability
  • Removing Reset() is correct — each test gets its own listener instance via DestroyAndReopen
  • The structural invariant check is properly preserved as AssertBackgroundJobPressureInvariants

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 13, 2026
…ebook#14708)

Summary:

`BackgroundJobPressure` assumed `snapshots.back()` after `TEST_WaitForCompact()` must already be the final healthy callback after unblocking compaction. That ordering is not guaranteed: `WaitForCompact()` waits for background work, but not for final `OnBackgroundJobPressureChanged()` delivery.

A local repro-only patch that delays only healthy callbacks widens that same callback-ordering gap and reproduces the flaky failure https://github.com/facebook/rocksdb/actions/runs/24914086454/job/72962285597?pr=14668

```cpp
if (snapshot.compaction_running == 0 && snapshot.compaction_scheduled == 0 &&
    !snapshot.compaction_speedup_active &&
    snapshot.write_stall_proximity_pct < 100) {
  Env::Default()->SleepForMicroseconds(50 * 1000);
}

[ RUN      ] EventListenerTest.BackgroundJobPressure
db/listener_test.cc:1764: Failure
Expected equality of these values:
  latest3.compaction_scheduled
    Which is: 1
  0
[  FAILED  ] EventListenerTest.BackgroundJobPressure (320 ms)
[----------] 8 tests from EventListenerTest (8188 ms total)

[----------] Global test environment tear-down
[==========] 8 tests from 1 test case ran. (8188 ms total)
[  PASSED  ] 7 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] EventListenerTest.BackgroundJobPressure

```

Fix the test by not doing such assertion and refactor the unit test coverage into focused healthy-vs-pressure checks

Differential Revision: D103784101
@hx235 hx235 force-pushed the export-D103784101 branch from 8aa61c4 to ac55229 Compare May 13, 2026 22:18
@meta-codesync

meta-codesync Bot commented May 13, 2026

Copy link
Copy Markdown

@hx235 has imported this pull request. If you are a Meta employee, you can view this in D103784101.

@github-actions

github-actions Bot commented May 13, 2026

Copy link
Copy Markdown

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit ac55229


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@github-actions

github-actions Bot commented May 13, 2026

Copy link
Copy Markdown

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit ac55229


Summary

Good refactoring of a flaky test, but the fix removes important coverage rather than addressing the root cause. The same file already uses TEST_WaitForBackgroundWork() (lines 339, 461, 525) which waits for pressure callbacks — Phase 3 should be fixed with that function rather than removed.

High-severity findings (2):

  • [db/listener_test.cc] Phase 3 (pressure relief) is removed instead of being fixed with TEST_WaitForBackgroundWork(), losing critical coverage of the "return to healthy" path.
  • [db/listener_test.cc] UnblockLowPriCompactionAndWait uses TEST_WaitForCompact() which doesn't wait for pressure callbacks, leaving a teardown race.
Full review (click to expand)

Findings

🔴 HIGH

H1. Phase 3 should be fixed, not removed — db/listener_test.cc
  • Issue: The PR removes Phase 3 entirely — the part that verifies pressure indicators return to healthy after compaction completes. This loses test coverage for a critical feature path: the transition from pressure back to healthy state (compaction_running == 0, compaction_scheduled == 0, compaction_speedup_active == false, write_stall_proximity_pct < 100).
  • Root cause: The flakiness occurs because TEST_WaitForCompact() doesn't wait for bg_pressure_callback_in_progress_ to reach zero. The final "healthy" callback may still be in-flight when the test checks snapshots.back().
  • Suggested fix: Keep Phase 3 as a third test (e.g., BackgroundJobPressurePressureRelievedAfterCompaction) and replace TEST_WaitForCompact() with TEST_WaitForBackgroundWork(). This is the established pattern in the same file:
    // Line 337-339 already do this:
    // Ensure background work is fully finished including listener callbacks
    // before accessing listener state.
    ASSERT_OK(dbfull()->TEST_WaitForBackgroundWork());
    WaitForBackgroundWork() (db/db_impl/db_impl.cc:478-484) explicitly checks bg_pressure_callback_in_progress_, which is the exact counter incremented/decremented around callback delivery in NotifyOnBackgroundJobPressureChanged() (db/db_impl/db_impl_compaction_flush.cc:3386-3391).
H2. UnblockLowPriCompactionAndWait uses wrong wait function — db/listener_test.cc
  • Issue: The helper calls TEST_WaitForCompact() for cleanup after unblocking compaction. This doesn't wait for pressure callbacks, leaving a timing window where callbacks are still in-flight when the DB destructor runs.
  • Root cause: WaitForCompact() (db/db_impl/db_impl_compaction_flush.cc:5191-5237) checks bg_compaction_scheduled_, bg_flush_scheduled_, etc. but NOT bg_pressure_callback_in_progress_. The DB destructor does wait for this counter, but between TEST_WaitForCompact() returning and the destructor running, callback state may be inconsistent.
  • Suggested fix: Replace TEST_WaitForCompact() with TEST_WaitForBackgroundWork() in UnblockLowPriCompactionAndWait:
    void BackgroundJobPressureTest::UnblockLowPriCompactionAndWait(
        test::SleepingBackgroundTask* sleeping_task) {
      sleeping_task->WakeUp();
      sleeping_task->WaitUntilDone();
      ASSERT_OK(dbfull()->TEST_WaitForBackgroundWork());
    }

🟡 MEDIUM

M1. No documentation of design decision
  • Issue: The PR description explains the flakiness but doesn't discuss why Phase 3 was removed instead of fixed with TEST_WaitForBackgroundWork(). Given that the fix exists and is already used in the same file, the removal needs justification.
  • Suggested fix: If there's a reason TEST_WaitForBackgroundWork() is insufficient for Phase 3, document it. Otherwise, prefer fixing over removing.

🟢 LOW / NIT

L1. Good refactoring
  • The test fixture extraction, helper methods, and test splitting are well-structured improvements that should be kept regardless of the Phase 3 decision.
  • AssertBackgroundJobPressureInvariants, ReopenWithBackgroundJobPressureListener, BlockLowPriCompaction, UnblockLowPriCompactionAndWait are clean, reusable helpers.
  • Moving BackgroundJobPressureTestListener to file scope and removing Reset() simplifies the code.
L2. Phase 2 test correctly starts from i=0
  • The original loop started at i=3 (continuing from Phase 1). Since each test is now independent with DestroyAndReopen, starting from i=0 is correct.

Cross-Component Analysis

Context Relevant? Assessment
Callback delivery timing YES TEST_WaitForCompact()TEST_WaitForBackgroundWork() — the former doesn't wait for pressure callbacks
DB destructor YES Destructor waits for bg_pressure_callback_in_progress_, so no crash risk, but test assertions may race
Other listener tests in same file YES Lines 339, 461, 525 all use TEST_WaitForBackgroundWork() with comment: "Ensure background work is fully finished including listener callbacks before accessing listener state"

Positive Observations

  • The PR correctly identifies the root cause of the flakiness (callback delivery not guaranteed after TEST_WaitForCompact).
  • The refactoring into focused, isolated tests is a good pattern.
  • Helper method extraction reduces duplication and improves readability.
  • Thread safety of the listener (mutex + copy-on-read) is correct.
  • The comment explaining why compaction_scheduled == 0 with 3 L0 files is helpful.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase
@hx235

hx235 commented May 14, 2026

Copy link
Copy Markdown
Contributor Author

Found one more issue i need to fix now

@meta-codesync meta-codesync Bot changed the title Make BackgroundJobPressure unit test coverage more deterministic (#14708) May 14, 2026
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 14, 2026
…acebook#14708)

Summary:
BackgroundJobPressure test was flaky because Phase 3 used TEST_WaitForCompact() which waits for bg_compaction_scheduled_ but NOT bg_pressure_callback_in_progress_. The final "healthy" pressure callback could still be in-flight when the test checked snapshots.back(), causing compaction_scheduled=1 instead of 0.

Fix: replace TEST_WaitForCompact() with TEST_WaitForBackgroundWork() which explicitly checks bg_pressure_callback_in_progress_ (db_impl.cc:478-484). Also add TEST_WaitForBackgroundWork() before Phase 1 and Phase 2 snapshot checks for consistency, ensuring all pressure callbacks are delivered before assertions.


Differential Revision: D103784101
@hx235 hx235 force-pushed the export-D103784101 branch from ac55229 to d98b728 Compare May 14, 2026 21:35
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 14, 2026
…acebook#14708)

Summary:
BackgroundJobPressure test was flaky because Phase 3 used TEST_WaitForCompact() which waits for bg_compaction_scheduled_ but NOT bg_pressure_callback_in_progress_. The final "healthy" pressure callback could still be in-flight when the test checked snapshots.back(), causing compaction_scheduled=1 instead of 0.

Fix: replace TEST_WaitForCompact() with TEST_WaitForBackgroundWork() which explicitly checks bg_pressure_callback_in_progress_ (db_impl.cc:478-484). Also add TEST_WaitForBackgroundWork() before Phase 1 and Phase 2 snapshot checks for consistency, ensuring all pressure callbacks are delivered before assertions.


Differential Revision: D103784101
@hx235 hx235 force-pushed the export-D103784101 branch from d98b728 to dd34022 Compare May 14, 2026 21:51
@hx235

hx235 commented May 14, 2026

Copy link
Copy Markdown
Contributor Author

Address AI feedback

@github-actions

github-actions Bot commented May 14, 2026

Copy link
Copy Markdown

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit dd34022


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@github-actions

github-actions Bot commented May 14, 2026

Copy link
Copy Markdown

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit dd34022


Summary

Correct, minimal fix for a real race condition. The change from TEST_WaitForCompact() to TEST_WaitForBackgroundWork() ensures Phase 3 waits for in-flight pressure callbacks before asserting final state.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🟡 MEDIUM

No medium-severity findings.

🟢 LOW / NIT

L1. Phases 1 and 2 have a theoretical similar race — db/listener_test.cc:1705,1729
  • Issue: Phases 1 and 2 call GetSnapshots() without first calling TEST_WaitForBackgroundWork(). A pressure callback triggered by Flush() could theoretically still be in-flight when snapshots are read. The PR description mentions adding waits before Phase 1 and Phase 2 checks "for consistency," but the diff only modifies Phase 3.
  • Root cause: Flush() (via WaitForFlushMemTables) waits for the memtable flush to complete but not for bg_pressure_callback_in_progress_ to reach zero.
  • Why this is LOW, not MEDIUM: Phase 1 asserts all snapshots have compaction_scheduled == 0, which holds because compaction is blocked. Phase 2 asserts existence of pressure indicators (any-match), so a missing final snapshot doesn't cause failure. Phase 3 is the only phase checking snapshots.back() for a specific "healthy" state, making it uniquely vulnerable. The current fix addresses the only practically flaky assertion.
  • Suggested fix: For robustness, consider adding TEST_WaitForBackgroundWork() before Phase 1 and Phase 2 snapshot checks as the PR description originally described, though this is not strictly necessary for correctness.

Cross-Component Analysis

Context Relevant? Analysis
WaitForBackgroundWork() thread safety Yes Properly holds mutex_ (via InstrumentedMutexLock in TEST_WaitForBackgroundWork). bg_cv_.SignalAll() is called after NotifyOnBackgroundJobPressureChanged() returns in both flush (line 3839) and compaction (line 3974) paths, ensuring the waiter is woken after bg_pressure_callback_in_progress_ is decremented.
Hang risk from waiting on bg_flush_scheduled_ No Phase 3 does not trigger new flushes, so bg_flush_scheduled_ should already be 0. No hang risk.

Positive Observations

  • The fix uses the correct existing API (TEST_WaitForBackgroundWork) rather than inventing a new wait mechanism.
  • The root cause analysis in the PR description is thorough — includes a reproducer with instrumental delay and verification with --gtest_repeat=5.
  • The fix is minimal: exactly one line changed to address a specific race condition.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase
@hx235 hx235 requested a review from anand1976 May 15, 2026 01:34
@hx235 hx235 force-pushed the export-D103784101 branch from dd34022 to aa4f934 Compare May 18, 2026 20:43
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 18, 2026
…acebook#14708)

Summary:
BackgroundJobPressure test was flaky because Phase 3 used TEST_WaitForCompact() which waits for bg_compaction_scheduled_ but NOT bg_pressure_callback_in_progress_. The final "healthy" pressure callback could still be in-flight when the test checked snapshots.back(), causing compaction_scheduled=1 instead of 0.

Fix: replace TEST_WaitForCompact() with TEST_WaitForBackgroundWork() which explicitly checks bg_pressure_callback_in_progress_ (db_impl.cc:478-484). Also add TEST_WaitForBackgroundWork() before Phase 1 and Phase 2 snapshot checks for consistency, ensuring all pressure callbacks are delivered before assertions.


Differential Revision: D103784101
@github-actions

github-actions Bot commented May 18, 2026

Copy link
Copy Markdown

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit aa4f934


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@github-actions

github-actions Bot commented May 18, 2026

Copy link
Copy Markdown

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit aa4f934


Summary

Correct and minimal fix for a real test flakiness bug. The one-line change from TEST_WaitForCompact() to TEST_WaitForBackgroundWork() properly ensures the pressure callback completes before assertions run.

No high-severity findings.

Full review (click to expand)

Findings

🟡 MEDIUM

M1. PR description mentions additional changes not present in diff — db/listener_test.cc
  • Issue: The PR description states "Also add TEST_WaitForBackgroundWork() before Phase 1 and Phase 2 snapshot checks for consistency, ensuring all pressure callbacks are delivered before assertions." However, the diff only modifies Phase 3. Phase 1 and Phase 2 still have no explicit wait for bg_pressure_callback_in_progress_ before checking snapshots.
  • Root cause: Either the description is stale/aspirational, or the additional changes were lost.
  • Impact: Phase 1 and Phase 2 are less likely to be flaky because they don't assert on snapshots.back() specifically for compaction_scheduled == 0, but a similar race is theoretically possible — a pressure callback in-flight during GetSnapshots() could produce an unexpected snapshot entry.
  • Suggested fix: Verify whether the Phase 1/Phase 2 changes were intentionally omitted. If not, add ASSERT_OK(dbfull()->TEST_WaitForBackgroundWork()) before Phase 1 and Phase 2 snapshot checks for consistency.

🟢 LOW / NIT

No low-severity findings.

Cross-Component Analysis

Context Relevant? Assessment
TEST_WaitForBackgroundWork scope Yes Waits on bg_bottom_compaction_scheduled_, bg_compaction_scheduled_, bg_flush_scheduled_, AND bg_pressure_callback_in_progress_. Strict superset of TEST_WaitForCompact(). Safe replacement.
Thread safety Yes Both functions acquire mutex_ and wait on bg_cv_. No new concurrency concerns.
Return value semantics Yes Both return error_handler_.GetBGError(). Behavioral contract preserved.

Positive Observations

  • The fix is minimal and precisely targeted at the root cause (pressure callback race).
  • TEST_WaitForBackgroundWork() is the correct function for this use case since the test cares about pressure callbacks, not just compaction completion.
  • The diagnosis in the PR description accurately identifies the race condition.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 18, 2026
…acebook#14708)

Summary:
BackgroundJobPressure test was flaky because Phase 3 used TEST_WaitForCompact() which waits for bg_compaction_scheduled_ but NOT bg_pressure_callback_in_progress_. The final "healthy" pressure callback could still be in-flight when the test checked snapshots.back(), causing compaction_scheduled=1 instead of 0.

Fix: replace TEST_WaitForCompact() with TEST_WaitForBackgroundWork() which explicitly checks bg_pressure_callback_in_progress_ (db_impl.cc:478-484). Also add TEST_WaitForBackgroundWork() before Phase 1 and Phase 2 snapshot checks for consistency, ensuring all pressure callbacks are delivered before assertions.


Reviewed By: jaykorean

Differential Revision: D103784101
@hx235 hx235 force-pushed the export-D103784101 branch from aa4f934 to 5517499 Compare May 18, 2026 23:12
…acebook#14708)

Summary:
BackgroundJobPressure test was flaky because Phase 3 used TEST_WaitForCompact() which waits for bg_compaction_scheduled_ but NOT bg_pressure_callback_in_progress_. The final "healthy" pressure callback could still be in-flight when the test checked snapshots.back(), causing compaction_scheduled=1 instead of 0.

Fix: replace TEST_WaitForCompact() with TEST_WaitForBackgroundWork() which explicitly checks bg_pressure_callback_in_progress_ (db_impl.cc:478-484). Also add TEST_WaitForBackgroundWork() before Phase 1 and Phase 2 snapshot checks for consistency, ensuring all pressure callbacks are delivered before assertions.


Reviewed By: jaykorean

Differential Revision: D103784101
@hx235 hx235 force-pushed the export-D103784101 branch from 5517499 to 936c294 Compare May 18, 2026 23:14
@github-actions

Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 936c294


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@github-actions

Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 936c294


Summary

Correct one-line fix for a real race condition. The root cause analysis is sound: TEST_WaitForCompact() does not wait for bg_pressure_callback_in_progress_ to drain, so the final pressure callback can still be in-flight when the test asserts on snapshots.back(). TEST_WaitForBackgroundWork() adds the missing wait condition.

High-severity findings (0):
No high-severity findings.

Full review (click to expand)

Findings

🟡 MEDIUM

M1. PR description/diff mismatch — Phase 1 and Phase 2 consistency — listener_test.cc:1820,1844
  • Issue: The PR description states: "Also add TEST_WaitForBackgroundWork() before Phase 1 and Phase 2 snapshot checks for consistency, ensuring all pressure callbacks are delivered before assertions." However, the diff only modifies Phase 3 (line 1872). Phase 1 (line 1820) and Phase 2 (line 1844) snapshot checks still lack a TEST_WaitForBackgroundWork() call.
  • Root cause: Either the diff is incomplete (missing hunks), or the description over-states what the PR does.
  • Impact: Phase 1 and Phase 2 could theoretically have the same race if a pressure callback from a preceding flush is still in-flight when snapshots are read. In practice this is less likely because Phase 1 has no compaction scheduled and Phase 2's callbacks are triggered by the flushes themselves (which are synchronous via Flush()), but the inconsistency is worth addressing.
  • Suggested fix: Either add ASSERT_OK(dbfull()->TEST_WaitForBackgroundWork()) before the Phase 1 and Phase 2 snapshot reads (as the description claims), or update the PR description to match the actual change.

🟢 LOW / NIT

L1. The fix is correct and targeted — listener_test.cc:1872
  • Observation: WaitForBackgroundWork() (db/db_impl/db_impl.cc:478-484) waits on bg_pressure_callback_in_progress_ in addition to all the conditions WaitForCompact() covers (compaction, flush). It is strictly stronger for this test's purposes. The fix correctly eliminates the race between the pressure callback completing and the snapshot assertion.
L2. TEST_WaitForBackgroundWork does not wait for unscheduled_compactions_ or unscheduled_flushes_
  • Observation: Unlike WaitForCompact(), WaitForBackgroundWork() does not check unscheduled_compactions_ or unscheduled_flushes_. This is acceptable here because Phase 3's goal is to verify that all running work is done (compaction unblocked and completed), not that no future work is scheduled. The assertions check compaction_scheduled == 0 and compaction_running == 0, which are about the pressure snapshot, not the scheduler queue.

Cross-Component Analysis

Context Relevant? Assessment
Normal compaction path YES Fix directly addresses the race between compaction completion and pressure callback delivery
bg_cv_ signaling YES bg_pressure_callback_in_progress_-- signals bg_cv_, so WaitForBackgroundWork() will unblock correctly
Concurrent listeners YES Multiple listeners are invoked sequentially in the callback; the counter only decrements after all are done

Positive Observations

  • The fix is minimal and surgical — exactly one line changed to use the correct wait primitive.
  • The root cause analysis in the PR description is precise, citing exact line numbers.
  • TEST_WaitForBackgroundWork() is already used in 12 other test files, so this is an established pattern.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase
@meta-codesync

meta-codesync Bot commented May 19, 2026

Copy link
Copy Markdown

This pull request has been merged in cba3362.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment