Fix BackgroundJobPressure flakiness with TEST_WaitForBackgroundWork (#14708)#14708
Fix BackgroundJobPressure flakiness with TEST_WaitForBackgroundWork (#14708)#14708hx235 wants to merge 1 commit into
Conversation
|
@hx235 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103784101. |
✅ clang-tidy: No findings on changed linesCompleted in 128.8s. |
…ebook#14708) Summary: `BackgroundJobPressure` assumed `snapshots.back()` after `TEST_WaitForCompact()` must already be the final healthy callback after unblocking compaction. That ordering is not guaranteed: `WaitForCompact()` waits for background work, but not for final `OnBackgroundJobPressureChanged()` delivery. A local repro-only patch that delays only healthy callbacks widens that same callback-ordering gap and reproduces the flaky failure https://github.com/facebook/rocksdb/actions/runs/24914086454/job/72962285597?pr=14668 ```cpp if (snapshot.compaction_running == 0 && snapshot.compaction_scheduled == 0 && !snapshot.compaction_speedup_active && snapshot.write_stall_proximity_pct < 100) { Env::Default()->SleepForMicroseconds(50 * 1000); } [ RUN ] EventListenerTest.BackgroundJobPressure db/listener_test.cc:1764: Failure Expected equality of these values: latest3.compaction_scheduled Which is: 1 0 [ FAILED ] EventListenerTest.BackgroundJobPressure (320 ms) [----------] 8 tests from EventListenerTest (8188 ms total) [----------] Global test environment tear-down [==========] 8 tests from 1 test case ran. (8188 ms total) [ PASSED ] 7 tests. [ FAILED ] 1 test, listed below: [ FAILED ] EventListenerTest.BackgroundJobPressure ``` Fix the test by not doing such assertion and refactor the unit test coverage into focused healthy-vs-pressure checks Differential Revision: D103784101
Codex Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit 2f24455 ❌ Codex review failed before producing any output. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
Claude Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit 2f24455 SummaryClean, well-motivated test fix that correctly eliminates a flaky Phase 3 assertion caused by High-severity findings (0): Full review (click to expand)Findings🟡 MEDIUMM1. Lost pressure-relief coverage —
|
| Concern | Assessment |
|---|---|
| 3 L0 files triggering compaction | Safe: level0_file_num_compaction_trigger=4, NeedsCompaction() returns false for 3 < 4. Same assertion existed in the original passing test. |
| SleepingBackgroundTask cleanup on failure | Safe: ~SleepingBackgroundTask() destructor wakes the task and waits for completion (testutil.h:407-415), providing RAII cleanup. |
| ASSERT_* in static method | Valid: gtest ASSERT_* macros work in any void-returning function, not just TEST_F bodies. |
| Thread safety of GetSnapshots() | Safe: Both OnBackgroundJobPressureChanged and GetSnapshots are mutex-protected. The race was in the timing assumption (back() == final callback), not the data access. |
| Platform-dependent behavior | Low risk: The flakiness was a logical race (callback vs WaitForCompact ordering), not a platform-specific issue. The fix correctly eliminates the race. |
Positive Observations
- The root cause analysis is precise:
WaitForCompact()waits for background jobs, not callback delivery. The PR description includes a concrete repro patch. - Splitting into focused tests improves debuggability — failures map to specific scenarios.
UnblockLowPriCompactionAndWait()ensures proper cleanup in both tests, preventing resource leaks.- Comments explain non-obvious assertions (e.g., why
compaction_scheduled == 0with 3 L0 files).
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
…ebook#14708) Summary: `BackgroundJobPressure` assumed `snapshots.back()` after `TEST_WaitForCompact()` must already be the final healthy callback after unblocking compaction. That ordering is not guaranteed: `WaitForCompact()` waits for background work, but not for final `OnBackgroundJobPressureChanged()` delivery. A local repro-only patch that delays only healthy callbacks widens that same callback-ordering gap and reproduces the flaky failure https://github.com/facebook/rocksdb/actions/runs/24914086454/job/72962285597?pr=14668 ```cpp if (snapshot.compaction_running == 0 && snapshot.compaction_scheduled == 0 && !snapshot.compaction_speedup_active && snapshot.write_stall_proximity_pct < 100) { Env::Default()->SleepForMicroseconds(50 * 1000); } [ RUN ] EventListenerTest.BackgroundJobPressure db/listener_test.cc:1764: Failure Expected equality of these values: latest3.compaction_scheduled Which is: 1 0 [ FAILED ] EventListenerTest.BackgroundJobPressure (320 ms) [----------] 8 tests from EventListenerTest (8188 ms total) [----------] Global test environment tear-down [==========] 8 tests from 1 test case ran. (8188 ms total) [ PASSED ] 7 tests. [ FAILED ] 1 test, listed below: [ FAILED ] EventListenerTest.BackgroundJobPressure ``` Fix the test by not doing such assertion and refactor the unit test coverage into focused healthy-vs-pressure checks Differential Revision: D103784101
|
Fixed meaningful feedback |
Codex Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit 8aa61c4 ❌ Codex review failed before producing any output. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
Claude Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit 8aa61c4 SummaryClean test refactoring that correctly fixes a real race condition. The fix is sound — High-severity findings (0): No high-severity findings. Full review (click to expand)Findings🟡 MEDIUMM1. Coverage gap: "pressure relief" transition untested —
|
| Concern | Status |
|---|---|
| Race in flush callbacks | SAFE — NotifyOnBackgroundJobPressureChanged() fires before bg_cv_.SignalAll() in BackgroundCallFlush, so Flush() returns only after callbacks are delivered |
| Race in compaction callbacks | KNOWN — TEST_WaitForCompact() does not wait for callbacks; this is precisely the race the PR fixes by removing Phase 3 |
| SleepingBackgroundTask lifetime | SAFE — destructor waits for completion |
| Test isolation | SAFE — DestroyAndReopen() ensures no cross-test state leakage |
Positive Observations
- Good refactoring: extracting
ReopenWithBackgroundJobPressureListener(),BlockLowPriCompaction(), andUnblockLowPriCompactionAndWait()into reusable fixture methods - Moving the listener class to file scope improves readability
- Removing
Reset()is correct — each test gets its own listener instance viaDestroyAndReopen - The structural invariant check is properly preserved as
AssertBackgroundJobPressureInvariants
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
…ebook#14708) Summary: `BackgroundJobPressure` assumed `snapshots.back()` after `TEST_WaitForCompact()` must already be the final healthy callback after unblocking compaction. That ordering is not guaranteed: `WaitForCompact()` waits for background work, but not for final `OnBackgroundJobPressureChanged()` delivery. A local repro-only patch that delays only healthy callbacks widens that same callback-ordering gap and reproduces the flaky failure https://github.com/facebook/rocksdb/actions/runs/24914086454/job/72962285597?pr=14668 ```cpp if (snapshot.compaction_running == 0 && snapshot.compaction_scheduled == 0 && !snapshot.compaction_speedup_active && snapshot.write_stall_proximity_pct < 100) { Env::Default()->SleepForMicroseconds(50 * 1000); } [ RUN ] EventListenerTest.BackgroundJobPressure db/listener_test.cc:1764: Failure Expected equality of these values: latest3.compaction_scheduled Which is: 1 0 [ FAILED ] EventListenerTest.BackgroundJobPressure (320 ms) [----------] 8 tests from EventListenerTest (8188 ms total) [----------] Global test environment tear-down [==========] 8 tests from 1 test case ran. (8188 ms total) [ PASSED ] 7 tests. [ FAILED ] 1 test, listed below: [ FAILED ] EventListenerTest.BackgroundJobPressure ``` Fix the test by not doing such assertion and refactor the unit test coverage into focused healthy-vs-pressure checks Differential Revision: D103784101
|
@hx235 has imported this pull request. If you are a Meta employee, you can view this in D103784101. |
Codex Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit ac55229 ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
Claude Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit ac55229 SummaryGood refactoring of a flaky test, but the fix removes important coverage rather than addressing the root cause. The same file already uses High-severity findings (2):
Full review (click to expand)Findings🔴 HIGHH1. Phase 3 should be fixed, not removed —
|
| Context | Relevant? | Assessment |
|---|---|---|
| Callback delivery timing | YES | TEST_WaitForCompact() ≠ TEST_WaitForBackgroundWork() — the former doesn't wait for pressure callbacks |
| DB destructor | YES | Destructor waits for bg_pressure_callback_in_progress_, so no crash risk, but test assertions may race |
| Other listener tests in same file | YES | Lines 339, 461, 525 all use TEST_WaitForBackgroundWork() with comment: "Ensure background work is fully finished including listener callbacks before accessing listener state" |
Positive Observations
- The PR correctly identifies the root cause of the flakiness (callback delivery not guaranteed after
TEST_WaitForCompact). - The refactoring into focused, isolated tests is a good pattern.
- Helper method extraction reduces duplication and improves readability.
- Thread safety of the listener (mutex + copy-on-read) is correct.
- The comment explaining why
compaction_scheduled == 0with 3 L0 files is helpful.
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
|
Found one more issue i need to fix now |
…acebook#14708) Summary: BackgroundJobPressure test was flaky because Phase 3 used TEST_WaitForCompact() which waits for bg_compaction_scheduled_ but NOT bg_pressure_callback_in_progress_. The final "healthy" pressure callback could still be in-flight when the test checked snapshots.back(), causing compaction_scheduled=1 instead of 0. Fix: replace TEST_WaitForCompact() with TEST_WaitForBackgroundWork() which explicitly checks bg_pressure_callback_in_progress_ (db_impl.cc:478-484). Also add TEST_WaitForBackgroundWork() before Phase 1 and Phase 2 snapshot checks for consistency, ensuring all pressure callbacks are delivered before assertions. Differential Revision: D103784101
…acebook#14708) Summary: BackgroundJobPressure test was flaky because Phase 3 used TEST_WaitForCompact() which waits for bg_compaction_scheduled_ but NOT bg_pressure_callback_in_progress_. The final "healthy" pressure callback could still be in-flight when the test checked snapshots.back(), causing compaction_scheduled=1 instead of 0. Fix: replace TEST_WaitForCompact() with TEST_WaitForBackgroundWork() which explicitly checks bg_pressure_callback_in_progress_ (db_impl.cc:478-484). Also add TEST_WaitForBackgroundWork() before Phase 1 and Phase 2 snapshot checks for consistency, ensuring all pressure callbacks are delivered before assertions. Differential Revision: D103784101
|
Address AI feedback |
Codex Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit dd34022 ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
Claude Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit dd34022 SummaryCorrect, minimal fix for a real race condition. The change from High-severity findings (0): No high-severity findings. Full review (click to expand)Findings🟡 MEDIUMNo medium-severity findings. 🟢 LOW / NITL1. Phases 1 and 2 have a theoretical similar race —
|
| Context | Relevant? | Analysis |
|---|---|---|
WaitForBackgroundWork() thread safety |
Yes | Properly holds mutex_ (via InstrumentedMutexLock in TEST_WaitForBackgroundWork). bg_cv_.SignalAll() is called after NotifyOnBackgroundJobPressureChanged() returns in both flush (line 3839) and compaction (line 3974) paths, ensuring the waiter is woken after bg_pressure_callback_in_progress_ is decremented. |
Hang risk from waiting on bg_flush_scheduled_ |
No | Phase 3 does not trigger new flushes, so bg_flush_scheduled_ should already be 0. No hang risk. |
Positive Observations
- The fix uses the correct existing API (
TEST_WaitForBackgroundWork) rather than inventing a new wait mechanism. - The root cause analysis in the PR description is thorough — includes a reproducer with instrumental delay and verification with
--gtest_repeat=5. - The fix is minimal: exactly one line changed to address a specific race condition.
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
…acebook#14708) Summary: BackgroundJobPressure test was flaky because Phase 3 used TEST_WaitForCompact() which waits for bg_compaction_scheduled_ but NOT bg_pressure_callback_in_progress_. The final "healthy" pressure callback could still be in-flight when the test checked snapshots.back(), causing compaction_scheduled=1 instead of 0. Fix: replace TEST_WaitForCompact() with TEST_WaitForBackgroundWork() which explicitly checks bg_pressure_callback_in_progress_ (db_impl.cc:478-484). Also add TEST_WaitForBackgroundWork() before Phase 1 and Phase 2 snapshot checks for consistency, ensuring all pressure callbacks are delivered before assertions. Differential Revision: D103784101
Codex Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit aa4f934 ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
Claude Code Review - OBSOLETESuperseded by a newer AI review. Expand to see the original review.✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit aa4f934 SummaryCorrect and minimal fix for a real test flakiness bug. The one-line change from No high-severity findings. Full review (click to expand)Findings🟡 MEDIUMM1. PR description mentions additional changes not present in diff —
|
| Context | Relevant? | Assessment |
|---|---|---|
TEST_WaitForBackgroundWork scope |
Yes | Waits on bg_bottom_compaction_scheduled_, bg_compaction_scheduled_, bg_flush_scheduled_, AND bg_pressure_callback_in_progress_. Strict superset of TEST_WaitForCompact(). Safe replacement. |
| Thread safety | Yes | Both functions acquire mutex_ and wait on bg_cv_. No new concurrency concerns. |
| Return value semantics | Yes | Both return error_handler_.GetBGError(). Behavioral contract preserved. |
Positive Observations
- The fix is minimal and precisely targeted at the root cause (pressure callback race).
TEST_WaitForBackgroundWork()is the correct function for this use case since the test cares about pressure callbacks, not just compaction completion.- The diagnosis in the PR description accurately identifies the race condition.
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
…acebook#14708) Summary: BackgroundJobPressure test was flaky because Phase 3 used TEST_WaitForCompact() which waits for bg_compaction_scheduled_ but NOT bg_pressure_callback_in_progress_. The final "healthy" pressure callback could still be in-flight when the test checked snapshots.back(), causing compaction_scheduled=1 instead of 0. Fix: replace TEST_WaitForCompact() with TEST_WaitForBackgroundWork() which explicitly checks bg_pressure_callback_in_progress_ (db_impl.cc:478-484). Also add TEST_WaitForBackgroundWork() before Phase 1 and Phase 2 snapshot checks for consistency, ensuring all pressure callbacks are delivered before assertions. Reviewed By: jaykorean Differential Revision: D103784101
…acebook#14708) Summary: BackgroundJobPressure test was flaky because Phase 3 used TEST_WaitForCompact() which waits for bg_compaction_scheduled_ but NOT bg_pressure_callback_in_progress_. The final "healthy" pressure callback could still be in-flight when the test checked snapshots.back(), causing compaction_scheduled=1 instead of 0. Fix: replace TEST_WaitForCompact() with TEST_WaitForBackgroundWork() which explicitly checks bg_pressure_callback_in_progress_ (db_impl.cc:478-484). Also add TEST_WaitForBackgroundWork() before Phase 1 and Phase 2 snapshot checks for consistency, ensuring all pressure callbacks are delivered before assertions. Reviewed By: jaykorean Differential Revision: D103784101
🟡 Codex Code ReviewAuto-triggered after CI passed — reviewing commit 936c294 ❌ Codex review failed before producing findings. ℹ️ About this responseGenerated by Codex CLI. Limitations:
Commands:
|
✅ Claude Code ReviewAuto-triggered after CI passed — reviewing commit 936c294 SummaryCorrect one-line fix for a real race condition. The root cause analysis is sound: High-severity findings (0): Full review (click to expand)Findings🟡 MEDIUMM1. PR description/diff mismatch — Phase 1 and Phase 2 consistency —
|
| Context | Relevant? | Assessment |
|---|---|---|
| Normal compaction path | YES | Fix directly addresses the race between compaction completion and pressure callback delivery |
bg_cv_ signaling |
YES | bg_pressure_callback_in_progress_-- signals bg_cv_, so WaitForBackgroundWork() will unblock correctly |
| Concurrent listeners | YES | Multiple listeners are invoked sequentially in the callback; the counter only decrements after all are done |
Positive Observations
- The fix is minimal and surgical — exactly one line changed to use the correct wait primitive.
- The root cause analysis in the PR description is precise, citing exact line numbers.
TEST_WaitForBackgroundWork()is already used in 12 other test files, so this is an established pattern.
ℹ️ About this response
Generated by Claude Code.
Review methodology: claude_md/code_review.md
Limitations:
- Claude may miss context from files not in the diff
- Large PRs may be truncated
- Always apply human judgment to AI suggestions
Commands:
/claude-review [context]— Request a code review/claude-query <question>— Ask about the PR or codebase
|
This pull request has been merged in cba3362. |
Summary:
BackgroundJobPressure test was flaky because Phase 3 used TEST_WaitForCompact() which waits for bg_compaction_scheduled_ but NOT bg_pressure_callback_in_progress_. The final "healthy" pressure callback could still be in-flight when the test checked snapshots.back(), causing compaction_scheduled=1 instead of 0.
Fix: replace TEST_WaitForCompact() with TEST_WaitForBackgroundWork() which explicitly checks bg_pressure_callback_in_progress_ (db_impl.cc:478-484). Also add TEST_WaitForBackgroundWork() before Phase 1 and Phase 2 snapshot checks for consistency, ensuring all pressure callbacks are delivered before assertions.
Reviewed By: jaykorean
Differential Revision: D103784101