Skip to content

Fix secondary DB stuck on stale versions when atomic_flush=true#14439

Open
hx235 wants to merge 1 commit into
facebook:mainfrom
hx235:repro_af_secondary
Open

Fix secondary DB stuck on stale versions when atomic_flush=true#14439
hx235 wants to merge 1 commit into
facebook:mainfrom
hx235:repro_af_secondary

Conversation

@hx235

@hx235 hx235 commented Mar 9, 2026

Copy link
Copy Markdown
Contributor

Context/Summary:

Bug: When atomic_flush=true and manifest rotation races with primary
compaction, the secondary/follower DB can permanently stop installing
new versions. Symptoms include remote compaction failing with "Cannot
find matched SST files", secondary serving permanently stale data
(release builds), and crashing on manifest switch (debug builds).

Root cause: VersionEditHandlerPointInTime uses a staging map
(atomic_update_versions_) for all-or-nothing atomic group recovery.
When the secondary reads a manifest where some SST files were already
compacted away by the primary, the atomic group becomes permanently
incomplete and the staging map is never cleared. All subsequent version
creation — including force-creates in OnAtomicGroupReplayBegin() and
CheckIterationResult() — routes through the stale staging map instead
of versions_, so versions are trapped and never installed.

Fix: Override OnAtomicGroupReplayEnd() in ManifestTailer to clear
the staging map when the atomic group is incomplete. This is safe
because:
- All staging map entries are nullptr (no versions are created while
in_atomic_group_ is true), so nothing is lost.
- Pre-atomic-group versions are already saved in versions_ by
OnAtomicGroupReplayBegin().
- The all-or-nothing guarantee for primary best-effort recovery
(VersionEditHandlerPointInTime) is preserved — only the subclass
behavior changes.

The fix can create temporary cross-CF inconsistency on the secondary:
when an incomplete atomic group is cleared, some CFs may have their
versions installed while others don't (e.g., CF1 has data but CF0's
atomic-flush SST was already deleted by CompactRange on the primary).
This is mitigated by the next TryCatchUpWithPrimary call, which reads
the updated manifest containing CompactRange results and restores
cross-CF consistency.

Test Plan:
- CompactionServiceTest.AtomicFlushRemoteCompactionMissingFile:
end-to-end test triggering the bug via remote compaction timing.
Without fix: remote compaction fails ("CompactionService failed to
run the compaction job"). With fix: compaction succeeds.
- DBSecondaryTest.AtomicFlushSecondaryIncompleteGroup: reproduces
the bug scenario (incomplete atomic group from CompactRange racing
with secondary manifest read). Verifies SST installation, cross-CF
consistency via NewIterators, and recovery via TryCatchUpWithPrimary.
Without fix: crashes in ManifestTailer::OnColumnFamilyAdd.
- DBSecondaryTest.AtomicFlushSecondaryMultiCFConsistency: baseline
cross-CF consistency with atomic_flush + WAL disabled.

@meta-cla meta-cla Bot added the CLA Signed label Mar 9, 2026
@github-actions

github-actions Bot commented Mar 9, 2026

Copy link
Copy Markdown

✅ clang-tidy: No findings on changed lines

Completed in 363.3s.

@hx235 hx235 force-pushed the repro_af_secondary branch from 3e409ae to b3010a1 Compare March 9, 2026 02:11
@meta-codesync

meta-codesync Bot commented Mar 9, 2026

Copy link
Copy Markdown

@hx235 has imported this pull request. If you are a Meta employee, you can view this in D95749838.

@hx235

hx235 commented Mar 9, 2026

Copy link
Copy Markdown
Contributor Author

linter not related

@hx235 hx235 requested review from anand1976 and jaykorean and removed request for jaykorean March 9, 2026 05:27
@hx235

hx235 commented Mar 9, 2026

Copy link
Copy Markdown
Contributor Author

Ah history md was not auto-generated - just pushed a new one on top of clean testing signals (excluding the unrelated linter warning)

@hx235 hx235 force-pushed the repro_af_secondary branch from b3010a1 to 5fbf050 Compare March 9, 2026 21:45
@meta-codesync

meta-codesync Bot commented Mar 9, 2026

Copy link
Copy Markdown

@hx235 has imported this pull request. If you are a Meta employee, you can view this in D95749838.

@hx235

hx235 commented Mar 9, 2026

Copy link
Copy Markdown
Contributor Author

Turning into draft mode in order to work better on "TODO: consider cross-CF read consistency requirement in secondary db and whether that conflicts with this change."

@hx235 hx235 removed the request for review from anand1976 March 9, 2026 22:30
@hx235 hx235 changed the title Fix stale atomic_update_versions_ trapping version installation in secondary DB Mar 9, 2026
@hx235 hx235 force-pushed the repro_af_secondary branch from 5fbf050 to 86188bb Compare March 19, 2026 05:13
@hx235 hx235 changed the title [WIP] Fix stale atomic_update_versions_ trapping version installation in secondary DB Mar 19, 2026
Summary:

Bug: When `atomic_flush=true` and manifest rotation races with primary
compaction, the secondary/follower DB can permanently stop installing
new versions. Symptoms include remote compaction failing with "Cannot
find matched SST files", secondary serving permanently stale data
(release builds), and crashing on manifest switch (debug builds).

Root cause: `VersionEditHandlerPointInTime` uses a staging map
(`atomic_update_versions_`) for all-or-nothing atomic group recovery.
When the secondary reads a manifest where some SST files were already
compacted away by the primary, the atomic group becomes permanently
incomplete and the staging map is never cleared. All subsequent version
creation — including force-creates in `OnAtomicGroupReplayBegin()` and
`CheckIterationResult()` — routes through the stale staging map instead
of `versions_`, so versions are trapped and never installed.

Fix: Override `OnAtomicGroupReplayEnd()` in `ManifestTailer` to clear
the staging map when the atomic group is incomplete. This is safe
because:
- All staging map entries are nullptr (no versions are created while
  `in_atomic_group_` is true), so nothing is lost.
- Pre-atomic-group versions are already saved in `versions_` by
  `OnAtomicGroupReplayBegin()`.
- The all-or-nothing guarantee for primary best-effort recovery
  (`VersionEditHandlerPointInTime`) is preserved — only the subclass
  behavior changes.

The fix can create temporary cross-CF inconsistency on the secondary:
when an incomplete atomic group is cleared, some CFs may have their
versions installed while others don't (e.g., CF1 has data but CF0's
atomic-flush SST was already deleted by CompactRange on the primary).
This is mitigated by the next TryCatchUpWithPrimary call, which reads
the updated manifest containing CompactRange results and restores
cross-CF consistency.

Test Plan:
- `CompactionServiceTest.AtomicFlushRemoteCompactionMissingFile`:
  end-to-end test triggering the bug via remote compaction timing.
  Without fix: remote compaction fails ("CompactionService failed to
  run the compaction job"). With fix: compaction succeeds.
- `DBSecondaryTest.AtomicFlushSecondaryIncompleteGroup`: reproduces
  the bug scenario (incomplete atomic group from CompactRange racing
  with secondary manifest read). Verifies SST installation, cross-CF
  consistency via NewIterators, and recovery via TryCatchUpWithPrimary.
  Without fix: crashes in ManifestTailer::OnColumnFamilyAdd.
- `DBSecondaryTest.AtomicFlushSecondaryMultiCFConsistency`: baseline
  cross-CF consistency with atomic_flush + WAL disabled.

  $ ./db_secondary_test --gtest_filter="DBSecondaryTest.AtomicFlush*"
  [  PASSED  ] 2 tests.
  $ ./compaction_service_test --gtest_filter="CompactionServiceTest.AtomicFlushRemoteCompactionMissingFile"
  [  PASSED  ] 1 test.
@hx235 hx235 force-pushed the repro_af_secondary branch from 86188bb to c6f19d8 Compare March 19, 2026 05:34
@meta-codesync

meta-codesync Bot commented Mar 19, 2026

Copy link
Copy Markdown

@hx235 has imported this pull request. If you are a Meta employee, you can view this in D95749838.

@hx235 hx235 requested a review from anand1976 March 19, 2026 06:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

2 participants