Fix seqno sync race with WriteUnprepared (#14864) by mszeszko-meta · Pull Request #14864 · facebook/rocksdb

mszeszko-meta · 2026-06-18T18:14:01Z

Summary:

Summary

With two_write_queues=true, WRITE_UNPREPARED can allocate sequence numbers through FetchAddLastAllocatedSequence() before those numbers are published through SetLastSequence(). Error recovery can then create new memtable/WAL boundaries from a stale LastSequence(), which can later surface as "sequence number going backwards" corruption. Fix this by syncing LastSequence() to LastAllocatedSequence() at the recovery entry point after draining both write queues, and again at the recovery flush fence after FlushAllColumnFamilies() has released/reacquired the DB mutex and before SwitchMemtable() consumes LastSequence(). Apply the same fence to atomic recovery flushes.

Perf comparative results

TLDR; no visible regression.

I ran a deterministic local recovery benchmark to check whether the new two-write-queue recovery fence adds measurable recovery latency.

Workload (P2385173320)

Optimized build.
TransactionDB with WRITE_UNPREPARED.
two_write_queues=true.
Each iteration dirties the memtable with WUP transactions, forces a retryable flush IO error through FaultInjectionTestFS, re-enables the filesystem, then measures DB::Resume().
50 warmup recoveries + 500 measured recoveries per run.

Why this is relevant

This targets the changed recovery path directly rather than measuring generic write throughput. The change only runs during recovery, so the useful signal is recovery latency around ResumeImpl() and the recovery flush fence.

Code path exercised

DB::Resume()
ErrorHandler::RecoverFromBGError(true)
DBImpl::ResumeImpl()
new two_write_queues_ recovery queue fence
non-atomic recovery FlushMemTable() fence before SwitchMemtable()

This does not cover atomic flush recovery or a writer-backlog stress case.

Comparative results:

| Revision | Attempt | resume_p50_us | resume_p95_us | Retryable BG errors |
| Parent | | 1555.75 | 2092.22 | 550 |
| Fix | 1 | 1537.68 | 1905.65 | 550 |
| Fix | 2 | 1475.10 | 1883.26 | 550 |
| Fix | 3 | 1617.52 | 1993.97 | 550 |

Reviewed By: xingbowang, anand1976

Differential Revision: D108946310

meta-codesync · 2026-06-18T18:14:09Z

@mszeszko-meta has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108946310.

github-actions · 2026-06-18T18:20:08Z

✅ clang-tidy: No findings on changed lines

Completed in 296.5s.

Summary: # Summary With `two_write_queues`=`true`, `WRITE_UNPREPARED` can allocate sequence numbers through `FetchAddLastAllocatedSequence()` before those numbers are published through `SetLastSequence()`. Error recovery can then create new memtable/WAL boundaries from a stale `LastSequence()`, which can later surface as "sequence number going backwards" corruption. Fix this by syncing `LastSequence()` to `LastAllocatedSequence()` at the recovery entry point after draining both write queues, and again at the recovery flush fence after `FlushAllColumnFamilies()` has released/reacquired the DB mutex and before `SwitchMemtable()` consumes `LastSequence()`. Apply the same fence to atomic recovery flushes. # Perf comparative results TLDR; no visible regression. I ran a deterministic local recovery benchmark to check whether the new two-write-queue recovery fence adds measurable recovery latency. ## Workload (P2385173320) - Optimized build. - `TransactionDB` with `WRITE_UNPREPARED`. - `two_write_queues=true`. - Each iteration dirties the memtable with WUP transactions, forces a retryable flush IO error through `FaultInjectionTestFS`, re-enables the filesystem, then measures `DB::Resume()`. - 50 warmup recoveries + 500 measured recoveries per run. ## Why this is relevant This targets the changed recovery path directly rather than measuring generic write throughput. The change only runs during recovery, so the useful signal is recovery latency around `ResumeImpl()` and the recovery flush fence. ## Code path exercised - `DB::Resume()` - `ErrorHandler::RecoverFromBGError(true)` - `DBImpl::ResumeImpl()` - new `two_write_queues_` recovery queue fence - non-atomic recovery `FlushMemTable()` fence before `SwitchMemtable()` This does not cover atomic flush recovery or a writer-backlog stress case. Comparative results: | Revision | Attempt | resume_p50_us | resume_p95_us | Retryable BG errors | | Parent | | 1555.75 | 2092.22 | 550 | | Fix | 1 | 1537.68 | 1905.65 | 550 | | Fix | 2 | 1475.10 | 1883.26 | 550 | | Fix | 3 | 1617.52 | 1993.97 | 550 | Reviewed By: anand1976 Differential Revision: D108946310

Summary: # Summary With `two_write_queues`=`true`, `WRITE_UNPREPARED` can allocate sequence numbers through `FetchAddLastAllocatedSequence()` before those numbers are published through `SetLastSequence()`. Error recovery can then create new memtable/WAL boundaries from a stale `LastSequence()`, which can later surface as "sequence number going backwards" corruption. Fix this by syncing `LastSequence()` to `LastAllocatedSequence()` at the recovery entry point after draining both write queues, and again at the recovery flush fence after `FlushAllColumnFamilies()` has released/reacquired the DB mutex and before `SwitchMemtable()` consumes `LastSequence()`. Apply the same fence to atomic recovery flushes. # Perf comparative results TLDR; no visible regression. I ran a deterministic local recovery benchmark to check whether the new two-write-queue recovery fence adds measurable recovery latency. ## Workload (P2385173320) - Optimized build. - `TransactionDB` with `WRITE_UNPREPARED`. - `two_write_queues=true`. - Each iteration dirties the memtable with WUP transactions, forces a retryable flush IO error through `FaultInjectionTestFS`, re-enables the filesystem, then measures `DB::Resume()`. - 50 warmup recoveries + 500 measured recoveries per run. ## Why this is relevant This targets the changed recovery path directly rather than measuring generic write throughput. The change only runs during recovery, so the useful signal is recovery latency around `ResumeImpl()` and the recovery flush fence. ## Code path exercised - `DB::Resume()` - `ErrorHandler::RecoverFromBGError(true)` - `DBImpl::ResumeImpl()` - new `two_write_queues_` recovery queue fence - non-atomic recovery `FlushMemTable()` fence before `SwitchMemtable()` This does not cover atomic flush recovery or a writer-backlog stress case. Comparative results: | Revision | Attempt | resume_p50_us | resume_p95_us | Retryable BG errors | | Parent | | 1555.75 | 2092.22 | 550 | | Fix | 1 | 1537.68 | 1905.65 | 550 | | Fix | 2 | 1475.10 | 1883.26 | 550 | | Fix | 3 | 1617.52 | 1993.97 | 550 | Reviewed By: xingbowang, anand1976 Differential Revision: D108946310

meta-codesync · 2026-06-21T05:59:09Z

This pull request has been merged in 0294101.

meta-cla Bot added the CLA Signed label Jun 18, 2026

meta-codesync Bot added the meta-exported label Jun 18, 2026

meta-codesync Bot changed the title ~~Fix seqno sync race with WriteUnprepared~~ Jun 18, 2026

mszeszko-meta force-pushed the export-D108946310 branch from 528be50 to 5173e6e Compare June 18, 2026 21:35

mszeszko-meta force-pushed the export-D108946310 branch from 5173e6e to 89a44e7 Compare June 18, 2026 21:41

mszeszko-meta force-pushed the export-D108946310 branch from 89a44e7 to 9bc8100 Compare June 18, 2026 22:00

mszeszko-meta force-pushed the export-D108946310 branch from 9bc8100 to 0acf493 Compare June 21, 2026 05:12

meta-codesync Bot closed this in 0294101 Jun 21, 2026

meta-codesync Bot added the Merged label Jun 21, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix seqno sync race with WriteUnprepared (#14864)#14864

Fix seqno sync race with WriteUnprepared (#14864)#14864
mszeszko-meta wants to merge 1 commit into
facebook:mainfrom
mszeszko-meta:export-D108946310

mszeszko-meta commented Jun 18, 2026 •

edited by meta-codesync Bot

Loading

meta-codesync Bot commented Jun 18, 2026

github-actions Bot commented Jun 18, 2026 •

edited

Loading

meta-codesync Bot commented Jun 21, 2026

Labels

1 participant

Uh oh!

Conversation

mszeszko-meta commented Jun 18, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Perf comparative results

Workload (P2385173320)

Why this is relevant

Code path exercised

meta-codesync Bot commented Jun 18, 2026

github-actions Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ clang-tidy: No findings on changed lines

meta-codesync Bot commented Jun 21, 2026

Labels

1 participant

mszeszko-meta commented Jun 18, 2026 •

edited by meta-codesync Bot

Loading

github-actions Bot commented Jun 18, 2026 •

edited

Loading