Fix seqno sync race with WriteUnprepared (#14864)#14864
Closed
mszeszko-meta wants to merge 1 commit into
Closed
Conversation
|
@mszeszko-meta has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108946310. |
✅ clang-tidy: No findings on changed linesCompleted in 296.5s. |
mszeszko-meta
added a commit
to mszeszko-meta/rocksdb
that referenced
this pull request
Jun 18, 2026
Summary: # Summary With `two_write_queues`=`true`, `WRITE_UNPREPARED` can allocate sequence numbers through `FetchAddLastAllocatedSequence()` before those numbers are published through `SetLastSequence()`. Error recovery can then create new memtable/WAL boundaries from a stale `LastSequence()`, which can later surface as "sequence number going backwards" corruption. Fix this by syncing `LastSequence()` to `LastAllocatedSequence()` at the recovery entry point after draining both write queues, and again at the recovery flush fence after `FlushAllColumnFamilies()` has released/reacquired the DB mutex and before `SwitchMemtable()` consumes `LastSequence()`. Apply the same fence to atomic recovery flushes. # Perf comparative results TLDR; no visible regression. I ran a deterministic local recovery benchmark to check whether the new two-write-queue recovery fence adds measurable recovery latency. ## Workload (P2385173320) - Optimized build. - `TransactionDB` with `WRITE_UNPREPARED`. - `two_write_queues=true`. - Each iteration dirties the memtable with WUP transactions, forces a retryable flush IO error through `FaultInjectionTestFS`, re-enables the filesystem, then measures `DB::Resume()`. - 50 warmup recoveries + 500 measured recoveries per run. ## Why this is relevant This targets the changed recovery path directly rather than measuring generic write throughput. The change only runs during recovery, so the useful signal is recovery latency around `ResumeImpl()` and the recovery flush fence. ## Code path exercised - `DB::Resume()` - `ErrorHandler::RecoverFromBGError(true)` - `DBImpl::ResumeImpl()` - new `two_write_queues_` recovery queue fence - non-atomic recovery `FlushMemTable()` fence before `SwitchMemtable()` This does not cover atomic flush recovery or a writer-backlog stress case. Comparative results: | Revision | Attempt | resume_p50_us | resume_p95_us | Retryable BG errors | | Parent | | 1555.75 | 2092.22 | 550 | | Fix | 1 | 1537.68 | 1905.65 | 550 | | Fix | 2 | 1475.10 | 1883.26 | 550 | | Fix | 3 | 1617.52 | 1993.97 | 550 | Reviewed By: anand1976 Differential Revision: D108946310
528be50 to
5173e6e
Compare
mszeszko-meta
added a commit
to mszeszko-meta/rocksdb
that referenced
this pull request
Jun 18, 2026
Summary: # Summary With `two_write_queues`=`true`, `WRITE_UNPREPARED` can allocate sequence numbers through `FetchAddLastAllocatedSequence()` before those numbers are published through `SetLastSequence()`. Error recovery can then create new memtable/WAL boundaries from a stale `LastSequence()`, which can later surface as "sequence number going backwards" corruption. Fix this by syncing `LastSequence()` to `LastAllocatedSequence()` at the recovery entry point after draining both write queues, and again at the recovery flush fence after `FlushAllColumnFamilies()` has released/reacquired the DB mutex and before `SwitchMemtable()` consumes `LastSequence()`. Apply the same fence to atomic recovery flushes. # Perf comparative results TLDR; no visible regression. I ran a deterministic local recovery benchmark to check whether the new two-write-queue recovery fence adds measurable recovery latency. ## Workload (P2385173320) - Optimized build. - `TransactionDB` with `WRITE_UNPREPARED`. - `two_write_queues=true`. - Each iteration dirties the memtable with WUP transactions, forces a retryable flush IO error through `FaultInjectionTestFS`, re-enables the filesystem, then measures `DB::Resume()`. - 50 warmup recoveries + 500 measured recoveries per run. ## Why this is relevant This targets the changed recovery path directly rather than measuring generic write throughput. The change only runs during recovery, so the useful signal is recovery latency around `ResumeImpl()` and the recovery flush fence. ## Code path exercised - `DB::Resume()` - `ErrorHandler::RecoverFromBGError(true)` - `DBImpl::ResumeImpl()` - new `two_write_queues_` recovery queue fence - non-atomic recovery `FlushMemTable()` fence before `SwitchMemtable()` This does not cover atomic flush recovery or a writer-backlog stress case. Comparative results: | Revision | Attempt | resume_p50_us | resume_p95_us | Retryable BG errors | | Parent | | 1555.75 | 2092.22 | 550 | | Fix | 1 | 1537.68 | 1905.65 | 550 | | Fix | 2 | 1475.10 | 1883.26 | 550 | | Fix | 3 | 1617.52 | 1993.97 | 550 | Reviewed By: xingbowang, anand1976 Differential Revision: D108946310
5173e6e to
89a44e7
Compare
mszeszko-meta
added a commit
to mszeszko-meta/rocksdb
that referenced
this pull request
Jun 18, 2026
Summary: # Summary With `two_write_queues`=`true`, `WRITE_UNPREPARED` can allocate sequence numbers through `FetchAddLastAllocatedSequence()` before those numbers are published through `SetLastSequence()`. Error recovery can then create new memtable/WAL boundaries from a stale `LastSequence()`, which can later surface as "sequence number going backwards" corruption. Fix this by syncing `LastSequence()` to `LastAllocatedSequence()` at the recovery entry point after draining both write queues, and again at the recovery flush fence after `FlushAllColumnFamilies()` has released/reacquired the DB mutex and before `SwitchMemtable()` consumes `LastSequence()`. Apply the same fence to atomic recovery flushes. # Perf comparative results TLDR; no visible regression. I ran a deterministic local recovery benchmark to check whether the new two-write-queue recovery fence adds measurable recovery latency. ## Workload (P2385173320) - Optimized build. - `TransactionDB` with `WRITE_UNPREPARED`. - `two_write_queues=true`. - Each iteration dirties the memtable with WUP transactions, forces a retryable flush IO error through `FaultInjectionTestFS`, re-enables the filesystem, then measures `DB::Resume()`. - 50 warmup recoveries + 500 measured recoveries per run. ## Why this is relevant This targets the changed recovery path directly rather than measuring generic write throughput. The change only runs during recovery, so the useful signal is recovery latency around `ResumeImpl()` and the recovery flush fence. ## Code path exercised - `DB::Resume()` - `ErrorHandler::RecoverFromBGError(true)` - `DBImpl::ResumeImpl()` - new `two_write_queues_` recovery queue fence - non-atomic recovery `FlushMemTable()` fence before `SwitchMemtable()` This does not cover atomic flush recovery or a writer-backlog stress case. Comparative results: | Revision | Attempt | resume_p50_us | resume_p95_us | Retryable BG errors | | Parent | | 1555.75 | 2092.22 | 550 | | Fix | 1 | 1537.68 | 1905.65 | 550 | | Fix | 2 | 1475.10 | 1883.26 | 550 | | Fix | 3 | 1617.52 | 1993.97 | 550 | Reviewed By: xingbowang, anand1976 Differential Revision: D108946310
89a44e7 to
9bc8100
Compare
Summary: # Summary With `two_write_queues`=`true`, `WRITE_UNPREPARED` can allocate sequence numbers through `FetchAddLastAllocatedSequence()` before those numbers are published through `SetLastSequence()`. Error recovery can then create new memtable/WAL boundaries from a stale `LastSequence()`, which can later surface as "sequence number going backwards" corruption. Fix this by syncing `LastSequence()` to `LastAllocatedSequence()` at the recovery entry point after draining both write queues, and again at the recovery flush fence after `FlushAllColumnFamilies()` has released/reacquired the DB mutex and before `SwitchMemtable()` consumes `LastSequence()`. Apply the same fence to atomic recovery flushes. # Perf comparative results TLDR; no visible regression. I ran a deterministic local recovery benchmark to check whether the new two-write-queue recovery fence adds measurable recovery latency. ## Workload (P2385173320) - Optimized build. - `TransactionDB` with `WRITE_UNPREPARED`. - `two_write_queues=true`. - Each iteration dirties the memtable with WUP transactions, forces a retryable flush IO error through `FaultInjectionTestFS`, re-enables the filesystem, then measures `DB::Resume()`. - 50 warmup recoveries + 500 measured recoveries per run. ## Why this is relevant This targets the changed recovery path directly rather than measuring generic write throughput. The change only runs during recovery, so the useful signal is recovery latency around `ResumeImpl()` and the recovery flush fence. ## Code path exercised - `DB::Resume()` - `ErrorHandler::RecoverFromBGError(true)` - `DBImpl::ResumeImpl()` - new `two_write_queues_` recovery queue fence - non-atomic recovery `FlushMemTable()` fence before `SwitchMemtable()` This does not cover atomic flush recovery or a writer-backlog stress case. Comparative results: | Revision | Attempt | resume_p50_us | resume_p95_us | Retryable BG errors | | Parent | | 1555.75 | 2092.22 | 550 | | Fix | 1 | 1537.68 | 1905.65 | 550 | | Fix | 2 | 1475.10 | 1883.26 | 550 | | Fix | 3 | 1617.52 | 1993.97 | 550 | Reviewed By: xingbowang, anand1976 Differential Revision: D108946310
9bc8100 to
0acf493
Compare
|
This pull request has been merged in 0294101. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Summary
With
two_write_queues=true,WRITE_UNPREPAREDcan allocate sequence numbers throughFetchAddLastAllocatedSequence()before those numbers are published throughSetLastSequence(). Error recovery can then create new memtable/WAL boundaries from a staleLastSequence(), which can later surface as "sequence number going backwards" corruption. Fix this by syncingLastSequence()toLastAllocatedSequence()at the recovery entry point after draining both write queues, and again at the recovery flush fence afterFlushAllColumnFamilies()has released/reacquired the DB mutex and beforeSwitchMemtable()consumesLastSequence(). Apply the same fence to atomic recovery flushes.Perf comparative results
TLDR; no visible regression.
I ran a deterministic local recovery benchmark to check whether the new two-write-queue recovery fence adds measurable recovery latency.
Workload (P2385173320)
TransactionDBwithWRITE_UNPREPARED.two_write_queues=true.FaultInjectionTestFS, re-enables the filesystem, then measuresDB::Resume().Why this is relevant
This targets the changed recovery path directly rather than measuring generic write throughput. The change only runs during recovery, so the useful signal is recovery latency around
ResumeImpl()and the recovery flush fence.Code path exercised
DB::Resume()ErrorHandler::RecoverFromBGError(true)DBImpl::ResumeImpl()two_write_queues_recovery queue fenceFlushMemTable()fence beforeSwitchMemtable()This does not cover atomic flush recovery or a writer-backlog stress case.
Comparative results:
| Revision | Attempt | resume_p50_us | resume_p95_us | Retryable BG errors |
| Parent | | 1555.75 | 2092.22 | 550 |
| Fix | 1 | 1537.68 | 1905.65 | 550 |
| Fix | 2 | 1475.10 | 1883.26 | 550 |
| Fix | 3 | 1617.52 | 1993.97 | 550 |
Reviewed By: xingbowang, anand1976
Differential Revision: D108946310