Skip to content

Fix seqno sync race with WriteUnprepared (#14864)#14864

Closed
mszeszko-meta wants to merge 1 commit into
facebook:mainfrom
mszeszko-meta:export-D108946310
Closed

Fix seqno sync race with WriteUnprepared (#14864)#14864
mszeszko-meta wants to merge 1 commit into
facebook:mainfrom
mszeszko-meta:export-D108946310

Conversation

@mszeszko-meta

@mszeszko-meta mszeszko-meta commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary:

Summary

With two_write_queues=true, WRITE_UNPREPARED can allocate sequence numbers through FetchAddLastAllocatedSequence() before those numbers are published through SetLastSequence(). Error recovery can then create new memtable/WAL boundaries from a stale LastSequence(), which can later surface as "sequence number going backwards" corruption. Fix this by syncing LastSequence() to LastAllocatedSequence() at the recovery entry point after draining both write queues, and again at the recovery flush fence after FlushAllColumnFamilies() has released/reacquired the DB mutex and before SwitchMemtable() consumes LastSequence(). Apply the same fence to atomic recovery flushes.

Perf comparative results

TLDR; no visible regression.

I ran a deterministic local recovery benchmark to check whether the new two-write-queue recovery fence adds measurable recovery latency.

Workload (P2385173320)

  • Optimized build.
  • TransactionDB with WRITE_UNPREPARED.
  • two_write_queues=true.
  • Each iteration dirties the memtable with WUP transactions, forces a retryable flush IO error through FaultInjectionTestFS, re-enables the filesystem, then measures DB::Resume().
  • 50 warmup recoveries + 500 measured recoveries per run.

Why this is relevant

This targets the changed recovery path directly rather than measuring generic write throughput. The change only runs during recovery, so the useful signal is recovery latency around ResumeImpl() and the recovery flush fence.

Code path exercised

  • DB::Resume()
  • ErrorHandler::RecoverFromBGError(true)
  • DBImpl::ResumeImpl()
  • new two_write_queues_ recovery queue fence
  • non-atomic recovery FlushMemTable() fence before SwitchMemtable()

This does not cover atomic flush recovery or a writer-backlog stress case.

Comparative results:

| Revision | Attempt | resume_p50_us | resume_p95_us | Retryable BG errors |
| Parent | | 1555.75 | 2092.22 | 550 |
| Fix | 1 | 1537.68 | 1905.65 | 550 |
| Fix | 2 | 1475.10 | 1883.26 | 550 |
| Fix | 3 | 1617.52 | 1993.97 | 550 |

Reviewed By: xingbowang, anand1976

Differential Revision: D108946310

@meta-cla meta-cla Bot added the CLA Signed label Jun 18, 2026
@meta-codesync

meta-codesync Bot commented Jun 18, 2026

Copy link
Copy Markdown

@mszeszko-meta has exported this pull request. If you are a Meta employee, you can view the originating Diff in D108946310.

@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown

✅ clang-tidy: No findings on changed lines

Completed in 296.5s.

@meta-codesync meta-codesync Bot changed the title Fix seqno sync race with WriteUnprepared Jun 18, 2026
mszeszko-meta added a commit to mszeszko-meta/rocksdb that referenced this pull request Jun 18, 2026
Summary:
# Summary


With `two_write_queues`=`true`, `WRITE_UNPREPARED` can allocate sequence numbers through `FetchAddLastAllocatedSequence()` before those numbers are published through `SetLastSequence()`. Error recovery can then create new memtable/WAL boundaries from a stale `LastSequence()`, which can later surface as "sequence number going backwards" corruption. Fix this by syncing `LastSequence()` to `LastAllocatedSequence()` at the recovery entry point after draining both write queues, and again at the recovery flush fence after `FlushAllColumnFamilies()` has released/reacquired the DB mutex and before `SwitchMemtable()` consumes `LastSequence()`. Apply the same fence to atomic recovery flushes.

# Perf comparative results

TLDR; no visible regression.

I ran a deterministic local recovery benchmark to check whether the new two-write-queue recovery fence adds measurable recovery latency.

## Workload (P2385173320)
  - Optimized build.
  - `TransactionDB` with `WRITE_UNPREPARED`.
  - `two_write_queues=true`.
  - Each iteration dirties the memtable with WUP transactions, forces a retryable flush IO error through `FaultInjectionTestFS`, re-enables the filesystem, then measures `DB::Resume()`.
  - 50 warmup recoveries + 500 measured recoveries per run.

## Why this is relevant
This targets the changed recovery path directly rather than measuring generic write throughput. The change only runs during recovery, so the useful signal is recovery latency around `ResumeImpl()` and the recovery flush fence.

## Code path exercised
  - `DB::Resume()`
  - `ErrorHandler::RecoverFromBGError(true)`
  - `DBImpl::ResumeImpl()`
  - new `two_write_queues_` recovery queue fence
  - non-atomic recovery `FlushMemTable()` fence before `SwitchMemtable()`

This does not cover atomic flush recovery or a writer-backlog stress case.

Comparative results:

| Revision | Attempt | resume_p50_us | resume_p95_us | Retryable BG errors |
| Parent |  | 1555.75 | 2092.22 | 550 |
| Fix | 1 | 1537.68 | 1905.65 | 550 |
| Fix | 2 | 1475.10 | 1883.26 | 550 |
| Fix | 3 | 1617.52 | 1993.97 | 550 |

Reviewed By: anand1976

Differential Revision: D108946310
mszeszko-meta added a commit to mszeszko-meta/rocksdb that referenced this pull request Jun 18, 2026
Summary:
# Summary


With `two_write_queues`=`true`, `WRITE_UNPREPARED` can allocate sequence numbers through `FetchAddLastAllocatedSequence()` before those numbers are published through `SetLastSequence()`. Error recovery can then create new memtable/WAL boundaries from a stale `LastSequence()`, which can later surface as "sequence number going backwards" corruption. Fix this by syncing `LastSequence()` to `LastAllocatedSequence()` at the recovery entry point after draining both write queues, and again at the recovery flush fence after `FlushAllColumnFamilies()` has released/reacquired the DB mutex and before `SwitchMemtable()` consumes `LastSequence()`. Apply the same fence to atomic recovery flushes.

# Perf comparative results

TLDR; no visible regression.

I ran a deterministic local recovery benchmark to check whether the new two-write-queue recovery fence adds measurable recovery latency.

## Workload (P2385173320)
  - Optimized build.
  - `TransactionDB` with `WRITE_UNPREPARED`.
  - `two_write_queues=true`.
  - Each iteration dirties the memtable with WUP transactions, forces a retryable flush IO error through `FaultInjectionTestFS`, re-enables the filesystem, then measures `DB::Resume()`.
  - 50 warmup recoveries + 500 measured recoveries per run.

## Why this is relevant
This targets the changed recovery path directly rather than measuring generic write throughput. The change only runs during recovery, so the useful signal is recovery latency around `ResumeImpl()` and the recovery flush fence.

## Code path exercised
  - `DB::Resume()`
  - `ErrorHandler::RecoverFromBGError(true)`
  - `DBImpl::ResumeImpl()`
  - new `two_write_queues_` recovery queue fence
  - non-atomic recovery `FlushMemTable()` fence before `SwitchMemtable()`

This does not cover atomic flush recovery or a writer-backlog stress case.

Comparative results:

| Revision | Attempt | resume_p50_us | resume_p95_us | Retryable BG errors |
| Parent |  | 1555.75 | 2092.22 | 550 |
| Fix | 1 | 1537.68 | 1905.65 | 550 |
| Fix | 2 | 1475.10 | 1883.26 | 550 |
| Fix | 3 | 1617.52 | 1993.97 | 550 |

Reviewed By: xingbowang, anand1976

Differential Revision: D108946310
mszeszko-meta added a commit to mszeszko-meta/rocksdb that referenced this pull request Jun 18, 2026
Summary:
# Summary


With `two_write_queues`=`true`, `WRITE_UNPREPARED` can allocate sequence numbers through `FetchAddLastAllocatedSequence()` before those numbers are published through `SetLastSequence()`. Error recovery can then create new memtable/WAL boundaries from a stale `LastSequence()`, which can later surface as "sequence number going backwards" corruption. Fix this by syncing `LastSequence()` to `LastAllocatedSequence()` at the recovery entry point after draining both write queues, and again at the recovery flush fence after `FlushAllColumnFamilies()` has released/reacquired the DB mutex and before `SwitchMemtable()` consumes `LastSequence()`. Apply the same fence to atomic recovery flushes.

# Perf comparative results

TLDR; no visible regression.

I ran a deterministic local recovery benchmark to check whether the new two-write-queue recovery fence adds measurable recovery latency.

## Workload (P2385173320)
  - Optimized build.
  - `TransactionDB` with `WRITE_UNPREPARED`.
  - `two_write_queues=true`.
  - Each iteration dirties the memtable with WUP transactions, forces a retryable flush IO error through `FaultInjectionTestFS`, re-enables the filesystem, then measures `DB::Resume()`.
  - 50 warmup recoveries + 500 measured recoveries per run.

## Why this is relevant
This targets the changed recovery path directly rather than measuring generic write throughput. The change only runs during recovery, so the useful signal is recovery latency around `ResumeImpl()` and the recovery flush fence.

## Code path exercised
  - `DB::Resume()`
  - `ErrorHandler::RecoverFromBGError(true)`
  - `DBImpl::ResumeImpl()`
  - new `two_write_queues_` recovery queue fence
  - non-atomic recovery `FlushMemTable()` fence before `SwitchMemtable()`

This does not cover atomic flush recovery or a writer-backlog stress case.

Comparative results:

| Revision | Attempt | resume_p50_us | resume_p95_us | Retryable BG errors |
| Parent |  | 1555.75 | 2092.22 | 550 |
| Fix | 1 | 1537.68 | 1905.65 | 550 |
| Fix | 2 | 1475.10 | 1883.26 | 550 |
| Fix | 3 | 1617.52 | 1993.97 | 550 |

Reviewed By: xingbowang, anand1976

Differential Revision: D108946310
Summary:
# Summary


With `two_write_queues`=`true`, `WRITE_UNPREPARED` can allocate sequence numbers through `FetchAddLastAllocatedSequence()` before those numbers are published through `SetLastSequence()`. Error recovery can then create new memtable/WAL boundaries from a stale `LastSequence()`, which can later surface as "sequence number going backwards" corruption. Fix this by syncing `LastSequence()` to `LastAllocatedSequence()` at the recovery entry point after draining both write queues, and again at the recovery flush fence after `FlushAllColumnFamilies()` has released/reacquired the DB mutex and before `SwitchMemtable()` consumes `LastSequence()`. Apply the same fence to atomic recovery flushes.

# Perf comparative results

TLDR; no visible regression.

I ran a deterministic local recovery benchmark to check whether the new two-write-queue recovery fence adds measurable recovery latency.

## Workload (P2385173320)
  - Optimized build.
  - `TransactionDB` with `WRITE_UNPREPARED`.
  - `two_write_queues=true`.
  - Each iteration dirties the memtable with WUP transactions, forces a retryable flush IO error through `FaultInjectionTestFS`, re-enables the filesystem, then measures `DB::Resume()`.
  - 50 warmup recoveries + 500 measured recoveries per run.

## Why this is relevant
This targets the changed recovery path directly rather than measuring generic write throughput. The change only runs during recovery, so the useful signal is recovery latency around `ResumeImpl()` and the recovery flush fence.

## Code path exercised
  - `DB::Resume()`
  - `ErrorHandler::RecoverFromBGError(true)`
  - `DBImpl::ResumeImpl()`
  - new `two_write_queues_` recovery queue fence
  - non-atomic recovery `FlushMemTable()` fence before `SwitchMemtable()`

This does not cover atomic flush recovery or a writer-backlog stress case.

Comparative results:

| Revision | Attempt | resume_p50_us | resume_p95_us | Retryable BG errors |
| Parent |  | 1555.75 | 2092.22 | 550 |
| Fix | 1 | 1537.68 | 1905.65 | 550 |
| Fix | 2 | 1475.10 | 1883.26 | 550 |
| Fix | 3 | 1617.52 | 1993.97 | 550 |

Reviewed By: xingbowang, anand1976

Differential Revision: D108946310
@meta-codesync

meta-codesync Bot commented Jun 21, 2026

Copy link
Copy Markdown

This pull request has been merged in 0294101.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment