Skip to content

[DRAFT] Disable auto-recovery for some WAL write IO error; Re-enable track_and_verify_wals in crash test #13903

Draft
hx235 wants to merge 2 commits into
facebook:mainfrom
hx235:debug_track_verify_wal_error
Draft

[DRAFT] Disable auto-recovery for some WAL write IO error; Re-enable track_and_verify_wals in crash test #13903
hx235 wants to merge 2 commits into
facebook:mainfrom
hx235:debug_track_verify_wal_error

Conversation

@hx235

@hx235 hx235 commented Aug 26, 2025

Copy link
Copy Markdown
Contributor

I realized there are more tests that assume auto-recovery ability of WAL write IO error. I need to think more about this and wonder why the previous stress test didn't fail much with the CF inconsistency.

Context/Summary:
When atomic_flush = false with multiple column families, when encountering WAL related IO error, individual CF flushing during auto recovery can create data inconsistencies (caught by track_and_verify_wals=1) where some column families advance past the corruption point while others remain behind, preventing successful database restart. Therefore we disable auto recovery by setting a higher severity Status::Severity::kFatalError and such testing combination in db crash test.

This PR also fixes a bug in stress test that we considered Status::Severity::kFatalError as retryable.

Test plan:

  • Rehearsal stress test
@meta-cla meta-cla Bot added the CLA Signed label Aug 26, 2025
@facebook-github-bot

Copy link
Copy Markdown
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this in D81056359.

@hx235 hx235 force-pushed the debug_track_verify_wal_error branch from 70bb732 to 53ad5c8 Compare August 26, 2025 23:14
@facebook-github-bot

Copy link
Copy Markdown
Contributor

@hx235 has imported this pull request. If you are a Meta employee, you can view this in D81056359.

@hx235 hx235 marked this pull request as draft August 27, 2025 07:19
@hx235 hx235 changed the title Disable auto-recovery for some WAL write IO error; Re-enable track_and_verify_wals in crash test Aug 27, 2025
@hx235 hx235 changed the title [WIP]Disable auto-recovery for some WAL write IO error; Re-enable track_and_verify_wals in crash test Aug 27, 2025
@pdillinger

Copy link
Copy Markdown
Contributor

Not ready for review? (Can you update the internal diff to "changes planned" if so?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

3 participants