Skip to content

Per-StressTest fault injection with env/fs cleanup#14750

Closed
hx235 wants to merge 2 commits into
facebook:mainfrom
hx235:export-D104959945
Closed

Per-StressTest fault injection with env/fs cleanup#14750
hx235 wants to merge 2 commits into
facebook:mainfrom
hx235:export-D104959945

Conversation

@hx235

@hx235 hx235 commented May 16, 2026

Copy link
Copy Markdown
Contributor

Summary:
Prerequisite for multi-DB stress test support where each StressTest instance owns one DB. Moves fault injection from a single global to per-StressTest instance so each DB gets isolated fault injection — errors injected into one DB do not leak to others.

FS/Env architecture change:

Before (upstream):
raw_fs → FaultInjectionTestFS (global) → DbStressFSWrapper → db_stress_env → DB::Open

After (this diff):
Global: raw_env (no wrappers)
Used outside StressTest: non-FS ops (threads, time, sleep), test framework FS setup (dirs), DB destruction
Used inside StressTest: cleanup ops that must succeed (external file delete)

Per StressTest:
  raw_fs → DbStressFSWrapper (db_stress_fs_, always)
            → FaultInjectionTestFS (db_fault_injection_fs_, optional)
                  → CompositeEnvWrapper (db_env_, always) → DB::Open

Expected values state: Env::Default() (always local PosixEnv even when raw_env is remote)

FS layer order swapped (DbStressFSWrapper now innermost). Safe because:

  • Error injection returns early — inner wrapper never executes (same behavior)
  • DbStressFSWrapper assertions do not modify data (checksum validation, IOActivity checks)
  • MANIFEST rename tracking slightly better for crash simulation in new order

Stored members (per StressTest):

  • db_stress_fs_: DbStressFSWrapper. Always active regardless of fault injection flags.
  • db_fault_injection_fs_: FaultInjectionTestFS. Only when fault injection flags set. Direct access for Enable/Disable/SetThreadLocal.
  • db_env_: CompositeEnvWrapper. Always present. THE env for all DB I/O (options_.env).

Eliminated globals: db_stress_env → renamed to raw_env (no wrappers, DbStressFSWrapper moved to per-StressTest); db_stress_listener_env, db_stress_raw_fs removed; fault_fs_guard, fault_env_guard moved to per-StressTest.

Other changes:

  • CleanupOutputDirectory simplified (uses raw_env, no disable/enable needed)
  • SstFileManager recreated with per-StressTest env in Open()
  • Remote compaction override env uses options.env directly (fixes pre-existing silent bug)
  • Comments added: Env::Default() always local, DbStressDestroyDb MANIFEST explanation, fault injection log path in TEST_TMPDIR

Differential Revision: D104959945

hx235 and others added 2 commits May 15, 2026 00:01
Summary:
Pure mechanical refactor: replace all direct FLAGS_db / FLAGS_expected_values_dir / FLAGS_secondaries_base reads with accessor methods on StressTest. No new flags, no parameters, no behavior change. Prepares for multi-DB stress test where each StressTest instance has its own DB.

Changes:
- GetDbPath(), GetExpectedValuesDir(), GetSecondariesBase() accessors return the corresponding FLAGS values directly
- Replace ~15 FLAGS_db references with GetDbPath() in db_stress_test_base.cc
- Move SharedState constructor from .h to .cc (needs full StressTest type for GetExpectedValuesDir())
- Move DbStressListener constructor from .h to .cc (same reason)
- Replace FLAGS_db / FLAGS_expected_values_dir in db_stress_driver.cc with accessor calls
- NO changes to db_crashtest.py or db_stress_gflags.cc

Differential Revision: D104959943
Summary:
Prerequisite for multi-DB stress test support where each StressTest instance owns one DB. Moves fault injection from a single global to per-StressTest instance so each DB gets isolated fault injection — errors injected into one DB do not leak to others.

FS/Env architecture change:

  Before (upstream):
    raw_fs → FaultInjectionTestFS (global) → DbStressFSWrapper → db_stress_env → DB::Open

  After (this diff):
    Global: raw_env (no wrappers)
      Used outside StressTest: non-FS ops (threads, time, sleep), test framework FS setup (dirs), DB destruction
      Used inside StressTest: cleanup ops that must succeed (external file delete)

    Per StressTest:
      raw_fs → DbStressFSWrapper (db_stress_fs_, always)
                → FaultInjectionTestFS (db_fault_injection_fs_, optional)
                      → CompositeEnvWrapper (db_env_, always) → DB::Open

    Expected values state: Env::Default() (always local PosixEnv even when raw_env is remote)

FS layer order swapped (DbStressFSWrapper now innermost). Safe because:
- Error injection returns early — inner wrapper never executes (same behavior)
- DbStressFSWrapper assertions do not modify data (checksum validation, IOActivity checks)
- MANIFEST rename tracking slightly better for crash simulation in new order

Stored members (per StressTest):
- db_stress_fs_: DbStressFSWrapper. Always active regardless of fault injection flags.
- db_fault_injection_fs_: FaultInjectionTestFS. Only when fault injection flags set. Direct access for Enable/Disable/SetThreadLocal.
- db_env_: CompositeEnvWrapper. Always present. THE env for all DB I/O (options_.env).

Eliminated globals: db_stress_env → renamed to raw_env (no wrappers, DbStressFSWrapper moved to per-StressTest); db_stress_listener_env, db_stress_raw_fs removed; fault_fs_guard, fault_env_guard moved to per-StressTest.

Other changes:
- CleanupOutputDirectory simplified (uses raw_env, no disable/enable needed)
- SstFileManager recreated with per-StressTest env in Open()
- Remote compaction override env uses options.env directly (fixes pre-existing silent bug)
- Comments added: Env::Default() always local, DbStressDestroyDb MANIFEST explanation, fault injection log path in TEST_TMPDIR

Differential Revision: D104959945
@meta-cla meta-cla Bot added the CLA Signed label May 16, 2026
@meta-codesync

meta-codesync Bot commented May 16, 2026

Copy link
Copy Markdown

@hx235 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D104959945.

@github-actions

Copy link
Copy Markdown

⚠️ clang-tidy: 6 warning(s) on changed lines

Completed in 155.3s.

Summary by check

Check Count
concurrency-mt-unsafe 2
cppcoreguidelines-avoid-non-const-global-variables 4
Total 6

Details

db_stress_tool/db_stress_common.cc (2 warning(s))
db_stress_tool/db_stress_common.cc:22:25: warning: variable 'raw_env' is non-const and globally accessible, consider making it const [cppcoreguidelines-avoid-non-const-global-variables]
db_stress_tool/db_stress_common.cc:22:25: warning: variable 'raw_env' provides global access to a non-const object; consider making the pointed-to data 'const' [cppcoreguidelines-avoid-non-const-global-variables]
db_stress_tool/db_stress_test_base.cc (2 warning(s))
db_stress_tool/db_stress_test_base.cc:128:31: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_test_base.cc:3883:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc (2 warning(s))
db_stress_tool/db_stress_tool.cc:39:49: warning: variable 'fault_fs_for_crash_report' is non-const and globally accessible, consider making it const [cppcoreguidelines-avoid-non-const-global-variables]
db_stress_tool/db_stress_tool.cc:39:49: warning: variable 'fault_fs_for_crash_report' provides global access to a non-const object; consider making the pointed-to data 'const' [cppcoreguidelines-avoid-non-const-global-variables]
hx235 added a commit to hx235/rocksdb that referenced this pull request May 20, 2026
Summary:
Pull Request resolved: facebook#14750

Prerequisite for multi-DB stress test support where each StressTest instance owns one DB. Moves fault injection from a single global to per-StressTest instance so each DB gets isolated fault injection; errors injected into one DB do not leak to others.

FS/Env architecture change:

  Before (upstream):
    raw_fs → FaultInjectionTestFS (global) → DbStressFSWrapper → db_stress_env → DB::Open

  After (this diff):
    Global: raw_env (no wrappers)
      Used outside StressTest: non-FS ops (threads, time, sleep), test framework FS setup (dirs), DB destruction
      Used inside StressTest: cleanup ops that must succeed (external file delete)

    Per StressTest:
      raw_fs → DbStressFSWrapper (db_stress_fs_, always)
                → FaultInjectionTestFS (db_fault_injection_fs_, optional)
                      → CompositeEnvWrapper (db_env_, always) → DB::Open

    Expected values state: Env::Default() (always local PosixEnv even when raw_env is remote)

FS layer order swapped (DbStressFSWrapper now innermost). Safe because:
- Error injection returns early; inner wrapper never executes (same behavior)
- DbStressFSWrapper assertions do not modify data (checksum validation, IOActivity checks)
- MANIFEST rename tracking slightly better for crash simulation in new order

Stored members (per StressTest):
- db_stress_fs_: DbStressFSWrapper. Always active regardless of fault injection flags.
- db_fault_injection_fs_: FaultInjectionTestFS. Only when fault injection flags set. Direct access for Enable/Disable/SetThreadLocal.
- db_env_: CompositeEnvWrapper. Always present. THE env for all DB I/O (options_.env).

Eliminated globals: db_stress_env → renamed to raw_env (no wrappers, DbStressFSWrapper moved to per-StressTest); db_stress_listener_env, db_stress_raw_fs removed; fault_fs_guard, fault_env_guard moved to per-StressTest.

Other changes:
- CleanupOutputDirectory simplified (uses raw_env, no disable/enable needed)
- SstFileManager recreated with per-StressTest env in Open()
- Remote compaction override env uses options.env directly (fixes pre-existing silent bug)
- Comments added: Env::Default() always local, DbStressDestroyDb MANIFEST explanation, fault injection log path in TEST_TMPDIR
- TestFSWritableFile::Close() now mirrors the production FSWritableFile close boundary. After the first Close() attempt, later explicit or destructor Close() calls are wrapper-level no-ops, while FaultInjectionTestFS still records the first close attempt for crash/recovery simulation.
- Added targeted fault_injection_fs_test coverage for injected metadata-close failures to ensure FaultInjectionTestFS does not retry the inner Close() path.

Differential Revision: D104959945
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 22, 2026
Summary:

Prerequisite for multi-DB stress test support where each StressTest instance owns one DB. Moves fault injection from a single global to per-StressTest instance so each DB gets isolated fault injection; errors injected into one DB do not leak to others.

FS/Env architecture change:

  Before (upstream):
    raw_fs → FaultInjectionTestFS (global) → DbStressFSWrapper → db_stress_env → DB::Open

  After (this diff):
    Global: raw_env (no wrappers)
      Used outside StressTest: non-FS ops (threads, time, sleep), test framework FS setup (dirs), DB destruction
      Used inside StressTest: cleanup ops that must succeed (external file delete)

    Per StressTest:
      raw_fs → DbStressFSWrapper (db_stress_fs_, always)
                → FaultInjectionTestFS (db_fault_injection_fs_, optional)
                      → CompositeEnvWrapper (db_env_, always) → DB::Open

    Expected values state: Env::Default() (always local PosixEnv even when raw_env is remote)

FS layer order swapped (DbStressFSWrapper now innermost). Safe because:
- Error injection returns early; inner wrapper never executes (same behavior)
- DbStressFSWrapper assertions do not modify data (checksum validation, IOActivity checks)
- MANIFEST rename tracking slightly better for crash simulation in new order

Stored members (per StressTest):
- db_stress_fs_: DbStressFSWrapper. Always active regardless of fault injection flags.
- db_fault_injection_fs_: FaultInjectionTestFS. Only when fault injection flags set. Direct access for Enable/Disable/SetThreadLocal.
- db_env_: CompositeEnvWrapper. Always present. THE env for all DB I/O (options_.env).

Eliminated globals: db_stress_env → renamed to raw_env (no wrappers, DbStressFSWrapper moved to per-StressTest); db_stress_listener_env, db_stress_raw_fs removed; fault_fs_guard, fault_env_guard moved to per-StressTest.

Other changes:
- CleanupOutputDirectory simplified (uses raw_env, no disable/enable needed)
- SstFileManager recreated with per-StressTest env in Open()
- Remote compaction override env uses options.env directly (fixes pre-existing silent bug)
- Comments added: Env::Default() always local, DbStressDestroyDb MANIFEST explanation, fault injection log path in TEST_TMPDIR
- TestFSWritableFile::Close() now mirrors the production FSWritableFile close boundary. After the first Close() attempt, later explicit or destructor Close() calls are wrapper-level no-ops, while FaultInjectionTestFS still records the first close attempt for crash/recovery simulation.
- Added targeted fault_injection_fs_test coverage for injected metadata-close failures to ensure FaultInjectionTestFS does not retry the inner Close() path.

Reviewed By: anand1976

Differential Revision: D104959945
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 22, 2026
Summary:

Prerequisite for multi-DB stress test support where each StressTest instance owns one DB. Moves fault injection from a single global to per-StressTest instance so each DB gets isolated fault injection; errors injected into one DB do not leak to others.

FS/Env architecture change:

  Before (upstream):
    raw_fs → FaultInjectionTestFS (global) → DbStressFSWrapper → db_stress_env → DB::Open

  After (this diff):
    Global: raw_env (no wrappers)
      Used outside StressTest: non-FS ops (threads, time, sleep), test framework FS setup (dirs), DB destruction
      Used inside StressTest: cleanup ops that must succeed (external file delete)

    Per StressTest:
      raw_fs → DbStressFSWrapper (db_stress_fs_, always)
                → FaultInjectionTestFS (db_fault_injection_fs_, optional)
                      → CompositeEnvWrapper (db_env_, always) → DB::Open

    Expected values state: Env::Default() (always local PosixEnv even when raw_env is remote)

FS layer order swapped (DbStressFSWrapper now innermost). Safe because:
- Error injection returns early; inner wrapper never executes (same behavior)
- DbStressFSWrapper assertions do not modify data (checksum validation, IOActivity checks)
- MANIFEST rename tracking slightly better for crash simulation in new order

Stored members (per StressTest):
- db_stress_fs_: DbStressFSWrapper. Always active regardless of fault injection flags.
- db_fault_injection_fs_: FaultInjectionTestFS. Only when fault injection flags set. Direct access for Enable/Disable/SetThreadLocal.
- db_env_: CompositeEnvWrapper. Always present. THE env for all DB I/O (options_.env).

Eliminated globals: db_stress_env → renamed to raw_env (no wrappers, DbStressFSWrapper moved to per-StressTest); db_stress_listener_env, db_stress_raw_fs removed; fault_fs_guard, fault_env_guard moved to per-StressTest.

Other changes:
- CleanupOutputDirectory simplified (uses raw_env, no disable/enable needed)
- SstFileManager recreated with per-StressTest env in Open()
- Remote compaction override env uses options.env directly (fixes pre-existing silent bug)
- Comments added: Env::Default() always local, DbStressDestroyDb MANIFEST explanation, fault injection log path in TEST_TMPDIR
- TestFSWritableFile::Close() now mirrors the production FSWritableFile close boundary. After the first Close() attempt, later explicit or destructor Close() calls are wrapper-level no-ops, while FaultInjectionTestFS still records the first close attempt for crash/recovery simulation.
- Added targeted fault_injection_fs_test coverage for injected metadata-close failures to ensure FaultInjectionTestFS does not retry the inner Close() path.

Reviewed By: anand1976

Differential Revision: D104959945
meta-codesync Bot pushed a commit that referenced this pull request May 22, 2026
Summary:
Pull Request resolved: #14757

Pull Request resolved: #14750

Prerequisite for multi-DB stress test support where each StressTest instance owns one DB. Moves fault injection from a single global to per-StressTest instance so each DB gets isolated fault injection; errors injected into one DB do not leak to others.

FS/Env architecture change:

  Before (upstream):
    raw_fs → FaultInjectionTestFS (global) → DbStressFSWrapper → db_stress_env → DB::Open

  After (this diff):
    Global: raw_env (no wrappers)
      Used outside StressTest: non-FS ops (threads, time, sleep), test framework FS setup (dirs), DB destruction
      Used inside StressTest: cleanup ops that must succeed (external file delete)

    Per StressTest:
      raw_fs → DbStressFSWrapper (db_stress_fs_, always)
                → FaultInjectionTestFS (db_fault_injection_fs_, optional)
                      → CompositeEnvWrapper (db_env_, always) → DB::Open

    Expected values state: Env::Default() (always local PosixEnv even when raw_env is remote)

FS layer order swapped (DbStressFSWrapper now innermost). Safe because:
- Error injection returns early; inner wrapper never executes (same behavior)
- DbStressFSWrapper assertions do not modify data (checksum validation, IOActivity checks)
- MANIFEST rename tracking slightly better for crash simulation in new order

Stored members (per StressTest):
- db_stress_fs_: DbStressFSWrapper. Always active regardless of fault injection flags.
- db_fault_injection_fs_: FaultInjectionTestFS. Only when fault injection flags set. Direct access for Enable/Disable/SetThreadLocal.
- db_env_: CompositeEnvWrapper. Always present. THE env for all DB I/O (options_.env).

Eliminated globals: db_stress_env → renamed to raw_env (no wrappers, DbStressFSWrapper moved to per-StressTest); db_stress_listener_env, db_stress_raw_fs removed; fault_fs_guard, fault_env_guard moved to per-StressTest.

Other changes:
- CleanupOutputDirectory simplified (uses raw_env, no disable/enable needed)
- SstFileManager recreated with per-StressTest env in Open()
- Remote compaction override env uses options.env directly (fixes pre-existing silent bug)
- Comments added: Env::Default() always local, DbStressDestroyDb MANIFEST explanation, fault injection log path in TEST_TMPDIR
- TestFSWritableFile::Close() now mirrors the production FSWritableFile close boundary. After the first Close() attempt, later explicit or destructor Close() calls are wrapper-level no-ops, while FaultInjectionTestFS still records the first close attempt for crash/recovery simulation.
- Added targeted fault_injection_fs_test coverage for injected metadata-close failures to ensure FaultInjectionTestFS does not retry the inner Close() path.

Reviewed By: anand1976

Differential Revision: D104959945

fbshipit-source-id: 7cf9bb494dec2b372528d5f119c023b6d392ffca
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment