Skip to content

Add multi-DB stress testing support (--num_dbs flag) (#14749)#14749

Closed
hx235 wants to merge 1 commit into
facebook:mainfrom
hx235:export-D104959942
Closed

Add multi-DB stress testing support (--num_dbs flag) (#14749)#14749
hx235 wants to merge 1 commit into
facebook:mainfrom
hx235:export-D104959942

Conversation

@hx235

@hx235 hx235 commented May 16, 2026

Copy link
Copy Markdown
Contributor

Summary:

Add --num_dbs flag to run N independent DB instances in parallel. Each StressTest instance has its own DB with isolated fault injection (from D104959945). db_crashtest.py defaults to num_dbs=1.

For num_dbs=1: --db and --expected_values_dir are paths used as-is.
For num_dbs>1: they are parent directories; C++ creates db_0/, db_1/, ... subdirs underneath.

Path ownership: C++ owns DB and secondary dir creation (supports remote env). Python owns EV dir creation (always local). C++ also creates EV dirs as fallback for direct CLI usage. DestroyAllDbs cleans up subdirs and the parent dir.

Per-DB: threads, max_key, ops_per_thread, reopen, column_families, and all DB options.
Shared: background env threads (compaction, flush pool), block_cache, write_buffer_manager, compressed_secondary_cache, rate_limiter, compaction_thread_pool_adjust_interval.

Reviewed By: anand1976

Differential Revision: D104959942

@meta-cla meta-cla Bot added the CLA Signed label May 16, 2026
@meta-codesync

meta-codesync Bot commented May 16, 2026

Copy link
Copy Markdown

@hx235 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D104959942.

@hx235 hx235 changed the title Add multi-DB stress testing support (--num_dbs flag) May 16, 2026
@github-actions

github-actions Bot commented May 16, 2026

Copy link
Copy Markdown

⚠️ clang-tidy: 6 warning(s) on changed lines

Completed in 159.9s.

Summary by check

Check Count
concurrency-mt-unsafe 3
cppcoreguidelines-avoid-non-const-global-variables 3
Total 6

Details

db_stress_tool/db_stress_common.cc (2 warning(s))
db_stress_tool/db_stress_common.cc:25:56: warning: variable 'wbm' is non-const and globally accessible, consider making it const [cppcoreguidelines-avoid-non-const-global-variables]
db_stress_tool/db_stress_common.cc:26:49: warning: variable 'rate_limiter' is non-const and globally accessible, consider making it const [cppcoreguidelines-avoid-non-const-global-variables]
db_stress_tool/db_stress_tool.cc (4 warning(s))
db_stress_tool/db_stress_tool.cc:41:5: warning: variable 'fault_fs_for_crash_report' is non-const and globally accessible, consider making it const [cppcoreguidelines-avoid-non-const-global-variables]
db_stress_tool/db_stress_tool.cc:416:9: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:439:7: warning: function is not thread safe [concurrency-mt-unsafe]
db_stress_tool/db_stress_tool.cc:448:9: warning: function is not thread safe [concurrency-mt-unsafe]
@meta-codesync meta-codesync Bot changed the title [CI only] Add multi-DB stress testing support (--num_dbs flag) May 20, 2026
@hx235 hx235 force-pushed the export-D104959942 branch from 6be55c3 to c855385 Compare May 20, 2026 05:51
@meta-codesync meta-codesync Bot changed the title Add multi-DB stress testing support (--num_dbs flag) May 22, 2026
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 22, 2026
Summary:

Add --num_dbs flag to run N independent DB instances in parallel. Each StressTest instance has its own DB with isolated fault injection (from Diff 2). db_crashtest.py defaults to num_dbs=1.

Path handling design: crash_main functions (blackbox/whitebox) compute specific per-DB paths from a base directory using get_subdirectory_paths() -- pure path computation, no I/O. gen_cmd validates path count matches num_dbs, creates EV directories, and handles destroy_db_initially for expected state. Both db and expected_values_dir follow the same validate logic in gen_cmd. If user passes specific paths (comma-separated), they are validated but not expanded.

Shared resources (block_cache, write_buffer_manager, rate_limiter) are created once in tool.cc and shared across all DB instances. Most flags (threads, max_key, ops_per_thread) are per-DB.

Other changes:
- --num_dbs flag (default 1) with guards for not-yet-supported features
- [db_label] prefix in all stdout output including stats (e.g. [db_0] Stress Test: ...)
- Per-DB crash callback vector (fault_fs_for_crash_report, raw pointer, signal-safe)
- Fault injection log path includes GetDbLabel() for per-DB uniqueness
- Secondary paths are ephemeral, generated directly by C++
- flockfile/funlockfile for multi-DB stats atomicity

Differential Revision: D104959942
@hx235 hx235 force-pushed the export-D104959942 branch from c855385 to 0086484 Compare May 22, 2026 00:31
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 22, 2026
Summary:

Add --num_dbs flag to run N independent DB instances in parallel. Each StressTest instance has its own DB with isolated fault injection (from Diff 2). db_crashtest.py defaults to num_dbs=1.

Path handling design: crash_main functions (blackbox/whitebox) compute specific per-DB paths from a base directory using get_subdirectory_paths() -- pure path computation, no I/O. gen_cmd validates path count matches num_dbs, creates EV directories, and handles destroy_db_initially for expected state. Both db and expected_values_dir follow the same validate logic in gen_cmd. If user passes specific paths (comma-separated), they are validated but not expanded.

Shared resources (block_cache, write_buffer_manager, rate_limiter) are created once in tool.cc and shared across all DB instances. Most flags (threads, max_key, ops_per_thread) are per-DB.

Other changes:
- --num_dbs flag (default 1) with guards for not-yet-supported features
- [db_label] prefix in all stdout output including stats (e.g. [db_0] Stress Test: ...)
- Per-DB crash callback vector (fault_fs_for_crash_report, raw pointer, signal-safe)
- Fault injection log path includes GetDbLabel() for per-DB uniqueness
- Secondary paths are ephemeral, generated directly by C++
- flockfile/funlockfile for multi-DB stats atomicity

Differential Revision: D104959942
@hx235 hx235 force-pushed the export-D104959942 branch from 0086484 to cb54e9d Compare May 22, 2026 00:38
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 22, 2026
Summary:

Add --num_dbs flag to run N independent DB instances in parallel. Each StressTest instance has its own DB with isolated fault injection (from Diff 2). db_crashtest.py defaults to num_dbs=1.

Path handling design: crash_main functions (blackbox/whitebox) compute specific per-DB paths from a base directory using get_subdirectory_paths() -- pure path computation, no I/O. gen_cmd validates path count matches num_dbs, creates EV directories, and handles destroy_db_initially for expected state. Both db and expected_values_dir follow the same validate logic in gen_cmd. If user passes specific paths (comma-separated), they are validated but not expanded.

Shared resources (block_cache, write_buffer_manager, rate_limiter) are created once in tool.cc and shared across all DB instances. Most flags (threads, max_key, ops_per_thread) are per-DB.

Other changes:
- --num_dbs flag (default 1) with guards for not-yet-supported features
- [db_label] prefix in all stdout output including stats (e.g. [db_0] Stress Test: ...)
- Per-DB crash callback vector (fault_fs_for_crash_report, raw pointer, signal-safe)
- Fault injection log path includes GetDbLabel() for per-DB uniqueness
- Secondary paths are ephemeral, generated directly by C++
- flockfile/funlockfile for multi-DB stats atomicity

Differential Revision: D104959942
@hx235 hx235 force-pushed the export-D104959942 branch from cb54e9d to 49193e7 Compare May 22, 2026 00:43
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 22, 2026
Summary:

Add --num_dbs flag to run N independent DB instances in parallel. Each StressTest instance has its own DB with isolated fault injection (from Diff 2). db_crashtest.py defaults to num_dbs=1.

Path handling design: crash_main functions (blackbox/whitebox) compute specific per-DB paths from a base directory using get_subdirectory_paths() -- pure path computation, no I/O. gen_cmd validates path count matches num_dbs, creates EV directories, and handles destroy_db_initially for expected state. Both db and expected_values_dir follow the same validate logic in gen_cmd. If user passes specific paths (comma-separated), they are validated but not expanded.

Shared resources (block_cache, write_buffer_manager, rate_limiter) are created once in tool.cc and shared across all DB instances. Most flags (threads, max_key, ops_per_thread) are per-DB.

Other changes:
- --num_dbs flag (default 1) with guards for not-yet-supported features
- [db_label] prefix in all stdout output including stats (e.g. [db_0] Stress Test: ...)
- Per-DB crash callback vector (fault_fs_for_crash_report, raw pointer, signal-safe)
- Fault injection log path includes GetDbLabel() for per-DB uniqueness
- Secondary paths are ephemeral, generated directly by C++
- flockfile/funlockfile for multi-DB stats atomicity

Differential Revision: D104959942
@hx235 hx235 force-pushed the export-D104959942 branch 2 times, most recently from 4de8a83 to b072a0e Compare May 22, 2026 00:56
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 22, 2026
Summary:

Add --num_dbs flag to run N independent DB instances in parallel. Each StressTest instance has its own DB with isolated fault injection (from Diff 2). db_crashtest.py defaults to num_dbs=1.

Path handling design: crash_main functions (blackbox/whitebox) compute specific per-DB paths from a base directory using get_subdirectory_paths() -- pure path computation, no I/O. gen_cmd validates path count matches num_dbs, creates EV directories, and handles destroy_db_initially for expected state. Both db and expected_values_dir follow the same validate logic in gen_cmd. If user passes specific paths (comma-separated), they are validated but not expanded.

Shared resources (block_cache, write_buffer_manager, rate_limiter) are created once in tool.cc and shared across all DB instances. Most flags (threads, max_key, ops_per_thread) are per-DB.

Other changes:
- --num_dbs flag (default 1) with guards for not-yet-supported features
- [db_label] prefix in all stdout output including stats (e.g. [db_0] Stress Test: ...)
- Per-DB crash callback vector (fault_fs_for_crash_report, raw pointer, signal-safe)
- Fault injection log path includes GetDbLabel() for per-DB uniqueness
- Secondary paths are ephemeral, generated directly by C++
- flockfile/funlockfile for multi-DB stats atomicity

Differential Revision: D104959942
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 22, 2026
Summary:

Add --num_dbs flag to run N independent DB instances in parallel. Each StressTest instance has its own DB with isolated fault injection (from Diff 2). db_crashtest.py defaults to num_dbs=1.

Path handling design: crash_main functions (blackbox/whitebox) compute specific per-DB paths from a base directory using get_subdirectory_paths() -- pure path computation, no I/O. gen_cmd validates path count matches num_dbs, creates EV directories, and handles destroy_db_initially for expected state. Both db and expected_values_dir follow the same validate logic in gen_cmd. If user passes specific paths (comma-separated), they are validated but not expanded.

Shared resources (block_cache, write_buffer_manager, rate_limiter) are created once in tool.cc and shared across all DB instances. Most flags (threads, max_key, ops_per_thread) are per-DB.

Other changes:
- --num_dbs flag (default 1) with guards for not-yet-supported features
- [db_label] prefix in all stdout output including stats (e.g. [db_0] Stress Test: ...)
- Per-DB crash callback vector (fault_fs_for_crash_report, raw pointer, signal-safe)
- Fault injection log path includes GetDbLabel() for per-DB uniqueness
- Secondary paths are ephemeral, generated directly by C++
- flockfile/funlockfile for multi-DB stats atomicity

Differential Revision: D104959942
@hx235 hx235 force-pushed the export-D104959942 branch from b072a0e to 723fe00 Compare May 22, 2026 02:12
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 22, 2026
Summary:

Add --num_dbs flag to run N independent DB instances in parallel. Each StressTest instance has its own DB with isolated fault injection (from Diff 2). db_crashtest.py defaults to num_dbs=1.

Path handling design: crash_main functions (blackbox/whitebox) compute specific per-DB paths from a base directory using get_subdirectory_paths() -- pure path computation, no I/O. gen_cmd validates path count matches num_dbs, creates EV directories, and handles destroy_db_initially for expected state. Both db and expected_values_dir follow the same validate logic in gen_cmd. If user passes specific paths (comma-separated), they are validated but not expanded.

Shared resources (block_cache, write_buffer_manager, rate_limiter) are created once in tool.cc and shared across all DB instances. Most flags (threads, max_key, ops_per_thread) are per-DB.

Other changes:
- --num_dbs flag (default 1) with guards for not-yet-supported features
- [db_label] prefix in all stdout output including stats (e.g. [db_0] Stress Test: ...)
- Per-DB crash callback vector (fault_fs_for_crash_report, raw pointer, signal-safe)
- Fault injection log path includes GetDbLabel() for per-DB uniqueness
- Secondary paths are ephemeral, generated directly by C++
- flockfile/funlockfile for multi-DB stats atomicity

Differential Revision: D104959942
@hx235 hx235 force-pushed the export-D104959942 branch from 723fe00 to 19717b4 Compare May 22, 2026 02:41
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 22, 2026
Summary:

Add --num_dbs flag to run N independent DB instances in parallel. Each StressTest instance has its own DB with isolated fault injection (from Diff 2). db_crashtest.py defaults to num_dbs=1.

Path handling design: crash_main functions (blackbox/whitebox) compute specific per-DB paths from a base directory using get_subdirectory_paths() -- pure path computation, no I/O. gen_cmd validates path count matches num_dbs, creates EV directories, and handles destroy_db_initially for expected state. Both db and expected_values_dir follow the same validate logic in gen_cmd. If user passes specific paths (comma-separated), they are validated but not expanded.

Shared resources (block_cache, write_buffer_manager, rate_limiter) are created once in tool.cc and shared across all DB instances. Most flags (threads, max_key, ops_per_thread) are per-DB.

Other changes:
- --num_dbs flag (default 1) with guards for not-yet-supported features
- [db_label] prefix in all stdout output including stats (e.g. [db_0] Stress Test: ...)
- Per-DB crash callback vector (fault_fs_for_crash_report, raw pointer, signal-safe)
- Fault injection log path includes GetDbLabel() for per-DB uniqueness
- Secondary paths are ephemeral, generated directly by C++
- flockfile/funlockfile for multi-DB stats atomicity

Differential Revision: D104959942
@hx235 hx235 force-pushed the export-D104959942 branch from 19717b4 to e775a94 Compare May 22, 2026 02:42
@github-actions

github-actions Bot commented May 22, 2026

Copy link
Copy Markdown

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit e775a94


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@github-actions

github-actions Bot commented May 22, 2026

Copy link
Copy Markdown

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit e775a94


Summary

Clean, well-structured PR that adds multi-DB stress testing with proper resource sharing and isolation. The design is sound -- per-DB paths, shared cache/WBM/rate_limiter, per-DB fault injection, and clear phase separation in the orchestration code. Most incompatible features are correctly guarded.

High-severity findings (1):

  • [db_stress_stat.h:184] flockfile/funlockfile are POSIX-only; breaks Windows/MSVC builds.
Full review (click to expand)

Findings

🔴 HIGH

H1. flockfile/funlockfile not portable to Windows -- db_stress_stat.h:184
  • Issue: The PR adds flockfile(stdout) and funlockfile(stdout) to Stats::Report() to prevent multi-DB output interleaving. These are POSIX functions and do not exist on Windows/MSVC.
  • Suggested fix: Use a static port::Mutex instead:
    static port::Mutex report_mutex;
    MutexLock l(&report_mutex);

🟡 MEDIUM

M1. Default --db path incompatible with --num_dbs > 1 when running db_stress directly -- db_stress_tool.cc:394-403
  • Issue: When FLAGS_db is empty and num_dbs > 1, the single default path causes a size mismatch error. Users must provide comma-separated paths.
  • Suggested fix: Auto-generate subdirectory paths in C++ (like Python does), or document this clearly.
M2. SyncPoint singleton destruction order -- db_stress_shared_state.h:83-91
  • Issue: SharedState::~SharedState() calls ClearAllCallBacks() globally. Currently safe because all threads are joined first, but fragile if destruction order changes. Add a comment documenting this invariant.
M3. No unit tests for num_dbs > 1 -- tools/db_crashtest_test.py
  • Issue: No tests exercise multi-DB paths (path generation, flag validation, cleanup, finalize_and_sanitize guards).
  • Suggested fix: Add tests for get_subdirectory_paths(), ValidateNumDbsFlags, and finalize_and_sanitize with num_dbs > 1.

🟢 LOW / NIT

L1. Dead else if branch in InitializeOptionsGeneral -- db_stress_test_base.cc:5278
  • Issue: The fallback else if (FLAGS_rate_limiter_bytes_per_sec > 0) creating a per-DB rate limiter is unreachable when the global rate_limiter is pre-created.
L2. manifest_path construction style inconsistency -- db_stress_test_base.cc:4434,4492
  • Issue: Mix of .append("/").append(f) and + "/CURRENT" concatenation styles.

Cross-Component Analysis

Shared Resource Thread-Safe? Guarded?
block_cache Yes N/A (safe to share)
WriteBufferManager Yes N/A (safe to share)
RateLimiter Yes N/A (safe to share)
CompressedSecondaryCache Yes Blocked with num_dbs>1
PoolSizeChangeThread N/A Blocked with num_dbs>1
SyncPoint (debug) Fragile Safe via destruction order

Positive Observations

  • Clean 4-phase orchestration design (create -> init -> run -> cleanup)
  • Proper shared resource factoring (cache, WBM, rate limiter)
  • Good feature incompatibility guards in both C++ and Python
  • Signal-safe crash callback with vector of fault_fs pointers
  • [db_N] output labeling for easy per-DB filtering

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 22, 2026
Summary:

Add --num_dbs flag to run N independent DB instances in parallel. Each StressTest instance has its own DB with isolated fault injection (from Diff 2). db_crashtest.py defaults to num_dbs=1.

Path handling design: crash_main functions (blackbox/whitebox) compute specific per-DB paths from a base directory using get_subdirectory_paths() -- pure path computation, no I/O. gen_cmd validates path count matches num_dbs, creates EV directories, and handles destroy_db_initially for expected state. Both db and expected_values_dir follow the same validate logic in gen_cmd. If user passes specific paths (comma-separated), they are validated but not expanded.

Shared resources (block_cache, write_buffer_manager, rate_limiter) are created once in tool.cc and shared across all DB instances. Most flags (threads, max_key, ops_per_thread) are per-DB.

Other changes:
- --num_dbs flag (default 1) with guards for not-yet-supported features
- [db_label] prefix in all stdout output including stats (e.g. [db_0] Stress Test: ...)
- Per-DB crash callback vector (fault_fs_for_crash_report, raw pointer, signal-safe)
- Fault injection log path includes GetDbLabel() for per-DB uniqueness
- Secondary paths are ephemeral, generated directly by C++
- flockfile/funlockfile for multi-DB stats atomicity

Differential Revision: D104959942
@hx235 hx235 force-pushed the export-D104959942 branch from e775a94 to d9613c5 Compare May 22, 2026 06:14
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 22, 2026
Summary:

Add --num_dbs flag to run N independent DB instances in parallel. Each StressTest instance has its own DB with isolated fault injection (from Diff 2). db_crashtest.py defaults to num_dbs=1.

Path handling design: crash_main functions (blackbox/whitebox) compute specific per-DB paths from a base directory using get_subdirectory_paths() -- pure path computation, no I/O. gen_cmd validates path count matches num_dbs, creates EV directories, and handles destroy_db_initially for expected state. Both db and expected_values_dir follow the same validate logic in gen_cmd. If user passes specific paths (comma-separated), they are validated but not expanded.

Shared resources (block_cache, write_buffer_manager, rate_limiter) are created once in tool.cc and shared across all DB instances. Most flags (threads, max_key, ops_per_thread) are per-DB.

Other changes:
- --num_dbs flag (default 1) with guards for not-yet-supported features
- [db_label] prefix in all stdout output including stats (e.g. [db_0] Stress Test: ...)
- Per-DB crash callback vector (fault_fs_for_crash_report, raw pointer, signal-safe)
- Fault injection log path includes GetDbLabel() for per-DB uniqueness
- Secondary paths are ephemeral, generated directly by C++
- flockfile/funlockfile for multi-DB stats atomicity

Differential Revision: D104959942
@hx235 hx235 force-pushed the export-D104959942 branch 2 times, most recently from a5ee2a5 to 29cb7bc Compare May 26, 2026 18:19
hx235 pushed a commit to hx235/rocksdb that referenced this pull request May 26, 2026
Summary:

Add `--num_dbs` flag to run N independent DB instances in parallel. Each `StressTest` instance has its own DB with isolated fault injection (from D104959945). `db_crashtest.py` defaults to `num_dbs=1`.

For `num_dbs=1`: `--db` and `--expected_values_dir` are paths used as-is.
For `num_dbs>1`: they are parent directories; C++ creates `db_0/`, `db_1/`, ... subdirs underneath.

Path ownership: C++ owns DB and secondary dir creation (supports remote env). Python owns EV dir creation (always local). C++ also creates EV dirs as fallback for direct CLI usage. `DestroyAllDbs` cleans up subdirs and the parent dir.

Per-DB: `threads`, `max_key`, `ops_per_thread`, `reopen`, `column_families`, and all DB options.
Shared: background env threads (compaction, flush pool), `block_cache`, `write_buffer_manager`, `compressed_secondary_cache`, `rate_limiter`, `compaction_thread_pool_adjust_interval`.

Reviewed By: anand1976

Differential Revision: D104959942
@github-actions

github-actions Bot commented May 26, 2026

Copy link
Copy Markdown

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 29cb7bc


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@github-actions

github-actions Bot commented May 26, 2026

Copy link
Copy Markdown

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 29cb7bc


Summary

This PR adds a --num_dbs flag for running N independent DB instances in parallel during stress testing. The overall design is sound for num_dbs=1 (no behavior change), but the num_dbs>1 path has a high-severity background thread lifecycle issue and several medium-severity concerns around shared resource semantics and test coverage.

High-severity findings (2):

  • [db_stress_driver.cc:113-136] std::call_once ties shared background threads (PoolSizeChangeThread, CompressedCacheSetCapacityThread) to the first DB's SharedState. If that DB finishes first, it shuts down the globally-shared threads while other DBs are still running.
  • [db_stress_driver.cc:113,131] ThreadState bg_thread is a stack-local variable in RunStressTestImpl. While mutex/condvar synchronization ensures the bg thread exits before the function returns in the normal case, the design is fragile -- the bg thread lifetime is implicitly coupled to a stack frame rather than explicitly managed.
Full review (click to expand)

Findings

🔴 HIGH

H1. Background thread lifecycle tied to first DB -- db_stress_driver.cc:113-136
  • Issue: std::call_once launches PoolSizeChangeThread and CompressedCacheSetCapacityThread using the first DB's SharedState and stack-local ThreadState. Only that DB's SharedState gets IncBgThreads(). If DB_0 finishes before DB_1..N-1, it signals the bg threads to stop and waits for them to exit. After that, no pool size adjustment or compressed cache capacity changes occur for the remaining DBs.
  • Root cause: Shared global threads are managed through a per-DB SharedState, creating an asymmetric dependency.
  • Suggested fix: Move the shared background threads out of RunStressTestImpl entirely. Create them in db_stress_tool.cc with their own dedicated shutdown mechanism, or ensure the first DB's thread waits for all other DB threads to complete before bg thread shutdown.
H2. Stack-local ThreadState for bg threads -- db_stress_driver.cc:113,131
  • Issue: ThreadState bg_thread(0, shared) lives on the stack. Normal shutdown is safe (mutex/condvar ensures bg thread exits first), but the design is fragile. Future code changes (e.g., replacing exit(1) with return) could create use-after-free.
  • Suggested fix: Heap-allocate or make static (acceptable since call_once ensures single init).

🟡 MEDIUM

M1. Shared rate_limiter not scaled -- db_stress_tool.cc:585-590
  • Issue: Single RateLimiter shared across N DBs. Effective per-DB rate = bytes_per_sec / N.
  • Suggested fix: Document or scale by num_dbs.
M2. Shared WriteBufferManager budget not scaled -- db_stress_tool.cc:581-584
  • Issue: WBM initialized with FLAGS_db_write_buffer_size shared across N DBs. Earlier flushes than single-DB.
  • Suggested fix: Scale budget or document.
M3. DestroyAllDbs partial failure -- db_stress_tool.cc:68-85
  • Issue: If one DB's destroy fails, parent dir cleanup fails too. State is inconsistent.
M4. DECLARE removed from shared_state.h -- db_stress_shared_state.h:27
  • Issue: DECLARE_string(expected_values_dir) removed. May break TUs that include only this header.
M5. Static mutex in Stats::Report -- db_stress_stat.h:184
  • Issue: Acceptable for test tool but technically has static destruction ordering concern.

🟢 LOW / NIT

  • L1. "Per-DB" flag description misleading -- all DBs use same FLAGS values
  • L2. No upper bound on num_dbs -- could OOM with large values
  • L3. Missing incompatibility checks for remote_compaction_worker_threads and others
  • L5. No CI test exercising num_dbs>1 end-to-end

Cross-Component Analysis

Context Multi-DB safe? Notes
Shared block_cache Yes Thread-safe cache
Shared WBM Partial Budget not scaled
Shared RateLimiter Partial Aggregate throttling
FaultInjectionTestFS Yes Per-DB isolated
PoolSizeChangeThread No Tied to first DB lifecycle
CompressedCacheSetCapacityThread No Same issue
Signal handler Yes Vector populated before threads start

Positive Observations

  • Clean per-DB/shared resource separation
  • num_dbs=1 is a true no-op change
  • Good Python-side flag sanitization
  • Report mutex prevents interleaved output
  • Crash callback vector fully populated before threads start

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase
@hx235

hx235 commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

Both high-severity findings are tolerable for simpler designs.

Summary:

Add `--num_dbs` flag to run N independent DB instances in parallel. Each `StressTest` instance has its own DB with isolated fault injection (from D104959945). `db_crashtest.py` defaults to `num_dbs=1`.

For `num_dbs=1`: `--db` and `--expected_values_dir` are paths used as-is.
For `num_dbs>1`: they are parent directories; C++ creates `db_0/`, `db_1/`, ... subdirs underneath.

Path ownership: C++ owns DB and secondary dir creation (supports remote env). Python owns EV dir creation (always local). C++ also creates EV dirs as fallback for direct CLI usage. `DestroyAllDbs` cleans up subdirs and the parent dir.

Per-DB: `threads`, `max_key`, `ops_per_thread`, `reopen`, `column_families`, and all DB options.
Shared: background env threads (compaction, flush pool), `block_cache`, `write_buffer_manager`, `compressed_secondary_cache`, `rate_limiter`, `compaction_thread_pool_adjust_interval`.

Reviewed By: anand1976

Differential Revision: D104959942
@hx235 hx235 force-pushed the export-D104959942 branch from 29cb7bc to 33a915b Compare May 27, 2026 04:11
@github-actions

Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 33a915b


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@github-actions

Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 33a915b


Summary

Well-structured PR that cleanly separates per-DB state from shared global resources. The multi-DB orchestration, path management, and thread lifecycle coordination are mostly correct. No high-severity correctness bugs found after deep analysis.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🟡 MEDIUM

M1. Remote DB behavior change in get_ev_parent_dirtools/db_crashtest.py
  • Issue: The old setup_expected_values_dir() had a is_remote_db guard: when is_remote_db=True and _TEST_EXPECTED_DIR_ENV_VAR is unset, it would NOT fall back to _TEST_DIR_ENV_VAR. The new get_ev_parent_dir() always falls back to _TEST_DIR_ENV_VAR regardless of is_remote_db. This changes behavior for remote DB test configurations.
  • Root cause: The is_remote_db condition was dropped during the refactor from setup_expected_values_dir to get_ev_parent_dir.
  • Suggested fix: Restore the is_remote_db guard:
    if not is_remote_db and (test_exp_tmpdir is None or test_exp_tmpdir == ""):
        test_exp_tmpdir = os.environ.get(_TEST_DIR_ENV_VAR)
M2. exit(1) in RunStressTestImpl terminates all DBs — db_stress_tool/db_stress_driver.cc
  • Issue: Multiple exit(1) calls (e.g., lines ~80, ~172) terminate the entire process. In multi-DB mode, a single DB's unverified-state-dir failure kills all other running DB stress tests without proper cleanup or result collection.
  • Root cause: Pre-existing single-DB assumption carried forward.
  • Suggested fix: Consider converting exit(1) calls to error returns, allowing the orchestrator to collect partial results. Alternatively, this could be a follow-up improvement since stress test failures are typically investigated manually.
M3. Missing incompatibility guard for features not yet tested with multi-DB — db_stress_tool/db_stress_tool.cc
  • Issue: Only clear_column_family_one_in and test_multi_ops_txns are guarded against num_dbs > 1. Other features that may have single-DB assumptions (e.g., remote_compaction_worker_threads, backup_one_in, checkpoint_one_in) are not guarded. While these may work correctly, they haven't been validated for multi-DB.
  • Root cause: Conservative guard set; may be intentionally minimal.
  • Suggested fix: Document which features have been validated with multi-DB. Consider adding a broader guard or a TODO for future validation.

🟢 LOW / NIT

L1. static std::once_flag persists across process lifetime — db_stress_tool/db_stress_driver.cc:116,138
  • Issue: pool_size_thread_flag and compressed_cache_thread_flag are static locals. If db_stress_tool() were ever called twice in the same process, bg threads would not start on the second invocation. Not a practical concern since db_stress runs once per process, but worth noting.
  • Suggested fix: No action needed; add a comment if desired.
L2. static port::Mutex report_mutex initialization — db_stress_tool/db_stress_stat.h:185
  • Issue: A static local mutex is used to prevent interleaved multi-DB report output. Static local initialization is thread-safe in C++11+, so this is correct. However, the mutex is never destroyed (lives until process exit), which is standard practice for this pattern.
  • Suggested fix: No action needed; correct as-is.
L3. db_label construction — db_stress_tool/db_stress_driver.cc:70
  • Issue: db_label is computed once per RunStressTestImpl call (not per log line), so performance is fine. The (FLAGS_num_dbs > 1) check uses the global flag which is appropriate.
  • Suggested fix: No action needed.
L4. Python test removed: test_prepare_expected_values_dir_resets_for_fresh_dbtools/db_crashtest_test.py
  • Issue: The prepare_expected_values_dir function was removed and its logic inlined into gen_cmd. The corresponding test was removed. The destroy_db_initially + rmtree behavior is now implicitly tested through gen_cmd, but there's no dedicated unit test for this specific behavior in the multi-DB path.
  • Suggested fix: Consider adding a test for the destroy_db_initially + multi-DB rmtree logic in gen_cmd.

Cross-Component Analysis

Context Applies? Safe? Notes
Shared block_cache across DBs Yes Yes Cache is thread-safe; sharing is intentional
Shared WriteBufferManager Yes Yes WBM is designed for cross-DB sharing
Shared rate_limiter Yes Yes Rate limiter is thread-safe
Shared compressed_secondary_cache Yes Yes CompressedCacheSetCapacityThread checks capacity (not usage); safe
PoolSizeChangeThread via call_once Yes Yes Only modifies global env thread pool; one instance is correct
Per-DB FaultInjectionTestFS Yes Yes Each DB has isolated fault injection
Per-DB DbVerificationThread Yes Yes Each DB verifies its own state independently
bg_thread ThreadState lifetime Yes Safe First DB waits for BgThreadsFinished before returning; stack variable outlives the bg thread

Positive Observations

  • Clean separation of per-DB state (paths, fault injection) from shared global resources (cache, WBM, rate limiter)
  • std::call_once correctly prevents duplicate global bg threads while keeping per-DB bg threads (continuous verification) independent
  • Good use of db_label for multi-DB log disambiguation
  • Python/C++ path ownership split is well-documented in comments
  • Proper validation of incompatible flag combinations (ValidateNumDbsFlags + finalize_and_sanitize)
  • Results vector access is race-free (each thread writes to its own index)
  • Crash callback vector is pre-sized before registration, safe for signal handler iteration
  • New Python unit tests cover flag sanitization for multi-DB

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase
@meta-codesync

meta-codesync Bot commented May 27, 2026

Copy link
Copy Markdown

This pull request has been merged in c724aeb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment