Skip to content

env: suppress liburing TSAN false positives#14710

Closed
xingbowang wants to merge 3 commits into
facebook:mainfrom
xingbowang:2026_05_06_main
Closed

env: suppress liburing TSAN false positives#14710
xingbowang wants to merge 3 commits into
facebook:mainfrom
xingbowang:2026_05_06_main

Conversation

@xingbowang

@xingbowang xingbowang commented May 6, 2026

Copy link
Copy Markdown
Contributor

Summary:

  • Use the liburing TSAN suppressions in tools/tsan_suppressions.txt instead of defining the process-wide __tsan_default_suppressions() hook, avoiding conflicts with downstream applications.
  • Wire RocksDB TSAN make and crash-test flows to use that suppressions file by default without overriding caller-provided TSAN_OPTIONS.
  • Cover direct db_crashtest.py launches by passing the default suppressions to db_stress subprocesses.
  • Fix the GCC 16 unity-build warning in CacheItemHelper by directly initializing the no-secondary-cache helper fields instead of delegating with this.

Imported from D101303486.

Test Plan:

  • COMPILE_WITH_TSAN=1 make -j128 env_test
  • timeout 60s ./env_test --gtest_filter=EnvPosixTest.IOUringAddressReuseNoTsanFalsePositive
  • python3 tools/db_crashtest_test.py
  • python3 -m py_compile tools/db_crashtest.py tools/db_crashtest_test.py
  • CI: build-linux-unity-and-headers passed after the CacheItemHelper fix
liburing initializes io_uring mappings through a raw __sys_mmap path, so TSAN does not clear stale shadow state before setup.c reads the kernel-managed ring pointers. Suppress the known liburing setup frames while keeping RocksDB's explicit mmap annotations for later accesses.

Imported from D101303486.

Test Plan:

COMPILE_WITH_TSAN=1 make -j128 env_test

timeout 60s ./env_test --gtest_filter=EnvPosixTest.IOUringAddressReuseNoTsanFalsePositive
@meta-cla meta-cla Bot added the CLA Signed label May 6, 2026
@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown

✅ clang-tidy: No findings on changed lines

Completed in 31.4s.

@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit b553120


❌ Codex review failed before producing any output.


ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit b553120


Summary

This PR adds __tsan_default_suppressions() to suppress known liburing TSAN false positives during io_uring initialization. The technical rationale is sound — the false positive is real and well-documented in the comment — but the implementation mechanism has a significant compatibility issue for downstream users.

High-severity findings (1):

  • [env/io_posix.cc:61] __tsan_default_suppressions() is a global singleton (one-definition-per-program) that will silently conflict with any downstream application that also defines it, causing one set of suppressions to be silently lost.
Full review (click to expand)

Findings

🔴 HIGH

H1. ODR conflict: __tsan_default_suppressions() is a program-wide singleton — env/io_posix.cc:61
  • Issue: __tsan_default_suppressions() is a weak symbol in the TSAN runtime that can be overridden exactly once per program. RocksDB is widely used as a library (both static and shared). If a downstream application also defines __tsan_default_suppressions() (a common practice for TSAN users), the linker will silently pick one definition and discard the other. Either RocksDB's io_uring suppressions or the application's suppressions will be lost — with no compiler or linker warning.
  • Root cause: This is a library claiming a global singleton that is conventionally owned by the application. Libraries should not define __tsan_default_suppressions() because they cannot know whether the final application also needs it.
  • Suggested fix: Use a shipped suppressions file instead (e.g., tools/tsan_suppressions.txt) and document that users should set TSAN_OPTIONS=suppressions=/path/to/rocksdb_tsan_suppressions.txt (or append to their existing suppressions file). This eliminates the ODR conflict entirely and follows established industry practice. Alternatively, if a code-level solution is preferred, RocksDB could document the required suppressions in a header or README and let the application aggregate them.

🟡 MEDIUM

M1. Suppression patterns use substring matching — may mask future real races
  • Issue: TSAN suppression patterns like race:io_uring_mmap match any function whose name contains the substring io_uring_mmap. While no RocksDB functions currently match, future liburing versions could add functions like io_uring_mmap_v2() that have real races, which would be silently suppressed.
  • Suggested fix: If the __tsan_default_suppressions() approach is kept, consider documenting that these patterns should be periodically reviewed when upgrading liburing. Alternatively, more specific patterns (e.g., using called_from_lib:liburing.so if available in the TSAN version) would be safer.
M2. No removal plan or version guard
  • Issue: The suppression is permanent. When liburing eventually fixes the underlying issue (e.g., by using libc mmap instead of raw __sys_mmap), the suppression will remain active and could mask new legitimate races in those functions.
  • Suggested fix: Add a comment noting the specific liburing version(s) affected and a TODO to re-evaluate when upgrading liburing.

🟢 LOW / NIT

L1. No direct test for the suppression itself
  • Issue: The test plan references IOUringAddressReuseNoTsanFalsePositive, but that test validates the TsanAnnotateMappedMemory() annotation mechanism, not the __tsan_default_suppressions() function. The suppression is validated indirectly by the absence of TSAN warnings in CI, which is reasonable but means regressions would only show as sporadic CI failures.
  • Suggested fix: Consider adding a trivial unit test that calls __tsan_default_suppressions() (under the appropriate #ifdefs) and checks the returned string contains the expected patterns. This ensures the function exists and returns the right content.
L2. Comment quality is good
  • The explanatory comment is thorough and correctly describes the root cause, the timing issue with TsanAnnotateIOUringMemory(), and why the suppression is safe. This is a positive aspect of the PR.

Cross-Component Analysis

Context Does code execute? Assumptions hold? Action needed?
Non-TSAN builds NO N/A Correctly guarded by __SANITIZE_THREAD__
Non-io_uring builds NO N/A Correctly guarded by ROCKSDB_IOURING_PRESENT
RocksDB as static library YES NO — ODR conflict possible H1 applies
RocksDB as shared library YES NO — symbol conflict possible H1 applies
Clang TSAN YES YES port/lang.h:62 defines __SANITIZE_THREAD__ for Clang
GCC TSAN YES YES GCC defines __SANITIZE_THREAD__ natively

Positive Observations

  • The comment explaining the false positive mechanism is excellent — it documents the root cause (raw __sys_mmap bypassing TSAN interceptors), why the existing annotation is insufficient (too late), and why the suppression is safe.
  • The preprocessor guards are correct and well-scoped.
  • The approach complements the existing TsanAnnotateIOUringMemory() mechanism — the annotation handles post-init shadow state, while the suppression handles during-init reads.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase
xingbowang added 2 commits May 6, 2026 18:08
Move the liburing TSAN suppressions into a shipped suppressions file and wire RocksDB TSAN make/crash-test flows to use it by default without overriding caller-provided TSAN_OPTIONS.
Initialize the no-secondary-cache helper fields directly instead of delegating through the secondary-cache constructor with this. This avoids GCC 16 -Wmaybe-uninitialized in the unity build while preserving the same without_secondary_compat value.
@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit bf450a6


❌ Codex review failed before producing any output.


ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@github-actions

github-actions Bot commented May 6, 2026

Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit bf450a6


Summary

Clean, well-structured PR that ships liburing TSAN suppressions as a file and fixes a GCC 16 unity-build warning in CacheItemHelper. Both changes are correct and low-risk.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🔴 HIGH

None.

🟡 MEDIUM

M1. User-provided TSAN_OPTIONS silently loses suppressions — Makefile, db_crashtest.py
  • Issue: If a user (or CI) sets TSAN_OPTIONS=halt_on_error=1 before invoking make or db_crashtest.py, the conditional assignment (?= / not in env) correctly preserves the user's value — but the liburing suppressions are silently dropped. The user must manually merge the suppressions= entry into their own TSAN_OPTIONS.
  • Root cause: Conditional-set semantics treat any pre-existing value as "already configured," even when it lacks a suppressions= clause.
  • Suggested fix: This is documented in the suppressions file header ("merge these entries into that application-owned file"), so it's by design. Consider adding a note in the Makefile/crash_test.mk comments pointing users to this guidance, e.g.:
    # NOTE: If you set TSAN_OPTIONS externally, merge
    # tools/tsan_suppressions.txt entries into your own file.
M2. Missing test for stress_cmd_env() when suppressions file doesn't exist — db_crashtest_test.py
  • Issue: stress_cmd_env() has an os.path.exists(_TSAN_SUPPRESSIONS_FILE) guard, but the test suite doesn't cover the branch where the file is absent (e.g., in a stripped install or out-of-tree build).
  • Root cause: Only the happy path (file exists, env unset) and preservation path (env already set) are tested.
  • Suggested fix: Add a test that temporarily renames or mocks the file path to verify stress_cmd_env() returns env without TSAN_OPTIONS when the file is missing.

🟢 LOW / NIT

L1. Skipped assertions in 2-arg CacheItemHelper constructor — advanced_cache.h:154
  • Issue: The new direct-initialization constructor bypasses the 6-arg constructor's five assertions. While all five are self-referential tautologies (since without_secondary_compat == this), they served as documentation of the constructor's contract.
  • Suggested fix: Consider adding a brief comment to the 2-arg constructor noting the invariants (e.g., // No secondary cache support: size_cb/saveto_cb/create_cb are nullptr, without_secondary_compat points to this).
L2. Suppression pattern completeness — tsan_suppressions.txt
  • Issue: Only io_uring_mmap and io_uring_setup_ring_pointers are suppressed. If liburing's internal function names change (e.g., io_uring_mmap_rings) or other functions also bypass TSAN's mmap interceptor, new false positives would appear.
  • Suggested fix: The file header already includes a "Revisit these suppressions when RocksDB upgrades to a liburing version..." note, which is appropriate. No action needed now.
L3. Two unrelated changes in one PR — general
  • Issue: The TSAN suppressions infrastructure and the CacheItemHelper GCC 16 fix are logically independent. Splitting would simplify review and rollback.
  • Suggested fix: Consider splitting for cleaner git history, though this is minor given the small scope of both changes.

Cross-Component Analysis

Context Safe? Notes
Unity builds (GCC 16) YES Primary motivation for CacheItemHelper change; warning eliminated
TSAN builds via Makefile YES TSAN_OPTIONS exported under COMPILE_WITH_TSAN guard
TSAN builds via crash_test.mk standalone YES crash_test.mk has its own COMPILE_WITH_TSAN guard
Direct db_crashtest.py usage YES stress_cmd_env() handles subprocess propagation
Non-TSAN builds YES All changes guarded by COMPILE_WITH_TSAN or file-existence check
User with custom TSAN_OPTIONS PARTIAL Suppressions dropped (by design, documented in file header)
Other test scripts (write_stress_runner.py, ldb_test.py) N/A Don't exercise io_uring paths heavily; not blocking

CacheItemHelper caller verification: 10 static/global instantiations of the 2-arg constructor were identified across cache.cc, typed_cache.h, tiered_secondary_cache.h, cache_test.cc, lru_cache_test.cc, and block_based_table_factory.cc. All produce identical runtime values with the new direct initialization.

Assumption stress-test results:

  • "Constructor equivalence" claim: Verified. Member init order matches declaration order; values identical; skipped assertions are tautologies.
  • "TSAN_OPTIONS won't override user values" claim: Verified. ?= and not in env correctly preserve existing values. Edge case of TSAN_OPTIONS="" is consistent with standard semantics.

Positive Observations

  • Good use of os.path.exists() guard for graceful degradation when suppressions file is absent.
  • Defensive os.environ.copy() in stress_cmd_env() avoids mutating global state.
  • Well-documented suppressions file with explanation of the root cause and guidance for downstream users.
  • Tests properly save/restore TSAN_OPTIONS in setUp/tearDown to avoid test pollution.
  • The _TSAN_SUPPRESSIONS_FILE uses os.path.abspath(os.path.join(os.path.dirname(__file__), ...)) for reliable path resolution regardless of working directory.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase
@meta-codesync

meta-codesync Bot commented May 6, 2026

Copy link
Copy Markdown

@xingbowang has imported this pull request. If you are a Meta employee, you can view this in D104103800.

@meta-codesync

meta-codesync Bot commented May 7, 2026

Copy link
Copy Markdown

@xingbowang merged this pull request in 224e849.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

2 participants