Skip to content

Unify LZ4 and LZ4HC compression levels#14819

Closed
pdillinger wants to merge 5 commits into
facebook:mainfrom
pdillinger:lz4_unification
Closed

Unify LZ4 and LZ4HC compression levels#14819
pdillinger wants to merge 5 commits into
facebook:mainfrom
pdillinger:lz4_unification

Conversation

@pdillinger

@pdillinger pdillinger commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Summary:
kLZ4Compression and kLZ4HCCompression share the same on-disk format and decompressor, but historically kLZ4Compression only honored negative (acceleration) levels while kLZ4HCCompression only honored positive levels. This unifies them so compression_opts.level alone selects the variant: level <= 0 uses LZ4 fast (acceleration = -level) and level >= 1 uses LZ4HC (1..12), regardless of which of the two types is configured.

The configured type now only determines the default compression level (LZ4: acceleration 1, equivalent to level -1; LZ4HC: level 9).

For code simplicity, the recorded per-block type comes from the compression type derived from the level, which could differ from the configured type. To preserve the originally configured choice for debugging/tracking, it is recorded as a _type=<decimal> pseudo-option in the rocksdb.compression_options SST table property.

Out-of-range non-default levels are clamped to the nearest effective value (LZ4 acceleration capped at 65537, which also avoids signed overflow negating INT_MIN; LZ4HC level capped at 12). The cost-aware (auto-tune) compressor's LZ4 level grid is changed to negative accelerations so it actually exercises fast LZ4 (positive levels now route to LZ4HC).

Related inclusion: The ZSTD library has a discontinuity at level=0, which maps to level 3, which is more aggressive than levels 1 and 2, which are more aggressive than levels -1, -2, etc. For better friendliness to auto-tuning (etc.), we now map level 0 to be the same as level -1, so that increasing compression level numbers have non-decreasing aggressiveness.

Test Plan:
New unit tests in compression_test.cc:

  • UnifiedLZ4LZ4HCLevels: for a representative set of non-default levels, both configured types produce identical output and the same recorded type (selected by the level), levels that clamp to the same effective parameter compress identically, and each round-trips; plus per-type default-level behavior.
  • ConfiguredCompressionTypeRecordedInProperties: the _type= pseudo- option appears in the SST table property for each configured type.
  • ZSTDLevelZeroMapsToMinusOne: level 0 behaves like -1, not like 3.
Summary:
kLZ4Compression and kLZ4HCCompression share the same on-disk format and
decompressor, but historically kLZ4Compression only honored negative
(acceleration) levels while kLZ4HCCompression only honored positive
levels. This unifies them so `compression_opts.level` alone selects the
variant: level <= 0 uses LZ4 fast (acceleration = -level) and level >= 1
uses LZ4HC (1..12), regardless of which of the two types is configured.

The configured type now only determines the default compression level
(LZ4: acceleration 1, equivalent to level -1; LZ4HC: level 9).

For code simplicity, the recorded per-block type comes from the compression
type derived from the level, which could differ from the configured type.
To preserve the originally configured choice for debugging/tracking, it
is recorded as a `_type=<decimal>` pseudo-option in the
rocksdb.compression_options SST table property.

Out-of-range non-default levels are clamped to the nearest effective
value (LZ4 acceleration capped at 65537, which also avoids signed
overflow negating INT_MIN; LZ4HC level capped at 12). The cost-aware
(auto-tune) compressor's LZ4 level grid is changed to negative
accelerations so it actually exercises fast LZ4 (positive levels now
route to LZ4HC).

Test Plan:
New unit tests in compression_test.cc:
- UnifiedLZ4LZ4HCLevels: for a representative set of non-default levels,
  both configured types produce identical output and the same recorded
  type (selected by the level), levels that clamp to the same effective
  parameter compress identically, and each round-trips; plus per-type
  default-level behavior.
- ConfiguredCompressionTypeRecordedInProperties: the `_type=` pseudo-
  option appears in the SST table property for each configured type.
@pdillinger pdillinger requested a review from joshkang97 June 4, 2026 16:27
@meta-cla meta-cla Bot added the CLA Signed label Jun 4, 2026
@meta-codesync

meta-codesync Bot commented Jun 4, 2026

Copy link
Copy Markdown

@pdillinger has imported this pull request. If you are a Meta employee, you can view this in D107536580.

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown

✅ clang-tidy: No findings on changed lines

Completed in 217.7s.

@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit a19c7a1


Codex review failed before producing findings.

WARNING: proceeding, even though we could not update PATH: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@github-actions

github-actions Bot commented Jun 4, 2026

Copy link
Copy Markdown

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit a19c7a1


Summary

Well-designed unification of LZ4/LZ4HC compression levels that fixes a real usability problem (auto-tune levels were ineffective) and provides a cleaner level API. Decompression compatibility is verified safe. Two medium-severity issues around db_bench and behavioral documentation warrant attention.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🟡 MEDIUM

M1. db_bench rejects unified-routed compressors -- tools/db_bench_tool.cc:3047
  • Issue: db_bench_tool.cc:3047-3051 checks compressor->GetPreferredCompressionType() != FLAGS_compression_type_e and nullifies the compressor on mismatch. With unified routing, --compression_type=lz4 --compression_level=5 returns an LZ4HC compressor whose preferred type is kLZ4HCCompression, which != kLZ4Compression, so the compressor is reset to nullptr and benchmarking silently runs uncompressed.
  • Root cause: db_bench assumes the returned compressor's preferred type matches the configured type.
  • Suggested fix: Update db_bench to accept both kLZ4Compression and kLZ4HCCompression when either LZ4 variant is configured.
M2. Release note incomplete on behavioral change -- unreleased_history/behavior_changes/unify_lz4_lz4hc_levels.md
  • Issue: Does not clearly state that existing users of kLZ4Compression with positive levels (previously silently used acceleration=1) will now get LZ4HC compression instead.
  • Suggested fix: Add a sentence clarifying that positive levels with kLZ4 now route to LZ4HC.
M3. Level=0 semantic change for kLZ4HCCompression -- util/compression.cc:1784
  • Issue: Previously, kLZ4HCCompression with level=0 used lz4hc (level sanitized to default=9 internally). Now level=0 routes to LZ4 fast with acceleration=1. This changes behavior for anyone who explicitly set level=0 with kLZ4HCCompression.
  • Suggested fix: Document this edge case or confirm intentionality.

🟢 LOW / NIT

L1. [[maybe_unused]] on LZ4AccelerationFromLevel -- util/compression.cc:853
  • Fine for non-LZ4 builds; a clarifying comment could help.
L2. Auto-tune levels functionally correct -- util/auto_tune_compressor.cc:20
  • kLZ4 now explores fast acceleration, kLZ4HC explores HC quality. Correct and intentional.
L3. _type= pseudo-option follows established conventions -- no issue.

Cross-Component Analysis

Context Safe? Notes
Block-based table read/write Yes Same decompressor for both types
BlobDB Yes Assertion checks compressor's own preferred type
Remote compaction Yes Per-block type; decompression agnostic
Auto-tune Yes Level grids now correctly exercise variants
db_bench No See M1
Dictionary compression Yes NoDict supports dict via CloneForDict() inheritance

Debunked Findings

  • Dictionary support loss (flagged by 5 agents as HIGH): Debunked -- BuiltinLZ4CompressorV2NoDict inherits dict capability via CompressorWithSimpleDictBase::MaybeCloneSpecialized() -> CloneForDict().
  • LegacyForceBuiltinCompression assertion: Debunked -- assertion checks compressor's own preferred type, which always matches output.
  • Existing test breakage: Debunked -- options_test.cc tests only verify parsing, not compression behavior.

Positive Observations

  • INT_MIN overflow protection in LZ4AccelerationFromLevel is well-considered.
  • Auto-tune fix is valuable: old {1,4,9} levels for kLZ4 were completely ineffective.
  • Thorough test coverage including extreme edge cases.

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase
@meta-codesync

meta-codesync Bot commented Jun 9, 2026

Copy link
Copy Markdown

@pdillinger has imported this pull request. If you are a Meta employee, you can view this in D107536580.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

Codex Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 23e1abc


Codex review failed before producing findings.

WARNING: proceeding, even though we could not create PATH aliases: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

Claude Code Review - OBSOLETE

Superseded by a newer AI review. Expand to see the original review.

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 23e1abc


Summary

Well-designed unification of LZ4/LZ4HC compression levels with good test coverage. The core approach is sound: level sign selects algorithm, configured type only affects default level. The ZSTD level-0 discontinuity fix and auto-tune level correction are welcome improvements. A few issues around incomplete ZSTD sanitization and a behavior change for dictionary compression with unified routing.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🟡 MEDIUM

M1. Dictionary support lost for LZ4 fast path via unified routing — util/compression.cc:1770
  • Issue: When GetCompressor() is called with kLZ4HCCompression and level <= 0, it returns BuiltinLZ4CompressorV2NoDict. The "NoDict" name is misleading — it inherits from BuiltinLZ4CompressorV2WithDict and CAN be specialized with a dictionary via MaybeCloneSpecialized() -> CloneForDict() (inherited). So dictionary support IS preserved through the specialization path.
  • Root cause: Non-issue on deeper analysis. The NoDict variant simply doesn't carry a dictionary by default but supports dictionary specialization through inheritance. The GetCompressor() return value is the "basic" compressor; dictionary support is added later via CloneForDict().
  • Verdict: No bug. The naming is confusing but the behavior is correct.
M2. _type= pseudo-option appended to SST compression_options property — table/block_based/block_based_table_builder.cc:1128
  • Issue: The PR appends _type=<decimal> to the compression_options SST table property string. This property is stored as an opaque string in SST files and read back verbatim (table/meta_blocks.cc:427). It is used for display/debugging only. It is NOT parsed back into a CompressionOptions struct. The _compression_manager= pseudo-option already follows the same pattern (block_based_table_builder.cc:1135).
  • Suggested fix: None needed. Pattern is consistent with existing conventions.
M3. Behavior change for users with positive levels + kLZ4Compression — util/compression.cc:1770
  • Issue: Users who previously configured compression = kLZ4Compression with a positive compression_opts.level got LZ4 fast with acceleration=1 (positive levels silently ignored). Now they get LZ4HC at that level. This is a behavior change for existing configurations documented in unreleased_history/.
  • Suggested fix: Consider adding a log warning when the effective algorithm differs from the configured type for non-default levels.
M4. Auto-tune LZ4 levels: old {1,4,9} was broken — util/auto_tune_compressor.cc:18
  • Issue: The old LZ4 auto-tune levels {1, 4, 9} were all positive, meaning the old code treated them all as acceleration = 1. All three produced identical output, making auto-tuning ineffective for LZ4. The new {-1, -4, -9} correctly varies the acceleration parameter. This is a bug fix.
  • Suggested fix: None — the fix is correct.

🟢 LOW / NIT

L1. [[maybe_unused]] on LZ4AccelerationFromLevelutil/compression.cc:847
  • Issue: The protected static method may not need [[maybe_unused]] if the class definition is already inside #ifdef LZ4 guards.
  • Suggested fix: Verify whether the attribute is needed.
L2. ZSTD_FinalizeDictionary now maps level 0 to -1 — util/compression.cc:358
  • Issue: ZSTD level -1 is valid (fast mode). If someone explicitly set level=0, they previously got ZSTD default (3), now they get -1 (fast). This is the intended monotonicity improvement.
  • Suggested fix: None needed.

Cross-Component Analysis

Context Safe? Notes
BlobDB (LegacyForceBuiltinCompression) YES Assertion actual_type == GetPreferredCompressionType() holds — both come from the same compressor instance
SST property parsing YES compression_options is display-only, not parsed back. _type= follows existing _compression_manager= pattern
Rolling upgrades YES LZ4/LZ4HC share wire format and decompressor
CompactionService / Remote YES Options are serialized, not compressor objects
Dynamic SetOptions YES GetCompressor called each time

Positive Observations

  • Clean, well-commented unified routing in GetCompressor()
  • SanitizeZSTDCompressionLevel() eliminates 4 instances of duplicated code
  • LZ4AccelerationFromLevel() adds INT_MIN overflow protection previously missing
  • Comprehensive test coverage with boundary levels, clamping, round-trips
  • Auto-tune fix from {1,4,9} to {-1,-4,-9} is a genuine bug fix

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase

@joshkang97 joshkang97 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review automatically exported from Phabricator review in Meta.

@meta-codesync

meta-codesync Bot commented Jun 10, 2026

Copy link
Copy Markdown

@pdillinger has imported this pull request. If you are a Meta employee, you can view this in D107536580.

@github-actions

Copy link
Copy Markdown

🟡 Codex Code Review

Auto-triggered after CI passed — reviewing commit 9b772c6


Codex review failed before producing findings.

WARNING: proceeding, even though we could not create PATH aliases: Refusing to create helper binaries under temporary dir "/tmp" (codex_home: AbsolutePathBuf("/tmp/codex-home"))
error: the argument '--base <BRANCH>' cannot be used with '[PROMPT]'

Usage: codex exec review --commit <SHA> --base <BRANCH> --title <TITLE> --model <MODEL> --config <key=value> --dangerously-bypass-approvals-and-sandbox --output-last-message <FILE> [PROMPT]

For more information, try '--help'.

ℹ️ About this response

Generated by Codex CLI.
Review methodology: claude_md/code_review.md

Limitations:

  • Codex may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /codex-review [context] — Request a code review
  • /codex-query <question> — Ask about the PR or codebase
@github-actions

Copy link
Copy Markdown

✅ Claude Code Review

Auto-triggered after CI passed — reviewing commit 9b772c6


Summary

Well-designed unification of LZ4/LZ4HC compression levels with good test coverage. One medium-severity consistency issue with ZSTD level 0 handling in the dictionary compression path.

High-severity findings (0):

No high-severity findings.

Full review (click to expand)

Findings

🟡 MEDIUM

M1. Inconsistent ZSTD level 0 mapping in CompressionDict constructor -- util/compression.h:230
  • Issue: The new SanitizeZSTDCompressionLevel() maps ZSTD level 0 to -1 to remove a discontinuity (ZSTD internally treats 0 as "use default level 3"). This function is called in ZSTD_FinalizeDictionary() and BuiltinZSTDCompressorV2::ObtainWorkingArea(), but NOT in the CompressionDict constructor at compression.h:230 or CompressionContext::CreateNativeContext() at compression.h:319. The CompressionDict constructor is actively used by BuiltinZSTDCompressorV2 (lines 1090 and 1256 in compression.cc) to create digested dictionaries. If a user sets level=0 with ZSTD dictionary compression, ZSTD_createCDict will receive level 0 (which ZSTD treats as level 3), while the block compression via ObtainWorkingArea() will use level -1. This creates a mismatch between the dictionary training level and the compression level.
  • Root cause: The centralization of ZSTD level sanitization was applied to only 2 of 4 call sites.
  • Suggested fix: Replace the inline default-level check in CompressionDict constructor and CompressionContext::CreateNativeContext() with SanitizeZSTDCompressionLevel():
    // In CompressionDict constructor:
    level = SanitizeZSTDCompressionLevel(level);
    // (remove the if block that only handles kDefaultCompressionLevel)
    Note: CompressionContext::CreateNativeContext() appears to be legacy/unused code, so it is lower priority but should be updated for consistency.
M2. Breaking behavior change for kLZ4Compression with positive levels -- util/compression.cc:1771
  • Issue: Previously, configuring kLZ4Compression with a positive compression_opts.level (e.g., level=3) would use LZ4 fast with acceleration=1 (positive levels were clamped). After this PR, the same configuration uses LZ4HC at that level, which is significantly slower. The release note mentions this ("behavior change for previously tolerated but dubious configurations such as positive compression level with kLZ4Compression") but users may not notice this in a minor release.
  • Root cause: Intentional design decision. The PR description correctly characterizes this as a "dubious" configuration.
  • Suggested fix: Consider logging a warning when kLZ4Compression is configured with a positive level, to make the behavior change more visible to users. The existing release note in unreleased_history/behavior_changes/unify_lz4_lz4hc_levels.md does document this.

🟢 LOW / NIT

L1. [[maybe_unused]] on non-static protected method -- util/compression.cc:862
  • Issue: LZ4AccelerationFromLevel is marked [[maybe_unused]] and is a protected method. The attribute is used to suppress warnings when LZ4 is not compiled in (the method body would be unused). This is correct but slightly unconventional -- typically the #ifdef LZ4 guard would wrap the method declaration itself.
  • Suggested fix: No change needed; the current approach works correctly with conditional compilation.
L2. ZSTD level 0 "backdoor" for parallel compression -- util/compression_test.cc:2544
  • Issue: The test comment notes that ZSTD level 0 is "intentionally left as a 'backdoor' to allow parallel compression even though it otherwise behaves like the fast level -1." This is because GetRecommendedParallelThreads() checks opts_.level < 0 (not the sanitized level), so level=0 maps to level=-1 for compression but still allows parallel threads. This is tested but not documented in the public API docs.
  • Suggested fix: Consider documenting this nuance or aligning the parallel thread check with the sanitized level.
L3. SST property _type= ordering relative to _compression_manager= -- table/block_based/block_based_table_builder.cc:1128
  • Issue: The _type= pseudo-option is appended to props.compression_options before the _compression_manager= pseudo-option (which is appended later at line 1135). The ordering is consistent and both use underscore prefixes, but neither is formally documented. External tools parsing this property might need updating.
  • Suggested fix: No code change needed; the existing approach is reasonable. The underscore prefix convention provides good namespacing.
L4. Test coverage gap: dictionary compression with level-based routing -- util/compression_test.cc
  • Issue: The new UnifiedLZ4LZ4HCLevels test validates compression/decompression round-trips for various levels but does not test the interaction between dictionary compression and the unified level routing. For example: configuring kLZ4HCCompression with max_dict_bytes > 0 and a negative level would route to BuiltinLZ4CompressorV2NoDict which doesn't support dictionaries -- dict support would be silently dropped.
  • Suggested fix: Add a test case verifying dictionary compression behavior when the level routes to the "other" compressor variant. Consider whether dict support should be preserved regardless of level routing (e.g., by using BuiltinLZ4CompressorV2WithDict instead of NoDict when LZ4HC config has dict enabled but negative level).

Cross-Component Analysis

Context Affected? Impact Notes
Decompression No None Both types already share decompressor
CompactionService No None Compressor created fresh per job
WritePreparedTxnDB No None Compression is orthogonal
compression_per_level Yes Low Mixed LZ4/LZ4HC arrays now behave identically for same level
Auto-tune (CostAwareCompressor) Yes Fixed LZ4 levels correctly changed to {-1,-4,-9}
db_bench Yes Fixed CanonicalCompressionType prevents false rejection
SST property parsing Yes Low New _type= field; underscore prefix helps

Positive Observations

  • Clean centralization of ZSTD level handling via SanitizeZSTDCompressionLevel()
  • Thorough test coverage for the core unification logic including level clamping
  • INT_MIN overflow protection in LZ4AccelerationFromLevel is correct
  • Auto-tune LZ4 levels properly changed to negative values to actually exercise acceleration
  • The CanonicalCompressionType() function cleanly handles the db_bench comparison
  • Release notes accurately describe the behavioral changes
  • ZSTD level 0 fix removes a genuine usability issue for auto-tuning

ℹ️ About this response

Generated by Claude Code.
Review methodology: claude_md/code_review.md

Limitations:

  • Claude may miss context from files not in the diff
  • Large PRs may be truncated
  • Always apply human judgment to AI suggestions

Commands:

  • /claude-review [context] — Request a code review
  • /claude-query <question> — Ask about the PR or codebase
@meta-codesync meta-codesync Bot closed this in 6d4a814 Jun 11, 2026
@meta-codesync meta-codesync Bot added the Merged label Jun 11, 2026
@meta-codesync

meta-codesync Bot commented Jun 11, 2026

Copy link
Copy Markdown

@pdillinger merged this pull request in 6d4a814.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

2 participants