Fix right-side stripping of Unicode surrogate pairs in stripChars/trim by JoshRosen · Pull Request #560 · databricks/sjsonnet

JoshRosen · 2025-12-04T09:16:03Z

This is a followup to #555 and the subsequent fix in #557.

Even after that fix, there are two remaining bugs related to stripping from the right side of a string:

Stripping emoji from end: std.rstripChars("hello🎉🎉🎉", "🎉") returned "hello🎉🎉🎉" instead of "hello" (nothing stripped)
Stripping ASCII after emoji: std.trim("🌍 ") returned "?" instead of"🌍" (i.e. the emoji was corrupted)

The root cause (explained by Claude) is that when iterating from the right of a string, codePointAt(str.length - 1) points to a low surrogate and it gets treated as an unpaired surrogate rather than seeking backwards to find the full code point.

The fix: use codePointBefore(end) (where end ranges from string.length down to 1) for right-to-left iteration. Unlike codePointAt(), codePointBefore() correctly reads surrogate pairs when scanning backwards.

Fix + test are authored by Claude Code.

The unspecializedStrip function failed to correctly strip characters from the right side of strings containing multi-byte Unicode characters (emoji). Two failure modes existed: 1. Stripping emoji from end: std.rstripChars("hello🎉🎉🎉", "🎉") returned "hello🎉🎉🎉" instead of "hello" (nothing stripped) 2. Stripping ASCII after emoji: std.trim("🌍 ") returned "?" instead of "🌍" (emoji corrupted) Root cause: The original code used end = str.length - 1 with codePointAt(end). For surrogate pairs (like emoji), this index points to the low surrogate. When codePointAt() is called on a low surrogate position, it returns just that surrogate value rather than the full code point, causing the loop to exit early or the final substring to split a surrogate pair. Fix: Use exclusive end position (end = str.length) with codePointBefore(end) for right-to-left iteration. Unlike codePointAt(), codePointBefore() correctly reads surrogate pairs when scanning backwards. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

stephenamar-db · 2025-12-04T17:27:15Z

thanks!

stephenamar-db merged commit 8673f8c into databricks:master Dec 4, 2025
6 checks passed

JoshRosen deleted the fix-unicode-trim-rstripchars branch December 6, 2025 19:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix right-side stripping of Unicode surrogate pairs in stripChars/trim#560

Fix right-side stripping of Unicode surrogate pairs in stripChars/trim#560
stephenamar-db merged 1 commit intodatabricks:masterfrom
JoshRosen:fix-unicode-trim-rstripchars

JoshRosen commented Dec 4, 2025 •

edited

Loading

Uh oh!

stephenamar-db commented Dec 4, 2025

Labels

2 participants

Conversation

JoshRosen commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stephenamar-db commented Dec 4, 2025

Labels

2 participants

JoshRosen commented Dec 4, 2025 •

edited

Loading