Skip to content

Fix right-side stripping of Unicode surrogate pairs in stripChars/trim#560

Merged
stephenamar-db merged 1 commit intodatabricks:masterfrom
JoshRosen:fix-unicode-trim-rstripchars
Dec 4, 2025
Merged

Fix right-side stripping of Unicode surrogate pairs in stripChars/trim#560
stephenamar-db merged 1 commit intodatabricks:masterfrom
JoshRosen:fix-unicode-trim-rstripchars

Conversation

@JoshRosen
Copy link
Copy Markdown
Contributor

@JoshRosen JoshRosen commented Dec 4, 2025

This is a followup to #555 and the subsequent fix in #557.

Even after that fix, there are two remaining bugs related to stripping from the right side of a string:

  1. Stripping emoji from end: std.rstripChars("hello🎉🎉🎉", "🎉") returned "hello🎉🎉🎉" instead of "hello" (nothing stripped)
  2. Stripping ASCII after emoji: std.trim("🌍 ") returned "?" instead of"🌍" (i.e. the emoji was corrupted)

The root cause (explained by Claude) is that when iterating from the right of a string, codePointAt(str.length - 1) points to a low surrogate and it gets treated as an unpaired surrogate rather than seeking backwards to find the full code point.

The fix: use codePointBefore(end) (where end ranges from string.length down to 1) for right-to-left iteration. Unlike codePointAt(), codePointBefore() correctly reads surrogate pairs when scanning backwards.

Fix + test are authored by Claude Code.

The unspecializedStrip function failed to correctly strip characters from
the right side of strings containing multi-byte Unicode characters (emoji).

Two failure modes existed:
1. Stripping emoji from end: std.rstripChars("hello🎉🎉🎉", "🎉") returned
   "hello🎉🎉🎉" instead of "hello" (nothing stripped)
2. Stripping ASCII after emoji: std.trim("🌍   ") returned "?" instead of
   "🌍" (emoji corrupted)

Root cause: The original code used end = str.length - 1 with codePointAt(end).
For surrogate pairs (like emoji), this index points to the low surrogate.
When codePointAt() is called on a low surrogate position, it returns just that
surrogate value rather than the full code point, causing the loop to exit early
or the final substring to split a surrogate pair.

Fix: Use exclusive end position (end = str.length) with codePointBefore(end)
for right-to-left iteration. Unlike codePointAt(), codePointBefore() correctly
reads surrogate pairs when scanning backwards.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@stephenamar-db stephenamar-db merged commit 8673f8c into databricks:master Dec 4, 2025
6 checks passed
@stephenamar-db
Copy link
Copy Markdown
Collaborator

thanks!

@JoshRosen JoshRosen deleted the fix-unicode-trim-rstripchars branch December 6, 2025 19:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants