Fix right-side stripping of Unicode surrogate pairs in stripChars/trim#560
Merged
stephenamar-db merged 1 commit intodatabricks:masterfrom Dec 4, 2025
Merged
Conversation
The unspecializedStrip function failed to correctly strip characters from
the right side of strings containing multi-byte Unicode characters (emoji).
Two failure modes existed:
1. Stripping emoji from end: std.rstripChars("hello🎉🎉🎉", "🎉") returned
"hello🎉🎉🎉" instead of "hello" (nothing stripped)
2. Stripping ASCII after emoji: std.trim("🌍 ") returned "?" instead of
"🌍" (emoji corrupted)
Root cause: The original code used end = str.length - 1 with codePointAt(end).
For surrogate pairs (like emoji), this index points to the low surrogate.
When codePointAt() is called on a low surrogate position, it returns just that
surrogate value rather than the full code point, causing the loop to exit early
or the final substring to split a surrogate pair.
Fix: Use exclusive end position (end = str.length) with codePointBefore(end)
for right-to-left iteration. Unlike codePointAt(), codePointBefore() correctly
reads surrogate pairs when scanning backwards.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Collaborator
|
thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a followup to #555 and the subsequent fix in #557.
Even after that fix, there are two remaining bugs related to stripping from the right side of a string:
std.rstripChars("hello🎉🎉🎉", "🎉")returned"hello🎉🎉🎉"instead of"hello"(nothing stripped)std.trim("🌍 ")returned"?"instead of"🌍"(i.e. the emoji was corrupted)The root cause (explained by Claude) is that when iterating from the right of a string,
codePointAt(str.length - 1)points to a low surrogate and it gets treated as an unpaired surrogate rather than seeking backwards to find the full code point.The fix: use
codePointBefore(end)(whereendranges fromstring.lengthdown to1) for right-to-left iteration. UnlikecodePointAt(),codePointBefore()correctly reads surrogate pairs when scanning backwards.Fix + test are authored by Claude Code.