Skip to content

Preserve whitespace in the Korean tokenizer#13974

Open
Incheonkirin wants to merge 2 commits into
explosion:masterfrom
Incheonkirin:fix/ko-tokenizer-whitespace
Open

Preserve whitespace in the Korean tokenizer#13974
Incheonkirin wants to merge 2 commits into
explosion:masterfrom
Incheonkirin:fix/ko-tokenizer-whitespace

Conversation

@Incheonkirin

Copy link
Copy Markdown

Fixes #13401.

The Korean tokenizer doesn't preserve whitespace — runs of spaces, newlines and tabs all come back as a single space, so doc.text stops matching the input and the offsets drift:

import spacy

nlp = spacy.blank("ko")
text = "보험금을  지급한다.\n\n면책기간\t특약"
nlp(text).text == text   # False — you get "보험금을 지급한다. 면책기간 특약"

The xx tokenizer round-trips the same string fine, so it's specific to the Korean path. The reason is in KoreanTokenizer.__call__: mecab only returns the morpheme surfaces, and the Doc is rebuilt with spaces=check_spaces(...), which is one boolean per token. A boolean can only encode "one space or none", so anything more than a single space gets dropped (and leading/trailing whitespace too).

I kept the existing behaviour where I could — a single space between morphemes still rides on the space flag, and only the rest (multiple spaces, newlines, tabs, leading/trailing) is kept as a whitespace token, the way the default tokenizer already represents whitespace. With that doc.text == text holds again, the morpheme tags and lemmas are unchanged, and single-space tokenization stays exactly the same. I also dropped the now-unused check_spaces helper and added a round-trip test.

The Korean tokenizer collapsed every run of whitespace (multiple spaces,
newlines, tabs, and leading/trailing whitespace) into a single space, so
doc.text no longer matched the input and character offsets drifted. Rebuild the
Doc keeping a single space on the space flag and any other whitespace as its own
token, the way the default tokenizer represents whitespace. Removes the
now-unused check_spaces helper and adds a round-trip regression test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant