Pipeline Functions · spaCy API Documentation

merge_noun_chunks function

Merge noun chunks into a single token. Also available via the string name "merge_noun_chunks".

Example

texts = [t.text for t in nlp("I have a blue car")]
assert texts == ["I", "have", "a", "blue", "car"]

nlp.add_pipe("merge_noun_chunks")
texts = [t.text for t in nlp("I have a blue car")]
assert texts == ["I", "have", "a blue car"]

Name	Description
`doc`	The `Doc` object to process, e.g. the `Doc` in the pipeline. `Doc`
RETURNS	The modified `Doc` with merged noun chunks. `Doc`

merge_entities function

Merge named entities into a single token. Also available via the string name "merge_entities".

Example

texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David", "Bowie"]

nlp.add_pipe("merge_entities")

texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David Bowie"]

Name	Description
`doc`	The `Doc` object to process, e.g. the `Doc` in the pipeline. `Doc`
RETURNS	The modified `Doc` with merged entities. `Doc`

Merge subtokens into a single token. Also available via the string name "merge_subtokens". As of v2.1, the parser is able to predict “subtokens” that should be merged into one single token later on. This is especially relevant for languages like Chinese, Japanese or Korean, where a “word” isn’t defined as a whitespace-delimited sequence of characters. Under the hood, this component uses the Matcher to find sequences of tokens with the dependency label "subtok" and then merges them into a single token.

Example

Note that this example assumes a custom Chinese model that oversegments and was trained to predict subtokens.

doc = nlp("拜托")
print([(token.text, token.dep_) for token in doc])
# [('拜', 'subtok'), ('托', 'subtok')]

nlp.add_pipe("merge_subtokens")
doc = nlp("拜托")
print([token.text for token in doc])
# ['拜托']

Name	Description
`doc`	The `Doc` object to process, e.g. the `Doc` in the pipeline. `Doc`
`label`	The subtoken dependency label. Defaults to `"subtok"`. `str`
RETURNS	The modified `Doc` with merged subtokens. `Doc`

token_splitter functionv3.0

Split tokens longer than a minimum length into shorter tokens. Intended for use with transformer pipelines where long spaCy tokens lead to input text that exceed the transformer model max length.

Example

config = {"min_length": 20, "split_length": 5}
nlp.add_pipe("token_splitter", config=config, first=True)
doc = nlp("aaaaabbbbbcccccdddddee")
print([token.text for token in doc])
# ['aaaaa', 'bbbbb', 'ccccc', 'ddddd', 'ee']

Setting	Description
`min_length`	The minimum length for a token to be split. Defaults to `25`. `int`
`split_length`	The length of the split tokens. Defaults to `5`. `int`
RETURNS	The modified `Doc` with the split tokens. `Doc`

Mar	APR	May
	21
2020	2021	2022

Pipeline

merge_noun_chunks function

merge_entities function

merge_subtokens function

token_splitter functionv3.0