Pipeline Functions
merge_noun_chunks function
Merge noun chunks into a single token. Also available via the string name
"merge_noun_chunks".
| Name | Description |
|---|---|
doc | The Doc object to process, e.g. the Doc in the pipeline. Doc |
| RETURNS | The modified Doc with merged noun chunks. Doc |
merge_entities function
Merge named entities into a single token. Also available via the string name
"merge_entities".
| Name | Description |
|---|---|
doc | The Doc object to process, e.g. the Doc in the pipeline. Doc |
| RETURNS | The modified Doc with merged entities. Doc |
merge_subtokens function
Merge subtokens into a single token. Also available via the string name
"merge_subtokens". As of v2.1, the parser is able to predict “subtokens” that
should be merged into one single token later on. This is especially relevant for
languages like Chinese, Japanese or Korean, where a “word” isn’t defined as a
whitespace-delimited sequence of characters. Under the hood, this component uses
the Matcher to find sequences of tokens with the dependency
label "subtok" and then merges them into a single token.
| Name | Description |
|---|---|
doc | The Doc object to process, e.g. the Doc in the pipeline. Doc |
label | The subtoken dependency label. Defaults to "subtok". str |
| RETURNS | The modified Doc with merged subtokens. Doc |
token_splitter functionv3.0
Split tokens longer than a minimum length into shorter tokens. Intended for use with transformer pipelines where long spaCy tokens lead to input text that exceed the transformer model max length.
| Setting | Description |
|---|---|
min_length | The minimum length for a token to be split. Defaults to 25. int |
split_length | The length of the split tokens. Defaults to 5. int |
| RETURNS | The modified Doc with the split tokens. Doc |

