The Wayback Machine - https://web.archive.org/web/20210421225124/https://spacy.io/api/pipeline-functions/

Pipeline

Pipeline Functions

Other built-in pipeline components and helpers

merge_noun_chunks function

Merge noun chunks into a single token. Also available via the string name "merge_noun_chunks".

NameDescription
docThe Doc object to process, e.g. the Doc in the pipeline. Doc

merge_entities function

Merge named entities into a single token. Also available via the string name "merge_entities".

NameDescription
docThe Doc object to process, e.g. the Doc in the pipeline. Doc

merge_subtokens function

Merge subtokens into a single token. Also available via the string name "merge_subtokens". As of v2.1, the parser is able to predict “subtokens” that should be merged into one single token later on. This is especially relevant for languages like Chinese, Japanese or Korean, where a “word” isn’t defined as a whitespace-delimited sequence of characters. Under the hood, this component uses the Matcher to find sequences of tokens with the dependency label "subtok" and then merges them into a single token.

NameDescription
docThe Doc object to process, e.g. the Doc in the pipeline. Doc
labelThe subtoken dependency label. Defaults to "subtok". str

token_splitter functionv3.0

Split tokens longer than a minimum length into shorter tokens. Intended for use with transformer pipelines where long spaCy tokens lead to input text that exceed the transformer model max length.

SettingDescription
min_lengthThe minimum length for a token to be split. Defaults to 25. int
split_lengthThe length of the split tokens. Defaults to 5. int