The Wayback Machine - https://web.archive.org/web/20200106170615/https://spacy.io/api/textcategorizer/

Pipeline

TextCategorizer

classv2

This class is a subclass of Pipe and follows the same API. The pipeline component is available in the processing pipeline via the ID "textcat".

TextCategorizer.Model classmethod

Initialize a model for the pipe. The model should implement the thinc.neural.Model API. Wrappers are under development for most major machine learning libraries.

NameTypeDescription
**kwargs-Parameters for initializing the model

TextCategorizer.__init__ method

Create a new pipeline instance. In your application, you would normally use a shortcut for this and instantiate the component using its string name and nlp.create_pipe.

NameTypeDescription
vocabVocabThe shared vocabulary.
modelthinc.neural.Model / TrueThe model powering the pipeline component. If no model is supplied, the model is created when you call begin_training, from_disk or from_bytes.
exclusive_classesboolMake categories mutually exclusive. Defaults to False.
architectureunicodeModel architecture to use, see architectures for details. Defaults to "ensemble".

Architectures v2.1

Text classification models can be used to solve a wide variety of problems. Differences in text length, number of labels, difficulty, and runtime performance constraints mean that no single algorithm performs well on all types of problems. To handle a wider variety of problems, the TextCategorizer object allows configuration of its model architecture, using the architecture keyword argument.

NameDescription
"ensemble"Default: Stacked ensemble of a bag-of-words model and a neural network model. The neural network uses a CNN with mean pooling and attention. The “ngram_size” and “attr” arguments can be used to configure the feature extraction for the bag-of-words model.
"simple_cnn"A neural network model where token vectors are calculated using a CNN. The vectors are mean pooled and used as features in a feed-forward network. This architecture is usually less accurate than the ensemble, but runs faster.
"bow"An ngram “bag-of-words” model. This architecture should run much faster than the others, but may not be as accurate, especially if texts are short. The features extracted can be controlled using the keyword arguments ngram_size and attr. For instance, ngram_size=3 and attr="lower" would give lower-cased unigram, trigram and bigram features. 2, 3 or 4 are usually good choices of ngram size.

TextCategorizer.__call__ method

Apply the pipe to one document. The document is modified in place, and returned. This usually happens under the hood when the nlp object is called on a text and all pipeline components are applied to the Doc in order. Both __call__ and pipe delegate to the predict and set_annotations methods.

NameTypeDescription
docDocThe document to process.

TextCategorizer.pipe method

Apply the pipe to a stream of documents. This usually happens under the hood when the nlp object is called on a text and all pipeline components are applied to the Doc in order. Both __call__ and pipe delegate to the predict and set_annotations methods.

NameTypeDescription
streamiterableA stream of documents.
batch_sizeintThe number of texts to buffer. Defaults to 128.

TextCategorizer.predict method

Apply the pipeline’s model to a batch of docs, without modifying them.

NameTypeDescription
docsiterableThe documents to predict.

TextCategorizer.set_annotations method

Modify a batch of documents, using pre-computed scores.

NameTypeDescription
docsiterableThe documents to modify.
scores-The scores to set, produced by TextCategorizer.predict.
tensorsiterableThe token representations used to predict the scores.

TextCategorizer.update method

Learn from a batch of documents and gold-standard information, updating the pipe’s model. Delegates to predict and get_loss.

NameTypeDescription
docsiterableA batch of documents to learn from.
goldsiterableThe gold-standard data. Must have the same length as docs.
dropfloatThe dropout rate.
sgdcallableThe optimizer. Should take two arguments weights and gradient, and an optional ID.
lossesdictOptional record of the loss during training. The value keyed by the model’s name is updated.

TextCategorizer.get_loss method

Find the loss and gradient of loss for the batch of documents and their predicted scores.

NameTypeDescription
docsiterableThe batch of documents.
goldsiterableThe gold-standard data. Must have the same length as docs.
scores-Scores representing the model’s predictions.

TextCategorizer.begin_training method

Initialize the pipe for training, using data examples if available. If no model has been initialized yet, the model is added.

NameTypeDescription
gold_tuplesiterableOptional gold-standard annotations from which to construct GoldParse objects.
pipelinelistOptional list of pipeline components that this component is part of.
sgdcallableAn optional optimizer. Should take two arguments weights and gradient, and an optional ID. Will be created via TextCategorizer if not set.

TextCategorizer.create_optimizer method

Create an optimizer for the pipeline component.

NameTypeDescription

TextCategorizer.use_params methodcontextmanager

Modify the pipe’s model, to use the given parameter values.

NameTypeDescription
paramsdictThe parameter values to use in the model. At the end of the context, the original parameters are restored.

TextCategorizer.add_label method

Add a new label to the pipe.

NameTypeDescription
labelunicodeThe label to add.

TextCategorizer.to_disk method

Serialize the pipe to disk.

NameTypeDescription
pathunicode / PathA path to a directory, which will be created if it doesn’t exist. Paths may be either strings or Path-like objects.
excludelistString names of serialization fields to exclude.

TextCategorizer.from_disk method

Load the pipe from disk. Modifies the object in place and returns it.

NameTypeDescription
pathunicode / PathA path to a directory. Paths may be either strings or Path-like objects.
excludelistString names of serialization fields to exclude.

TextCategorizer.to_bytes method

Serialize the pipe to a bytestring.

NameTypeDescription
excludelistString names of serialization fields to exclude.

TextCategorizer.from_bytes method

Load the pipe from a bytestring. Modifies the object in place and returns it.

NameTypeDescription
bytes_databytesThe data to load from.
excludelistString names of serialization fields to exclude.

TextCategorizer.labels property

The labels currently added to the component.

NameTypeDescription

Serialization fields

During serialization, spaCy will export several data fields used to restore different aspects of the object. If needed, you can exclude them from serialization by passing in the string names via the exclude argument.

NameDescription
vocabThe shared Vocab.
cfgThe config file. You usually don’t want to exclude this.
modelThe binary model data. You usually don’t want to exclude this.