Create a GoldParse. The TextCategorizer component
expects true examples of a label to have the value 1.0, and negative examples
of a label to have the value 0.0. Labels not in the dictionary are treated as
missing – the gradient for those labels will be zero.
Name
Type
Description
doc
Doc
The document the annotations refer to.
words
iterable
A sequence of unicode word strings.
tags
iterable
A sequence of strings, representing tag annotations.
heads
iterable
A sequence of integers, representing syntactic head offsets.
deps
iterable
A sequence of strings, representing the syntactic relation types.
entities
iterable
A sequence of named entity annotations, either as BILUO tag strings, or as (start_char, end_char, label) tuples, representing the entity positions. If BILUO tag strings, you can specify missing values by setting the tag to None.
cats
dict
Labels for text classification. Each key in the dictionary is a string label for the category and each value is 1.0 (positive) or 0.0 (negative).
links
dict
Labels for entity linking. A dict with (start_char, end_char) keys, and the values being dicts with kb_id:value entries, representing external KB IDs mapped to either 1.0 (positive) or 0.0 (negative).
Convert a list of Doc objects into the
JSON-serializable format used by the
spacy train command. Each input doc will be treated as a ‘paragraph’ in the output doc.
Encode labelled spans into per-token tags, using the
BILUO scheme (Begin, In, Last, Unit, Out). Returns a
list of unicode strings, describing the tags. Each tag string will be of the
form of either "", "O" or "{action}-{label}", where action is one of
"B", "I", "L", "U". The string "-" is used where the entity offsets
don’t align with the tokenization in the Doc object. The training algorithm
will view these as missing values. O denotes a non-entity token. B denotes
the beginning of a multi-token entity, I the inside of an entity of three or
more tokens, and L the end of an entity of two or more tokens. U denotes a
single-token entity.
Name
Type
Description
doc
Doc
The document that the entity offsets refer to. The output tags will refer to the token boundaries within the document.
entities
iterable
A sequence of (start, end, label) triples. start and end should be character-offset integers denoting the slice into the original string.
Encode per-token tags following the BILUO scheme into
entity offsets.
Name
Type
Description
doc
Doc
The document that the BILUO tags refer to.
entities
iterable
A sequence of BILUO tags with each tag describing one token. Each tag string will be of the form of either "", "O" or "{action}-{label}", where action is one of "B", "I", "L", "U".
RETURNS
list
A sequence of (start, end, label) triples. start and end will be character-offset integers denoting the slice into the original string.
Encode per-token tags following the BILUO scheme into
Span objects. This can be used to create entity spans from
token-based tags, e.g. to overwrite the doc.ents.
Name
Type
Description
doc
Doc
The document that the BILUO tags refer to.
entities
iterable
A sequence of BILUO tags with each tag describing one token. Each tag string will be of the form of either "", "O" or "{action}-{label}", where action is one of "B", "I", "L", "U".
RETURNS
list
A sequence of Span objects with added entity labels.