The Wayback Machine - https://web.archive.org/web/20200602111419/https://spacy.io/api/goldparse/

Other

GoldParse

class
A collection for training annotations

GoldParse.__init__ method

Create a GoldParse. The TextCategorizer component expects true examples of a label to have the value 1.0, and negative examples of a label to have the value 0.0. Labels not in the dictionary are treated as missing – the gradient for those labels will be zero.

NameTypeDescription
docDocThe document the annotations refer to.
wordsiterableA sequence of unicode word strings.
tagsiterableA sequence of strings, representing tag annotations.
headsiterableA sequence of integers, representing syntactic head offsets.
depsiterableA sequence of strings, representing the syntactic relation types.
entitiesiterableA sequence of named entity annotations, either as BILUO tag strings, or as (start_char, end_char, label) tuples, representing the entity positions. If BILUO tag strings, you can specify missing values by setting the tag to None.
catsdictLabels for text classification. Each key in the dictionary is a string label for the category and each value is 1.0 (positive) or 0.0 (negative).
linksdictLabels for entity linking. A dict with (start_char, end_char) keys, and the values being dicts with kb_id:value entries, representing external KB IDs mapped to either 1.0 (positive) or 0.0 (negative).

GoldParse.__len__ method

Get the number of gold-standard tokens.

NameTypeDescription

GoldParse.is_projective property

Whether the provided syntactic annotations form a projective dependency tree.

NameTypeDescription

Attributes

NameTypeDescription
wordslistThe words.
tagslistThe part-of-speech tag annotations.
headslistThe syntactic head annotations.
labelslistThe syntactic relation-type annotations.
nerlistThe named entity annotations as BILUO tags.
cand_to_goldlistThe alignment from candidate tokenization to gold tokenization.
gold_to_candlistThe alignment from gold tokenization to candidate tokenization.
cats v2.0dictKeys in the dictionary are string category labels with values 1.0 or 0.0.
links v2.2dictKeys in the dictionary are (start_char, end_char) triples, and the values are dictionaries with kb_id:value entries.

Utilities

gold.docs_to_json function

Convert a list of Doc objects into the JSON-serializable format used by the spacy train command. Each input doc will be treated as a ‘paragraph’ in the output doc.

NameTypeDescription
docsiterable / DocThe Doc object(s) to convert.
idintID to assign to the JSON. Defaults to 0.

gold.align function

Calculate alignment tables between two tokenizations, using the Levenshtein algorithm. The alignment is case-insensitive.

NameTypeDescription
tokens_alistString values of candidate tokens to align.
tokens_blistString values of reference tokens to align.

The returned tuple contains the following alignment information:

NameTypeDescription
costintThe number of misaligned tokens.
a2bnumpy.ndarray[ndim=1, dtype='int32']One-to-one mappings of indices in tokens_a to indices in tokens_b.
b2anumpy.ndarray[ndim=1, dtype='int32']One-to-one mappings of indices in tokens_b to indices in tokens_a.
a2b_multidictA dictionary mapping indices in tokens_a to indices in tokens_b, where multiple tokens of tokens_a align to the same token of tokens_b.
b2a_multidictA dictionary mapping indices in tokens_b to indices in tokens_a, where multiple tokens of tokens_b align to the same token of tokens_a.

gold.biluo_tags_from_offsets function

Encode labelled spans into per-token tags, using the BILUO scheme (Begin, In, Last, Unit, Out). Returns a list of unicode strings, describing the tags. Each tag string will be of the form of either "", "O" or "{action}-{label}", where action is one of "B", "I", "L", "U". The string "-" is used where the entity offsets don’t align with the tokenization in the Doc object. The training algorithm will view these as missing values. O denotes a non-entity token. B denotes the beginning of a multi-token entity, I the inside of an entity of three or more tokens, and L the end of an entity of two or more tokens. U denotes a single-token entity.

NameTypeDescription
docDocThe document that the entity offsets refer to. The output tags will refer to the token boundaries within the document.
entitiesiterableA sequence of (start, end, label) triples. start and end should be character-offset integers denoting the slice into the original string.

gold.offsets_from_biluo_tags function

Encode per-token tags following the BILUO scheme into entity offsets.

NameTypeDescription
docDocThe document that the BILUO tags refer to.
entitiesiterableA sequence of BILUO tags with each tag describing one token. Each tag string will be of the form of either "", "O" or "{action}-{label}", where action is one of "B", "I", "L", "U".

gold.spans_from_biluo_tags functionv2.1

Encode per-token tags following the BILUO scheme into Span objects. This can be used to create entity spans from token-based tags, e.g. to overwrite the doc.ents.

NameTypeDescription
docDocThe document that the BILUO tags refer to.
entitiesiterableA sequence of BILUO tags with each tag describing one token. Each tag string will be of the form of either "", "O" or "{action}-{label}", where action is one of "B", "I", "L", "U".