PhraseMatcher
classv2The PhraseMatcher lets you efficiently match large terminology lists. While
the Matcher lets you match sequences based on lists of token
descriptions, the PhraseMatcher accepts match patterns in the form of Doc
objects.
PhraseMatcher.__init__ method
Create the rule-based PhraseMatcher. Setting a different attr to match on
will change the token attributes that will be compared to determine a match. By
default, the incoming Doc is checked for sequences of tokens with the same
ORTH value, i.e. the verbatim token text. Matching on the attribute LOWER
will result in case-insensitive matching, since only the lowercase token texts
are compared. In theory, it’s also possible to match on sequences of the same
part-of-speech tags or dependency labels.
If validate=True is set, additional validation is performed when pattern are
added. At the moment, it will check whether a Doc has attributes assigned that
aren’t necessary to produce the matches (for example, part-of-speech tags if the
PhraseMatcher matches on the token text). Since this can often lead to
significantly worse performance when creating the pattern, a UserWarning will
be shown.
| Name | Type | Description |
|---|---|---|
vocab | Vocab | The vocabulary object, which must be shared with the documents the matcher will operate on. |
max_length | int | Deprecated argument - the PhraseMatcher does not have a phrase length limit anymore. |
attr v2.1 | int / unicode | The token attribute to match on. Defaults to ORTH, i.e. the verbatim token text. |
validate v2.1 | bool | Validate patterns added to the matcher. |
| RETURNS | PhraseMatcher | The newly constructed object. |
PhraseMatcher.__call__ method
Find all token sequences matching the supplied patterns on the Doc.
| Name | Type | Description |
|---|---|---|
doc | Doc | The document to match over. |
| RETURNS | list | A list of (match_id, start, end) tuples, describing the matches. A match tuple describes a span doc[start:end]. The match_id is the ID of the added match pattern. |
PhraseMatcher.pipe method
Match a stream of documents, yielding them in turn.
| Name | Type | Description |
|---|---|---|
docs | iterable | A stream of documents. |
batch_size | int | The number of documents to accumulate into a working set. |
| YIELDS | Doc | Documents, in order. |
PhraseMatcher.__len__ method
Get the number of rules added to the matcher. Note that this only returns the number of rules (identical with the number of IDs), not the number of individual patterns.
| Name | Type | Description |
|---|---|---|
| RETURNS | int | The number of rules. |
PhraseMatcher.__contains__ method
Check whether the matcher contains rules for a match ID.
| Name | Type | Description |
|---|---|---|
key | unicode | The match ID. |
| RETURNS | bool | Whether the matcher contains rules for this match ID. |
PhraseMatcher.add method
Add a rule to the matcher, consisting of an ID key, one or more patterns, and a
callback function to act on the matches. The callback function will receive the
arguments matcher, doc, i and matches. If a pattern already exists for
the given ID, the patterns will be extended. An on_match callback will be
overwritten.
| Name | Type | Description |
|---|---|---|
match_id | unicode | An ID for the thing you’re matching. |
on_match | callable or None | Callback function to act on matches. Takes the arguments matcher, doc, i and matches. |
*docs | Doc | Doc objects of the phrases to match. |
PhraseMatcher.remove methodv2.2
Remove a rule from the matcher by match ID. A KeyError is raised if the key
does not exist.
| Name | Type | Description |
|---|---|---|
key | unicode | The ID of the match rule. |

Formed in 2009, the Archive Team (not to be confused with the archive.org Archive-It Team) is a rogue archivist collective dedicated to saving copies of rapidly dying or deleted websites for the sake of history and digital heritage. The group is 100% composed of volunteers and interested parties, and has expanded into a large amount of related projects for saving online and digital history.
