Lemma are not (always) lowercased in spacy 2.1 #3256

thomasopsomer · 2019-02-11T02:59:59Z

How to reproduce the behaviour

There is a change of behaviour with the lemma_ between 2.0 and 2.1:

doc = nlp("Wells Fargo Outages Hit Online and Mobile Banking.")
[x.lemma_ for x in doc]
# ["Wells", "Fargo", "Outages", "hit", "Online", "and", "Mobile", "Banking"]

whereas in 2.0, every lemma were lowercased.

Your Environment

Operating System:
Python Version Used:
spaCy Version Used: spacy-nightly==2.1.0a6
Environment Information:

The text was updated successfully, but these errors were encountered:

p-sodmann · 2019-02-11T14:30:04Z

The Issue is this rule in lemmatizer.py:

    elif univ_pos in (PROPN, "PROPN"):
        return [string]
    else:
        return [string.lower()]

so every PROPN will no get lowercased.

for token in doc:
print( token.lemma_, token.pos_, token.tag_)

Wells PROPN NNP
Fargo PROPN NNP
Outages PROPN NNPS
Hit PROPN NNP
Online PROPN NNP
and CCONJ CC
Mobile PROPN NNP
Banking PROPN NNP
. PUNCT .

So the question is, why are they all tagged as PROPN?

thomasopsomer · 2019-02-11T15:13:36Z

Hum, it's because the models are not very good on capitalized text, so in this case the tagger think almost all words are proper noun. But it's not new, it was already the case in 2.0.x. However this new rule regarding 'PROPN' change the behaviour of the lemmatization.

I don't know what is the best solution. I'm used to have lowercased token when asking for lemma, but it's maybe a bad habit :)

honnibal · 2019-02-17T12:18:50Z

In v2.1 we've been aiming for better compatibility with the Universal Dependencies data. In their scheme, for proper nouns the lemmas are capitalised --- so we've switched over to preserving them. I know this sort of change can be surprising. Sorry it wasn't communicated clearly.

thomasopsomer · 2019-02-21T10:13:54Z

Sounds right, tks for the explanation :)

lock · 2019-03-23T10:21:21Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added 🌙 nightly Discussion and contributions related to nightly builds feat / lemmatizer Feature: Rule-based and lookup lemmatization labels Feb 11, 2019

honnibal closed this as completed Feb 17, 2019

ines mentioned this issue Mar 15, 2019

Lemmatization of the adjective "more" at start of sentence in SpaCy 2.1 #3411

Closed

lock bot locked as resolved and limited conversation to collaborators Mar 23, 2019

Nov	DEC	Jan
	26
2021	2022	2023

Lemma are not (always) lowercased in spacy 2.1 #3256

Lemma are not (always) lowercased in spacy 2.1 #3256

thomasopsomer commented Feb 11, 2019

p-sodmann commented Feb 11, 2019

thomasopsomer commented Feb 11, 2019

honnibal commented Feb 17, 2019

thomasopsomer commented Feb 21, 2019

lock bot commented Mar 23, 2019

Lemma are not (always) lowercased in spacy 2.1 #3256

Lemma are not (always) lowercased in spacy 2.1 #3256

Comments

thomasopsomer commented Feb 11, 2019

How to reproduce the behaviour

Your Environment

p-sodmann commented Feb 11, 2019

thomasopsomer commented Feb 11, 2019

honnibal commented Feb 17, 2019

thomasopsomer commented Feb 21, 2019

lock bot commented Mar 23, 2019