New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lemma are not (always) lowercased in spacy 2.1 #3256
Comments
|
The Issue is this rule in lemmatizer.py: so every PROPN will no get lowercased. for token in doc: Wells PROPN NNP So the question is, why are they all tagged as PROPN? |
|
Hum, it's because the models are not very good on capitalized text, so in this case the tagger think almost all words are proper noun. But it's not new, it was already the case in 2.0.x. However this new rule regarding 'PROPN' change the behaviour of the lemmatization. I don't know what is the best solution. I'm used to have lowercased token when asking for lemma, but it's maybe a bad habit :) |
|
In v2.1 we've been aiming for better compatibility with the Universal Dependencies data. In their scheme, for proper nouns the lemmas are capitalised --- so we've switched over to preserving them. I know this sort of change can be surprising. Sorry it wasn't communicated clearly. |
|
Sounds right, tks for the explanation :) |
|
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |


thomasopsomer commentedFeb 11, 2019
How to reproduce the behaviour
There is a change of behaviour with the
lemma_between 2.0 and 2.1:whereas in 2.0, every lemma were lowercased.
Your Environment
The text was updated successfully, but these errors were encountered: