0

Is there an option in Stanford Corenlp for specifying abbreviations? For example in the sentence: The reason pt. stayed at home was to rest. pt is the abbreviation for patient and corenlp incorrectly splits it into two sentences.

I was wondering how can I pass the list of abbreviations to the Stanford's tokenizer.

1
  • Are you looking specifically at clinical/medical language? If so, then don't use Stanford corenlp. Switch to a toolkit specializing in biomedical NLP. Fair warning, though, it's a horrendously difficult domain for NLP. Commented May 23, 2015 at 5:26

1 Answer 1

2

The short answer is "no, there's no way to specify custom abbreviations currently" (as far as I know). The longer answer is that this code lives in a *.flex file, and you could add custom abbreviations to it. I think the place to do so is in PTBLexer.flex under the ABBREV1 definition.

3
  • I changed the file and recompiled, but it didn't work. I added my abbreviations to ABBREV1 in line 641.
    – CentAu
    Commented May 20, 2015 at 17:20
  • Should the .flex file be compiled differently?
    – CentAu
    Commented May 20, 2015 at 18:41
  • Yes, you'll likely need to compile that with jflex. Commented May 20, 2015 at 21:43

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.