Extracting Part of Speech (Source and Destinations) using text mining/NLP?

Question

I need to extract the source and destination terms from the text documents using text mining/NLP/Information Retrieval ?

ex :

1. i am travelling from New York to London.
2. i am heading towards playground from home.
3. i will be going to Sweden from Boston.
4. i was flying from School to Home.

the output can be as follows :

S. No. |  source    | Destination
------ |  ----------|------------
      1| New York   | London
      2| playground | home
      3| Sweden     | Boston
      4| School     | Home

This looks like a natural language understanding problem. NLTK can generate discourse representation structures that generate the meaning of the text. — Anderson Green, Commented Jun 15, 2017 at 22:00

azeldes · Accepted Answer · 2017-06-13 14:16:38Z

It sounds like you need two things:

A dependency parse of the data to identify nouns governed by 'to' and 'from' (if these are really the only two prepositions you care about)
A (non-)named entity recognizer to verify that locations are being referred to.

For part 1, there are many dependency parsers out there. You tagged the question with Stanford NLP and NLTK, so it sounds like you're using Java or Python. The Stanford parser can provide dependency parses, so that's a good option, but many options are available.

For part 2, if you only need named destinations (New York), CoreNLP's NER works well. You could also consider using Spacy (https://spacy.io/), which offers dependency parses and NER out of the box in Python.

If you need to match things like 'playground' as well, you need a Non-Named Entity Recognition component. There are fewer of these around, but you can try using xrenner (https://corpling.uis.georgetown.edu/xrenner/), which is available as a Python package from PyPI as well. It takes a dependency parse using Basic Stanford Dependencies as input (not Universal Dependencies), so you can use those in step 1 and feed the result to xrenner.

Keep in mind that all of these tools are stochastic and there will be a certain error rate no matter what you do.

Hope this helps!

Since this question is about NLTK, it might be easier to simply generate discourse representation structures from the input text. — Anderson Green, Commented Jun 15, 2017 at 21:58

Collectives™ on Stack Overflow

Extracting Part of Speech (Source and Destinations) using text mining/NLP?

1 Answer 1

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Related