0

I need to extract the source and destination terms from the text documents using text mining/NLP/Information Retrieval ?

ex :

1. i am travelling from New York to London.
2. i am heading towards playground from home.
3. i will be going to Sweden from Boston.
4. i was flying from School to Home.

the output can be as follows :

S. No. |  source    | Destination
------ |  ----------|------------
      1| New York   | London
      2| playground | home
      3| Sweden     | Boston
      4| School     | Home
2

1 Answer 1

2

It sounds like you need two things:

  1. A dependency parse of the data to identify nouns governed by 'to' and 'from' (if these are really the only two prepositions you care about)
  2. A (non-)named entity recognizer to verify that locations are being referred to.

For part 1, there are many dependency parsers out there. You tagged the question with Stanford NLP and NLTK, so it sounds like you're using Java or Python. The Stanford parser can provide dependency parses, so that's a good option, but many options are available.

For part 2, if you only need named destinations (New York), CoreNLP's NER works well. You could also consider using Spacy (https://spacy.io/), which offers dependency parses and NER out of the box in Python.

If you need to match things like 'playground' as well, you need a Non-Named Entity Recognition component. There are fewer of these around, but you can try using xrenner (https://corpling.uis.georgetown.edu/xrenner/), which is available as a Python package from PyPI as well. It takes a dependency parse using Basic Stanford Dependencies as input (not Universal Dependencies), so you can use those in step 1 and feed the result to xrenner.

Keep in mind that all of these tools are stochastic and there will be a certain error rate no matter what you do.

Hope this helps!

1

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.