1

I have to implement a module in which i need to group sentences(strings) having similar meaning into different clusters. I read about k-means , EM clustering etc. But the problem which i am facing is that these algorithms are explained with vector points on a graph. I am not getting how these algorithms can be implemented for a sentence(String) having similar meaning. Please suggest some appropriate ways.

For example , Lets consider a classroom scenario.. 1) Teacher has ample knowledge. 2) Students understand what teacher teaches. 3) Teacher is sometimes punctual in class. 4) Teacher is audible in class.

Lets say we have these 4 sentences. Then looking at them we can say that sentence 1 and 2 are of similar meaning. But sentence 3 and 4 are neither related to each other nor to the first two. In this way i need to classify the sentences. So how can it be done?

2
  • This is a big question. I think that Google's 'Deep Learning' course on Udacity provides a nice and free introduction to text mining using tensorflow with python. Commented Mar 23, 2016 at 21:54
  • I don't think there's a single best answer to this question, so I'm voting to close it as too broad. That said - look at "Word to Vector" or "Word Embedding" models, which show a lot of promise in this area. Commented Mar 23, 2016 at 22:00

2 Answers 2

2

First of all you should make yourself familiar with the bag of words concept. The basic idea ist to map each word in a sentence on the number of occurrences, e.g., for the sentences hello world, hello tanay would get mapped onto

Hello World Tanay
  1    1      0
  1    0      1

This allows you to use one of the standard approaches.

Also worthwhile would be having a look at TF/DF it is made to reweigh words in a bag of words representation, with their importance to distinguish the documents (or sentences in your case)

Secondly, you should look at LDA which was made specifically for clustering words to concepts. Nevertheless, it is made of a view concepts.

Most promising to me sounds like a combination of these approaches. Generate, bags of words, reweigh the bag of words using TF/DF, run LDA and augment the reweighed bag of words with the LDA concepts and then use a standard clustering algorithm.

2
  • Lets consider a classroom scenario.. 1) Teacher has ample knowledge. 2) Students understand what teacher teaches. 3) Teacher is sometimes punctual in class. 4) Teacher is audible in class. Lets say we have there 4 sentences. Then looking at them we can say that sentence 1 and 2 are of similar meaning. But sentence 3 and 4 are neither related to each other nor to the first two. In this way i need to classify the sentences. So how can it be done? Commented Mar 23, 2016 at 22:01
  • 2
    @TanayNarkhede: to improve your question, you should merge this comment into it with an edit. Nevertheless, this example makes me doubt that it is possible. Because, even I as a human have a hard time to see the connection, as it is deeply hidden in the semantics of the sentence.
    – CAFEBABE
    Commented Mar 23, 2016 at 22:05
2

Clustering cannot do this.

Because it looks for stucture in data, but you want to cluster by the abstract human concept of meaning that is hard to capture using statistics...

So you first need to solve the really hard task of making the computer understand language reliably. And not on a "best match" basis, but good enough to quantify similarities.

There are actualy some attempts in this direction, usually involving massive data and deep learning. They can do this on some toy examples such as "Paris - France + USA = ?" - sometimes. Google for IBM Watson and Google word2vec.

Good luck. You will need high-performance GPUs and Exabytes of training data.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.