Clustering algorithms for strings

Question

I have to implement a module in which i need to group sentences(strings) having similar meaning into different clusters. I read about k-means , EM clustering etc. But the problem which i am facing is that these algorithms are explained with vector points on a graph. I am not getting how these algorithms can be implemented for a sentence(String) having similar meaning. Please suggest some appropriate ways.

For example , Lets consider a classroom scenario.. 1) Teacher has ample knowledge. 2) Students understand what teacher teaches. 3) Teacher is sometimes punctual in class. 4) Teacher is audible in class.

Lets say we have these 4 sentences. Then looking at them we can say that sentence 1 and 2 are of similar meaning. But sentence 3 and 4 are neither related to each other nor to the first two. In this way i need to classify the sentences. So how can it be done?

This is a big question. I think that Google's 'Deep Learning' course on Udacity provides a nice and free introduction to text mining using tensorflow with python. — hilberts_drinking_problem, Commented Mar 23, 2016 at 21:54
I don't think there's a single best answer to this question, so I'm voting to close it as too broad. That said - look at "Word to Vector" or "Word Embedding" models, which show a lot of promise in this area. — templatetypedef, Commented Mar 23, 2016 at 22:00

CAFEBABE · Accepted Answer · 2016-03-23 22:01:28Z

2

First of all you should make yourself familiar with the bag of words concept. The basic idea ist to map each word in a sentence on the number of occurrences, e.g., for the sentences hello world, hello tanay would get mapped onto

Hello World Tanay
  1    1      0
  1    0      1

This allows you to use one of the standard approaches.

Also worthwhile would be having a look at TF/DF it is made to reweigh words in a bag of words representation, with their importance to distinguish the documents (or sentences in your case)

Secondly, you should look at LDA which was made specifically for clustering words to concepts. Nevertheless, it is made of a view concepts.

Most promising to me sounds like a combination of these approaches. Generate, bags of words, reweigh the bag of words using TF/DF, run LDA and augment the reweighed bag of words with the LDA concepts and then use a standard clustering algorithm.

edited Mar 23, 2016 at 22:01

answered Mar 23, 2016 at 21:55

CAFEBABE

4,1011 gold badge22 silver badges39 bronze badges

Lets consider a classroom scenario.. 1) Teacher has ample knowledge. 2) Students understand what teacher teaches. 3) Teacher is sometimes punctual in class. 4) Teacher is audible in class. Lets say we have there 4 sentences. Then looking at them we can say that sentence 1 and 2 are of similar meaning. But sentence 3 and 4 are neither related to each other nor to the first two. In this way i need to classify the sentences. So how can it be done?
– Tanay Narkhede
Commented Mar 23, 2016 at 22:01
2

@TanayNarkhede: to improve your question, you should merge this comment into it with an edit. Nevertheless, this example makes me doubt that it is possible. Because, even I as a human have a hard time to see the connection, as it is deeply hidden in the semantics of the sentence.
– CAFEBABE
Commented Mar 23, 2016 at 22:05

Add a comment |

Has QUIT--Anony-Mousse · Accepted Answer · 2016-03-24 06:57:25Z

Clustering cannot do this.

Because it looks for stucture in data, but you want to cluster by the abstract human concept of meaning that is hard to capture using statistics...

So you first need to solve the really hard task of making the computer understand language reliably. And not on a "best match" basis, but good enough to quantify similarities.

There are actualy some attempts in this direction, usually involving massive data and deep learning. They can do this on some toy examples such as "Paris - France + USA = ?" - sometimes. Google for IBM Watson and Google word2vec.

Good luck. You will need high-performance GPUs and Exabytes of training data.

Collectives™ on Stack Overflow

Clustering algorithms for strings

2 Answers 2

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Related