Questions tagged [clustering]

Ask Question

Cluster analysis is the task of partitioning data into subsets of objects according to their mutual "similarity," without using preexisting knowledge such as class labels. [Clustered-standard-errors and/or cluster-samples should be tagged as such; do NOT use the "clustering" tag for them.]

4,046 questions

448 votes

5 answers

180k views

How to understand the drawbacks of K-means

K-means is a widely used method in cluster analysis. In my understanding, this method does NOT require ANY assumptions, i.e., give me a dataset and a pre-specified number of clusters, k, and I just ...

KevinKim

7,039

asked Jan 16, 2015 at 4:38

366 votes

8 answers

172k views

Why is Euclidean distance not a good metric in high dimensions?

I read that 'Euclidean distance is not a good distance in high dimensions'. I guess this statement has something to do with the curse of dimensionality, but what exactly? Besides, what is 'high ...

teaLeef

3,877

asked May 18, 2014 at 17:50

164 votes

6 answers

140k views

Clustering on the output of t-SNE

I've got an application where it'd be handy to cluster a noisy dataset before looking for subgroup effects within the clusters. I first looked at PCA, but it takes ~30 components to get to 90% of the ...

generic_user

13.9k

asked Feb 23, 2017 at 1:39

120 votes

6 answers

181k views

What is the relation between k-means clustering and PCA?

It is a common practice to apply PCA (principal component analysis) before a clustering algorithm (such as k-means). It is believed that it improves the clustering results in practice (noise reduction)...

mic

4,560

asked Nov 23, 2015 at 22:42

114 votes

7 answers

121k views

What is the difference between a multiclass and a multilabel problem?

What is the difference between a multiclass problem and a multilabel problem?

Learner

4,507

asked Jun 13, 2011 at 5:35

109 votes

7 answers

15k views

Detecting a given face in a database of facial images

I'm working on a little project involving the faces of twitter users via their profile pictures. A problem I've encountered is that after I filter out all but the images that are clear portrait ...

ʞɔıu

1,117

asked Feb 14, 2011 at 22:41

97 votes

6 answers

175k views

Why does k-means clustering algorithm use only Euclidean distance metric?

Is there a specific purpose in terms of efficiency or functionality why the k-means algorithm does not use for example cosine (dis)similarity as a distance metric, but can only use the Euclidean norm? ...

curious

1,111

asked Jan 7, 2014 at 11:53

94 votes

7 answers

44k views

Euclidean distance is usually not good for sparse data (and more general case)?

I have seen somewhere that classical distances (like Euclidean distance) become weakly discriminant when we have multidimensional and sparse data. Why? Do you have an example of two sparse data ...

shn

2,987

asked Jun 1, 2012 at 13:55

92 votes

6 answers

84k views

How to tell if data is "clustered" enough for clustering algorithms to produce meaningful results?

How would you know if your (high dimensional) data exhibits enough clustering so that results from kmeans or other clustering algorithm is actually meaningful? For k-means algorithm in particular, ...

xuexue

2,328

asked Jun 8, 2011 at 0:04

84 votes

6 answers

53k views

Choosing a clustering method

When using cluster analysis on a data set to group similar cases, one needs to choose among a large number of clustering methods and measures of distance. Sometimes, one choice might influence the ...

Brett

6,365

asked Oct 18, 2010 at 15:58

83 votes

3 answers

140k views

How can an artificial neural network ANN, be used for unsupervised clustering?

I understand how an artificial neural network (ANN), can be trained in a supervised manner using backpropogation to improve the fitting by decreasing the error in ...

Vass

1,705

asked Mar 3, 2015 at 16:21

83 votes

2 answers

135k views

Performance metrics to evaluate unsupervised learning

With respect to the unsupervised learning (like clustering), are there any metrics to evaluate performance?

user3125

3,109

asked Dec 9, 2013 at 3:00

81 votes

7 answers

108k views

Where to cut a dendrogram?

Hierarchical clustering can be represented by a dendrogram. Cutting a dendrogram at a certain level gives a set of clusters. Cutting at another level gives another set of clusters. How would you pick ...

Eduardas

2,389

asked Oct 17, 2010 at 21:57

76 votes

8 answers

108k views

Clustering with a distance matrix

I have a (symmetric) matrix M that represents the distance between each pair of nodes. For example, A B C D E F G H I J K L A 0 20 20 ...

yassin

asked Sep 16, 2010 at 11:47

73 votes

6 answers

162k views

Is it important to scale data before clustering?

I found this tutorial, which suggests that you should run the scale function on features before clustering (I believe that it converts data to z-scores). I'm wondering whether that is necessary. I'm ...

Jeremy

1,509

asked Mar 12, 2014 at 21:27

15 30 50 per page

2 3 4 5

…

270 Next

Stack Exchange Network

Questions tagged [clustering]

How to understand the drawbacks of K-means

Why is Euclidean distance not a good metric in high dimensions?

Clustering on the output of t-SNE

What is the relation between k-means clustering and PCA?

What is the difference between a multiclass and a multilabel problem?

Detecting a given face in a database of facial images

Why does k-means clustering algorithm use only Euclidean distance metric?

Euclidean distance is usually not good for sparse data (and more general case)?

How to tell if data is "clustered" enough for clustering algorithms to produce meaningful results?

Choosing a clustering method

How can an artificial neural network ANN, be used for unsupervised clustering?

Performance metrics to evaluate unsupervised learning

Where to cut a dendrogram?

Clustering with a distance matrix

Is it important to scale data before clustering?

Hot Network Questions