Questions tagged [clustering]
Cluster analysis is the task of partitioning data into subsets of objects according to their mutual "similarity," without using preexisting knowledge such as class labels. [Clustered-standard-errors and/or cluster-samples should be tagged as such; do NOT use the "clustering" tag for them.]
4,046 questions
448
votes
5
answers
180k
views
How to understand the drawbacks of K-means
K-means is a widely used method in cluster analysis. In my understanding, this method does NOT require ANY assumptions, i.e., give me a dataset and a pre-specified number of clusters, k, and I just ...
366
votes
8
answers
172k
views
Why is Euclidean distance not a good metric in high dimensions?
I read that 'Euclidean distance is not a good distance in high dimensions'. I guess this statement has something to do with the curse of dimensionality, but what exactly? Besides, what is 'high ...
164
votes
6
answers
140k
views
Clustering on the output of t-SNE
I've got an application where it'd be handy to cluster a noisy dataset before looking for subgroup effects within the clusters. I first looked at PCA, but it takes ~30 components to get to 90% of the ...
120
votes
6
answers
181k
views
What is the relation between k-means clustering and PCA?
It is a common practice to apply PCA (principal component analysis) before a clustering algorithm (such as k-means). It is believed that it improves the clustering results in practice (noise reduction)...
114
votes
7
answers
121k
views
What is the difference between a multiclass and a multilabel problem?
What is the difference between a multiclass problem and a multilabel problem?
109
votes
7
answers
15k
views
Detecting a given face in a database of facial images
I'm working on a little project involving the faces of twitter users via their profile pictures.
A problem I've encountered is that after I filter out all but the images that are clear portrait ...
97
votes
6
answers
175k
views
Why does k-means clustering algorithm use only Euclidean distance metric?
Is there a specific purpose in terms of efficiency or functionality why the k-means algorithm does not use for example cosine (dis)similarity as a distance metric, but can only use the Euclidean norm? ...
94
votes
7
answers
44k
views
Euclidean distance is usually not good for sparse data (and more general case)?
I have seen somewhere that classical distances (like Euclidean distance) become weakly discriminant when we have multidimensional and sparse data. Why? Do you have an example of two sparse data ...
92
votes
6
answers
84k
views
How to tell if data is "clustered" enough for clustering algorithms to produce meaningful results?
How would you know if your (high dimensional) data exhibits enough clustering so that results from kmeans or other clustering algorithm is actually meaningful?
For k-means algorithm in particular, ...
84
votes
6
answers
53k
views
Choosing a clustering method
When using cluster analysis on a data set to group similar cases, one needs to choose among a large number of clustering methods and measures of distance. Sometimes, one choice might influence the ...
83
votes
3
answers
140k
views
How can an artificial neural network ANN, be used for unsupervised clustering?
I understand how an artificial neural network (ANN), can be trained in a supervised manner using backpropogation to improve the fitting by decreasing the error in ...
83
votes
2
answers
135k
views
Performance metrics to evaluate unsupervised learning
With respect to the unsupervised learning (like clustering), are there any metrics to evaluate performance?
81
votes
7
answers
108k
views
Where to cut a dendrogram?
Hierarchical clustering can be represented by a dendrogram. Cutting a dendrogram at a certain level gives a set of clusters. Cutting at another level gives another set of clusters. How would you pick ...
76
votes
8
answers
108k
views
Clustering with a distance matrix
I have a (symmetric) matrix M that represents the distance between each pair of nodes. For example,
A B C D E F G H I J K L
A 0 20 20 ...
73
votes
6
answers
162k
views
Is it important to scale data before clustering?
I found this tutorial, which suggests that you should run the scale function on features before clustering (I believe that it converts data to z-scores).
I'm wondering whether that is necessary. I'm ...