Skip to main content

Questions tagged [clustering]

Cluster analysis is the task of partitioning data into subsets of objects according to their mutual "similarity," without using preexisting knowledge such as class labels. [Clustered-standard-errors and/or cluster-samples should be tagged as such; do NOT use the "clustering" tag for them.]

448 votes
5 answers
180k views

K-means is a widely used method in cluster analysis. In my understanding, this method does NOT require ANY assumptions, i.e., give me a dataset and a pre-specified number of clusters, k, and I just ...
KevinKim's user avatar
  • 7,039
366 votes
8 answers
172k views

I read that 'Euclidean distance is not a good distance in high dimensions'. I guess this statement has something to do with the curse of dimensionality, but what exactly? Besides, what is 'high ...
teaLeef's user avatar
  • 3,877
164 votes
6 answers
140k views

I've got an application where it'd be handy to cluster a noisy dataset before looking for subgroup effects within the clusters. I first looked at PCA, but it takes ~30 components to get to 90% of the ...
generic_user's user avatar
  • 13.9k
120 votes
6 answers
181k views

It is a common practice to apply PCA (principal component analysis) before a clustering algorithm (such as k-means). It is believed that it improves the clustering results in practice (noise reduction)...
mic's user avatar
  • 4,560
114 votes
7 answers
121k views

What is the difference between a multiclass problem and a multilabel problem?
Learner's user avatar
  • 4,507
109 votes
7 answers
15k views

I'm working on a little project involving the faces of twitter users via their profile pictures. A problem I've encountered is that after I filter out all but the images that are clear portrait ...
ʞɔıu's user avatar
  • 1,117
97 votes
6 answers
175k views

Is there a specific purpose in terms of efficiency or functionality why the k-means algorithm does not use for example cosine (dis)similarity as a distance metric, but can only use the Euclidean norm? ...
curious's user avatar
  • 1,111
94 votes
7 answers
44k views

I have seen somewhere that classical distances (like Euclidean distance) become weakly discriminant when we have multidimensional and sparse data. Why? Do you have an example of two sparse data ...
shn's user avatar
  • 2,987
92 votes
6 answers
84k views

How would you know if your (high dimensional) data exhibits enough clustering so that results from kmeans or other clustering algorithm is actually meaningful? For k-means algorithm in particular, ...
xuexue's user avatar
  • 2,328
84 votes
6 answers
53k views

When using cluster analysis on a data set to group similar cases, one needs to choose among a large number of clustering methods and measures of distance. Sometimes, one choice might influence the ...
Brett's user avatar
  • 6,365
83 votes
3 answers
140k views

I understand how an artificial neural network (ANN), can be trained in a supervised manner using backpropogation to improve the fitting by decreasing the error in ...
Vass's user avatar
  • 1,705
83 votes
2 answers
135k views

With respect to the unsupervised learning (like clustering), are there any metrics to evaluate performance?
user3125's user avatar
  • 3,109
81 votes
7 answers
108k views

Hierarchical clustering can be represented by a dendrogram. Cutting a dendrogram at a certain level gives a set of clusters. Cutting at another level gives another set of clusters. How would you pick ...
Eduardas's user avatar
  • 2,389
76 votes
8 answers
108k views

I have a (symmetric) matrix M that represents the distance between each pair of nodes. For example, A B C D E F G H I J K L A 0 20 20 ...
yassin's user avatar
  • 863
73 votes
6 answers
162k views

I found this tutorial, which suggests that you should run the scale function on features before clustering (I believe that it converts data to z-scores). I'm wondering whether that is necessary. I'm ...
Jeremy's user avatar
  • 1,509

15 30 50 per page
1
2 3 4 5
270