Revisions to Are mean normalization and feature scaling needed for k-means clustering?

added 2 characters in body

Source Link

edited Sep 22, 2021 at 23:23

60.4k
56
295
548

If your variables are of incomparable units (e.g. height in cm and weight in kg) then you should standardize variables, of course. Even if variables are of the same units but show quite different variances it is still a good idea to standardize before K-means. You see, K-means clustering is "isotropic" in all directions of space and therefore tends to produce more or less round (rather than elongated) clusters. In this situation leaving variances unequal is equivalent to putting more weight on variables with smaller variance, so clusters will tend to be separated along variables with greater variance.

A different thing also worth to remind is that K-means clustering results are potentially sensitive to the order of objects in the data set$^1$. A justified practice would be to run the analysis several times, randomizing objects order; then average the cluster centres of the correpondent/same clusters between those runs$^2$ and input the centres as initial ones for one final run of the analysis.

Here is some general reasoning about the issue of standardizing features in cluster or other multivariate analysis.

$^1$ Specifically, (1) some methods of centres initialization are sensitive to case order; (2) even when the initialization method isn't sensitive, results might depend sometimes on the order the initial centres are introduced to the program by (in particular, when there are tied, equal distances within data); (3) so-called running means version of k-means algorithm is naturaly sensitive to case order (in this version - which is not often used apart from maybe online clustering - recalculation of centroids take place after each individual case is re-asssigned to another cluster).

$^2$ In practice, which clusters from different runs correspond - is often immediately seen by their relative closeness. When not easily seen, correpondencecorrespondence can be established by a hierarchical clustering done among the centres or by a matching algorithm such as Hungarian. But, to remark, if the correpondencecorrespondence is so vague that it almost vanishes, then the data either had no cluster structure detectable by K-means, or K is very wrong.

If your variables are of incomparable units (e.g. height in cm and weight in kg) then you should standardize variables, of course. Even if variables are of the same units but show quite different variances it is still a good idea to standardize before K-means. You see, K-means clustering is "isotropic" in all directions of space and therefore tends to produce more or less round (rather than elongated) clusters. In this situation leaving variances unequal is equivalent to putting more weight on variables with smaller variance, so clusters will tend to be separated along variables with greater variance.

A different thing also worth to remind is that K-means clustering results are potentially sensitive to the order of objects in the data set$^1$. A justified practice would be to run the analysis several times, randomizing objects order; then average the cluster centres of the correpondent/same clusters between those runs$^2$ and input the centres as initial ones for one final run of the analysis.

Here is some general reasoning about the issue of standardizing features in cluster or other multivariate analysis.

$^1$ Specifically, (1) some methods of centres initialization are sensitive to case order; (2) even when the initialization method isn't sensitive, results might depend sometimes on the order the initial centres are introduced to the program by (in particular, when there are tied, equal distances within data); (3) so-called running means version of k-means algorithm is naturaly sensitive to case order (in this version - which is not often used apart from maybe online clustering - recalculation of centroids take place after each individual case is re-asssigned to another cluster).

$^2$ In practice, which clusters from different runs correspond - is often immediately seen by their relative closeness. When not easily seen, correpondence can be established by a hierarchical clustering done among the centres or by a matching algorithm such as Hungarian. But, to remark, if the correpondence is so vague that it almost vanishes, then the data either had no cluster structure detectable by K-means, or K is very wrong.

If your variables are of incomparable units (e.g. height in cm and weight in kg) then you should standardize variables, of course. Even if variables are of the same units but show quite different variances it is still a good idea to standardize before K-means. You see, K-means clustering is "isotropic" in all directions of space and therefore tends to produce more or less round (rather than elongated) clusters. In this situation leaving variances unequal is equivalent to putting more weight on variables with smaller variance, so clusters will tend to be separated along variables with greater variance.

A different thing also worth to remind is that K-means clustering results are potentially sensitive to the order of objects in the data set$^1$. A justified practice would be to run the analysis several times, randomizing objects order; then average the cluster centres of the correpondent/same clusters between those runs$^2$ and input the centres as initial ones for one final run of the analysis.

Here is some general reasoning about the issue of standardizing features in cluster or other multivariate analysis.

$^1$ Specifically, (1) some methods of centres initialization are sensitive to case order; (2) even when the initialization method isn't sensitive, results might depend sometimes on the order the initial centres are introduced to the program by (in particular, when there are tied, equal distances within data); (3) so-called running means version of k-means algorithm is naturaly sensitive to case order (in this version - which is not often used apart from maybe online clustering - recalculation of centroids take place after each individual case is re-asssigned to another cluster).

$^2$ In practice, which clusters from different runs correspond - is often immediately seen by their relative closeness. When not easily seen, correspondence can be established by a hierarchical clustering done among the centres or by a matching algorithm such as Hungarian. But, to remark, if the correspondence is so vague that it almost vanishes, then the data either had no cluster structure detectable by K-means, or K is very wrong.

added 479 characters in body

Source Link

edited Sep 22, 2021 at 22:23

ttnphns

60.4k
56
295
548

If your variables are of incomparable units (e.g. height in cm and weight in kg) then you should standardize variables, of course. Even if variables are of the same units but show quite different variances it is still a good idea to standardize before K-means. You see, K-means clustering is "isotropic" in all directions of space and therefore tends to produce more or less round (rather than elongated) clusters. In this situation leaving variances unequal is equivalent to putting more weight on variables with smaller variance, so clusters will tend to be separated along variables with greater variance.

A different thing also worth to remind is that K-means clustering results are potentially sensitive to the order of objects in the data set$^1$. A justified practice would be to run the analysis several times, randomizing objects order; then average the cluster centres of the correpondent/same clusters between those runs$^2$ and input the centres as initial ones for one final run of the analysis.

Here is some general reasoning about the issue of standardizing features in cluster or other multivariate analysis.

$^1$ Specifically, (1) some methods of centres initialization are sensitive to case order; (2) even when the initialization method isn't sensitive, results might depend sometimes on the order the initial centres are introduced to the program by (in particular, when there are tied, equal distances within data); (3) so-called running means version of k-means algorithm is naturaly sensitive to case order (in this version - which is not often used apart from maybe online clustering - recalculation of centroids take place after each individual case is re-asssigned to another cluster).

$^2$ In practice, which clusters from different runs correspond - is often immediately seen by their relative closeness. When not easily seen, correpondence can be established by a hierarchical clustering done among the centres or by a matching algorithm such as Hungarian. But, to remark, if the correpondence is so vague that it almost vanishes, then the data either had no cluster structure detectable by K-means, or K is very wrong.

If your variables are of incomparable units (e.g. height in cm and weight in kg) then you should standardize variables, of course. Even if variables are of the same units but show quite different variances it is still a good idea to standardize before K-means. You see, K-means clustering is "isotropic" in all directions of space and therefore tends to produce more or less round (rather than elongated) clusters. In this situation leaving variances unequal is equivalent to putting more weight on variables with smaller variance, so clusters will tend to be separated along variables with greater variance.

A different thing also worth to remind is that K-means clustering results are potentially sensitive to the order of objects in the data set$^1$. A justified practice would be to run the analysis several times, randomizing objects order; then average the cluster centres of those runs and input the centres as initial ones for one final run of the analysis.

Here is some general reasoning about the issue of standardizing features in cluster or other multivariate analysis.

$^1$ Specifically, (1) some methods of centres initialization are sensitive to case order; (2) even when the initialization method isn't sensitive, results might depend sometimes on the order the initial centres are introduced to the program by (in particular, when there are tied, equal distances within data); (3) so-called running means version of k-means algorithm is naturaly sensitive to case order (in this version - which is not often used apart from maybe online clustering - recalculation of centroids take place after each individual case is re-asssigned to another cluster).

If your variables are of incomparable units (e.g. height in cm and weight in kg) then you should standardize variables, of course. Even if variables are of the same units but show quite different variances it is still a good idea to standardize before K-means. You see, K-means clustering is "isotropic" in all directions of space and therefore tends to produce more or less round (rather than elongated) clusters. In this situation leaving variances unequal is equivalent to putting more weight on variables with smaller variance, so clusters will tend to be separated along variables with greater variance.

A different thing also worth to remind is that K-means clustering results are potentially sensitive to the order of objects in the data set$^1$. A justified practice would be to run the analysis several times, randomizing objects order; then average the cluster centres of the correpondent/same clusters between those runs$^2$ and input the centres as initial ones for one final run of the analysis.

Here is some general reasoning about the issue of standardizing features in cluster or other multivariate analysis.

$^1$ Specifically, (1) some methods of centres initialization are sensitive to case order; (2) even when the initialization method isn't sensitive, results might depend sometimes on the order the initial centres are introduced to the program by (in particular, when there are tied, equal distances within data); (3) so-called running means version of k-means algorithm is naturaly sensitive to case order (in this version - which is not often used apart from maybe online clustering - recalculation of centroids take place after each individual case is re-asssigned to another cluster).

$^2$ In practice, which clusters from different runs correspond - is often immediately seen by their relative closeness. When not easily seen, correpondence can be established by a hierarchical clustering done among the centres or by a matching algorithm such as Hungarian. But, to remark, if the correpondence is so vague that it almost vanishes, then the data either had no cluster structure detectable by K-means, or K is very wrong.

added 168 characters in body

Source Link

edited Mar 22, 2019 at 7:55

ttnphns

60.4k
56
295
548