Data standardization vs. normalization for clustering analysis

Question

I'm performing clustering analysis and visualization (hierarchal, PCA, T-SNE etc.) on a dataset, and a bit confused about the method for data preparation. I understand that the typical options are to standardize, normalize, or log transform, but it seems like there are no hard and fast rules regarding when you apply one over the other?

With standardization and log-transformation - my dataset splits into two clusters with a number of different algorithms. One cluster is large and heterogeneous (which is actually interesting as this is a biological problem and makes logical sense). However, if I normalize the data, I get three clusters out of it - splits the heterogeneous cluster into two. This could make sense as well, but it would be a stretch, and the clusters are not as clean. What could be causing this? The non-heterogeneous cluster remains the same, which is reassuring. Is it reasonable to conclude that the "instability" of the second cluster is further evidence of the heterogeneity in the dataset?

Has QUIT--Anony-Mousse · Accepted Answer · 2019-07-13 20:16:04Z

3

There cannot be a general rule on what to do.

Any automatic normalization is usually "wrong". They only happen to usually work better than not weighting features at all, so people commony use them - in particular on data they don't understand. But the right way is to weight and scale features such they have the right balanced amount of influence on the results. As there is no mathematical way to capture this "right balance" (it's not uniform!) there cannot be an automatic solution. You have to understand your data and scale each feature to give it he desired amount of influence.

answered Jul 13, 2019 at 20:16

Has QUIT--Anony-Mousse

43.4k8 gold badges71 silver badges115 bronze badges

1

$\begingroup$ Anony-Mousse provides a good answer. I'd add that often you are looking for sensible clusters that help the data tell a story. From what you've said in your question, the easier to interpret 2 cluster solution would seem better. $\endgroup$

zbicyclist
– zbicyclist

2019-07-13 20:29:59 +00:00
Commented Jul 13, 2019 at 20:29
$\begingroup$ Thank you! I guess my question was also - what component of normalizing/standardizing would give rise to these differences in results? Also, the challenge with my dataset is that it is unlabelled (and impossible to label otherwise, common for biological data), and so we are trying to use unsupervised clustering to figure out how many clusters exist. We are currently not weighting the features, but including feature selection. $\endgroup$

Elicen
– Elicen

2019-07-13 20:38:41 +00:00
Commented Jul 13, 2019 at 20:38
$\begingroup$ Normalizing usually is much worse because of outliers. Standardization is much more robust. $\endgroup$

Has QUIT--Anony-Mousse
– Has QUIT--Anony-Mousse

2019-07-13 21:45:57 +00:00
Commented Jul 13, 2019 at 21:45

Add a comment |

aghd · Accepted Answer · 2019-07-13 20:40:05Z

I think standard scaling mostly depends on the model being used, and normalizing depend on how the data is originated

Most of distance based models e.g. k-means need standard scaling so that large-scaled features don't dominate the variation. Same goes to PCA.

About the normalization, it mostly depends on the data. For example, if you have sensor data (each time step being a variable) with different scaling, you need to L2 normalize the data to bring them into the same scale. Or if you are working on customer recommendation and your entry are the number of times they bought each item (items being variables), you might need to L2 normalize the items if you don't want people who buy a lot to skew the feature.

Personally, I think if the variables are well-defined, their log might result in losing interpretaility. So if you get good looking clusters without the log transform, I'd stick to it.

Stack Exchange Network

Data standardization vs. normalization for clustering analysis

2 Answers 2

Hot Network Questions

Data standardization vs. normalization for clustering analysis

2 Answers 2

Related

Hot Network Questions