Should I standardize my data or not?

Question

I am currently working on a dataset concerning the color magnitude of astronomical point sources. There are 9 covariates, each representing a specific color of a point source. I used k-means, hierarchical clustering and self organizing maps. The results from these three methods are very similar. However, one thing I noticed is that these three methods created clusters mainly based on two of the covariates which have the largest range of magnitude.

I know it is recommended that we scale the data and I fear it is because I didn't scale the data that the clustering are dominated by the two covariates.

However, scaling the data will essentially change the physical meaning and relationship between different colors which is not something I want for this particular project.

Does anyone know what should be the best way to handle this? Thanks!

I think that this is where your goals and domain knowledge need to influence your choice. What are you trying to achieve with these clusters, and how would you decide if they are acceptable or not? — mkt
– mkt, Commented Apr 11, 2018 at 6:26
I guess this is the problem. I'm a stats student and I'm very interested in astrostatiatics. But my astronomy background is just not well versed to this extend to decide what is appropriate or not. As for the purpose of the clustering, it is essentially used to see if the data shows any interesting structure which would be further interpreted by an expert in astronomy. — NamelessGods
– NamelessGods, Commented Apr 11, 2018 at 6:34
Defining what constitutes an 'interesting' cluster requires some critertion that is difficult to decide on without domain knowledge, IMO. If this is purely exploratory, you could try it both ways and discuss the results with astromers. — mkt
– mkt, Commented Apr 11, 2018 at 6:37
How does it take nine covariates to represent "color magnitude"? Your answer to this might reveal the right solution. — whuber
– whuber ♦, Commented Apr 11, 2018 at 12:50
Just so you might not know, color in astronomy is not in the sense of everyday RGB colors. See here — NamelessGods
– NamelessGods, Commented Apr 11, 2018 at 14:28

Has QUIT--Anony-Mousse · Accepted Answer · 2018-04-14 08:12:08Z

The statistical answer to this question is: there is no statistical answer to this question.

Just normalizing or standardizing variables is popular/common to do, in particular when you don't know the data, and what it means. But then you also won't know if the result is useful at all.

An approach that scales variables based on domain knowledge is always preferable.

Just consider the following: what if you had a variable in there by some calibration sensor. A variable that is supposed to be zero all the time, because there won't be any light in this spectrum, for example. Assume the sensor reads values from $N(0;10^{-10})$. If you apply any uninformed scaling, this useless noise will become as important as your actual data, and your results become much worse. Now assume you had more than one variable like this.

So usually, I would argue to only do this if you don't understand your data, or if you have good arguments for each variable to do this.

Stack Exchange Network

Should I standardize my data or not?

1 Answer 1

Hot Network Questions

Should I standardize my data or not?

1 Answer 1

Related

Hot Network Questions