Skip to main content
3 of 4
added 19 characters in body

Normalization/Standarization for Clustering visualization

I'm performing visualization of a dataset clustered with k-means. I compute a weight for each cluster and I draw a circle as big as its weight. But it seems like after the clustering some values are too high respect to the data set. For example the biggest weight is 1117797 while the smallest is just 2.75, I've performed normalization between [0,1] and the visualization due to that dissimilarity is not good.

Should I normalize data in a different way?, I've read about z-score but I'm not sure how to apply it in order to give less relevance to this big clusters.

Additional info: Average: 16213 Standard deviation: 110985.9

The problem I'm solving: I have around 500k text comments and they are represented in a vector space model using a Term frequency – Inverse document frequency matrix. In the end every document is represented as a vector where each dimension represents the weight of a term in the corpus.

Edit:

So far I've got the following: I consider a cluster an "outlier" of the data if its weight is greater than the average plus the standard deviation multiplied by 3. Formally is an outlier if satisfies:

$w_{i} >= \mu + \sigma * 3$

Then I do a scaling in [0,1] of the set of "outliers" and then multiply every element by $a * max_{w_{j} < \mu + \sigma * 3}(w_{j})$. It is, take the max of the "non outliers" and use it as a baseline to put another weight to the outlier points such that they remain bigger but not too much big. I've set $a = 1.7$ because it gives me nice graphical results but I am not sure this "experimental" method would work for different types of data (which may be the case in my problem).