normalization in clustering

Question

I am working on a project where I aim to cluster provinces according to their exposure to river floods. Currently, I am considering the following indicators:

Total number of flood events / total provincial area
Total flooded area / total provincial area
Total flooded area / number of flood events (average event size)
Flood-prone area / total provincial area

These indicators are intended to capture event frequency, overall territorial impact, average event magnitude, and structural exposure.

However, I am uncertain about the most appropriate normalization strategy.

Instead of normalizing by total provincial area, I could normalize by flood-prone area:

Total flood events / flood-prone area
Total flooded area / flood-prone area
Total flooded area / number of flood events

Clearly, these two approaches would lead to different clustering results. My question is: from a methodological perspective, which normalization strategy is preferable when clustering territorial units according to hazard exposure?

Any insights or references on similar clustering approaches would be greatly appreciated.

Thank you.

Peter Flom · Accepted Answer · 2026-02-27 12:50:37Z

I think this decision should be based, first of all, on what you want to find out. It is more a question of meteorology than statistics.

However, from a statistics point of view, it seems to me that he second approach could run into some real problems, if the provinces have very different flood prone areas. You don't say what country or region you are dealing with, but many regions might have some provinces with no flood prone areas. And, even if FPA is above 0, a small change in a small denominator can make a big difference and surely estimates of FPA are just that: Estimates.

I also wonder about dividing provinces into flood prone and not flood prone. Dichotomizing continuous variables is usually a mistake, as it throws away information. Surely some provinces are much more flood prone than others. Some might flood every year, or multiple times per year, others might flood sometimes, others almost never.

Christian Hennig · Accepted Answer · 2026-02-28 20:12:43Z

There are valuable considerations in the other answers. I make a more general point that is implicitly already present in the other answers, but I thought it may help to say this explicitly.

The way of thinking about such things should be that you need to have some kind of concept what makes provinces similar or dissimilar in terms of the subject matter. This requires understanding of the background, and it also depends on the aim of clustering. You should aim at doing data analysis in such a way that the processing of the data appropriately reflects the meaning of the data and the clustering aim. One thing you can do is to look at a number of pairs of provinces, calculate both (or more) versions of you data by different normalisation schemes and try to assess, based on your knowledge of the provinces and what you want to achieve, which one reflects better the actual similarity or dissimilarity (in the sense of being appropriate to put these provinces into the same cluster).

In many cases, different decisions lead to different but potentially equally valid clusterings in the sense that the clusters are driven by different meanings/concepts of similarity, but both could be justified in some sense. If you assess flooded area compared to total provincial area, you make statements about the possibility of being flooded in any place of the province. If you assess it relative to flood-prone areas, you make statements about the actual danger in what is classified as flood-prone areas, which might be relevant as well, particularly regarding people and enterprises located in these areas (this is relevant if what happens elsewhere in the province isn't of primary interest to your clustering aim). It is important to understand though that this is not a statistical decision, but a subject matter one.

So the clusterings you get will allow for different kinds of interpretation, and you need to decide what kind of interpretation is relevant to you. I admit that this looks quite subjective, and you may think that in science we want to find out something "true" rather than finding what we decided to find. So it is important to make decisions in such a way that what you do addresses the right problem (you are responsible for deciding what that problem is!), without "biasing" the actual result. This is sometimes hard to tell apart.

When helping subject matter experts with cluster analysis, the experience is that they often have some a priori ideas which observations should cluster together. As a statistician it is important to (a) listen to this as it is usually informed by relevant information and competence that they have and I don't, but (b) to also to remain somewhat sceptical because there may be something interesting and relevant to be found that deviates from the prior expectations of the experts. It is however crucial to understand that clustering depends on subject matter background based decisions, and that the data and statistical reasoning alone don't have all the information that is needed to make the required decisions. There is no unique truth in clustering; different clusterings will allow for different interpretations, and what is a valid interpretation of a clustering depends on the decisions made earlier.

I add that in principle all these considerations also interact with the clustering method that you will be using. For example, it makes a difference whether your clustering is based on the Euclidean distance (treating all the information in the different variables independently - which is by the way different from assuming that they are independent), or whether you model correlation within clusters, for example by using a Gaussian mixture with flexible covariance matrices. What effect the inclusion of the third variable has (as discussed by @jginestet) will depend on this.

jginestet · Accepted Answer · 2026-02-28 16:12:23Z

I do not think you want to "normalize" by flood-prone area (FPA). FPA is a possible explanatory factor for all your other factors (the higher the FPA is in a province, the higher one would expect the total flooded area to be, the higher one would expect the average event size to be, and maybe even the total number of flood events (it is easier to get an area flooded)). So dividing by it, you are in fact "eliminating" this cause (e.g. a province may experience twice as large a total flooded area relative to the province size, but normalized to FPA, that difference may disappear).

So I would stick to your initial approach (divide by province area). You may even add a few more explanatory variables (e.g. total annual precipitations, or number of days with precipitation above so-many cm, or lengths of waterways through the province, etc.).

Once you find some provinces which are different, you can then also try your 2nd approach, and if the difference disappears, then the difference is all due to higher FPA, but if the difference is similar, then other factors are responsible for the difference.

Last, a question about your 3rd variable (Total flooded area / number of flood events (average event size)). It is just the ratio of your first 2 variables (where the normalizing denominator does not matter as it cancels out), and hence does not add anything new. It creates correlations between your variables, adds more weights to the first 2, and I would question whether you really need it?

Regarding the 3rd variable, whether this is a problem at all (or rather a feature than a bug) depends on the clustering approach. Based on only variables 1 and 2 standard Euclidean distance-based clustering for example will assess provinces as very dissimilar if variable 1 and 2 differ a lot, and will ignore whether they are similar on the 3rd variable. Bringing in the third variable brings in new relevant information for the method in this sense, regardless of its dependence on variable 1 and 2, which otherwise wouldn't have an effect on the clustering. — Christian Hennig
– Christian Hennig, Commented 15 hours ago

Stack Exchange Network

normalization in clustering

3 Answers 3

Hot Network Questions

normalization in clustering

3 Answers 3

Related

Hot Network Questions