8
$\begingroup$

I am working on a network security-related project, in which I have to build a deep learning model to detect a specific attack. It's about detecting whether a network system of an organisation is a victim of that attack or not. The attack I am talking about is sophisticated in nature and very hard to detect. However, I am instructed to work on an imbalanced dataset and strictly prohibited from using any kind of balancing technique. As far as I know, imbalanced data makes a model biased to the majority class. Then what may be a possible reason that my supervisor told me to do so? My data contains 56453 positive classes and 6345 negative classes.

$\endgroup$
1

2 Answers 2

7
$\begingroup$
Brief Answer

In your use case it seems to be better to keep the dataset imbalanced!

Build a continuous preditcion (sigmoid output) and then find the policy (e.g. a threshold) that balances false positives and false negeatives in a way that suits your case.

Long Answer

Imbalanced data is not in general a problem and most ML-Models can handle a decent amount of imbalance. A (roughly) 90-10 split should not be a problem, here.

What you probably want is a model that estimates the risk (of a certain case to be an attack). If given a pattern that in the training data is in 20% of the cases an attack (I'm simplifying it a bit by ignoring generalization and similar patterns; the argument would still hold), then the model will predict a 20% rsik. If you would balance your dataset, the model would learn a risk of maybe 70% (when taking only 1/9 of the positive class). Balancing will result in overestimating the risk.

But what about biasing towards the majority class

What you learned is not completely wrong: if there are only 0.5% cases of one class, then a model will most certainly assign very low probabilities to this class. But that does not need to be a problem if used in the right way.

The bias towards one class is a problem, if your training data does nor reflect the real or expected distribution. In that case, the model learns the wrong distribution and reports biased probabilities. A famous example is face recognition that was mainly trained on white male faces (although here the imbalance happens in the feature space, but not in the label).

Summerazing: First look for bias in your data, before worrying about bias in the model!

Recommendation
  1. Build the model without balancing your data.
  2. Check the validity of the risk prediction. A calibration curve will tell you if the predicted probabilities reflect the true risk.
  3. What you make out of the computed risk depends on your concrete use-case and priorities. An easy way would be find a threshold (e.g. 10%) and send an alarm (or whatever action you take) if the risk is higher than that. This is a business decision: how many false alarms do we want to risk? What happens with those cases that we do not detect? You can check different thresholds and see what the implication would be for you. Do not blindly use a 50% threshold!.
    It is also possible to design more complex policies or use multiple thresholds, e.g. for different levels of alerts.
$\endgroup$
4
$\begingroup$

The answer by @Broele rightly notes that blindly balancing the training data often isn't appropriate and that we often want to preserve meaningful class prevalence, which is a great place start to start from.

The real issue is not the imbalance per se, but how we use the data: the loss, threshold, and evaluation metric. Treating imbalance as something to fix in the data often misunderstands the problem. Resampling doesn't magically align your model with the true objectives; it can distort estimated probabilities and impair calibration if done naively.

We should avoid balancing the dataset when:

  • The observed class frequencies genuinely reflect the true population distribution that we care about.

  • The model outputs calibrated probabilities that are meaningful for decision policies rather than unanchored class labels.

  • We are using metrics and thresholds appropriate for imbalanced settings (eg., PR-AUC, rank-based scores, calibrated thresholding) instead of raw accuracy.

In these situations, the model can already learn meaningful risk scores without synthetic rebalancing; the critical decisions about false positives vs false negatives can be handled after fitting the model via thresholding or proper utility functions.

Balancing/oversampling may still be useful experimentally, but it should be treated as a tuning option rather than a default requirement.

For a broader discussion on why class imbalance itself is rarely the root problem and how metrics and loss interact with imbalance, see:

Is class imbalance really a problem in machine learning?

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.