14
$\begingroup$

I have a dataset with 5 classes. About 98% of the dataset belong to class 5. Classes 1-4 share equally about 2% of the dataset. However, it is highly important, that classes 1-4 are correctly classified.

The accuracy is not a good measure of performance for my task. I found lots of information on metrics for imbalanced binary classification tasks but not on multiclass problems.

Which performance metrics should I use for such a task?

  • TP, TN, FP, FN
  • Precision
  • Sensitivity
  • Specificity
  • F-score
  • ROC-AUC (micro, macro, samples, weighted)
$\endgroup$
1

5 Answers 5

8
$\begingroup$

For unbalanced classes, I would suggest to go with Weighted F1-Score or Average AUC/Weighted AUC

Let's first see F1-Score for binary classification.

The F1-score gives a larger weight to lower numbers.

For example,

  • when Precision is 100% and Recall is 0%, the F1-score will be 0%, not 50%.
  • When let us say, we have Classifier A with precision=recall=80%, and Classifier B has precision=60%, recall=100%. Arithmetically, the mean of the precision and recall is the same for both models. But when we use F1’s harmonic mean formula, the score for Classifier A will be 80%, and for Classifier B it will be only 75%. Model B’s low precision score pulled down its F1-score.

Now, come to the Mutliclass Classification

Let us suppose we have the five classes, class_1, class_2, class_3, class_4, class_5

and the model is having the following results for each class.

enter image description here

Formula for precision for each class = (True Positive for class)/(Count of predicted Positive for that class)

e.g. precision for class_1 = (True Positive for class_1)/(Count of Predicted of class_1)

Formula for Recall for each class = (True Positive for class)/(Actual Positive for that class)

e.g. precision for class_1 = (True Positive for class_1)/(Total instances of class_1)

Formula for F1: F1 is the geometric mean of Precision and Recall i.e.

F1 = 2*(Precision*Recall)/(Precision+Recall)

Macro-F1 = Average(Class_1_F1 + Class_2_F1 + Class_3_F1 + Class_4_F1 + Class_5_F1)

Macro-Precision = Average(Class_1_Precision + Class_2_Precision + Class_3_Precision + Class_4_Precision + Class_5_Precision)

Macro-Recall = Average(Class_1_Recall + Class_2_Recall + Class_3_Recall + Class_4_Recall + Class_5_Recall)

Problem with Macro calculation: When averaging the macro-F1, we gave equal weights to each class.

Weighted F1 Score:

We don’t have to do that: in weighted-average F1-score, or weighted-F1, we weight the F1-score of each class by the number of samples from that class.

Weighted F1 Score = (N1*Class_1_F1 + N2*Class_2_F1 + N3*Class_3_F1 + N4*Class_4_F1 + N5*Class_5_F1)/(N1 + N2 + N3 + N4 + N5)

References: https://towardsdatascience.com/multi-class-metrics-made-simple-part-ii-the-f1-score-ebe8b2c2ca1

$\endgroup$
5
  • 6
    $\begingroup$ Just for the record: Medium articles are not credible sources. (I am not down-voting this post or anything like that, but I have seen "bad" Medium articles from experienced users; so be careful about them.) $\endgroup$ Commented Apr 28, 2020 at 14:38
  • 2
    $\begingroup$ Thanks for that, i checked the profile of the writer, he is a P.hD student and have good experience with Data Science. Apart from this, I am a Data Scientist, I have used this Metric for our models evaluation. $\endgroup$ Commented Apr 28, 2020 at 15:23
  • 1
    $\begingroup$ Why is giving equal weights to each class in Macro F1 a bad thing? For example, Let us assume that class one is the majority class, and Class_1_F1 is the highest as well. wouldn't this mean Weighted F1 is represents F1 of the majority class 1? But we want a metric that represents the minority classes, isn't it? $\endgroup$ Commented Mar 27, 2021 at 6:50
  • 1
    $\begingroup$ f1 score would require some threshold. How do you decide that. The good thing about metrics like area under roc/pr is that they compute across all thresholds $\endgroup$ Commented Nov 5, 2024 at 9:04
  • 1
    $\begingroup$ Not a good idea to use F1 unless it is a detection or information retrieval problem where the number of true negatives is genuinely irrelevant. The fact that it is an imbalanced problem does not automatically mean it is a detection or information retrieval problem. $\endgroup$ Commented Oct 10, 2025 at 8:38
3
$\begingroup$

The premise of this question is incorrect. We should not choose a performance metric according to the properties of the data, but according to the requirements of the application - i.e. what is important to the user of the classifier system we are trying to create. Most often the information that is ignored is that different kinds of errors can have different costs, for example in a medical screening test, it is a much worse error to say that someone is healthy when they are not (they may get much worse or even die before the error is spotted) than to say that they have a disease when they don't (they will probably be sent for a more complicated expensive which will show they are healthy, but they won't die). If the misclassification costs are equal, then the error rate may be a good metric, even for imbalanced tasks, as it represents the (e.g. financial) loss over the test set.

In general, I would recommend using a probabilistic classifier (e.g. [kernel] multinomial logistic regression) which estimates the probabilities of class membership. That way misclassification costs can be taken into account without retraining the model. Use a proper scoring rule (e.g. log loss) to evaluate the quality of those probability estimates as well as metrics more directly targeted to the needs of the application.

In some applications, where the misclassification costs (and operational class frequencies) are known in advance and fixed, then you may get better performance on the decision by using a discrete classifier (such as the Support Vector Machine). This is because they are focussed on the decision boundary and there is less compromise introduced by modelling data that doesn't affect it. However that scenario tends to be fairly uncommon.

Use more than one metric - I often use loss/error rate (perhaps balanced error) to evaluate the quality of the hard decisions, but also AUROC (for binary problems) to assess the ranking of patterns and the cross-entropy to assess the quality of estimates of posterior probability estimates. These are chosen to help diagnose problems with the more important application specific metrics that should be the primary focus.

$\endgroup$
0
$\begingroup$

Precision, recall, F1, ROC/AUC, and other metrics like specificity/sensitivity that you mentioned can be good for multi-class imbalanced metrics. If you want to emphasize the undersampled classes, use macro weighting (arithmetic average). If not, use micro average, which is weighted by number of samples.

Another metric I don't see people often talking about is Cohen's Kappa. I like to think of it like accuracy, but taking into account the "no information rate" or random guessing baseline. It will give you a score similar to accuracy, though I believe it has some flaws in certain situations. In general, I've found Cohen's Kappa to work well.

Others include the litany of metrics listed on Wikipedia's confusion matrix page, such as Matthew's correlation coefficient (MCC) and others.

$\endgroup$
0
$\begingroup$

I would use binary F-beta with beta tuned according to your preference for false positives vs. false negatives, and with classes 1-4 grouped (for performance evaluation only) into a single "negative" class.

$\endgroup$
0
$\begingroup$

Look for a matching scoring rule

https://en.wikipedia.org/wiki/Scoring_rule

These can act as loss functions as well. The smaller the better. Ask yourself which scoring rule aligns with your business question (e.g. ranked probability score?)

Then you can still introduce thresholds and do cost optimization/priority setting with these thresholds on these scores in a second step.

$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.