Multiclass Classification for Multiple Minority Classes

Question

I've been working on a multiclass problem (5 classes) and having some challenges on Feature Selection and Class Imbalance.

I have around 1,000 rows and 2,000 features (which I also generated exhaustively) consisting of in-app activity (data has many zeros because of inactive users, making the distribution right skewed with long tail).

The target variable is distributed like this:
Class 0: 2%
Class 1: 5%
Class 2: 10%
Class 3: 30%
Class 4: 55%

Please critic my workflow below and feel free to suggest steps that can improve my solution.

Split the data into Train-Test and Holdout Sets.
Check the skewness of each feature and if > 0.5, use Square Root transformation.
After the transformation, ~95% of the features are still skewed, therefore use RobustScaler() instead of StandardScaler().
Drop Constant and Quasi-constant features.
Create different training sets using the following Feature Selection techniques:
a. Filter Method: SelectKBest - Since my features are continuous, use ANOVA (Find k using Cross Validation).
b. Wrapper Method: Recursive Feature Elimination - Use Random Forest as an estimator (Using RFECV)
c. Embedded Methods: LASSOCV(Weird, because it retained only 1 feature out of 2,000), RidgeCV (Arbitrarily pick the top 5% features) and Random Forest Feature Importance (top 5% features).
Create more training sets but this time, check for multicollinearity using Variance Inflation Factor and Correlation.
Train the following models on all of the training sets generated from the different Feature Selection techniques (someone tell me if this is the right thing to do):
a. LASSO Regression
b. Ridge Regression
c. SVM
d. Decision Tree
e. Random Forest
f. XGBoost
g. LightGBM
h. CatBoost
Evaluate and compare results using classification metrics and review the confusion matrix (a complete diagonal, the better).

Results: Most of the models scored at only around 20-25% Balanced Accuracy and Macro F1-Score, maybe because of the class imbalance.

I've read some Q&As here that the best thing to do is to "do nothing" for these type of cases since SMOTE/oversampling/undersampling changes the sample and destroys the calibration.

Will built in Class Weights parameter help me (does this also change the calibration) or is this a case of adjusting the threshold based on business decisions?

Sorry for the long post, I'm just wondering why I cannot get higher scores than 25%.

Thank you.

2 & 3: why do you see skewed predictors as a problem that needs to be addressed, and why do you think these are good ways of addressing it? 8: accuracy and related KPIs (yes, very much including the two you mention have major issues. Better to use probabilistic predictions and proper scoring rules. The log score can be used in the multiclass case. (9): class weighting is just a more sophisticated way of over-/undersampling, which is usually not helpful. — Stephan Kolassa
– Stephan Kolassa, Commented Mar 15, 2024 at 12:36
Finally, you may have simply reached the limit of predictability. — Stephan Kolassa
– Stephan Kolassa, Commented Mar 15, 2024 at 12:36
Hi @StephanKolassa, apologies in advance since this is what I've learned from ML lectures a while ago. Skewed predictors should be brought closer to a normal distribution to improve model performance (classification metrics in this instance). As for point #8, is this still applicable if the stakeholders are just concerned with the predictions (1,2,3,4 or 5) to generate a score (averaging the predictions, for example an average score of 4.5)? — easymoneysniper
– easymoneysniper, Commented Mar 18, 2024 at 3:51
To add, I ran an XGBoost model using the different training sets that I've generated and the best one had a confusion matrix that is heavily biased towards predicting the majority classes 4 and 5 with only a few instances landing on 3 (zero predictions for 1 and 2). What should I do if the results that the business wants are the scores itself based on the classes and not the probabilities as you have suggested? Thank you. — easymoneysniper
– easymoneysniper, Commented Mar 18, 2024 at 3:53
Unfortunately, there is a lot of sheer misinformation out there, even in lectures. (I'll be honest: I see more misunderstanding from people who learned their ML in departments of Computer Science than of Statistics.) Why did your teachers say that predictors "need" to be brought "closer to a normal distribution"? (What would you do with Boolean dummy predictors?) As to (8), you may be interested in this thread. You really need to understand what will be done with your predictions to be able to make good predictions. — Stephan Kolassa
– Stephan Kolassa, Commented Mar 18, 2024 at 5:10

Dave · Accepted Answer · 2024-03-15 13:46:11Z

1

Unfortunately, almost every step you are taking warrants constructive criticism.

It is not obvious that splitting is the way to go. Frank Harrell advocates for bootstrapping the entire pipeline, especially when the sample size is fairly small (he’s given $20,000$ as a loose threshold).
Skewed features aren’t necessarily problematic. There can be good reasons to transform features, but there are minimal assumptions about feature distributions in machine learning. Transformations are mostly to capture a particular relationship between the features and the outcome, e.g., quadratic.
This might be reasonable, but it is worth considering if you should scale your features at all. This is done in deep learning to help the numerical optimization of such a complicated loss function. You are not in such a situation (though scaling features on different scales might impact the regularization).
Constant features have no predictive utility and should be dropped. Near constant features might be quite informative when they deviate from “business as usual” and aren’t necessarily good to drop.
Feature selection is surprisingly unstable, and it isn’t clear that you will benefit from doing so. Regularization can handle $p>n$ situations, and domain knowledge might be able to reduce the features, too.
Removing correlated features isn’t necessarily a good idea. The information might be redundant, but each might have unique information about the outcome.
Fair enough
It is quite likely that the majority category will usually have the highest predicted probability, so you might wind up with a ton of “classifications” that way, and this will affect your classification metrics. Perhaps better and often advocated by statisticians is to directly evaluate the raw predictions with a strictly proper scoring rule like log loss or Brier score.

edited Mar 15, 2024 at 13:46

answered Mar 15, 2024 at 13:39

Dave

72.9k8 gold badges116 silver badges360 bronze badges

$\begingroup$ Additional edits may be forthcoming when I am not on mobile. $\endgroup$

Dave
– Dave

2024-03-15 13:46:52 +00:00
Commented Mar 15, 2024 at 13:46
$\begingroup$ Hi Dave, thank you for your insights. Here are my responses. 1. Apologies as I am not really familiar with this, let me read more on it and get back to you. Does this mean that the entire train-test split concept is not good at all? 2. I'll run another set of models based on non-transformed skewed features and I'll compare the results. 3. I'm not using any deep learning as of now, but isn't scaling also needed for Linear models? Although it can be skipped for tree based models, I think. 4. For near constant features, I set the threshold to 99%. $\endgroup$

easymoneysniper
– easymoneysniper

2024-03-18 04:05:23 +00:00
Commented Mar 18, 2024 at 4:05
$\begingroup$ 5. I have done LASSO and Ridge cross validation to select the remaining features. Question, is it alright to use LASSO and Ridge and then use the features for tree based models? 6. I see, that's why I created many training sets. Features that are only from feature selection techniques + features that are reduced by removing multicollinearity. But this is like a brute force way of doing it because I'm exhausting the possible training sets that I can use. $\endgroup$

easymoneysniper
– easymoneysniper

2024-03-18 04:05:41 +00:00
Commented Mar 18, 2024 at 4:05
$\begingroup$ 8. You are correct, the "best" model that I have trained has majority of its predictions on 4 and 5 and a few landing on 3 (zero predictions on class 1 and 2). And as I've said to Stephen also, the business is a concerned more on the predictions (1,2,3,4,5) rather than the probabilities since they want to generate an average score (4.5, 4.75, etc.). $\endgroup$

easymoneysniper
– easymoneysniper

2024-03-18 04:05:44 +00:00
Commented Mar 18, 2024 at 4:05
$\begingroup$ I can't edit my comment anymore but is point #1 also referring to cross validation? $\endgroup$

easymoneysniper
– easymoneysniper

2024-03-18 04:11:52 +00:00
Commented Mar 18, 2024 at 4:11

| Show 1 more comment

Stack Exchange Network

Multiclass Classification for Multiple Minority Classes

1 Answer 1

Linked

Hot Network Questions

Multiclass Classification for Multiple Minority Classes

1 Answer 1

Linked

Related

Hot Network Questions