I've been working on a multiclass problem (5 classes) and having some challenges on Feature Selection and Class Imbalance.
I have around 1,000 rows and 2,000 features (which I also generated exhaustively) consisting of in-app activity (data has many zeros because of inactive users, making the distribution right skewed with long tail).
The target variable is distributed like this:
Class 0: 2%
Class 1: 5%
Class 2: 10%
Class 3: 30%
Class 4: 55%
Please critic my workflow below and feel free to suggest steps that can improve my solution.
- Split the data into Train-Test and Holdout Sets.
- Check the skewness of each feature and if > 0.5, use Square Root transformation.
- After the transformation, ~95% of the features are still skewed, therefore use RobustScaler() instead of StandardScaler().
- Drop Constant and Quasi-constant features.
- Create different training sets using the following Feature Selection techniques:
a. Filter Method: SelectKBest - Since my features are continuous, use ANOVA (Find k using Cross Validation).
b. Wrapper Method: Recursive Feature Elimination - Use Random Forest as an estimator (Using RFECV)
c. Embedded Methods: LASSOCV(Weird, because it retained only 1 feature out of 2,000), RidgeCV (Arbitrarily pick the top 5% features) and Random Forest Feature Importance (top 5% features). - Create more training sets but this time, check for multicollinearity using Variance Inflation Factor and Correlation.
- Train the following models on all of the training sets generated from the different Feature Selection techniques (someone tell me if this is the right thing to do):
a. LASSO Regression
b. Ridge Regression
c. SVM
d. Decision Tree
e. Random Forest
f. XGBoost
g. LightGBM
h. CatBoost - Evaluate and compare results using classification metrics and review the confusion matrix (a complete diagonal, the better).
Results: Most of the models scored at only around 20-25% Balanced Accuracy and Macro F1-Score, maybe because of the class imbalance.
I've read some Q&As here that the best thing to do is to "do nothing" for these type of cases since SMOTE/oversampling/undersampling changes the sample and destroys the calibration.
Will built in Class Weights parameter help me (does this also change the calibration) or is this a case of adjusting the threshold based on business decisions?
Sorry for the long post, I'm just wondering why I cannot get higher scores than 25%.
Thank you.