2025-06-18T12:29:02Z

I am using the red wine quality dataset from Kaggle 'Red Wine Quality' for regression as self-exploration.

I did the analysis using multiple linear regression and found that '***' appeared against only few of the independent variables. However, when I did the important feature selection on the same dataset using 'Boruta', it deems all independent variables as important.

I will publish the code and corresponding output post office hours :)

However, would be obliged, if someone, in theory can help me understand the difference in output between these two methods. I am perplexed as to which are the 'important' variables in this case and when one does use Boruta.

I followed the following steps:

Removed all previous variables to start with a clean slate. #remove any previous variables from the environment.

rm(list=ls(all.names = T))

Set the working directory. #set working directory

setwd("D:/MachineLearning/Kaggle/2_WineQuality")
getwd()

Installed packages using Pacman #install packages

if(! require(pacman)) install.packages("pacman") p_load(janitor, stats19, caret, dplyr, Boruta, ggplot2, reshape2, caTools, randomForest, missForest, e1071, rpart) p_loaded()

Imported the data set

Importing the dataset

dataset = read.csv('winequality-red.csv')

str(dataset) head(dataset,8)

OUTPUT for reference: *> str(dataset) 'data.frame': 1599 obs. of 12 variables: $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ... $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ... $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ... $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ... $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ... $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ... $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ... $ density : num 0.998 0.997 0.997 0.998 0.998 ... $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ... $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ... $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ... $ quality : int 5 5 5 6 5 5 5 7 7 5 ...

head(dataset,8) fixed.acidity volatile.acidity citric.acid residual.sugar chlorides 1 7.4 0.70 0.00 1.9 0.076 2 7.8 0.88 0.00 2.6 0.098 3 7.8 0.76 0.04 2.3 0.092 4 11.2 0.28 0.56 1.9 0.075 5 7.4 0.70 0.00 1.9 0.076 6 7.4 0.66 0.00 1.8 0.075 7 7.9 0.60 0.06 1.6 0.069 8 7.3 0.65 0.00 1.2 0.065 free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality 1 11 34 0.9978 3.51 0.56 9.4 5 2 25 67 0.9968 3.20 0.68 9.8 5 3 15 54 0.9970 3.26 0.65 9.8 5 4 17 60 0.9980 3.16 0.58 9.8 6 5 11 34 0.9978 3.51 0.56 9.4 5 6 13 40 0.9978 3.51 0.56 9.4 5 7 15 59 0.9964 3.30 0.46 9.4 5 8 15 21 0.9946 3.39 0.47 10.0 7

*

Checked column-wise missing values using a function – none were found.
Checked for outliers using the dplyr library and function.
Then, I completed the following steps:

#Splitting the dataset into the Training set and Test set #First we are using entire dataset without identifying important variables.

#install.packages('caTools') library(caTools)

set.seed(1235) split = sample.split(dataset_revised$quality, SplitRatio = 0.8) training_set= subset(dataset_revised, split == TRUE) test_set = subset(dataset_revised, split == FALSE)

#Fitting multiple linear regression to the training set regressor = lm(quality ~ ., data = training_set) summary(regressor)

#predicting the test set results y_pred = predict(regressor, newdata = test_set) y_pred

OUTPUT for reference: *> regressor = lm(quality ~ .,

```
           data = training_set)
```

summary(regressor) Call: lm(formula = quality ~ ., data = training_set) Residuals: Min 1Q Median 3Q Max -2.10231 -0.36668 -0.06113 0.45744 1.63829 Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.596e+01 1.726e+01 0.924 0.35556
fixed.acidity 2.603e-02 1.930e-02 1.349 0.17762
volatile.acidity -8.255e-01 1.333e-01 -6.191 8.04e-10 *** citric.acid -2.022e-01 1.409e-01 -1.435 0.15160
residual.sugar 3.117e-02 4.569e-02 0.682 0.49523
chlorides -6.888e-01 1.333e+00 -0.517 0.60548
free.sulfur.dioxide 2.446e-03 2.332e-03 1.049 0.29450
total.sulfur.dioxide -2.602e-03 8.397e-04 -3.099 0.00198 ** density -1.252e+01 1.747e+01 -0.716 0.47387
pH -3.350e-01 1.693e-01 -1.978 0.04809 *
sulphates 1.554e+00 1.571e-01 9.892 < 2e-16 *** alcohol 2.504e-01 2.336e-02 10.720 < 2e-16 ***

Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.6043 on 1267 degrees of freedom Multiple R-squared: 0.3383, Adjusted R-squared: 0.3326 F-statistic: 58.89 on 11 and 1267 DF, p-value: < 2.2e-16*

As per Multiple linear regression, 3 independent varaibles are deemed as ‘significant ()’ while other two are deemed as less significant (**/)**

OUTPUT for prediction: *> #predicting the test set results

y_pred = predict(regressor, newdata = test_set) y_pred 5 6 8 9 14 17 24 25 28 5.074874 5.094743 5.233555 5.325573 5.180410 5.854369 5.250577 5.522663 5.925952 31 33 42 65 68 69 85 88 92 5.182381 5.307526 5.111679 5.249088 5.463764 6.057060 5.842495 5.445663 5.631287 98 105 106 108 109 114 115 118 124 5.362495 5.146589 5.038781 5.316397 5.667034 5.680731 5.833450 5.145062 5.105668 132 136 152 153 156 165 166 168 170 6.073000 5.546268 5.360423 5.235707 5.794010 4.890636 5.116015 5.027211 5.260609 173 182 184 189 198 199 202 204 206 (output is truncated on purpose)*

*#+++++++++++++++++++++++++++++++++++

#WITH BORUTA #+++++++++++++++++++++++++++++++++++

Run Boruta feature selection

set.seed(1234) # For reproducibility boruta_result <- Boruta(quality ~ ., data = dataset_revised, doTrace = 2, maxRuns = 100)

run of importance source...
run of importance source...
run of importance source...
run of importance source...
run of importance source...
run of importance source...
run of importance source...
run of importance source...
run of importance source...
run of importance source...
run of importance source... After 11 iterations, +12 secs: confirmed 11 attributes: alcohol, chlorides, citric.acid, density, fixed.acidity and 6 more; no more attributes left.>

Print Boruta results

print(boruta_result) Boruta performed 11 iterations in 11.93987 secs. 11 attributes confirmed important: alcohol, chlorides, citric.acid, density, fixed.acidity and 6 more; No attributes deemed unimportant.

Get confirmed important variables

important_vars <- getSelectedAttributes(boruta_result, withTentative = FALSE) cat("Confirmed Important Variables:\n", important_vars, "\n") Confirmed Important Variables: fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol

Optional: Statistical summary of importance

boruta_stats <- attStats(boruta_result) print(boruta_stats) meanImp medianImp minImp maxImp normHits decision fixed.acidity 15.938186 16.149327 14.724135 17.18521 1 Confirmed volatile.acidity 29.416528 29.596525 27.164920 31.04642 1 Confirmed citric.acid 17.514146 18.123377 15.352202 18.57271 1 Confirmed residual.sugar 8.922052 8.875384 7.389886 10.59015 1 Confirmed chlorides 13.972852 14.111462 12.401181 15.36371 1 Confirmed free.sulfur.dioxide 13.665830 13.732384 12.539486 14.70550 1 Confirmed total.sulfur.dioxide 19.104844 19.301512 17.758147 20.12949 1 Confirmed density 22.191538 21.825438 20.661987 24.31289 1 Confirmed pH 11.781957 11.828005 10.721066 12.99994 1 Confirmed sulphates 38.698071 38.576669 36.752754 41.40466 1 Confirmed alcohol 46.899268 46.976149 44.181024 49.39952 1 Confirmed*

However, in comparison to Multiple Linear regression, Boruta says all independent variables are important. Is this a contradiction or am I reading something wrong?

2025-05-01T09:08:33Z

I am looking at how gut microbiome compositional features (species and genera) relate to anxiety scores (and if they mediate HEI-2020 group effect, exploratory question given n = 46). I run multiple univariate robust mediation models (robmed() with bootstrapping) so that I obtained casual mediation stats for my exploratory question and stats parameters for path b (gm features --> anxiety regardless diet). I was wandering if it could add any value to also perform Boruta as a “sanity check” (with multi-seeds to check for stability) to confirm the taxa found significant in the robust models in a multivariate, RF-based framework, and to spot any additional features or non-linear interactions the robust models might have missed. Would this make sense/ be scientifically sound?

given the small n I cannot use Boruta as feature selection and run my robust models on the selected features as that would require a data split I cannot afford..

What do you think?

2021-08-28T11:03:40Z

I want to use BorutaShap for feature selection in my model. I have my train_x as an numpy.ndarray and I want to pass it to the BorutaShap instance. When I try to fit I am getting error as:

AttributeError: 'numpy.ndarray' object has no attribute 'columns'

Below is my code:-

num_trans = Pipeline(steps = [('impute', SimpleImputer(strategy = 
                               'mean')), 
                          ('scale', StandardScaler())])
cat_trans = Pipeline(steps = [('impute', SimpleImputer(strategy = 
                               'most_frequent')), 
                          ('encode', OneHotEncoder(handle_unknown = 
                           'ignore'))])

from sklearn.compose import ColumnTransformer

preproc = ColumnTransformer(transformers = [('cat', cat_trans, 
                                             cat_cols), ('num', 
                                            num_trans, num_cols)])

X = preproc.fit_transform(train_data1)
X_final = preproc.transform(test_data1)

from xgboost import XGBRegressor
xgbr_model = XGBRegressor(random_state = 69, tree_method = 'gpu_hist')

from sklearn.model_selection import train_test_split, cross_val_score
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size = 
                                               0.2, random_state = 69)

from BorutaShap import BorutaShap
Feature_Selector = BorutaShap(model=xgbr_model,
                              importance_measure='shap', 
                              classification=False)

Feature_Selector.fit(train_x, train_y, n_trials=10, random_state=69)

Any help will be appreciated!

2024-02-01T13:30:53Z

I ran Boruta feature selection prior to XGB training\testing step and didn't see any difference, although ~30/200 features were rejected prior to going into the training. Can it be that internal feature selection of XGB is comparable to Boruta step and assigns nearly 0 importance to the same 30 features?

My logic is simple - because these 200 features are the initial dataset it is very hard to believe it is also the optimal set. Therefore feature selection should bring some extra performance, except it is not.

How do I test the above assumption? If I take a less smart model like logistic regression and test it with and without Boruta - can I expect significant difference in f1 or auc?

Surely, someone already answered this question before systematically... any good links to papers? If I have to say to my team that we don't need feature selection step, I can use all support :)

2020-11-11T10:17:37Z

I have been asked to look at XGBoost (as implemented in R, and with a maximum of around 50 features) as an alternative to an already existing but not developed by me logistic regression model created from a very large set of credit risk data containing a few thousand predictors.

The documentation surrounding the logistic regression is very well prepared, and as such track has been kept of the reasons for exclusion of each variable. Among those are:

automated data audit (through internal tool) - i.e. detected excessive number of missings, or incredibly low variance, etc.;
lack of monotonic trend - for u-shaped variables after attempts at coarse classing;
high correlation (>70%) - on raw level or after binning;
low GINI / Information Value - on raw level or after binning;
low representativeness - assessed through population stability index, PSI;
business logic / expert judgement.

A huge number of the variables are derived (incl. aggregates like min / max / avg of the standard deviation of other predictors) and some have been deemed too synthetic for inclusion. We have decided to not use those in XGBoost either.

The regression was initially ran with 44 predictors (the output of a stepwise procedure), whereas the final approved model includes only 10.

Because I am rather new to XGBoost, I was wondering whether the feature selection process differs substantially from what has already been done in preparation for the logistic regression, and what some rules / good practices would be.

Based on what I have been reading, perfect correlation and missing values are both automatically handled in XGBoost. I suspect monotonicity of trend should not be a concern (as the focus, unlike in regression, is on non-linear relations), hence binning is likely out; otherwise I am a bit unsure about handling of u-shaped variables. Since GINI is used in deciding on the best split in decision trees in general under the CART (“Classification and Regression Trees”) approach, maybe this is one criterion that it is worth keeping.

I have been entertaining the idea of perusing our internal automated data audit tool, removing std aggregates (too synthetic as per above), removing low GINI and low PSI variables, potentially treating for very high (95+%) correlation, and then applying lasso / elastic net and taking it from there. I am aware that Boruta is relevant here but as of now still have no solid opinion on it.

2021-05-02T14:34:17Z

I was trying to select the most important features of a data set using Boruta in python. I have split the data into training and test set. Then I used SVM regressor to fit the data. Then I used Boruta to measure feature importance.The code is as follows:

from sklearn.svm import SVR

svclassifier = SVR(kernel='rbf',C=1e4, gamma=0.1)
svm_model= svclassifier.fit(x_train, y_train)

from boruta import BorutaPy

feat_selector = BorutaPy(svclassifier, n_estimators='auto', verbose=2, random_state=1)
feat_selector.fit(x_train, y_train)
feat_selector.support_
feat_selector.ranking_
X_filtered = feat_selector.transform(x_train)

But I get this error KeyError: 'max_depth'.

What might be causing this error?

Does Boruta work with any kind of models? i.e linear models, tree-based models, neural nets, etc.?

2020-08-21T15:20:00Z

I am interested in learning what routine others use (if any) for Feature Reduction/Selection.

For example, If my data has several thousand features, I typically try {2,3,4} things right away depending on circumstances.

Zero variance/Near zero variance
- Using R package caret, nzv
- I find a v.small percentage is zero variance and a few more are near zero variance.
- Then by using nzv$PercentUnique I may remove the bottom quartile of features depending on the range of PercentUnique's.
Correlation to find multicollinearity
- I find the correlation matrix and remove values > 0.75 and remove.
- I have seen others use correlations > 0.5 or 0.6, but don't have any references for it.
Boruta / Random Forest
- Love Boruta package but it takes a while.
- Then here again use Forward Feature Selection.
PCA
- Depending on the nature of the data I will try PCA last.
- If the model must be explainable then I skip this.
- I may use several criteria: 80, 90, 95% error explained
- Forward Feature selection, look for first ~3:10 orthogonal features

NOTE: I am not suggesting this is the best/worst routine but I'm opening the floor to civil debate. If you need a definition for Civil Debate see Wikipedia.