I am using the red wine quality dataset from Kaggle 'Red Wine Quality' for regression as self-exploration.
I did the analysis using multiple linear regression and found that '***' appeared against only few of the independent variables. However, when I did the important feature selection on the same dataset using 'Boruta', it deems all independent variables as important.
I will publish the code and corresponding output post office hours :)
However, would be obliged, if someone, in theory can help me understand the difference in output between these two methods. I am perplexed as to which are the 'important' variables in this case and when one does use Boruta.
I followed the following steps:
- Removed all previous variables to start with a clean slate. #remove any previous variables from the environment.
rm(list=ls(all.names = T))
- Set the working directory. #set working directory
setwd("D:/MachineLearning/Kaggle/2_WineQuality") getwd()
- Installed packages using Pacman #install packages
if(! require(pacman)) install.packages("pacman") p_load(janitor, stats19, caret, dplyr, Boruta, ggplot2, reshape2, caTools, randomForest, missForest, e1071, rpart) p_loaded()
- Imported the data set
Importing the dataset
dataset = read.csv('winequality-red.csv')
str(dataset) head(dataset,8)
OUTPUT for reference: *> str(dataset) 'data.frame': 1599 obs. of 12 variables: $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ... $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ... $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ... $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ... $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ... $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ... $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ... $ density : num 0.998 0.997 0.997 0.998 0.998 ... $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ... $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ... $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ... $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
head(dataset,8) fixed.acidity volatile.acidity citric.acid residual.sugar chlorides 1 7.4 0.70 0.00 1.9 0.076 2 7.8 0.88 0.00 2.6 0.098 3 7.8 0.76 0.04 2.3 0.092 4 11.2 0.28 0.56 1.9 0.075 5 7.4 0.70 0.00 1.9 0.076 6 7.4 0.66 0.00 1.8 0.075 7 7.9 0.60 0.06 1.6 0.069 8 7.3 0.65 0.00 1.2 0.065 free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol quality 1 11 34 0.9978 3.51 0.56 9.4 5 2 25 67 0.9968 3.20 0.68 9.8 5 3 15 54 0.9970 3.26 0.65 9.8 5 4 17 60 0.9980 3.16 0.58 9.8 6 5 11 34 0.9978 3.51 0.56 9.4 5 6 13 40 0.9978 3.51 0.56 9.4 5 7 15 59 0.9964 3.30 0.46 9.4 5 8 15 21 0.9946 3.39 0.47 10.0 7
*
Checked column-wise missing values using a function – none were found.
Checked for outliers using the dplyr library and function.
Then, I completed the following steps:
#Splitting the dataset into the Training set and Test set #First we are using entire dataset without identifying important variables.
#install.packages('caTools') library(caTools)
set.seed(1235) split = sample.split(dataset_revised$quality, SplitRatio = 0.8) training_set= subset(dataset_revised, split == TRUE) test_set = subset(dataset_revised, split == FALSE)
#Fitting multiple linear regression to the training set regressor = lm(quality ~ ., data = training_set) summary(regressor)
#predicting the test set results y_pred = predict(regressor, newdata = test_set) y_pred
OUTPUT for reference: *> regressor = lm(quality ~ .,
-
data = training_set)
summary(regressor) Call: lm(formula = quality ~ ., data = training_set) Residuals: Min 1Q Median 3Q Max -2.10231 -0.36668 -0.06113 0.45744 1.63829 Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.596e+01 1.726e+01 0.924 0.35556
fixed.acidity 2.603e-02 1.930e-02 1.349 0.17762
volatile.acidity -8.255e-01 1.333e-01 -6.191 8.04e-10 *** citric.acid -2.022e-01 1.409e-01 -1.435 0.15160
residual.sugar 3.117e-02 4.569e-02 0.682 0.49523
chlorides -6.888e-01 1.333e+00 -0.517 0.60548
free.sulfur.dioxide 2.446e-03 2.332e-03 1.049 0.29450
total.sulfur.dioxide -2.602e-03 8.397e-04 -3.099 0.00198 ** density -1.252e+01 1.747e+01 -0.716 0.47387
pH -3.350e-01 1.693e-01 -1.978 0.04809 *
sulphates 1.554e+00 1.571e-01 9.892 < 2e-16 *** alcohol 2.504e-01 2.336e-02 10.720 < 2e-16 ***
Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.6043 on 1267 degrees of freedom Multiple R-squared: 0.3383, Adjusted R-squared: 0.3326 F-statistic: 58.89 on 11 and 1267 DF, p-value: < 2.2e-16*
As per Multiple linear regression, 3 independent varaibles are deemed as ‘significant ()’ while other two are deemed as less significant (**/)**
OUTPUT for prediction: *> #predicting the test set results
y_pred = predict(regressor, newdata = test_set) y_pred 5 6 8 9 14 17 24 25 28 5.074874 5.094743 5.233555 5.325573 5.180410 5.854369 5.250577 5.522663 5.925952 31 33 42 65 68 69 85 88 92 5.182381 5.307526 5.111679 5.249088 5.463764 6.057060 5.842495 5.445663 5.631287 98 105 106 108 109 114 115 118 124 5.362495 5.146589 5.038781 5.316397 5.667034 5.680731 5.833450 5.145062 5.105668 132 136 152 153 156 165 166 168 170 6.073000 5.546268 5.360423 5.235707 5.794010 4.890636 5.116015 5.027211 5.260609 173 182 184 189 198 199 202 204 206 (output is truncated on purpose)*
*#+++++++++++++++++++++++++++++++++++
#WITH BORUTA #+++++++++++++++++++++++++++++++++++
Run Boruta feature selection
set.seed(1234) # For reproducibility boruta_result <- Boruta(quality ~ ., data = dataset_revised, doTrace = 2, maxRuns = 100)
- run of importance source...
- run of importance source...
- run of importance source...
- run of importance source...
- run of importance source...
- run of importance source...
- run of importance source...
- run of importance source...
- run of importance source...
- run of importance source...
- run of importance source... After 11 iterations, +12 secs: confirmed 11 attributes: alcohol, chlorides, citric.acid, density, fixed.acidity and 6 more; no more attributes left.>
Print Boruta results
print(boruta_result) Boruta performed 11 iterations in 11.93987 secs. 11 attributes confirmed important: alcohol, chlorides, citric.acid, density, fixed.acidity and 6 more; No attributes deemed unimportant.
Get confirmed important variables
important_vars <- getSelectedAttributes(boruta_result, withTentative = FALSE) cat("Confirmed Important Variables:\n", important_vars, "\n") Confirmed Important Variables: fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
Optional: Statistical summary of importance
boruta_stats <- attStats(boruta_result) print(boruta_stats) meanImp medianImp minImp maxImp normHits decision fixed.acidity 15.938186 16.149327 14.724135 17.18521 1 Confirmed volatile.acidity 29.416528 29.596525 27.164920 31.04642 1 Confirmed citric.acid 17.514146 18.123377 15.352202 18.57271 1 Confirmed residual.sugar 8.922052 8.875384 7.389886 10.59015 1 Confirmed chlorides 13.972852 14.111462 12.401181 15.36371 1 Confirmed free.sulfur.dioxide 13.665830 13.732384 12.539486 14.70550 1 Confirmed total.sulfur.dioxide 19.104844 19.301512 17.758147 20.12949 1 Confirmed density 22.191538 21.825438 20.661987 24.31289 1 Confirmed pH 11.781957 11.828005 10.721066 12.99994 1 Confirmed sulphates 38.698071 38.576669 36.752754 41.40466 1 Confirmed alcohol 46.899268 46.976149 44.181024 49.39952 1 Confirmed*
However, in comparison to Multiple Linear regression, Boruta says all independent variables are important. Is this a contradiction or am I reading something wrong?