1

For a ML course, I am supposed to build a model based on the training set to predict the variable "classe" on a validation set. I removed all unnecessary variables in the training set, used cross validation to prevent over-fitting, and made sure the validation set matched the training set in terms of which columns are removed. When I predict classe in the validation set, it yields all classe A, and I know this is incorrect.

I included the entire script below.

Where did I go wrong?

library(caret)

download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", "train.csv")
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", "test.csv")

train <- read.csv("./train.csv")
val <- read.csv("./test.csv")

#getting rid of columns with NAs
nas <- sapply(train, function(x) sum(is.na(x)))
train <- train[, nas<1900]

#removing near zero variance columns
remove <- nearZeroVar(train)
train <- train[, -remove]

#create partition in our training set
set.seed(8675309)
inTrain <- createDataPartition(train$classe, p = .7, list = FALSE)
training <- train[inTrain,]
testing <- train[-inTrain,]

model <- train(classe ~ ., method = "rf", data = training)

confusionMatrix(predict(model, testing), testing$classe)

#make sure validation set has same features as training set
trainforvalid <- subset(training, select = -classe)
val <- val[, colnames(trainforvalid)]

predict(model, val)
#the above step yields all predictions as classe A
2
  • How can it give a 0.99 ACC if it predicts all A, but A only accounts for 0.284 of the data (according to your distribution of the targets int Khelifi's answer)? Commented Aug 14, 2020 at 5:25
  • Further more, the zeroVar removal is something you should think through. If you have a feature X1 which is in the range of [10^(-10),10^(-8) ] it would have a "low-variance" by nature due to the scale, but it might very clearly seperate your classes e.g each class could be a medical product and X1 the amount of a given active chemical. Use the normalized/scaled variance instead IMO Commented Aug 14, 2020 at 5:31

1 Answer 1

1

This might be happening because the data is unbalanced. If the data have a lot more data points for Class A then Class B, the model will simply learn to predict always Class A.

Try to use a better metric in this case like F1 score.

I also recommend using techniques like oversampling or undersampling to avoid the unbalanced data issue.

Sign up to request clarification or add additional context in comments.

3 Comments

Here are the classifications of the "classe" variable that was input in the training set: A B C D E 3906 2658 2396 2252 2525 , I know it is not incredibly readable, but essentially the response variable is not incredible skewed towards A
Well, the data does not seem to be so unbalanced ( the count of rows with A as a class is between 1.4 a,d 1.9 of other labels ). If you use weak features or weak model this might be the reason of the issue. Try to increase the number of estimators and create/use other meaningful features. Do not stuck to one model, if you prefer tree-based models XGBoost or LightGBM will perform better then RF.
How does the confusion matrix looks like? It might give you a hint of which classes it struggles with.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.