I am using the random forest.My test accuracy is 70% on the other hand train accuracy is 34% ? what to do ? How can I solve this problem.
3 Answers
Test accuracy should not be higher than train since the model is optimized for the latter. Ways in which this behavior might happen:
you did not use the same source dataset for test. You should do a proper train/test split in which both of them have the same underlying distribution. Most likely you provided a completely different (and more agreeable) dataset for
testan unreasonably high degree of
regularizationwas applied. Even so there would need to be some element of "test data distribution is not the same as that of train" for the observed behavior to occur.
5 Comments
adversarial models are a different animalThe other answers are correct in most cases. But I'd like to offer another perspective. There are specific training regimes that could cause the training data to be harder for the model to learn - for instance, adversarial training or adding Gaussian noise to the training examples. In these cases, the benign test accuracy could be higher than train accuracy, because benign examples are easier to evaluate. This isn't always a problem, however!
If this applies to you, and the gap between train and test accuracies is larger than you'd like (~30%, as in your question, is a pretty big gap), then this indicates that your model is underfitting to the harder patterns, so you'll need to increase the expressibility of your model. In the case of random forests, this might mean training the trees to a higher depth.
1 Comment
First you should check the data that is used for training. I think there is some problem with the data, the data may not be properly pre-processed.
Also, in this case, you should try more epochs. Plot the learning curve to analyze when the model is going to converge.
You should check the following:
- Both training and validation accuracy scores should increase and loss should decrease.
- If there is something wrong in step 1 after any particular epoch, then train your model until that epoch only, because your model is over-fitting after that.