1
$\begingroup$

I am learning Machine Learning and exploring nested cross-validation. I don't understand the example given in scikit-learn as the model seems to learn from the whole dataset and the evaluation is not performed on a hold-out set.

scikit documentation

scikit implementation

From what I read in Applied Predictive Modeling from Kuhn & Johnson, the model resulting from the inner loop should be evaluated on the hold-out set of the outer loop and the following post adheres to this point machinelearningmastery blog

As I am far from a Python expert, could you tell me the advantages, drawbacks and purposes of each of these implementations?

$\endgroup$
2
  • $\begingroup$ I don't see a difference between the methods in your three links. Can you clarify what difference you're asking about? $\endgroup$ Commented Jun 4 at 22:11
  • $\begingroup$ Shortly, I feel the first approach using the whole dataset in the inner loop features a data leakage. The first 2 links show the same approach comprising an inner loop tuning hyper-parameters on the whole dataset and an outer loop evaluating the model performance on the outer loop on the whole dataset. The third link shows what I consider as a really nested approach with the outer loop splitting the dataset into a training part that feeds the inner loop aims at tuning hyper-parameters and a hold-out part that is used to evaluate the performance of the tuned model. $\endgroup$ Commented Jun 5 at 5:11

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.