Suppose that we want use 5-fold cross-validation for a support vector regression(SVR) model. We should normalize total data before cross-validation process or we need normalize every train part separately and use same normalization properties for test data?
1 Answer
You should "normalize all the training data together, as one block of data, and then use the same normalization properties of the training data ( i.e. the mean and standard deviation of the training data ) to normalize all the test data" the reason being that if you use the data that is destined to be used for cross-validation (CV) in the normalization of the training data, the information contained in this CV data will have been leaked into the training data.
-
$\begingroup$ So as you mentioned we should normalize every fold separately and use same normalization properties for test fold (I didn't see this procedure in default MATLAB or Python codes)? Suppose that after this process we want use final model (trained using whole data - we used cross-validation to find best parameters of model) to predict future values (this is a prediction model). What normalization properties we should use for this step? $\endgroup$user2991243– user29912432016-04-11 13:16:57 +00:00Commented Apr 11, 2016 at 13:16