2
$\begingroup$

I have a XGB model ready to go to production, in validation I discovered that the random seed makes reasonable difference in the performance of the model, which is pretty good, but for some seeds it's just good, and for others it's very good.

Now my intuition is that the random seed shouldn't make too much difference in a robust model, which probably means my model is overfitting, but if it is, it hasn't shown up in the other validation tests I've done.

Edit: extra info: this is a regression problem. I have used K-Fold CV to optimize hyperparameters, including alpha and gamma. The only part susceptible to randomness is the train/test split, so this tells me there's some kind of probability distribution in my data that may get better represented with some splits than with others. If this was a classification task, I could use stratified split to deal with this, but in this case, what's the correct approach? Finally, I haven't really set the seed anywhere, just ran the experiment 30 times and compared the results.

  1. Does this mean my model is overfitting? Should I try to regularize it and see if the random effect diminishes or disappears?

  2. When actually training the model that will go into production and on re-training in the future, how should I deal with the random seed? Optimizing it like a hyperparameter feels wrong to me, so should I just leave it at random? What's the recommended approach?

$\endgroup$
5
  • $\begingroup$ It's not overfitting, and the seed is just for reproducibility and you should not optimize it as a hyperparameter. Here is a good article. $\endgroup$ Commented Feb 14, 2022 at 16:34
  • $\begingroup$ What kind of randomization have you switched on? Or did you just use the defaults? How large is your dataset? $\endgroup$ Commented Feb 18, 2022 at 7:09
  • $\begingroup$ How large is the performance difference you observe? How do you set the seed? What implementation of XGBoost do you use (R, Python, ...)? $\endgroup$ Commented Feb 18, 2022 at 7:23
  • $\begingroup$ @frank I just used the defaults and ran the experiment 30 times. The dataset is roughly 100k rows, I splitted it 70/30, used KFold cross validation for hyperparameter optimization. The difference is reasonably large, let's say worst case is a 6 out of 10, and best case is a 9 out of 10, on average I get 7.5. I'm using Python's XGB $\endgroup$ Commented Feb 20, 2022 at 13:45
  • $\begingroup$ What function do you use for setting the seed? $\endgroup$ Commented Feb 20, 2022 at 14:14

2 Answers 2

0
$\begingroup$

This is very strange. I never had any experience on the random seed being important. Specially in XBG which has no random component (in the default configuration) as far as I can remember.

XGB can implement random forests, and they are probably more sensitive to random initialization, but even than it should not make a "reasonable difference".

Maybe you can post details of the parametrization of the XBG and also details of the dataset - specially proportion of the classes. The only source of problems I can think now is a dataset with very few examples of some classes, and in some internal sampling of the training set (with as far as I know XGB does not do by default - parameter sampling_method) some sets are left without examples of these minority classes.

@Lerner Zhang link for reproducibility is very interesting.

Finally, there is no "random seed search". there is no structure for the random seed search - if a seed of 42 yields a good result, a seed of 43 may yield a bad result, and 44 an even better result. There is no meaningful search. I think the usual practice is to fix the random seed always with a known seed - at least you can reproduce the results!

$\endgroup$
2
  • $\begingroup$ Thanks for the help @Jacques Wainer. Indeed I use the random seed on splitting the train/test data, this is a regression task, so I'm measuring performance as RMSE. I believe you might be right that for some splits I may get better representation of my data in the train set, but how should I handle this? I can't just use stratified split because my data is continuous, is there any way to split based on the probability distribution of the data? $\endgroup$ Commented Feb 20, 2022 at 13:50
  • $\begingroup$ One paper I remember that deals with resampling for regression is scholar.google.com.br/… by Luis Torgo. Maybe this paper or similar ones deals with the problem of in some way the regression output. Also, there is an obvious solution that may or may not work - divide the output into quantiles and treat these quantiles as classes for the stratification. $\endgroup$ Commented Feb 24, 2022 at 12:30
0
$\begingroup$

Your first question: Since you are using the default setting of XGB, you are not using any built-in features to fight overfitting, so your model is probably indeed overfitting.

XGBoost provides two randomization techniques to fight overfitting, see the section "Control Overfitting" here.

Your second question: Your setting of the seed will currently only effect your cross-validation (CV) splitting. IIUC, your variations in model performance refer to the different folds in CV. Once you have turned on the XGB internal randomization features, the overfitting should disappear and the results in CV should become similar.

CV is used for two purposes: first, to predict the error on the test dataset, and second, to use this predicted error to tune hyperparameters. But since you don't have any hyperparameters to tune (do you?), the only reason left to use CV is the prediction of the generalization error. But since you have a large dataset (100k), I recommend using just one single 80/20 partition, with training on the larger and testing on the smaller part. This should be totally sufficient to estimate your generalization capabilities, no need for CV. But, of course, you must make sure that your 80/20 partition is really random.

And then you leave the seed alone. Don't try to tune the seed, this is kind of a rule.

$\endgroup$
3
  • $\begingroup$ Thanks for the help Frank! I have used K-Fold cross validation to optimize hyperparameters, including alpha and gamma (the regularization parameters). And yes, the only effect of randomness will be on the train/test split. I'm aware that on classification problems you can use stratified split to keep the splits balanced, but my problem is regression, how do I deal with this problem in this case? $\endgroup$ Commented Feb 22, 2022 at 12:53
  • $\begingroup$ Both randomization methods described in the given link also work for regression. $\endgroup$ Commented Feb 22, 2022 at 15:04
  • $\begingroup$ In fact, I think that all the parameters mentioned there for fighting overfitting should work for both classification and regression. $\endgroup$ Commented Feb 22, 2022 at 15:12

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.