Calibration with all data: data-poor scenarios [closed]

Ask Question

Asked 4 months ago

Modified 4 months ago

Viewed 76 times

Closed. This question needs to be more focused. It is not currently accepting answers.

Want to improve this question? Guide the asker to update the question so it focuses on a single, specific problem. Narrowing the question will help others answer the question concisely. You may edit the question if you feel you can improve it yourself. If edited, the question will be reviewed and might be reopened.

Closed 4 months ago.

Improve this question

I’m working on species distribution modeling with binary data (presence / absence, 1 / 0). My target species is extremely rare (prevalence ~0.014), so my dataset is almost all zeros and just a handful of ones. To avoid throwing away any precious data, I trained a Random Forest on all the points instead of doing the usual train/validation/test split.

After fitting the model and checking performance via block cross-validation, I noticed the predicted probabilities and the observed frequencies don’t line up. In areas where there are almost no presences, the model still spits out high occurrence probabilities—exactly when I’d expect low ones. That told me the probabilities need calibration. I applied Platt Scaling on the entire dataset, and now the reliability diagram is almost perfect: all points sit right on the diagonal.

But here’s the dilemma: calibration is normally done on held-out (validation) data, not the full dataset. Do you think this shortcut is defensible? I know it hurts transferability to new regions, but with so few presences, splitting feels impossible without losing critical information.

In a perfect world, I’d:

Split into training, calibration (validation) and testing sets

Train the RF on training data

Calibrate probabilities on validation data

Predict on unseen areas (testing)

But with such an imbalanced dataset, that split would leave me almost no positive cases to work with.

What do you think? Can I still trust these calibrated probabilities, or should I be extra cautious when interpreting them? Any suggestions or alternative strategies would be super welcome!

asked Jul 10 at 9:05

LolaRT96

191 bronze badge

1

$\begingroup$ Welcome to Cross Validated! This is an interesting question that I have upvoted. My instinct is to bootstrap the entire process and compare performance of the bootstrap-trained models on the whole data set to the performance on the original model applied to the whole data set, but with random forest models having bootstrapping built in, I wonder if this does anything. Perhaps an answer can address this. $\endgroup$

Dave
– Dave

2025-07-10 11:17:46 +00:00
Commented Jul 10 at 11:17
$\begingroup$ Given the sparseness of the information, some methodologies are better than other. Imo, RF is not one of them. I understand that your data is binary but poisson regression is intended for rare event modeling. One can ignore the fact that it's also intended for count data valued greater than one, simply assume underdispersion. Next, Bayesian regression assumptions and tools such as MCMC have literature supporting their superiority with sparse data, focus your evalutions on the posterior. Finally, contingency table modeling as in Fienberg's Analysis of Cross-Classified Data is worth a look. $\endgroup$

user78229
– user78229

2025-07-10 12:47:46 +00:00
Commented Jul 10 at 12:47
$\begingroup$ I have posted a question about general use of bootstrap validation of random forest models. These questions are not duplicates of each other, but if bootstrap validation makes sense for random forest models, would seem to be your friend in this situation. $\endgroup$

Dave
– Dave

2025-07-10 13:56:06 +00:00
Commented Jul 10 at 13:56

Add a comment |

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Stack Exchange Network

Calibration with all data: data-poor scenarios [closed]

0

Linked

Hot Network Questions

Calibration with all data: data-poor scenarios [closed]

0

Linked

Related

Hot Network Questions