1
$\begingroup$

I’m working on species distribution modeling with binary data (presence / absence, 1 / 0). My target species is extremely rare (prevalence ~0.014), so my dataset is almost all zeros and just a handful of ones. To avoid throwing away any precious data, I trained a Random Forest on all the points instead of doing the usual train/validation/test split.

After fitting the model and checking performance via block cross-validation, I noticed the predicted probabilities and the observed frequencies don’t line up. In areas where there are almost no presences, the model still spits out high occurrence probabilities—exactly when I’d expect low ones. That told me the probabilities need calibration. I applied Platt Scaling on the entire dataset, and now the reliability diagram is almost perfect: all points sit right on the diagonal.

But here’s the dilemma: calibration is normally done on held-out (validation) data, not the full dataset. Do you think this shortcut is defensible? I know it hurts transferability to new regions, but with so few presences, splitting feels impossible without losing critical information.

In a perfect world, I’d:

Split into training, calibration (validation) and testing sets

Train the RF on training data

Calibrate probabilities on validation data

Predict on unseen areas (testing)

But with such an imbalanced dataset, that split would leave me almost no positive cases to work with.

What do you think? Can I still trust these calibrated probabilities, or should I be extra cautious when interpreting them? Any suggestions or alternative strategies would be super welcome!

$\endgroup$
3
  • 1
    $\begingroup$ Welcome to Cross Validated! This is an interesting question that I have upvoted. My instinct is to bootstrap the entire process and compare performance of the bootstrap-trained models on the whole data set to the performance on the original model applied to the whole data set, but with random forest models having bootstrapping built in, I wonder if this does anything. Perhaps an answer can address this. $\endgroup$ Commented Jul 10 at 11:17
  • $\begingroup$ Given the sparseness of the information, some methodologies are better than other. Imo, RF is not one of them. I understand that your data is binary but poisson regression is intended for rare event modeling. One can ignore the fact that it's also intended for count data valued greater than one, simply assume underdispersion. Next, Bayesian regression assumptions and tools such as MCMC have literature supporting their superiority with sparse data, focus your evalutions on the posterior. Finally, contingency table modeling as in Fienberg's Analysis of Cross-Classified Data is worth a look. $\endgroup$ Commented Jul 10 at 12:47
  • $\begingroup$ I have posted a question about general use of bootstrap validation of random forest models. These questions are not duplicates of each other, but if bootstrap validation makes sense for random forest models, would seem to be your friend in this situation. $\endgroup$ Commented Jul 10 at 13:56

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.