I’m working on species distribution modeling with binary data (presence / absence, 1 / 0). My target species is extremely rare (prevalence ~0.014), so my dataset is almost all zeros and just a handful of ones. To avoid throwing away any precious data, I trained a Random Forest on all the points instead of doing the usual train/validation/test split.
After fitting the model and checking performance via block cross-validation, I noticed the predicted probabilities and the observed frequencies don’t line up. In areas where there are almost no presences, the model still spits out high occurrence probabilities—exactly when I’d expect low ones. That told me the probabilities need calibration. I applied Platt Scaling on the entire dataset, and now the reliability diagram is almost perfect: all points sit right on the diagonal.
But here’s the dilemma: calibration is normally done on held-out (validation) data, not the full dataset. Do you think this shortcut is defensible? I know it hurts transferability to new regions, but with so few presences, splitting feels impossible without losing critical information.
In a perfect world, I’d:
Split into training, calibration (validation) and testing sets
Train the RF on training data
Calibrate probabilities on validation data
Predict on unseen areas (testing)
But with such an imbalanced dataset, that split would leave me almost no positive cases to work with.
What do you think? Can I still trust these calibrated probabilities, or should I be extra cautious when interpreting them? Any suggestions or alternative strategies would be super welcome!