Two-stage statistical model that estimates each team's probability of winning the 2026 FIFA World Cup (and of reaching each knockout round).
| Stage | Module | What it does |
|---|---|---|
| Data | src/data/download.py |
Pull results.csv (martj42) from GitHub — no Kaggle token needed. Kaggle datasets optional. |
| Clean | src/data/clean.py |
Canonical team-name map, date parsing, dedupe. |
| Elo | src/models/elo.py |
Self-computed as-of-date Elo (leakage-free) from match history. |
| Features | src/data/features.py |
Time-decay + match-importance weights. |
| Match model | src/models/poisson.py |
Elo-driven Poisson with Dixon-Coles low-score correction (weighted MLE). |
| Eval | src/eval/ |
RPS (primary), log-loss, Brier, accuracy on a time-split holdout. |
| Simulator | src/simulate/ |
Official 2026 bracket + Monte Carlo → title / round probabilities. |
python -m venv .venv && .venv/Scripts/python -m pip install -r requirements.txt
python -m src.data.download # -> data/raw/results.csv (+ shootouts, goalscorers)
python -m src.data.clean # -> data/interim/results_clean.pkl
python -m src.models.elo # -> data/interim/results_elo.pkl (+ top-20 ratings)
python -m src.eval.train_eval # fit match model, report RPS/log-loss
python -m src.eval.backtest # 2018 & 2022 out-of-sample validation
python -m src.models.hybrid # XGBoost + ensemble vs Dixon-Coles
python -m src.news.run_news # Stage 3: news -> bounded Elo deltas (needs OPENAI_API_KEY)
python -m src.simulate.montecarlo 20000 # -> outputs/title_probabilities*.csvSet OPENAI_API_KEY (and optionally OPENAI_MODEL, default gpt-4o-mini) in a
.env file at the repo root for the Stage 3 news layer.
Title odds — 20,000 simulations, with the strength-uncertainty and news layers applied. "90% range" is the credible interval across simulation batches.
| # | Team | Win title | 90% range | Reach final | Reach semis |
|---|---|---|---|---|---|
| 1 | Spain | 25.7% | 23.4–28.2% | 37.3% | 50.1% |
| 2 | Argentina | 18.5% | 17.4–19.9% | 29.5% | 41.9% |
| 3 | France | 10.3% | 8.9–11.9% | 18.9% | 34.6% |
| 4 | England | 6.5% | 5.6–7.6% | 13.3% | 24.7% |
| 5 | Brazil | 4.9% | 3.8–5.5% | 10.0% | 21.2% |
| 6 | Colombia | 4.7% | 3.2–5.9% | 10.1% | 18.7% |
| 7 | Portugal | 4.2% | 3.3–5.1% | 9.0% | 16.9% |
| 8 | Mexico | 3.5% | 2.7–4.2% | 8.1% | 18.4% |
| 9 | Ecuador | 2.9% | 2.2–3.8% | 7.2% | 17.2% |
| 10 | Germany | 1.9% | 1.3–2.5% | 5.1% | 13.1% |
Full table (all 48 teams) in outputs/title_probabilities_news.csv; the model-only
version (no news) is in outputs/title_probabilities.csv.
Match model (train < 2024-01-01, test ≥ 2024-01-01, 2,547 matches): accuracy 60.4%, RPS 0.167 (naive base-rate 0.227), log-loss 0.867. Ordering of favorites matches the betting market; favorites sit in the sane 15–25% band.
Uncertainty layer: each tournament re-draws team strengths from N(Elo, σ), with
σ larger for teams that play fewer recent matches (noisier ratings). This pulls the
top favorite down (~26.5% → ~24.7%) and fattens the field, and yields a 90% credible
interval per team. Point-estimate output is also saved for comparison
(outputs/title_probabilities_pointest.csv).
- Backtest (
src/eval/backtest.py): out-of-sample on WC 2018 (acc 56%, RPS 0.209) and 2022 (acc 55%, RPS 0.224), strictly no-leakage. Both beat base-rate. - Hybrid + ensemble (
src/models/hybrid.py): XGBoost on Elo + form features. Honest finding — Dixon-Coles (RPS 0.167) still edges XGBoost (0.173) becauseelo_diffdominates and we lack market-value covariates. Hybrid is ready for when squad value/age data is added (needs the Kaggle token). - Stage 3 news layer (
src/news/): Google News RSS → OpenAI Structured Outputs (injuries/suspensions/returns with confidence) → bounded, shrunk Elo deltas (capped ±40) → re-simulate. Adjusts model inputs, not final percentages, so a weakened rival coherently lifts everyone else's odds. Output:outputs/title_probabilities_news.csv+ auditableoutputs/news_adjustments.json.
- Elo is computed from the same
results.csvused for training, which eliminates cross-source team-name mismatches and gives full control of as-of-date logic. - Host advantage is modeled as an effective Elo bump (USA +60, Mexico/Canada +40).
- All RNGs are seeded; sim count + seed are logged to
outputs/sim_meta.json.
