Skip to content

hasseneafif/worldcup-2026-advanced-prediction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

2026 World Cup Prediction Model

Two-stage statistical model that estimates each team's probability of winning the 2026 FIFA World Cup (and of reaching each knockout round).

WorldCup

Pipeline

Stage Module What it does
Data src/data/download.py Pull results.csv (martj42) from GitHub — no Kaggle token needed. Kaggle datasets optional.
Clean src/data/clean.py Canonical team-name map, date parsing, dedupe.
Elo src/models/elo.py Self-computed as-of-date Elo (leakage-free) from match history.
Features src/data/features.py Time-decay + match-importance weights.
Match model src/models/poisson.py Elo-driven Poisson with Dixon-Coles low-score correction (weighted MLE).
Eval src/eval/ RPS (primary), log-loss, Brier, accuracy on a time-split holdout.
Simulator src/simulate/ Official 2026 bracket + Monte Carlo → title / round probabilities.

Quickstart

python -m venv .venv && .venv/Scripts/python -m pip install -r requirements.txt
python -m src.data.download        # -> data/raw/results.csv (+ shootouts, goalscorers)
python -m src.data.clean           # -> data/interim/results_clean.pkl
python -m src.models.elo           # -> data/interim/results_elo.pkl  (+ top-20 ratings)
python -m src.eval.train_eval      # fit match model, report RPS/log-loss
python -m src.eval.backtest        # 2018 & 2022 out-of-sample validation
python -m src.models.hybrid        # XGBoost + ensemble vs Dixon-Coles
python -m src.news.run_news        # Stage 3: news -> bounded Elo deltas (needs OPENAI_API_KEY)
python -m src.simulate.montecarlo 20000   # -> outputs/title_probabilities*.csv

Set OPENAI_API_KEY (and optionally OPENAI_MODEL, default gpt-4o-mini) in a .env file at the repo root for the Stage 3 news layer.

Results

Title odds — 20,000 simulations, with the strength-uncertainty and news layers applied. "90% range" is the credible interval across simulation batches.

# Team Win title 90% range Reach final Reach semis
1 Spain 25.7% 23.4–28.2% 37.3% 50.1%
2 Argentina 18.5% 17.4–19.9% 29.5% 41.9%
3 France 10.3% 8.9–11.9% 18.9% 34.6%
4 England 6.5% 5.6–7.6% 13.3% 24.7%
5 Brazil 4.9% 3.8–5.5% 10.0% 21.2%
6 Colombia 4.7% 3.2–5.9% 10.1% 18.7%
7 Portugal 4.2% 3.3–5.1% 9.0% 16.9%
8 Mexico 3.5% 2.7–4.2% 8.1% 18.4%
9 Ecuador 2.9% 2.2–3.8% 7.2% 17.2%
10 Germany 1.9% 1.3–2.5% 5.1% 13.1%

Full table (all 48 teams) in outputs/title_probabilities_news.csv; the model-only version (no news) is in outputs/title_probabilities.csv.

Match model (train < 2024-01-01, test ≥ 2024-01-01, 2,547 matches): accuracy 60.4%, RPS 0.167 (naive base-rate 0.227), log-loss 0.867. Ordering of favorites matches the betting market; favorites sit in the sane 15–25% band.

Uncertainty layer: each tournament re-draws team strengths from N(Elo, σ), with σ larger for teams that play fewer recent matches (noisier ratings). This pulls the top favorite down (~26.5% → ~24.7%) and fattens the field, and yields a 90% credible interval per team. Point-estimate output is also saved for comparison (outputs/title_probabilities_pointest.csv).

Validation & layers (built)

  • Backtest (src/eval/backtest.py): out-of-sample on WC 2018 (acc 56%, RPS 0.209) and 2022 (acc 55%, RPS 0.224), strictly no-leakage. Both beat base-rate.
  • Hybrid + ensemble (src/models/hybrid.py): XGBoost on Elo + form features. Honest finding — Dixon-Coles (RPS 0.167) still edges XGBoost (0.173) because elo_diff dominates and we lack market-value covariates. Hybrid is ready for when squad value/age data is added (needs the Kaggle token).
  • Stage 3 news layer (src/news/): Google News RSS → OpenAI Structured Outputs (injuries/suspensions/returns with confidence) → bounded, shrunk Elo deltas (capped ±40) → re-simulate. Adjusts model inputs, not final percentages, so a weakened rival coherently lifts everyone else's odds. Output: outputs/title_probabilities_news.csv + auditable outputs/news_adjustments.json.

Notes

  • Elo is computed from the same results.csv used for training, which eliminates cross-source team-name mismatches and gives full control of as-of-date logic.
  • Host advantage is modeled as an effective Elo bump (USA +60, Mexico/Canada +40).
  • All RNGs are seeded; sim count + seed are logged to outputs/sim_meta.json.

About

One of the most advanced worldcup prediction models in the world that combines Data, machine learning, and LLM that analyzes news for injuries and other issues

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors