2026 World Cup Prediction Model

Two-stage statistical model that estimates each team's probability of winning the 2026 FIFA World Cup (and of reaching each knockout round).

Pipeline

Stage	Module	What it does
Data	`src/data/download.py`	Pull `results.csv` (martj42) from GitHub — no Kaggle token needed. Kaggle datasets optional.
Clean	`src/data/clean.py`	Canonical team-name map, date parsing, dedupe.
Elo	`src/models/elo.py`	Self-computed as-of-date Elo (leakage-free) from match history.
Features	`src/data/features.py`	Time-decay + match-importance weights.
Match model	`src/models/poisson.py`	Elo-driven Poisson with Dixon-Coles low-score correction (weighted MLE).
Eval	`src/eval/`	RPS (primary), log-loss, Brier, accuracy on a time-split holdout.
Simulator	`src/simulate/`	Official 2026 bracket + Monte Carlo → title / round probabilities.

Quickstart

python -m venv .venv && .venv/Scripts/python -m pip install -r requirements.txt
python -m src.data.download        # -> data/raw/results.csv (+ shootouts, goalscorers)
python -m src.data.clean           # -> data/interim/results_clean.pkl
python -m src.models.elo           # -> data/interim/results_elo.pkl  (+ top-20 ratings)
python -m src.eval.train_eval      # fit match model, report RPS/log-loss
python -m src.eval.backtest        # 2018 & 2022 out-of-sample validation
python -m src.models.hybrid        # XGBoost + ensemble vs Dixon-Coles
python -m src.news.run_news        # Stage 3: news -> bounded Elo deltas (needs OPENAI_API_KEY)
python -m src.simulate.montecarlo 20000   # -> outputs/title_probabilities*.csv

Set OPENAI_API_KEY (and optionally OPENAI_MODEL, default gpt-4o-mini) in a .env file at the repo root for the Stage 3 news layer.

Results

Title odds — 20,000 simulations, with the strength-uncertainty and news layers applied. "90% range" is the credible interval across simulation batches.

#	Team	Win title	90% range	Reach final	Reach semis
1	Spain	25.7%	23.4–28.2%	37.3%	50.1%
2	Argentina	18.5%	17.4–19.9%	29.5%	41.9%
3	France	10.3%	8.9–11.9%	18.9%	34.6%
4	England	6.5%	5.6–7.6%	13.3%	24.7%
5	Brazil	4.9%	3.8–5.5%	10.0%	21.2%
6	Colombia	4.7%	3.2–5.9%	10.1%	18.7%
7	Portugal	4.2%	3.3–5.1%	9.0%	16.9%
8	Mexico	3.5%	2.7–4.2%	8.1%	18.4%
9	Ecuador	2.9%	2.2–3.8%	7.2%	17.2%
10	Germany	1.9%	1.3–2.5%	5.1%	13.1%

Full table (all 48 teams) in outputs/title_probabilities_news.csv; the model-only version (no news) is in outputs/title_probabilities.csv.

Match model (train < 2024-01-01, test ≥ 2024-01-01, 2,547 matches): accuracy 60.4%, RPS 0.167 (naive base-rate 0.227), log-loss 0.867. Ordering of favorites matches the betting market; favorites sit in the sane 15–25% band.

Uncertainty layer: each tournament re-draws team strengths from N(Elo, σ), with σ larger for teams that play fewer recent matches (noisier ratings). This pulls the top favorite down (~26.5% → ~24.7%) and fattens the field, and yields a 90% credible interval per team. Point-estimate output is also saved for comparison (outputs/title_probabilities_pointest.csv).

Validation & layers (built)

Backtest (src/eval/backtest.py): out-of-sample on WC 2018 (acc 56%, RPS 0.209) and 2022 (acc 55%, RPS 0.224), strictly no-leakage. Both beat base-rate.
Hybrid + ensemble (src/models/hybrid.py): XGBoost on Elo + form features. Honest finding — Dixon-Coles (RPS 0.167) still edges XGBoost (0.173) because elo_diff dominates and we lack market-value covariates. Hybrid is ready for when squad value/age data is added (needs the Kaggle token).
Stage 3 news layer (src/news/): Google News RSS → OpenAI Structured Outputs (injuries/suspensions/returns with confidence) → bounded, shrunk Elo deltas (capped ±40) → re-simulate. Adjusts model inputs, not final percentages, so a weakened rival coherently lifts everyone else's odds. Output: outputs/title_probabilities_news.csv + auditable outputs/news_adjustments.json.

Notes

Elo is computed from the same results.csv used for training, which eliminates cross-source team-name mismatches and gives full control of as-of-date logic.
Host advantage is modeled as an effective Elo bump (USA +60, Mexico/Canada +40).
All RNGs are seeded; sim count + seed are logged to outputs/sim_meta.json.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
outputs		outputs
src		src
web		web
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

2026 World Cup Prediction Model

Pipeline

Quickstart

Results

Validation & layers (built)

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

2026 World Cup Prediction Model

Pipeline

Quickstart

Results

Validation & layers (built)

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages