Model Calibration Report
Across 980 World Cup matches, the model's top-result probabilities are reasonably calibrated overall: Brier 0.5746, ECE 3.6%. The same walk-forward test also gives a 56.3% win/draw/loss hit rate. These are probabilities, not guarantees.
World Cup matches
Tournaments
Top-pick hit rate
Brier score
expected calibration error
report update date
Citable definition: calibration checks whether predicted probabilities match observed outcomes over many matches. A well-calibrated 60% forecast should win close to 60% of the time over a large enough sample.
What This Means
The model is best calibrated in the 40-80% confidence range, where predicted probabilities generally track real World Cup outcomes closely. Very high-confidence buckets are based on fewer historical matches, so they should be treated with more caution. In plain English: a 60-70% pick has historically behaved like a strong but still beatable favourite, not a guaranteed winner.
Reliability by Confidence Bucket
Each row groups matches by the model's highest predicted result probability before kick-off. For example, the 60-70% bucket asks: when the model's top pick was around two-thirds likely, how often did it actually happen?
| Model confidence | Matches | Avg model | Actual hit rate | Gap |
|---|---|---|---|---|
| 0-10% | 0 | — | — | — |
| 10-20% | 0 | — | — | — |
| 20-30% | 0 | — | — | — |
| 30-40% | 99 | 38.6% | 36.4% | -2.2% |
| 40-50% | 283 | 45.0% | 49.5% | +4.5% |
| 50-60% | 251 | 54.7% | 54.2% | -0.6% |
| 60-70% | 177 | 64.8% | 67.8% | +3.0% |
| 70-80% | 110 | 74.3% | 70.9% | -3.4% |
| 80-90% | 48 | 84.3% | 72.9% | -11.4% |
| 90-100% | 12 | 93.2% | 58.3% | -34.9% |
Reliability Diagram
Buckets with very few matches are noisy. In this report, the 90-100% bucket has only 12 matches, so its gap should not be over-interpreted.
Method
We replay every international match in the dataset (49,421 games) in date order. Before each historical World Cup match, the model predicts using only the Elo ratings available at that moment, then the real result updates the ratings. No future match information is used.
The probability model uses Elo + Poisson with `eloToGoals=0.0022`, base goals 1.35, and a 100-point host advantage only for non-neutral matches. The report covers all 23 World Cups from 1930 to 2026.
Match data comes from the open-source martj42/international_results dataset (every international match since 1872); World Cup 2026 fixtures from TheSportsDB.
FAQ
What does model calibration mean?
Calibration checks whether predicted probabilities match real outcomes. If a model gives many teams a 60% chance, those predictions should win close to 60% of the time over a large enough sample.
How many matches are included?
This report uses 980 World Cup finals matches across 23 tournaments from 1930 to 2026, using a walk-forward test with no future data.
How should I read the confidence buckets?
Each bucket compares what the model said before kick-off with what actually happened. The model is best calibrated in the 40-80% range, while very high-confidence buckets have fewer historical matches and should be treated with more caution.
Do calibrated probabilities guarantee results?
No. Calibration is a historical reliability check, not a guarantee. Football remains uncertain, and small buckets can be noisy.
Updated 2026-06-16.