Model Calibration Report

Across 980 World Cup matches, the model's top-result probabilities are reasonably calibrated overall: Brier 0.5746, ECE 3.6%. The same walk-forward test also gives a 56.3% win/draw/loss hit rate. These are probabilities, not guarantees.

980
World Cup matches

23
Tournaments

56.3%
Top-pick hit rate

0.5746
Brier score

3.6%
expected calibration error

2026-06-16
report update date

Citable definition: calibration checks whether predicted probabilities match observed outcomes over many matches. A well-calibrated 60% forecast should win close to 60% of the time over a large enough sample.

What This Means

The model is best calibrated in the 40-80% confidence range, where predicted probabilities generally track real World Cup outcomes closely. Very high-confidence buckets are based on fewer historical matches, so they should be treated with more caution. In plain English: a 60-70% pick has historically behaved like a strong but still beatable favourite, not a guaranteed winner.

Reliability by Confidence Bucket

Each row groups matches by the model's highest predicted result probability before kick-off. For example, the 60-70% bucket asks: when the model's top pick was around two-thirds likely, how often did it actually happen?

Model confidence	Matches	Avg model	Actual hit rate	Gap
0-10%	0	—	—	—
10-20%	0	—	—	—
20-30%	0	—	—	—
30-40%	99	38.6%	36.4%	-2.2%
40-50%	283	45.0%	49.5%	+4.5%
50-60%	251	54.7%	54.2%	-0.6%
60-70%	177	64.8%	67.8%	+3.0%
70-80%	110	74.3%	70.9%	-3.4%
80-90%	48	84.3%	72.9%	-11.4%
90-100%	12	93.2%	58.3%	-34.9%

Reliability Diagram

0-10%

Model — Actual —

10-20%

Model — Actual —

20-30%

Model — Actual —

30-40%

Model 38.6% Actual 36.4%

40-50%

Model 45.0% Actual 49.5%

50-60%

Model 54.7% Actual 54.2%

60-70%

Model 64.8% Actual 67.8%

70-80%

Model 74.3% Actual 70.9%

80-90%

Model 84.3% Actual 72.9%

90-100%

Model 93.2% Actual 58.3%

Buckets with very few matches are noisy. In this report, the 90-100% bucket has only 12 matches, so its gap should not be over-interpreted.

Method

We replay every international match in the dataset (49,421 games) in date order. Before each historical World Cup match, the model predicts using only the Elo ratings available at that moment, then the real result updates the ratings. No future match information is used.

The probability model uses Elo + Poisson with `eloToGoals=0.0022`, base goals 1.35, and a 100-point host advantage only for non-neutral matches. The report covers all 23 World Cups from 1930 to 2026.

Match data comes from the open-source martj42/international_results dataset (every international match since 1872); World Cup 2026 fixtures from TheSportsDB.

FAQ

What does model calibration mean?

Calibration checks whether predicted probabilities match real outcomes. If a model gives many teams a 60% chance, those predictions should win close to 60% of the time over a large enough sample.

How many matches are included?

This report uses 980 World Cup finals matches across 23 tournaments from 1930 to 2026, using a walk-forward test with no future data.

How should I read the confidence buckets?

Each bucket compares what the model said before kick-off with what actually happened. The model is best calibrated in the 40-80% range, while very high-confidence buckets have fewer historical matches and should be treated with more caution.

Do calibrated probabilities guarantee results?

No. Calibration is a historical reliability check, not a guarantee. Football remains uncertain, and small buckets can be noisy.

Updated 2026-06-16.