I put 12 AI models into a public World Cup prediction arena.
Not because I think anyone should use LLMs for betting. They should not. The page says entertainment only for a reason.
I did it because sports prediction is a surprisingly clean stress test for models:
- structured facts
- stale priors
- uncertainty
- calibration
- price-performance
- and the most painful thing for LLMs: admitting a favorite might draw
After 169 predictions and 21 settled scoring entries, the leaderboard is technically tied.
But the misses are already more useful than the winners.
TL;DR
- No, there is no “best World Cup AI model” yet. The sample is too small.
- 12 models are currently tied on 3 points.
- Qwen3.5 Flash, Claude Opus 4.7, and Claude Sonnet 4.6 show 100% winner accuracy, but only on one settled pre-match prediction each.
- All 12 models got Colombia over Uzbekistan directionally right.
- Nine valid pre-match models all missed Portugal 1-1 Congo DR because they picked Portugal.
- The early lesson is not “flagship models win.” It is “favorite bias is real, and cheap models are good enough to poll at scale.”
Full live scoreboard: WorldCup AI Arena
What I actually tracked
The public dashboard tracks model forecasts, match results, team context, and prediction accuracy.
Snapshot used here: 2026-06-18 05:53 UTC.
| Metric | Value |
|---|---|
| Models tracked | 12 |
| Total predictions | 169 |
| Settled scoring entries | 21 |
| Total leaderboard points | 36 |
| Exact score hits | 0 |
| Correct-winner hits | 12 |
| Average winner accuracy | 62.5% |
The model list includes Claude, GPT, Gemini, DeepSeek, Qwen, Kimi, and Grok variants.
Important caveat: I count pre-match predictions only for accuracy. Post-match reviews are useful for explanation, but they know the result. They are not forecasts.
The current leaderboard
Every model has 3 points right now.
That sounds boring until you look at the sample size.
| Model | Tier | Predictions | Settled | Winner hits | Points | Accuracy |
|---|---|---|---|---|---|---|
| Qwen3.5 Flash | wildcard | 13 | 1 | 1 | 3 | 100% |
| Claude Opus 4.7 | flagship | 14 | 1 | 1 | 3 | 100% |
| Claude Sonnet 4.6 | flagship | 14 | 1 | 1 | 3 | 100% |
| GPT-5.4 | flagship | 15 | 2 | 1 | 3 | 50% |
| Gemini 3.1 Pro | flagship | 15 | 2 | 1 | 3 | 50% |
| DeepSeek V4 Pro | value | 15 | 2 | 1 | 3 | 50% |
| Qwen 3.7 Plus | value | 14 | 2 | 1 | 3 | 50% |
| Kimi K2.6 | value | 14 | 2 | 1 | 3 | 50% |
| Gemini 2.5 Flash | value | 14 | 2 | 1 | 3 | 50% |
| Grok 4.1 Fast Reasoning | wildcard | 14 | 2 | 1 | 3 | 50% |
| DeepSeek V4 Flash | wildcard | 14 | 2 | 1 | 3 | 50% |
| GPT-5 Nano | wildcard | 13 | 2 | 1 | 3 | 50% |
My read: the leaderboard is not mature enough to crown a winner.
The first useful signal is elsewhere.
The obvious match: everyone got Colombia right
Uzbekistan vs Colombia ended 1-3.
All 12 models picked Colombia.
None got the exact score.
| Model | Prediction | Final | Winner hit |
|---|---|---|---|
| Claude Opus 4.7 | 0-2 Colombia | 1-3 Colombia | Yes |
| Claude Sonnet 4.6 | 1-2 Colombia | 1-3 Colombia | Yes |
| GPT-5.4 | 1-2 Colombia | 1-3 Colombia | Yes |
| Gemini 3.1 Pro | 0-2 Colombia | 1-3 Colombia | Yes |
| DeepSeek V4 Pro | 0-2 Colombia | 1-3 Colombia | Yes |
| Qwen 3.7 Plus | 0-2 Colombia | 1-3 Colombia | Yes |
| Kimi K2.6 | 0-2 Colombia | 1-3 Colombia | Yes |
| Gemini 2.5 Flash | 0-2 Colombia | 1-3 Colombia | Yes |
| Grok 4.1 Fast Reasoning | 0-2 Colombia | 1-3 Colombia | Yes |
| DeepSeek V4 Flash | 0-2 Colombia | 1-3 Colombia | Yes |
| GPT-5 Nano | 0-1 Colombia | 1-3 Colombia | Yes |
| Qwen3.5 Flash | 0-1 Colombia | 1-3 Colombia | Yes |
This is the kind of match where a cheap model can be enough.
If all you need is “which side is more likely,” then polling cheap models may beat paying a flagship model for every pick.
The useful miss: every valid model missed Portugal-Congo DR
Portugal vs Congo DR ended 1-1.
Every valid pre-match model picked Portugal.
| Model | Prediction | Final | Outcome |
|---|---|---|---|
| GPT-5.4 | 2-0 Portugal | 1-1 | Miss |
| Gemini 3.1 Pro | 2-0 Portugal | 1-1 | Miss |
| DeepSeek V4 Pro | 2-0 Portugal | 1-1 | Miss |
| Qwen 3.7 Plus | 2-0 Portugal | 1-1 | Miss |
| Kimi K2.6 | 2-0 Portugal | 1-1 | Miss |
| Gemini 2.5 Flash | 2-0 Portugal | 1-1 | Miss |
| Grok 4.1 Fast Reasoning | 3-0 Portugal | 1-1 | Miss |
| DeepSeek V4 Flash | 2-0 Portugal | 1-1 | Miss |
| GPT-5 Nano | 2-1 Portugal | 1-1 | Miss |
That is the part I care about.
The models did not just get unlucky independently. They shared the same prior: Portugal strong, Congo DR weaker, therefore Portugal win.
That is a classic LLM failure mode.
It shows up outside sports too:
- “OpenAI usually ships X, so the next release will be X”
- “Claude is the premium model, so it must win this task”
- “The famous team/vendor/person is probably the right answer”
- “Historical quality beats current uncertainty”
In other words, the World Cup is a cute interface for a serious eval problem: models are often too willing to convert reputation into certainty.
The cost angle
The dashboard includes listed price tiers for each model.
Here is the funny part: the cheapest model currently has the cleanest-looking row.
| Model | Listed input / output price | Current result |
|---|---|---|
| Qwen3.5 Flash | $0.026 / $0.263 per 1M | 1/1 winner hit |
| GPT-5 Nano | $0.049 / $0.388 per 1M | 1/2 winner hit |
| Claude Opus 4.7 | $5 / $25 per 1M | 1/1 winner hit |
| GPT-5.4 | $2.45 / $14.7 per 1M | 1/2 winner hit |
Do not overread that. One match is not proof.
But the unit economics are hard to ignore.
Suppose a prediction prompt uses 10K input tokens and 1K output tokens.
Approximate cost:
Qwen3.5 Flash:
10K * $0.026 / 1M + 1K * $0.263 / 1M = $0.000526
Claude Opus 4.7:
10K * $5 / 1M + 1K * $25 / 1M = $0.075
That is roughly a 143x spread for one prediction-shaped call.
If I were building a prediction system, I would not send every match to the most expensive model. I would route it.
def pick_prediction_route(match_uncertainty, model_disagreement, budget_mode):
if budget_mode == "cheap_poll":
return ["qwen3.5-flash", "gpt-5-nano", "deepseek-v4-flash"]
if match_uncertainty == "low" and model_disagreement == "low":
return ["qwen3.5-flash"]
if match_uncertainty == "high" or model_disagreement == "high":
return [
"qwen3.5-flash",
"deepseek-v4-pro",
"gemini-3.1-pro",
"claude-sonnet-4.6",
]
return ["qwen3.5-flash", "claude-sonnet-4.6"]
Cheap models for breadth. Expensive models for disagreement.
That is the same routing logic I use for normal API workloads.
What I would measure next
Winner accuracy is not enough.
I want these metrics:
| Metric | Why it matters |
|---|---|
| Winner accuracy | Basic direction |
| Exact score | Hard mode |
| Goal difference | More informative than exact score alone |
| Brier score | Calibration |
| Confidence bucket accuracy | Overconfidence detection |
| Cost per correct winner | Production routing |
| Draw recall | Favorite-bias detector |
| Disagreement value | Whether ensembles help |
The biggest one is draw recall.
Portugal-Congo DR already suggests the models may underpredict draws when a prestigious team is involved.
If that pattern holds, it is more important than the leaderboard.
What I’d do if I were tracking this live
I would not declare a winner until at least 30-50 settled pre-match predictions per model.
For now:
- Track every match.
- Exclude post-match reviews from accuracy.
- Compare cheap vs flagship models by cost per correct winner.
- Watch draw prediction rate.
- Add a baseline from betting markets or Elo.
- Update after each matchday.
If you want the full data-cited writeup and live links, I wrote the original breakdown here: AI World Cup Predictions 2026: 12 Models, Early Leaderboard.
Disclosure: I work on the research side at TokenMix, which is why I can wire this kind of multi-model scoreboard quickly.
Bottom line
The early World Cup AI leaderboard does not tell us which model is best yet.
It does tell us something useful: cheap models can match flagship consensus on obvious favorites, and all models can share the same bad prior on a draw.
That is a model-evaluation lesson, not betting advice.
If you were scoring this, would you reward exact score heavily, or focus on calibrated probabilities instead?