Software

5 minute read

I Let 12 AI Models Predict the World Cup. The First 169 Picks Already Show a Pattern.

Greg Gifford

June 18, 2026

I put 12 AI models into a public World Cup prediction arena.

Not because I think anyone should use LLMs for betting. They should not. The page says entertainment only for a reason.

I did it because sports prediction is a surprisingly clean stress test for models:

structured facts
stale priors
uncertainty
calibration
price-performance
and the most painful thing for LLMs: admitting a favorite might draw

After 169 predictions and 21 settled scoring entries, the leaderboard is technically tied.

But the misses are already more useful than the winners.

TL;DR

No, there is no “best World Cup AI model” yet. The sample is too small.
12 models are currently tied on 3 points.
Qwen3.5 Flash, Claude Opus 4.7, and Claude Sonnet 4.6 show 100% winner accuracy, but only on one settled pre-match prediction each.
All 12 models got Colombia over Uzbekistan directionally right.
Nine valid pre-match models all missed Portugal 1-1 Congo DR because they picked Portugal.
The early lesson is not “flagship models win.” It is “favorite bias is real, and cheap models are good enough to poll at scale.”

Full live scoreboard: WorldCup AI Arena

What I actually tracked

The public dashboard tracks model forecasts, match results, team context, and prediction accuracy.

Snapshot used here: 2026-06-18 05:53 UTC.

Metric	Value
Models tracked	12
Total predictions	169
Settled scoring entries	21
Total leaderboard points	36
Exact score hits	0
Correct-winner hits	12
Average winner accuracy	62.5%

The model list includes Claude, GPT, Gemini, DeepSeek, Qwen, Kimi, and Grok variants.

Important caveat: I count pre-match predictions only for accuracy. Post-match reviews are useful for explanation, but they know the result. They are not forecasts.

The current leaderboard

Every model has 3 points right now.

That sounds boring until you look at the sample size.

Model	Tier	Predictions	Settled	Winner hits	Points	Accuracy
Qwen3.5 Flash	wildcard	13	1	1	3	100%
Claude Opus 4.7	flagship	14	1	1	3	100%
Claude Sonnet 4.6	flagship	14	1	1	3	100%
GPT-5.4	flagship	15	2	1	3	50%
Gemini 3.1 Pro	flagship	15	2	1	3	50%
DeepSeek V4 Pro	value	15	2	1	3	50%
Qwen 3.7 Plus	value	14	2	1	3	50%
Kimi K2.6	value	14	2	1	3	50%
Gemini 2.5 Flash	value	14	2	1	3	50%
Grok 4.1 Fast Reasoning	wildcard	14	2	1	3	50%
DeepSeek V4 Flash	wildcard	14	2	1	3	50%
GPT-5 Nano	wildcard	13	2	1	3	50%

My read: the leaderboard is not mature enough to crown a winner.

The first useful signal is elsewhere.

The obvious match: everyone got Colombia right

Uzbekistan vs Colombia ended 1-3.

All 12 models picked Colombia.

None got the exact score.

Model	Prediction	Final	Winner hit
Claude Opus 4.7	0-2 Colombia	1-3 Colombia	Yes
Claude Sonnet 4.6	1-2 Colombia	1-3 Colombia	Yes
GPT-5.4	1-2 Colombia	1-3 Colombia	Yes
Gemini 3.1 Pro	0-2 Colombia	1-3 Colombia	Yes
DeepSeek V4 Pro	0-2 Colombia	1-3 Colombia	Yes
Qwen 3.7 Plus	0-2 Colombia	1-3 Colombia	Yes
Kimi K2.6	0-2 Colombia	1-3 Colombia	Yes
Gemini 2.5 Flash	0-2 Colombia	1-3 Colombia	Yes
Grok 4.1 Fast Reasoning	0-2 Colombia	1-3 Colombia	Yes
DeepSeek V4 Flash	0-2 Colombia	1-3 Colombia	Yes
GPT-5 Nano	0-1 Colombia	1-3 Colombia	Yes
Qwen3.5 Flash	0-1 Colombia	1-3 Colombia	Yes

This is the kind of match where a cheap model can be enough.

If all you need is “which side is more likely,” then polling cheap models may beat paying a flagship model for every pick.

The useful miss: every valid model missed Portugal-Congo DR

Portugal vs Congo DR ended 1-1.

Every valid pre-match model picked Portugal.

Model	Prediction	Final	Outcome
GPT-5.4	2-0 Portugal	1-1	Miss
Gemini 3.1 Pro	2-0 Portugal	1-1	Miss
DeepSeek V4 Pro	2-0 Portugal	1-1	Miss
Qwen 3.7 Plus	2-0 Portugal	1-1	Miss
Kimi K2.6	2-0 Portugal	1-1	Miss
Gemini 2.5 Flash	2-0 Portugal	1-1	Miss
Grok 4.1 Fast Reasoning	3-0 Portugal	1-1	Miss
DeepSeek V4 Flash	2-0 Portugal	1-1	Miss
GPT-5 Nano	2-1 Portugal	1-1	Miss

That is the part I care about.

The models did not just get unlucky independently. They shared the same prior: Portugal strong, Congo DR weaker, therefore Portugal win.

That is a classic LLM failure mode.

It shows up outside sports too:

“OpenAI usually ships X, so the next release will be X”
“Claude is the premium model, so it must win this task”
“The famous team/vendor/person is probably the right answer”
“Historical quality beats current uncertainty”

In other words, the World Cup is a cute interface for a serious eval problem: models are often too willing to convert reputation into certainty.

The cost angle

The dashboard includes listed price tiers for each model.

Here is the funny part: the cheapest model currently has the cleanest-looking row.

Model	Listed input / output price	Current result
Qwen3.5 Flash	$0.026 / $0.263 per 1M	1/1 winner hit
GPT-5 Nano	$0.049 / $0.388 per 1M	1/2 winner hit
Claude Opus 4.7	$5 / $25 per 1M	1/1 winner hit
GPT-5.4	$2.45 / $14.7 per 1M	1/2 winner hit

Do not overread that. One match is not proof.

But the unit economics are hard to ignore.

Suppose a prediction prompt uses 10K input tokens and 1K output tokens.

Approximate cost:

Qwen3.5 Flash:
10K * $0.026 / 1M + 1K * $0.263 / 1M = $0.000526

Claude Opus 4.7:
10K * $5 / 1M + 1K * $25 / 1M = $0.075

That is roughly a 143x spread for one prediction-shaped call.

If I were building a prediction system, I would not send every match to the most expensive model. I would route it.

def pick_prediction_route(match_uncertainty, model_disagreement, budget_mode):
    if budget_mode == "cheap_poll":
        return ["qwen3.5-flash", "gpt-5-nano", "deepseek-v4-flash"]

    if match_uncertainty == "low" and model_disagreement == "low":
        return ["qwen3.5-flash"]

    if match_uncertainty == "high" or model_disagreement == "high":
        return [
            "qwen3.5-flash",
            "deepseek-v4-pro",
            "gemini-3.1-pro",
            "claude-sonnet-4.6",
        ]

    return ["qwen3.5-flash", "claude-sonnet-4.6"]

Cheap models for breadth. Expensive models for disagreement.

That is the same routing logic I use for normal API workloads.

What I would measure next

Winner accuracy is not enough.

I want these metrics:

Metric	Why it matters
Winner accuracy	Basic direction
Exact score	Hard mode
Goal difference	More informative than exact score alone
Brier score	Calibration
Confidence bucket accuracy	Overconfidence detection
Cost per correct winner	Production routing
Draw recall	Favorite-bias detector
Disagreement value	Whether ensembles help

The biggest one is draw recall.

Portugal-Congo DR already suggests the models may underpredict draws when a prestigious team is involved.

If that pattern holds, it is more important than the leaderboard.

What I’d do if I were tracking this live

I would not declare a winner until at least 30-50 settled pre-match predictions per model.

For now:

Track every match.
Exclude post-match reviews from accuracy.
Compare cheap vs flagship models by cost per correct winner.
Watch draw prediction rate.
Add a baseline from betting markets or Elo.
Update after each matchday.

If you want the full data-cited writeup and live links, I wrote the original breakdown here: AI World Cup Predictions 2026: 12 Models, Early Leaderboard.

Disclosure: I work on the research side at TokenMix, which is why I can wire this kind of multi-model scoreboard quickly.

Bottom line

The early World Cup AI leaderboard does not tell us which model is best yet.

It does tell us something useful: cheap models can match flagship consensus on obvious favorites, and all models can share the same bad prior on a draw.

That is a model-evaluation lesson, not betting advice.

If you were scoring this, would you reward exact score heavily, or focus on calibrated probabilities instead?

Stop Fighting Python for Webhooks: Why Node.js is Optimal for Cloud Function Signatures

June 18, 2026

Quality Assurance

Common Challenges and Keys to Success for Fatigue Testing of Additively Manufactured Metals

June 18, 2026

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Hand-Picked Top-Read Stories

How to Manage Project Scope Without Scope Creep (with examples)

AI Readiness: The Gap Between Activities and Outcomes

Verista COUNTQ Tray Inspection Systems

Trending Tags

I Let 12 AI Models Predict the World Cup. The First 169 Picks Already Show a Pattern.

TL;DR

What I actually tracked

The current leaderboard

The obvious match: everyone got Colombia right

The useful miss: every valid model missed Portugal-Congo DR

The cost angle

What I would measure next

What I’d do if I were tracking this live

Bottom line

Leave a Reply Cancel reply

Previous Post

Stop Fighting Python for Webhooks: Why Node.js is Optimal for Cloud Function Signatures

Next Post

Common Challenges and Keys to Success for Fatigue Testing of Additively Manufactured Metals

I Let 12 AI Models Predict the World Cup. The First 169 Picks Already Show a Pattern.

TL;DR

What I actually tracked

The current leaderboard

The obvious match: everyone got Colombia right

The useful miss: every valid model missed Portugal-Congo DR

The cost angle

What I would measure next

What I’d do if I were tracking this live

Bottom line

Leave a Reply Cancel reply

Previous Post

Next Post

Related Posts