I Let 12 AI Models Predict the World Cup. The First 169 Picks Already Show a Pattern.

I put 12 AI models into a public World Cup prediction arena.

Not because I think anyone should use LLMs for betting. They should not. The page says entertainment only for a reason.

I did it because sports prediction is a surprisingly clean stress test for models:

  • structured facts
  • stale priors
  • uncertainty
  • calibration
  • price-performance
  • and the most painful thing for LLMs: admitting a favorite might draw

After 169 predictions and 21 settled scoring entries, the leaderboard is technically tied.

But the misses are already more useful than the winners.

TL;DR

  • No, there is no “best World Cup AI model” yet. The sample is too small.
  • 12 models are currently tied on 3 points.
  • Qwen3.5 Flash, Claude Opus 4.7, and Claude Sonnet 4.6 show 100% winner accuracy, but only on one settled pre-match prediction each.
  • All 12 models got Colombia over Uzbekistan directionally right.
  • Nine valid pre-match models all missed Portugal 1-1 Congo DR because they picked Portugal.
  • The early lesson is not “flagship models win.” It is “favorite bias is real, and cheap models are good enough to poll at scale.”

Full live scoreboard: WorldCup AI Arena

What I actually tracked

The public dashboard tracks model forecasts, match results, team context, and prediction accuracy.

Snapshot used here: 2026-06-18 05:53 UTC.

Metric Value
Models tracked 12
Total predictions 169
Settled scoring entries 21
Total leaderboard points 36
Exact score hits 0
Correct-winner hits 12
Average winner accuracy 62.5%

The model list includes Claude, GPT, Gemini, DeepSeek, Qwen, Kimi, and Grok variants.

Important caveat: I count pre-match predictions only for accuracy. Post-match reviews are useful for explanation, but they know the result. They are not forecasts.

The current leaderboard

Every model has 3 points right now.

That sounds boring until you look at the sample size.

Model Tier Predictions Settled Winner hits Points Accuracy
Qwen3.5 Flash wildcard 13 1 1 3 100%
Claude Opus 4.7 flagship 14 1 1 3 100%
Claude Sonnet 4.6 flagship 14 1 1 3 100%
GPT-5.4 flagship 15 2 1 3 50%
Gemini 3.1 Pro flagship 15 2 1 3 50%
DeepSeek V4 Pro value 15 2 1 3 50%
Qwen 3.7 Plus value 14 2 1 3 50%
Kimi K2.6 value 14 2 1 3 50%
Gemini 2.5 Flash value 14 2 1 3 50%
Grok 4.1 Fast Reasoning wildcard 14 2 1 3 50%
DeepSeek V4 Flash wildcard 14 2 1 3 50%
GPT-5 Nano wildcard 13 2 1 3 50%

My read: the leaderboard is not mature enough to crown a winner.

The first useful signal is elsewhere.

The obvious match: everyone got Colombia right

Uzbekistan vs Colombia ended 1-3.

All 12 models picked Colombia.

None got the exact score.

Model Prediction Final Winner hit
Claude Opus 4.7 0-2 Colombia 1-3 Colombia Yes
Claude Sonnet 4.6 1-2 Colombia 1-3 Colombia Yes
GPT-5.4 1-2 Colombia 1-3 Colombia Yes
Gemini 3.1 Pro 0-2 Colombia 1-3 Colombia Yes
DeepSeek V4 Pro 0-2 Colombia 1-3 Colombia Yes
Qwen 3.7 Plus 0-2 Colombia 1-3 Colombia Yes
Kimi K2.6 0-2 Colombia 1-3 Colombia Yes
Gemini 2.5 Flash 0-2 Colombia 1-3 Colombia Yes
Grok 4.1 Fast Reasoning 0-2 Colombia 1-3 Colombia Yes
DeepSeek V4 Flash 0-2 Colombia 1-3 Colombia Yes
GPT-5 Nano 0-1 Colombia 1-3 Colombia Yes
Qwen3.5 Flash 0-1 Colombia 1-3 Colombia Yes

This is the kind of match where a cheap model can be enough.

If all you need is “which side is more likely,” then polling cheap models may beat paying a flagship model for every pick.

The useful miss: every valid model missed Portugal-Congo DR

Portugal vs Congo DR ended 1-1.

Every valid pre-match model picked Portugal.

Model Prediction Final Outcome
GPT-5.4 2-0 Portugal 1-1 Miss
Gemini 3.1 Pro 2-0 Portugal 1-1 Miss
DeepSeek V4 Pro 2-0 Portugal 1-1 Miss
Qwen 3.7 Plus 2-0 Portugal 1-1 Miss
Kimi K2.6 2-0 Portugal 1-1 Miss
Gemini 2.5 Flash 2-0 Portugal 1-1 Miss
Grok 4.1 Fast Reasoning 3-0 Portugal 1-1 Miss
DeepSeek V4 Flash 2-0 Portugal 1-1 Miss
GPT-5 Nano 2-1 Portugal 1-1 Miss

That is the part I care about.

The models did not just get unlucky independently. They shared the same prior: Portugal strong, Congo DR weaker, therefore Portugal win.

That is a classic LLM failure mode.

It shows up outside sports too:

  • “OpenAI usually ships X, so the next release will be X”
  • “Claude is the premium model, so it must win this task”
  • “The famous team/vendor/person is probably the right answer”
  • “Historical quality beats current uncertainty”

In other words, the World Cup is a cute interface for a serious eval problem: models are often too willing to convert reputation into certainty.

The cost angle

The dashboard includes listed price tiers for each model.

Here is the funny part: the cheapest model currently has the cleanest-looking row.

Model Listed input / output price Current result
Qwen3.5 Flash $0.026 / $0.263 per 1M 1/1 winner hit
GPT-5 Nano $0.049 / $0.388 per 1M 1/2 winner hit
Claude Opus 4.7 $5 / $25 per 1M 1/1 winner hit
GPT-5.4 $2.45 / $14.7 per 1M 1/2 winner hit

Do not overread that. One match is not proof.

But the unit economics are hard to ignore.

Suppose a prediction prompt uses 10K input tokens and 1K output tokens.

Approximate cost:

Qwen3.5 Flash:
10K * $0.026 / 1M + 1K * $0.263 / 1M = $0.000526

Claude Opus 4.7:
10K * $5 / 1M + 1K * $25 / 1M = $0.075

That is roughly a 143x spread for one prediction-shaped call.

If I were building a prediction system, I would not send every match to the most expensive model. I would route it.

def pick_prediction_route(match_uncertainty, model_disagreement, budget_mode):
    if budget_mode == "cheap_poll":
        return ["qwen3.5-flash", "gpt-5-nano", "deepseek-v4-flash"]

    if match_uncertainty == "low" and model_disagreement == "low":
        return ["qwen3.5-flash"]

    if match_uncertainty == "high" or model_disagreement == "high":
        return [
            "qwen3.5-flash",
            "deepseek-v4-pro",
            "gemini-3.1-pro",
            "claude-sonnet-4.6",
        ]

    return ["qwen3.5-flash", "claude-sonnet-4.6"]

Cheap models for breadth. Expensive models for disagreement.

That is the same routing logic I use for normal API workloads.

What I would measure next

Winner accuracy is not enough.

I want these metrics:

Metric Why it matters
Winner accuracy Basic direction
Exact score Hard mode
Goal difference More informative than exact score alone
Brier score Calibration
Confidence bucket accuracy Overconfidence detection
Cost per correct winner Production routing
Draw recall Favorite-bias detector
Disagreement value Whether ensembles help

The biggest one is draw recall.

Portugal-Congo DR already suggests the models may underpredict draws when a prestigious team is involved.

If that pattern holds, it is more important than the leaderboard.

What I’d do if I were tracking this live

I would not declare a winner until at least 30-50 settled pre-match predictions per model.

For now:

  • Track every match.
  • Exclude post-match reviews from accuracy.
  • Compare cheap vs flagship models by cost per correct winner.
  • Watch draw prediction rate.
  • Add a baseline from betting markets or Elo.
  • Update after each matchday.

If you want the full data-cited writeup and live links, I wrote the original breakdown here: AI World Cup Predictions 2026: 12 Models, Early Leaderboard.

Disclosure: I work on the research side at TokenMix, which is why I can wire this kind of multi-model scoreboard quickly.

Bottom line

The early World Cup AI leaderboard does not tell us which model is best yet.

It does tell us something useful: cheap models can match flagship consensus on obvious favorites, and all models can share the same bad prior on a draw.

That is a model-evaluation lesson, not betting advice.

If you were scoring this, would you reward exact score heavily, or focus on calibrated probabilities instead?

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post

Stop Fighting Python for Webhooks: Why Node.js is Optimal for Cloud Function Signatures

Next Post

Common Challenges and Keys to Success for Fatigue Testing of Additively Manufactured Metals

Related Posts