LLMs for Classification: One Example is All You Need

The first few-shot example delivers +55.9% gains. The next 49? Just +25% more.

One-Shot Impact

The Counter-Intuitive Discovery

After running 35 benchmark configurations across 5 datasets with rigorous Monte Carlo cross-validation (30 iterations each), we discovered something that challenges conventional prompt engineering wisdom:

The first example you give a Large Language Model improves performance by +55.9% on average. The second? Just +3.1%.

This finding has massive implications for how we design GenAI pipelines — and how much we spend on them.

The Data: What We Tested

We benchmarked GPT-4.1-nano across multiple tasks using publicly available datasets:

Dataset Task Type Classes Zero-Shot F1
SMS Spam Binary Classification 2 0.820
Emotion Multi-class 6 0.344
Twitter Sentiment Sentiment 3 0.488
DBpedia Multi-class 14 0.321
Keyword Extraction (Inspec) Extraction 0.155

Each configuration was tested with 0, 1, 3, 5, 10, 20, and 50 few-shot examples.

All datasets are open-source and available on Hugging Face Datasets.

The Diminishing Returns Curve

Learning Curves

The pattern is striking across all datasets:

Transition Average Gain What It Means
0 → 1 shot +55.9% Massive improvement
1 → 3 shots +3.1% Diminishing returns begin
3 → 5 shots +2.8% Marginal gains
5 → 10 shots +1.3% Near plateau
10 → 50 shots +8.8%* Only significant for multi-class

*Inflated by DBpedia (14 classes) which continues improving.

Why Does This Happen?

The first example teaches the model three critical things:

  1. Output format — How to structure the response
  2. Task semantics — What “classify” or “extract” means in this context
  3. Label space — What categories or outputs are expected

Once these are established, additional examples provide marginal refinement.

Statistical Proof: This Is Not Random

We used Welch’s ANOVA to verify these findings weren’t due to chance:

Dataset F-statistic p-value Significant?
SMS Spam 76.75 <0.0001 YES
Emotion 73.89 <0.0001 YES
Twitter Sentiment 92.57 <0.0001 YES
DBpedia 439.88 <0.0001 YES
Keyword Extraction 160.30 <0.0001 YES

The effect is statistically significant for ALL datasets. DBpedia shows the strongest effect (F=439.88), reflecting its large improvement range from zero-shot to 50-shot.

Pairwise Comparisons (Bonferroni-corrected)

Comparison Cohen’s d Interpretation
0 vs 1 d = -4.3 Large (always significant)
1 vs 3 d = -0.5 Medium (rarely significant)
3 vs 5 d = -0.2 Negligible
5 vs 10 d = -0.1 Negligible

Key insight: While the overall effect is significant, pairwise comparisons confirm that differences between adjacent levels (3 vs 5, 5 vs 10) are not statistically significant after correction.

The Exception: Multi-Class Tasks

DBpedia Exception

DBpedia (14 classes) tells a different story:

Few-shots F1-Score Gain vs Zero-Shot
0 0.321
1 0.650 +102.5%
5 0.760 +136.8%
20 0.876 +173.0%
50 0.937 +192.1%

The rule of thumb: For tasks with >10 classes, plan for ~3-4 examples per class.

The ROI Reality Check

Cost vs Performance

Here’s what nobody talks about — the economics:

Strategy Performance Gain Cost Multiplier ROI Score
1-shot +53.2% 1.75x 30.4
3-shot +59.0% 3.25x 18.2
5-shot +63.5% 4.75x 13.4
10-shot +66.9% 8.5x 7.9
50-shot +81.1% 37.5x 2.2

The verdict: 1-shot delivers the best ROI by far. Going from 1 to 50 examples costs 21x more for only +28% additional performance.

Practical Recommendations

The Decision Matrix

Your Task Recommended Examples Why
Binary classification 3-5 Plateau at 5, marginal gains after
Sentiment (3 classes) 3 Very early plateau, oscillation after
Emotion (6 classes) 5 Early saturation
Multi-class (>10) 20-50 Continues improving significantly
Extraction 10 Benefits from format examples

Key Takeaways

  1. Invest in your first example — It’s worth 55.9% of your total potential improvement
  2. Stop at 3-5 for simple tasks — Binary and low-cardinality multi-class plateau early
  3. Scale up only for complex taxonomies — >10 classes benefit from 20-50 examples
  4. Measure, don’t assume — Run your own benchmarks with Monte Carlo cross-validation

Methodology

  • Protocol: Monte Carlo Cross-Validation (30 random train/test splits)
  • Model: GPT-4.1-nano
  • Metrics: F1-score (macro-averaged for multi-class)
  • Statistical tests: Welch’s ANOVA, Bonferroni-corrected post-hoc tests
  • Effect size: Cohen’s d

Next in this series: “Why Classical ML Still Crushes GenAI at Regression” — The surprising tasks where sklearn wins 100% of the time.

Total
0
Shares
Leave a Reply

Your email address will not be published. Required fields are marked *

Previous Post
a-developer’s-guide-to-debugging-jax-on-cloud-tpus:-essential-tools-and-techniques

A Developer’s Guide to Debugging JAX on Cloud TPUs: Essential Tools and Techniques

Next Post

Why Image Hallucination Is More Dangerous Than Text Hallucination

Related Posts