Software

2 minute read

LLMs for Classification: One Example is All You Need

January 6, 2026

The first few-shot example delivers +55.9% gains. The next 49? Just +25% more.

The Counter-Intuitive Discovery

After running 35 benchmark configurations across 5 datasets with rigorous Monte Carlo cross-validation (30 iterations each), we discovered something that challenges conventional prompt engineering wisdom:

The first example you give a Large Language Model improves performance by +55.9% on average. The second? Just +3.1%.

This finding has massive implications for how we design GenAI pipelines — and how much we spend on them.

The Data: What We Tested

We benchmarked GPT-4.1-nano across multiple tasks using publicly available datasets:

Dataset	Task Type	Classes	Zero-Shot F1
SMS Spam	Binary Classification	2	0.820
Emotion	Multi-class	6	0.344
Twitter Sentiment	Sentiment	3	0.488
DBpedia	Multi-class	14	0.321
Keyword Extraction (Inspec)	Extraction	–	0.155

Each configuration was tested with 0, 1, 3, 5, 10, 20, and 50 few-shot examples.

All datasets are open-source and available on Hugging Face Datasets.

The Diminishing Returns Curve

The pattern is striking across all datasets:

Transition	Average Gain	What It Means
0 → 1 shot	+55.9%	Massive improvement
1 → 3 shots	+3.1%	Diminishing returns begin
3 → 5 shots	+2.8%	Marginal gains
5 → 10 shots	+1.3%	Near plateau
10 → 50 shots	+8.8%*	Only significant for multi-class

*Inflated by DBpedia (14 classes) which continues improving.

Why Does This Happen?

The first example teaches the model three critical things:

Output format — How to structure the response
Task semantics — What “classify” or “extract” means in this context
Label space — What categories or outputs are expected

Once these are established, additional examples provide marginal refinement.

Statistical Proof: This Is Not Random

We used Welch’s ANOVA to verify these findings weren’t due to chance:

Dataset	F-statistic	p-value	Significant?
SMS Spam	76.75	<0.0001	YES
Emotion	73.89	<0.0001	YES
Twitter Sentiment	92.57	<0.0001	YES
DBpedia	439.88	<0.0001	YES
Keyword Extraction	160.30	<0.0001	YES

The effect is statistically significant for ALL datasets. DBpedia shows the strongest effect (F=439.88), reflecting its large improvement range from zero-shot to 50-shot.

Pairwise Comparisons (Bonferroni-corrected)

Comparison	Cohen’s d	Interpretation
0 vs 1	d = -4.3	Large (always significant)
1 vs 3	d = -0.5	Medium (rarely significant)
3 vs 5	d = -0.2	Negligible
5 vs 10	d = -0.1	Negligible

Key insight: While the overall effect is significant, pairwise comparisons confirm that differences between adjacent levels (3 vs 5, 5 vs 10) are not statistically significant after correction.

The Exception: Multi-Class Tasks

DBpedia (14 classes) tells a different story:

Few-shots	F1-Score	Gain vs Zero-Shot
0	0.321	—
1	0.650	+102.5%
5	0.760	+136.8%
20	0.876	+173.0%
50	0.937	+192.1%

The rule of thumb: For tasks with >10 classes, plan for ~3-4 examples per class.

The ROI Reality Check

Here’s what nobody talks about — the economics:

Strategy	Performance Gain	Cost Multiplier	ROI Score
1-shot	+53.2%	1.75x	30.4
3-shot	+59.0%	3.25x	18.2
5-shot	+63.5%	4.75x	13.4
10-shot	+66.9%	8.5x	7.9
50-shot	+81.1%	37.5x	2.2

The verdict: 1-shot delivers the best ROI by far. Going from 1 to 50 examples costs 21x more for only +28% additional performance.

Practical Recommendations

The Decision Matrix

Your Task	Recommended Examples	Why
Binary classification	3-5	Plateau at 5, marginal gains after
Sentiment (3 classes)	3	Very early plateau, oscillation after
Emotion (6 classes)	5	Early saturation
Multi-class (>10)	20-50	Continues improving significantly
Extraction	10	Benefits from format examples

Key Takeaways

Invest in your first example — It’s worth 55.9% of your total potential improvement
Stop at 3-5 for simple tasks — Binary and low-cardinality multi-class plateau early
Scale up only for complex taxonomies — >10 classes benefit from 20-50 examples
Measure, don’t assume — Run your own benchmarks with Monte Carlo cross-validation

Methodology

Protocol: Monte Carlo Cross-Validation (30 random train/test splits)
Model: GPT-4.1-nano
Metrics: F1-score (macro-averaged for multi-class)
Statistical tests: Welch’s ANOVA, Bonferroni-corrected post-hoc tests
Effect size: Cohen’s d

Next in this series: “Why Classical ML Still Crushes GenAI at Regression” — The surprising tasks where sklearn wins 100% of the time.

A Developer’s Guide to Debugging JAX on Cloud TPUs: Essential Tools and Techniques

January 5, 2026

Software

Why Image Hallucination Is More Dangerous Than Text Hallucination

January 6, 2026

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Hand-Picked Top-Read Stories

[GDE] How to Refactor a Complex Blog Review Prompt into Reusable AI Agents

MediaTek Dimensity 9500s and Dimensity 8500 Debut for Android’s ‘Flagship Killer’ Phones

Movie Planner — планировщик фильмов и сериалов / Movie Planner: movie & TV series planning tool

Trending Tags

LLMs for Classification: One Example is All You Need

The Counter-Intuitive Discovery

The Data: What We Tested

The Diminishing Returns Curve

Why Does This Happen?

Statistical Proof: This Is Not Random

Pairwise Comparisons (Bonferroni-corrected)

The Exception: Multi-Class Tasks

The ROI Reality Check

Practical Recommendations

The Decision Matrix

Key Takeaways

Methodology

Leave a Reply Cancel reply

Previous Post

A Developer’s Guide to Debugging JAX on Cloud TPUs: Essential Tools and Techniques

Next Post

Why Image Hallucination Is More Dangerous Than Text Hallucination

LLMs for Classification: One Example is All You Need

The Counter-Intuitive Discovery

The Data: What We Tested

The Diminishing Returns Curve

Why Does This Happen?

Statistical Proof: This Is Not Random

Pairwise Comparisons (Bonferroni-corrected)

The Exception: Multi-Class Tasks

The ROI Reality Check

Practical Recommendations

The Decision Matrix

Key Takeaways

Methodology

Leave a Reply Cancel reply

Previous Post

Next Post

Related Posts