The first few-shot example delivers +55.9% gains. The next 49? Just +25% more.
The Counter-Intuitive Discovery
After running 35 benchmark configurations across 5 datasets with rigorous Monte Carlo cross-validation (30 iterations each), we discovered something that challenges conventional prompt engineering wisdom:
The first example you give a Large Language Model improves performance by +55.9% on average. The second? Just +3.1%.
This finding has massive implications for how we design GenAI pipelines — and how much we spend on them.
The Data: What We Tested
We benchmarked GPT-4.1-nano across multiple tasks using publicly available datasets:
| Dataset | Task Type | Classes | Zero-Shot F1 |
|---|---|---|---|
| SMS Spam | Binary Classification | 2 | 0.820 |
| Emotion | Multi-class | 6 | 0.344 |
| Twitter Sentiment | Sentiment | 3 | 0.488 |
| DBpedia | Multi-class | 14 | 0.321 |
| Keyword Extraction (Inspec) | Extraction | – | 0.155 |
Each configuration was tested with 0, 1, 3, 5, 10, 20, and 50 few-shot examples.
All datasets are open-source and available on Hugging Face Datasets.
The Diminishing Returns Curve
The pattern is striking across all datasets:
| Transition | Average Gain | What It Means |
|---|---|---|
| 0 → 1 shot | +55.9% | Massive improvement |
| 1 → 3 shots | +3.1% | Diminishing returns begin |
| 3 → 5 shots | +2.8% | Marginal gains |
| 5 → 10 shots | +1.3% | Near plateau |
| 10 → 50 shots | +8.8%* | Only significant for multi-class |
*Inflated by DBpedia (14 classes) which continues improving.
Why Does This Happen?
The first example teaches the model three critical things:
- Output format — How to structure the response
- Task semantics — What “classify” or “extract” means in this context
- Label space — What categories or outputs are expected
Once these are established, additional examples provide marginal refinement.
Statistical Proof: This Is Not Random
We used Welch’s ANOVA to verify these findings weren’t due to chance:
| Dataset | F-statistic | p-value | Significant? |
|---|---|---|---|
| SMS Spam | 76.75 | <0.0001 | YES |
| Emotion | 73.89 | <0.0001 | YES |
| Twitter Sentiment | 92.57 | <0.0001 | YES |
| DBpedia | 439.88 | <0.0001 | YES |
| Keyword Extraction | 160.30 | <0.0001 | YES |
The effect is statistically significant for ALL datasets. DBpedia shows the strongest effect (F=439.88), reflecting its large improvement range from zero-shot to 50-shot.
Pairwise Comparisons (Bonferroni-corrected)
| Comparison | Cohen’s d | Interpretation |
|---|---|---|
| 0 vs 1 | d = -4.3 | Large (always significant) |
| 1 vs 3 | d = -0.5 | Medium (rarely significant) |
| 3 vs 5 | d = -0.2 | Negligible |
| 5 vs 10 | d = -0.1 | Negligible |
Key insight: While the overall effect is significant, pairwise comparisons confirm that differences between adjacent levels (3 vs 5, 5 vs 10) are not statistically significant after correction.
The Exception: Multi-Class Tasks
DBpedia (14 classes) tells a different story:
| Few-shots | F1-Score | Gain vs Zero-Shot |
|---|---|---|
| 0 | 0.321 | — |
| 1 | 0.650 | +102.5% |
| 5 | 0.760 | +136.8% |
| 20 | 0.876 | +173.0% |
| 50 | 0.937 | +192.1% |
The rule of thumb: For tasks with >10 classes, plan for ~3-4 examples per class.
The ROI Reality Check
Here’s what nobody talks about — the economics:
| Strategy | Performance Gain | Cost Multiplier | ROI Score |
|---|---|---|---|
| 1-shot | +53.2% | 1.75x | 30.4 |
| 3-shot | +59.0% | 3.25x | 18.2 |
| 5-shot | +63.5% | 4.75x | 13.4 |
| 10-shot | +66.9% | 8.5x | 7.9 |
| 50-shot | +81.1% | 37.5x | 2.2 |
The verdict: 1-shot delivers the best ROI by far. Going from 1 to 50 examples costs 21x more for only +28% additional performance.
Practical Recommendations
The Decision Matrix
| Your Task | Recommended Examples | Why |
|---|---|---|
| Binary classification | 3-5 | Plateau at 5, marginal gains after |
| Sentiment (3 classes) | 3 | Very early plateau, oscillation after |
| Emotion (6 classes) | 5 | Early saturation |
| Multi-class (>10) | 20-50 | Continues improving significantly |
| Extraction | 10 | Benefits from format examples |
Key Takeaways
- Invest in your first example — It’s worth 55.9% of your total potential improvement
- Stop at 3-5 for simple tasks — Binary and low-cardinality multi-class plateau early
- Scale up only for complex taxonomies — >10 classes benefit from 20-50 examples
- Measure, don’t assume — Run your own benchmarks with Monte Carlo cross-validation
Methodology
- Protocol: Monte Carlo Cross-Validation (30 random train/test splits)
- Model: GPT-4.1-nano
- Metrics: F1-score (macro-averaged for multi-class)
- Statistical tests: Welch’s ANOVA, Bonferroni-corrected post-hoc tests
- Effect size: Cohen’s d
Next in this series: “Why Classical ML Still Crushes GenAI at Regression” — The surprising tasks where sklearn wins 100% of the time.



