Zero-Shot & Few-Shot Prompting: Teaching Models by Example
You ask the model to classify customer feedback as positive, negative, or neutral. Zero-shot: “Classify this feedback: ‘The delivery was late but the product quality was amazing.’” The model responds with a paragraph explaining the nuances of the sentiment instead of giving you a one-word classification. You add two examples showing exactly what format you want. Now every response is a clean single-word label. Same model, same capability - but the examples taught it what you actually needed.
This is the fundamental difference between zero-shot and few-shot prompting: examples are not just hints, they are specifications. They define the task format, output structure, edge case handling, and quality bar in a way that instructions alone cannot.
What zero-shot and few-shot actually mean
Zero-shot: You describe the task and provide the input. No examples of completed tasks. The model relies entirely on its pre-training to understand what you want.
Classify the sentiment: "The app crashes every time I open it"
One-shot: You provide one example of a completed task, then the actual input.
Classify the sentiment:
"Love the new update!" → positive
"The app crashes every time I open it" →
Few-shot: You provide 2-8 examples of completed tasks, then the actual input.
Classify the sentiment:
"Love the new update!" → positive
"Terrible customer service" → negative
"It works fine I guess" → neutral
"The app crashes every time I open it" →
The “shot” is the example. Zero examples, one example, few examples.
graph LR
subgraph zs["Zero-Shot"]
Z1["Task description"]
Z2["Input"]
Z3["Hope for the best"]
end
subgraph fs["Few-Shot"]
F1["Task description"]
F2["Example 1: input → output"]
F3["Example 2: input → output"]
F4["Example 3: input → output"]
F5["Actual input → ?"]
end
style Z3 fill:#FAEEDA,stroke:#854F0B,color:#633806
style F2 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style F3 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style F4 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style F5 fill:#EEEDFE,stroke:#534AB7,color:#3C3489
When zero-shot works
Zero-shot is not always inferior. For tasks the model has seen extensively in training, zero-shot can match or exceed few-shot performance:
- Simple classification with clear categories (spam/not-spam)
- Translation between common language pairs
- Summarization of well-structured text
- Code generation with clear specifications
- General knowledge questions the model knows from training
The model has already seen millions of examples of these tasks during pre-training. Your examples are redundant.
Zero-shot also saves tokens. If your context window is tight and each example costs 200 tokens, spending 600 tokens on 3 examples might not be worth the marginal improvement.
When few-shot is essential
Few-shot becomes critical when:
1. Custom output format: You need JSON with specific keys, a particular CSV structure, or a domain-specific notation. Examples are the most reliable way to specify format.
2. Domain-specific reasoning: The model has not seen enough examples of your specific domain during training. Medical coding, legal citation formats, or internal company classification schemes need examples.
3. Edge case handling: You want specific behavior on ambiguous inputs. An example showing “I guess it’s okay” → neutral teaches the model your calibration for borderline cases.
4. Consistency across calls: Without examples, the model might format outputs differently on each call. Examples anchor the format.
5. Novel tasks: If the task does not map cleanly to something in the training data, examples are how you teach it.
How to construct effective few-shot prompts
Rule 1: Examples should cover the output space
If you have 5 categories, include at least one example per category. If your task produces varying length outputs, show both short and long examples:
Extract action items from meeting notes:
Meeting: "Let's ship v2 by Friday. John will handle the migration script."
Action items:
- John: Write migration script (deadline: Friday)
- Team: Ship v2 (deadline: Friday)
Meeting: "Good sync. No blockers."
Action items:
- None
Meeting: "We discussed the Q3 roadmap extensively..."
Action items:
Rule 2: Examples should represent difficulty distribution
Do not only show easy cases. Include at least one tricky example that demonstrates how to handle ambiguity:
Sentiment: "Not bad, actually" → positive
Sentiment: "Could be better, could be worse" → neutral
Sentiment: "I wouldn't say I hate it" → neutral
Rule 3: Order matters
Place examples in a consistent pattern. For classification, alternate categories rather than grouping all positives then all negatives. The model can develop recency bias toward whatever category appeared last.
Rule 4: Use delimiters consistently
Separate examples clearly. Use ---, ###, or numbered formatting. The model needs to distinguish between “this is an example” and “this is the actual input”:
### Example 1
Input: "..."
Output: "..."
### Example 2
Input: "..."
Output: "..."
### Your Task
Input: "..."
Output:
graph TD
subgraph good["Effective Few-Shot Design"]
G1["Covers all output categories"]
G2["Includes edge cases"]
G3["Consistent formatting"]
G4["Clear delimiters"]
G5["3-5 examples (sweet spot)"]
end
subgraph bad["Common Mistakes"]
B1["All examples are easy cases"]
B2["Only one category shown"]
B3["Inconsistent format between examples"]
B4["Too many examples (token waste)"]
B5["Examples contradict each other"]
end
style G1 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style G2 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style G3 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style G4 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style G5 fill:#E1F5EE,stroke:#0F6E56,color:#085041
style B1 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
style B2 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
style B3 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
style B4 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
style B5 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F
Where it breaks or gets interesting
The copy problem
Models sometimes copy patterns from examples too literally. If all your examples have short inputs, the model might truncate its reasoning on a long input to match the pattern length. If an example contains a specific entity (“Amazon”), the model might reference that entity in its response to the real input.
Sensitivity to example selection
Research shows that few-shot performance varies significantly based on which examples you select. Random selection from a large pool outperforms carefully curated “best” examples in some cases, because diversity matters more than individual quality. Conversely, one misleading example can tank performance more than adding three good ones improves it.
Dynamic few-shot selection
The most effective approach for production systems: do not use static examples. Instead, embed your example pool and at inference time, retrieve the examples most similar to the current input. This gives the model the most relevant demonstrations for each specific query.
# Dynamic few-shot: retrieve similar examples at runtime
query_embedding = embed(user_input)
similar_examples = vector_db.search(query_embedding, top_k=3)
prompt = format_few_shot(similar_examples, user_input)
The sweet spot is 3-5 examples
Research and practice converge on 3-5 examples as the optimal range for most tasks. Below 3, you often lack coverage of the output space. Above 5-7, you get diminishing returns while spending significant tokens. At 10+, you risk the model overfitting to example patterns rather than generalizing.
Real-world systems that use these techniques
- GitHub Copilot - uses the surrounding code context as implicit few-shot examples (nearby functions demonstrate the coding style and patterns)
- OpenAI’s function calling - internally uses few-shot formatting to teach the model when and how to call tools
- Jasper AI - provides tone examples to steer marketing copy generation
- Retool AI - uses dynamic few-shot retrieval from a company’s database schema to generate accurate SQL queries
- Classification APIs - Cohere’s classify endpoint and similar services use few-shot examples as the primary configuration mechanism
How to apply in practice
For classification tasks: Start with few-shot (1-2 examples per class). Only drop to zero-shot if token budget is extremely tight and accuracy remains acceptable. Use dynamic example selection for heterogeneous inputs.
For generation tasks: Use few-shot to define style, format, and length. Zero-shot with detailed instructions works for well-understood formats (JSON, markdown), but few-shot is more robust across model versions.
For extraction tasks: Always use few-shot. Show the model exactly what to extract, what to ignore, and how to format the output. Extraction with zero-shot is unreliable for anything beyond trivial cases.
For multi-step reasoning: Few-shot examples showing the reasoning steps (chain-of-thought) dramatically outperform zero-shot instructions to “think step by step.” Show the intermediate work, not just the final answer.
Token budget allocation: If you have 4000 tokens for context, spending 500-800 on few-shot examples is usually the highest-ROI use of those tokens. Better than adding more retrieved documents or longer system prompts.
FAQ
Q: If few-shot is better, why would I ever use zero-shot?
Token cost and latency. Each example costs tokens. For high-volume, low-complexity tasks (simple translation, trivial classification), zero-shot with a clear instruction is cheaper and faster with minimal quality loss. Also, for tasks where the model already excels (common coding patterns, standard summarization), examples can actually introduce unintended constraints. Test both and let your eval decide.
Q: How do I know if my examples are good enough?
Run your eval with different example sets and compare scores. Good examples produce consistent outputs across diverse inputs. If the model handles your eval well but fails on real traffic, your examples likely do not cover the full input distribution. Track production failures and add examples that address each new failure mode.
Q: Does the order of few-shot examples affect the output?
Yes, significantly. Models exhibit recency bias - the last example has disproportionate influence. For classification, the last example’s category is more likely to be chosen for ambiguous inputs. Mitigation: randomize example order across calls, or deliberately place the most “neutral” example last. For generation, the last example’s style and length most strongly influences output style.
Interview questions
Q: You are building a product review classification system. Reviews need to be tagged with multiple labels (quality, shipping, support, pricing). Would you use zero-shot or few-shot, and how would you structure the prompt?
Few-shot is essential here because multi-label classification is inherently ambiguous - the model needs examples showing that a single review can have multiple tags, and examples showing reviews with only one tag. Structure: 4-5 examples covering single-label reviews, multi-label reviews, and one example with no applicable labels. Use consistent output format (JSON array of labels). For production scale, use dynamic few-shot: embed the review pool and retrieve examples most similar to the incoming review, ensuring the examples cover the most likely categories for that input.
Q: Your few-shot prompt works great in testing but degrades in production. What might be happening?
Production inputs are more diverse than test inputs. The few-shot examples likely cover the “clean” cases but not the messy real-world variants: typos, mixed languages, extremely short or long inputs, ambiguous edge cases. Fix: log production failures, categorize them, add representative examples for each failure mode. Consider dynamic example selection so each input gets the most relevant demonstrations. Also check if the production model version differs from your test model - few-shot sensitivity varies across model versions.
Q: Compare few-shot prompting vs fine-tuning. When would you choose each?
Few-shot: fast to iterate (change examples, no retraining), works immediately, no training data requirements beyond the examples themselves. Best for tasks where you need flexibility and the example pool might change. Fine-tuning: higher ceiling for performance, lower per-request cost (no example tokens consumed), more consistent outputs. Best when you have 100+ labeled examples, the task is stable, and you are optimizing for cost at scale. The practical threshold: if you are spending more than 30% of your context window on few-shot examples and need this for thousands of daily calls, fine-tuning often pays for itself within weeks.