Table of Contents

One of the best ways to get better outputs from LLMs is to include examples in your prompt. This method is called few-shot prompting (a “shot” is an example). By providing examples in your prompt you're showing the model exactly what you are looking for in terms of output structure, tone, and style.

This guide dives deep into everything related to few shot prompting (also known as few shot learning, or in-context learning). We’ll cover the different ways to use this method, when you should use it, common questions (like how many examples is best), and limitations + biases.

We’ll also be looking at tons of concrete examples and will provide free templates.

Whether you’re new to prompt engineering or currently running prompts in production, you’ll be able to get value from this guide. After reading this, you will walk away with actionable tactics you can use to enhance your prompts and get better outputs from LLMs.

What is few shot prompting?

Few shot prompting is a prompt engineering technique where you insert examples in your prompt, training the model on what you want the output to look and sound like.

This method builds on LLM’s ability to learn and generalize information from a small amount of data. This makes it particularly helpful when you don’t have enough data to fine-tune a model.

Here is a very simple example of a few-shot prompt.

The goal of the prompt is for the LLM to determine the sentiment of the movie review.

As you can see, we send three example pairs of data. This approach not only helps the model learn what we would deem as positive, negative, or neutral, but it also shows our desired output format is a single word, all lowercase.

Zero shot and one shot prompting

You may also hear about “Zero shot prompting” or “One shot prompting”. These are prompts with zero or one example, respectively, rather than a few.

Zero shot vs few shot prompting

One shot vs few shot prompting

Few shot prompting examples

Content creation

Let’s say you’re a digital marketing firm and you want to use AI to generate customized content for each of your different clients. Let’s use few-shot prompting to create a template that both:

  1. Creates content that is in the correct tone and style of the client
  2. Is scalable and adaptable to be used for any client.

Here’s what the prompt might look like:

By passing along previous briefs and content generated from those briefs, the model will get an understanding of the tone and style for the specific client. I wrapped the examples in delimiters (three quotation marks) to format the prompt and help the model better understand which part of the prompt is the examples versus the instructions.

This prompt, while basic, is adaptable to any client, all we would need to do is update the variables. You could even turn the prompt into a PromptHub form and share it with your team. By surfacing only the variables that need to change based on the client, anyone on your team can run the few-shot prompt just by updating which client they are working on and what the brief is.

A form with a header, sub-header, and 3 input fields
PromptHub Forms make it easy to quickly prototype on top of prompts

Code generation few shot prompt

Let’s say we want to use an LLM to write a python function that calculates the factorial of a number.

Here's a prompt we could use:

Here’s a few shot prompt:

Here was the output for the zero shot prompt :

Here was the output for the few shot prompt :

Taking a look at the outputs, a few things stand out.

  • The zero-shot prompt produced a succinct recursive factorial function, but didn't add input validation for negative integers.
  • The few-shot prompt output, however, included input checks and used an iterative approach with a docstring for clarity, aligning with Python’s preference for readability.

Overall, the few-shot prompt returned a more robust function. It's more reliable, and offers better maintainability and input validation.

This is a quick example how just a little bit of work via few-shot prompting can make material differences in the outputs you get from LLMs.

Few shot prompting with multiple prompts

Another, slightly more complex, way to implement few shot prompting is by using multiple prompts to provide the examples.

This involves “pre-baking” a few user and AI messages before sending the last prompt which is the one we want the AI to respond to. We’ll use PromptHub’s chat testing feature to do this.

Following the movie sentiment example, this is how you would implement it with many prompts.

4 messages stacked on top of each other
Few shot prompting via multiple messages

In this case you’re sending the model multiple messages. Rather than just showing it how it should respond, we've gone ahead and actually created responses. All these messages will be sent at once, giving the model an even better understanding of how it should respond.

So which method is better? It depends, but here are a few things to keep in mind.

Multiple messages might be better when:

  1. Simulating Interaction: The real-world application involves a back-and-forth interaction, such as in a customer service chatbot, where the model needs to understand and respond within the flow of a conversation.
  2. Contextual Continuity: You're aiming to maintain a narrative or contextual continuity over several interactions, and the model needs to generate responses that are coherent within the ongoing sequence.
  3. Incremental Complexity: The task benefits from a step-by-step buildup of context, where each message might add a layer of complexity or nuance to the conversation that a single prompt might not encapsulate.

A single prompt might be better when:

  1. Streamlining Processing: Efficiency is a priority and you want the model to process the examples and generate a response in one go
  2. Uniformity in Output: You're seeking a consistent style or format in the outputs, which may be more reliably produced if all examples are provided in a single prompt.

I’d recommend you test both out and see how they perform. You can use PromptHub’s testing tools to do this side-by-side to see how each method performs.

Common questions about few shot prompting:

How many examples should I include?

Adding more examples does not necessarily improve accuracy; in some cases adding more examples can actually reduce accuracy. Multiple research papers point to major gains after 2 examples and then a plateau. After 2 examples you might just be burning tokens.

a graph showing performance versus number of examples in context
Source: Language Models are Few-Shot Learners

A table showing the performance on different datasets with different number of examples
K represents the number of examples.
Source: Large Language Models as Analogical Reasoners

Does the order of examples matter?

Yes, the order matters. The extent to which it affects output quality depends on the model you’re using. The paper, Calibrate Before Use: Improving Few-Shot Performance of Language Models, demonstrated this by altering the order of the same examples in a prompt for GPT-3. I think it is safe to assume that ‘smarter’ models should be influenced less by ordering.

The researchers found that the model's predictions varied dramatically based on the sequence of examples. In some instances, the right permutation of examples led to near state-of-the-art performance, while others fell to nearly chance levels. The graph below shows more details.

Multiple bar charts showing how performance can vary when changing the order of the examples used in a prompt
Source: Calibrate Before Use:
Improving Few-Shot Performance of Language Models

One strategy worth testing is placing your most critical example last in the order. LLMs have been known to place significant weight on the last piece of information they process.

What about the prompt format, what should come first, the examples or instructions?

While the more typical approach is to lead with the instructions followed by the examples, either approach can work and the best method might vary based on the model.

If you place the examples second and it seems like the model is either overemphasizing the last example or 'forgetting' the instructions, then consider having the instructions come last.

Another approach is to omit the instructions completely, like we did for our movie sentiment classifier. If the task is simple enough that the model can infer what to do, then basic instructions may not be necessary at all.

When to use few shot prompting

Okay great, but when should you use few shot prompting? Luckily, few shot prompting can be applied to almost any prompt and will help you get better and more consistent outputs. Here are a few use cases where few shot prompting can be particularly helpful.

  1. Specialized Domains: When working in specialized fields such as legal, medical, or technical domains, where gathering vast amounts of data can be difficult, few shot prompting allows for high-quality, domain-specific outputs without the need for extensive datasets.
  2. Dynamic Content Creation: Ideal for tasks like content generation where consistent styles and tone are paramount.
  3. Strict Output Structure Requirements: Few shot prompting is particularly helpful in showing the model how you’d like your outputs to be structured.
  4. Customized User Experiences: In personalized applications, such as chatbots or recommendation systems, where the AI needs to quickly adjust to individual user preferences and inputs.

Why use few shot prompting

Here are some of the top reasons to try out few shot prompting:

  1. Resource Efficiency: Few shot prompting only requires a few example pieces of data
  2. Time Savings: It accelerates the model's ability to adapt to new tasks, which means quicker deployment times and faster time-to-market for AI-powered features and products.
  3. Cost Reduction: Compared to the time spent gathering and labeling data to fine-tune a model, few-shot prompting is considerably cheaper, especially for smaller teams.
  4. Small Lift, Big Gains: Setting up and testing few shot prompting is relatively easy and can help you get much better outputs.

An example from the research

It wouldn’t be a PromptHub article if we didn’t dive deep into some research. We’ll be checking out this paper from April 2024, from researchers at the University of London: The Fact Selection Problem in LLM-Based Program Repair.

Overview

The paper is centered around the use of various “facts” (examples) in prompts used to solve bugs in open-source projects on Github.

Methodology

  1. Fact Collection: The researchers gathered a set of bug-related examples. These included details about buggy code, error messages and other types of documentation that could be helpful when solving future bugs.
  2. Prompt Construction: The examples were incorporated into the prompts using few-shot prompting.
  3. Evaluating Impact: The researchers then evaluated how different combinations of these examples affected the model’s ability to correctly solve the bugs

Findings

  • Utility of Examples: Each example contributed uniquely, highlighting the importance of having a diverse set of examples
  • More examples doesn’t mean better outputs: Interestingly, adding more examples didn’t always lead to better outcomes and sometimes degraded performance if the prompt becomes too cluttered or complex. (See graph below )
  • Fact Selection Model: The researchers built a statistical model named MANIPLE that algorithmically selected the most effective subset of facts for each bug, optimizing the prompt's effectiveness.

line graph showing the performance scores versus number of facts used
Another example of the diminishing returns of adding examples into your prompt

Overview of MANIPLE

The primary goal of MANIPLE is to maximize the gains from few shot prompting by identifying the optimal subset of examples that are most relevant and effective for each bug.

How MANIPLE works

  1. Statistical Modeling: MANIPLE examines patterns patterns from past bug fixes to decide which examples to include to get successful outcomes.
  2. Probabilistic Inference: The model isolates each example to determine how much it contributed positively to the successful bug fix.
  3. Fact Selection Optimization: Based on the probabilities, MANIPLE selects the subset of facts that maximizes the likelihood of successful bug repair.

The MANIPLE framework led to a 17% increase in bug fixes. While this setup is a little advanced, it is a great example of how you can extend few shot prompting to achieve more significant results.

Limitations and challenges of few shot prompting

As great as few shot prompting is, it isn’t perfect. The biggest limitation is its dependency on the quality and variety of the examples provided. Garbage in, garbage out, as they say.

Some times the examples can even degrade the performance, or send the model in the wrong direction.

There is also the risk of overfitting - where the model fails to generalize the examples and creates outputs that mimic the examples too closely. Additionally there are some biases to be aware of:

  1. Majority Label Bias: The model tends to favor answers that are more frequent in the prompt. For example, going back to our movie sentiment task, if our prompt includes more positive than negative examples, the model may be biased towards predicting a positive sentiment. The magnitude of this bias varies based on the model.
  2. Recency Bias: LLMs are known to favor the last chunk of information they receive. Revisiting our movie sentiment prompt, if the last few examples in a prompt are negative, the model may be more likely to predict a negative sentiment.

Wrapping up

There you have it, a comprehensive guide on what we believe to be the most effective prompt engineering method out there. Few shot prompting has the best bang for buck in relation to its accessibility and how it can drastically enhance output quality. We hope this helps you get better outputs!

Dan Cleary
Founder