Table of Contents

If you’ve spent any time writing prompts, you’ve probably noticed just how sensitive LLMs are to minor changes in the prompt. For example, look at the two prompts below. The semantic differences are minor, but the performance difference is huge. Try to guess which one is better.

Two similar prompts on top of each other
Find the answer in my LinkedIn post

This is why prompt testing is so critical. It is hard to know just how these little changes can affect performance, the knife can cut in either direction.

Luckily, there has been a recent flurry of papers related to prompt sensitivity. In this article, we’ll dive deep into the latest research, the implications of prompt sensitivity, and what you need to do if you’re using LLMs in any type of application.

For reference, these are the three papers we’ll be pulling data and insights from:

What is prompt sensitivity

Prompt sensitivity refers to the model’s acute responsiveness to even minor variations in the prompt. The higher the sensitivity, the greater the variation in the output. Every model experiences some level of prompt sensitivity.

For example, the chart below shows how even a minor syntactic rephrasing of the prompt can lead to a complete change in the distribution of outputs.

Two bar charts showing prompt performance on top of each other
Different prompt variants lead to extremely different output distributions

What is prompt consistency

Prompt consistency measures how uniform the model's predictions are across different samples of the same class.

What’s the difference between consistency and sensitivity?

Sensitivity measures the variation in a model's predictions due to different prompts, while consistency assesses how uniform these predictions are across samples of the same class.

High consistency combined with low sensitivity indicates stable model performance.

How different prompt engineering methods affect sensitivity

Across all three papers, there were various experiments that analyzed how different models and prompt engineering methods related to sensitivity and performance.

We’ll start by taking a look at different prompt engineering methods, starting with the paper How are Prompts Different in Terms of Sensitivity?

The researchers tested 8 methods:

A list of prompt engineering methods in a table

Let’s look at the results, broken down by model.

6 graphs showing prompt engineering performance on different models
The average accuracy and sensitivity of each model using various prompts across different datasets.

You’ll see that there is a strong negative correlation between accuracy and sensitivity. I.e., when sensitivity goes up, accuracy goes down.

Impact of human-designed vs. LM-generated prompts on sensitivity

The researchers then tested how human-designed prompts compared to LM-generated prompts in regard to accuracy and sensitivity. Base_b was the human-designed prompt, and APE (Automatic Prompt Engineer) was the LM-generated prompt.

A table showing the accuracy and sensitivity of human-written prompt versus LLM-generated prompt
Comparing the performance of models using base_b (human written-prompt) and APE (LLM generated-prompts)

As you can see, overall, the two prompts lead to similar accuracy and sensitivity on the given datasets. Signaling that human-designed and LM-generated prompts had similar effects.

Generated Knowledge Prompting

The next prompting method analyzed was Generated Knowledge Prompting (GKP). GKP is when you leverage knowledge generated by the LLM to give more information in the prompt.

A table of results, broken down by model, comparing a basic prompt and generated knowledge prompt

As you can see from the results, GKP led to higher accuracy and lower sensitivity most of time.

This suggests that including instructions and generated knowledge has cumulative effects on performance.

Chain-of-Thought Prompting

Next up was Chain-of-Thought (CoT) prompting, one of the more popular techniques. This approach involves structuring prompts to guide the model through a logical reasoning process, potentially enhancing its ability to derive correct conclusions.

A table of results, broken down by model, comparing a basic prompt and CoT prompt

The table above shows that CoT leads to similar accuracy but higher sensitivity compared to the base_b prompt.

Some more CoT data:

6 bar charts showing prompt sensitivity for 6 models

As you can see in the graph above, CoT_base_a outperforms base_a, but is worse than base_b in most cases. This suggests that for these datasets, reasoning chains do help improve performance, but not as effective as instructions (GKP).

Simple, detailed, and 1-shot prompting

We’ll continue on with our analysis of different prompting strategies and how they relate to sensitivity and, by proxy, accuracy. We’ll turn to a different paper now:  What Did I Do Wrong? Quantifying LLMs’ Sensitivity and Consistency to Prompt Engineering

This paper tested three different prompt engineering methods across a variety of models and datasets

  1. Simple: The prompt consists of just the task description.
  2. Detail: A detailed description of the task is provided.
  3. 1-shot: Similar to simple, but includes one example to illustrate the task.

A large table of results of different prompting methods, datasets, and models, measuring sensitivity, consistency, and F1 scores
Sensitivity Sτ (lower is better), average Consistency Cy (higher is better), and Micro-F1 scores (higher is better) for various datasets, models, and prompting strategies.

  • Simple and Detail prompting are more effective across all metrics for Llama3 and GPT-3.5
  • Detail and 1-shot tends to work better for Mixtral and GPT-4o.
  • No consistent pattern of best sensitivity, consistency, and F1

It's mentioned above, but is worth repeating. Look again at Llama3 and GPT-4o. The best performing method is completely different, reinforcing the idea that one size does not fit all.

This highlights an important point for developers and teams using LLMs: You need to extensively test your prompts when switching from one LLM to another. A prompt that worked well for one model might lead to instability and decreased performance with another.

Self-refinement, voting and distillation prompting

Turning to our third and last paper, On the Worst Prompt Performance of Large LanguageModels, we’ll look at a few more prompting methods.

In an effort to enhance the performance of prompts that underperformed due to high sensitivity, the researchers tested several prompt engineering methods:

Raw: This method uses the original, unaltered prompts to establish a baseline performance.

Self-refinement: This method involves the LLM iteratively refining prompts based on the model's previous outputs to enhance performance.

Voting: This approach aggregates the outputs from multiple variations of the prompt and lets the model vote for the best result to improve reliability.

Distillation: This technique involves training the model to generalize better by distilling knowledge from multiple training iterations into a single, more robust model.

A large table of results showing the different performance metrics and changes based on which prompt engineering method was used
Model performance after prompt engineering (Self-refinement and Voting) and prompt consistency regularization (Distillation). The red numbers indicate a decrease in performance, the green numbers represent an improvement.

  • Self-Refinement: Significantly decreased performance for Llama-2-7/13/70B-chat models, with declines of 10.04%, 13.15%, and 13.53% respectively
  • Voting Method: The voting method boosted the worst-case performance significantly (e.g., a 21.98% increase for Llama-2-70B-chat), though it reduced the best and average performances by 6.71% for Llama-2-13b-chat,
  • Distillation: Improved consistency but reduced overall performance, likely due to overfitting to lower-quality, self-generated outputs, showcasing the difficulty of balancing refinement with the risk of bias or errors.

Which parts of the prompt are the most sensitive?

As we saw above, different prompt engineering methods focus on different components, such as instructions, examples, and chains of reasoning.

The researchers did a breakdown on which components of the prompt affect the output the most. I.e., which are the most sensitive.

The image below displays these sensitivity scores:

A small, 1-row table, showing the saliency scores of four different prompt components
The average mean saliency scores of prompt components

  • S_input (4.33): Shows moderate sensitivity, indicating that direct inputs to the model have a substantial impact on the output.
  • S_knowledge (2.56): Demonstrates (surprisingly) lower sensitivity
  • S_option (6.37): Indicates a higher sensitivity, which implies that the options or choices presented within the prompt are critical in shaping the model's response
  • S_prompt (12.86): Exhibits the highest sensitivity, underscoring the significant effect of the overall prompt structure on the model's behavior.

The main takeaway here is that the prompt instructions will have the most impact in guiding the model’s response. This is why we always tell teams that writing clear and specific instructions is step 1 in the prompt engineering process.

Which models are the most sensitive?

We’ve taken a deep dive on how different prompt engineering methods effect sensitivity, but what about different models?

We’ll turn our focus to On the Worst Prompt Performance of Large Language Models.

The researchers created their own dataset, ROBUSTALPACAEVAL, to better match real-world user queries compared to other popular datasets. ROBUSTALPACAEVAL addresses these limitations by generating semantically similar queries to cover a broad range of phrasings.

The table below shows the model performance on their own dataset ROBUSTALPACAEVAL.

A table showing performance metrics for a few different models
Results on our ROBUSTALPACAEVAL benchmark. The model order is arranged according to their original performance.

  • A larger gap between the worst and best perf. indicates higher sensitivity in the model
  • Llama-2-70B-chat had a large range, from 0.094 to 0.549. This huge range of values shows how sensitive LLMs can be. Semantically identical prompts can lead to vastly different results.
  • Although scaling up model sizes enhances performance, it does not necessarily improve robustness or decrease sensitivity
  • For instance, Llama-2-7B/13B/70B-chat shows improved instruction-following performance, rising from 0.195 to 0.292; however, robustness slightly declines, as indicated by an increase in standard deviation from 0.133 to 0.156
  • Similarly, scaling up the Gemma models increases average performance (from 0.153 in the 2b model to 0.31 in the 7b model) but results in lower robustness (0.191 compared to 0.118 in the 2b model).

Having identified the worst-performing prompts for different models, the study next explored specific trends, particularly:

  • Whether the worst prompts overlapped across models
  • Whether prompt rankings were consistent across various models

The following graph tells the story:

A bar char with four lines
The overlap rate of model-agnostic worst-k prompts across different models

  • The consistency between the worst prompts across all models (the red line) is really low. This shows that there is no such thing as a “bad prompt”, it is all relative to the model.
  • You see some better consistency when looking within the same family of models, but the rate is still low. This suggests that even individual models within the same family will have their own unique strengths and weaknesses
  • It is essentially impossible to characterize the worst prompts without knowing the model

Who is better at picking the better prompt, humans or LLMs?

Remember that first example we looked at, where I asked you to see if you could guess which prompt scored better? If you got it wrong, don’t feel bad, you would probably did as well as ChatGPT.

The researchers tested the model’s ability to discern prompt quality by presenting it with two prompts and asking it to pick the one that would “be more likely to yield a more helpful, accurate, and comprehensive response". I was shocked at the performance.

A table with four models and their scores on guessing which prompt performed best

All the models were right around the 50% mark, which is the same performance you would get if you just guessed randomly. So if the models can’t discern what a better performing prompt looks like, how could we depend on them to write prompts for us?

Wrapping up

Prompts are tricky, LLMs are tricky, this whole new stack brings a set of problems that we aren’t used to solving for in traditional software development. Slight changes in a prompt can significantly impact performance. This underscores the importance of having a prompt management tool (we can help with that) where you can easily test, version, and iterate on prompts as new models continue to come out.

Dan Cleary