Small edits to prompts can have major impacts on outputs from LLMs. We recently wrote about how adding a few polite keywords can help lead to better outputs.
One of the more popular and simple modifications to achieve better outputs is by leveraging Chain-of-Thought (CoT) reasoning by adding phrases like “Think step by step” to your prompts. But, this modification does not universally increase prompt performance for every model or use case. For example, in the PaLM 2 technical report, they found that using CoT directives in prompts sometimes led to adverse effects.
This illustrates an extremely important point about prompt engineering: what works well for one model, might not for another. A recent paper from VMware further illustrates this point.
The researchers set out to study the impact of various “positive thinking” additions to the system message of a prompt. For example, appending “This will be fun” to a prompt.
On top of dissecting how specific human-written phrases affected prompts, the researchers also used an LLM to optimize prompts and pitted the two groups against each other.
The paper is packed with takeaways, let’s jump in.
Experiment set up
The researchers created 60 prompt variations from 5 openers, 3 task descriptions, and 4 closers
Here's what the template looked like:
Here are the “positive thinking” components:
Openers
- None.
- You are as smart as ChatGPT.
- You are highly intelligent.
- You are an expert mathematician.
- You are a professor of mathematics.
Task Descriptions
- None.
- Solve the following math problem.
- Answer the following math question.
Closers
- None.
- This will be fun!
- Take a deep breath and think carefully.
- I really need your help!
The researchers tested the performance of these 60 combinations on a math dataset, and tested versions with and without Chain of Thought reasoning.
Dataset: GSM8K
Models: Mistral-7B, Llama2-13B, and Llama2-70B
Scoring: Exact Match (EM). Did the model provide the exact numerical answer or not. The prompts are compared to a baseline, which is used to serve as a control for evaluating the impact of the positive thinking components. The baseline in this case is when the model receives no system message.
Experiment method
To analyze the impact of 'positive thinking' phrases, the researchers broke down the dataset into subsets containing the first 10, 25, 50, and 100 questions. This allows the analysis to show the impact of the “positive thinking” phrases as the dataset size increases.
The researchers incorporated examples into the prompt (few-shot prompting).
Automatic prompt optimization
In addition to testing 'positive thinking' phrases, the researchers tested prompts optimized by LLMs. They pitted the human-generated positive thinking prompts against the auto optimized prompts.
Each model optimized their own prompts, no cross-model optimization.
Experiment results: Human optimized prompts
Here are the results that we’ll review for all three of the models.
A quick, yet relevant note about model names. The '7b' in 'Mistral-7b' signifies the model was trained with 7 billion parameters, which is relatively small compared to, for instance, GPT-4's rumored training on more than a trillion parameters.
The two main metrics we’ll look at to judge the effectiveness of the positive thinking phrases are:
- Standard deviation (EM Std Dev): A low standard deviation means that all the prompts performed similarly and thus, the different directives didn’t make much of a difference.
- EM Baseline vs EM Mean: Baseline > Mean = the directives negatively impact the prompt and vice versa
Mistral-7b
Without Chain of Thought reasoning, Mistral’s performance remained extremely consistent. There was 0 deviation on the 10 and 25 question sets, and a very small amount on the 100 question subset (.007). All that to say, the different positive thinking directives didn’t seem to have much of an effect.
When prompted with Chain of Thought, the standard deviation decreases as the number of questions increase. Additionally, you see can that the EM Mean outperforms the EM baseline by a fair amount. So in this case, the positive thinking prompts made a significant positive difference in the output correctness.
Llama2-13B
Without Chain of Thought reasoning, Llama shows the opposite trend to Mistral, with deviation decreasing from 0.014 at 25 questions to 0.003 at 100 questions.
With Chain of Thought prompting included, the trend isn’t as clear. Overall the standard deviation does decrease from 0.026 at 10 to 0.016 at 100, but, at 50, it’s even lower, at 0.012.
Overall the standard deviations in both cases are quite low, pointing to the fact that, for Llama2-13b, the positive thinking directives had little effect.
Llama-70b
Without Chain of Thought prompting the standard deviation follows a similar trend to Llama2-13b, decreasing from 0.017 at 25 questions to 0.0050 at 100 questions.
Without CoT, positive thinking prompts significantly underperformed relative to baselines, marking a notable divergence from the other models' trends.
Overall
Across all three models, the standard deviation tended to increase when Chain of Thought reasoning was used. This might suggest that CoT reasoning can lead to a wider range of outcomes, possibly due to the models engaging in more complex reasoning paths that are sensitive to the nuances of the prompts.
Aside from that, the only trend was no trend. There wasn’t a single prompt snippet that could be used across all models to increase performance.
This may be the most important takeaway from the paper, and is extremely relevant to anyone using prompts to do anything. What is best for a given model, on a given dataset, may be specific to that combination.
Because of this, the researchers moved from hand-tuning the prompts with positive thinking messaging, to having the LLMs automatically optimize the prompts.
Experiment results: Automatic prompt optimization
Overall, the prompts that were optimized by LLMs often performed on the same level, or outperformed the manually generated prompts with the positive thinking directives.
So is prompt engineering dead? Should just let LLMs do it? I would say no for a few reasons:
- I don’t think the human crafted prompts were all that good (see examples above). They just added on phrases to a core prompt, without much rhyme or reason
- A very important part of prompt engineering is knowing what has gotten you to your current iteration. If you abstract that all away and let an LLM control the whole process, then the prompt becomes hard to maintain
We’ve written about using LLMs to optimize prompts a few times now (Using LLMs to Optimize Your Prompts, How to Optimize Long Prompts, RecPrompt). Our opinion then and now is that a mix of humans and LLMs gets the best results.
With that out of the way, let’s look at some results.
Human 'optimized' versus LLM-optimized prompts were evaluated using two metrics:
- Raw performance scores (Avg EM)
- The delta between scores on the optimization set and the evaluation set. A low delta implies the prompt is generalizable. The best prompts have a high Avg EM and a low delta.
Looking at Mistral, you’ll see that the “Positive Thinking” prompts have a lower delta for 10, 25, and 50 questions, but the automatically optimized prompts have a lower delta for 100 questions. On the contrary, the larger Llama-2 models consistently show a lower delta across all cases when letting an LLM automatically optimize prompts.
Why does that matter? It translates to the following takeaway: The size of the model matters when it comes to the decision to use it to help in the prompt engineering process. If the model is larger than 7B, the research suggests leveraging the model to optimize the prompt.
The most interesting part of this whole paper was checking out the automatically optimized prompt examples and how much they differ from what a human may come up with.
This was the highest-scoring prompt generated by Llama2-70B:
Here are a few more examples
Llama2-13B Optimized Prompt & Prefix NoQ=10
Mistral-7B Optimized Prompt & Prefix NoQ=50
Llama2-70B Optimized Prompt & Prefix NoQ=10
One last finding
The researchers compared the results they saw on the GSM8K dataset to the scores reported by the model providers (Meta, Mistral).
Interestingly, the average difference was quite large for Mistral-7B and Llama2-13B, but for Llama2-70b it was closer to an acceptable margin of error.
Unfortunately, neither Mistral nor Meta released the prompts used in their tests, so reproducing their benchmark scores can be challenging.
Benchmark scores should be taken with a grain of salt because of this lack of reproducibility and transparency. If your prompt engineering skills aren’t as strong as Meta’s then your outputs are going to underperform relative to the benchmark.
Wrapping up
We’ve said it before and we’ll say it again, small modifications to a prompt can have major effects.
What this paper makes exceptionally clear is that what works for one model won’t necessarily translate to another model. This highlights the importance of thoroughly testing your prompts as models get updated, or when trying out new techniques.