Small edits to prompts can have major impacts on outputs from LLMs. We recently wrote about how adding a few polite keywords can help lead to better outputs.

One of the more popular and simple modifications to achieve better outputs is by leveraging Chain-of-Thought (CoT) reasoning by adding phrases like “Think step by step” to your prompts. But, this modification does not universally increase prompt performance for every model or use case. For example, in the PaLM 2 technical report, they found that using CoT directives in prompts sometimes led to adverse effects.

This illustrates an extremely important point about prompt engineering: what works well for one model, might not for another. A recent paper from VMware further illustrates this point.

The researchers set out to study the impact of various “positive thinking” additions to the system message of a prompt. For example, appending “This will be fun” to a prompt.

On top of dissecting how specific human-written phrases affected prompts, the researchers also used an LLM to optimize prompts and pitted the two groups against each other.

The paper is packed with takeaways, let’s jump in.

Experiment set up

The researchers created 60 prompt variations from 5 openers, 3 task descriptions, and 4 closers

Here's what the template looked like:

Here are the “positive thinking” components:

Openers

  • None.
  • You are as smart as ChatGPT.
  • You are highly intelligent.
  • You are an expert mathematician.
  • You are a professor of mathematics.

Task Descriptions

  • None.
  • Solve the following math problem.
  • Answer the following math question.

Closers

  • None.
  • This will be fun!
  • Take a deep breath and think carefully.
  • I really need your help!

The researchers tested the performance of these 60 combinations on a math dataset, and tested versions with and without Chain of Thought reasoning.

Dataset: GSM8K

Models: Mistral-7B, Llama2-13B, and Llama2-70B

Scoring: Exact Match (EM). Did the model provide the exact numerical answer or not. The prompts are compared to a baseline, which is used to serve as a control for evaluating the impact of the positive thinking components. The baseline in this case is when the model receives no system message.

Hello hello hello everyone, this is Dan here from PromptHub, and welcome back to our YouTube channel. It is Friday, April 5th, and we will be talking about prompt engineering today. Specifically, we'll be talking about how different prompt engineering strategies translate—or don't translate—across different models and how they each have their own specific approaches.

A question that came up when I was talking to a team recently was: how sensitive are the models to different types of best practices or phrases that are commonly recommended for prompt engineering? One popular technique is Chain of Thought reasoning, which often uses the directive "think step by step." Generally, Chain of Thought attempts to help the model reason more effectively.

However, in the PaLM 2 technical report, you’ll find that Chain of Thought reasoning actually led to degraded outputs—worse outputs. Even a widely accepted best practice like this doesn't hold for some of Google's best models. This makes you think about what other best practices might only be relevant to specific models or scenarios.

We'll look at a couple of papers to dive into this. The first one is from VMware, published in the last month. It's a great paper where they ran two experiments. In the first experiment, they tested various positive thinking-type phrases. They had 60 prompt system message variants, combining five openers, three task descriptions, and four closers (5 * 3 * 4 = 60 permutations).

The baseline they compared to was just the regular prompt, to which they added system messages with various types. Here are a few examples of using an opener without a description or closer, and so on. They tested this on a mathematical dataset, GSM 8K, a popular dataset, and scored it using exact match since every math question has a known answer. They tested it across several models, including MRAW and LLaMA models of varying sizes.

Let's look at the results for MRAW. Two things to focus on are the baseline versus the mean and the standard deviation. Generally, the mean and baseline are pretty close, and the standard deviation is low. Without Chain of Thought, the model exhibited strong improvement in output correctness above the baseline. With Chain of Thought, there was an almost 50% increase.

For LLaMA, the deviation decreases without Chain of Thought. The effect of positive thinking phrases is much less clear compared to MRAW. The baseline and mean are very close, and for LLaMA 70B, there's a decrease in standard deviation without Chain of Thought. Positive thinking prompts actually have a negative impact, underperforming the baseline in most cases.

Two major takeaways: the use of Chain of Thought reasoning tends to introduce greater variability in standard deviations. The biggest trend is that there is no consistent effect of positive thinking directives across all models. In MRAW, it made a positive difference. In LLaMA 13B, the effect was unclear, and in LLaMA 70B, it was detrimental. This highlights the importance of model-specific prompt engineering.

The researchers also set up an experiment where the LLMs optimized the prompts themselves. Generally, LLM-optimized prompts either outperformed or were on par with human-written ones. A mix of human and LLM optimization seems to be the best approach, as it allows for reproducibility and understanding of how the final answer was achieved.

One example from the paper showed a top prompt for LLaMA 70B that took on a Star Trek theme, which is very different from anything a human would typically write. This highlights the creativity and diversity LLMs can bring to prompt engineering.

Another paper from Google DeepMind, "Large Language Model as Optimizer," explored how LLMs can optimize prompts using a meta-prompt. The instructions varied significantly across models, again showing that prompt engineering needs to be tailored to specific models.

Overall, when using different models or even the same model repeatedly, it's important to test thoroughly and find the best practices specific to that model. Be open to exploring different techniques across models.

That is it for today. We'll have links below to resources and papers for deeper dives. One size does not fit all in prompt engineering. Thanks for watching!

Experiment method

To analyze the impact of 'positive thinking' phrases, the researchers broke down the dataset into subsets containing the first 10, 25, 50, and 100 questions. This allows the analysis to show the impact of the “positive thinking” phrases as the dataset size increases.

The researchers incorporated examples into the prompt (few-shot learning).

Automatic prompt optimization

In addition to testing 'positive thinking' phrases, the researchers tested prompts optimized by LLMs. They pitted the human-generated positive thinking prompts against the auto optimized prompts.

Each model optimized their own prompts, no cross-model optimization.

Experiment results: Human optimized prompts

Here are the results that we’ll review for all three of the models.

A quick, yet relevant note about model names. The '7b' in 'Mistral-7b' signifies the model was trained with 7 billion parameters, which is relatively small compared to, for instance, GPT-4's rumored training on more than a trillion parameters.

The two main metrics we’ll look at to judge the effectiveness of the positive thinking phrases are:

  • Standard deviation (EM Std Dev): A low standard deviation means that all the prompts performed similarly and thus, the different directives didn’t  make much of a difference.
  • EM Baseline vs EM Mean: Baseline > Mean = the directives negatively impact the prompt and vice versa

Table of results for positive thinking prompts
Results from testing the 60 positive thinking prompt combinations

Mistral-7b

Without Chain of Thought reasoning, Mistral’s performance remained extremely consistent. There was 0 deviation on the 10 and 25 question sets, and a very small amount on the 100 question subset (.007). All that to say, the different positive thinking directives didn’t seem to have much of an effect.

When prompted with Chain of Thought, the standard deviation decreases as the number of questions increase. Additionally, you see can that the EM Mean outperforms the EM baseline by a fair amount. So in this case, the positive thinking prompts made a significant positive difference in the output correctness.

List of example combinations of positive thinking prompt components and their scores
A few examples of the best performing prompt variants

Llama2-13B

Without Chain of Thought reasoning, Llama shows the opposite trend to Mistral, with deviation decreasing from 0.014 at 25 questions to 0.003 at 100 questions.

With Chain of Thought prompting included, the trend isn’t as clear. Overall the standard deviation does decrease from 0.026 at 10 to 0.016 at 100, but, at 50, it’s even lower, at 0.012.

Overall the standard deviations in both cases are quite low, pointing to the fact that, for Llama2-13b, the positive thinking directives had little effect.

Llama-70b

Without Chain of Thought prompting the standard deviation follows a similar trend to Llama2-13b, decreasing from 0.017 at 25 questions to 0.0050 at 100 questions.

Without CoT, positive thinking prompts significantly underperformed relative to baselines, marking a notable divergence from the other models' trends.

Overall

Across all three models, the standard deviation tended to increase when Chain of Thought reasoning was used. This might suggest that CoT reasoning can lead to a wider range of outcomes, possibly due to the models engaging in more complex reasoning paths that are sensitive to the nuances of the prompts.

Aside from that, the only trend was no trend. There wasn’t a single prompt snippet that could be used across all models to increase performance.

This may be the most important takeaway from the paper, and is extremely relevant to anyone using prompts to do anything. What is best for a given model, on a given dataset, may be specific to that combination.

Because of this, the researchers moved from hand-tuning the prompts with positive thinking messaging, to having the LLMs automatically optimize the prompts.

Experiment results: Automatic prompt optimization

Overall, the prompts that were optimized by LLMs often performed on the same level, or outperformed the manually generated prompts with the positive thinking directives.

So is prompt engineering dead? Should just let LLMs do it? I would say no for a few reasons:

  • I don’t think the human crafted prompts were all that good (see examples above). They just added on phrases to a core prompt, without much rhyme or reason
  • A very important part of prompt engineering is knowing what has gotten you to your current iteration. If you abstract that all away and let an LLM control the whole process, then the prompt becomes hard to maintain

We’ve written about using LLMs to optimize prompts a few times now (Using LLMs to Optimize Your Prompts, How to Optimize Long Prompts, RecPrompt). Our opinion then and now is that a mix of humans and LLMs gets the best results.

With that out of the way, let’s look at some results.

Table of results from automated prompt optimization
“OS EM” is Exact Match on the Optimization Set. “ES EM” is for the Evaluation Set. “Avg EM” is the average for the two sets. “EM Delta” is the difference between the Exact Match for the two sets. All prompts are with Chain of Thought.

Human 'optimized' versus LLM-optimized prompts were evaluated using two metrics:

  • Raw performance scores (Avg EM)
  • The delta between scores on the optimization set and the evaluation set. A low delta implies the prompt is generalizable. The best prompts have a high Avg EM and a low delta.

Looking at Mistral, you’ll see that the “Positive Thinking” prompts have a lower delta for 10, 25, and 50 questions, but the automatically optimized prompts have a lower delta for 100 questions. On the contrary, the larger Llama-2 models consistently show a lower delta across all cases when letting an LLM automatically optimize prompts.

Why does that matter? It translates to the following takeaway: The size of the model matters when it comes to the decision to use it to help in the prompt engineering process. If the model is larger than 7B, the research suggests leveraging the model to optimize the prompt.

The most interesting part of this whole paper was checking out the automatically optimized prompt examples and how much they differ from what a human may come up with.

This was the highest-scoring prompt generated by Llama2-70B:

Here are a few more examples

Llama2-13B Optimized Prompt & Prefix NoQ=10

Mistral-7B Optimized Prompt & Prefix NoQ=50

Llama2-70B Optimized Prompt & Prefix NoQ=10

One last finding

The researchers compared the results they saw on the GSM8K dataset to the scores reported by the model providers (Meta, Mistral).

Table of results compared the reported EM scores versus the EM scores from this study

Interestingly, the average difference was quite large for Mistral-7B and Llama2-13B, but for Llama2-70b it was closer to an acceptable margin of error.

Unfortunately, neither Mistral nor Meta released the prompts used in their tests, so reproducing their benchmark scores can be challenging.

Benchmark scores should be taken with a grain of salt because of this lack of reproducibility and transparency. If your prompt engineering skills aren’t as strong as Meta’s then your outputs are going to underperform relative to the benchmark.

Wrapping up

We’ve said it before and we’ll say it again, small modifications to a prompt can have major effects.

What this paper makes exceptionally clear is that what works for one model won’t necessarily translate to another model. This highlights the importance of thoroughly testing your prompts as models get updated, or when trying out new techniques.

Dan Cleary
Founder