There has been a lot of research into how LLMs can enhance prompts for better outputs (We wrote about a few here and here). But these methods and research papers have focused on shorter prompts.

So today we are focusing on how to systematically optimize long prompts, drawing from the Google research paper: Automatic Engineering of Long Prompts.

Why long prompts are challenging

Since ChatGPT launched, there has been a ton of research into prompt engineering. There are now several established design principles, like including specific instructions, demonstrating in-context examples, and using chain-of-thought reasoning.

These principles do really work, but they lead to longer prompts, which can be harder to handle.

Additionally, minor modifications to a prompt, can significantly impact output quality. The longer the prompt, the more variables. This highlights the importance of having a prompt management tool that allows for rigorous testing.

Previous automatic prompt optimization studies focused on short prompts. The methods focus on replacing specific words or using the LLM to rewrite the entire prompt.

These methods don’t translate well to long prompts because:

  • The search space (the number of words) for word-replacement is too large
  • Completely rewriting a long prompt, while maintaining clarity and coherence, is challenging

Enter a new framework for optimizing longer prompts!

Hey guys, how's it going? Dan here, co-founder of PromptHub, back again to give you some of the latest research into prompt engineering. Today, we're focusing on long prompts and how to optimize them.

2023 has revealed a lot of information about prompt engineering. We're figuring out best practices, whether that be more specific instructions, adding in context examples, or Chain of Thought reasoning. There are a bunch of different types of prompt engineering methods, from Chain of Thought to Algorithm of Thought. There are all these new methods, and they work well. We've written a lot about them, but they lead to longer and longer prompts. We're talking 2,000-3,000 token prompts. For example, the ChatGPT system message is about 1,600 tokens. We actually just did a deep dive on ChatGPT system messages as well as a couple of other leading tech companies like Perplexity, TL;DR, and Vercel, and that's on our blog too.

The point here is that these prompts are getting a lot longer. Long prompts are more challenging to handle, iterate on, and make better. The more tokens there are, the more variables there are, and it's harder to control for all of those. There's been a lot of research into how to optimize prompts, but prompt optimizers are mostly suited for short prompts. They just don't work well on large prompts. This is because they usually focus on some sort of word replacement, and for long prompts, the search space is just too large for something like that. Alternatively, these prompt optimizers work on completely rewriting a prompt, which is not feasible for a long prompt. It's hard to maintain the coherence, clarity, and intent when doing that.

It's really important that when your prompt is optimized, you're left with something that at least looks familiar and sounds familiar. The team from Google DeepMind came out with a paper at the end of November that looked specifically at a framework to optimize long prompts. The goal here is to generate something that is semantically very similar to the original long prompt but achieves better outputs by whatever evaluation metric. The semantic similarity is very important because having something that's completely new and different from your original prompt isn't super helpful. You need something that feels the same, something that you can iterate on in the future, something that you can understand and grasp.

The method's pretty straightforward. It's three major components. First, take a prompt and chop it up into its individual sentences. Then use an LLM to rephrase some of those sentences while maintaining the same semantic similarity. Finally, add those sentences back if they performed well and keep a pool of the top-performing prompts along the way.

We'll jump into each one of these. Step one: breaking down the prompts into sentences. Very straightforward, just chop it up. Step two: rephrasing the sentences. Here is the prompt that the researchers used to rephrase individual sentences from the original long prompt. It gives it a role as a sentence rephraser, includes some examples, and then they have the next prompt to be rephrased. The examples they used are known to increase performance. They include example sentences that, when rephrased and rerun on whatever evaluation metrics, are known to increase performance. These are injected back in to make sure the model is looking at good use cases and that can help guide it for future rephrasing.

Lastly is the prompt pool. Rather than just having one prompt that you're constantly rephrasing and adding back to, they kept a pool of prompts. If they rephrase a sentence and it led to detrimental effects, they could revert to a different one. This also keeps it from going into a self-contained loop of constantly editing and updating the same sentences over and over again. By keeping a pool of the prompts, they always had multiple options to continue their refinement process and just multiple candidates in general.

We'll look at the results here. This method is compared against a couple of other methods: the original prompt, a genetic algorithm, a Chain of Thought prompting method used to update the original prompt, and a greedy algorithm which uses a single prompt rather than a pool of prompts.

We can see our method over on the right, the long prompt optimization method, and the other methods here. On average, our method outperforms by about 88.2%. There are some pretty big wins, especially in the logical task set. The original prompt got a score of about 40, and the new one got 60, so you're looking at almost a 20% gain. That's just from changing the semantics of sentences and running through this algorithm. We can see the difference between the greedy and our method. The only functional difference is the pool of prompts. The greedy method does not have a pool of prompts, and you can see that adding that step leads to pretty good additional gains, about 4-5% across the board.

To look at a specific example, this is from the logical deduction test set. The original prompt had an accuracy of 38.8%, which increased to about 60%—almost a 50% gain. That's just from changing some words. This took about 48 iterations. It might sound like a lot, but this is really a great thing for prompt engineers. It shows that you don't need anything fancy; just changing around words can lead to performance gains. For instance, changing "let's think step by step" to "let's think through one at a time" can get you a 50% increase in performance.

Here's another example, going from 60% accuracy to about 92%, a 30% increase at iteration 91. It took more iterations, but this is a good sign for prompt engineers. If you want to try and put this into practice, mimicking the exact algorithm used by the team might be challenging if you don't have the same setup or time. We built a PromptHub form that will run an algorithm mimicking the researchers' algorithm as closely as it can, and you'll get an optimized prompt on the other side. Input your prompt, and we generate a semantically similar one, going through sentence by sentence and choosing the best ones. You'll get a new prompt on the outside. There's no guarantee that the generated one will work better, so you should test it against the original. We have batch features within our product to allow you to run the prompt many times to see how it performs. This is just one way to put this into practice. We also have a template that's open source and available in the product if you want to see what it looks like under the hood.

Let us know if this helps you get better output, and happy prompting!

Methodology

The goal is to generate a new prompt that is semantically similar to the original prompt but achieves better outputs. Semantic similarity is key because we don’t a prompt version that is hard to interpret and understand.

The main operation in this technique involves replacing existing sentences in the prompt with semantically equivalent sentences.

Generating the new sentences is done via a simple prompt:

An image of the first version of the LLM-mutator prompt from the research paper
Version 1 of the LLM-mutator prompt

The first step involves breaking down the prompt into individual sentences.

Next the LLM beings to rephrase the sentences, while preserving the original meaning.

A pool of the top performing prompts is kept. During each iteration, one prompt from this pool is chosen for further refinement. This ensures that even if the method makes detrimental edits, it can always fall back to a different version of the prompt from the pool.

To further improve the optimization method, the researchers implemented a search history. They updated the LLM-mutator prompt to include examples of previously rephrased sentences that led to better outcomes.

Here’s the updated prompt:

An image of the second version of the LLM-mutator prompt from the research paper
Version 2 of the LLM-mutator prompt

But how can we ensure we are selecting the most impactful sentences?

The researchers implemented several algorithms to solve the selection problem. I’m not going to dive into the details here, but you can see more in the paper if you’d like.

Experiment results

The test prompts for this method were divided into two parts: Task Description and Demos. The Task Description provides the instructions and the Demo includes the question, a chain of thoughts demonstration and the final answer.

An example prompt used in the dataset for the experiments
An example prompt from the dataset

The researchers compared a few different automatic prompt engineering methods:

  • Original prompt (baseline): The original human-designed prompt from the dataset
  • Genetic Algorithm: An implementation of the genetic algorithm to tune long prompts, differing slightly from the primary method discussed above.
  • Evolve “step-by-step”: Optimizing the original prompt by using single sentence optimization via the popular chain-of-thought prompting method.
  • Greedy: Only stores the single top performing prompt, rather than a pool of prompts

Here are the results!

A table of results from the experiment

Takeaways

  • Across all 8 tasks, the new method (”Our Method”) achieves an average gain of 8.2% in test accuracy and 9.2% in overall evaluation accuracy (test + train) across all 8 tasks.
  • The largest gains (~18.5%) were on the logical deduction tasks
  • Comparing Evolve to Original Prompt, it’s clear that just adding “think it through step-by-step” doesn’t provide substantial improvements. Since these are long prompts it makes sense that adding a single sentence doesn’t move the needle materially.

A specific example of a prompt being optimized from the study
A prompt optimization example

The prompt’s accuracy improved from 38.8% to 56% (a ~50% increase). This improvement took 48 iterations. Which goes to show, prompt engineering and prompt management require a lot of iterations to get to better results.

Most of the changes are minor adjustments to the original sentences. This is really promising for anyone working on prompts, and reinforces the concept that small changes can go a long way.

A specific example of a prompt being optimized from the study
A prompt optimization example

Here’s another example where the prompt’s accuracy improved by ~30% at iteration 91.

Putting this into practice

Implementing the same algorithms discussed in the paper is possible, but would require some effort. Instead, here are a few easier ways to apply these same insights to your prompts.

Generate semantically similar prompt versions using LLMs

We made a form using PromptHub that takes a prompt as input, applies an algorithm to rephrase some sentences while retaining their meaning, and outputs a new prompt version. You can try it out below or here.

This functionality is also available as a template in PromptHub, allowing you to generate multiple semantically similar prompt versions simultaneously using Batch testing.

PromptHub template

Test variations in PromptHub

The core of this optimization method lies in making minor semantic edits. One way to put this into practice would be to test prompts and outputs side-by-side in PromptHub. You can make semantic edits until your new version outperforms the old and repeat.

Wrapping up

My biggest takeaway is that swapping out words and phrases can get you extremely far in prompt engineering.

No magic or intricate engineering needed. All you need is a way to generate semantically similar prompt versions and a systematic way to test prompts. We can help you out with both!

Dan Cleary
Founder