So today we are focusing on how to systematically optimize long prompts, drawing from the Google research paper: Automatic Engineering of Long Prompts.
Why long prompts are challenging
Since ChatGPT launched, there has been a ton of research into prompt engineering. There are now several established design principles, like including specific instructions, demonstrating in-context examples, and using chain-of-thought reasoning.
These principles do really work, but they lead to longer prompts, which can be harder to handle.
Additionally, minor modifications to a prompt, can significantly impact output quality. The longer the prompt, the more variables. This highlights the importance of having a prompt management tool that allows for rigorous testing.
Previous automatic prompt optimization studies focused on short prompts. The methods focus on replacing specific words or using the LLM to rewrite the entire prompt.
These methods don’t translate well to long prompts because:
- The search space (the number of words) for word-replacement is too large
- Completely rewriting a long prompt, while maintaining clarity and coherence, is challenging
Enter a new framework for optimizing longer prompts!
The goal is to generate a new prompt that is semantically similar to the original prompt but achieves better outputs. Semantic similarity is key because we don’t a prompt version that is hard to interpret and understand.
The main operation in this technique involves replacing existing sentences in the prompt with semantically equivalent sentences.
Generating the new sentences is done via a simple prompt:
The first step involves breaking down the prompt into individual sentences.
Next the LLM beings to rephrase the sentences, while preserving the original meaning.
A pool of the top performing prompts is kept. During each iteration, one prompt from this pool is chosen for further refinement. This ensures that even if the method makes detrimental edits, it can always fall back to a different version of the prompt from the pool.
To further improve the optimization method, the researchers implemented a search history. They updated the LLM-mutator prompt to include examples of previously rephrased sentences that led to better outcomes.
Here’s the updated prompt:
But how can we ensure we are selecting the most impactful sentences?
The researchers implemented several algorithms to solve the selection problem. I’m not going to dive into the details here, but you can see more in the paper if you’d like.
The test prompts for this method were divided into two parts: Task Description and Demos. The Task Description provides the instructions and the Demo includes the question, a chain of thoughts demonstration and the final answer.
The researchers compared a few different automatic prompt engineering methods:
- Original prompt (baseline): The original human-designed prompt from the dataset
- Genetic Algorithm: An implementation of the genetic algorithm to tune long prompts, differing slightly from the primary method discussed above.
- Evolve “step-by-step”: Optimizing the original prompt by using single sentence optimization via the popular chain-of-thought prompting method.
- Greedy: Only stores the single top performing prompt, rather than a pool of prompts
Here are the results!
- Across all 8 tasks, the new method (”Our Method”) achieves an average gain of 8.2% in test accuracy and 9.2% in overall evaluation accuracy (test + train) across all 8 tasks.
- The largest gains (~18.5%) were on the logical deduction tasks
- Comparing Evolve to Original Prompt, it’s clear that just adding “think it through step-by-step” doesn’t provide substantial improvements. Since these are long prompts it makes sense that adding a single sentence doesn’t move the needle materially.
The prompt’s accuracy improved from 38.8% to 56% (a ~50% increase). This improvement took 48 iterations. Which goes to show, prompt engineering and prompt management require a lot of iterations to get to better results.
Most of the changes are minor adjustments to the original sentences. This is really promising for anyone working on prompts, and reinforces the concept that small changes can go a long way.
Here’s another example where the prompt’s accuracy improved by ~30% at iteration 91.
Putting this into practice
Implementing the same algorithms discussed in the paper is possible, but would require some effort. Instead, here are a few easier ways to apply these same insights to your prompts.
Generate semantically similar prompt versions using LLMs
We made a form using PromptHub that takes a prompt as input, applies an algorithm to rephrase some sentences while retaining their meaning, and outputs a new prompt version. You can try it out below or here.
This functionality is also available as a template in PromptHub, allowing you to generate multiple semantically similar prompt versions simultaneously using Batch testing.
Test variations in PromptHub
The core of this optimization method lies in making minor semantic edits. One way to put this into practice would be to test prompts and outputs side-by-side in PromptHub. You can make semantic edits until your new version outperforms the old and repeat.
My biggest takeaway is that swapping out words and phrases can get you extremely far in prompt engineering.
No magic or intricate engineering needed. All you need is a way to generate semantically similar prompt versions and a systematic way to test prompts. We can help you out with both!