There has been a lot of research into how LLMs can enhance prompts for better outputs (We wrote about a few here and here). But these methods and research papers have focused on shorter prompts.

So today we are focusing on how to systematically optimize long prompts, drawing from the Google research paper: Automatic Engineering of Long Prompts.

Why long prompts are challenging

Since ChatGPT launched, there has been a ton of research into prompt engineering. There are now several established design principles, like including specific instructions, demonstrating in-context examples, and using chain-of-thought reasoning.

These principles do really work, but they lead to longer prompts, which can be harder to handle.

Additionally, minor modifications to a prompt, can significantly impact output quality. The longer the prompt, the more variables. This highlights the importance of having a prompt management tool that allows for rigorous testing.

Previous automatic prompt optimization studies focused on short prompts. The methods focus on replacing specific words or using the LLM to rewrite the entire prompt.

These methods don’t translate well to long prompts because:

  • The search space (the number of words) for word-replacement is too large
  • Completely rewriting a long prompt, while maintaining clarity and coherence, is challenging

Enter a new framework for optimizing longer prompts!


The goal is to generate a new prompt that is semantically similar to the original prompt but achieves better outputs. Semantic similarity is key because we don’t a prompt version that is hard to interpret and understand.

The main operation in this technique involves replacing existing sentences in the prompt with semantically equivalent sentences.

Generating the new sentences is done via a simple prompt:

An image of the first version of the LLM-mutator prompt from the research paper
Version 1 of the LLM-mutator prompt

The first step involves breaking down the prompt into individual sentences.

Next the LLM beings to rephrase the sentences, while preserving the original meaning.

A pool of the top performing prompts is kept. During each iteration, one prompt from this pool is chosen for further refinement. This ensures that even if the method makes detrimental edits, it can always fall back to a different version of the prompt from the pool.

To further improve the optimization method, the researchers implemented a search history. They updated the LLM-mutator prompt to include examples of previously rephrased sentences that led to better outcomes.

Here’s the updated prompt:

An image of the second version of the LLM-mutator prompt from the research paper
Version 2 of the LLM-mutator prompt

But how can we ensure we are selecting the most impactful sentences?

The researchers implemented several algorithms to solve the selection problem. I’m not going to dive into the details here, but you can see more in the paper if you’d like.

Experiment results

The test prompts for this method were divided into two parts: Task Description and Demos. The Task Description provides the instructions and the Demo includes the question, a chain of thoughts demonstration and the final answer.

An example prompt used in the dataset for the experiments
An example prompt from the dataset

The researchers compared a few different automatic prompt engineering methods:

  • Original prompt (baseline): The original human-designed prompt from the dataset
  • Genetic Algorithm: An implementation of the genetic algorithm to tune long prompts, differing slightly from the primary method discussed above.
  • Evolve “step-by-step”: Optimizing the original prompt by using single sentence optimization via the popular chain-of-thought prompting method.
  • Greedy: Only stores the single top performing prompt, rather than a pool of prompts

Here are the results!

A table of results from the experiment


  • Across all 8 tasks, the new method (”Our Method”) achieves an average gain of 8.2% in test accuracy and 9.2% in overall evaluation accuracy (test + train) across all 8 tasks.
  • The largest gains (~18.5%) were on the logical deduction tasks
  • Comparing Evolve to Original Prompt, it’s clear that just adding “think it through step-by-step” doesn’t provide substantial improvements. Since these are long prompts it makes sense that adding a single sentence doesn’t move the needle materially.

A specific example of a prompt being optimized from the study
A prompt optimization example

The prompt’s accuracy improved from 38.8% to 56% (a ~50% increase). This improvement took 48 iterations. Which goes to show, prompt engineering and prompt management require a lot of iterations to get to better results.

Most of the changes are minor adjustments to the original sentences. This is really promising for anyone working on prompts, and reinforces the concept that small changes can go a long way.

A specific example of a prompt being optimized from the study
A prompt optimization example

Here’s another example where the prompt’s accuracy improved by ~30% at iteration 91.

Putting this into practice

Implementing the same algorithms discussed in the paper is possible, but would require some effort. Instead, here are a few easier ways to apply these same insights to your prompts.

Generate semantically similar prompt versions using LLMs

We made a form using PromptHub that takes a prompt as input, applies an algorithm to rephrase some sentences while retaining their meaning, and outputs a new prompt version. You can try it out below or here.

This functionality is also available as a template in PromptHub, allowing you to generate multiple semantically similar prompt versions simultaneously using Batch testing.

PromptHub template

Test variations in PromptHub

The core of this optimization method lies in making minor semantic edits. One way to put this into practice would be to test prompts and outputs side-by-side in PromptHub. You can make semantic edits until your new version outperforms the old and repeat.

Wrapping up

My biggest takeaway is that swapping out words and phrases can get you extremely far in prompt engineering.

No magic or intricate engineering needed. All you need is a way to generate semantically similar prompt versions and a systematic way to test prompts. We can help you out with both!

Dan Cleary