Who's better at writing prompts, humans or LLMs? I've found that a blend of human and LLM input works best.

One thing I know for sure is that the way you instruct a model with a prompt is extremely important. Even single-word differences in a prompt can have an outsized effect.

That's why this research from the folks over at Deepmind stuck out to me: Large Language Models As Optimizers.

They developed a method called Optimization by PROmpting (OPRO) to help with the prompt optimization process. At its core, OPRO leverages LLMs to iteratively evaluate outputs and optimize prompts.

Why OPRO?

While prompt engineering is relatively new, there has been a flurry of research studies resulting in countless methods and techniques. These range from multi-person collaboration to the Tree of Thoughts method and many others. Each of these methods has its own set of advantages and disadvantages.

OPRO stands out among these methods by leveraging LLMs as prompt optimizers. OPRO iteratively optimizes prompts to continuously generate new solutions, refining its outputs based on the problem description and previously discovered solutions. (we’ll walk through the steps)

Its dynamic, feedback-driven process ensures that the optimization is not only accurate but also adaptive to the nuances of the specific task.

How OPRO works

Flowchart for OPRO framework
Flowchart for OPRO

At the heart of OPRO is a framework designed to integrate 2 LLMs (an optimizer and evaluator) to iteratively improve the meta-prompt's output.

Problem Setup:

Every optimization journey starts with a clear problem setup. This involves representing the task using datasets that have both training and test splits. The training set refines the optimization process, while the test set evaluates the efficacy of the optimized method.

(We have a full concrete example below.)

Meta-Prompt Creation:

Central to OPRO, the meta-prompt contains 3 pieces of info: previously generated prompts with their training accuracies, examples, and a description of the optimization problem.

Meta-Prompt example
The blue text contains solution-score pairs; the purple text are the examples; the orange text are meta-instructions.

Iterative Solution Generation

OPRO’s strength lies in its iterative nature. The process is continuous. Every subsequent attempt at optimizing the prompt and output takes into account the previous attempts.
New prompts and outputs are added to the meta-prompt (see blue text in image above).

Harnessing the power of trajectory

Since this framework takes into account previous solutions and their scores, the LLM can excel at what it does best: identifying patterns and trends.

Feedback and refinement

After each optimization step, the generated solutions are evaluated, and this feedback is looped back into the meta-prompt.

Putting OPRO into practice

Before jumping into the experiments and the results of the study, let's look at a quick example.

Let’s assume we’re a venture capital firm. Our goal is to categorize startups we've invested in or communicated with, based on their stage (e.g., pre-seed, seed, Series A, healthcare, fintech).

Step 1: Define the Task

Classify startups into predefined categories based on the content of their emails.

Step 2: Data Collection and Preprocessing

Now we’ll gather a dataset of emails from the startups and label them with the correct classifications. This labeled data will be important for training and evaluation.

We don’t need anything complex, a simple Google Sheet will do. Column A will contain the email content and Column B will contain the startup’s category.

Step 3: Create initial Meta-Prompt

Now we’ll write an initial meta-prompt that describes that task at hand, with a few examples from our labeled dataset. Something like:

Step 4: Iterative Optimization with OPRO

The meta-prompt is fed into an LLM. The LLM interprets the prompt to understand the classification task, reviewing the examples and type of data that it’s dealing with.

Next up comes the optimization step. The meta-prompt is analyzed by an LLM in order to find improvements. The LLM generates a set of new prompts that it believes could improve classification results.

The responses will be a list of prompts to test:

  1. "Analyze the primary product or service mentioned in the email. Determine its industry relevance and classify the startup based on its current stage and sector."
  2. "Consider the startup's product, user base, and partnerships. Classify it into a category that best represents its industry and growth phase."
  3. "Based on the email's content, identify the startup's core offering and its market traction. Classify it into the most fitting category."
  4. "Evaluate the startup's product, its target audience, and any mentioned achievements. Assign a category that best encapsulates its industry and stage."
  5. "Review the email for clues about the startup's main product, user engagement, and collaborations. Classify the startup accordingly."

After these prompts are tested, their results will be appended to the meta-prompt:

From here, the iterative process continues. After evaluating new prompts, feedback is used to refine the meta-prompt, and the process is repeated until you’re satisfied with the prompt’s performance.

Experiment setup

OPRO was put to the test across a diverse set of tasks, ranging from classic optimization problems like linear regression to more contemporary challenges in movie recommendations and various natural language processing tasks.

Setup

The researchers used a range of LLMs for both optimization and scoring:

  • Optimizer LLMs:
  • Pre-trained PaLM 2-L (Anil et al., 2023)
  • Instruction-tuned PaLM 2-L (denoted as PaLM 2-L-IT)
  • Text-bison
  • GPT-3.5-turbo
  • GPT-4
  • Scorer LLMs:
  • Pre-trained PaLM 2-L
  • Text-bison

The temperature of the evaluator LLM was set to 0, since this task was more deterministic. Simply evaluating accuracies.

However, the temperature for the optimizer LLM was set to 1.0 to allow for more creative and diverse prompt generations.

Experiment Results

Efficient Prompt Optimization: Even with a fraction (3.5%) of the training data used, OPRO outperformed many other prompting benchmarks.

Superior Performance with Zero-Shot Prompting: OPRO's top instructions matched chain-of-thought performance and exceeded zero-shot, human-crafted prompts by 8%.

Diverse Optimization Styles: The prompts that performed best for each model varied greatly in length. Check out the instructions from PaLM 2-L-IT and text-bison compared to GPT-4 (last in the list).

table with results from the research study showing the top instructions for different models and their accuracy scores
The variance in top instruction length varied greatly amongst models

Sensitivity to Word Choice: Even slight variations in semantically similar instructions led to significant differences in accuracies. Word choice matters in prompt engineering!

Critical Role of Examples: Examples significantly impacted optimization, especially from 0 to 3 examples. However, benefits lessened from 3 to 10. This emphasizes the importance of balancing your prompts.

2 graphs showing the accuracy versus # of steps for different variations in number of examples in the meta-prompt
Accuracy peaks with 3 examples and goes down as you add more (up to 10)

Balancing Exploration and Exploitation with Temperature: A temperature of 1.0 had the best results. Lower temperatures led to a lack of exploration, resulting in stagnant optimization curves and higher temperatures often overlooked the trajectory of previous instructions.

2 graphs showing the accuracy versus # of steps for different variations in temperature

Implications and looking forward

Couple of last points

  • Effects on Prompt Engineering: OPRO highlights the importance of iterative and feedback-driven approaches to prompt engineering. We strongly believe in this and have seen first hand how a little iterative work goes a long way. PromptHub makes this easy.
  • Training and Fine-Tuning: OPRO was extremely efficient with very limited training data.
  • Versatility Across Domains: OPRO can be applied across domains, and may reduce the need for domain-specific fine-tuning,

OPRO reinforces a belief that I already had: LLMs greatly speed up the prompt writing process. OPRO's systematic approach is promising and can be a huge point of leverage.

Dan Cleary
Founder